Re: [openib-general] opensm crash with topspin HCA
On Thu, 2006-11-02 at 13:33, Viswanath Krishnamurthy wrote: > > When we run opensm (OFED) release and if a Topspin HCA is in the IB > network, opensm crashes in umad_receiver with NULL pointer exception. > The transaction ID is zero is the MAD'S from topspin HCA on windows. > The crashes seems to random in umad_receiver. What OpenSM version ? There was a problem like this fixed back at the end of August: r8920 | halr | 2006-08-14 09:09:28 -0400 (Mon, 14 Aug 2006) | 11 lines OpenSM/osm_vendor_ibumad.c: In get_madw, check for TID 0 (resolves NULL ptr crash with Cisco stack) This change fixes an OSM crash when working with Cisco's stack. Cisco's stack doesn't follow the same TID convention when generating transaction id which in some bad flow revealed this bug in the get_madw lookup. The bug was in get_madw which does not detect lookup of its reserved "free" entr y of key==0. Signed-off-by: Yevgeny Kliteynik <[EMAIL PROTECTED]> Signed-off-by: Hal Rosenstock <[EMAIL PROTECTED]> -- Hal > > > > HCA found: > > hca_id=InfiniHost0 > > vendor_id=0x02C9 > > vendor_part_id=0x5A44 > > hw_ver=0xA0 > > fw_ver=0x40006 > > > > > __ > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] opensm crash with topspin HCA
On 10:33 Thu 02 Nov , Viswanath Krishnamurthy wrote: > When we run opensm (OFED) release and if a Topspin HCA is in the IB network, > opensm crashes in umad_receiver with NULL pointer exception. Do you have any logs, gdb backtrace or any other details? Sasha > The > transaction ID is zero is the MAD'S from topspin HCA on windows. The crashes > seems to random in umad_receiver. > > > HCA found: > >hca_id=InfiniHost0 > >vendor_id=0x02C9 > >vendor_part_id=0x5A44 > >hw_ver=0xA0 > >fw_ver=0x40006 > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] opensm crash with topspin HCA
When we run opensm (OFED) release and if a Topspin HCA is in the IB network, opensm crashes in umad_receiver with NULL pointer exception. The transaction ID is zero is the MAD'S from topspin HCA on windows. The crashes seems to random in umad_receiver. HCA found: hca_id=InfiniHost0 vendor_id=0x02C9 vendor_part_id=0x5A44 hw_ver=0xA0 fw_ver=0x40006 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk
Thanks for the patch Aniruddha. Can you resend with a signed-off-by line? See "How do I submit source code patches?" at https://openib.org/tiki/tiki-index.php?page=OpenIBFAQ > Also a minor patch, you can see that %P is printed as %P and not used as > a format character. > > Index: common/dapl_ep_post_rdma_write.c > === > --- common/dapl_ep_post_rdma_write.c(revision 3892) > +++ common/dapl_ep_post_rdma_write.c(working copy) > @@ -78,7 +78,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", > + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > Index: common/dapl_ep_post_send.c > === > --- common/dapl_ep_post_send.c (revision 3892) > +++ common/dapl_ep_post_send.c (working copy) > @@ -75,7 +75,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", > + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > Index: common/dapl_srq_post_recv.c > === > --- common/dapl_srq_post_recv.c (revision 3892) > +++ common/dapl_srq_post_recv.c (working copy) > @@ -79,7 +79,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_srq_post_recv (%p, %d, %p, %P)\n", > + "dapl_srq_post_recv (%p, %d, %p, %p)\n", > srq_handle, > num_segments, > local_iov, > Index: common/dapl_ep_post_recv.c > === > --- common/dapl_ep_post_recv.c (revision 3892) > +++ common/dapl_ep_post_recv.c (working copy) > @@ -79,7 +79,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n", > + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > > Thanks > Aniruddha > > > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk
Aniruddha Bohra wrote: Now, I have a problem with udapl : The following is a code snippet from : dapl_ib_dto.h for (i = 0; i < segments; i++ ) { if ( !local_iov[i].segment_length ) continue; ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; ds_array_p->length = local_iov[i].segment_length; ds_array_p->lkey = local_iov[i].lmr_context; dapl_dbg_log ( DAPL_DBG_TYPE_EP, " post_snd: lkey 0x%x va %p len %d \n", ds_array_p->lkey, ds_array_p->addr, ds_array_p->length ); total_len += ds_array_p->length; wr.num_sge++; ds_array_p++; } The following is the relevant part of the log with DAPL_DBG_TYPE=0x dapl_ep_post_send (0x8087110, 2, 0x80f9910, %P, b5f395bc)^M post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x80f9910 r_iov 0xbfc29060 f 0^M post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x80f9910^M post_snd: lkey 0x10de003b va 0xb5f3976c len 0 ^M post_snd: lkey 0x10de003b va 0xb5f39924 len 0 ^M From the above loop, how is this possible : If local_iov[i].segment_length == 0, it should not be printed. And the if the assignment is successful, len must not be 0. Any ideas? Of course following this, the ep is disconnected in the next step :( local_iov (LMR) length is 64bits and the ibv_sge (ds_array) length is 32 bits so it truncates. Sounds like you setup a transfer greater then 4GB-1? If you query the device via uDAPL you will see the max limits (2GB): query_hca: (a0.0) ep 64512 ep_q 65535 evd 65408 evd_q 131071 query_hca: msg 2147483648 rdma 2147483648 iov 59 lmr 131056 rmr 0 -arlin Also a minor patch, you can see that %P is printed as %P and not used as a format character. Index: common/dapl_ep_post_rdma_write.c === --- common/dapl_ep_post_rdma_write.c(revision 3892) +++ common/dapl_ep_post_rdma_write.c(working copy) @@ -78,7 +78,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_ep_post_send.c === --- common/dapl_ep_post_send.c (revision 3892) +++ common/dapl_ep_post_send.c (working copy) @@ -75,7 +75,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_srq_post_recv.c === --- common/dapl_srq_post_recv.c (revision 3892) +++ common/dapl_srq_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_srq_post_recv (%p, %d, %p, %P)\n", + "dapl_srq_post_recv (%p, %d, %p, %p)\n", srq_handle, num_segments, local_iov, Index: common/dapl_ep_post_recv.c === --- common/dapl_ep_post_recv.c (revision 3892) +++ common/dapl_ep_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Thanks Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk
Roland Dreier wrote: > OK so, what options do I have right now -- compile a new kernel and > apply patches and > continue, or is there some patch that I can apply ? I don't think anyone has prepared a kzalloc() patch, but just adding something like static void *kzalloc(size_t size, unsigned int flags) { void *ret = kmalloc(size, flags); if (ret) memset(ret, 0, size); return ret; } to files that use kzalloc() should let you use 2.6.13 (assuming there are no other incompatibilities). Thanks, that works. Now, I have a problem with udapl : The following is a code snippet from : dapl_ib_dto.h for (i = 0; i < segments; i++ ) { if ( !local_iov[i].segment_length ) continue; ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; ds_array_p->length = local_iov[i].segment_length; ds_array_p->lkey = local_iov[i].lmr_context; dapl_dbg_log ( DAPL_DBG_TYPE_EP, " post_snd: lkey 0x%x va %p len %d \n", ds_array_p->lkey, ds_array_p->addr, ds_array_p->length ); total_len += ds_array_p->length; wr.num_sge++; ds_array_p++; } The following is the relevant part of the log with DAPL_DBG_TYPE=0x dapl_ep_post_send (0x8087110, 2, 0x80f9910, %P, b5f395bc)^M post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x80f9910 r_iov 0xbfc29060 f 0^M post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x80f9910^M post_snd: lkey 0x10de003b va 0xb5f3976c len 0 ^M post_snd: lkey 0x10de003b va 0xb5f39924 len 0 ^M From the above loop, how is this possible : If local_iov[i].segment_length == 0, it should not be printed. And the if the assignment is successful, len must not be 0. Any ideas? Of course following this, the ep is disconnected in the next step :( Also a minor patch, you can see that %P is printed as %P and not used as a format character. Index: common/dapl_ep_post_rdma_write.c === --- common/dapl_ep_post_rdma_write.c(revision 3892) +++ common/dapl_ep_post_rdma_write.c(working copy) @@ -78,7 +78,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_ep_post_send.c === --- common/dapl_ep_post_send.c (revision 3892) +++ common/dapl_ep_post_send.c (working copy) @@ -75,7 +75,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_srq_post_recv.c === --- common/dapl_srq_post_recv.c (revision 3892) +++ common/dapl_srq_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_srq_post_recv (%p, %d, %p, %P)\n", + "dapl_srq_post_recv (%p, %d, %p, %p)\n", srq_handle, num_segments, local_iov, Index: common/dapl_ep_post_recv.c === --- common/dapl_ep_post_recv.c (revision 3892) +++ common/dapl_ep_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Thanks Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
> OK so, what options do I have right now -- compile a new kernel and > apply patches and > continue, or is there some patch that I can apply ? I don't think anyone has prepared a kzalloc() patch, but just adding something like static void *kzalloc(size_t size, unsigned int flags) { void *ret = kmalloc(size, flags); if (ret) memset(ret, 0, size); return ret; } to files that use kzalloc() should let you use 2.6.13 (assuming there are no other incompatibilities). - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Roland Dreier wrote: > With 3892 I now get the following warnings on compilation: > WARNING: > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko > needs unknown symbol kzalloc > WARNING: > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko > needs unknown symbol kzalloc Yes, kzalloc() was added in 2.6.14. Now that 2.6.14 has been released, the subversion trunk is targeted against that kernel rather than the old 2.6.13 release. - R. OK so, what options do I have right now -- compile a new kernel and apply patches and continue, or is there some patch that I can apply ? Thanks Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
> With 3892 I now get the following warnings on compilation: > WARNING: > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko > needs unknown symbol kzalloc > WARNING: > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko > needs unknown symbol kzalloc Yes, kzalloc() was added in 2.6.14. Now that 2.6.14 has been released, the subversion trunk is targeted against that kernel rather than the old 2.6.13 release. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Roland Dreier wrote: > Now there is an OOPS in the dmesg : This really looks like the bug I fixed in r3889. What svn rev are your kernel modules built from? - R. And of course, the module does not load : Oct 28 16:21:57 hora-3 kernel: ib_mthca: Unknown symbol kzalloc Oct 28 16:21:58 hora-3 kernel: ib_umad: Unknown symbol kzalloc Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Roland Dreier wrote: > Now there is an OOPS in the dmesg : This really looks like the bug I fixed in r3889. What svn rev are your kernel modules built from? - R. With 3892 I now get the following warnings on compilation: WARNING: /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko needs unknown symbol kzalloc WARNING: /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko needs unknown symbol kzalloc Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
> Now there is an OOPS in the dmesg : This really looks like the bug I fixed in r3889. What svn rev are your kernel modules built from? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Hal Rosenstock wrote: Or perhaps something crashed and didn't clean up properly. Does this occur immediately after a boot ? After a fresh reboot of the machines on the switch, I get the log at http://www.cs.rutgers.edu/~bohra/osm-v2.log The opensm process does not crash but hangs. The state of the port never changes. Now there is an OOPS in the dmesg : ct 28 13:52:13 hora-3 OpenSM[5168]: OpenSM Rev:openib-1.1.0 Oct 28 13:52:14 hora-3 kernel: Unable to handle kernel paging request at virtual address 0910 Oct 28 13:52:14 hora-3 kernel: printing eip: Oct 28 13:52:14 hora-3 kernel: f883f12d Oct 28 13:52:14 hora-3 kernel: *pde = Oct 28 13:52:14 hora-3 kernel: Oops: [#1] Oct 28 13:52:14 hora-3 kernel: SMP Oct 28 13:52:14 hora-3 kernel: Modules linked in: ib_uverbs ib_umad ipv6 i2c_dev i2c_core sunrpc dm_mod video button battery ac uhci_hcd hw_random ib_mthca ib_mad ib_core e1000 floppy Oct 28 13:52:14 hora-3 kernel: CPU:1 Oct 28 13:52:14 hora-3 kernel: EIP:0060:[]Not tainted VLI Oct 28 13:52:14 hora-3 kernel: EFLAGS: 00010286 (2.6.13bohra) Oct 28 13:52:14 hora-3 kernel: EIP is at ib_post_send_mad+0x1c/0x1b1 [ib_mad] Oct 28 13:52:14 hora-3 kernel: eax: 0900 ebx: c1a7d900 ecx: c1a7d918 edx: Oct 28 13:52:14 hora-3 kernel: esi: c1a7d918 edi: f6571f68 ebp: f6571efc esp: f6571ed8 Oct 28 13:52:14 hora-3 kernel: ds: 007b es: 007b ss: 0068 Oct 28 13:52:14 hora-3 kernel: Process opensm (pid: 5224, threadinfo=f657 task=f7dfb020) Oct 28 13:52:14 hora-3 kernel: Stack: f883ef5a c1a7d800 080bd018 f6571efc f6a42900 a0f684f6 Oct 28 13:52:14 hora-3 kernel:f6571f68 f6571f74 f88f1728 0018 00e8 00d0 f6a42948 Oct 28 13:52:14 hora-3 kernel:f68bda24 0009 a0f684f6 0009 c1a7d918 0100 Oct 28 13:52:14 hora-3 kernel: Call Trace: Oct 28 13:52:14 hora-3 kernel: [] show_stack+0x7c/0x92 Oct 28 13:52:14 hora-3 kernel: [] show_registers+0x152/0x1ca Oct 28 13:52:14 hora-3 kernel: [] die+0xf4/0x16f Oct 28 13:52:14 hora-3 kernel: [] do_page_fault+0x463/0x649 Oct 28 13:52:14 hora-3 kernel: [] error_code+0x4f/0x54 Oct 28 13:52:14 hora-3 kernel: [] ib_umad_write+0x2d0/0x30e [ib_umad] Oct 28 13:52:14 hora-3 kernel: [] vfs_write+0x155/0x15a Oct 28 13:52:14 hora-3 kernel: [] sys_write+0x3d/0x64 Oct 28 13:52:14 hora-3 kernel: [] sysenter_past_esp+0x54/0x75 Oct 28 13:52:14 hora-3 kernel: Code: e8 d8 63 af c7 89 d8 83 c4 0c 5b 5e 5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 18 85 f6 89 55 f0 0f 84 ff 00 00 00 8b 46 08 8d 5e e8 <8b> 50 10 8b 7b 14 85 d2 0f 84 7c 01 00 00 8b 4e 18 85 c9 74 0b Thanks Aniruddha From: [EMAIL PROTECTED] on behalf of Sean Hefty Sent: Fri 10/28/2005 12:01 PM To: Aniruddha Bohra Cc: openib-general@openib.org Subject: Re: [openib-general] OpenSM crash with today's trunk Aniruddha Bohra wrote: Oh well, I guess this is a different bug. Is there an oops or anything in your kernel log, or is this just a userspace crash? This is what I see : Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM Is this useful? Is there any chance opensm is already running on the system? It sounds like something has already registered to receive the same MADs that opensm wants to receive. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Hal Rosenstock wrote: Or perhaps something crashed and didn't clean up properly. Does this occur immediately after a boot ? This is after a clean reboot. There are two systems on the switch and this is the only active one. I will reboot both and see again. Thanks Aniruddha From: [EMAIL PROTECTED] on behalf of Sean Hefty Sent: Fri 10/28/2005 12:01 PM To: Aniruddha Bohra Cc: openib-general@openib.org Subject: Re: [openib-general] OpenSM crash with today's trunk Aniruddha Bohra wrote: Oh well, I guess this is a different bug. Is there an oops or anything in your kernel log, or is this just a userspace crash? This is what I see : Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM Is this useful? Is there any chance opensm is already running on the system? It sounds like something has already registered to receive the same MADs that opensm wants to receive. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] OpenSM crash with today's trunk
Or perhaps something crashed and didn't clean up properly. Does this occur immediately after a boot ? From: [EMAIL PROTECTED] on behalf of Sean Hefty Sent: Fri 10/28/2005 12:01 PM To: Aniruddha Bohra Cc: openib-general@openib.org Subject: Re: [openib-general] OpenSM crash with today's trunk Aniruddha Bohra wrote: >> Oh well, I guess this is a different bug. Is there an oops or >> anything in your kernel log, or is this just a userspace crash? >> > This is what I see : > Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 > Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use > Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM > > Is this useful? Is there any chance opensm is already running on the system? It sounds like something has already registered to receive the same MADs that opensm wants to receive. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Aniruddha Bohra wrote: Oh well, I guess this is a different bug. Is there an oops or anything in your kernel log, or is this just a userspace crash? This is what I see : Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM Is this useful? Is there any chance opensm is already running on the system? It sounds like something has already registered to receive the same MADs that opensm wants to receive. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] OpenSM crash with today's trunk
Title: RE: [openib-general] OpenSM crash with today's trunk This means you have another SM or application already registered for handling SubnetManagement packets. Thus OpenSM fails to start (register as the handler for such requests). The crash is a bug that should be solved. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -Original Message- > From: Aniruddha Bohra [mailto:[EMAIL PROTECTED]] > Sent: Friday, October 28, 2005 5:28 PM > To: Roland Dreier > Cc: openib-general@openib.org > Subject: Re: [openib-general] OpenSM crash with today's trunk > > Roland Dreier wrote: > > > Aniruddha> I tried with r3888 and r3891 with the same result. > > > >Oh well, I guess this is a different bug. Is there an oops or > >anything in your kernel log, or is this just a userspace crash? > > > > > This is what I see : > Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 > Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use > Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM > > Is this useful? > > Aniruddha > > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Roland Dreier wrote: Aniruddha> I tried with r3888 and r3891 with the same result. Oh well, I guess this is a different bug. Is there an oops or anything in your kernel log, or is this just a userspace crash? This is what I see : Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM Is this useful? Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Aniruddha> I tried with r3888 and r3891 with the same result. Oh well, I guess this is a different bug. Is there an oops or anything in your kernel log, or is this just a userspace crash? If it's just opensm crashing then I'm not much use in debugging. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
Roland Dreier wrote: I believe that this is in r3889. - R. I tried with r3888 and r3891 with the same result. Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash with today's trunk
I believe that this is in r3889. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] OpenSM crash with today's trunk
Hello, I updated the OpenIB stack today and I get the following error on starting OpenSM. The verbose log is available at http://www.cs.rutgers.edu/~bohra/osm-v.log # opensm -V -d10 -r - OpenSM Rev:openib-1.1.0 Command Line Arguments: Big V selected d level = 0xa Reassign LIDs Log File: /var/log/osm.log - OpenSM Rev:openib-1.1.0 Using default guid 0x2c901081e7471 Error from osm_opensm_bind (0x2A) Exiting SM Segmentation fault Please let me know what I can do to debug this. Thanks Aniruddha ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Tue, 2005-05-31 at 16:43, Tom Duffy wrote: > On Tue, 2005-05-31 at 13:09 -0400, Hal Rosenstock wrote: > > There are certain changes where the makefiles need to be regenerated > > (and this is not done automatically). Since there was an additional > > compile flag added, they need to be regenerated or else it is being > > built the old way (without the real RMPP support enabled). > > $ make automake > > at the toplevel should take care of this, no? Yes. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Tue, 2005-05-31 at 13:09 -0400, Hal Rosenstock wrote: > There are certain changes where the makefiles need to be regenerated > (and this is not done automatically). Since there was an additional > compile flag added, they need to be regenerated or else it is being > built the old way (without the real RMPP support enabled). $ make automake at the toplevel should take care of this, no? -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 17:30, Tom Duffy wrote: > > Also, did > > you pick up the user_mad.c fix on Tuesday AM ? If it was, any other > > changes are either not related or trivial. > > > > After you picked up these changes, did you regenerate the various OpenSM > > makefiles (a define for RMPP changed in them) or just rebuild ? [This > > would not explain the crash, but is different from how my OpenSM is > > built.] > > I just reran make from the toplevel (management) after updating. I > would think it would rebuild them if something changed, no? There are certain changes where the makefiles need to be regenerated (and this is not done automatically). Since there was an additional compile flag added, they need to be regenerated or else it is being built the old way (without the real RMPP support enabled). -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 17:37, Hal Rosenstock wrote: > On Fri, 2005-05-27 at 17:33, Roland Dreier wrote: > > > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, > > 2 MADs outstanding. > > > > Hal> I take that back. That's just a lot of MADs have been sent > > Hal> (on the IB wire). OpenSM was probably up and running for a > > Hal> while... > > > > I find it hard to believe that OpenSM has sent 4 billion MADs -- > > that's more than 1000 MADs a second for a solid month. It also looks > > very suspicious that the value is equal to ((unsigned int) -1). > ^^ > on a 32 bit machine. > > Good point. The fact that it gets to -1 is significant as I think that > is used as a magic value for some computations. I'm pretty sure that I see a way this could have gone negative in the vendor layer. I'm working on a patch for this. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
Roland> I find it hard to believe that OpenSM has sent 4 billion Roland> MADs -- that's more than 1000 MADs a second for a solid Roland> month. It also looks very suspicious that the value is Roland> equal to ((unsigned int) -1). Hal> ^^ on a 32 bit machine. This is really a very minor point but the following program #include int main(int argc, char *argv[]) { printf("%u\n", ((unsigned int) -1)); return 0; } prints 4294967295 on any 64-bit Linux machine I have access to... - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 17:33, Roland Dreier wrote: > > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 > MADs outstanding. > > Hal> I take that back. That's just a lot of MADs have been sent > Hal> (on the IB wire). OpenSM was probably up and running for a > Hal> while... > > I find it hard to believe that OpenSM has sent 4 billion MADs -- > that's more than 1000 MADs a second for a solid month. It also looks > very suspicious that the value is equal to ((unsigned int) -1). ^^ on a 32 bit machine. Good point. The fact that it gets to -1 is significant as I think that is used as a magic value for some computations. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
> May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs outstanding. Hal> I take that back. That's just a lot of MADs have been sent Hal> (on the IB wire). OpenSM was probably up and running for a Hal> while... I find it hard to believe that OpenSM has sent 4 billion MADs -- that's more than 1000 MADs a second for a solid month. It also looks very suspicious that the value is equal to ((unsigned int) -1). - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 17:15 -0400, Hal Rosenstock wrote: > On Fri, 2005-05-27 at 14:31, Tom Duffy wrote: > > On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote: > > > I just noticed that my opensm had segv'ed and dumped core. > > > > BTW, here was the tail of the osm.log: > > > > May 27 01:44:09 [43005960] -> osm_vendor_get: [ > > May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = > > 0x5678f0 (mad 0x5f33f0 req 1) > > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = > > 0x567908, size = 256. > > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size > > = 256. > > May 27 01:44:09 [43005960] -> osm_vendor_get: ] > > May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, > > p_mad = 0x5f3670, size = 256. > > May 27 01:44:09 [43005960] -> osm_mad_pool_get: ] > > May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), > > modifier = 0x10001, TID = 0x1c149. > > May 27 01:44:09 [43005960] -> osm_vl15_post: [ > > May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 > > (mad 0x5f3670 req 1) > > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 > > MADs outstanding. >^^ > This looks weird. > > > May 27 01:44:09 [43005960] -> osm_vl15_poll: [ > > May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread. > > May 27 01:44:09 [43005960] -> osm_vl15_poll: ] > > May 27 01:44:09 [43005960] -> osm_vl15_post: ] > > May 27 01:44:09 [43005960] -> osm_req_get: ] > > May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ] > > May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ] > > May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ] > > May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [ > > Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack > shown in the previous email as this makes it look like it should be. > > Could you go back a little further in the log ? I'd like to see what is > before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and > osm_pi_rcv_process. The log had grown to almost 1G, so I actually deleted it. Shit, sorry. > It's also seems weird to me that there is no other > log message between these two. > > >From the stack trace: > #3 osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ') > at osm_helper.c:1446 > #4 0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at > osm_madw.h:575 > > It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller > > if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) ) > { > if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > { > osm_log( p_vl->p_log, OSM_LOG_DEBUG, > "__osm_vl15_poller: " > "Servicing p_madw = %p (mad %p req %d)\n", > p_madw, p_madw->p_mad, p_madw->resp_expected); > } > > if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) ) > { > osm_dump_dr_smp( p_vl->p_log, > osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES ); > <=== here > } > > when it died but I didn't see the previous log message in the code > "osm_vl15_poller: Servicing p_madw" which I also would have expected. > [This would have been telling as p_madw->p_mad would have been logged]. > I also didn't see the __osm_vl15_poller entry message either. well, if it segv'ed maybe it never finished writing out to the file... -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 16:25 -0400, Hal Rosenstock wrote: > On Fri, 2005-05-27 at 15:26, Tom Duffy wrote: > > > Also, what version of OpenSM are you using ? > > > > It was pretty close to the head of the tree, although a couple of files > > were updated when I did a svn update after the crash. > > When was your last update of OpenSM ? Was it after Tues AM ? To be honest, I can't remember. > Also, did > you pick up the user_mad.c fix on Tuesday AM ? If it was, any other > changes are either not related or trivial. > > After you picked up these changes, did you regenerate the various OpenSM > makefiles (a define for RMPP changed in them) or just rebuild ? [This > would not explain the crash, but is different from how my OpenSM is > built.] I just reran make from the toplevel (management) after updating. I would think it would rebuild them if something changed, no? -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 17:15, Hal Rosenstock wrote: > > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 > > MADs outstanding. >^^ > This looks weird. I take that back. That's just a lot of MADs have been sent (on the IB wire). OpenSM was probably up and running for a while... -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 14:31, Tom Duffy wrote: > On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote: > > I just noticed that my opensm had segv'ed and dumped core. > > BTW, here was the tail of the osm.log: > > May 27 01:44:09 [43005960] -> osm_vendor_get: [ > May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 0x5678f0 > (mad 0x5f33f0 req 1) > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = > 0x567908, size = 256. > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size = > 256. > May 27 01:44:09 [43005960] -> osm_vendor_get: ] > May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, > p_mad = 0x5f3670, size = 256. > May 27 01:44:09 [43005960] -> osm_mad_pool_get: ] > May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), > modifier = 0x10001, TID = 0x1c149. > May 27 01:44:09 [43005960] -> osm_vl15_post: [ > May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 (mad > 0x5f3670 req 1) > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs > outstanding. ^^ This looks weird. > May 27 01:44:09 [43005960] -> osm_vl15_poll: [ > May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread. > May 27 01:44:09 [43005960] -> osm_vl15_poll: ] > May 27 01:44:09 [43005960] -> osm_vl15_post: ] > May 27 01:44:09 [43005960] -> osm_req_get: ] > May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ] > May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ] > May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ] > May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [ Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack shown in the previous email as this makes it look like it should be. Could you go back a little further in the log ? I'd like to see what is before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and osm_pi_rcv_process. It's also seems weird to me that there is no other log message between these two. >From the stack trace: #3 osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ') at osm_helper.c:1446 #4 0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at osm_madw.h:575 It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) ) { if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_vl->p_log, OSM_LOG_DEBUG, "__osm_vl15_poller: " "Servicing p_madw = %p (mad %p req %d)\n", p_madw, p_madw->p_mad, p_madw->resp_expected); } if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) ) { osm_dump_dr_smp( p_vl->p_log, osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES ); <=== here } when it died but I didn't see the previous log message in the code "osm_vl15_poller: Servicing p_madw" which I also would have expected. [This would have been telling as p_madw->p_mad would have been logged]. I also didn't see the __osm_vl15_poller entry message either. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 15:26, Tom Duffy wrote: > > Also, what version of OpenSM are you using ? > > It was pretty close to the head of the tree, although a couple of files > were updated when I did a svn update after the crash. When was your last update of OpenSM ? Was it after Tues AM ? Also, did you pick up the user_mad.c fix on Tuesday AM ? If it was, any other changes are either not related or trivial. After you picked up these changes, did you regenerate the various OpenSM makefiles (a define for RMPP changed in them) or just rebuild ? [This would not explain the crash, but is different from how my OpenSM is built.] -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 14:54 -0400, Hal Rosenstock wrote: > Anything "special" about your configuration/what was going on ? This was in the middle of the night. I wasn't doing anything to the systems at the time. > Can you reproduce this ? nope. > Also, what version of OpenSM are you using ? It was pretty close to the head of the tree, although a couple of files were updated when I did a svn update after the crash. -tduffy -- I wish we lived in the America of yesteryear that only exists in the minds of us Republicans. -- Ned Flanders signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 14:27, Tom Duffy wrote: > I just noticed that my opensm had segv'ed and dumped core. Here is the > gdb backtrace. > > #0 stack_dump () at src/stack.c:72 > 72 if (!__builtin_frame_address(2)) > (gdb) bt > #0 stack_dump () at src/stack.c:72 > #1 0x2abb71a6 in handler (x=11) at src/stack.c:151 > #2 Looks like osm_dump_dr_smp was called with a NULL p_smp so: osm_madw_get_smp_ptr(p_madw) returned NULL for some unknown reason and that is an unexpected (should not occur) condition. > #3 osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ') > at osm_helper.c:1446 > #4 0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at osm_madw.h:575 > #5 0x2adc911e in __cl_thread_wrapper (arg=0x0) at cl_thread.c:61 > #6 0x0036d28060aa in start_thread () from /lib64/tls/libpthread.so.0 > #7 0x0036d19c53d3 in clone () from /lib64/tls/libc.so.6 > #8 0x in ?? () Anything "special" about your configuration/what was going on ? Can you reproduce this ? Also, what version of OpenSM are you using ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] OpenSM crash
On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote: > I just noticed that my opensm had segv'ed and dumped core. BTW, here was the tail of the osm.log: May 27 01:44:09 [43005960] -> osm_vendor_get: [ May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 0x5678f0 (mad 0x5f33f0 req 1) May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x567908, size = 256. May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size = 256. May 27 01:44:09 [43005960] -> osm_vendor_get: ] May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, p_mad = 0x5f3670, size = 256. May 27 01:44:09 [43005960] -> osm_mad_pool_get: ] May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), modifier = 0x10001, TID = 0x1c149. May 27 01:44:09 [43005960] -> osm_vl15_post: [ May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 (mad 0x5f3670 req 1) May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs outstanding. May 27 01:44:09 [43005960] -> osm_vl15_poll: [ May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread. May 27 01:44:09 [43005960] -> osm_vl15_poll: ] May 27 01:44:09 [43005960] -> osm_vl15_post: ] May 27 01:44:09 [43005960] -> osm_req_get: ] May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ] May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ] May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ] May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [ -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] OpenSM crash
I just noticed that my opensm had segv'ed and dumped core. Here is the gdb backtrace. #0 stack_dump () at src/stack.c:72 72 if (!__builtin_frame_address(2)) (gdb) bt #0 stack_dump () at src/stack.c:72 #1 0x2abb71a6 in handler (x=11) at src/stack.c:151 #2 #3 osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ') at osm_helper.c:1446 #4 0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at osm_madw.h:575 #5 0x2adc911e in __cl_thread_wrapper (arg=0x0) at cl_thread.c:61 #6 0x0036d28060aa in start_thread () from /lib64/tls/libpthread.so.0 #7 0x0036d19c53d3 in clone () from /lib64/tls/libc.so.6 #8 0x in ?? () -tduffy -- I wish we lived in the America of yesteryear that only exists in the minds of us Republicans. -- Ned Flanders signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] opensm crash
Hi Tom, On Thu, 2005-01-27 at 12:53, Tom Duffy wrote: > I hit control-c to kill osm and got: > > Jan 27 18:47:09 [44808960] -> osm_mad_pool_get: [ > opensm[4627]: *** exception handler: died with signal 11 > Segmentation fault Looks to me like the following could be the case: One thread was shutting down the OSM (osm_opensm_destroy was called and got at least as far as destroying the SA; subsequent to this the MAD pool is destroyed) and another thread attempted a get from the MAD pool. I'm not sure what would prevent this from occuring. I am looking into this crash further and am trying to reproduce the same. -- Hal > Here is the last 100 lines of the osm.log > > [EMAIL PROTECTED] bin]# tail -100 /var/log/osm.log > Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Retiring > MAD with TID = 0x2bf9. > Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: [ > Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: Releasing p_madw = 0x56d9c0, > p_mad = 0x599140. > Jan 27 18:47:04 [43005960] -> osm_vendor_put: [ > Jan 27 18:47:04 [43005960] -> osm_vendor_put: Retiring UMAD 0x599140. > Jan 27 18:47:04 [43005960] -> osm_vendor_put: ] > Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: ] > Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: 0 QP0 MADs > outstanding. > Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Posting > Dispatcher message OSM_MSG_NO_SMPS_OUTSTANDING. > Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: ] > Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: ] > Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: [ > Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal > OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_LIGHT. > Jan 27 18:47:04 [43005960] -> __osm_state_mgr_light_sweep_done_msg: > > > ** > ** LIGHT SWEEP COMPLETE ** > ** > > > Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal > OSM_SIGNAL_IDLE_TIME_PROCESS in state OSM_SM_STATE_PROCESS_REQUEST. > Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: [ > Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: ] > Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: ] > Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: [ > Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: ] > Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: [ > Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: ] > Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: [ > Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: Off schedule sweep signalled. > Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: ] > Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: [ > Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: [ > Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: event_wheel ptr:0x5575f8 > Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: ] > Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: ] > Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: [ > Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: ] > Jan
[openib-general] opensm crash
I hit control-c to kill osm and got: Jan 27 18:47:09 [44808960] -> osm_mad_pool_get: [ opensm[4627]: *** exception handler: died with signal 11 Segmentation fault Here is the last 100 lines of the osm.log [EMAIL PROTECTED] bin]# tail -100 /var/log/osm.log Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Retiring MAD with TID = 0x2bf9. Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: [ Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: Releasing p_madw = 0x56d9c0, p_mad = 0x599140. Jan 27 18:47:04 [43005960] -> osm_vendor_put: [ Jan 27 18:47:04 [43005960] -> osm_vendor_put: Retiring UMAD 0x599140. Jan 27 18:47:04 [43005960] -> osm_vendor_put: ] Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: ] Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: 0 QP0 MADs outstanding. Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Posting Dispatcher message OSM_MSG_NO_SMPS_OUTSTANDING. Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: ] Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: ] Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: [ Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_LIGHT. Jan 27 18:47:04 [43005960] -> __osm_state_mgr_light_sweep_done_msg: ** ** LIGHT SWEEP COMPLETE ** ** Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_IDLE_TIME_PROCESS in state OSM_SM_STATE_PROCESS_REQUEST. Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: [ Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: ] Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: ] Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: [ Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: ] Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: [ Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: ] Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: [ Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: Off schedule sweep signalled. Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: ] Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: [ Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: [ Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: event_wheel ptr:0x5575f8 Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: ] Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_mcast_mgr_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_mcast_mgr_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_sa_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_nr_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_nr_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_pir_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_pir_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_lr_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_lr_rcv_destroy: ] Jan 27 18:47:09 [9597F060] -> osm_pr_rcv_destroy: [ Jan 27 18:47:09 [9597F060] -> osm_pr_rcv_destroy: ] Jan 27 18: