Re: [openib-general] opensm crash with topspin HCA

2006-11-02 Thread Hal Rosenstock
On Thu, 2006-11-02 at 13:33, Viswanath Krishnamurthy wrote:
> 
> When we run opensm (OFED) release and if a Topspin HCA is in the IB
> network, opensm crashes in umad_receiver with NULL pointer exception. 
> The transaction ID is zero is the MAD'S from topspin HCA on windows.
> The crashes seems to random in umad_receiver. 

What OpenSM version ? 

There was a problem like this fixed back at the end of August:

r8920 | halr | 2006-08-14 09:09:28 -0400 (Mon, 14 Aug 2006) | 11 lines

OpenSM/osm_vendor_ibumad.c: In get_madw, check for TID 0 (resolves
NULL ptr crash with Cisco stack)

This change fixes an OSM crash when working with Cisco's stack.
Cisco's stack doesn't follow the same TID convention when generating transaction
 id which in some bad flow revealed this bug in the get_madw lookup.

The bug was in get_madw which does not detect lookup of its reserved "free" entr
y of key==0.

Signed-off-by: Yevgeny Kliteynik <[EMAIL PROTECTED]>
Signed-off-by: Hal Rosenstock <[EMAIL PROTECTED]>

-- Hal

> 
> 
> 
> HCA found:
> 
> hca_id=InfiniHost0
> 
> vendor_id=0x02C9
> 
> vendor_part_id=0x5A44
> 
> hw_ver=0xA0
> 
> fw_ver=0x40006
> 
> 
> 
> 
> __
> 
> ___
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] opensm crash with topspin HCA

2006-11-02 Thread Sasha Khapyorsky
On 10:33 Thu 02 Nov , Viswanath Krishnamurthy wrote:
> When we run opensm (OFED) release and if a Topspin HCA is in the IB network,
> opensm crashes in umad_receiver with NULL pointer exception. 

Do you have any logs, gdb backtrace or any other details?

Sasha

> The
> transaction ID is zero is the MAD'S from topspin HCA on windows. The crashes
> seems to random in umad_receiver.
> 
> 
> HCA found:
> 
>hca_id=InfiniHost0
> 
>vendor_id=0x02C9
> 
>vendor_part_id=0x5A44
> 
>hw_ver=0xA0
> 
>fw_ver=0x40006

> ___
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] opensm crash with topspin HCA

2006-11-02 Thread Viswanath Krishnamurthy
When we run opensm (OFED) release and if a Topspin HCA is in the IB network, opensm crashes in umad_receiver with NULL pointer exception.  The transaction ID is zero is the MAD'S from topspin HCA on windows. The crashes seems to random in umad_receiver.

 HCA found:
    
hca_id=InfiniHost0
    
vendor_id=0x02C9
    
vendor_part_id=0x5A44
    
hw_ver=0xA0
    
fw_ver=0x40006

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk

2005-10-31 Thread James Lentini

Thanks for the patch Aniruddha. Can you resend with a signed-off-by 
line? 

See "How do I submit source code patches?" at 
https://openib.org/tiki/tiki-index.php?page=OpenIBFAQ


> Also a minor patch, you can see that %P is printed as %P and not used as
> a format character.
> 
> Index: common/dapl_ep_post_rdma_write.c
> ===
> --- common/dapl_ep_post_rdma_write.c(revision 3892)
> +++ common/dapl_ep_post_rdma_write.c(working copy)
> @@ -78,7 +78,7 @@
> DAT_RETURN dat_status;
> 
> dapl_dbg_log (DAPL_DBG_TYPE_API,
> - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n",
> + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n",
>  ep_handle,
>  num_segments,
>  local_iov,
> Index: common/dapl_ep_post_send.c
> ===
> --- common/dapl_ep_post_send.c  (revision 3892)
> +++ common/dapl_ep_post_send.c  (working copy)
> @@ -75,7 +75,7 @@
> DAT_RETURN dat_status;
> 
> dapl_dbg_log (DAPL_DBG_TYPE_API,
> - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n",
> + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n",
>  ep_handle,
>  num_segments,
>  local_iov,
> Index: common/dapl_srq_post_recv.c
> ===
> --- common/dapl_srq_post_recv.c (revision 3892)
> +++ common/dapl_srq_post_recv.c (working copy)
> @@ -79,7 +79,7 @@
> DAT_RETURN dat_status;
> 
> dapl_dbg_log (DAPL_DBG_TYPE_API,
> - "dapl_srq_post_recv (%p, %d, %p, %P)\n",
> + "dapl_srq_post_recv (%p, %d, %p, %p)\n",
>  srq_handle,
>  num_segments,
>  local_iov,
> Index: common/dapl_ep_post_recv.c
> ===
> --- common/dapl_ep_post_recv.c  (revision 3892)
> +++ common/dapl_ep_post_recv.c  (working copy)
> @@ -79,7 +79,7 @@
> DAT_RETURN dat_status;
> 
> dapl_dbg_log (DAPL_DBG_TYPE_API,
> - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n",
> + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n",
>  ep_handle,
>  num_segments,
>  local_iov,
> 
> Thanks
> Aniruddha
> 
> 
> 
> ___
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Arlin Davis

Aniruddha Bohra wrote:



Now, I have a problem with udapl :

The following is a code snippet from :
dapl_ib_dto.h

for (i = 0; i < segments; i++ ) {
   if ( !local_iov[i].segment_length )
   continue;

   ds_array_p->addr  = (uint64_t) 
local_iov[i].virtual_address;

   ds_array_p->length = local_iov[i].segment_length;
   ds_array_p->lkey  = local_iov[i].lmr_context;

   dapl_dbg_log (  DAPL_DBG_TYPE_EP,
   " post_snd: lkey 0x%x va %p len %d \n",
   ds_array_p->lkey, ds_array_p->addr,
   ds_array_p->length );

   total_len += ds_array_p->length;
   wr.num_sge++;
   ds_array_p++;
   }

The following is the relevant part of the log with DAPL_DBG_TYPE=0x

dapl_ep_post_send (0x8087110, 2, 0x80f9910, %P, b5f395bc)^M
post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x80f9910 r_iov 
0xbfc29060 f 0^M

post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x80f9910^M
post_snd: lkey 0x10de003b va 0xb5f3976c len 0 ^M
post_snd: lkey 0x10de003b va 0xb5f39924 len 0 ^M




From the above loop, how is this possible :
If local_iov[i].segment_length == 0, it should not be printed. And the
if the assignment is successful, len must not be 0.

Any ideas? Of course following this, the ep is disconnected in the 
next step :(


local_iov (LMR) length is 64bits and the ibv_sge (ds_array) length is 32 
bits so it truncates.

Sounds like you setup a transfer greater then 4GB-1?

If you query the device via uDAPL you will see the max limits (2GB):

query_hca: (a0.0) ep 64512 ep_q 65535 evd 65408 evd_q 131071
query_hca: msg 2147483648 rdma 2147483648 iov 59 lmr 131056 rmr 0

-arlin



Also a minor patch, you can see that %P is printed as %P and not used as
a format character.

Index: common/dapl_ep_post_rdma_write.c
===
--- common/dapl_ep_post_rdma_write.c(revision 3892)
+++ common/dapl_ep_post_rdma_write.c(working copy)
@@ -78,7 +78,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n",
+ "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n",
 ep_handle,
 num_segments,
 local_iov,
Index: common/dapl_ep_post_send.c
===
--- common/dapl_ep_post_send.c  (revision 3892)
+++ common/dapl_ep_post_send.c  (working copy)
@@ -75,7 +75,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_ep_post_send (%p, %d, %p, %P, %x)\n",
+ "dapl_ep_post_send (%p, %d, %p, %p, %x)\n",
 ep_handle,
 num_segments,
 local_iov,
Index: common/dapl_srq_post_recv.c
===
--- common/dapl_srq_post_recv.c (revision 3892)
+++ common/dapl_srq_post_recv.c (working copy)
@@ -79,7 +79,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_srq_post_recv (%p, %d, %p, %P)\n",
+ "dapl_srq_post_recv (%p, %d, %p, %p)\n",
 srq_handle,
 num_segments,
 local_iov,
Index: common/dapl_ep_post_recv.c
===
--- common/dapl_ep_post_recv.c  (revision 3892)
+++ common/dapl_ep_post_recv.c  (working copy)
@@ -79,7 +79,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n",
+ "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n",
 ep_handle,
 num_segments,
 local_iov,

Thanks
Aniruddha



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Roland Dreier wrote:


   > OK so, what options do I have right now -- compile a new kernel and
   > apply patches and
   > continue, or is there some patch that I can apply ?

I don't think anyone has prepared a kzalloc() patch, but just adding
something like

static void *kzalloc(size_t size, unsigned int flags)
{
void *ret = kmalloc(size, flags);
if (ret)
memset(ret, 0, size);
return ret;
}

to files that use kzalloc() should let you use 2.6.13 (assuming there
are no other incompatibilities).




Thanks, that works.

Now, I have a problem with udapl :

The following is a code snippet from :
dapl_ib_dto.h

for (i = 0; i < segments; i++ ) {
   if ( !local_iov[i].segment_length )
   continue;

   ds_array_p->addr  = (uint64_t) local_iov[i].virtual_address;
   ds_array_p->length = local_iov[i].segment_length;
   ds_array_p->lkey  = local_iov[i].lmr_context;

   dapl_dbg_log (  DAPL_DBG_TYPE_EP,
   " post_snd: lkey 0x%x va %p len %d \n",
   ds_array_p->lkey, ds_array_p->addr,
   ds_array_p->length );

   total_len += ds_array_p->length;
   wr.num_sge++;
   ds_array_p++;
   }

The following is the relevant part of the log with DAPL_DBG_TYPE=0x

dapl_ep_post_send (0x8087110, 2, 0x80f9910, %P, b5f395bc)^M
post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x80f9910 r_iov 
0xbfc29060 f 0^M

post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x80f9910^M
post_snd: lkey 0x10de003b va 0xb5f3976c len 0 ^M
post_snd: lkey 0x10de003b va 0xb5f39924 len 0 ^M




From the above loop, how is this possible :
If local_iov[i].segment_length == 0, it should not be printed. And the
if the assignment is successful, len must not be 0.

Any ideas? Of course following this, the ep is disconnected in the next 
step :(


Also a minor patch, you can see that %P is printed as %P and not used as
a format character.

Index: common/dapl_ep_post_rdma_write.c
===
--- common/dapl_ep_post_rdma_write.c(revision 3892)
+++ common/dapl_ep_post_rdma_write.c(working copy)
@@ -78,7 +78,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n",
+ "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n",
 ep_handle,
 num_segments,
 local_iov,
Index: common/dapl_ep_post_send.c
===
--- common/dapl_ep_post_send.c  (revision 3892)
+++ common/dapl_ep_post_send.c  (working copy)
@@ -75,7 +75,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_ep_post_send (%p, %d, %p, %P, %x)\n",
+ "dapl_ep_post_send (%p, %d, %p, %p, %x)\n",
 ep_handle,
 num_segments,
 local_iov,
Index: common/dapl_srq_post_recv.c
===
--- common/dapl_srq_post_recv.c (revision 3892)
+++ common/dapl_srq_post_recv.c (working copy)
@@ -79,7 +79,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_srq_post_recv (%p, %d, %p, %P)\n",
+ "dapl_srq_post_recv (%p, %d, %p, %p)\n",
 srq_handle,
 num_segments,
 local_iov,
Index: common/dapl_ep_post_recv.c
===
--- common/dapl_ep_post_recv.c  (revision 3892)
+++ common/dapl_ep_post_recv.c  (working copy)
@@ -79,7 +79,7 @@
DAT_RETURN dat_status;

dapl_dbg_log (DAPL_DBG_TYPE_API,
- "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n",
+ "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n",
 ep_handle,
 num_segments,
 local_iov,

Thanks
Aniruddha



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Roland Dreier
> OK so, what options do I have right now -- compile a new kernel and
> apply patches and
> continue, or is there some patch that I can apply ?

I don't think anyone has prepared a kzalloc() patch, but just adding
something like

static void *kzalloc(size_t size, unsigned int flags)
{
void *ret = kmalloc(size, flags);
if (ret)
memset(ret, 0, size);
return ret;
}

to files that use kzalloc() should let you use 2.6.13 (assuming there
are no other incompatibilities).

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Roland Dreier wrote:


   > With 3892 I now get the following warnings on compilation:
   > WARNING:
   > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko
   > needs unknown symbol kzalloc
   > WARNING:
   > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko
   > needs unknown symbol kzalloc

Yes, kzalloc() was added in 2.6.14.  Now that 2.6.14 has been
released, the subversion trunk is targeted against that kernel rather
than the old 2.6.13 release.

- R.
 

OK so, what options do I have right now -- compile a new kernel and 
apply patches and

continue, or is there some patch that I can apply ?

Thanks
Aniruddha

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Roland Dreier
> With 3892 I now get the following warnings on compilation:
> WARNING:
> /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko
> needs unknown symbol kzalloc
> WARNING:
> /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko
> needs unknown symbol kzalloc

Yes, kzalloc() was added in 2.6.14.  Now that 2.6.14 has been
released, the subversion trunk is targeted against that kernel rather
than the old 2.6.13 release.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Roland Dreier wrote:


   > Now there is an OOPS in the dmesg :

This really looks like the bug I fixed in r3889.  What svn rev are
your kernel modules built from?

- R.
 


And of course, the module does not load :
Oct 28 16:21:57 hora-3 kernel: ib_mthca: Unknown symbol kzalloc
Oct 28 16:21:58 hora-3 kernel: ib_umad: Unknown symbol kzalloc

Aniruddha


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Roland Dreier wrote:


   > Now there is an OOPS in the dmesg :

This really looks like the bug I fixed in r3889.  What svn rev are
your kernel modules built from?

- R.
 


With 3892 I now get the following warnings on compilation:
WARNING: 
/lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko 
needs unknown symbol kzalloc
WARNING: 
/lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko needs 
unknown symbol kzalloc



Aniruddha

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Roland Dreier
> Now there is an OOPS in the dmesg :

This really looks like the bug I fixed in r3889.  What svn rev are
your kernel modules built from?

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Hal Rosenstock wrote:


Or perhaps something crashed and didn't clean up properly. Does this occur 
immediately after a boot ?

 



After a fresh reboot of the machines on the switch, I get the log at
http://www.cs.rutgers.edu/~bohra/osm-v2.log

The opensm process does not crash but hangs. The state of the port never 
changes.


Now there is an OOPS in the dmesg :

ct 28 13:52:13 hora-3 OpenSM[5168]: OpenSM Rev:openib-1.1.0
Oct 28 13:52:14 hora-3 kernel: Unable to handle kernel paging request at 
virtual address 0910

Oct 28 13:52:14 hora-3 kernel:  printing eip:
Oct 28 13:52:14 hora-3 kernel: f883f12d
Oct 28 13:52:14 hora-3 kernel: *pde = 
Oct 28 13:52:14 hora-3 kernel: Oops:  [#1]
Oct 28 13:52:14 hora-3 kernel: SMP
Oct 28 13:52:14 hora-3 kernel: Modules linked in: ib_uverbs ib_umad ipv6 
i2c_dev i2c_core sunrpc dm_mod video button battery ac uhci_hcd 
hw_random ib_mthca ib_mad ib_core e1000 floppy

Oct 28 13:52:14 hora-3 kernel: CPU:1
Oct 28 13:52:14 hora-3 kernel: EIP:0060:[]Not tainted VLI
Oct 28 13:52:14 hora-3 kernel: EFLAGS: 00010286   (2.6.13bohra)
Oct 28 13:52:14 hora-3 kernel: EIP is at ib_post_send_mad+0x1c/0x1b1 
[ib_mad]
Oct 28 13:52:14 hora-3 kernel: eax: 0900   ebx: c1a7d900   ecx: 
c1a7d918   edx: 
Oct 28 13:52:14 hora-3 kernel: esi: c1a7d918   edi: f6571f68   ebp: 
f6571efc   esp: f6571ed8

Oct 28 13:52:14 hora-3 kernel: ds: 007b   es: 007b   ss: 0068
Oct 28 13:52:14 hora-3 kernel: Process opensm (pid: 5224, 
threadinfo=f657 task=f7dfb020)
Oct 28 13:52:14 hora-3 kernel: Stack: f883ef5a  c1a7d800 
080bd018 f6571efc  f6a42900 a0f684f6
Oct 28 13:52:14 hora-3 kernel:f6571f68 f6571f74 f88f1728 
 0018 00e8 00d0 f6a42948
Oct 28 13:52:14 hora-3 kernel:f68bda24  0009 
a0f684f6 0009 c1a7d918  0100

Oct 28 13:52:14 hora-3 kernel: Call Trace:
Oct 28 13:52:14 hora-3 kernel:  [] show_stack+0x7c/0x92
Oct 28 13:52:14 hora-3 kernel:  [] show_registers+0x152/0x1ca
Oct 28 13:52:14 hora-3 kernel:  [] die+0xf4/0x16f
Oct 28 13:52:14 hora-3 kernel:  [] do_page_fault+0x463/0x649
Oct 28 13:52:14 hora-3 kernel:  [] error_code+0x4f/0x54
Oct 28 13:52:14 hora-3 kernel:  [] ib_umad_write+0x2d0/0x30e 
[ib_umad]

Oct 28 13:52:14 hora-3 kernel:  [] vfs_write+0x155/0x15a
Oct 28 13:52:14 hora-3 kernel:  [] sys_write+0x3d/0x64
Oct 28 13:52:14 hora-3 kernel:  [] sysenter_past_esp+0x54/0x75
Oct 28 13:52:14 hora-3 kernel: Code: e8 d8 63 af c7 89 d8 83 c4 0c 5b 5e 
5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 18 85 f6 89 55 f0 0f 84 ff 00 00 
00 8b 46 08 8d 5e e8 <8b> 50 10 8b 7b 14 85 d2 0f 84 7c 01 00 00 8b 4e 
18 85 c9 74 0b



Thanks
Aniruddha




From: [EMAIL PROTECTED] on behalf of Sean Hefty
Sent: Fri 10/28/2005 12:01 PM
To: Aniruddha Bohra
Cc: openib-general@openib.org
Subject: Re: [openib-general] OpenSM crash with today's trunk



Aniruddha Bohra wrote:
 


Oh well, I guess this is a different bug.  Is there an oops or
anything in your kernel log, or is this just a userspace crash?

 


This is what I see :
Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0
Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use
Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM

Is this useful?
   



Is there any chance opensm is already running on the system?  It sounds like
something has already registered to receive the same MADs that opensm wants to
receive.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Hal Rosenstock wrote:


Or perhaps something crashed and didn't clean up properly. Does this occur 
immediately after a boot ?

 


This is after a clean reboot.
There are two systems on the switch and this is the only active one.
I will reboot both and see again.

Thanks
Aniruddha




From: [EMAIL PROTECTED] on behalf of Sean Hefty
Sent: Fri 10/28/2005 12:01 PM
To: Aniruddha Bohra
Cc: openib-general@openib.org
Subject: Re: [openib-general] OpenSM crash with today's trunk



Aniruddha Bohra wrote:
 


Oh well, I guess this is a different bug.  Is there an oops or
anything in your kernel log, or is this just a userspace crash?

 


This is what I see :
Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0
Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use
Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM

Is this useful?
   



Is there any chance opensm is already running on the system?  It sounds like
something has already registered to receive the same MADs that opensm wants to
receive.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


 



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Hal Rosenstock
Or perhaps something crashed and didn't clean up properly. Does this occur 
immediately after a boot ?



From: [EMAIL PROTECTED] on behalf of Sean Hefty
Sent: Fri 10/28/2005 12:01 PM
To: Aniruddha Bohra
Cc: openib-general@openib.org
Subject: Re: [openib-general] OpenSM crash with today's trunk



Aniruddha Bohra wrote:
>> Oh well, I guess this is a different bug.  Is there an oops or
>> anything in your kernel log, or is this just a userspace crash?
>> 
> This is what I see :
> Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0
> Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use
> Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM
>
> Is this useful?

Is there any chance opensm is already running on the system?  It sounds like
something has already registered to receive the same MADs that opensm wants to
receive.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Sean Hefty

Aniruddha Bohra wrote:

Oh well, I guess this is a different bug.  Is there an oops or
anything in your kernel log, or is this just a userspace crash?
 

This is what I see :
Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0
Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use
Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM

Is this useful?


Is there any chance opensm is already running on the system?  It sounds like 
something has already registered to receive the same MADs that opensm wants to 
receive.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Eitan Zahavi
Title: RE: [openib-general] OpenSM crash with today's trunk





This means you have another SM or application already registered for handling SubnetManagement packets. Thus OpenSM fails to start (register as the handler for such requests). The crash is a bug that should be solved. 

Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL



> -Original Message-
> From: Aniruddha Bohra [mailto:[EMAIL PROTECTED]]
> Sent: Friday, October 28, 2005 5:28 PM
> To: Roland Dreier
> Cc: openib-general@openib.org
> Subject: Re: [openib-general] OpenSM crash with today's trunk
> 
> Roland Dreier wrote:
> 
> >    Aniruddha> I tried with r3888 and r3891 with the same result.
> >
> >Oh well, I guess this is a different bug.  Is there an oops or
> >anything in your kernel log, or is this just a userspace crash?
> >
> >
> This is what I see :
> Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0
> Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use
> Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM
> 
> Is this useful?
> 
> Aniruddha
> 
> 
> ___
> openib-general mailing list
> openib-general@openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Roland Dreier wrote:


   Aniruddha> I tried with r3888 and r3891 with the same result.

Oh well, I guess this is a different bug.  Is there an oops or
anything in your kernel log, or is this just a userspace crash?
 


This is what I see :
Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0
Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use
Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM

Is this useful?

Aniruddha


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Roland Dreier
Aniruddha> I tried with r3888 and r3891 with the same result.

Oh well, I guess this is a different bug.  Is there an oops or
anything in your kernel log, or is this just a userspace crash?

If it's just opensm crashing then I'm not much use in debugging.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-28 Thread Aniruddha Bohra

Roland Dreier wrote:


I believe that this is in r3889.

- R.
 


I tried with r3888 and r3891 with the same result.

Aniruddha

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-27 Thread Roland Dreier
I believe that this is in r3889.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] OpenSM crash with today's trunk

2005-10-27 Thread Aniruddha Bohra
Hello,
I updated the OpenIB stack today and I get the following error
on starting OpenSM. The verbose log is available at
http://www.cs.rutgers.edu/~bohra/osm-v.log


# opensm -V -d10 -r
-
OpenSM Rev:openib-1.1.0
Command Line Arguments:
 Big V selected
 d level = 0xa
 Reassign LIDs
 Log File: /var/log/osm.log
-
OpenSM Rev:openib-1.1.0

Using default guid 0x2c901081e7471

Error from osm_opensm_bind (0x2A)
Exiting SM

Segmentation fault


Please let me know what I can do to debug this.

Thanks
Aniruddha


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-31 Thread Hal Rosenstock
On Tue, 2005-05-31 at 16:43, Tom Duffy wrote:
> On Tue, 2005-05-31 at 13:09 -0400, Hal Rosenstock wrote:
> > There are certain changes where the makefiles need to be regenerated
> > (and this is not done automatically). Since there was an additional
> > compile flag added, they need to be regenerated or else it is being
> > built the old way (without the real RMPP support enabled).
> 
> $ make automake
> 
> at the toplevel should take care of this, no?

Yes.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-31 Thread Tom Duffy
On Tue, 2005-05-31 at 13:09 -0400, Hal Rosenstock wrote:
> There are certain changes where the makefiles need to be regenerated
> (and this is not done automatically). Since there was an additional
> compile flag added, they need to be regenerated or else it is being
> built the old way (without the real RMPP support enabled).

$ make automake

at the toplevel should take care of this, no?

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] OpenSM crash

2005-05-31 Thread Hal Rosenstock
On Fri, 2005-05-27 at 17:30, Tom Duffy wrote: 
> > Also, did
> > you pick up the user_mad.c fix on Tuesday AM ? If it was, any other
> > changes are either not related or trivial.
> > 
> > After you picked up these changes, did you regenerate the various OpenSM
> > makefiles (a define for RMPP changed in them) or just rebuild ? [This
> > would not explain the crash, but is different from how my OpenSM is
> > built.]
> 
> I just reran make from the toplevel (management) after updating.  I
> would think it would rebuild them if something changed, no?

There are certain changes where the makefiles need to be regenerated
(and this is not done automatically). Since there was an additional
compile flag added, they need to be regenerated or else it is being
built the old way (without the real RMPP support enabled).

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-28 Thread Hal Rosenstock
On Fri, 2005-05-27 at 17:37, Hal Rosenstock wrote:
> On Fri, 2005-05-27 at 17:33, Roland Dreier wrote:
> > > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 
> > 2 MADs outstanding.
> > 
> > Hal> I take that back. That's just a lot of MADs have been sent
> > Hal> (on the IB wire). OpenSM was probably up and running for a
> > Hal> while...
> > 
> > I find it hard to believe that OpenSM has sent 4 billion MADs --
> > that's more than 1000 MADs a second for a solid month.  It also looks
> > very suspicious that the value is equal to ((unsigned int) -1).
>   ^^
> on a 32 bit machine.
> 
> Good point. The fact that it gets to -1 is significant as I think that
> is used as a magic value for some computations.

I'm pretty sure that I see a way this could have gone negative in the
vendor layer. I'm working on a patch for this.

-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Roland Dreier
Roland> I find it hard to believe that OpenSM has sent 4 billion
Roland> MADs -- that's more than 1000 MADs a second for a solid
Roland> month.  It also looks very suspicious that the value is
Roland> equal to ((unsigned int) -1).

Hal> ^^ on a 32 bit machine.

This is really a very minor point but the following program

#include 
int main(int argc, char *argv[]) {
printf("%u\n", ((unsigned int) -1)); return 0;
}

prints 4294967295 on any 64-bit Linux machine I have access to...

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Hal Rosenstock
On Fri, 2005-05-27 at 17:33, Roland Dreier wrote:
> > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 
> MADs outstanding.
> 
> Hal> I take that back. That's just a lot of MADs have been sent
> Hal> (on the IB wire). OpenSM was probably up and running for a
> Hal> while...
> 
> I find it hard to believe that OpenSM has sent 4 billion MADs --
> that's more than 1000 MADs a second for a solid month.  It also looks
> very suspicious that the value is equal to ((unsigned int) -1).
  ^^
on a 32 bit machine.

Good point. The fact that it gets to -1 is significant as I think that
is used as a magic value for some computations.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Roland Dreier
> May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 
MADs outstanding.

Hal> I take that back. That's just a lot of MADs have been sent
Hal> (on the IB wire). OpenSM was probably up and running for a
Hal> while...

I find it hard to believe that OpenSM has sent 4 billion MADs --
that's more than 1000 MADs a second for a solid month.  It also looks
very suspicious that the value is equal to ((unsigned int) -1).

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Tom Duffy
On Fri, 2005-05-27 at 17:15 -0400, Hal Rosenstock wrote:
> On Fri, 2005-05-27 at 14:31, Tom Duffy wrote:
> > On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote:
> > > I just noticed that my opensm had segv'ed and dumped core.
> > 
> > BTW, here was the tail of the osm.log:
> > 
> > May 27 01:44:09 [43005960] -> osm_vendor_get: [
> > May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 
> > 0x5678f0 (mad 0x5f33f0 req 1)
> > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = 
> > 0x567908, size = 256.
> > May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size 
> > = 256.
> > May 27 01:44:09 [43005960] -> osm_vendor_get: ]
> > May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, 
> > p_mad = 0x5f3670, size = 256.
> > May 27 01:44:09 [43005960] -> osm_mad_pool_get: ]
> > May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), 
> > modifier = 0x10001, TID = 0x1c149.
> > May 27 01:44:09 [43005960] -> osm_vl15_post: [
> > May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 
> > (mad 0x5f3670 req 1)
> > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 
> > MADs outstanding.
>^^
> This looks weird.
> 
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: [
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread.
> > May 27 01:44:09 [43005960] -> osm_vl15_poll: ]
> > May 27 01:44:09 [43005960] -> osm_vl15_post: ]
> > May 27 01:44:09 [43005960] -> osm_req_get: ]
> > May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ]
> > May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ]
> > May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ]
> > May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [
> 
> Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack
> shown in the previous email as this makes it look like it should be.
> 
> Could you go back a little further in the log ? I'd like to see what is
> before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and
> osm_pi_rcv_process.

The log had grown to almost 1G, so I actually deleted it.  Shit, sorry.

> It's also seems weird to me that there is no other
> log message between these two.
> 
> >From the stack trace:
> #3  osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
> at osm_helper.c:1446
> #4  0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at
> osm_madw.h:575
> 
> It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller
> 
> if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) )
> {
>   if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
>   {
> osm_log( p_vl->p_log, OSM_LOG_DEBUG,
>  "__osm_vl15_poller: "
>  "Servicing p_madw = %p (mad %p req %d)\n",
>  p_madw, p_madw->p_mad, p_madw->resp_expected);
>   }
> 
>   if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) )
>   {
> osm_dump_dr_smp( p_vl->p_log,
>  osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES );  
> <=== here
>   }
> 
> when it died but I didn't see the previous log message in the code
> "osm_vl15_poller: Servicing p_madw" which I also would have expected.
> [This would have been telling as p_madw->p_mad would have been logged].
> I also didn't see the __osm_vl15_poller entry message either.

well, if it segv'ed maybe it never finished writing out to the file...

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] OpenSM crash

2005-05-27 Thread Tom Duffy
On Fri, 2005-05-27 at 16:25 -0400, Hal Rosenstock wrote:
> On Fri, 2005-05-27 at 15:26, Tom Duffy wrote:
> > > Also, what version of OpenSM are you using ?
> > 
> > It was pretty close to the head of the tree, although a couple of files
> > were updated when I did a svn update after the crash.
> 
> When was your last update of OpenSM ? Was it after Tues AM ?

To be honest, I can't remember.

> Also, did
> you pick up the user_mad.c fix on Tuesday AM ? If it was, any other
> changes are either not related or trivial.
> 
> After you picked up these changes, did you regenerate the various OpenSM
> makefiles (a define for RMPP changed in them) or just rebuild ? [This
> would not explain the crash, but is different from how my OpenSM is
> built.]

I just reran make from the toplevel (management) after updating.  I
would think it would rebuild them if something changed, no?

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] OpenSM crash

2005-05-27 Thread Hal Rosenstock
On Fri, 2005-05-27 at 17:15, Hal Rosenstock wrote:
> > May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 
> > MADs outstanding.
>^^
> This looks weird.

I take that back. That's just a lot of MADs have been sent (on the IB
wire). OpenSM was probably up and running for a while...

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Hal Rosenstock
On Fri, 2005-05-27 at 14:31, Tom Duffy wrote:
> On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote:
> > I just noticed that my opensm had segv'ed and dumped core.
> 
> BTW, here was the tail of the osm.log:
> 
> May 27 01:44:09 [43005960] -> osm_vendor_get: [
> May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 0x5678f0 
> (mad 0x5f33f0 req 1)
> May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = 
> 0x567908, size = 256.
> May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size = 
> 256.
> May 27 01:44:09 [43005960] -> osm_vendor_get: ]
> May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, 
> p_mad = 0x5f3670, size = 256.
> May 27 01:44:09 [43005960] -> osm_mad_pool_get: ]
> May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), 
> modifier = 0x10001, TID = 0x1c149.
> May 27 01:44:09 [43005960] -> osm_vl15_post: [
> May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 (mad 
> 0x5f3670 req 1)
> May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs 
> outstanding.
   ^^
This looks weird.

> May 27 01:44:09 [43005960] -> osm_vl15_poll: [
> May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread.
> May 27 01:44:09 [43005960] -> osm_vl15_poll: ]
> May 27 01:44:09 [43005960] -> osm_vl15_post: ]
> May 27 01:44:09 [43005960] -> osm_req_get: ]
> May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ]
> May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ]
> May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ]
> May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [

Wonder why __osm_sm_mad_ctrl_disp_done_callback wasn't on the stack
shown in the previous email as this makes it look like it should be.

Could you go back a little further in the log ? I'd like to see what is
before the start of __osm_pi_rcv_get_pkey_slvl_vla_tables and
osm_pi_rcv_process. It's also seems weird to me that there is no other
log message between these two.

>From the stack trace:
#3  osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
at osm_helper.c:1446
#4  0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at
osm_madw.h:575

It looks like OpenSM was in osm_vl15intf.c::__osm_vl15_poller

if( p_madw != (osm_madw_t*)cl_qlist_end( p_fifo ) )
{
  if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) )
  {
osm_log( p_vl->p_log, OSM_LOG_DEBUG,
 "__osm_vl15_poller: "
 "Servicing p_madw = %p (mad %p req %d)\n",
 p_madw, p_madw->p_mad, p_madw->resp_expected);
  }

  if( osm_log_is_active( p_vl->p_log, OSM_LOG_FRAMES ) )
  {
osm_dump_dr_smp( p_vl->p_log,
 osm_madw_get_smp_ptr( p_madw ), OSM_LOG_FRAMES );  
<=== here
  }

when it died but I didn't see the previous log message in the code
"osm_vl15_poller: Servicing p_madw" which I also would have expected.
[This would have been telling as p_madw->p_mad would have been logged].
I also didn't see the __osm_vl15_poller entry message either.

-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Hal Rosenstock
On Fri, 2005-05-27 at 15:26, Tom Duffy wrote:
> > Also, what version of OpenSM are you using ?
> 
> It was pretty close to the head of the tree, although a couple of files
> were updated when I did a svn update after the crash.

When was your last update of OpenSM ? Was it after Tues AM ? Also, did
you pick up the user_mad.c fix on Tuesday AM ? If it was, any other
changes are either not related or trivial.

After you picked up these changes, did you regenerate the various OpenSM
makefiles (a define for RMPP changed in them) or just rebuild ? [This
would not explain the crash, but is different from how my OpenSM is
built.]

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Tom Duffy
On Fri, 2005-05-27 at 14:54 -0400, Hal Rosenstock wrote:
> Anything "special" about your configuration/what was going on ?

This was in the middle of the night.  I wasn't doing anything to the
systems at the time.

> Can you reproduce this ? 

nope.

> Also, what version of OpenSM are you using ?

It was pretty close to the head of the tree, although a couple of files
were updated when I did a svn update after the crash.

-tduffy




-- 
I wish we lived in the America of yesteryear that only exists in the
minds of us Republicans.
-- Ned Flanders


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] OpenSM crash

2005-05-27 Thread Hal Rosenstock
On Fri, 2005-05-27 at 14:27, Tom Duffy wrote:
> I just noticed that my opensm had segv'ed and dumped core.  Here is the
> gdb backtrace.
> 
> #0  stack_dump () at src/stack.c:72
> 72  if (!__builtin_frame_address(2))
> (gdb) bt
> #0  stack_dump () at src/stack.c:72
> #1  0x2abb71a6 in handler (x=11) at src/stack.c:151
> #2  

Looks like osm_dump_dr_smp was called with a NULL p_smp so:
osm_madw_get_smp_ptr(p_madw) returned NULL for some unknown reason
and that is an unexpected (should not occur) condition.

> #3  osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
> at osm_helper.c:1446
> #4  0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at osm_madw.h:575
> #5  0x2adc911e in __cl_thread_wrapper (arg=0x0) at cl_thread.c:61
> #6  0x0036d28060aa in start_thread () from /lib64/tls/libpthread.so.0
> #7  0x0036d19c53d3 in clone () from /lib64/tls/libc.so.6
> #8  0x in ?? ()

Anything "special" about your configuration/what was going on ?

Can you reproduce this ? 

Also, what version of OpenSM are you using ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash

2005-05-27 Thread Tom Duffy
On Fri, 2005-05-27 at 11:27 -0700, Tom Duffy wrote:
> I just noticed that my opensm had segv'ed and dumped core.

BTW, here was the tail of the osm.log:

May 27 01:44:09 [43005960] -> osm_vendor_get: [
May 27 01:44:09 [43806960] -> __osm_vl15_poller: Servicing p_madw = 0x5678f0 
(mad 0x5f33f0 req 1)
May 27 01:44:09 [43005960] -> osm_vendor_get: Acquiring UMAD for p_madw = 
0x567908, size = 256.
May 27 01:44:09 [43005960] -> osm_vendor_get: Acquired UMAD 0x5f3640, size = 
256.
May 27 01:44:09 [43005960] -> osm_vendor_get: ]
May 27 01:44:09 [43005960] -> osm_mad_pool_get: Acquired p_madw = 0x5678f0, 
p_mad = 0x5f3670, size = 256.
May 27 01:44:09 [43005960] -> osm_mad_pool_get: ]
May 27 01:44:09 [43005960] -> osm_req_get: Getting P_KeyTable (0x16), modifier 
= 0x10001, TID = 0x1c149.
May 27 01:44:09 [43005960] -> osm_vl15_post: [
May 27 01:44:09 [43005960] -> osm_vl15_post: Servicing p_madw = 0x5678f0 (mad 
0x5f3670 req 1)
May 27 01:44:09 [43005960] -> osm_vl15_post: 4294967295 MADs on wire, 2 MADs 
outstanding.
May 27 01:44:09 [43005960] -> osm_vl15_poll: [
May 27 01:44:09 [43005960] -> osm_vl15_poll: Signalling poller thread.
May 27 01:44:09 [43005960] -> osm_vl15_poll: ]
May 27 01:44:09 [43005960] -> osm_vl15_post: ]
May 27 01:44:09 [43005960] -> osm_req_get: ]
May 27 01:44:09 [43005960] -> osm_physp_has_pkey: ]
May 27 01:44:09 [43005960] -> __osm_pi_rcv_get_pkey_slvl_vla_tables: ]
May 27 01:44:09 [43005960] -> osm_pi_rcv_process: ]
May 27 01:44:09 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: [

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] OpenSM crash

2005-05-27 Thread Tom Duffy
I just noticed that my opensm had segv'ed and dumped core.  Here is the
gdb backtrace.

#0  stack_dump () at src/stack.c:72
72  if (!__builtin_frame_address(2))
(gdb) bt
#0  stack_dump () at src/stack.c:72
#1  0x2abb71a6 in handler (x=11) at src/stack.c:151
#2  
#3  osm_dump_dr_smp (p_log=0x552498, p_smp=0x0, log_level=32 ' ')
at osm_helper.c:1446
#4  0x0042eed1 in __osm_vl15_poller (p_ptr=0x552498) at osm_madw.h:575
#5  0x2adc911e in __cl_thread_wrapper (arg=0x0) at cl_thread.c:61
#6  0x0036d28060aa in start_thread () from /lib64/tls/libpthread.so.0
#7  0x0036d19c53d3 in clone () from /lib64/tls/libc.so.6
#8  0x in ?? ()



-tduffy

-- 
I wish we lived in the America of yesteryear that only exists in the
minds of us Republicans.
-- Ned Flanders


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] opensm crash

2005-01-27 Thread Hal Rosenstock
Hi Tom,

On Thu, 2005-01-27 at 12:53, Tom Duffy wrote:
> I hit control-c to kill osm and got:
> 
> Jan 27 18:47:09 [44808960] -> osm_mad_pool_get: [
> opensm[4627]: *** exception handler: died with signal 11
> Segmentation fault

Looks to me like the following could be the case: 

One thread was shutting down the OSM (osm_opensm_destroy was called and
got at least as far as destroying the SA; subsequent to this the MAD
pool is destroyed) and another thread attempted a get from the MAD pool.
I'm not sure what would prevent this from occuring. 

I am looking into this crash further and am trying to reproduce the
same.

-- Hal


> Here is the last 100 lines of the osm.log
> 
> [EMAIL PROTECTED] bin]# tail -100 /var/log/osm.log
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Retiring 
> MAD with TID = 0x2bf9.
> Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: [
> Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: Releasing p_madw = 0x56d9c0, 
> p_mad = 0x599140.
> Jan 27 18:47:04 [43005960] -> osm_vendor_put: [
> Jan 27 18:47:04 [43005960] -> osm_vendor_put: Retiring UMAD 0x599140.
> Jan 27 18:47:04 [43005960] -> osm_vendor_put: ]
> Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: ]
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: 0 QP0 MADs 
> outstanding.
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Posting 
> Dispatcher message OSM_MSG_NO_SMPS_OUTSTANDING.
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: ]
> Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: ]
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: [
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal 
> OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_LIGHT.
> Jan 27 18:47:04 [43005960] -> __osm_state_mgr_light_sweep_done_msg:
> 
> 
> **
> ** LIGHT SWEEP COMPLETE **
> **
> 
> 
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal 
> OSM_SIGNAL_IDLE_TIME_PROCESS in state OSM_SM_STATE_PROCESS_REQUEST.
> Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: [
> Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: ]
> Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: ]
> Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: [
> Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: ]
> Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: [
> Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: ]
> Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: [
> Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: Off schedule sweep signalled.
> Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: ]
> Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: [
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: [
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: event_wheel ptr:0x5575f8
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: ]
> Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: ]
> Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: [
> Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: ]
> Jan

[openib-general] opensm crash

2005-01-27 Thread Tom Duffy
I hit control-c to kill osm and got:

Jan 27 18:47:09 [44808960] -> osm_mad_pool_get: [
opensm[4627]: *** exception handler: died with signal 11
Segmentation fault

Here is the last 100 lines of the osm.log

[EMAIL PROTECTED] bin]# tail -100 /var/log/osm.log
Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Retiring MAD 
with TID = 0x2bf9.
Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: [
Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: Releasing p_madw = 0x56d9c0, 
p_mad = 0x599140.
Jan 27 18:47:04 [43005960] -> osm_vendor_put: [
Jan 27 18:47:04 [43005960] -> osm_vendor_put: Retiring UMAD 0x599140.
Jan 27 18:47:04 [43005960] -> osm_vendor_put: ]
Jan 27 18:47:04 [43005960] -> osm_mad_pool_put: ]
Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: 0 QP0 MADs 
outstanding.
Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: Posting 
Dispatcher message OSM_MSG_NO_SMPS_OUTSTANDING.
Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_retire_trans_mad: ]
Jan 27 18:47:04 [43005960] -> __osm_sm_mad_ctrl_disp_done_callback: ]
Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: [
Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal 
OSM_SIGNAL_NO_PENDING_TRANSACTIONS in state OSM_SM_STATE_SWEEP_LIGHT.
Jan 27 18:47:04 [43005960] -> __osm_state_mgr_light_sweep_done_msg:


**
** LIGHT SWEEP COMPLETE **
**


Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: Received signal 
OSM_SIGNAL_IDLE_TIME_PROCESS in state OSM_SM_STATE_PROCESS_REQUEST.
Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: [
Jan 27 18:47:04 [43005960] -> __process_idle_time_queue_start: ]
Jan 27 18:47:04 [43005960] -> osm_state_mgr_process: ]
Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: [
Jan 27 18:47:09 [9597F060] -> osm_vl15_shutdown: ]
Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: [
Jan 27 18:47:09 [9597F060] -> osm_vendor_set_sm: ]
Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: [
Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: Off schedule sweep signalled.
Jan 27 18:47:09 [44007960] -> __osm_sm_sweeper: ]
Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: [
Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: [
Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: event_wheel ptr:0x5575f8
Jan 27 18:47:09 [9597F060] -> cl_event_wheel_dump: ]
Jan 27 18:47:09 [9597F060] -> cl_event_wheel_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_trap_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_sminfo_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_ni_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_pi_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_si_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_nd_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_lid_mgr_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_ucast_mgr_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_link_mgr_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_drop_mgr_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_lft_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_mft_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_slvl_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_vla_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_pkey_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_state_mgr_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_sm_state_mgr_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_mcast_mgr_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_mcast_mgr_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_sm_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_sa_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_nr_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_nr_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_pir_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_pir_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_lr_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_lr_rcv_destroy: ]
Jan 27 18:47:09 [9597F060] -> osm_pr_rcv_destroy: [
Jan 27 18:47:09 [9597F060] -> osm_pr_rcv_destroy: ]
Jan 27 18: