Re: [openib-general] oops at device removal

2007-01-29 Thread Sean Hefty
 @@ -71,6 +70,7 @@ struct mcast_device {
   int start_port;
   int end_port;
   struct mcast_port   port[0];
 + struct ib_event_handler event_handler;
  };

The mcast_port data is allocated at the end of the structure.  event_handler 
will need to be located up in the structure.

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] oops at device removal

2007-01-28 Thread Michael S. Tsirkin
We have observed the following crash:

Unable to handle kernel paging request at 00100108 RIP:
8823af5f{:ib_core:ib_unregister_event_handler+31}
PGD 117034067 PUD 102047067 PMD 0
Oops: 0002 [1] SMP
last sysfs file: /devices/pci:00/:00:06.0/:08:00.0/subsystem_device
CPU 2
Modules linked in: autofs4 ipv6 raw ib_sa ib_uverbs ib_umad nfs lockd nfs_acl 
sunrpc ib_mt
hca ib_mad ib_core memtrack af_packet button battery ac apparmor aamatch_pcre 
loop dm_mod
ehci_hcd uhci_hcd ide_cd cdrom i8xx_tco usbcore shpchp e1000 pci_hotplug floppy 
ext3 jbd e
dd fan thermal processor sg mptspi mptscsih mptbase scsi_transport_spi piix 
sd_mod scsi_mo
d ide_disk ide_core
Pid: 9241, comm: modprobe Tainted: G U 2.6.16.21-0.8-smp #1
RIP: 0010:[8823af5f] 
8823af5f{:ib_core:ib_unregister_event_handler+31}
RSP: :810100801e68  EFLAGS: 00010046
RAX: 00200200 RBX: 883282e0 RCX: 883282f0
RDX: 00100100 RSI: 0282 RDI: 810119836058
RBP: 8101119ce480 R08: 8101119ce608 R09: 810100801e40
R10: 0001 R11: 81010f493e38 R12: 
R13: 88324020 R14: 810119af9080 R15: 0292
FS:  2b7fa92af6d0() GS:81011c06b340() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 00100108 CR3: 000101a04000 CR4: 06e0
Process modprobe (pid: 9241, threadinfo 81010080, task 81011b1b0080)
Stack: 8101119ce490 8831f036 810119af9070 810119836000
   810111ab2de0 8823b615 810110005820 8033da60
   88324080 0080
Call Trace: 8831f036{:ib_sa:mcast_remove_one+43}
   8823b615{:ib_core:ib_unregister_client+55}
   8831efc8{:ib_sa:mcast_cleanup+16} 
8832001d{:ib_sa:ib_sa_cleanup
+9}
   8014aa3c{sys_delete_module+540} 
80167e37{do_munmap+619}
   801e7e6b{__up_write+33} 8010a7be{system_call+126}

Code: 48 89 42 08 48 89 10 48 c7 41 08 00 02 20 00 48 8b 3b 48 c7
RIP 8823af5f{:ib_core:ib_unregister_event_handler+31} RSP 
810100801e68
CR2: 00100108

Address ib_unregister_event_handler+31 is here:

/tmp/openib_gen2/last_stable/gen2_devel_kernel/drivers/infiniband/core/device.c:450
list_del():
/usr/src/linux-2.6.16.21-0.8/include/linux/list.h:165
__list_del():
/usr/src/linux-2.6.16.21-0.8/include/linux/list.h:153
1cdb:   48 89 42 08 mov%rax,0x8(%rdx)



-- 
MST

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] oops at device removal

2007-01-28 Thread Michael S. Tsirkin
 We have observed the following crash:

OK, I think I see a reason for this.

I notice the following in code, file multicast.c, function mcast_add_one:

ib_set_client_data(device, mcast_client, dev);

INIT_IB_EVENT_HANDLER(event_handler, device,
  mcast_event_handler);
ib_register_event_handler(event_handler);

So it seems like if I have 2 devices, event_handler will be registered twice.
This will trigger data corruption as same entry will be added to list twice.

Or so it seems. Sean, what's the idea here?

-- 
MST

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] oops at device removal

2007-01-28 Thread Michael S. Tsirkin
 Quoting Michael S. Tsirkin [EMAIL PROTECTED]:
 Subject: Re: oops at device removal
 
  We have observed the following crash:
 
 OK, I think I see a reason for this.
 
 I notice the following in code, file multicast.c, function mcast_add_one:
 
 ib_set_client_data(device, mcast_client, dev);
 
   INIT_IB_EVENT_HANDLER(event_handler, device,
 mcast_event_handler);
 ib_register_event_handler(event_handler);
 
 So it seems like if I have 2 devices, event_handler will be registered twice.
 This will trigger data corruption as same entry will be added to list twice.
 
 Or so it seems. Sean, what's the idea here?

It seems something like the following would fix it (untested).



Make new multicast code not crash on platforms with multiple HCAs.

Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/core/multicast.c 
b/drivers/infiniband/core/multicast.c
index fde977e..e51a078 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -51,7 +51,6 @@ static struct ib_client mcast_client = {
 };
 
 static struct ib_sa_client sa_client;
-static struct ib_event_handler event_handler;
 static struct workqueue_struct *mcast_wq;
 static union ib_gid mgid0;
 
@@ -71,6 +70,7 @@ struct mcast_device {
int start_port;
int end_port;
struct mcast_port   port[0];
+   struct ib_event_handler event_handler;
 };
 
 enum mcast_state {
@@ -793,8 +793,8 @@ static void mcast_add_one(struct ib_device *device)
dev-device = device;
ib_set_client_data(device, mcast_client, dev);
 
-   INIT_IB_EVENT_HANDLER(event_handler, device, mcast_event_handler);
-   ib_register_event_handler(event_handler);
+   INIT_IB_EVENT_HANDLER(dev-event_handler, device, mcast_event_handler);
+   ib_register_event_handler(dev-event_handler);
 }
 
 static void mcast_remove_one(struct ib_device *device)
@@ -807,7 +807,7 @@ static void mcast_remove_one(struct ib_device *device)
if (!dev)
return;
 
-   ib_unregister_event_handler(event_handler);
+   ib_unregister_event_handler(dev-event_handler);
flush_workqueue(mcast_wq);
 
for (i = 0; i = dev-end_port - dev-start_port; i++) {

-- 
MST

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general