Re: [openib-general] oops at device removal
@@ -71,6 +70,7 @@ struct mcast_device { int start_port; int end_port; struct mcast_port port[0]; + struct ib_event_handler event_handler; }; The mcast_port data is allocated at the end of the structure. event_handler will need to be located up in the structure. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] oops at device removal
We have observed the following crash: Unable to handle kernel paging request at 00100108 RIP: 8823af5f{:ib_core:ib_unregister_event_handler+31} PGD 117034067 PUD 102047067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /devices/pci:00/:00:06.0/:08:00.0/subsystem_device CPU 2 Modules linked in: autofs4 ipv6 raw ib_sa ib_uverbs ib_umad nfs lockd nfs_acl sunrpc ib_mt hca ib_mad ib_core memtrack af_packet button battery ac apparmor aamatch_pcre loop dm_mod ehci_hcd uhci_hcd ide_cd cdrom i8xx_tco usbcore shpchp e1000 pci_hotplug floppy ext3 jbd e dd fan thermal processor sg mptspi mptscsih mptbase scsi_transport_spi piix sd_mod scsi_mo d ide_disk ide_core Pid: 9241, comm: modprobe Tainted: G U 2.6.16.21-0.8-smp #1 RIP: 0010:[8823af5f] 8823af5f{:ib_core:ib_unregister_event_handler+31} RSP: :810100801e68 EFLAGS: 00010046 RAX: 00200200 RBX: 883282e0 RCX: 883282f0 RDX: 00100100 RSI: 0282 RDI: 810119836058 RBP: 8101119ce480 R08: 8101119ce608 R09: 810100801e40 R10: 0001 R11: 81010f493e38 R12: R13: 88324020 R14: 810119af9080 R15: 0292 FS: 2b7fa92af6d0() GS:81011c06b340() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 00100108 CR3: 000101a04000 CR4: 06e0 Process modprobe (pid: 9241, threadinfo 81010080, task 81011b1b0080) Stack: 8101119ce490 8831f036 810119af9070 810119836000 810111ab2de0 8823b615 810110005820 8033da60 88324080 0080 Call Trace: 8831f036{:ib_sa:mcast_remove_one+43} 8823b615{:ib_core:ib_unregister_client+55} 8831efc8{:ib_sa:mcast_cleanup+16} 8832001d{:ib_sa:ib_sa_cleanup +9} 8014aa3c{sys_delete_module+540} 80167e37{do_munmap+619} 801e7e6b{__up_write+33} 8010a7be{system_call+126} Code: 48 89 42 08 48 89 10 48 c7 41 08 00 02 20 00 48 8b 3b 48 c7 RIP 8823af5f{:ib_core:ib_unregister_event_handler+31} RSP 810100801e68 CR2: 00100108 Address ib_unregister_event_handler+31 is here: /tmp/openib_gen2/last_stable/gen2_devel_kernel/drivers/infiniband/core/device.c:450 list_del(): /usr/src/linux-2.6.16.21-0.8/include/linux/list.h:165 __list_del(): /usr/src/linux-2.6.16.21-0.8/include/linux/list.h:153 1cdb: 48 89 42 08 mov%rax,0x8(%rdx) -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] oops at device removal
We have observed the following crash: OK, I think I see a reason for this. I notice the following in code, file multicast.c, function mcast_add_one: ib_set_client_data(device, mcast_client, dev); INIT_IB_EVENT_HANDLER(event_handler, device, mcast_event_handler); ib_register_event_handler(event_handler); So it seems like if I have 2 devices, event_handler will be registered twice. This will trigger data corruption as same entry will be added to list twice. Or so it seems. Sean, what's the idea here? -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] oops at device removal
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: oops at device removal We have observed the following crash: OK, I think I see a reason for this. I notice the following in code, file multicast.c, function mcast_add_one: ib_set_client_data(device, mcast_client, dev); INIT_IB_EVENT_HANDLER(event_handler, device, mcast_event_handler); ib_register_event_handler(event_handler); So it seems like if I have 2 devices, event_handler will be registered twice. This will trigger data corruption as same entry will be added to list twice. Or so it seems. Sean, what's the idea here? It seems something like the following would fix it (untested). Make new multicast code not crash on platforms with multiple HCAs. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] --- diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index fde977e..e51a078 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -51,7 +51,6 @@ static struct ib_client mcast_client = { }; static struct ib_sa_client sa_client; -static struct ib_event_handler event_handler; static struct workqueue_struct *mcast_wq; static union ib_gid mgid0; @@ -71,6 +70,7 @@ struct mcast_device { int start_port; int end_port; struct mcast_port port[0]; + struct ib_event_handler event_handler; }; enum mcast_state { @@ -793,8 +793,8 @@ static void mcast_add_one(struct ib_device *device) dev-device = device; ib_set_client_data(device, mcast_client, dev); - INIT_IB_EVENT_HANDLER(event_handler, device, mcast_event_handler); - ib_register_event_handler(event_handler); + INIT_IB_EVENT_HANDLER(dev-event_handler, device, mcast_event_handler); + ib_register_event_handler(dev-event_handler); } static void mcast_remove_one(struct ib_device *device) @@ -807,7 +807,7 @@ static void mcast_remove_one(struct ib_device *device) if (!dev) return; - ib_unregister_event_handler(event_handler); + ib_unregister_event_handler(dev-event_handler); flush_workqueue(mcast_wq); for (i = 0; i = dev-end_port - dev-start_port; i++) { -- MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general