[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
https://bugs.openfabrics.org/show_bug.cgi?id=263 [EMAIL PROTECTED] changed: What|Removed |Added Status|RESOLVED|CLOSED --- Comment #14 from [EMAIL PROTECTED] 2007-02-27 21:00 --- With OFED 1.2 alpha1, I was able to failover/failback an IB port every 10 seconds for 8 hours on RHEL4 x86_64 LionMini SDR and DDR. Will keep testing on other platforms. -- Configure bugmail: https://bugs.openfabrics.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 [EMAIL PROTECTED] changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED --- Comment #13 from [EMAIL PROTECTED] 2006-11-13 21:59 --- Fix is merged into Linus's tree as commit 39798695 (ie sometime after 2.6.19-rc5) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #12 from [EMAIL PROTECTED] 2006-11-06 20:26 --- I finally spent some time tracking this down and I believe the problem is actually in the MAD layer. I will post more details and a patch to openib-general. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #11 from [EMAIL PROTECTED] 2006-10-18 09:56 --- Roland, I enabled debug_level=1 with OFED 1.1 rc7 RHEL4 U3 x86_64, and got same crash (netserver machine). I could only see the debug_level=1 info by running dmesg in a loop, and the info did not get saved into any /var/log files. Is there some extra configuration needed for syslog? Shouldn't IPoIB debug_level=1 info go into a syslog file by default? Here's what I saw from dmesg loop right before crash. ib1: Port state change event ib0: Port state change event ib1: Port state change event ib0: flushing ib0: downing ib_dev ib1: flushing ib1: downing ib_dev ib0: Created ah 0101beffa800 ib1: Created ah 0101be636800 ib0: Created ah 0101be5724c0 ib1: Created ah 0101be9c8a80 ib0: Created ah 0101bfc57100 ib1: Created ah 0101be49f700 ib0: Created ah 0101beffa3c0 ib1: Created ah 0101beffae80 ib0: Created ah 0101be636b40 ib1: Created ah 01019dfecd40 ib0: Start path record lookup for fe80::::0005:ad00:0020:0861 MTU 1024 ib0: PathRec LID 0x0006 for GID fe80::::0005:ad00:0020:0861 ib0: Created ah 01019dfec600 ib0: created address handle 01019dfecac0 for LID 0x0006, SL 0 ib0: Port state change event ib1: Port state change event ib0: flushing ib0: downing ib_dev ib1: flushing ib1: downing ib_dev ib0: Start path record lookup for fe80::::0005:ad00:0020:0861 MTU 1024 ib0: PathRec LID 0x0006 for GID fe80::::0005:ad00:0020:0861 ib0: Created ah 0101beffa300 ib0: created address handle 01019dfec1c0 for LID 0x0006, SL 0 ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: Created ah 0101bfc55e80 ib0: Created ah 0101bfc4cc80 ib0: Created ah 01019dfec480 ib0: Created ah 01019dfec3c0 ib0: Created ah 01019dfec100 Tue Oct 17 01:05:42 PDT 2006 Message from [EMAIL PROTECTED] at Tue Oct 17 01:05:43 2006 ... svbu-qa-pcie-1 kernel: general protection fault: [1] SMP Here's serial console output from netserver machine. ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet general protection fault: [1] SMP CPU 0 Modules linked in: rdma_ucm(U) rdma_cm(U) ib_addr(U) ib_ipoib(U) ib_mthca7Losi ng some ticks... checking if CPU frequency changed. (U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac uhci_h cd ehci_hcd hw_random shpchp e1000 floppy sg ext3 jbd aic79xx sd_mod scsi_mod Pid: 7838, comm: ib_mad1 Not tainted 2.6.9-34.ELsmp RIP: 0010:[a01c384b] a01c384b{:ib_ipoib:path_rec_completion+ 178} RSP: 0018:0101a756bc70 EFLAGS: 00010202 warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip mwait_idle+0x56/0x7c RAX: RBX: RCX: RDX: 0101bbeffc80 RSI: RDI: fffc RBP: 0101bbeffc80 R08: 0003 R09: 0101bbeffca0 R10: 8011dfe0 R11: 8011dfe0 R12: 1b60167f R13: fffc R14: R15: 1b6012ff FS: () GS:804d7b00() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 006cf5e8 CR3: 00101000 CR4: 06e0 Process ib_mad1 (pid: 7838, threadinfo 0101a756a000, task 0101bdc3b030) Stack: a00e547d 0101afda5000 0002 0101afda5380 0246 0246 802ab017 0101bc16a500 0101bbeffca0 0101bbeffc80 Call Trace:a00e547d{:ib_sa:ib_sa_path_rec_callback+0} 802ab017{dev_queue_xmit+525} a01c3b0e{:ib_ipoib:path_ rec_completion+885} a00e54bd{:ib_sa:ib_sa_path_rec_callback+64} a00e5a56{:ib_sa:send_handler+74} a00db763{:ib_mad:ib_ mad_complete_send_wr+418} a00dbce5{:ib_mad:ib_mad_completion_handler+979} a00db912{:ib_mad:ib_mad_completion_handler+0} 80146e1e{worker_thread+419} 801333c8{default_wake_fun ction+0} 801333c8{default_wake_function+0} 8014aabc{keventd_cr eate_kthread+0} 80146c7b{worker_thread+0} 8014aabc{keventd_create_kth read+0} 8014aa93{kthread+200} 80110e17{child_rip+8} 8014aabc{keventd_create_kthread+0} 8014a9cb{kthread+0 } 80110e0f{child_rip+0} Code: 49 8b 74 24 08 50 0f b6 42 16 50 0f b6 42 15 50 0f b6 42 14 RIP a01c384b{:ib_ipoib:path_rec_completion+178} RSP 0101a756bc70 0Kernel panic - not
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #10 from [EMAIL PROTECTED] 2006-10-16 23:05 --- I'm trying debug_level=1 now, sorry for the delay, but I wanted to finish other rc7 testing. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #6 from [EMAIL PROTECTED] 2006-10-12 06:00 --- calling netif_stop_queue() doesnt immediately stop the transmit queue. it might be necessary to take priv-tx_lock when calling netif_stop_queue() from ipoib_stop() to ensure that ipoib_start_xmit() isnt in the middle of some work. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #7 from [EMAIL PROTECTED] 2006-10-12 06:07 --- why is it necessary to ensure that ipoib_start_xmit() isnt in the middle of some work? --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #8 from [EMAIL PROTECTED] 2006-10-12 07:17 --- ipoib_start_xmit() only checks at entry to see if the queue is stopped. ipoib_start_xmit() could still unicast_arp_send() after a netif_stop_queue(). in ipoib_stop(), i guess this will be synchronized somewhat by the ipoib_flush_paths() in ipoib_ib_dev_down() which also takes priv-tx_lock but this doesnt seem intentional. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #9 from [EMAIL PROTECTED] 2006-10-12 07:54 --- Created an attachment (id=62) -- (http://openib.org/bugzilla/attachment.cgi?id=62action=view) Please test this patch - does the crash happen with it? Interesting. As a test, Scott, could you pls check what happens with the attached patch applied? --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 [EMAIL PROTECTED] changed: What|Removed |Added CC||[EMAIL PROTECTED] OS/Version|SLES 10 |All --- Comment #1 from [EMAIL PROTECTED] 2006-10-11 10:10 --- I tried OFED 1.1 rc7 on RHEL4 U3 x86_64, using two hosts each with dual port HCAs. I am looping a script that turns off and back on IB ports on a Cisco IB switchsuch that there will be IPoIB failover every 20 seconds on one of the hosts. I ran ping and netserver on host 1, and netperf on host2. After a few hours, host 1 gets an Oops ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet general protection fault: [1] SMP CPU 1 Modules linked in: ib_sdp(U) rdma_ucm(U) rdma_cm(U) ib_addr(U) ib_ipoib7Losing some ticks... checking if CPU frequency changed. (U) ib_mthca(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib _core(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_a cl sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button batte ry ac uhci_hcd ehci_hcd hw_random shpchp e1000 floppy sg ext3 jbd aic79xx sd_mod scsi_mod Pid: 7155, comm: ib_mad1 Not tainted 2.6.9-34.ELsmp RIP: 0010:[8030596b] 8030596b{_spin_lock_irqsave+12}4warni ng: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip mwait_idle+0x56/0x7c RSP: 0018:0101bccd1c58 EFLAGS: 00010086 RAX: 0101bccd1cb8 RBX: 1b60167f RCX: a00e547d RDX: dead4ead0001 RSI: RDI: 1b60167f RBP: 0101b9c0f480 R08: 0003 R09: 0101b9c0f4a0 R10: 8040a900 R11: 8040a900 R12: 1b60167f R13: fffc R14: R15: 1b6012ff FS: () GS:804d7b80() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 003e2678f4b0 CR3: bff28000 CR4: 06e0 Process ib_mad1 (pid: 7155, threadinfo 0101bccd, task 0101b94cd030) Stack: 0286 a011195b 0101beca7000 0002 0101beca7380 0246 0246 802ab017 Call Trace:a011195b{:ib_ipoib:path_rec_completion+450} 802ab017{dev_queue_xmit+525} a00e54bd{:ib_sa:ib_sa_pa th_rec_callback+64} a00e5a56{:ib_sa:send_handler+74} a00db763{:ib_mad:ib_ mad_complete_send_wr+418} a00dbce5{:ib_mad:ib_mad_completion_handler+979} a00db912{:ib_mad:ib_mad_completion_handler+0} 80146e1e{worker_thread+419} 801333c8{default_wake_fun ction+0} 801333c8{default_wake_function+0} 8014aabc{keventd_cr eate_kthread+0} 80146c7b{worker_thread+0} 8014aabc{keventd_create_kth read+0} 8014aa93{kthread+200} 80110e17{child_rip+8} 8014aabc{keventd_create_kthread+0} 8014a9cb{kthread+0 } 80110e0f{child_rip+0} Code: 81 7f 04 ad 4e ad de 74 1f 48 8b 74 24 18 48 c7 c7 ed f2 31 RIP 8030596b{_spin_lock_irqsave+12} RSP 0101bccd1c58 0Kernel panic - not syncing: Oops --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 [EMAIL PROTECTED] changed: What|Removed |Added Version|1.1rc6 |1.1rc7 --- Comment #2 from [EMAIL PROTECTED] 2006-10-11 11:23 --- Roland, can you look into this please? I dont think we want to take the risk to change ipoib at this point, but still relevant for upstream. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #3 from [EMAIL PROTECTED] 2006-10-11 15:13 --- In both cases the final crash seems to be in the call spin_lock_irqsave(priv-lock, flags); in path_rec_completion(). This would seem to indicate some sort of memory corruption I guess. I don't know yet why dev_queue_xmit() would fail... --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #4 from [EMAIL PROTECTED] 2006-10-11 15:25 --- OK, most likely dev_queue_xmit() is returning an error because the device is down, but that should be OK. I guess we have a race somewhere with up/downing the device at the same time as handling traffic. --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop
http://openib.org/bugzilla/show_bug.cgi?id=263 --- Comment #5 from [EMAIL PROTECTED] 2006-10-11 16:35 --- Scott, could you add debug_level=1 to the ib_ipoib module flags and rerun one of these tests? That will generate a boatload of logging output, but I'd just like to see the last part before a crash -- say the final 1000 lines or so. Thanks... (unfortunately I don't have an appropriate setup to reproduce this at the moment but I'd like to try and make progress...) --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general