[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2007-02-27 Thread bugzilla-daemon
https://bugs.openfabrics.org/show_bug.cgi?id=263


[EMAIL PROTECTED] changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED




--- Comment #14 from [EMAIL PROTECTED]  2007-02-27 21:00 ---
With OFED 1.2 alpha1, I was able to failover/failback an IB port every 10
seconds for 8 hours on RHEL4 x86_64 LionMini SDR and DDR.  Will keep testing on
other platforms.


-- 
Configure bugmail: https://bugs.openfabrics.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-11-13 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263


[EMAIL PROTECTED] changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED




--- Comment #13 from [EMAIL PROTECTED]  2006-11-13 21:59 ---
Fix is merged into Linus's tree as commit 39798695 (ie sometime after
2.6.19-rc5)




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-11-06 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #12 from [EMAIL PROTECTED]  2006-11-06 20:26 ---
I finally spent some time tracking this down and I believe the problem is
actually in the MAD layer.  I will post more details and a patch to
openib-general.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-18 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #11 from [EMAIL PROTECTED]  2006-10-18 09:56 ---
Roland, I enabled debug_level=1 with OFED 1.1 rc7 RHEL4 U3 x86_64, and got same
crash (netserver machine).

I could only see the debug_level=1 info by running dmesg in a loop, and the
info did not get saved into any /var/log files.  Is there some extra
configuration needed for syslog?  Shouldn't IPoIB debug_level=1 info go into a
syslog file by default?

Here's what I saw from dmesg loop right before crash.

ib1: Port state change event
ib0: Port state change event
ib1: Port state change event
ib0: flushing
ib0: downing ib_dev
ib1: flushing
ib1: downing ib_dev
ib0: Created ah 0101beffa800
ib1: Created ah 0101be636800
ib0: Created ah 0101be5724c0
ib1: Created ah 0101be9c8a80
ib0: Created ah 0101bfc57100
ib1: Created ah 0101be49f700
ib0: Created ah 0101beffa3c0
ib1: Created ah 0101beffae80
ib0: Created ah 0101be636b40
ib1: Created ah 01019dfecd40
ib0: Start path record lookup for fe80::::0005:ad00:0020:0861 MTU 
1024
ib0: PathRec LID 0x0006 for GID fe80::::0005:ad00:0020:0861
ib0: Created ah 01019dfec600
ib0: created address handle 01019dfecac0 for LID 0x0006, SL 0
ib0: Port state change event
ib1: Port state change event
ib0: flushing
ib0: downing ib_dev
ib1: flushing
ib1: downing ib_dev
ib0: Start path record lookup for fe80::::0005:ad00:0020:0861 MTU 
1024
ib0: PathRec LID 0x0006 for GID fe80::::0005:ad00:0020:0861
ib0: Created ah 0101beffa300
ib0: created address handle 01019dfec1c0 for LID 0x0006, SL 0
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: Created ah 0101bfc55e80
ib0: Created ah 0101bfc4cc80
ib0: Created ah 01019dfec480
ib0: Created ah 01019dfec3c0
ib0: Created ah 01019dfec100
Tue Oct 17 01:05:42 PDT 2006

Message from [EMAIL PROTECTED] at Tue Oct 17 01:05:43 2006 ...
svbu-qa-pcie-1 kernel: general protection fault:  [1] SMP


Here's serial console output from netserver machine.

ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
general protection fault:  [1] SMP
CPU 0
Modules linked in: rdma_ucm(U) rdma_cm(U) ib_addr(U) ib_ipoib(U)
ib_mthca7Losi
ng some ticks... checking if CPU frequency changed.
(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U)
md5
 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc
ds
 yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button battery ac
uhci_h
cd ehci_hcd hw_random shpchp e1000 floppy sg ext3 jbd aic79xx sd_mod scsi_mod
Pid: 7838, comm: ib_mad1 Not tainted 2.6.9-34.ELsmp
RIP: 0010:[a01c384b]
a01c384b{:ib_ipoib:path_rec_completion+
178}
RSP: 0018:0101a756bc70  EFLAGS: 00010202
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip mwait_idle+0x56/0x7c
RAX:  RBX:  RCX: 
RDX: 0101bbeffc80 RSI:  RDI: fffc
RBP: 0101bbeffc80 R08: 0003 R09: 0101bbeffca0
R10: 8011dfe0 R11: 8011dfe0 R12: 1b60167f
R13: fffc R14:  R15: 1b6012ff
FS:  () GS:804d7b00() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 006cf5e8 CR3: 00101000 CR4: 06e0
Process ib_mad1 (pid: 7838, threadinfo 0101a756a000, task 0101bdc3b030)
Stack: a00e547d 0101afda5000 0002 0101afda5380
   0246 0246 802ab017 0101bc16a500
   0101bbeffca0 0101bbeffc80
Call Trace:a00e547d{:ib_sa:ib_sa_path_rec_callback+0}
   802ab017{dev_queue_xmit+525}
a01c3b0e{:ib_ipoib:path_
rec_completion+885}
   a00e54bd{:ib_sa:ib_sa_path_rec_callback+64}
   a00e5a56{:ib_sa:send_handler+74}
a00db763{:ib_mad:ib_
mad_complete_send_wr+418}
   a00dbce5{:ib_mad:ib_mad_completion_handler+979}
   a00db912{:ib_mad:ib_mad_completion_handler+0}
   80146e1e{worker_thread+419}
801333c8{default_wake_fun
ction+0}
   801333c8{default_wake_function+0}
8014aabc{keventd_cr
eate_kthread+0}
   80146c7b{worker_thread+0}
8014aabc{keventd_create_kth
read+0}
   8014aa93{kthread+200} 80110e17{child_rip+8}
   8014aabc{keventd_create_kthread+0}
8014a9cb{kthread+0
}
   80110e0f{child_rip+0}

Code: 49 8b 74 24 08 50 0f b6 42 16 50 0f b6 42 15 50 0f b6 42 14
RIP a01c384b{:ib_ipoib:path_rec_completion+178} RSP
0101a756bc70
 0Kernel panic - not 

[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-17 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #10 from [EMAIL PROTECTED]  2006-10-16 23:05 ---
I'm trying debug_level=1 now, sorry for the delay, but I wanted to finish other
rc7 testing.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-12 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #6 from [EMAIL PROTECTED]  2006-10-12 06:00 ---
calling netif_stop_queue() doesnt immediately stop the transmit queue.  it
might be necessary to take priv-tx_lock when calling netif_stop_queue() from
ipoib_stop() to ensure that ipoib_start_xmit() isnt in the middle of some work.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-12 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #7 from [EMAIL PROTECTED]  2006-10-12 06:07 ---
why is it necessary to ensure that ipoib_start_xmit()
isnt in the middle of some work?




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-12 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #8 from [EMAIL PROTECTED]  2006-10-12 07:17 ---
ipoib_start_xmit() only checks at entry to see if the queue is stopped. 
ipoib_start_xmit() could still unicast_arp_send() after a netif_stop_queue(). 
in ipoib_stop(), i guess this will be synchronized somewhat by the
ipoib_flush_paths() in ipoib_ib_dev_down() which also takes priv-tx_lock but
this doesnt seem intentional.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-12 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #9 from [EMAIL PROTECTED]  2006-10-12 07:54 ---
Created an attachment (id=62)
 -- (http://openib.org/bugzilla/attachment.cgi?id=62action=view)
Please test this patch - does the crash happen with it?

Interesting. As a test, Scott, could you pls check what
happens with the attached patch applied?




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-11 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263


[EMAIL PROTECTED] changed:

   What|Removed |Added

 CC||[EMAIL PROTECTED]
 OS/Version|SLES 10 |All




--- Comment #1 from [EMAIL PROTECTED]  2006-10-11 10:10 ---
I tried OFED 1.1 rc7 on RHEL4 U3 x86_64, using two hosts each with dual port
HCAs.  I am looping a script that turns off and back on IB ports on a Cisco IB
switchsuch that there will be IPoIB failover every 20 seconds on one of the 
hosts. I ran ping and netserver on host 1, and netperf on host2.  After a few
hours, host 1 gets an Oops

ib1: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
ib0: dev_queue_xmit failed to requeue packet
general protection fault:  [1] SMP
CPU 1
Modules linked in: ib_sdp(U) rdma_ucm(U) rdma_cm(U) ib_addr(U)
ib_ipoib7Losing
 some ticks... checking if CPU frequency changed.
(U) ib_mthca(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U)
ib
_core(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd
nfs_a
cl sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod button
batte
ry ac uhci_hcd ehci_hcd hw_random shpchp e1000 floppy sg ext3 jbd aic79xx
sd_mod
 scsi_mod
Pid: 7155, comm: ib_mad1 Not tainted 2.6.9-34.ELsmp
RIP: 0010:[8030596b]
8030596b{_spin_lock_irqsave+12}4warni
ng: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip mwait_idle+0x56/0x7c

RSP: 0018:0101bccd1c58  EFLAGS: 00010086
RAX: 0101bccd1cb8 RBX: 1b60167f RCX: a00e547d
RDX: dead4ead0001 RSI:  RDI: 1b60167f
RBP: 0101b9c0f480 R08: 0003 R09: 0101b9c0f4a0
R10: 8040a900 R11: 8040a900 R12: 1b60167f
R13: fffc R14:  R15: 1b6012ff
FS:  () GS:804d7b80() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 003e2678f4b0 CR3: bff28000 CR4: 06e0
Process ib_mad1 (pid: 7155, threadinfo 0101bccd, task 0101b94cd030)
Stack:  0286  a011195b
   0101beca7000 0002 0101beca7380 0246
   0246 802ab017
Call Trace:a011195b{:ib_ipoib:path_rec_completion+450}
   802ab017{dev_queue_xmit+525}
a00e54bd{:ib_sa:ib_sa_pa
th_rec_callback+64}
   a00e5a56{:ib_sa:send_handler+74}
a00db763{:ib_mad:ib_
mad_complete_send_wr+418}
   a00dbce5{:ib_mad:ib_mad_completion_handler+979}
   a00db912{:ib_mad:ib_mad_completion_handler+0}
   80146e1e{worker_thread+419}
801333c8{default_wake_fun
ction+0}
   801333c8{default_wake_function+0}
8014aabc{keventd_cr
eate_kthread+0}
   80146c7b{worker_thread+0}
8014aabc{keventd_create_kth
read+0}
   8014aa93{kthread+200} 80110e17{child_rip+8}
   8014aabc{keventd_create_kthread+0}
8014a9cb{kthread+0
}
   80110e0f{child_rip+0}

Code: 81 7f 04 ad 4e ad de 74 1f 48 8b 74 24 18 48 c7 c7 ed f2 31
RIP 8030596b{_spin_lock_irqsave+12} RSP 0101bccd1c58
 0Kernel panic - not syncing: Oops




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-11 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263


[EMAIL PROTECTED] changed:

   What|Removed |Added

Version|1.1rc6  |1.1rc7




--- Comment #2 from [EMAIL PROTECTED]  2006-10-11 11:23 ---
Roland, can you look into this please?
I dont think we want to take the risk to change ipoib at this point,
but still relevant for upstream.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-11 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #3 from [EMAIL PROTECTED]  2006-10-11 15:13 ---
In both cases the final crash seems to be in the call

spin_lock_irqsave(priv-lock, flags);

in path_rec_completion().  This would seem to indicate some sort of memory
corruption I guess.  I don't know yet why dev_queue_xmit() would fail...




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-11 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #4 from [EMAIL PROTECTED]  2006-10-11 15:25 ---
OK, most likely dev_queue_xmit() is returning an error because the device is
down, but that should be OK.  I guess we have a race somewhere with up/downing
the device at the same time as handling traffic.




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop

2006-10-11 Thread bugzilla-daemon
http://openib.org/bugzilla/show_bug.cgi?id=263





--- Comment #5 from [EMAIL PROTECTED]  2006-10-11 16:35 ---
Scott, could you add debug_level=1 to the ib_ipoib module flags and rerun one
of these tests?  That will generate a boatload of logging output, but I'd just
like to see the last part before a crash -- say the final 1000 lines or so. 
Thanks...

(unfortunately I don't have an appropriate setup to reproduce this at the
moment but I'd like to try and make progress...)




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general