Another point: this seems to be crashing while we are requeueing the packet through dev_start_xmit upon path record completion. It looks like this could try to requeue even though the interface is going down - could this trigger some problems?
Quoting r. Michael S. Tsirkin <[EMAIL PROTECTED]>: Subject: Fwd: Re: Problems with OFED IPoIB HA on SLES10 BTW, any idea? The ipoib_ha is just a script that ups/downs and configures interfaces, so this crash it seems coul also happen on systems without it. -- MST Date: Tue, 3 Oct 2006 22:39:54 -0700 From: "Scott Weitzenkamp (sweitzen)" <[EMAIL PROTECTED]> Subject: Re: [openib-general] Problems with OFED IPoIB HA on SLES10 If I fail back and forth between ib0 and ib1 every 30 seconds or so for several hours, while IPoIB traffic is running, IPoIB host gets an Oops: and IPoIB stops working. ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib1: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet ib0: dev_queue_xmit failed to requeue packet general protection fault: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 7 Modules linked in: af_packet ib_sdp rdma_ucm rdma_cm ib_addr ib_cm ib_ipoib ib_s a ib_uverbs ib_umad ib_mthca ib_mad ib_core nls_utf8 st ipv6 nfs lockd nfs_acl s unrpc button battery ac apparmor aamatch_pcre loop usbhid dm_mod hw_random ide_c d ehci_hcd uhci_hcd cdrom i8xx_tco ide_floppy usbcore shpchp e1000 pci_hotplug f loppy reiserfs edd fan thermal processor siimage sg mptspi mptscsih mptbase scsi _transport_spi piix sd_mod scsi_mod ide_disk ide_core Pid: 23541, comm: ib_mad1 Tainted: G U 2.6.16.21-0.8-smp #1 RIP: 0010:[<ffffffff802cffea>] <ffffffff802cffea>{_spin_lock_irqsave+3} RSP: 0018:ffff810132a4fc20 EFLAGS: 00010086 RAX: 0000000000000286 RBX: 0000000000000000 RCX: ffffffff883324ee RDX: ffff810128d5e380 RSI: 0000000000000000 RDI: 0000ffff1b6017ff RBP: 00000000fffffffc R08: ffffffff803d3260 R09: ffff810140333800 R10: ffff81000107d400 R11: 0000000000000292 R12: ffff810128d5e380 R13: ffff810132a4fc78 R14: 0000ffff1b6017ff R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff810142d19740(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b0b5e6ae180 CR3: 0000000128cbc000 CR4: 00000000000006e0 Process ib_mad1 (pid: 23541, threadinfo ffff810132a4e000, task ffff810142b56100) Stack: ffffffff8833c5f5 ffff8101302b3000 0000ffff1b6012ff 0000000000000002 0000000000000296 ffff8101302b3500 ffffffff8027753e ffff810128d5e3a0 ffff81012bce1680 ffff810128d5e380 Call Trace: <ffffffff8833c5f5>{:ib_ipoib:path_rec_completion+862} <ffffffff8027753e>{dev_queue_xmit+545} <ffffffff8833c5b2>{:ib_ipoib:path_ rec_completion+795} <ffffffff8833252e>{:ib_sa:ib_sa_path_rec_callback+64} <ffffffff80138f17>{lock_timer_base+27} <ffffffff80138f89>{try_to_del_time r_sync+81} <ffffffff883322b3>{:ib_sa:send_handler+72} <ffffffff8826762f>{:ib_mad:ib_ mad_complete_send_wr+421} <ffffffff88267f00>{:ib_mad:ib_mad_completion_handler+947} <ffffffff88267b4d>{:ib_mad:ib_mad_completion_handler+0} <ffffffff80140177>{run_workqueue+153} <ffffffff8014081e>{worker_thread+0} <ffffffff801437e5>{keventd_create_kthread+0} <ffffffff80140927>{worker_th read+265} <ffffffff8012787f>{__wake_up_common+62} <ffffffff8012905a>{default_wake_f unction+0} <ffffffff801437e5>{keventd_create_kthread+0} <ffffffff80143aca>{kthread+2 36} <ffffffff8010b60a>{child_rip+8} <ffffffff801437e5>{keventd_create_kthread +0} <ffffffff801439de>{kthread+0} <ffffffff8010b602>{child_rip+0} Code: f0 ff 0f 0f 88 29 01 00 00 c3 fa f0 ff 0f 0f 88 2a 01 00 00 RIP <ffffffff802cffea>{_spin_lock_irqsave+3} RSP <ffff810132a4fc20> Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Tuesday, October 03, 2006 2:53 PM To: Vladimir Sokolovsky Cc: EWG; openib-General Subject: Re: [openib-general] [openfabrics-ewg] Problems with OFED IPoIB HA on SLES10 Vlad, thaks for the fast response. I have some followup questions about configuring IPoIB HA, see below. 3) I got IPoIB HA working on SLES 10, but the documentation is a little lacking. Looks like I have to put the same IP address in ifcfg-ib0 and ifcfg-ib1, is this correct? Yes, IP address should be the same. Actually the configuration of the secondary interface does not matter. The High Availability daemon reads the configuration of the primary interface and migrates it between the interfaces in case of failure. If I don't have an ifcfg-ib1 file, then ipoib_ha.pl won't start. If I don't have an ifcfg-ib1, then ipoib_ha.pl won't start. I would prefer to not configure ifcfg-ib1 since I don't plan to use it. # ipoib_ha.pl --with-arping --with-multicast -v Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory Can't open conf /etc/sysconfig/network/ifcfg-ib1: No such file or directory ... If I put different IP addresses in ifcfg-ib0 and ifcfg-ib1, then the ifcfg-ib1 IP address is used for both ib0 and ib1! # pwd /etc/sysconfig/network # cat ifcfg-ib0 DEVICE=ib0 BOOTPROTO=static IPADDR=192.168.2.46 NETMASK=255.255.255.0 ONBOOT=yes # cat ifcfg-ib1 DEVICE=ib1 BOOTPROTO=static IPADDR=192.168.6.46 NETMASK=255.255.255.0 ONBOOT=yes # /etc/init.d/openibd start Loading HCA driver and Access Layer: [ OK ] Setting up InfiniBand network interfaces: ib0 device: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor com patibility mode) (rev 20) ib0 configuration: ib1 Bringing up interface ib0: [ OK ] ib1 device: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor com patibility mode) (rev 20) Bringing up interface ib1: [ OK ] Setting up service network . . . [ done ] # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00 -00 inet addr:192.168.6.46 Bcast:192.168.6.255 Mask:255.255.255.0 inet6 addr: fe80::202:c902:21:700d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:3 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:224 (224.0 b) # ifconfig ib1 ib1 Link encap:UNSPEC HWaddr 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00 -00 inet addr:192.168.6.46 Bcast:192.168.6.255 Mask:255.255.255.0 inet6 addr: fe80::202:c902:21:700e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:304 (304.0 b) Notice how both ib0 and ib1 have the IP address from ifcfg-ib1. This contradicts this info from ipoib_release_notes.txt: b. The ib1 interface uses the configuration script of ib0. Scott _______________________________________________ openfabrics-ewg mailing list [EMAIL PROTECTED] http://openib.org/mailman/listinfo/openfabrics-ewg _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general