Re: [ewg] possible bug in rds?
On Wed, Mar 10, 2010 at 03:51:36PM -0800, Andy Grover wrote: > > I've opened a bug: > > https://bugs.openfabrics.org/show_bug.cgi?id=1983 > > Did this just start happening? What is the test doing when this > occurred? Please add to the bug if possible, and I'll try to diagnose > further. > Follow up response in bugzilla. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] possible bug in rds?
Eli Cohen wrote: > Hi Andy, > > in our regression tests we've encountered a kernel oops with the > following stack dump: > Examining the dump I see the failure results in trying to call > hlist_del() twice on the same pointer (I can see that by the poisoned > pointer RCX: 00200200). > Could it be that rds will call rdma_destroy_id() which will result in > the described behaviour? I've opened a bug: https://bugs.openfabrics.org/show_bug.cgi?id=1983 Did this just start happening? What is the test doing when this occurred? Please add to the bug if possible, and I'll try to diagnose further. Thanks -- Regards -- Andy ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] possible bug in rds?
Hi Andy, in our regression tests we've encountered a kernel oops with the following stack dump: Call trace: Mar 1 05:45:50 sw134 kernel: mlx4_en: eth2: Link Down Mar 1 05:46:00 sw134 kernel: mlx4_en: eth2: Link Up Mar 1 05:46:00 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready Mar 1 05:46:01 sw134 /usr/sbin/cron[16940]: (root) CMD (/mswg/projects/test_suite2/etc/check_daemon.csh >/dev/null) Mar 1 05:46:01 sw134 /usr/sbin/cron[16941]: (root) CMD (/usr/check_mswg.csh >/dev/null) Mar 1 05:46:01 sw134 /usr/sbin/cron[16942]: (root) CMD (/.autodirect/LIT/CRONTABS/do_it_now.sh > /dev/null) Mar 1 05:46:03 sw134 kernel: Unable to handle kernel paging request at 00200200 RIP: Mar 1 05:46:03 sw134 kernel: {:rdma_cm:rdma_destroy_id+399} Mar 1 05:46:03 sw134 kernel: PGD 0 Mar 1 05:46:03 sw134 kernel: Oops: 0002 [1] SMP Mar 1 05:46:03 sw134 kernel: last sysfs file: /class/infiniband/mlx4_0/ports/1/gids/127 Mar 1 05:46:03 sw134 kernel: CPU 0 Mar 1 05:46:03 sw134 kernel: Modules linked in: 8021q mst_pciconf mst_pci rdma_ucm rds_tcp rds_rdma rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_cm ib_sa ib_uverbs ib_umad mlx4_en mlx4_core ib_mad ib_core memtrack autofs4 cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table nfs lockd nfs_acl sunrpc ipv6 af_packet dock button battery ac apparmor nls_iso8859_1 nls_cp437 vfat fat loop dm_mod ohci_hcd ide_cd cdrom generic ehci_hcd shpchp pci_hotplug i2c_piix4 i2c_core usbcore mptctl tg3 floppy ext3 jbd edd fan thermal processor mptsas mptscsih sg mptbase scsi_transport_sas sata_svw libata serverworks sd_mod scsi_mod ide_disk ide_core Mar 1 05:46:03 sw134 kernel: Pid: 15000, comm: krdsd Tainted: GU 2.6.16.60-0.54.5-smp #1 Mar 1 05:46:03 sw134 kernel: RIP: 0010:[] {:rdma_cm:rdma_destroy_id+399} Mar 1 05:46:03 sw134 kernel: RSP: 0018:81000dad7dd8 EFLAGS: 00010206 Mar 1 05:46:03 sw134 kernel: RAX: 00100100 RBX: 81012d2ba740 RCX: 00200200 Mar 1 05:46:03 sw134 kernel: RDX: 81010ee445b8 RSI: 8101248c0048 RDI: 81012bdaf800 Mar 1 05:46:03 sw134 kernel: RBP: 81010ee44400 R08: R09: Mar 1 05:46:03 sw134 kernel: R10: R11: R12: 8101248c0048 Mar 1 05:46:03 sw134 kernel: R13: 8101248c0290 R14: 8846ca40 R15: Mar 1 05:46:03 sw134 kernel: FS: 2b2f96622ae0() GS:803dc000() knlGS: Mar 1 05:46:03 sw134 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b Mar 1 05:46:03 sw134 kernel: CR2: 00200200 CR3: 00101000 CR4: 06e0 Mar 1 05:46:03 sw134 kernel: Process krdsd (pid: 15000, threadinfo 81000dad6000, task 810126bf5040) Mar 1 05:46:03 sw134 kernel: Stack: 81012bdaf800 81012bdaf800 81010e7fa000 884891f4 Mar 1 05:46:03 sw134 kernel:81000100f700 230b363c2cf0 0002220f 0f5eb200 Mar 1 05:46:03 sw134 kernel:8101248c0048 802f0652 Mar 1 05:46:03 sw134 kernel: Call Trace: {:rds_rdma:rds_ib_conn_shutdown+477} Mar 1 05:46:03 sw134 kernel:{mutex_lock+13} {:rds:rds_shutdown_worker+163} Mar 1 05:46:04 sw134 kernel: {run_workqueue+139} {worker_thread+0} Mar 1 05:46:04 sw134 kernel: {keventd_create_kthread+0} {worker_thread+244} Mar 1 05:46:04 sw134 kernel: {default_wake_function+0} {kthread+236} Mar 1 05:46:04 sw134 kernel:{child_rip+8} {keventd_create_kthread+0} Mar 1 05:46:04 sw134 kernel:{kthread+0} {child_rip+0} Mar 1 05:46:04 sw134 kernel: Mar 1 05:46:04 sw134 kernel: Code: 48 89 01 74 04 48 89 48 08 48 c7 85 b8 01 00 00 00 01 10 00 Mar 1 05:46:04 sw134 kernel: RIP {:rdma_cm:rdma_destroy_id+399} RSP Mar 1 05:46:04 sw134 kernel: CR2: 00200200 Mar 1 05:46:09 sw134 kernel: <6>mlx4_en: eth2: Link Down Mar 1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up Mar 1 05:46:20 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready Mar 1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Down Mar 1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up Mar 1 05:46:21 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready Examining the dump I see the failure results in trying to call hlist_del() twice on the same pointer (I can see that by the poisoned pointer RCX: 00200200). Could it be that rds will call rdma_destroy_id() which will result in the described behaviour? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg