Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
David Sterba wrote: Hi, a candidate fix: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=commit;h=aa0467d8d2a00e75b2bb6a56a4ee6d70c5d1928f With Linus' tree, today's linux-next build (powercp ppc64_defconfig) produced this warning: fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode': fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used uninitialized in this function Introduced by commit 16cdcec736cd (btrfs: implement delayed inode items operation). This fixes a bug in btrfs_update_inode(): if the returned value from btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not updated and several call paths may hit a BUG_ON or fail with strange code. if you can reproduce it reliably, add this patch on top of the delayed inodes. I cherry-picked aa0467d8d2a00e on top of 16cdcec736cd21, which gave me the following instead of a BUG: [ 246.986087] [ cut here ] [ 246.990714] WARNING: at mm/page_alloc.c:2032 __alloc_pages_slowpath+0x54/0x3c5() [ 246.998100] Hardware name: PowerEdge 1950 [ 247.002110] Modules linked in: btrfs zlib_deflate lzo_compress ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom ib_mthca qla2xxx serio_raw ib_mad ib_core scsi_transport_fc button scsi_tgt ata_piix libata scsi_mod dcdbas i5k_amb tpm_tis tpm ioatdma hwmon tpm_bios ehci_hcd dca i5000_edac pcspkr iTCO_wdt uhci_hcd iTCO_vendor_support edac_core rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last unloaded: freq_table] [ 247.076305] Pid: 6044, comm: cosd Not tainted 2.6.39-2-gf47e9fd #25 [ 247.082911] Call Trace: [ 247.085358] [810ccf8a] ? __alloc_pages_slowpath+0x54/0x3c5 [ 247.091793] [81049379] ? warn_slowpath_common+0x85/0x9e [ 247.097974] [810493ac] ? warn_slowpath_null+0x1a/0x1c [ 247.103974] [810ccf8a] ? __alloc_pages_slowpath+0x54/0x3c5 [ 247.110409] [810ccae2] ? get_page_from_freelist+0x166/0x198 [ 247.116928] [810cd3c5] ? __alloc_pages_nodemask+0xca/0xf4 [ 247.123297] [a0702884] ? unmap_extent_buffer+0x11/0x13 [btrfs] [ 247.130079] [810fd957] ? alloc_pages_current+0xa3/0xac [ 247.136166] [810cc5f7] ? alloc_pages+0xe/0x10 [ 247.141472] [810cc607] ? __get_free_pages+0xe/0x4b [ 247.147218] [811061eb] ? kmalloc_order_trace+0x27/0x55 [ 247.153304] [8110664e] ? __kmalloc+0x37/0x100 [ 247.158625] [a072622a] ? btrfs_batch_insert_items+0xe0/0x229 [btrfs] [ 247.165933] [a06d454b] ? btrfs_block_rsv_release+0x39/0x3b [btrfs] [ 247.173072] [a07265e7] ? btrfs_insert_delayed_items+0xac/0xef [btrfs] [ 247.180472] [a0726798] ? btrfs_run_delayed_items+0x68/0xd9 [btrfs] [ 247.187610] [a06e85fc] ? btrfs_commit_transaction+0x266/0x5c9 [btrfs] [ 247.195000] [81066118] ? list_del_init+0x21/0x21 [ 247.200583] [a0710c4c] ? create_subvol+0x420/0x440 [btrfs] [ 247.207018] [810362ec] ? need_resched+0x23/0x2d [ 247.212511] [a0710d7a] ? btrfs_mksubvol+0x10e/0x167 [btrfs] [ 247.219043] [a071129f] ? btrfs_ioctl_snap_create_transid+0x9c/0x121 [btrfs] [ 247.226962] [a071145e] ? btrfs_ioctl_snap_create+0x50/0x67 [btrfs] [ 247.234101] [a0712c5a] ? btrfs_ioctl+0x1d0/0x2c6 [btrfs] [ 247.240364] [8111fe12] ? vfs_ioctl+0x1d/0x34 [ 247.245582] [8112048d] ? do_vfs_ioctl+0x171/0x17a [ 247.251242] [811130a2] ? fget_light+0x69/0x81 [ 247.256549] [811204f2] ? sys_ioctl+0x5c/0x7c [ 247.261770] [8111d48d] ? putname+0x33/0x37 [ 247.266819] [813b21eb] ? system_call_fastpath+0x16/0x1b [ 247.272993] ---[ end trace 9c75d74017f060f5 ]--- The mkcephfs command I was attempting succeeded, so I'm not sure if the above matters; mm/page_alloc.c:2032 is /* * In the slowpath, we sanity check order to avoid ever trying to * reclaim = MAX_ORDER areas which will never succeed. Callers may * be using allocators in order of preference for an area that is * too large. */ if (order = MAX_ORDER) { WARN_ON_ONCE(!(gfp_mask __GFP_NOWARN)); return NULL; } When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178]
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. [ rest of the oops below for context ] -chris [ 276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0 [ 276.365127] Oops: [#1] SMP [ 276.365127] CPU 2 [ 276.365127] Modules linked in: btrfs zlib_deflate lzo_compress ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc scsi_tgt dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon iTCO_wdt scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core uhci_hcd pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last unloaded: freq_table] [ 276.365127] [ 276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 #26 Dell Inc. PowerEdge 1950/0DT097 [ 276.365127] RIP: 0010:[a05434b1] [a05434b1] journal_start+0x3e/0x9c [jbd] [ 276.365127] RSP: 0018:8801e2897b28 EFLAGS: 00010286 [ 276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 0002 [ 276.365127] RDX: 19b2d000 RSI: 000e RDI: 000e [ 276.365127] RBP: 8801e2897b48 R08: 0003 R09: 8801e2897c38 [ 276.365127] R10: 8801e2897ed8 R11: 0001 R12: 880223ff4400 [ 276.365127] R13: 880218522d60 R14: 0ec6 R15: 88021f54d878 [ 276.365127] FS: 7f8ff0bbb710() GS:88022fc8() knlGS: [ 276.365127] CS: 0010 DS: ES: CR0: 8005003b [ 276.365127] CR2: 000a CR3: 00021744f000 CR4: 06e0 [ 276.365127] DR0: DR1: DR2: [ 276.365127] DR3: DR6: 0ff0 DR7: 0400 [ 276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 880218522d60) [ 276.365127] Stack: [ 276.365127] 8801e2897b68 ea000756e788 88021f54d728 8801e2897c78 [ 276.365127] 8801e2897b58 a05670ce 8801e2897b68 a055c72d [ 276.365127] 8801e2897be8 a055f044 8801e2897c38 0074 [ 276.365127] Call Trace: [ 276.365127] [a05670ce] ext3_journal_start_sb+0x4f/0x51 [ext3] [ 276.365127] [a055c72d] ext3_journal_start+0x12/0x14 [ext3] [ 276.365127] [a055f044] ext3_write_begin+0x93/0x1a1 [ext3] [ 276.365127] [810c6f0e] ? __kunmap_atomic+0xe/0x10 [ 276.365127] [810c75e5] generic_perform_write+0xb1/0x172 [ 276.365127] [81036a33] ? need_resched+0x23/0x2d [ 276.365127] [810c76ea] generic_file_buffered_write+0x44/0x6f [ 276.365127] [810c91f5] __generic_file_aio_write+0x253/0x2a8 [ 276.365127] [810c92ad] generic_file_aio_write+0x63/0xb8 [ 276.365127] [81113b26] do_sync_write+0xc7/0x10b [ 276.365127] [81036a4b] ? should_resched+0xe/0x2f [ 276.365127] [813b0faf] ? _cond_resched+0xe/0x22 [ 276.365127] [811986c3] ? security_file_permission+0x2c/0x31 [ 276.365127] [81113d21] ? rw_verify_area+0xac/0xdb [ 276.365127] [81114253] vfs_write+0xac/0xe4 [ 276.365127] [8111434f] sys_write+0x4c/0x71 [ 276.365127] [813b8beb] system_call_fastpath+0x16/0x1b [ 276.365127] Code: 89 fc 48 c7 c3 e2 ff ff ff 89 f7 65 4c 8b 2c 25 c0 b5 00 00 4d 85 e4 49 8b 85 48 06 00 00 74
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? sage [ rest of the oops below for context ] -chris [ 276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0 [ 276.365127] Oops: [#1] SMP [ 276.365127] CPU 2 [ 276.365127] Modules linked in: btrfs zlib_deflate lzo_compress ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc scsi_tgt dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon iTCO_wdt scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core uhci_hcd pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last unloaded: freq_table] [ 276.365127] [ 276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 #26 Dell Inc. PowerEdge 1950/0DT097 [ 276.365127] RIP: 0010:[a05434b1] [a05434b1] journal_start+0x3e/0x9c [jbd] [ 276.365127] RSP: 0018:8801e2897b28 EFLAGS: 00010286 [ 276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 0002 [ 276.365127] RDX: 19b2d000 RSI: 000e RDI: 000e [ 276.365127] RBP: 8801e2897b48 R08: 0003 R09: 8801e2897c38 [ 276.365127] R10: 8801e2897ed8 R11: 0001 R12: 880223ff4400 [ 276.365127] R13: 880218522d60 R14: 0ec6 R15: 88021f54d878 [ 276.365127] FS: 7f8ff0bbb710() GS:88022fc8() knlGS: [ 276.365127] CS: 0010 DS: ES: CR0: 8005003b [ 276.365127] CR2: 000a CR3: 00021744f000 CR4: 06e0 [ 276.365127] DR0: DR1: DR2: [ 276.365127] DR3: DR6: 0ff0 DR7: 0400 [ 276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 880218522d60) [ 276.365127] Stack: [ 276.365127] 8801e2897b68 ea000756e788 88021f54d728 8801e2897c78 [ 276.365127] 8801e2897b58 a05670ce 8801e2897b68 a055c72d [ 276.365127] 8801e2897be8 a055f044
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? Anyway, assuming it's useful, I think the below would fix the problem. Want to give it a shot, Jim? Thanks! sage diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index c571734..fd04ad7 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1196,6 +1196,9 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, put_transaction(cur_trans); mutex_unlock(root-fs_info-trans_mutex); + if (current-journal_info == trans) + current-journal_info = NULL; + return 0; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On 06/10/2011 02:14 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? It is used now, check the beginning of start_transaction(). Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:14 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? It is used now, check the beginning of start_transaction(). Thanks, Oh I see, okay. So clearing it in btrfs_commit_transaction_async should be fine then, right? When btrfs_commit_transaction runs in the other thread it won't care that current-journal_info is NULL. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On 06/10/2011 02:35 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:14 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? It is used now, check the beginning of start_transaction(). Thanks, Oh I see, okay. So clearing it in btrfs_commit_transaction_async should be fine then, right? When btrfs_commit_transaction runs in the other thread it won't care that current-journal_info is NULL. Oh yeah your patch is good :), Josef -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:35 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:14 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? It is used now, check the beginning of start_transaction(). Thanks, Oh I see, okay. So clearing it in btrfs_commit_transaction_async should be fine then, right? When btrfs_commit_transaction runs in the other thread it won't care that current-journal_info is NULL. Oh yeah your patch is good :), Okay cool. Here's the fix with a proper changelog and a little use-after-free paranoia. Thanks! sage From 9881c0752293769d5133c01dff3ab04c0c24c61b Mon Sep 17 00:00:00 2001 From: Sage Weil s...@newdream.net Date: Fri, 10 Jun 2011 11:41:25 -0700 Subject: [PATCH] Btrfs: clear current-journal_info on async transaction commit Normally current-jouranl_info is cleared by commit_transaction. For an async snap or subvol creation, though, it runs in a work queue. Clear it in btrfs_commit_transaction_async() to avoid leaking a non-NULL journal_info when we return to userspace. When the actual commit runs in the other thread it won't care that it's current-journal_info is already NULL. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/transaction.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index dd71966..9d516ed 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1118,8 +1118,11 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, wait_current_trans_commit_start_and_unblock(root, cur_trans); else wait_current_trans_commit_start(root, cur_trans); - put_transaction(cur_trans); + if (current-journal_info == trans) + current-journal_info = NULL; + + put_transaction(cur_trans); return 0; } -- 1.7.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
MDS configuration
Hi! Which configuration would you recommended for cluster with 50-80 million files? WBR, Fyodor. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected
Sage Weil wrote: On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:35 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Josef Bacik wrote: On 06/10/2011 02:14 PM, Sage Weil wrote: On Fri, 10 Jun 2011, Sage Weil wrote: On Fri, 10 Jun 2011, Chris Mason wrote: Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400: [ two different btrfs crashes ] I think your two crashes in btrfs were from the uninit variables and those should be fixed in rc2. When I did my bisection, my criteria for success/failure was did mkcephfs succeed?. When I apply this criteria to a recent linus kernel (e.g. 06e86849cf4019), which includes the fix you mentioned (aa0467d8d2a00e), I get still a different failure mode, which doesn't actually reference btrfs: [ 276.364178] BUG: unable to handle kernel NULL pointer dereference at 000a [ 276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd] Looking at the resulting code in the oops, we're here in journal_start: if (handle) { J_ASSERT(handle-h_transaction-t_journal == journal); handle comes from current-journal_info, and we're doing a deref on handle-h_transaction, which is probably 0xa. So, we're leaving crud in current-journal_info and ext3 is finding it. Perhaps its from ceph starting a transaction but leaving it running? The bug came with Josef's transaction performance fixes, but it is probably a mixture of his code with the ioctls ceph is using. Ah, yeah, that's the problem. We saw a similar problem a while back with the start/stop transaction ioctls. In this case, create_snapshot is doing trans = btrfs_start_transaction(root-fs_info-extent_root, 5); if (IS_ERR(trans)) { ret = PTR_ERR(trans); goto fail; } which sets current-journal_info. Then ret = btrfs_snap_reserve_metadata(trans, pending_snapshot); BUG_ON(ret); list_add(pending_snapshot-list, trans-transaction-pending_snapshots); if (async_transid) { *async_transid = trans-transid; ret = btrfs_commit_transaction_async(trans, root-fs_info-extent_root, 1); } else { ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } but the async snap creation ioctl takes the async path, which runs btrfs_commit_transaction in a worker thread. I'm not sure what the right thing to do is here is... can whatever is in journal_info be attached to trans instead in btrfs_commit_transaction_async()? It looks like it's not used for anything in btrfs, actually; it's just set and cleared. What's the point of that? It is used now, check the beginning of start_transaction(). Thanks, Oh I see, okay. So clearing it in btrfs_commit_transaction_async should be fine then, right? When btrfs_commit_transaction runs in the other thread it won't care that current-journal_info is NULL. Oh yeah your patch is good :), Okay cool. Here's the fix with a proper changelog and a little use-after-free paranoia. This patch solves my issue - thanks a lot. Tested-by: Jim Schutt jasc...@sandia.gov -- Jim Thanks! sage From 9881c0752293769d5133c01dff3ab04c0c24c61b Mon Sep 17 00:00:00 2001 From: Sage Weil s...@newdream.net Date: Fri, 10 Jun 2011 11:41:25 -0700 Subject: [PATCH] Btrfs: clear current-journal_info on async transaction commit Normally current-jouranl_info is cleared by commit_transaction. For an async snap or subvol creation, though, it runs in a work queue. Clear it in btrfs_commit_transaction_async() to avoid leaking a non-NULL journal_info when we return to userspace. When the actual commit runs in the other thread it won't care that it's current-journal_info is already NULL. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/transaction.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index dd71966..9d516ed 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1118,8 +1118,11 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, wait_current_trans_commit_start_and_unblock(root, cur_trans); else wait_current_trans_commit_start(root, cur_trans); - put_transaction(cur_trans); + if (current-journal_info == trans) + current-journal_info = NULL; + + put_transaction(cur_trans); return 0; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph to copy files. 500000 journal throttle: waited for ops and fault with nothing to send
hi,all When I copy a contains about 50 files folder to ceph file system! All files size is under 2 M ! In the executive to half of the time I found that the error log : [root@ceph_osd1 ~]# tail /var/log/ceph/osd.1.log -f 2011-06-11 00:26:56.654530 7f19f221c710 journal throttle: waited for ops 2011-06-11 00:26:56.845956 7f19f2a1d710 journal throttle: waited for ops 2011-06-11 00:27:22.378345 7f19f2a1d710 journal throttle: waited for ops 2011-06-11 00:27:22.561483 7f19f221c710 journal throttle: waited for ops 2011-06-11 00:42:35.491143 7f19ef216710 -- 192.168.0.209:6801/31773 192.168.0.208:6801/26514 pipe(0x1281280 sd=13 pgs=4 cs=5 l=0).fault with nothing to send, going to standby 2011-06-11 06:03:07.834786 7f19ef216710 -- 192.168.0.209:6801/31773 192.168.0.208:6801/26514 pipe(0x4c20780 sd=16 pgs=0 cs=0 l=0).accept connect_seq 6 vs existing 6 state 1 2011-06-11 06:03:08.513372 7f19f2a1d710 osd1 4 pg[1.70( v 4'3143 (4'3141,4'3143] n=385 ec=2 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'3141 !hml active+clean] sending commit on repgather(0x281df4b0 applied 4'3143 rep_tid=256248 wfack= wfdisk= op=osd_op(mds0.1:334034 200.02a6 [write 702932~3164] 1.a9f0) v2) 0x30140540 2011-06-11 06:04:02.963002 7f19f221c710 osd1 4 pg[1.70( v 4'3144 (4'3142,4'3144] n=385 ec=2 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'3142 !hml active+clean] sending commit on repgather(0x281df4b0 applied 4'3144 rep_tid=256249 wfack= wfdisk= op=osd_op(mds0.1:334037 200.02a6 [write 706096~1232] 1.a9f0) v2) 0x30140380 2011-06-11 06:04:09.489288 7f19f2a1d710 osd1 4 pg[1.70( v 4'3145 (4'3143,4'3145] n=385 ec=2 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'3143 !hml active+clean] sending commit on repgather(0x281df5a0 applied 4'3145 rep_tid=256250 wfack= wfdisk= op=osd_op(mds0.1:334039 200.02a6 [write 707328~1196] 1.a9f0) v2) 0x30140700 2011-06-11 06:19:34.799410 7f19eea0e710 -- 192.168.0.209:6801/31773 192.168.0.208:6801/26514 pipe(0x1281280 sd=13 pgs=6 cs=7 l=0).fault with nothing to send, going to standby [root@ceph_osd0 ~]# tail /var/log/ceph/osd.0.log -f 2011-06-11 12:26:05.366366 7fbfe3ae8710 osd0 4 OSD::ms_handle_reset() s=0x3eebc60 2011-06-11 12:39:55.959237 7fbfe08e0710 -- 192.168.0.208:6801/26514 192.168.0.209:6801/31773 pipe(0x2de2280 sd=13 pgs=3 cs=5 l=0).fault with nothing to send, going to standby 2011-06-11 18:00:28.570888 7fbfdd7d2710 -- 192.168.0.208:6801/26514 192.168.0.209:6801/31773 pipe(0x2769780 sd=14 pgs=0 cs=0 l=0).accept connect_seq 6 vs existing 6 state 6 2011-06-11 18:00:29.140549 7fbfe7af0710 osd0 4 pg[1.14( v 4'4797 (4'4795,4'4797] n=416 ec=2 les/c 4/4 3/3/2) [0,1] r=0 mlcod 4'4795 !hml active+clean] sending commit on repgather(0x2107d4b0 applied 4'4797 rep_tid=315438 wfack= wfdisk= op=osd_op(mds0.1:334035 200. [writefull 0~84] 1.3494) v2) 0x1d347000 2011-06-11 18:01:23.652245 7fbfe07df710 osd0 4 pg[1.14( v 4'4798 (4'4796,4'4798] n=416 ec=2 les/c 4/4 3/3/2) [0,1] r=0 mlcod 4'4796 !hml active+clean] sending commit on repgather(0x2107d3c0 applied 4'4798 rep_tid=315439 wfack= wfdisk= op=osd_op(mds0.1:334038 200. [writefull 0~84] 1.3494) v2) 0x21659700 2011-06-11 18:01:55.254019 7fbfe08e0710 -- 192.168.0.208:6800/26514 192.168.0.210:0/605992965 pipe(0xcf9d000 sd=13 pgs=0 cs=0 l=0).accept peer addr is really 192.168.0.210:0/605992965 (socket is 192.168.0.210:39797/0) 2011-06-11 18:01:55.378638 7fbfe07df710 osd0 4 pg[0.60( v 4'1853 (4'1851,4'1853] n=1853 ec=2 les/c 4/4 3/3/2) [0,1] r=0 mlcod 4'1851 !hml active+clean] sending commit on repgather(0x2107d3c0 applied 4'1853 rep_tid=315440 wfack= wfdisk= op=osd_op(client4101.1:237118 10474b9. [write 0~2560 [1@-1]] 0.c60 snapc 1=[])) 0x1df74c40 2011-06-11 18:03:05.460675 7fbfe3ae8710 osd0 4 OSD::ms_handle_reset() 2011-06-11 18:03 And unable to carry out writing data, such as: time dd if=/dev/zero bs=5 count=512 of=file My operating system is redhat6.0 kernel 2.6.38.6 x86_64 4 a home computer If you can give me some Suggestions? thank you -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html