Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Jim Schutt

David Sterba wrote:

Hi,

a candidate fix:

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=commit;h=aa0467d8d2a00e75b2bb6a56a4ee6d70c5d1928f


 With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
 produced this warning:

 fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
 fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
 uninitialized in this function
 Introduced by commit 16cdcec736cd (btrfs: implement delayed inode items
 operation).

 This fixes a bug in btrfs_update_inode(): if the returned value from
 btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
 updated and several call paths may hit a BUG_ON or fail with strange
 code.


if you can reproduce it reliably, add this patch on top of the delayed inodes.



I cherry-picked aa0467d8d2a00e on top of 16cdcec736cd21, which
gave me the following instead of a BUG:

[  246.986087] [ cut here ]
[  246.990714] WARNING: at mm/page_alloc.c:2032 
__alloc_pages_slowpath+0x54/0x3c5()
[  246.998100] Hardware name: PowerEdge 1950
[  247.002110] Modules linked in: btrfs zlib_deflate lzo_compress 
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp 
i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm rdma_cm 
ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot 
battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom 
ib_mthca qla2xxx serio_raw ib_mad ib_core scsi_transport_fc button scsi_tgt 
ata_piix libata scsi_mod dcdbas i5k_amb tpm_tis tpm ioatdma hwmon tpm_bios 
ehci_hcd dca i5000_edac pcspkr iTCO_wdt uhci_hcd iTCO_vendor_support edac_core 
rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last unloaded: 
freq_table]
[  247.076305] Pid: 6044, comm: cosd Not tainted 2.6.39-2-gf47e9fd #25
[  247.082911] Call Trace:
[  247.085358]  [810ccf8a] ? __alloc_pages_slowpath+0x54/0x3c5
[  247.091793]  [81049379] ? warn_slowpath_common+0x85/0x9e
[  247.097974]  [810493ac] ? warn_slowpath_null+0x1a/0x1c
[  247.103974]  [810ccf8a] ? __alloc_pages_slowpath+0x54/0x3c5
[  247.110409]  [810ccae2] ? get_page_from_freelist+0x166/0x198
[  247.116928]  [810cd3c5] ? __alloc_pages_nodemask+0xca/0xf4
[  247.123297]  [a0702884] ? unmap_extent_buffer+0x11/0x13 [btrfs]
[  247.130079]  [810fd957] ? alloc_pages_current+0xa3/0xac
[  247.136166]  [810cc5f7] ? alloc_pages+0xe/0x10
[  247.141472]  [810cc607] ? __get_free_pages+0xe/0x4b
[  247.147218]  [811061eb] ? kmalloc_order_trace+0x27/0x55
[  247.153304]  [8110664e] ? __kmalloc+0x37/0x100
[  247.158625]  [a072622a] ? btrfs_batch_insert_items+0xe0/0x229 
[btrfs]
[  247.165933]  [a06d454b] ? btrfs_block_rsv_release+0x39/0x3b [btrfs]
[  247.173072]  [a07265e7] ? btrfs_insert_delayed_items+0xac/0xef 
[btrfs]
[  247.180472]  [a0726798] ? btrfs_run_delayed_items+0x68/0xd9 [btrfs]
[  247.187610]  [a06e85fc] ? btrfs_commit_transaction+0x266/0x5c9 
[btrfs]
[  247.195000]  [81066118] ? list_del_init+0x21/0x21
[  247.200583]  [a0710c4c] ? create_subvol+0x420/0x440 [btrfs]
[  247.207018]  [810362ec] ? need_resched+0x23/0x2d
[  247.212511]  [a0710d7a] ? btrfs_mksubvol+0x10e/0x167 [btrfs]
[  247.219043]  [a071129f] ? 
btrfs_ioctl_snap_create_transid+0x9c/0x121 [btrfs]
[  247.226962]  [a071145e] ? btrfs_ioctl_snap_create+0x50/0x67 [btrfs]
[  247.234101]  [a0712c5a] ? btrfs_ioctl+0x1d0/0x2c6 [btrfs]
[  247.240364]  [8111fe12] ? vfs_ioctl+0x1d/0x34
[  247.245582]  [8112048d] ? do_vfs_ioctl+0x171/0x17a
[  247.251242]  [811130a2] ? fget_light+0x69/0x81
[  247.256549]  [811204f2] ? sys_ioctl+0x5c/0x7c
[  247.261770]  [8111d48d] ? putname+0x33/0x37
[  247.266819]  [813b21eb] ? system_call_fastpath+0x16/0x1b
[  247.272993] ---[ end trace 9c75d74017f060f5 ]---

The mkcephfs command I was attempting succeeded, so I'm
not sure if the above matters; mm/page_alloc.c:2032 is

/*
 * In the slowpath, we sanity check order to avoid ever trying to
 * reclaim = MAX_ORDER areas which will never succeed. Callers may
 * be using allocators in order of preference for an area that is
 * too large.
 */
if (order = MAX_ORDER) {
WARN_ON_ONCE(!(gfp_mask  __GFP_NOWARN));
return NULL;
}

When I did my bisection, my criteria for success/failure was
did mkcephfs succeed?.  When I apply this criteria to a recent
linus kernel (e.g. 06e86849cf4019), which includes the fix you
mentioned (aa0467d8d2a00e), I get still a different failure mode,
which doesn't actually reference btrfs:

[  276.364178] 

Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Chris Mason
Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:

[ two different btrfs crashes ]

I think your two crashes in btrfs were from the uninit variables and
those should be fixed in rc2.

 When I did my bisection, my criteria for success/failure was
 did mkcephfs succeed?.  When I apply this criteria to a recent
 linus kernel (e.g. 06e86849cf4019), which includes the fix you
 mentioned (aa0467d8d2a00e), I get still a different failure mode,
 which doesn't actually reference btrfs:
 
 [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
 000a
 [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]

Looking at the resulting code in the oops, we're here in journal_start:

if (handle) {
J_ASSERT(handle-h_transaction-t_journal == journal);

handle comes from current-journal_info, and we're doing a deref on
handle-h_transaction, which is probably 0xa.

So, we're leaving crud in current-journal_info and ext3 is finding it.

Perhaps its from ceph starting a transaction but leaving it running?
The bug came with Josef's transaction performance fixes, but it is
probably a mixture of his code with the ioctls ceph is using.

[ rest of the oops below for context ]

-chris

 [  276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0
 [  276.365127] Oops:  [#1] SMP
 [  276.365127] CPU 2
 [  276.365127] Modules linked in: btrfs zlib_deflate lzo_compress 
 ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
 nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge 
 stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm 
 rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
 dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot 
 battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod cdrom 
 ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc scsi_tgt 
 dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon iTCO_wdt 
 scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core uhci_hcd 
 pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 e1000 [last 
 unloaded: freq_table]
 [  276.365127]
 [  276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 #26 
 Dell Inc. PowerEdge 1950/0DT097
 [  276.365127] RIP: 0010:[a05434b1]  [a05434b1] 
 journal_start+0x3e/0x9c [jbd]
 [  276.365127] RSP: 0018:8801e2897b28  EFLAGS: 00010286
 [  276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 
 0002
 [  276.365127] RDX: 19b2d000 RSI: 000e RDI: 
 000e
 [  276.365127] RBP: 8801e2897b48 R08: 0003 R09: 
 8801e2897c38
 [  276.365127] R10: 8801e2897ed8 R11: 0001 R12: 
 880223ff4400
 [  276.365127] R13: 880218522d60 R14: 0ec6 R15: 
 88021f54d878
 [  276.365127] FS:  7f8ff0bbb710() GS:88022fc8() 
 knlGS:
 [  276.365127] CS:  0010 DS:  ES:  CR0: 8005003b
 [  276.365127] CR2: 000a CR3: 00021744f000 CR4: 
 06e0
 [  276.365127] DR0:  DR1:  DR2: 
 
 [  276.365127] DR3:  DR6: 0ff0 DR7: 
 0400
 [  276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 
 880218522d60)
 [  276.365127] Stack:
 [  276.365127]  8801e2897b68 ea000756e788 88021f54d728 
 8801e2897c78
 [  276.365127]  8801e2897b58 a05670ce 8801e2897b68 
 a055c72d
 [  276.365127]  8801e2897be8 a055f044 8801e2897c38 
 0074
 [  276.365127] Call Trace:
 [  276.365127]  [a05670ce] ext3_journal_start_sb+0x4f/0x51 [ext3]
 [  276.365127]  [a055c72d] ext3_journal_start+0x12/0x14 [ext3]
 [  276.365127]  [a055f044] ext3_write_begin+0x93/0x1a1 [ext3]
 [  276.365127]  [810c6f0e] ? __kunmap_atomic+0xe/0x10
 [  276.365127]  [810c75e5] generic_perform_write+0xb1/0x172
 [  276.365127]  [81036a33] ? need_resched+0x23/0x2d
 [  276.365127]  [810c76ea] generic_file_buffered_write+0x44/0x6f
 [  276.365127]  [810c91f5] __generic_file_aio_write+0x253/0x2a8
 [  276.365127]  [810c92ad] generic_file_aio_write+0x63/0xb8
 [  276.365127]  [81113b26] do_sync_write+0xc7/0x10b
 [  276.365127]  [81036a4b] ? should_resched+0xe/0x2f
 [  276.365127]  [813b0faf] ? _cond_resched+0xe/0x22
 [  276.365127]  [811986c3] ? security_file_permission+0x2c/0x31
 [  276.365127]  [81113d21] ? rw_verify_area+0xac/0xdb
 [  276.365127]  [81114253] vfs_write+0xac/0xe4
 [  276.365127]  [8111434f] sys_write+0x4c/0x71
 [  276.365127]  [813b8beb] system_call_fastpath+0x16/0x1b
 [  276.365127] Code: 89 fc 48 c7 c3 e2 ff ff ff 89 f7 65 4c 8b 2c 25 c0 b5 00 
 00 4d 85 e4 49 8b 85 48 06 00 00 74 

Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Chris Mason wrote:
 Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
 
 [ two different btrfs crashes ]
 
 I think your two crashes in btrfs were from the uninit variables and
 those should be fixed in rc2.
 
  When I did my bisection, my criteria for success/failure was
  did mkcephfs succeed?.  When I apply this criteria to a recent
  linus kernel (e.g. 06e86849cf4019), which includes the fix you
  mentioned (aa0467d8d2a00e), I get still a different failure mode,
  which doesn't actually reference btrfs:
  
  [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
  000a
  [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
 
 Looking at the resulting code in the oops, we're here in journal_start:
 
 if (handle) {
 J_ASSERT(handle-h_transaction-t_journal == journal);
 
 handle comes from current-journal_info, and we're doing a deref on
 handle-h_transaction, which is probably 0xa.
 
 So, we're leaving crud in current-journal_info and ext3 is finding it.
 
 Perhaps its from ceph starting a transaction but leaving it running?
 The bug came with Josef's transaction performance fixes, but it is
 probably a mixture of his code with the ioctls ceph is using.

Ah, yeah, that's the problem.  We saw a similar problem a while back with 
the start/stop transaction ioctls.  In this case, create_snapshot is doing

trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
goto fail;
}

which sets current-journal_info.  Then

ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
BUG_ON(ret);

list_add(pending_snapshot-list,
 trans-transaction-pending_snapshots);
if (async_transid) {
*async_transid = trans-transid;
ret = btrfs_commit_transaction_async(trans,
 root-fs_info-extent_root, 1);
} else {
ret = btrfs_commit_transaction(trans,
   root-fs_info-extent_root);
}

but the async snap creation ioctl takes the async path, which runs 
btrfs_commit_transaction in a worker thread.

I'm not sure what the right thing to do is here is... can whatever is in 
journal_info be attached to trans instead in 
btrfs_commit_transaction_async()?

sage






 
 [ rest of the oops below for context ]
 
 -chris
 
  [  276.365127] PGD 1e4469067 PUD 1e1658067 PMD 0
  [  276.365127] Oops:  [#1] SMP
  [  276.365127] CPU 2
  [  276.365127] Modules linked in: btrfs zlib_deflate lzo_compress 
  ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
  nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge 
  stp i2c_dev i2c_core ext3 jbd scsi_transport_iscsi rds ib_ipoib rdma_ucm 
  rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror 
  dm_region_hash dm_log dm_multipath scsi_dh dm_mod video sbs sbshc pci_slot 
  battery acpi_pad ac kvm sg ses sd_mod enclosure megaraid_sas ide_cd_mod 
  cdrom ib_mthca ib_mad qla2xxx button ib_core serio_raw scsi_transport_fc 
  scsi_tgt dcdbas ata_piix libata tpm_tis tpm i5k_amb ioatdma tpm_bios hwmon 
  iTCO_wdt scsi_mod i5000_edac iTCO_vendor_support ehci_hcd dca edac_core 
  uhci_hcd pcspkr rtc nfs nfs_acl auth_rpcgss fscache lockd sunrpc tg3 bnx2 
  e1000 [last unloaded: freq_table]
  [  276.365127]
  [  276.365127] Pid: 6076, comm: cosd Not tainted 3.0.0-rc2-00196-g06e8684 
  #26 Dell Inc. PowerEdge 1950/0DT097
  [  276.365127] RIP: 0010:[a05434b1]  [a05434b1] 
  journal_start+0x3e/0x9c [jbd]
  [  276.365127] RSP: 0018:8801e2897b28  EFLAGS: 00010286
  [  276.365127] RAX: 000a RBX: 8801de8e1090 RCX: 
  0002
  [  276.365127] RDX: 19b2d000 RSI: 000e RDI: 
  000e
  [  276.365127] RBP: 8801e2897b48 R08: 0003 R09: 
  8801e2897c38
  [  276.365127] R10: 8801e2897ed8 R11: 0001 R12: 
  880223ff4400
  [  276.365127] R13: 880218522d60 R14: 0ec6 R15: 
  88021f54d878
  [  276.365127] FS:  7f8ff0bbb710() GS:88022fc8() 
  knlGS:
  [  276.365127] CS:  0010 DS:  ES:  CR0: 8005003b
  [  276.365127] CR2: 000a CR3: 00021744f000 CR4: 
  06e0
  [  276.365127] DR0:  DR1:  DR2: 
  
  [  276.365127] DR3:  DR6: 0ff0 DR7: 
  0400
  [  276.365127] Process cosd (pid: 6076, threadinfo 8801e2896000, task 
  880218522d60)
  [  276.365127] Stack:
  [  276.365127]  8801e2897b68 ea000756e788 88021f54d728 
  8801e2897c78
  [  276.365127]  8801e2897b58 a05670ce 8801e2897b68 
  a055c72d
  [  276.365127]  8801e2897be8 a055f044 

Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Sage Weil wrote:
 On Fri, 10 Jun 2011, Chris Mason wrote:
  Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
  
  [ two different btrfs crashes ]
  
  I think your two crashes in btrfs were from the uninit variables and
  those should be fixed in rc2.
  
   When I did my bisection, my criteria for success/failure was
   did mkcephfs succeed?.  When I apply this criteria to a recent
   linus kernel (e.g. 06e86849cf4019), which includes the fix you
   mentioned (aa0467d8d2a00e), I get still a different failure mode,
   which doesn't actually reference btrfs:
   
   [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
   000a
   [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
  
  Looking at the resulting code in the oops, we're here in journal_start:
  
  if (handle) {
  J_ASSERT(handle-h_transaction-t_journal == journal);
  
  handle comes from current-journal_info, and we're doing a deref on
  handle-h_transaction, which is probably 0xa.
  
  So, we're leaving crud in current-journal_info and ext3 is finding it.
  
  Perhaps its from ceph starting a transaction but leaving it running?
  The bug came with Josef's transaction performance fixes, but it is
  probably a mixture of his code with the ioctls ceph is using.
 
 Ah, yeah, that's the problem.  We saw a similar problem a while back with 
 the start/stop transaction ioctls.  In this case, create_snapshot is doing
 
   trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
   if (IS_ERR(trans)) {
   ret = PTR_ERR(trans);
   goto fail;
   }
 
 which sets current-journal_info.  Then
 
   ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
   BUG_ON(ret);
 
   list_add(pending_snapshot-list,
trans-transaction-pending_snapshots);
   if (async_transid) {
   *async_transid = trans-transid;
   ret = btrfs_commit_transaction_async(trans,
root-fs_info-extent_root, 1);
   } else {
   ret = btrfs_commit_transaction(trans,
  root-fs_info-extent_root);
   }
 
 but the async snap creation ioctl takes the async path, which runs 
 btrfs_commit_transaction in a worker thread.
 
 I'm not sure what the right thing to do is here is... can whatever is in 
 journal_info be attached to trans instead in 
 btrfs_commit_transaction_async()?

It looks like it's not used for anything in btrfs, actually; it's just set 
and cleared.  What's the point of that?

Anyway, assuming it's useful, I think the below would fix the problem.  
Want to give it a shot, Jim?

Thanks!
sage


diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index c571734..fd04ad7 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1196,6 +1196,9 @@ int btrfs_commit_transaction_async(struct 
btrfs_trans_handle *trans,
put_transaction(cur_trans);
mutex_unlock(root-fs_info-trans_mutex);
 
+   if (current-journal_info == trans)
+   current-journal_info = NULL;
+
return 0;
 }
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Josef Bacik
On 06/10/2011 02:14 PM, Sage Weil wrote:
 On Fri, 10 Jun 2011, Sage Weil wrote:
 On Fri, 10 Jun 2011, Chris Mason wrote:
 Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:

 [ two different btrfs crashes ]

 I think your two crashes in btrfs were from the uninit variables and
 those should be fixed in rc2.

 When I did my bisection, my criteria for success/failure was
 did mkcephfs succeed?.  When I apply this criteria to a recent
 linus kernel (e.g. 06e86849cf4019), which includes the fix you
 mentioned (aa0467d8d2a00e), I get still a different failure mode,
 which doesn't actually reference btrfs:

 [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
 000a
 [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]

 Looking at the resulting code in the oops, we're here in journal_start:

 if (handle) {
 J_ASSERT(handle-h_transaction-t_journal == journal);

 handle comes from current-journal_info, and we're doing a deref on
 handle-h_transaction, which is probably 0xa.

 So, we're leaving crud in current-journal_info and ext3 is finding it.

 Perhaps its from ceph starting a transaction but leaving it running?
 The bug came with Josef's transaction performance fixes, but it is
 probably a mixture of his code with the ioctls ceph is using.

 Ah, yeah, that's the problem.  We saw a similar problem a while back with 
 the start/stop transaction ioctls.  In this case, create_snapshot is doing

  trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
  if (IS_ERR(trans)) {
  ret = PTR_ERR(trans);
  goto fail;
  }

 which sets current-journal_info.  Then

  ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
  BUG_ON(ret);

  list_add(pending_snapshot-list,
   trans-transaction-pending_snapshots);
  if (async_transid) {
  *async_transid = trans-transid;
  ret = btrfs_commit_transaction_async(trans,
   root-fs_info-extent_root, 1);
  } else {
  ret = btrfs_commit_transaction(trans,
 root-fs_info-extent_root);
  }

 but the async snap creation ioctl takes the async path, which runs 
 btrfs_commit_transaction in a worker thread.

 I'm not sure what the right thing to do is here is... can whatever is in 
 journal_info be attached to trans instead in 
 btrfs_commit_transaction_async()?
 
 It looks like it's not used for anything in btrfs, actually; it's just set 
 and cleared.  What's the point of that?
 

It is used now, check the beginning of start_transaction().  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Josef Bacik wrote:
 On 06/10/2011 02:14 PM, Sage Weil wrote:
  On Fri, 10 Jun 2011, Sage Weil wrote:
  On Fri, 10 Jun 2011, Chris Mason wrote:
  Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
 
  [ two different btrfs crashes ]
 
  I think your two crashes in btrfs were from the uninit variables and
  those should be fixed in rc2.
 
  When I did my bisection, my criteria for success/failure was
  did mkcephfs succeed?.  When I apply this criteria to a recent
  linus kernel (e.g. 06e86849cf4019), which includes the fix you
  mentioned (aa0467d8d2a00e), I get still a different failure mode,
  which doesn't actually reference btrfs:
 
  [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
  000a
  [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
 
  Looking at the resulting code in the oops, we're here in journal_start:
 
  if (handle) {
  J_ASSERT(handle-h_transaction-t_journal == journal);
 
  handle comes from current-journal_info, and we're doing a deref on
  handle-h_transaction, which is probably 0xa.
 
  So, we're leaving crud in current-journal_info and ext3 is finding it.
 
  Perhaps its from ceph starting a transaction but leaving it running?
  The bug came with Josef's transaction performance fixes, but it is
  probably a mixture of his code with the ioctls ceph is using.
 
  Ah, yeah, that's the problem.  We saw a similar problem a while back with 
  the start/stop transaction ioctls.  In this case, create_snapshot is doing
 
 trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
 if (IS_ERR(trans)) {
 ret = PTR_ERR(trans);
 goto fail;
 }
 
  which sets current-journal_info.  Then
 
 ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
 BUG_ON(ret);
 
 list_add(pending_snapshot-list,
  trans-transaction-pending_snapshots);
 if (async_transid) {
 *async_transid = trans-transid;
 ret = btrfs_commit_transaction_async(trans,
  root-fs_info-extent_root, 1);
 } else {
 ret = btrfs_commit_transaction(trans,
root-fs_info-extent_root);
 }
 
  but the async snap creation ioctl takes the async path, which runs 
  btrfs_commit_transaction in a worker thread.
 
  I'm not sure what the right thing to do is here is... can whatever is in 
  journal_info be attached to trans instead in 
  btrfs_commit_transaction_async()?
  
  It looks like it's not used for anything in btrfs, actually; it's just set 
  and cleared.  What's the point of that?
  
 
 It is used now, check the beginning of start_transaction().  Thanks,

Oh I see, okay.

So clearing it in btrfs_commit_transaction_async should be fine then, 
right?  When btrfs_commit_transaction runs in the other thread it won't 
care that current-journal_info is NULL.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Josef Bacik
On 06/10/2011 02:35 PM, Sage Weil wrote:
 On Fri, 10 Jun 2011, Josef Bacik wrote:
 On 06/10/2011 02:14 PM, Sage Weil wrote:
 On Fri, 10 Jun 2011, Sage Weil wrote:
 On Fri, 10 Jun 2011, Chris Mason wrote:
 Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:

 [ two different btrfs crashes ]

 I think your two crashes in btrfs were from the uninit variables and
 those should be fixed in rc2.

 When I did my bisection, my criteria for success/failure was
 did mkcephfs succeed?.  When I apply this criteria to a recent
 linus kernel (e.g. 06e86849cf4019), which includes the fix you
 mentioned (aa0467d8d2a00e), I get still a different failure mode,
 which doesn't actually reference btrfs:

 [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
 000a
 [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]

 Looking at the resulting code in the oops, we're here in journal_start:

 if (handle) {
 J_ASSERT(handle-h_transaction-t_journal == journal);

 handle comes from current-journal_info, and we're doing a deref on
 handle-h_transaction, which is probably 0xa.

 So, we're leaving crud in current-journal_info and ext3 is finding it.

 Perhaps its from ceph starting a transaction but leaving it running?
 The bug came with Josef's transaction performance fixes, but it is
 probably a mixture of his code with the ioctls ceph is using.

 Ah, yeah, that's the problem.  We saw a similar problem a while back with 
 the start/stop transaction ioctls.  In this case, create_snapshot is doing

trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
goto fail;
}

 which sets current-journal_info.  Then

ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
BUG_ON(ret);

list_add(pending_snapshot-list,
 trans-transaction-pending_snapshots);
if (async_transid) {
*async_transid = trans-transid;
ret = btrfs_commit_transaction_async(trans,
 root-fs_info-extent_root, 1);
} else {
ret = btrfs_commit_transaction(trans,
   root-fs_info-extent_root);
}

 but the async snap creation ioctl takes the async path, which runs 
 btrfs_commit_transaction in a worker thread.

 I'm not sure what the right thing to do is here is... can whatever is in 
 journal_info be attached to trans instead in 
 btrfs_commit_transaction_async()?

 It looks like it's not used for anything in btrfs, actually; it's just set 
 and cleared.  What's the point of that?


 It is used now, check the beginning of start_transaction().  Thanks,
 
 Oh I see, okay.
 
 So clearing it in btrfs_commit_transaction_async should be fine then, 
 right?  When btrfs_commit_transaction runs in the other thread it won't 
 care that current-journal_info is NULL.
 

Oh yeah your patch is good :),

Josef
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Sage Weil
On Fri, 10 Jun 2011, Josef Bacik wrote:
 On 06/10/2011 02:35 PM, Sage Weil wrote:
  On Fri, 10 Jun 2011, Josef Bacik wrote:
  On 06/10/2011 02:14 PM, Sage Weil wrote:
  On Fri, 10 Jun 2011, Sage Weil wrote:
  On Fri, 10 Jun 2011, Chris Mason wrote:
  Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
 
  [ two different btrfs crashes ]
 
  I think your two crashes in btrfs were from the uninit variables and
  those should be fixed in rc2.
 
  When I did my bisection, my criteria for success/failure was
  did mkcephfs succeed?.  When I apply this criteria to a recent
  linus kernel (e.g. 06e86849cf4019), which includes the fix you
  mentioned (aa0467d8d2a00e), I get still a different failure mode,
  which doesn't actually reference btrfs:
 
  [  276.364178] BUG: unable to handle kernel NULL pointer dereference 
  at 000a
  [  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]
 
  Looking at the resulting code in the oops, we're here in journal_start:
 
  if (handle) {
  J_ASSERT(handle-h_transaction-t_journal == journal);
 
  handle comes from current-journal_info, and we're doing a deref on
  handle-h_transaction, which is probably 0xa.
 
  So, we're leaving crud in current-journal_info and ext3 is finding it.
 
  Perhaps its from ceph starting a transaction but leaving it running?
  The bug came with Josef's transaction performance fixes, but it is
  probably a mixture of his code with the ioctls ceph is using.
 
  Ah, yeah, that's the problem.  We saw a similar problem a while back 
  with 
  the start/stop transaction ioctls.  In this case, create_snapshot is 
  doing
 
   trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
   if (IS_ERR(trans)) {
   ret = PTR_ERR(trans);
   goto fail;
   }
 
  which sets current-journal_info.  Then
 
   ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
   BUG_ON(ret);
 
   list_add(pending_snapshot-list,
trans-transaction-pending_snapshots);
   if (async_transid) {
   *async_transid = trans-transid;
   ret = btrfs_commit_transaction_async(trans,
root-fs_info-extent_root, 1);
   } else {
   ret = btrfs_commit_transaction(trans,
  root-fs_info-extent_root);
   }
 
  but the async snap creation ioctl takes the async path, which runs 
  btrfs_commit_transaction in a worker thread.
 
  I'm not sure what the right thing to do is here is... can whatever is in 
  journal_info be attached to trans instead in 
  btrfs_commit_transaction_async()?
 
  It looks like it's not used for anything in btrfs, actually; it's just 
  set 
  and cleared.  What's the point of that?
 
 
  It is used now, check the beginning of start_transaction().  Thanks,
  
  Oh I see, okay.
  
  So clearing it in btrfs_commit_transaction_async should be fine then, 
  right?  When btrfs_commit_transaction runs in the other thread it won't 
  care that current-journal_info is NULL.
  
 
 Oh yeah your patch is good :),

Okay cool.  Here's the fix with a proper changelog and a little 
use-after-free paranoia.

Thanks!
sage


From 9881c0752293769d5133c01dff3ab04c0c24c61b Mon Sep 17 00:00:00 2001
From: Sage Weil s...@newdream.net
Date: Fri, 10 Jun 2011 11:41:25 -0700
Subject: [PATCH] Btrfs: clear current-journal_info on async transaction commit

Normally current-jouranl_info is cleared by commit_transaction.  For an
async snap or subvol creation, though, it runs in a work queue.  Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace.  When the actual commit runs in
the other thread it won't care that it's current-journal_info is already
NULL.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/transaction.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index dd71966..9d516ed 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1118,8 +1118,11 @@ int btrfs_commit_transaction_async(struct 
btrfs_trans_handle *trans,
wait_current_trans_commit_start_and_unblock(root, cur_trans);
else
wait_current_trans_commit_start(root, cur_trans);
-   put_transaction(cur_trans);
 
+   if (current-journal_info == trans)
+   current-journal_info = NULL;
+
+   put_transaction(cur_trans);
return 0;
 }
 
-- 
1.7.0


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MDS configuration

2011-06-10 Thread Fyodor Ustinov

Hi!

Which configuration would you recommended for cluster with 50-80 million 
files?


WBR,
Fyodor.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Jim Schutt

Sage Weil wrote:

On Fri, 10 Jun 2011, Josef Bacik wrote:

On 06/10/2011 02:35 PM, Sage Weil wrote:

On Fri, 10 Jun 2011, Josef Bacik wrote:

On 06/10/2011 02:14 PM, Sage Weil wrote:

On Fri, 10 Jun 2011, Sage Weil wrote:

On Fri, 10 Jun 2011, Chris Mason wrote:

Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:

[ two different btrfs crashes ]

I think your two crashes in btrfs were from the uninit variables and
those should be fixed in rc2.


When I did my bisection, my criteria for success/failure was
did mkcephfs succeed?.  When I apply this criteria to a recent
linus kernel (e.g. 06e86849cf4019), which includes the fix you
mentioned (aa0467d8d2a00e), I get still a different failure mode,
which doesn't actually reference btrfs:

[  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
000a
[  276.365127] IP: [a05434b1] journal_start+0x3e/0x9c [jbd]

Looking at the resulting code in the oops, we're here in journal_start:

if (handle) {
J_ASSERT(handle-h_transaction-t_journal == journal);

handle comes from current-journal_info, and we're doing a deref on
handle-h_transaction, which is probably 0xa.

So, we're leaving crud in current-journal_info and ext3 is finding it.

Perhaps its from ceph starting a transaction but leaving it running?
The bug came with Josef's transaction performance fixes, but it is
probably a mixture of his code with the ioctls ceph is using.
Ah, yeah, that's the problem.  We saw a similar problem a while back with 
the start/stop transaction ioctls.  In this case, create_snapshot is doing


trans = btrfs_start_transaction(root-fs_info-extent_root, 5);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
goto fail;
}

which sets current-journal_info.  Then

ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
BUG_ON(ret);

list_add(pending_snapshot-list,
 trans-transaction-pending_snapshots);
if (async_transid) {
*async_transid = trans-transid;
ret = btrfs_commit_transaction_async(trans,
 root-fs_info-extent_root, 1);
} else {
ret = btrfs_commit_transaction(trans,
   root-fs_info-extent_root);
}

but the async snap creation ioctl takes the async path, which runs 
btrfs_commit_transaction in a worker thread.


I'm not sure what the right thing to do is here is... can whatever is in 
journal_info be attached to trans instead in 
btrfs_commit_transaction_async()?
It looks like it's not used for anything in btrfs, actually; it's just set 
and cleared.  What's the point of that?



It is used now, check the beginning of start_transaction().  Thanks,

Oh I see, okay.

So clearing it in btrfs_commit_transaction_async should be fine then, 
right?  When btrfs_commit_transaction runs in the other thread it won't 
care that current-journal_info is NULL.



Oh yeah your patch is good :),


Okay cool.  Here's the fix with a proper changelog and a little 
use-after-free paranoia.


This patch solves my issue - thanks a lot.

Tested-by: Jim Schutt jasc...@sandia.gov

-- Jim



Thanks!
sage



From 9881c0752293769d5133c01dff3ab04c0c24c61b Mon Sep 17 00:00:00 2001

From: Sage Weil s...@newdream.net
Date: Fri, 10 Jun 2011 11:41:25 -0700
Subject: [PATCH] Btrfs: clear current-journal_info on async transaction commit

Normally current-jouranl_info is cleared by commit_transaction.  For an
async snap or subvol creation, though, it runs in a work queue.  Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace.  When the actual commit runs in
the other thread it won't care that it's current-journal_info is already
NULL.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/transaction.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index dd71966..9d516ed 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1118,8 +1118,11 @@ int btrfs_commit_transaction_async(struct 
btrfs_trans_handle *trans,
wait_current_trans_commit_start_and_unblock(root, cur_trans);
else
wait_current_trans_commit_start(root, cur_trans);
-   put_transaction(cur_trans);
 
+	if (current-journal_info == trans)

+   current-journal_info = NULL;
+
+   put_transaction(cur_trans);
return 0;
 }
 



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph to copy files. 500000 journal throttle: waited for ops and fault with nothing to send

2011-06-10 Thread biyan chen
hi,all

When I copy a contains about 50 files folder to ceph file system!
All files size is under 2 M !

In the executive to half of the time

I found that the error log :

[root@ceph_osd1 ~]# tail /var/log/ceph/osd.1.log -f
2011-06-11 00:26:56.654530 7f19f221c710 journal throttle: waited for ops
2011-06-11 00:26:56.845956 7f19f2a1d710 journal throttle: waited for ops
2011-06-11 00:27:22.378345 7f19f2a1d710 journal throttle: waited for ops
2011-06-11 00:27:22.561483 7f19f221c710 journal throttle: waited for ops
2011-06-11 00:42:35.491143 7f19ef216710 -- 192.168.0.209:6801/31773 
192.168.0.208:6801/26514 pipe(0x1281280 sd=13 pgs=4 cs=5 l=0).fault
with nothing to send, going to standby
2011-06-11 06:03:07.834786 7f19ef216710 -- 192.168.0.209:6801/31773 
192.168.0.208:6801/26514 pipe(0x4c20780 sd=16 pgs=0 cs=0 l=0).accept
connect_seq 6 vs existing 6 state 1
2011-06-11 06:03:08.513372 7f19f2a1d710 osd1 4 pg[1.70( v 4'3143
(4'3141,4'3143] n=385 ec=2 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'3141
!hml active+clean]  sending commit on repgather(0x281df4b0 applied
4'3143 rep_tid=256248 wfack= wfdisk= op=osd_op(mds0.1:334034
200.02a6 [write 702932~3164] 1.a9f0) v2) 0x30140540
2011-06-11 06:04:02.963002 7f19f221c710 osd1 4 pg[1.70( v 4'3144
(4'3142,4'3144] n=385 ec=2 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'3142
!hml active+clean]  sending commit on repgather(0x281df4b0 applied
4'3144 rep_tid=256249 wfack= wfdisk= op=osd_op(mds0.1:334037
200.02a6 [write 706096~1232] 1.a9f0) v2) 0x30140380
2011-06-11 06:04:09.489288 7f19f2a1d710 osd1 4 pg[1.70( v 4'3145
(4'3143,4'3145] n=385 ec=2 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'3143
!hml active+clean]  sending commit on repgather(0x281df5a0 applied
4'3145 rep_tid=256250 wfack= wfdisk= op=osd_op(mds0.1:334039
200.02a6 [write 707328~1196] 1.a9f0) v2) 0x30140700
2011-06-11 06:19:34.799410 7f19eea0e710 -- 192.168.0.209:6801/31773 
192.168.0.208:6801/26514 pipe(0x1281280 sd=13 pgs=6 cs=7 l=0).fault
with nothing to send, going to standby

[root@ceph_osd0 ~]# tail /var/log/ceph/osd.0.log -f
2011-06-11 12:26:05.366366 7fbfe3ae8710 osd0 4 OSD::ms_handle_reset()
s=0x3eebc60
2011-06-11 12:39:55.959237 7fbfe08e0710 -- 192.168.0.208:6801/26514 
192.168.0.209:6801/31773 pipe(0x2de2280 sd=13 pgs=3 cs=5 l=0).fault
with nothing to send, going to standby
2011-06-11 18:00:28.570888 7fbfdd7d2710 -- 192.168.0.208:6801/26514 
192.168.0.209:6801/31773 pipe(0x2769780 sd=14 pgs=0 cs=0 l=0).accept
connect_seq 6 vs existing 6 state 6
2011-06-11 18:00:29.140549 7fbfe7af0710 osd0 4 pg[1.14( v 4'4797
(4'4795,4'4797] n=416 ec=2 les/c 4/4 3/3/2) [0,1] r=0 mlcod 4'4795
!hml active+clean]  sending commit on repgather(0x2107d4b0 applied
4'4797 rep_tid=315438 wfack= wfdisk= op=osd_op(mds0.1:334035
200. [writefull 0~84] 1.3494) v2) 0x1d347000
2011-06-11 18:01:23.652245 7fbfe07df710 osd0 4 pg[1.14( v 4'4798
(4'4796,4'4798] n=416 ec=2 les/c 4/4 3/3/2) [0,1] r=0 mlcod 4'4796
!hml active+clean]  sending commit on repgather(0x2107d3c0 applied
4'4798 rep_tid=315439 wfack= wfdisk= op=osd_op(mds0.1:334038
200. [writefull 0~84] 1.3494) v2) 0x21659700
2011-06-11 18:01:55.254019 7fbfe08e0710 -- 192.168.0.208:6800/26514 
192.168.0.210:0/605992965 pipe(0xcf9d000 sd=13 pgs=0 cs=0 l=0).accept
peer addr is really 192.168.0.210:0/605992965 (socket is
192.168.0.210:39797/0)
2011-06-11 18:01:55.378638 7fbfe07df710 osd0 4 pg[0.60( v 4'1853
(4'1851,4'1853] n=1853 ec=2 les/c 4/4 3/3/2) [0,1] r=0 mlcod 4'1851
!hml active+clean]  sending commit on repgather(0x2107d3c0 applied
4'1853 rep_tid=315440 wfack= wfdisk= op=osd_op(client4101.1:237118
10474b9. [write 0~2560 [1@-1]] 0.c60 snapc 1=[]))
0x1df74c40
2011-06-11 18:03:05.460675 7fbfe3ae8710 osd0 4 OSD::ms_handle_reset()
2011-06-11 18:03

And unable to carry out writing data, such as: time dd if=/dev/zero
bs=5 count=512 of=file


My operating system is redhat6.0 kernel 2.6.38.6 x86_64
4 a home computer


If you can give me some Suggestions? thank you
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html