Adventures in btrfs raid5 disk recovery

2016-06-19 Thread Zygo Blaxell
Not so long ago, I had a disk fail in a btrfs filesystem with raid1
metadata and raid5 data.  I mounted the filesystem readonly, replaced
the failing disk, and attempted to recover by adding the new disk and
deleting the missing disk.

It's not going well so far.  Pay attention, there are at least four
separate problems in here and we're not even half done yet.

I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
also pulled in some patches from the 4.7-rc series.

This fixed a few problems I encountered early on, and I'm still making
forward progress, but I've only replaced 50% of the failed disk so far,
and this is week four of this particular project.

What worked:

'mount -odegraded,...' successfully mounts the filesystem RW.  
'btrfs device add' adds the new disk.  Success!

The first thing I did was balance the metadata onto non-missing disks.
That went well.  Now there are only data chunks to recover from the
missing disk.  Success!

The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.  
That ends the success, but I've had similar problems with raid5 arrays
before and been able to solve them.

I've managed to remove about half of the data from the missing disk
so far.  'balance start -ddevid=,drange=0..1000'
(with increasing values for drange) is able to move data off the failed
disk while avoiding the damaged regions.  It looks like this process could
reduce the amount of data on "missing" devices to a manageable number,
then I could identify the offending corrupted extents with 'btrfs scrub',
remove the files containing them, and finish the device delete operation.
Hope!

What doesn't work:

The first problem is that the kernel keeps crashing.  I put the filesystem
and all its disks in a KVM so the crashes are less disruptive, and I can
debug them (or at least collect panic logs).  OK now crashes are merely a
performance problem.

Why did I mention 'btrfs scrub' above?  Because 'btrfs scrub' tells me
where corrupted blocks are.  'device delete' fills my kernel logs with
lines like this:

[26054.744158] BTRFS info (device vdc): relocating block group 
27753592127488 flags 129
[26809.746993] BTRFS warning (device vdc): csum failed ino 404 off 
6021976064 csum 778377694 expected csum 2827380172
[26809.747029] BTRFS warning (device vdc): csum failed ino 404 off 
6021980160 csum 3776938678 expected csum 514150079
[26809.747077] BTRFS warning (device vdc): csum failed ino 404 off 
6021984256 csum 470593400 expected csum 642831408
[26809.747093] BTRFS warning (device vdc): csum failed ino 404 off 
6021988352 csum 796755777 expected csum 690854341
[26809.747108] BTRFS warning (device vdc): csum failed ino 404 off 
6021992448 csum 4115095129 expected csum 249712906
[26809.747122] BTRFS warning (device vdc): csum failed ino 404 off 
6021996544 csum 2337431338 expected csum 1869250975
[26809.747138] BTRFS warning (device vdc): csum failed ino 404 off 
6022000640 csum 3543852608 expected csum 1929026437
[26809.747154] BTRFS warning (device vdc): csum failed ino 404 off 
6022004736 csum 3417780495 expected csum 3698318115
[26809.747169] BTRFS warning (device vdc): csum failed ino 404 off 
6022008832 csum 3423877520 expected csum 2981727596
[26809.747183] BTRFS warning (device vdc): csum failed ino 404 off 
6022012928 csum 550838742 expected csum 1005563554
[26896.379773] BTRFS info (device vdc): relocating block group 
27753592127488 flags 129
[27791.128098] __readpage_endio_check: 7 callbacks suppressed
[27791.236794] BTRFS warning (device vdc): csum failed ino 405 off 
6021980160 csum 3776938678 expected csum 514150079
[27791.236799] BTRFS warning (device vdc): csum failed ino 405 off 
6021971968 csum 3304844252 expected csum 4171523312
[27791.236821] BTRFS warning (device vdc): csum failed ino 405 off 
6021984256 csum 470593400 expected csum 642831408
[27791.236825] BTRFS warning (device vdc): csum failed ino 405 off 
6021988352 csum 796755777 expected csum 690854341
[27791.236842] BTRFS warning (device vdc): csum failed ino 405 off 
6021992448 csum 4115095129 expected csum 249712906
[27791.236847] BTRFS warning (device vdc): csum failed ino 405 off 
6021996544 csum 2337431338 expected csum 1869250975
[27791.236857] BTRFS warning (device vdc): csum failed ino 405 off 
6022004736 csum 3417780495 expected csum 3698318115
[27791.236864] BTRFS warning (device vdc): csum failed ino 405 off 
6022000640 csum 3543852608 expected csum 1929026437
[27791.236874] BTRFS warning (device vdc): csum failed ino 405 off 
6022008832 csum 3423877520 expected csum 2981727596
[27791.236978] BTRFS warning (device vdc): csum failed ino 405 off 
6021976064 csum 778377694 expected 

[PATCH] btrfs: fix WARNING in btrfs_select_ref_head()

2016-06-19 Thread Wang Xiaoguang
This issue was found when testing in-band dedupe enospc behaviour,
sometimes run_one_delayed_ref() may fail for enospc reason, then
__btrfs_run_delayed_refs()will return, but forget to add num_heads_read
back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in
btrfs_select_ref_head().

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/extent-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6146729..eeedff3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2665,7 +2665,10 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
 
btrfs_free_delayed_extent_op(extent_op);
if (ret) {
+   spin_lock(_refs->lock);
locked_ref->processing = 0;
+   delayed_refs->num_heads_ready++;
+   spin_unlock(_refs->lock);
btrfs_delayed_ref_unlock(locked_ref);
btrfs_put_delayed_ref(ref);
btrfs_debug(fs_info, "run_one_delayed_ref returned %d", 
ret);
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/24] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-06-19 Thread Deepa Dinamani
The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC macros.
The macros are not y2038 safe. There is no plan to transition them into being
y2038 safe.
ktime_get_* api's can be used in their place. And, these are y2038 safe.

Thanks to Arnd Bergmann for all the guidance and discussions.

Patches 2-4 were mostly generated using coccinelle scripts.

All filesystem timestamps use current_fs_time() for right granularity as
mentioned in the respective commit texts of patches. This has a changed
signature, renamed to current_time() and moved to the fs/inode.c.

This series also serves as a preparatory series to transition vfs to 64 bit
timestamps as outlined here: https://lkml.org/lkml/2016/2/12/104 .

As per Linus's suggestion in https://lkml.org/lkml/2016/5/24/663 , all the
inode timestamp changes have been squashed into a single patch. Also,
current_time() now is used as a single generic vfs filesystem timestamp api.
It also takes struct inode* as argument instead of struct super_block*.
Posting all patches together in a bigger series so that the big picture is
clear.

As per the suggestion in https://lwn.net/Articles/672598/, CURRENT_TIME macro
bug fixes are being handled in a series separate from transitioning vfs to use.

Changes from v1:
* Change current_fs_time(struct super_block *) to current_time(struct inode *)

* Note that change to add time64_to_tm() is already part of John's
  kernel tree: https://lkml.org/lkml/2016/6/17/875 .

Deepa Dinamani (24):
  vfs: Add current_time() api
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace current_fs_time() with current_time()
  fs: jfs: Replace CURRENT_TIME_SEC by current_time()
  fs: ext4: Use current_time() for inode timestamps
  fs: ubifs: Replace CURRENT_TIME_SEC with current_time
  fs: btrfs: Use ktime_get_real_ts for root ctime
  fs: udf: Replace CURRENT_TIME with current_time()
  fs: cifs: Replace CURRENT_TIME by current_time()
  fs: cifs: Replace CURRENT_TIME with ktime_get_real_ts()
  fs: cifs: Replace CURRENT_TIME by get_seconds
  fs: f2fs: Use ktime_get_real_seconds for sit_info times
  drivers: staging: lustre: Replace CURRENT_TIME with current_time()
  fs: ocfs2: Use time64_t to represent orphan scan times
  fs: ocfs2: Replace CURRENT_TIME with ktime_get_real_seconds()
  audit: Use timespec64 to represent audit timestamps
  fs: nfs: Make nfs boot time y2038 safe
  fnic: Use time64_t to represent trace timestamps
  block: Replace CURRENT_TIME with ktime_get_real_ts
  libceph: Replace CURRENT_TIME with ktime_get_real_ts
  fs: ceph: Replace current_fs_time for request stamp
  time: Delete CURRENT_TIME_SEC and CURRENT_TIME macro
  time: Delete current_fs_time() function

 arch/powerpc/platforms/cell/spufs/inode.c  |  2 +-
 arch/s390/hypfs/inode.c|  4 ++--
 drivers/block/rbd.c|  2 +-
 drivers/char/sonypi.c  |  2 +-
 drivers/infiniband/hw/qib/qib_fs.c |  2 +-
 drivers/misc/ibmasm/ibmasmfs.c |  2 +-
 drivers/oprofile/oprofilefs.c  |  2 +-
 drivers/platform/x86/sony-laptop.c |  2 +-
 drivers/scsi/fnic/fnic_trace.c |  4 ++--
 drivers/scsi/fnic/fnic_trace.h |  2 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c| 16 ++---
 drivers/staging/lustre/lustre/llite/namei.c|  4 ++--
 drivers/staging/lustre/lustre/mdc/mdc_reint.c  |  6 ++---
 .../lustre/lustre/obdclass/linux/linux-obdo.c  |  6 ++---
 drivers/staging/lustre/lustre/obdclass/obdo.c  |  6 ++---
 drivers/staging/lustre/lustre/osc/osc_io.c |  2 +-
 drivers/usb/core/devio.c   | 18 +++---
 drivers/usb/gadget/function/f_fs.c |  8 +++
 drivers/usb/gadget/legacy/inode.c  |  2 +-
 fs/9p/vfs_inode.c  |  2 +-
 fs/adfs/inode.c|  2 +-
 fs/affs/amigaffs.c |  6 ++---
 fs/affs/inode.c|  2 +-
 fs/attr.c  |  2 +-
 fs/autofs4/inode.c |  2 +-
 fs/autofs4/root.c  |  6 ++---
 fs/bad_inode.c |  2 +-
 fs/bfs/dir.c   | 14 +--
 fs/binfmt_misc.c   |  2 +-
 fs/btrfs/file.c|  6 ++---
 fs/btrfs/inode.c   | 22 -
 fs/btrfs/ioctl.c   |  8 +++
 fs/btrfs/root-tree.c   |  3 ++-
 fs/btrfs/transaction.c |  4 ++--
 fs/btrfs/xattr.c   |  2 +-
 fs/ceph/file.c

[PATCH v2 08/24] fs: btrfs: Use ktime_get_real_ts for root ctime

2016-06-19 Thread Deepa Dinamani
btrfs_root_item maintains the ctime for root updates.
This is not part of vfs_inode.

Since current_time() uses struct inode* as an argument
as Linus suggested, this cannot be used to update root
times unless, we modify the signature to use inode.

Since btrfs uses nanosecond time granularity, it can also
use ktime_get_real_ts directly to obtain timestamp for
the root. It is necessary to use the timespec time api
here because the same btrfs_set_stack_timespec_*() apis
are used for vfs inode times as well. These can be
transitioned to using timespec64 when btrfs internally
changes to use timespec64 as well.

Signed-off-by: Deepa Dinamani 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/root-tree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index f1c3086..161118b 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -496,10 +496,11 @@ void btrfs_update_root_times(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root)
 {
struct btrfs_root_item *item = >root_item;
-   struct timespec ct = current_fs_time(root->fs_info->sb);
+   struct timespec ct;
 
spin_lock(>root_item_lock);
btrfs_set_root_ctransid(item, trans->transid);
+   ktime_get_real_ts();
btrfs_set_stack_timespec_sec(>ctime, ct.tv_sec);
btrfs_set_stack_timespec_nsec(>ctime, ct.tv_nsec);
spin_unlock(>root_item_lock);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-19 Thread boli
For completeness here's the summary of my replacement of all four 6 TB drives 
(henceforth "6T") with 8 TB drives ("8T") in a btrfs raid1 volume.
I included transfer rates so maybe others can get a rough idea what to expect 
when doing similar things. All capacity units are SI, not base 2.

Filesystem usage was ~17.84 of 24 TB used when I started.

The first steps all happened while the machine was booted into emergency mode.

 1. Physically replaced 1st 6T with 1st 8T,
without having done a logical remove beforehand.
Should have done that to maintain redundancy.
 2. Mounted volume degraded and btrfs device remove missing.
Took over 4 days, and 1.4 TB were still missing after.
Also it was a close call: 17.84 TB of 18 TB used!
(Two of the drives were completely full after this)
Transfer rate of ~46 MB/s (~6 h/TB)
 3. Restored missing 1.4 TB onto the 1st 8T with btrfs replace -r
Would have been more efficient to try and complete step 2.
Transfer rate of ~159 MB/s (~1.75 h/TB)
 4. Resized to full size of 1st 8T
 5. btrfs device remove'd a 2nd 6T
 6. Physically replaced this 2nd 6T with 2nd 8T

At this point the machine was rebooted into normal mode.   

 7. Logically replaced 3rd 6T onto 2nd 8T with btrfs replace
Transfer rate of ~140 MB/s (~1.98 h/TB)
 8. Resized to full size of 2nd 8T
 9. Physically replaced 3rd 6T with 3rd 8T

Another reboot for kernel update to 4.5.6. Also the machine received a few of 
the backups that were previously held back so it could restore in peace.

10. Logically replaced 4th 6T onto 3rd 8T with btrfs replace
Transfer rate of ~151 MB/s (~1.84 h/TB)
11. Resized to full size of 3rd 8T
12. Physically replaced 4th 6T with 4th 8T (reboot)
13. Logically added 4th 8T to volume with btrfs device add
14. Ran a full balance (~18 TB used). Took about 2 days.
Transfer rate of ~104 MB/s (~2.67 h/TB)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html