[PATCH] fstests: btrfs, test directory fsync after deleting snapshots
From: Filipe MananaTest that if we fsync a directory that had a snapshot entry in it that was deleted and crash, the next time we mount the filesystem, the log replay procedure will not fail and the snapshot is not present anymore. This issue is fixed by the following patch for the linux kernel: "Btrfs: fix unreplayable log after snapshot delete + parent dir fsync" Signed-off-by: Filipe Manana --- tests/btrfs/118 | 86 + tests/btrfs/118.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 89 insertions(+) create mode 100755 tests/btrfs/118 create mode 100644 tests/btrfs/118.out diff --git a/tests/btrfs/118 b/tests/btrfs/118 new file mode 100755 index 000..3ed1cbe --- /dev/null +++ b/tests/btrfs/118 @@ -0,0 +1,86 @@ +#! /bin/bash +# FSQA Test No. 118 +# +# Test that if we fsync a directory that had a snapshot entry in it that was +# deleted and crash, the next time we mount the filesystem, the log replay +# procedure will not fail and the snapshot is not present anymore. +# +#--- +# +# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + _cleanup_flakey + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# real QA test starts here +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_dm_target flakey +_require_metadata_journaling $SCRATCH_DEV + +rm -f $seqres.full + +_scratch_mkfs >>$seqres.full 2>&1 +_init_flakey +_mount_flakey + +# Create a snapshot at the root of our filesystem (mount point path), delete it, +# fsync the mount point path, crash and mount to replay the log. This should +# succeed and after the filesystem is mounted the snapshot should not be visible +# anymore. +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1 +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1 +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT +_flakey_drop_and_remount +[ -e $SCRATCH_MNT/snap1 ] && echo "Snapshot snap1 still exists after log replay" + +# Similar scenario as above, but this time the snapshot is created inside a +# directory and not directly under the root (mount point path). +mkdir $SCRATCH_MNT/testdir +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2 +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2 +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir +_flakey_drop_and_remount +[ -e $SCRATCH_MNT/testdir/snap2 ] && \ + echo "Snapshot snap2 still exists after log replay" + +_unmount_flakey + +echo "Silence is golden" + +status=0 +exit diff --git a/tests/btrfs/118.out b/tests/btrfs/118.out new file mode 100644 index 000..3daed86 --- /dev/null +++ b/tests/btrfs/118.out @@ -0,0 +1,2 @@ +QA output created by 118 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index f74ffbb..a2fa412 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -118,3 +118,4 @@ 115 auto qgroup 116 auto quick metadata 117 auto quick send clone +118 auto quick snapshot metadata -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions
On 2016-02-09 15:39, Chris Murphy wrote: On Fri, Feb 5, 2016 at 12:36 PM, Mackenzie Meyerwrote: RAID 6 write holes? I don't even understand the nature of the write hole on Btrfs. If modification is still always COW, then either an fs block, a strip, or whole stripe write happens, I'm not sure where the hole comes from. It suggests some raid56 writes are not atomic. It's an issue of torn writes in this case, not of atomicity of BTRFS. Disks can't atomically write more than sector size chunks, which means that almost all BTRFS filesystems are doing writes that disks can't atomically complete. Add to that that we serialized writes to different devices, and it becomes trivial to lose some data if the system crashes while BTRFS is writing out a stripe (it shouldn't screw up existing data though, you'll just loose whatever you were trying to write). One way to minimize this which would also boost performance on slow storage would be to avoid writing parts of the stripe that aren't changed (so for example, if only one disk in the stripe actually has changed data, only write that and the parities). If you're worried about raid56 write holes, then a.) you need a server running this raid where power failures or crashes don't happen b.) don't use raid56 c.) use ZFS. It's not just BTRFS that has this issue though, ZFS does too, it just recovers more gracefully than BTRFS does, and even with the journaled RAID{5,6} support that's being added in MDRAID (and by extension DM-RAID and therefore LVM), it still has the same issue, it just moves it elsewhere (in this case, it has problems if there's a torn write to the journal). RAID 6 stability? Any articles I've tried looking for online seem to be from early 2014, I can't find anything recent discussing the stability of RAID 5 or 6. Are there or have there recently been any data corruption bugs which impact RAID 6? Would you consider RAID 6 safe/stable enough for production use? It's not stable for your use case, if you have to ask others if it's stable enough for your use case. Simple as that. Right now some raid6 users are experiencing remarkably slow balances, on the order of weeks. If device replacement rebuild times are that long, I'd say it's disqualifying for most any use case, just because there are alternatives that have better fail over behavior than this. So far there's no word from any developers what the problem might be, or where to gather more information. So chances are they're already aware of it but haven't reproduced it, or isolated it, or have a fix for it yet. Double on this, we should probably put something similar on the wiki, and this really applies to any feature, not just raid56. Do you still strongly recommend backups, or has stability reached a point where backups aren't as critical? I'm thinking from a data consistency standpoint, not a hardware failure standpoint. You can't separate them. On completely stable hardware, stem to stern, you'd have no backups, no Btrfs or ZFS, you'd just run linear/concat arrays with XFS, for example. So you can't just hand wave the hardware part away. There are bugs in the entire storage stack, there are connectors that can become intermittent, the system could crash. All of these affect data consistency. I may be wrong, but I believe the intent of this question was to try and figure out how likely BTRFS itself is to cause crashes or data corruption, independent of the hardware. In other words, 'Do I need to worry significantly about BTRFS in planning for disaster recovery, or can I focus primarily on the hardware itself?' or 'Is the most likely failure mode going to be hardware failure, or software?'. In general, right now I'd say that using BTRFS in traditional multi-device setup (nothing more than raid1 or possibly raid10), you've got roughly a 50% chance of an arbitrary crash being a software issue instead of hardware. Single disk, I'd say it's probably closer to 25%, and raid56 I'd say it's probably closer to 75%. By comparison, I'd say that with ZFS it's maybe a 5% chance (ZFS is developed as enterprise level software, it has to work, period), and with XFS on LVM raid, probably about 15% (similar to ZFS, XFS is supposed to be enterprise level software, the difference here comes from LVM, which has had some interesting issues recently due to incomplete testing of certain things before they got pushed upstream). Stability has not reach a point where backups aren't as critical. I don't really even know what that means though. No matter Btrfs or not, you need to be doing backups such that if the primary stack is a 100% loss without notice, is not a disaster. Plan on having to use it. If you don't like the sound of that, look elsewhere. What your using has impact on how you need to do backups. For someone who can afford long periods of down time for example, it may be perfectly fine to use something like Amazon S3 Glacier storage
[PATCH] Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
From: Filipe MananaIf we delete a snapshot, fsync its parent directory and crash/power fail before the next transaction commit, on the next mount when we attempt to replay the log tree of the root containing the parent directory we will fail and prevent the filesystem from mounting, which is solvable by wiping out the log trees with the btrfs-zero-log tool but very inconvenient as we will lose any data and metadata fsynced before the parent directory was fsynced. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ btrfs subvolume snapshot /mnt /mnt/testdir/snap $ btrfs subvolume delete /mnt/testdir/snap $ xfs_io -c "fsync" /mnt/testdir < crash / power failure and reboot > $ mount /dev/sdc /mnt mount: mount(2) failed: No such file or directory And in dmesg/syslog we get the following message and trace: [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [192066.363010] [ cut here ] [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]() [192066.367250] BTRFS: Transaction aborted (error -2) [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: GW 4.4.0-rc6-btrfs-next-20+ #1 [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [192066.380889] 880143923670 81257570 8801439236b8 [192066.382561] 8801439236a8 8104ec07 a039dc2c fffe [192066.384191] 8801ed31d000 8801b9fc9c88 8801086875e0 880143923710 [192066.385827] Call Trace: [192066.386373] [] dump_stack+0x4e/0x79 [192066.387387] [] warn_slowpath_common+0x99/0xb2 [192066.388429] [] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.389236] [] warn_slowpath_fmt+0x48/0x50 [192066.389884] [] __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.390621] [] ? iput+0xb0/0x266 [192066.391200] [] btrfs_unlink_inode+0x1c/0x3d [btrfs] [192066.391930] [] check_item_in_log+0x1fe/0x29b [btrfs] [192066.392715] [] replay_dir_deletes+0x167/0x1cf [btrfs] [192066.393510] [] replay_one_buffer+0x417/0x570 [btrfs] [192066.394241] [] walk_up_log_tree+0x10e/0x1dc [btrfs] [192066.394958] [] walk_log_tree+0xa5/0x190 [btrfs] [192066.395628] [] btrfs_recover_log_trees+0x239/0x32c [btrfs] [192066.396790] [] ? replay_one_extent+0x50a/0x50a [btrfs] [192066.397891] [] open_ctree+0x1d8b/0x2167 [btrfs] [192066.398897] [] btrfs_mount+0x5ef/0x729 [btrfs] [192066.399823] [] ? trace_hardirqs_on+0xd/0xf [192066.400739] [] ? lockdep_init_map+0xb9/0x1b3 [192066.401700] [] mount_fs+0x67/0x131 [192066.402482] [] vfs_kern_mount+0x6c/0xde [192066.403930] [] btrfs_mount+0x1cb/0x729 [btrfs] [192066.404831] [] ? trace_hardirqs_on+0xd/0xf [192066.405726] [] ? lockdep_init_map+0xb9/0x1b3 [192066.406621] [] mount_fs+0x67/0x131 [192066.407401] [] vfs_kern_mount+0x6c/0xde [192066.408247] [] do_mount+0x893/0x9d2 [192066.409047] [] ? strndup_user+0x3f/0x8c [192066.409842] [] SyS_mount+0x75/0xa1 [192066.410621] [] entry_SYSCALL_64_fastpath+0x12/0x6b [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]--- [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree) [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30 [192066.444613] BTRFS: open_ctree failed This happens because when we are replaying the log and processing the directory entry pointing to the snapshot in the subvolume tree, we treat its btrfs_dir_item item as having a location with a key type matching BTRFS_INODE_ITEM_KEY, which is wrong because the type matches BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the object id refers to a root number and not to an inode in the root containing the parent directory. So fix this by triggering a transaction commit if an fsync against the parent directory is requested after deleting a snapshot. This is the simplest approach for a rare use case. Some alternative that avoids the transaction commit would require more code to explicitly delete the snapshot at log replay time (factoring out common code from ioctl.c: btrfs_ioctl_snap_destroy()), special care at fsync time to remove the log tree of the snapshot's root from the log root of the root of tree
Re: RAID5 Unable to remove Failing HD
Arnand, thanks for the tip. What kernels are these meant for? I am not able to apply these cleanly to the kernels i have tried. Or is there a kernel with these incorporated? I have tried rebooting without the disk attached and am unable to mount the partition. Complaining about bad tree and failed to read chunk. So at the moment the disk is still readable, though not sure how long that will last. I have posted a copy of my messages log, only the last couple of days. https://www.dropbox.com/s/9f05e1q5w4zkp38/messages_trimmed2?dl=0 If you or anybody else has some tips i would appreciate it. Regards On 10 February 2016 at 17:58, Rene Castbergwrote: > Arnand, thanks for the tip. What kernels are these meant for? I am not able > to apply these cleanly to the kernels i have tried. Or is there a kernel > with these incorporated? > > I have tried rebooting without the disk attached and am unable to mount the > partition. Complaining about bad tree and > failed to read chunk. So at the moment the disk is still readable, though > not sure how long that will last. > > I have posted a copy of my messages log, only the last couple of days. > https://www.dropbox.com/s/9f05e1q5w4zkp38/messages_trimmed2?dl=0 > > If you or anybody else has some tips i would appreciate it. > > Regards > > Rene Castberg > > On 10 February 2016 at 10:00, Anand Jain wrote: >> >> >> >> Rene, >> >> Thanks for the report. Fixes are in the following patch sets >> >> concern1: >> Btrfs to fail/offline a device for write/flush error: >>[PATCH 00/15] btrfs: Hot spare and Auto replace >> >> concern2: >> User should be able to delete a device when device has failed: >>[PATCH 0/7] Introduce device delete by devid >> >> If you were able to tryout these patches, pls lets know. >> >> Thanks, Anand >> >> >> >> On 02/10/2016 03:17 PM, Rene Castberg wrote: >>> >>> Hi, >>> >>> This morning i woke up to a failing disk: >>> >>> [230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush >>> 503, corrupt 0, gen 0 >>> [230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush >>> 503, corrupt 0, gen 0 >>> [230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed >>> [230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush >>> 503, corrupt 0, gen 0 >>> [230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush >>> 503, corrupt 0, gen 0 >>> [230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed >>> [230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush >>> 503, corrupt 0, gen 0 >>> [230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush >>> 503, corrupt 0, gen 0 >>> [230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush >>> 503, corrupt 0, gen 0 >>> [230802.000425] nfsd: last server has exited, flushing export cache >>> [230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush >>> 503, corrupt 0, gen 0 >>> [230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush >>> 503, corrupt 0, gen 0 >>> [230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc >>> [230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush >>> 503, corrupt 0, gen 0 >>> [230898.830893] BTRFS info (device sdd): allowing degraded mounts >>> [230898.830902] BTRFS info (device sdd): disk space caching is enabled >>> >>> Eventually i remounted it as degraded, hopefully to prevent any loss of >>> data. >>> >>> It seems taht the btrfs filesystem still hasn't noticed that the disk >>> has failed: >>> $btrfs fi show >>> Label: 'RenesData' uuid: ee80dae2-7c86-43ea-a253-c8f04589b496 >>> Total devices 5 FS bytes used 5.38TiB >>> devid1 size 2.73TiB used 1.84TiB path /dev/sdb >>> devid2 size 2.73TiB used 1.84TiB path /dev/sde >>> devid3 size 3.64TiB used 1.84TiB path /dev/sdf >>> devid4 size 2.73TiB used 1.84TiB path /dev/sdd >>> devid5 size 3.64TiB used 1.84TiB path /dev/sdc >>> >>> I tried deleting the device: >>> # btrfs device delete /dev/sdc /mnt2/RenesData/ >>> ERROR: error removing device '/dev/sdc': Invalid argument >>> >>> I have been unlucky and already had a failure last friday, where a >>> RAID5 array failed after a disk failure. I rebooted, and the data was >>> unrecoverable. Fortunately this was only temp data so the failure >>> wasn't a real issue. >>> >>> Can somebody give me some advice how to delete the failing disk? I >>>
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
http://fpaste.org/320720/45511028/ What is rb_next? See if you can explode that out and find out more about why there's so much time going on with that. I see that rb_next gets used for lots of things, including btrfs. In mine, rb_next is less than 1% overhead, but for you it's the top item. That's suspicious. http://fpaste.org/320718/10016145/ line 72-73. We both have counts for qgroup stuff. Mine is much much less than yours. I have never had quotas enabled on any of my filesystems, so I don't know why there are any such counts at all. But since your values are nearly three orders of magnitude greater than mine, I have to ask if you have quotas enabled or have ever had them enabled? That might be a factor here... Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs, test directory fsync after deleting snapshots
On Wed, Feb 10, 2016 at 01:32:51PM +, fdman...@kernel.org wrote: > From: Filipe Manana> > Test that if we fsync a directory that had a snapshot entry in it that > was deleted and crash, the next time we mount the filesystem, the log > replay procedure will not fail and the snapshot is not present anymore. > > This issue is fixed by the following patch for the linux kernel: > > "Btrfs: fix unreplayable log after snapshot delete + parent dir fsync" > Tested-by: Liu Bo Reviewed-by: Liu Bo Thanks, -liubo > Signed-off-by: Filipe Manana > --- > tests/btrfs/118 | 86 > + > tests/btrfs/118.out | 2 ++ > tests/btrfs/group | 1 + > 3 files changed, 89 insertions(+) > create mode 100755 tests/btrfs/118 > create mode 100644 tests/btrfs/118.out > > diff --git a/tests/btrfs/118 b/tests/btrfs/118 > new file mode 100755 > index 000..3ed1cbe > --- /dev/null > +++ b/tests/btrfs/118 > @@ -0,0 +1,86 @@ > +#! /bin/bash > +# FSQA Test No. 118 > +# > +# Test that if we fsync a directory that had a snapshot entry in it that was > +# deleted and crash, the next time we mount the filesystem, the log replay > +# procedure will not fail and the snapshot is not present anymore. > +# > +#--- > +# > +# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved. > +# Author: Filipe Manana > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + _cleanup_flakey > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/dmflakey > + > +# real QA test starts here > +_need_to_be_root > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch > +_require_dm_target flakey > +_require_metadata_journaling $SCRATCH_DEV > + > +rm -f $seqres.full > + > +_scratch_mkfs >>$seqres.full 2>&1 > +_init_flakey > +_mount_flakey > + > +# Create a snapshot at the root of our filesystem (mount point path), delete > it, > +# fsync the mount point path, crash and mount to replay the log. This should > +# succeed and after the filesystem is mounted the snapshot should not be > visible > +# anymore. > +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1 > +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1 > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT > +_flakey_drop_and_remount > +[ -e $SCRATCH_MNT/snap1 ] && echo "Snapshot snap1 still exists after log > replay" > + > +# Similar scenario as above, but this time the snapshot is created inside a > +# directory and not directly under the root (mount point path). > +mkdir $SCRATCH_MNT/testdir > +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT > $SCRATCH_MNT/testdir/snap2 > +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2 > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir > +_flakey_drop_and_remount > +[ -e $SCRATCH_MNT/testdir/snap2 ] && \ > + echo "Snapshot snap2 still exists after log replay" > + > +_unmount_flakey > + > +echo "Silence is golden" > + > +status=0 > +exit > diff --git a/tests/btrfs/118.out b/tests/btrfs/118.out > new file mode 100644 > index 000..3daed86 > --- /dev/null > +++ b/tests/btrfs/118.out > @@ -0,0 +1,2 @@ > +QA output created by 118 > +Silence is golden > diff --git a/tests/btrfs/group b/tests/btrfs/group > index f74ffbb..a2fa412 100644 > --- a/tests/btrfs/group > +++ b/tests/btrfs/group > @@ -118,3 +118,4 @@ > 115 auto qgroup > 116 auto quick metadata > 117 auto quick send clone > +118 auto quick snapshot metadata > -- > 2.7.0.rc3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 1/2] btrfs-progs: copy functionality of btrfs-debug-tree to inspect-internal subcommand
On Tue, Feb 09, 2016 at 05:12:08PM +0100, Alexander Fougner wrote: > The long-term plan is to merge the features of standalone tools > into the btrfs binary, reducing the number of shipped binaries. > > Signed-off-by: Alexander FougnerApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions
On Wed, Feb 10, 2016 at 6:57 AM, Austin S. Hemmelgarnwrote: > It's an issue of torn writes in this case, not of atomicity of BTRFS. Disks > can't atomically write more than sector size chunks, which means that almost > all BTRFS filesystems are doing writes that disks can't atomically complete. > Add to that that we serialized writes to different devices, and it becomes > trivial to lose some data if the system crashes while BTRFS is writing out a > stripe (it shouldn't screw up existing data though, you'll just loose > whatever you were trying to write). I follow all of this. I still don't know how a torn write leads to a write hole in the conventional sense though. If the write is partial, a pointer never should have been written to that unfinished write. So the pointer that's there after a crash should either point to the old stripe or new stripe (which includes parity), not to the new data strips but an old (stale) parity strip for that partial stripe write that was interrupted. It's easy to see how conventional raid gets this wrong because it has no pointers to strips, those locations are known due to the geometry (raid level, layout, number of devices) and fixed. I don't know what rmw looks like on Btrfs raid56 without overwriting the stripe - a whole new cow'd stripe, and then metadata is updated to reflect the new location of that stripe? > One way to minimize this which would also boost performance on slow storage > would be to avoid writing parts of the stripe that aren't changed (so for > example, if only one disk in the stripe actually has changed data, only > write that and the parities). I'm pretty sure that's part of rmw, which is not a full stripe write. At least there appears to be some distinction in raid56.c between them. The additional optimization that md raid has had for some time is the ability during rmw of a single data chunk (what they call strips, or the smallest unit in a stripe), they can actually optimize the change down to a sector write. So they aren't even doing full chunk/strip writes either. The parity strip though I think must be completely rewritten. >> >> >> If you're worried about raid56 write holes, then a.) you need a server >> running this raid where power failures or crashes don't happen b.) >> don't use raid56 c.) use ZFS. > > It's not just BTRFS that has this issue though, ZFS does too, Well it's widely considered to not have the write hole. From a ZFS conference I got this tidbit on how they closed the write hole, but I still don't understand why they'd be pointing to a partial (torn) write in the first place: "key insight was realizing instead of treating a stripe as it's a "stripe of separate blocks" you can take a block and break it up into many sectors and have a stripe across the sectors that is of one logic block, that eliminates the write hole because even if the write is partial until all of those writes are complete there's not going to be an uber block referencing any of that." –Bonwick https://www.youtube.com/watch?v=dcV2PaMTAJ4 14:45 > What your using has impact on how you need to do backups. For someone who > can afford long periods of down time for example, it may be perfectly fine > to use something like Amazon S3 Glacier storage (which has a 4 hour lead > time on restoration for read access) for backups. OTOH, if you can't afford > more than a few minutes of down time and want to use BTRFS, you should > probably have full on-line on-site backups which you can switch in on a > moments notice while you fix things. Right or use glusterfs or ceph if you need to stay up and running during a total brick implosion. Quite honestly, I would much rather see Btrfs single support multiple streams per device, like XFS does with allocation groups when used on linear/concat of multiple devices; two to four per -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
On Wed, Feb 10, 2016 at 01:30:48PM +, fdman...@kernel.org wrote: > From: Filipe Manana> > If we delete a snapshot, fsync its parent directory and crash/power fail > before the next transaction commit, on the next mount when we attempt to > replay the log tree of the root containing the parent directory we will > fail and prevent the filesystem from mounting, which is solvable by wiping > out the log trees with the btrfs-zero-log tool but very inconvenient as > we will lose any data and metadata fsynced before the parent directory > was fsynced. > > For example: > > $ mkfs.btrfs -f /dev/sdc > $ mount /dev/sdc /mnt > $ mkdir /mnt/testdir > $ btrfs subvolume snapshot /mnt /mnt/testdir/snap > $ btrfs subvolume delete /mnt/testdir/snap > $ xfs_io -c "fsync" /mnt/testdir > < crash / power failure and reboot > > $ mount /dev/sdc /mnt > mount: mount(2) failed: No such file or directory > > And in dmesg/syslog we get the following message and trace: > > [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, > inode 257 parent 257 > [192066.363010] [ cut here ] > [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 > __btrfs_unlink_inode+0x17a/0x354 [btrfs]() > [192066.367250] BTRFS: Transaction aborted (error -2) > [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev > sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq > tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 > psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper > button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic > virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod > e1000 virtio floppy [last unloaded: btrfs] > [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: GW > 4.4.0-rc6-btrfs-next-20+ #1 > [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > by qemu-project.org 04/01/2014 > [192066.380889] 880143923670 81257570 > 8801439236b8 > [192066.382561] 8801439236a8 8104ec07 a039dc2c > fffe > [192066.384191] 8801ed31d000 8801b9fc9c88 8801086875e0 > 880143923710 > [192066.385827] Call Trace: > [192066.386373] [] dump_stack+0x4e/0x79 > [192066.387387] [] warn_slowpath_common+0x99/0xb2 > [192066.388429] [] ? __btrfs_unlink_inode+0x17a/0x354 > [btrfs] > [192066.389236] [] warn_slowpath_fmt+0x48/0x50 > [192066.389884] [] __btrfs_unlink_inode+0x17a/0x354 [btrfs] > [192066.390621] [] ? iput+0xb0/0x266 > [192066.391200] [] btrfs_unlink_inode+0x1c/0x3d [btrfs] > [192066.391930] [] check_item_in_log+0x1fe/0x29b [btrfs] > [192066.392715] [] replay_dir_deletes+0x167/0x1cf [btrfs] > [192066.393510] [] replay_one_buffer+0x417/0x570 [btrfs] > [192066.394241] [] walk_up_log_tree+0x10e/0x1dc [btrfs] > [192066.394958] [] walk_log_tree+0xa5/0x190 [btrfs] > [192066.395628] [] btrfs_recover_log_trees+0x239/0x32c > [btrfs] > [192066.396790] [] ? replay_one_extent+0x50a/0x50a [btrfs] > [192066.397891] [] open_ctree+0x1d8b/0x2167 [btrfs] > [192066.398897] [] btrfs_mount+0x5ef/0x729 [btrfs] > [192066.399823] [] ? trace_hardirqs_on+0xd/0xf > [192066.400739] [] ? lockdep_init_map+0xb9/0x1b3 > [192066.401700] [] mount_fs+0x67/0x131 > [192066.402482] [] vfs_kern_mount+0x6c/0xde > [192066.403930] [] btrfs_mount+0x1cb/0x729 [btrfs] > [192066.404831] [] ? trace_hardirqs_on+0xd/0xf > [192066.405726] [] ? lockdep_init_map+0xb9/0x1b3 > [192066.406621] [] mount_fs+0x67/0x131 > [192066.407401] [] vfs_kern_mount+0x6c/0xde > [192066.408247] [] do_mount+0x893/0x9d2 > [192066.409047] [] ? strndup_user+0x3f/0x8c > [192066.409842] [] SyS_mount+0x75/0xa1 > [192066.410621] [] entry_SYSCALL_64_fastpath+0x12/0x6b > [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]--- > [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: > errno=-2 No such entry > [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 > No such entry (Failed to recover log tree) > [192066.415458] BTRFS error (device dm-0): cleaner transaction attach > returned -30 > [192066.444613] BTRFS: open_ctree failed > > This happens because when we are replaying the log and processing the > directory entry pointing to the snapshot in the subvolume tree, we treat > its btrfs_dir_item item as having a location with a key type matching > BTRFS_INODE_ITEM_KEY, which is wrong because the type matches > BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the > object id refers to a root number and not to an inode in the root > containing the parent directory. > > So fix this by triggering a transaction commit if an fsync against the > parent directory is requested after deleting a snapshot. This is the > simplest approach for a rare use case. Some alternative that avoids the > transaction
[LFS/MM TOPIC] fs reflink issues, fs online scrub/check, etc
[resend, email exploded, sorry...] Hi, I want to discuss a few FS related topics that I haven't already seen on the mailing lists: * Shared pagecache pages for reflinked files (and by extension making dax work with reflink on xfs) * Providing a simple interface for scrubbing filesystem metadata in the background (the online check thing). Ideally we'd make it easy to discover what kind of metadata there is to check and provide a simple interface to check the metadata, once discovered. This is a tricky interface topic since FS design differs pretty widely. * Rudimentary online repair and rebuilding (xfs) from secondary metadata * Working out the variances in the btrfs/xfs/ocfs2/nfs reflink implementations and making sure they all get test coverage I would also like participate in some of the proposed discussions: * The ext4 summit (and whatever meeting of XFS devs may happen) * Integrating existing filesystems into pmem, or hallway bofs about designing new filesystems for pmem * Actually seeing the fs developers (well, everyone!) in person again :) --Darrick -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
Sometimes when things are really slow or even hung up with Btrfs, yet there's no blocked task being reported, a dev has asked for sysrq+t, so that might also be something to issue while the slow balance is happening, and then dmesg to grab the result. The thing is, I have no idea how to read the output, but maybe if it gets posted up somewhere we can figure it out. I mean, obviously this is a bug, it shouldn't take two weeks or more to balance a raid6 volume. I'd like to think this would have been caught much sooner in regression testing before it'd be released, so it makes me wonder if this is an edge case related to hardware, kernel build, or more likely some state of the affected file systems that the test file systems aren't in. It might be more helpful to sort through xfstests that call balance and raid56, and see if there's something that's just not being tested, but applies to the actual filesystems involved; rather than trying to decipher kernel output. *shrug* Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions
On 2016-02-10 14:06, Chris Murphy wrote: On Wed, Feb 10, 2016 at 6:57 AM, Austin S. Hemmelgarnwrote: It's an issue of torn writes in this case, not of atomicity of BTRFS. Disks can't atomically write more than sector size chunks, which means that almost all BTRFS filesystems are doing writes that disks can't atomically complete. Add to that that we serialized writes to different devices, and it becomes trivial to lose some data if the system crashes while BTRFS is writing out a stripe (it shouldn't screw up existing data though, you'll just loose whatever you were trying to write). I follow all of this. I still don't know how a torn write leads to a write hole in the conventional sense though. If the write is partial, a pointer never should have been written to that unfinished write. So the pointer that's there after a crash should either point to the old stripe or new stripe (which includes parity), not to the new data strips but an old (stale) parity strip for that partial stripe write that was interrupted. It's easy to see how conventional raid gets this wrong because it has no pointers to strips, those locations are known due to the geometry (raid level, layout, number of devices) and fixed. I don't know what rmw looks like on Btrfs raid56 without overwriting the stripe - a whole new cow'd stripe, and then metadata is updated to reflect the new location of that stripe? I agree, it's not technically a write hole in the conventional sense, but the terminology has become commonplace for data loss in RAID{5,6} due to a failure somewhere in the write path, and this does fit in that sense. In this case the failure is in writing out the metadata that references the blocks instead of in writing out the blocks themselves. Even though you don't loose any existing data, you still loose anything that you were trying to write out. One way to minimize this which would also boost performance on slow storage would be to avoid writing parts of the stripe that aren't changed (so for example, if only one disk in the stripe actually has changed data, only write that and the parities). I'm pretty sure that's part of rmw, which is not a full stripe write. At least there appears to be some distinction in raid56.c between them. The additional optimization that md raid has had for some time is the ability during rmw of a single data chunk (what they call strips, or the smallest unit in a stripe), they can actually optimize the change down to a sector write. So they aren't even doing full chunk/strip writes either. The parity strip though I think must be completely rewritten. I actually wasn't aware that BTRFS did this (it's been a while since I looked at the kernel code), although I'm glad to hear it does. If you're worried about raid56 write holes, then a.) you need a server running this raid where power failures or crashes don't happen b.) don't use raid56 c.) use ZFS. It's not just BTRFS that has this issue though, ZFS does too, Well it's widely considered to not have the write hole. From a ZFS conference I got this tidbit on how they closed the write hole, but I still don't understand why they'd be pointing to a partial (torn) write in the first place: "key insight was realizing instead of treating a stripe as it's a "stripe of separate blocks" you can take a block and break it up into many sectors and have a stripe across the sectors that is of one logic block, that eliminates the write hole because even if the write is partial until all of those writes are complete there's not going to be an uber block referencing any of that." –Bonwick https://www.youtube.com/watch?v=dcV2PaMTAJ4 14:45 Again, a torn write to the metadata referencing the block (stripe in this case I believe) will result in loosing anything written by the update to the stripe. There is no way that _any_ system can avoid this issue without having the ability to truly atomically write out the entire metadata tree after the block (stripe) update. Doing so would require a degree of tight hardware level integration that's functionally impossible for any general purpose system (in essence, the filesystem would have to be implemented in the hardware, not software). What your using has impact on how you need to do backups. For someone who can afford long periods of down time for example, it may be perfectly fine to use something like Amazon S3 Glacier storage (which has a 4 hour lead time on restoration for read access) for backups. OTOH, if you can't afford more than a few minutes of down time and want to use BTRFS, you should probably have full on-line on-site backups which you can switch in on a moments notice while you fix things. Right or use glusterfs or ceph if you need to stay up and running during a total brick implosion. Quite honestly, I would much rather see Btrfs single support multiple streams per device, like XFS does with allocation groups when used on linear/concat of
Re: task btrfs-cleaner:770 blocked for more than 120 seconds.
2016-02-03 9:48 GMT+05:00 Chris Murphy: > Mike: From your attachment, looks like you rebooted. So do this: > > echo 1 > /proc/sys/kernel/sysrq > Reproduce the problem where you get blocked task messages in dmesg > echo w > /proc/sysrq-trigger > journalctl -k > kernel-sysrqw-btrfscleaner770blocked-2.txt > > > Make sure you use the same mount options. Looks like you're using > autodefrag, and inode_cache. Are there others? And can you say what > the workload is? Especially because inode_cache is not a default mount > option and isn't recommended except for certain workloads, but still I > think it shouldn't hang. But that's a question for Liu Bo. > > > Chris Murphy Thanks Chris for clarification. I am not have exactly algorithm for reproducing this. But it happens with my btrfs partition again. *Hang occured here* echo 1 > /proc/sys/kernel/sysrq Reproduce the problem where you get blocked task messages in dmesg echo w > /proc/sysrq-trigger journalctl -k > kernel-sysrqw-btrfscleaner770blocked-2.txt Here full log: http://btrfs.sy24.ru/kernel-sysrqw-btrfscleaner770blocked-2.txt I am so sorry if this log is useless. If "sysrq" is needed enabled before hang then I need set this permanently because as I said I not having exactly reproducing this. My mount options: UUID=82df2d84-bf54-46cb-84ba-c88e93677948 /home btrfs subvolid=5,autodefrag,noatime,space_cache,inode_cache 0 0 -- Best Regards, Mike Gavrilov. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: task btrfs-cleaner:770 blocked for more than 120 seconds.
On Wed, Feb 10, 2016 at 1:39 PM, Михаил Гавриловwrote: > > Here full log: http://btrfs.sy24.ru/kernel-sysrqw-btrfscleaner770blocked-2.txt > > I am so sorry if this log is useless. Looks good to me. The blocked task happens out of no where with nothing reported for almost an hour before the blocking. And I see the sysrq: SysRq : Show Blocked State was issued and lots of information is in the file. > If "sysrq" is needed enabled before hang then I need set this > permanently because as I said I not having exactly reproducing this. echo 1 > /proc/sys/kernel/sysrq can happen anytime, it just enables sysrq triggering functions which on Fedora kernels is not enabled by default. The main thing is that the echo w to the sysrq trigger needs to happen at the time of the problem to show the state. You did that. Let's see what Liu Bo has to say about it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 Unable to remove Failing HD
On 02/11/2016 12:58 AM, Rene Castberg wrote: Arnand, thanks for the tip. What kernels are these meant for? I am not able to apply these cleanly to the kernels i have tried. Or is there a kernel with these incorporated? As I am trying again, they apply nice on v4.4-rc8 (last commit b82dde0230439215b55e545880e90337ee16f51a) Probably you may be missing some not related independent patches. To make things easier, I have attached here a tar of patches on 4.4-rc8, these patches are already in the ML as individual and set where there are dependencies. Pls apply them in the same order as the dir names. I have tried rebooting without the disk attached and am unable to mount the partition. Complaining about bad tree and failed to read chunk. So at the moment the disk is still readable, though not sure how long that will last. Pls physically remove the disk (/dev/sdc), And as you are already using -o degrade pls continue to use it. So now you can delete the missing. Thanks, Anand 2to5.tar.gz Description: application/gzip
Re: btrfs-image failure (btrfs-tools 4.4)
On Tue, Jan 26, 2016 at 09:03:07AM +0800, Qu Wenruo wrote: > If the fs is small enough, would you please do a btrfs-image dump? > That would help a lot to locate the direct cause. I started making a dump, image was growing past 3GB, and then it failed and the image got deleted: gargamel:~# btrfs-image -s -c 9 /dev/mapper/dshelf1old /mnt/dshelf1/ds1old.dump Error adding space cache blocks -5 Error flushing pending -5 create failed (Success) gargamel:~# dpkg --status btrfs-tools Package: btrfs-tools Status: install ok installed Priority: optional Section: admin Installed-Size: 3605 Maintainer: Dimitri John LedkovArchitecture: amd64 Version: 4.4-1 Is there a 4G file size limit, or did I hit another problem? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
On Wednesday 10 Feb 2016 11:39:25 David Sterba wrote: > > The explanations and the table would be good in the changelog and as > comments. I think we'll need to consider the smaller blocks more often > so some examples and locking rules would be useful, eg. documented in > this file. David, I agree. As suggested, I will add the documentation to the commit message and as comments in the code. -- chandan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-image failure (btrfs-tools 4.4)
Marc MERLIN wrote on 2016/02/10 22:31 -0800: On Tue, Jan 26, 2016 at 09:03:07AM +0800, Qu Wenruo wrote: If the fs is small enough, would you please do a btrfs-image dump? That would help a lot to locate the direct cause. I started making a dump, image was growing past 3GB, and then it failed and the image got deleted: gargamel:~# btrfs-image -s -c 9 /dev/mapper/dshelf1old /mnt/dshelf1/ds1old.dump Error adding space cache blocks -5 It seems that btrfs-image failed to read space cache, in read_data_extent() function. And since there is no "Couldn't map the block " error message, either some device is missing or pread64 failed to read the desired data. Error flushing pending -5 create failed (Success) gargamel:~# dpkg --status btrfs-tools Package: btrfs-tools Status: install ok installed Priority: optional Section: admin Installed-Size: 3605 Maintainer: Dimitri John LedkovArchitecture: amd64 Version: 4.4-1 Is there a 4G file size limit, or did I hit another problem? For the 4G file size limit, did you mean the limit from old filesystem like FAT32? I didn't think there is such limit for modern Linux filesystem, or normal read/write operation won't has such limit either. Thanks, Qu Thanks, Marc -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL 4.6] Preparatory work for subpage-blocksize patchset
Hi, the preparatory patchset has been split from the core subpage-blocksize so we can merge it in smaller pieces. It has been pending for a long time and IMHO should be merged so we can focus on the core patchset. The branch contains the unmodified v10, partial reviews from Josef and Liu Bo, tested it by fstests. Please pull to 4.6. The following changes since commit e410e34fad913dd568ec28d2a9949694324c14db: Revert "btrfs: synchronize incompat feature bits with sysfs files" (2016-01-29 08:19:37 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git chandan/prep-subpage-blocksize for you to fetch changes up to 65bfa6580791f8c01fbc9cd8bd73d92aea53723f: Btrfs: btrfs_ioctl_clone: Truncate complete page after performing clone operation (2016-02-01 19:24:29 +0100) Chandan Rajendra (12): Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Btrfs: Compute and look up csums based on sectorsized blocks Btrfs: Direct I/O read: Work on sectorsized blocks Btrfs: fallocate: Work with sectorsized blocks Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units Btrfs: Search for all ordered extents that could span across a page Btrfs: Use (eb->start, seq) as search key for tree modification log Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length Btrfs: Limit inline extents to root->sectorsize Btrfs: Fix block size returned to user space Btrfs: Clean pte corresponding to page straddling i_size Btrfs: btrfs_ioctl_clone: Truncate complete page after performing clone operation fs/btrfs/ctree.c | 34 +++ fs/btrfs/ctree.h | 5 +- fs/btrfs/extent_io.c | 3 +- fs/btrfs/file-item.c | 92 --- fs/btrfs/file.c | 99 fs/btrfs/inode.c | 248 --- fs/btrfs/ioctl.c | 5 +- 7 files changed, 321 insertions(+), 165 deletions(-) -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions
On 05/02/16 20:36, Mackenzie Meyer wrote: RAID 6 stability? I'll say more: currently, btrfs is in a state of flux where if you don't have a very recent kernel that's the first recommendation you're going to receive in case of problems. This means going out of stable packages in most distros. Once you're in the bleeding kernel edge, you are obviously more likely to run into undiscovered bugs. I even see here people that has to patch the kernel with still non-mainline patches when trying to recover. So don't for anything but testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
On Fri, Jun 19, 2015 at 03:15:01PM +0530, Chandan Rajendra wrote: > > private->io_lock is not acquired here but not in below. > > > > IIUC, this can be protected by EXTENT_LOCKED. > > > > private->io_lock plays the same role as BH_Uptodate_Lock (see > end_buffer_async_read()) i.e. without the io_lock we may end up in the > following situation, > > NOTE: Assume 64k page size and 4k block size. Also assume that the first 12 > blocks of the page are contiguous while the next 4 blocks are contiguous. When > reading the page we end up submitting two "logical address space" bios. So > end_bio_extent_readpage function is invoked twice (once for each bio). > > |-+-+-| > | Task A | Task B | Task C | > |-+-+-| > | end_bio_extent_readpage | | | > | process block 0 | | | > | - clear BLK_STATE_IO| | | > | - page_read_complete| | | > | process block 1 | | | > | ... | | | > | ... | | | > | ... | end_bio_extent_readpage | | > | ... | process block 0 | | > | ... | - clear BLK_STATE_IO| | > | ... | - page_read_complete| | > | ... | process block 1 | | > | ... | ... | | > | process block 11| process block 3 | | > | - clear BLK_STATE_IO| - clear BLK_STATE_IO| | > | - page_read_complete| - page_read_complete| | > | - returns true| - returns true| | > | - unlock_page() | | | > | | | lock_page() | > | | - unlock_page() | | > |-+-+-| > > So we end up incorrectly unlocking the page twice and "Task C" ends up working > on an unlocked page. So private->io_lock makes sure that only one of the tasks > gets "true" as the return value when page_read_complete() is invoked. As an > optimization the patch gets the io_lock only when nr_sectors counter reaches > the value 0 (i.e. when the last block of the bio_vec is being processed). > Please let me know if my analysis was incorrect. > > Also, I noticed that page_read_complete() and page_write_complete() can be > replaced by just one function i.e. page_io_complete(). The explanations and the table would be good in the changelog and as comments. I think we'll need to consider the smaller blocks more often so some examples and locking rules would be useful, eg. documented in this file. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.
On Tue, Jun 23, 2015 at 04:37:48PM +0800, Liu Bo wrote: ... > > | - clear BLK_STATE_IO| - clear BLK_STATE_IO| | > > | - page_read_complete| - page_read_complete| | > > | - returns true| - returns true| | > > | - unlock_page() | | | > > | | | lock_page() | > > | | - unlock_page() | | > > |-+-+-| > > > > So we end up incorrectly unlocking the page twice and "Task C" ends up > > working > > on an unlocked page. So private->io_lock makes sure that only one of the > > tasks > > gets "true" as the return value when page_read_complete() is invoked. As an > > optimization the patch gets the io_lock only when nr_sectors counter reaches > > the value 0 (i.e. when the last block of the bio_vec is being processed). > > Please let me know if my analysis was incorrect. > > Thanks for the nice explanation, it looks reasonable to me. Please don't hesitate to add your reviewed-by if you spent time on that and think it's ok, this rellay helps to make decisions about merging. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Fujitsu for 4.5
On Wed, Jan 13, 2016 at 05:28:12PM +0800, Zhao Lei wrote: > This is collection of some bug fix, enhance and cleanup from fujitsu > against btrfs for v4.5, mainly for reada, plus some small fix and cleanup > for scrub and raid56. > > All patchs are in btrfs-maillist, rebased on top of integration-4.5. I was trying to isolate safe fixes for 4.5 but saw hangs (same as Chris reported) and was not able to find the right followups. Can you please collect all your readahead patches sent recently? I got lost. Make it a git branch and let me know, I'll add it to for-next and send pull request for 4.6 later. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 Unable to remove Failing HD
Rene, Thanks for the report. Fixes are in the following patch sets concern1: Btrfs to fail/offline a device for write/flush error: [PATCH 00/15] btrfs: Hot spare and Auto replace concern2: User should be able to delete a device when device has failed: [PATCH 0/7] Introduce device delete by devid If you were able to tryout these patches, pls lets know. Thanks, Anand On 02/10/2016 03:17 PM, Rene Castberg wrote: Hi, This morning i woke up to a failing disk: [230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush 503, corrupt 0, gen 0 [230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush 503, corrupt 0, gen 0 [230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc [230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc [230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed [230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush 503, corrupt 0, gen 0 [230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush 503, corrupt 0, gen 0 [230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed [230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc [230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush 503, corrupt 0, gen 0 [230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc [230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush 503, corrupt 0, gen 0 [230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc [230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush 503, corrupt 0, gen 0 [230802.000425] nfsd: last server has exited, flushing export cache [230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc [230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush 503, corrupt 0, gen 0 [230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc [230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush 503, corrupt 0, gen 0 [230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc [230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush 503, corrupt 0, gen 0 [230898.830893] BTRFS info (device sdd): allowing degraded mounts [230898.830902] BTRFS info (device sdd): disk space caching is enabled Eventually i remounted it as degraded, hopefully to prevent any loss of data. It seems taht the btrfs filesystem still hasn't noticed that the disk has failed: $btrfs fi show Label: 'RenesData' uuid: ee80dae2-7c86-43ea-a253-c8f04589b496 Total devices 5 FS bytes used 5.38TiB devid1 size 2.73TiB used 1.84TiB path /dev/sdb devid2 size 2.73TiB used 1.84TiB path /dev/sde devid3 size 3.64TiB used 1.84TiB path /dev/sdf devid4 size 2.73TiB used 1.84TiB path /dev/sdd devid5 size 3.64TiB used 1.84TiB path /dev/sdc I tried deleting the device: # btrfs device delete /dev/sdc /mnt2/RenesData/ ERROR: error removing device '/dev/sdc': Invalid argument I have been unlucky and already had a failure last friday, where a RAID5 array failed after a disk failure. I rebooted, and the data was unrecoverable. Fortunately this was only temp data so the failure wasn't a real issue. Can somebody give me some advice how to delete the failing disk? I plan on replacing the disk but unfortunately the system doesn't have hotplug, so i will need to shutdown to replace the disk without loosing any of the data stored on these devices. Regards Rene Castberg # uname -a Linux midgard 4.3.3-1.el7.elrepo.x86_64 #1 SMP Tue Dec 15 11:18:19 EST 2015 x86_64 x86_64 x86_64 GNU/Linux [root@midgard ~]# btrfs --version btrfs-progs v4.3.1 [root@midgard ~]# btrfs fi df /mnt2/RenesData/ Data, RAID6: total=5.52TiB, used=5.37TiB System, RAID6: total=96.00MiB, used=480.00KiB Metadata, RAID6: total=17.53GiB, used=11.86GiB GlobalReserve, single: total=512.00MiB, used=0.00B # btrfs device stats /mnt2/RenesData/ [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sde].write_io_errs 0 [/dev/sde].read_io_errs0 [/dev/sde].flush_io_errs 0 [/dev/sde].corruption_errs 0 [/dev/sde].generation_errs 0 [/dev/sdf].write_io_errs 0 [/dev/sdf].read_io_errs0 [/dev/sdf].flush_io_errs 0 [/dev/sdf].corruption_errs 0 [/dev/sdf].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdc].write_io_errs 1583 [/dev/sdc].read_io_errs45652 [/dev/sdc].flush_io_errs 503 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
Hey btrfs-folks, I did a bit of digging using "perf": 1) * "perf stat -B -p 3933 sleep 60" * "perf stat -e 'btrfs:*' -a sleep 60" -> http://fpaste.org/320718/10016145/ 2) * perf record -e block:block_rq_issue -ag" for about 30 seconds: -> http://fpaste.org/320719/51101751/raw/ 3) * perf top -> http://fpaste.org/320720/45511028/ Regards Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
письмо от Екатерины
Здравствуйте. Так же как Вы видите это сообщение, смогут такие же люди увидеть Ваше письмо. Цены от 1500. -- С уважением, менеджер Екатерина. Сот.: 7961136 3521 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Business Partnership
Hello, I am Mr. LAURENT EYADEMA from Republic of Togo.please read the attached proposal. Thanks in anticipation of your urgent response, LAURENT EYADEMA proposal.docx Description: Binary data
Business Partnership
Hello, I am Mr. LAURENT EYADEMA from Republic of Togo.please read the attached proposal. Thanks in anticipation of your urgent response, LAURENT EYADEMA proposal.docx Description: Binary data