[PATCH] fstests: btrfs, test directory fsync after deleting snapshots

2016-02-10 Thread fdmanana
From: Filipe Manana 

Test that if we fsync a directory that had a snapshot entry in it that
was deleted and crash, the next time we mount the filesystem, the log
replay procedure will not fail and the snapshot is not present anymore.

This issue is fixed by the following patch for the linux kernel:

  "Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"

Signed-off-by: Filipe Manana 
---
 tests/btrfs/118 | 86 +
 tests/btrfs/118.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 89 insertions(+)
 create mode 100755 tests/btrfs/118
 create mode 100644 tests/btrfs/118.out

diff --git a/tests/btrfs/118 b/tests/btrfs/118
new file mode 100755
index 000..3ed1cbe
--- /dev/null
+++ b/tests/btrfs/118
@@ -0,0 +1,86 @@
+#! /bin/bash
+# FSQA Test No. 118
+#
+# Test that if we fsync a directory that had a snapshot entry in it that was
+# deleted and crash, the next time we mount the filesystem, the log replay
+# procedure will not fail and the snapshot is not present anymore.
+#
+#---
+#
+# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_init_flakey
+_mount_flakey
+
+# Create a snapshot at the root of our filesystem (mount point path), delete 
it,
+# fsync the mount point path, crash and mount to replay the log. This should
+# succeed and after the filesystem is mounted the snapshot should not be 
visible
+# anymore.
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
+_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT
+_flakey_drop_and_remount
+[ -e $SCRATCH_MNT/snap1 ] && echo "Snapshot snap1 still exists after log 
replay"
+
+# Similar scenario as above, but this time the snapshot is created inside a
+# directory and not directly under the root (mount point path).
+mkdir $SCRATCH_MNT/testdir
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
+_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
+_flakey_drop_and_remount
+[ -e $SCRATCH_MNT/testdir/snap2 ] && \
+   echo "Snapshot snap2 still exists after log replay"
+
+_unmount_flakey
+
+echo "Silence is golden"
+
+status=0
+exit
diff --git a/tests/btrfs/118.out b/tests/btrfs/118.out
new file mode 100644
index 000..3daed86
--- /dev/null
+++ b/tests/btrfs/118.out
@@ -0,0 +1,2 @@
+QA output created by 118
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index f74ffbb..a2fa412 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -118,3 +118,4 @@
 115 auto qgroup
 116 auto quick metadata
 117 auto quick send clone
+118 auto quick snapshot metadata
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions

2016-02-10 Thread Austin S. Hemmelgarn

On 2016-02-09 15:39, Chris Murphy wrote:

On Fri, Feb 5, 2016 at 12:36 PM, Mackenzie Meyer  wrote:



RAID 6 write holes?


I don't even understand the nature of the write hole on Btrfs. If
modification is still always COW, then either an fs block, a strip, or
whole stripe write happens, I'm not sure where the hole comes from. It
suggests some raid56 writes are not atomic.
It's an issue of torn writes in this case, not of atomicity of BTRFS. 
Disks can't atomically write more than sector size chunks, which means 
that almost all BTRFS filesystems are doing writes that disks can't 
atomically complete.  Add to that that we serialized writes to different 
devices, and it becomes trivial to lose some data if the system crashes 
while BTRFS is writing out a stripe (it shouldn't screw up existing data 
though, you'll just loose whatever you were trying to write).


One way to minimize this which would also boost performance on slow 
storage would be to avoid writing parts of the stripe that aren't 
changed (so for example, if only one disk in the stripe actually has 
changed data, only write that and the parities).


If you're worried about raid56 write holes, then a.) you need a server
running this raid where power failures or crashes don't happen b.)
don't use raid56 c.) use ZFS.
It's not just BTRFS that has this issue though, ZFS does too, it just 
recovers more gracefully than BTRFS does, and even with the journaled 
RAID{5,6} support that's being added in MDRAID (and by extension DM-RAID 
and therefore LVM), it still has the same issue, it just moves it 
elsewhere (in this case, it has problems if there's a torn write to the 
journal).



RAID 6 stability?
Any articles I've tried looking for online seem to be from early 2014,
I can't find anything recent discussing the stability of RAID 5 or 6.
Are there or have there recently been any data corruption bugs which
impact RAID 6? Would you consider RAID 6 safe/stable enough for
production use?


It's not stable for your use case, if you have to ask others if it's
stable enough for your use case. Simple as that. Right now some raid6
users are experiencing remarkably slow balances, on the order of
weeks. If device replacement rebuild times are that long, I'd say it's
disqualifying for most any use case, just because there are
alternatives that have better fail over behavior than this. So far
there's no word from any developers what the problem might be, or
where to gather more information. So chances are they're already aware
of it but haven't reproduced it, or isolated it, or have a fix for it
yet.
Double on this, we should probably put something similar on the wiki, 
and this really applies to any feature, not just raid56.



Do you still strongly recommend backups, or has stability reached a
point where backups aren't as critical? I'm thinking from a data
consistency standpoint, not a hardware failure standpoint.


You can't separate them. On completely stable hardware, stem to stern,
you'd have no backups, no Btrfs or ZFS, you'd just run linear/concat
arrays with XFS, for example. So you can't just hand wave the hardware
part away. There are bugs in the entire storage stack, there are
connectors that can become intermittent, the system could crash. All
of these affect data consistency.
I may be wrong, but I believe the intent of this question was to try and 
figure out how likely BTRFS itself is to cause crashes or data 
corruption, independent of the hardware. In other words, 'Do I need to 
worry significantly about BTRFS in planning for disaster recovery, or 
can I focus primarily on the hardware itself?' or 'Is the most likely 
failure mode going to be hardware failure, or software?'. In general, 
right now I'd say that using BTRFS in traditional multi-device setup 
(nothing more than raid1 or possibly raid10), you've got roughly a 50% 
chance of an arbitrary crash being a software issue instead of hardware. 
Single disk, I'd say it's probably closer to 25%, and raid56 I'd say 
it's probably closer to 75%. By comparison, I'd say that with ZFS it's 
maybe a 5% chance (ZFS is developed as enterprise level software, it has 
to work, period), and with XFS on LVM raid, probably about 15% (similar 
to ZFS, XFS is supposed to be enterprise level software, the difference 
here comes from LVM, which has had some interesting issues recently due 
to incomplete testing of certain things before they got pushed upstream).


Stability has not reach a point where backups aren't as critical. I
don't really even know what that means though. No matter Btrfs or not,
you need to be doing backups such that if the primary stack is a 100%
loss without notice, is not a disaster. Plan on having to use it. If
you don't like the sound of that, look elsewhere.
What your using has impact on how you need to do backups.  For someone 
who can afford long periods of down time for example, it may be 
perfectly fine to use something like Amazon S3 Glacier storage 

[PATCH] Btrfs: fix unreplayable log after snapshot delete + parent dir fsync

2016-02-10 Thread fdmanana
From: Filipe Manana 

If we delete a snapshot, fsync its parent directory and crash/power fail
before the next transaction commit, on the next mount when we attempt to
replay the log tree of the root containing the parent directory we will
fail and prevent the filesystem from mounting, which is solvable by wiping
out the log trees with the btrfs-zero-log tool but very inconvenient as
we will lose any data and metadata fsynced before the parent directory
was fsynced.

For example:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/testdir
  $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
  $ btrfs subvolume delete /mnt/testdir/snap
  $ xfs_io -c "fsync" /mnt/testdir
  < crash / power failure and reboot >
  $ mount /dev/sdc /mnt
  mount: mount(2) failed: No such file or directory

And in dmesg/syslog we get the following message and trace:

[192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, 
inode 257 parent 257
[192066.363010] [ cut here ]
[192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 
__btrfs_unlink_inode+0x17a/0x354 [btrfs]()
[192066.367250] BTRFS: Transaction aborted (error -2)
[192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic 
xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 
tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport 
i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 
ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix 
libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last 
unloaded: btrfs]
[192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: GW   
4.4.0-rc6-btrfs-next-20+ #1
[192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by 
qemu-project.org 04/01/2014
[192066.380889]   880143923670 81257570 
8801439236b8
[192066.382561]  8801439236a8 8104ec07 a039dc2c 
fffe
[192066.384191]  8801ed31d000 8801b9fc9c88 8801086875e0 
880143923710
[192066.385827] Call Trace:
[192066.386373]  [] dump_stack+0x4e/0x79
[192066.387387]  [] warn_slowpath_common+0x99/0xb2
[192066.388429]  [] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.389236]  [] warn_slowpath_fmt+0x48/0x50
[192066.389884]  [] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
[192066.390621]  [] ? iput+0xb0/0x266
[192066.391200]  [] btrfs_unlink_inode+0x1c/0x3d [btrfs]
[192066.391930]  [] check_item_in_log+0x1fe/0x29b [btrfs]
[192066.392715]  [] replay_dir_deletes+0x167/0x1cf [btrfs]
[192066.393510]  [] replay_one_buffer+0x417/0x570 [btrfs]
[192066.394241]  [] walk_up_log_tree+0x10e/0x1dc [btrfs]
[192066.394958]  [] walk_log_tree+0xa5/0x190 [btrfs]
[192066.395628]  [] btrfs_recover_log_trees+0x239/0x32c 
[btrfs]
[192066.396790]  [] ? replay_one_extent+0x50a/0x50a [btrfs]
[192066.397891]  [] open_ctree+0x1d8b/0x2167 [btrfs]
[192066.398897]  [] btrfs_mount+0x5ef/0x729 [btrfs]
[192066.399823]  [] ? trace_hardirqs_on+0xd/0xf
[192066.400739]  [] ? lockdep_init_map+0xb9/0x1b3
[192066.401700]  [] mount_fs+0x67/0x131
[192066.402482]  [] vfs_kern_mount+0x6c/0xde
[192066.403930]  [] btrfs_mount+0x1cb/0x729 [btrfs]
[192066.404831]  [] ? trace_hardirqs_on+0xd/0xf
[192066.405726]  [] ? lockdep_init_map+0xb9/0x1b3
[192066.406621]  [] mount_fs+0x67/0x131
[192066.407401]  [] vfs_kern_mount+0x6c/0xde
[192066.408247]  [] do_mount+0x893/0x9d2
[192066.409047]  [] ? strndup_user+0x3f/0x8c
[192066.409842]  [] SyS_mount+0x75/0xa1
[192066.410621]  [] entry_SYSCALL_64_fastpath+0x12/0x6b
[192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
[192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: 
errno=-2 No such entry
[192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 
No such entry (Failed to recover log tree)
[192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned 
-30
[192066.444613] BTRFS: open_ctree failed

This happens because when we are replaying the log and processing the
directory entry pointing to the snapshot in the subvolume tree, we treat
its btrfs_dir_item item as having a location with a key type matching
BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
object id refers to a root number and not to an inode in the root
containing the parent directory.

So fix this by triggering a transaction commit if an fsync against the
parent directory is requested after deleting a snapshot. This is the
simplest approach for a rare use case. Some alternative that avoids the
transaction commit would require more code to explicitly delete the
snapshot at log replay time (factoring out common code from ioctl.c:
btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
log tree of the snapshot's root from the log root of the root of tree

Re: RAID5 Unable to remove Failing HD

2016-02-10 Thread Rene Castberg
Arnand, thanks for the tip. What kernels are these meant for? I am not
able to apply these cleanly to the kernels i have tried. Or is there a
kernel with these incorporated?

I have tried rebooting without the disk attached and am unable to
mount the partition. Complaining about bad tree and
failed to read chunk. So at the moment the disk is still readable,
though not sure how long that will last.

I have posted a copy of my messages log, only the last couple of days.
https://www.dropbox.com/s/9f05e1q5w4zkp38/messages_trimmed2?dl=0

If you or anybody else has some tips i would appreciate it.

Regards

On 10 February 2016 at 17:58, Rene Castberg  wrote:
> Arnand, thanks for the tip. What kernels are these meant for? I am not able
> to apply these cleanly to the kernels i have tried. Or is there a kernel
> with these incorporated?
>
> I have tried rebooting without the disk attached and am unable to mount the
> partition. Complaining about bad tree and
> failed to read chunk. So at the moment the disk is still readable, though
> not sure how long that will last.
>
> I have posted a copy of my messages log, only the last couple of days.
> https://www.dropbox.com/s/9f05e1q5w4zkp38/messages_trimmed2?dl=0
>
> If you or anybody else has some tips i would appreciate it.
>
> Regards
>
> Rene Castberg
>
> On 10 February 2016 at 10:00, Anand Jain  wrote:
>>
>>
>>
>> Rene,
>>
>> Thanks for the report. Fixes are in the following patch sets
>>
>>  concern1:
>>  Btrfs to fail/offline a device for write/flush error:
>>[PATCH 00/15] btrfs: Hot spare and Auto replace
>>
>>  concern2:
>>  User should be able to delete a device when device has failed:
>>[PATCH 0/7] Introduce device delete by devid
>>
>>  If you were able to tryout these patches, pls lets know.
>>
>> Thanks, Anand
>>
>>
>>
>> On 02/10/2016 03:17 PM, Rene Castberg wrote:
>>>
>>> Hi,
>>>
>>> This morning i woke up to a failing disk:
>>>
>>> [230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush
>>> 503, corrupt 0, gen 0
>>> [230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush
>>> 503, corrupt 0, gen 0
>>> [230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
>>> [230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush
>>> 503, corrupt 0, gen 0
>>> [230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed
>>> [230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230802.000425] nfsd: last server has exited, flushing export cache
>>> [230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230898.830893] BTRFS info (device sdd): allowing degraded mounts
>>> [230898.830902] BTRFS info (device sdd): disk space caching is enabled
>>>
>>> Eventually i remounted it as degraded, hopefully to prevent any loss of
>>> data.
>>>
>>> It seems taht the btrfs filesystem still hasn't noticed that the disk
>>> has failed:
>>> $btrfs fi show
>>> Label: 'RenesData'  uuid: ee80dae2-7c86-43ea-a253-c8f04589b496
>>>  Total devices 5 FS bytes used 5.38TiB
>>>  devid1 size 2.73TiB used 1.84TiB path /dev/sdb
>>>  devid2 size 2.73TiB used 1.84TiB path /dev/sde
>>>  devid3 size 3.64TiB used 1.84TiB path /dev/sdf
>>>  devid4 size 2.73TiB used 1.84TiB path /dev/sdd
>>>  devid5 size 3.64TiB used 1.84TiB path /dev/sdc
>>>
>>> I tried deleting the device:
>>> # btrfs device delete /dev/sdc /mnt2/RenesData/
>>> ERROR: error removing device '/dev/sdc': Invalid argument
>>>
>>> I have been unlucky and already had a failure last friday, where a
>>> RAID5 array failed after a disk failure.  I rebooted, and the data was
>>> unrecoverable. Fortunately this was only temp data so the failure
>>> wasn't a real issue.
>>>
>>> Can somebody give me some advice how to delete the failing disk? I
>>> 

Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-10 Thread Chris Murphy
http://fpaste.org/320720/45511028/

What is rb_next? See if you can explode that out and find out more
about why there's so much time going on with that. I see that rb_next
gets used for lots of things, including btrfs. In mine, rb_next is
less than 1% overhead, but for you it's the top item. That's
suspicious.


http://fpaste.org/320718/10016145/
line 72-73. We both have counts for qgroup stuff. Mine is much much
less than yours. I have never had quotas enabled on any of my
filesystems, so I don't know why there are any such counts at all. But
since your values are nearly three orders of magnitude greater than
mine, I have to ask if you have quotas enabled or have ever had them
enabled? That might be a factor here...



Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs, test directory fsync after deleting snapshots

2016-02-10 Thread Liu Bo
On Wed, Feb 10, 2016 at 01:32:51PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that if we fsync a directory that had a snapshot entry in it that
> was deleted and crash, the next time we mount the filesystem, the log
> replay procedure will not fail and the snapshot is not present anymore.
> 
> This issue is fixed by the following patch for the linux kernel:
> 
>   "Btrfs: fix unreplayable log after snapshot delete + parent dir fsync"
> 

Tested-by: Liu Bo 
Reviewed-by: Liu Bo 

Thanks,

-liubo
> Signed-off-by: Filipe Manana 
> ---
>  tests/btrfs/118 | 86 
> +
>  tests/btrfs/118.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 89 insertions(+)
>  create mode 100755 tests/btrfs/118
>  create mode 100644 tests/btrfs/118.out
> 
> diff --git a/tests/btrfs/118 b/tests/btrfs/118
> new file mode 100755
> index 000..3ed1cbe
> --- /dev/null
> +++ b/tests/btrfs/118
> @@ -0,0 +1,86 @@
> +#! /bin/bash
> +# FSQA Test No. 118
> +#
> +# Test that if we fsync a directory that had a snapshot entry in it that was
> +# deleted and crash, the next time we mount the filesystem, the log replay
> +# procedure will not fail and the snapshot is not present anymore.
> +#
> +#---
> +#
> +# Copyright (C) 2016 SUSE Linux Products GmbH. All Rights Reserved.
> +# Author: Filipe Manana 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + _cleanup_flakey
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# real QA test starts here
> +_need_to_be_root
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_dm_target flakey
> +_require_metadata_journaling $SCRATCH_DEV
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_init_flakey
> +_mount_flakey
> +
> +# Create a snapshot at the root of our filesystem (mount point path), delete 
> it,
> +# fsync the mount point path, crash and mount to replay the log. This should
> +# succeed and after the filesystem is mounted the snapshot should not be 
> visible
> +# anymore.
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
> +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
> +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT
> +_flakey_drop_and_remount
> +[ -e $SCRATCH_MNT/snap1 ] && echo "Snapshot snap1 still exists after log 
> replay"
> +
> +# Similar scenario as above, but this time the snapshot is created inside a
> +# directory and not directly under the root (mount point path).
> +mkdir $SCRATCH_MNT/testdir
> +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT 
> $SCRATCH_MNT/testdir/snap2
> +_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
> +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
> +_flakey_drop_and_remount
> +[ -e $SCRATCH_MNT/testdir/snap2 ] && \
> + echo "Snapshot snap2 still exists after log replay"
> +
> +_unmount_flakey
> +
> +echo "Silence is golden"
> +
> +status=0
> +exit
> diff --git a/tests/btrfs/118.out b/tests/btrfs/118.out
> new file mode 100644
> index 000..3daed86
> --- /dev/null
> +++ b/tests/btrfs/118.out
> @@ -0,0 +1,2 @@
> +QA output created by 118
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index f74ffbb..a2fa412 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -118,3 +118,4 @@
>  115 auto qgroup
>  116 auto quick metadata
>  117 auto quick send clone
> +118 auto quick snapshot metadata
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 1/2] btrfs-progs: copy functionality of btrfs-debug-tree to inspect-internal subcommand

2016-02-10 Thread David Sterba
On Tue, Feb 09, 2016 at 05:12:08PM +0100, Alexander Fougner wrote:
> The long-term plan is to merge the features of standalone tools
> into the btrfs binary, reducing the number of shipped binaries.
> 
> Signed-off-by: Alexander Fougner 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions

2016-02-10 Thread Chris Murphy
On Wed, Feb 10, 2016 at 6:57 AM, Austin S. Hemmelgarn
 wrote:

> It's an issue of torn writes in this case, not of atomicity of BTRFS. Disks
> can't atomically write more than sector size chunks, which means that almost
> all BTRFS filesystems are doing writes that disks can't atomically complete.
> Add to that that we serialized writes to different devices, and it becomes
> trivial to lose some data if the system crashes while BTRFS is writing out a
> stripe (it shouldn't screw up existing data though, you'll just loose
> whatever you were trying to write).

I follow all of this. I still don't know how a torn write leads to a
write hole in the conventional sense though. If the write is partial,
a pointer never should have been written to that unfinished write. So
the pointer that's there after a crash should either point to the old
stripe or new stripe (which includes parity), not to the new data
strips but an old (stale) parity strip for that partial stripe write
that was interrupted. It's easy to see how conventional raid gets this
wrong because it has no pointers to strips, those locations are known
due to the geometry (raid level, layout, number of devices) and fixed.
I don't know what rmw looks like on Btrfs raid56 without overwriting
the stripe - a whole new cow'd stripe, and then metadata is updated to
reflect the new location of that stripe?




> One way to minimize this which would also boost performance on slow storage
> would be to avoid writing parts of the stripe that aren't changed (so for
> example, if only one disk in the stripe actually has changed data, only
> write that and the parities).

I'm pretty sure that's part of rmw, which is not a full stripe write.
At least there appears to be some distinction in raid56.c between
them. The additional optimization that md raid has had for some time
is the ability during rmw of a single data chunk (what they call
strips, or the smallest unit in a stripe), they can actually optimize
the change down to a sector write. So they aren't even doing full
chunk/strip writes either. The parity strip though I think must be
completely rewritten.


>>
>>
>> If you're worried about raid56 write holes, then a.) you need a server
>> running this raid where power failures or crashes don't happen b.)
>> don't use raid56 c.) use ZFS.
>
> It's not just BTRFS that has this issue though, ZFS does too,

Well it's widely considered to not have the write hole. From a ZFS
conference I got this tidbit on how they closed the write hole, but I
still don't understand why they'd be pointing to a partial (torn)
write in the first place:

"key insight was realizing instead of treating a stripe as it's a
"stripe of separate blocks" you can take a block and break it up into
many sectors and have a stripe across the sectors that is of one logic
block, that eliminates the write hole because even if the write is
partial until all of those writes are complete there's not going to be
an uber block referencing any of that." –Bonwick
https://www.youtube.com/watch?v=dcV2PaMTAJ4
14:45


> What your using has impact on how you need to do backups.  For someone who
> can afford long periods of down time for example, it may be perfectly fine
> to use something like Amazon S3 Glacier storage (which has a 4 hour lead
> time on restoration for read access) for backups. OTOH, if you can't afford
> more than a few minutes of down time and want to use BTRFS, you should
> probably have full on-line on-site backups which you can switch in on a
> moments notice while you fix things.

Right or use glusterfs or ceph if you need to stay up and running
during a total brick implosion. Quite honestly, I would much rather
see Btrfs single support multiple streams per device, like XFS does
with allocation groups when used on linear/concat of multiple devices;
two to four per



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix unreplayable log after snapshot delete + parent dir fsync

2016-02-10 Thread Liu Bo
On Wed, Feb 10, 2016 at 01:30:48PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If we delete a snapshot, fsync its parent directory and crash/power fail
> before the next transaction commit, on the next mount when we attempt to
> replay the log tree of the root containing the parent directory we will
> fail and prevent the filesystem from mounting, which is solvable by wiping
> out the log trees with the btrfs-zero-log tool but very inconvenient as
> we will lose any data and metadata fsynced before the parent directory
> was fsynced.
> 
> For example:
> 
>   $ mkfs.btrfs -f /dev/sdc
>   $ mount /dev/sdc /mnt
>   $ mkdir /mnt/testdir
>   $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
>   $ btrfs subvolume delete /mnt/testdir/snap
>   $ xfs_io -c "fsync" /mnt/testdir
>   < crash / power failure and reboot >
>   $ mount /dev/sdc /mnt
>   mount: mount(2) failed: No such file or directory
> 
> And in dmesg/syslog we get the following message and trace:
> 
> [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, 
> inode 257 parent 257
> [192066.363010] [ cut here ]
> [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 
> __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
> [192066.367250] BTRFS: Transaction aborted (error -2)
> [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev 
> sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq 
> tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 
> psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper 
> button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic 
> virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod 
> e1000 virtio floppy [last unloaded: btrfs]
> [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: GW   
> 4.4.0-rc6-btrfs-next-20+ #1
> [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> by qemu-project.org 04/01/2014
> [192066.380889]   880143923670 81257570 
> 8801439236b8
> [192066.382561]  8801439236a8 8104ec07 a039dc2c 
> fffe
> [192066.384191]  8801ed31d000 8801b9fc9c88 8801086875e0 
> 880143923710
> [192066.385827] Call Trace:
> [192066.386373]  [] dump_stack+0x4e/0x79
> [192066.387387]  [] warn_slowpath_common+0x99/0xb2
> [192066.388429]  [] ? __btrfs_unlink_inode+0x17a/0x354 
> [btrfs]
> [192066.389236]  [] warn_slowpath_fmt+0x48/0x50
> [192066.389884]  [] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
> [192066.390621]  [] ? iput+0xb0/0x266
> [192066.391200]  [] btrfs_unlink_inode+0x1c/0x3d [btrfs]
> [192066.391930]  [] check_item_in_log+0x1fe/0x29b [btrfs]
> [192066.392715]  [] replay_dir_deletes+0x167/0x1cf [btrfs]
> [192066.393510]  [] replay_one_buffer+0x417/0x570 [btrfs]
> [192066.394241]  [] walk_up_log_tree+0x10e/0x1dc [btrfs]
> [192066.394958]  [] walk_log_tree+0xa5/0x190 [btrfs]
> [192066.395628]  [] btrfs_recover_log_trees+0x239/0x32c 
> [btrfs]
> [192066.396790]  [] ? replay_one_extent+0x50a/0x50a [btrfs]
> [192066.397891]  [] open_ctree+0x1d8b/0x2167 [btrfs]
> [192066.398897]  [] btrfs_mount+0x5ef/0x729 [btrfs]
> [192066.399823]  [] ? trace_hardirqs_on+0xd/0xf
> [192066.400739]  [] ? lockdep_init_map+0xb9/0x1b3
> [192066.401700]  [] mount_fs+0x67/0x131
> [192066.402482]  [] vfs_kern_mount+0x6c/0xde
> [192066.403930]  [] btrfs_mount+0x1cb/0x729 [btrfs]
> [192066.404831]  [] ? trace_hardirqs_on+0xd/0xf
> [192066.405726]  [] ? lockdep_init_map+0xb9/0x1b3
> [192066.406621]  [] mount_fs+0x67/0x131
> [192066.407401]  [] vfs_kern_mount+0x6c/0xde
> [192066.408247]  [] do_mount+0x893/0x9d2
> [192066.409047]  [] ? strndup_user+0x3f/0x8c
> [192066.409842]  [] SyS_mount+0x75/0xa1
> [192066.410621]  [] entry_SYSCALL_64_fastpath+0x12/0x6b
> [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
> [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: 
> errno=-2 No such entry
> [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 
> No such entry (Failed to recover log tree)
> [192066.415458] BTRFS error (device dm-0): cleaner transaction attach 
> returned -30
> [192066.444613] BTRFS: open_ctree failed
> 
> This happens because when we are replaying the log and processing the
> directory entry pointing to the snapshot in the subvolume tree, we treat
> its btrfs_dir_item item as having a location with a key type matching
> BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
> BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
> object id refers to a root number and not to an inode in the root
> containing the parent directory.
> 
> So fix this by triggering a transaction commit if an fsync against the
> parent directory is requested after deleting a snapshot. This is the
> simplest approach for a rare use case. Some alternative that avoids the
> transaction 

[LFS/MM TOPIC] fs reflink issues, fs online scrub/check, etc

2016-02-10 Thread Darrick J. Wong
[resend, email exploded, sorry...]

Hi,

I want to discuss a few FS related topics that I haven't already seen on
the mailing lists:

 * Shared pagecache pages for reflinked files (and by extension making dax
   work with reflink on xfs)

 * Providing a simple interface for scrubbing filesystem metadata in the
   background (the online check thing).  Ideally we'd make it easy to discover
   what kind of metadata there is to check and provide a simple interface to
   check the metadata, once discovered.  This is a tricky interface topic
   since FS design differs pretty widely.

 * Rudimentary online repair and rebuilding (xfs) from secondary metadata

 * Working out the variances in the btrfs/xfs/ocfs2/nfs reflink implementations
   and making sure they all get test coverage

I would also like participate in some of the proposed discussions:

 * The ext4 summit (and whatever meeting of XFS devs may happen)

 * Integrating existing filesystems into pmem, or hallway bofs about designing
   new filesystems for pmem

 * Actually seeing the fs developers (well, everyone!) in person again :)

--Darrick
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-10 Thread Chris Murphy
Sometimes when things are really slow or even hung up with Btrfs, yet
there's no blocked task being reported, a dev has asked for sysrq+t,
so that might also be something to issue while the slow balance is
happening, and then dmesg to grab the result. The thing is, I have no
idea how to read the output, but maybe if it gets posted up somewhere
we can figure it out. I mean, obviously this is a bug, it shouldn't
take two weeks or more to balance a raid6 volume.

I'd like to think this would have been caught much sooner in
regression testing before it'd be released, so it makes me wonder if
this is an edge case related to hardware, kernel build, or more likely
some state of the affected file systems that the test file systems
aren't in. It might be more helpful to sort through xfstests that call
balance and raid56, and see if there's something that's just not being
tested, but applies to the actual filesystems involved; rather than
trying to decipher kernel output. *shrug*


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions

2016-02-10 Thread Austin S. Hemmelgarn

On 2016-02-10 14:06, Chris Murphy wrote:

On Wed, Feb 10, 2016 at 6:57 AM, Austin S. Hemmelgarn
 wrote:


It's an issue of torn writes in this case, not of atomicity of BTRFS. Disks
can't atomically write more than sector size chunks, which means that almost
all BTRFS filesystems are doing writes that disks can't atomically complete.
Add to that that we serialized writes to different devices, and it becomes
trivial to lose some data if the system crashes while BTRFS is writing out a
stripe (it shouldn't screw up existing data though, you'll just loose
whatever you were trying to write).


I follow all of this. I still don't know how a torn write leads to a
write hole in the conventional sense though. If the write is partial,
a pointer never should have been written to that unfinished write. So
the pointer that's there after a crash should either point to the old
stripe or new stripe (which includes parity), not to the new data
strips but an old (stale) parity strip for that partial stripe write
that was interrupted. It's easy to see how conventional raid gets this
wrong because it has no pointers to strips, those locations are known
due to the geometry (raid level, layout, number of devices) and fixed.
I don't know what rmw looks like on Btrfs raid56 without overwriting
the stripe - a whole new cow'd stripe, and then metadata is updated to
reflect the new location of that stripe?

I agree, it's not technically a write hole in the conventional sense, 
but the terminology has become commonplace for data loss in RAID{5,6} 
due to a failure somewhere in the write path, and this does fit in that 
sense.  In this case the failure is in writing out the metadata that 
references the blocks instead of in writing out the blocks themselves. 
Even though you don't loose any existing data, you still loose anything 
that you were trying to write out.





One way to minimize this which would also boost performance on slow storage
would be to avoid writing parts of the stripe that aren't changed (so for
example, if only one disk in the stripe actually has changed data, only
write that and the parities).


I'm pretty sure that's part of rmw, which is not a full stripe write.
At least there appears to be some distinction in raid56.c between
them. The additional optimization that md raid has had for some time
is the ability during rmw of a single data chunk (what they call
strips, or the smallest unit in a stripe), they can actually optimize
the change down to a sector write. So they aren't even doing full
chunk/strip writes either. The parity strip though I think must be
completely rewritten.
I actually wasn't aware that BTRFS did this (it's been a while since I 
looked at the kernel code), although I'm glad to hear it does.






If you're worried about raid56 write holes, then a.) you need a server
running this raid where power failures or crashes don't happen b.)
don't use raid56 c.) use ZFS.


It's not just BTRFS that has this issue though, ZFS does too,


Well it's widely considered to not have the write hole. From a ZFS
conference I got this tidbit on how they closed the write hole, but I
still don't understand why they'd be pointing to a partial (torn)
write in the first place:

"key insight was realizing instead of treating a stripe as it's a
"stripe of separate blocks" you can take a block and break it up into
many sectors and have a stripe across the sectors that is of one logic
block, that eliminates the write hole because even if the write is
partial until all of those writes are complete there's not going to be
an uber block referencing any of that." –Bonwick
https://www.youtube.com/watch?v=dcV2PaMTAJ4
14:45
Again, a torn write to the metadata referencing the block (stripe in 
this case I believe) will result in loosing anything written by the 
update to the stripe.  There is no way that _any_ system can avoid this 
issue without having the ability to truly atomically write out the 
entire metadata tree after the block (stripe) update.  Doing so would 
require a degree of tight hardware level integration that's functionally 
impossible for any general purpose system (in essence, the filesystem 
would have to be implemented in the hardware, not software).




What your using has impact on how you need to do backups.  For someone who
can afford long periods of down time for example, it may be perfectly fine
to use something like Amazon S3 Glacier storage (which has a 4 hour lead
time on restoration for read access) for backups. OTOH, if you can't afford
more than a few minutes of down time and want to use BTRFS, you should
probably have full on-line on-site backups which you can switch in on a
moments notice while you fix things.


Right or use glusterfs or ceph if you need to stay up and running
during a total brick implosion. Quite honestly, I would much rather
see Btrfs single support multiple streams per device, like XFS does
with allocation groups when used on linear/concat of 

Re: task btrfs-cleaner:770 blocked for more than 120 seconds.

2016-02-10 Thread Михаил Гаврилов
2016-02-03 9:48 GMT+05:00 Chris Murphy :
> Mike: From your attachment, looks like you rebooted. So do this:
>
> echo 1 > /proc/sys/kernel/sysrq
> Reproduce the problem where you get blocked task messages in dmesg
> echo w > /proc/sysrq-trigger
> journalctl -k > kernel-sysrqw-btrfscleaner770blocked-2.txt
>
>
> Make sure you use the same mount options. Looks like you're using
> autodefrag, and inode_cache. Are there others? And can you say what
> the workload is? Especially because inode_cache is not a default mount
> option and isn't recommended except for certain workloads, but still I
> think it shouldn't hang. But that's a question for Liu Bo.
>
>
> Chris Murphy

Thanks Chris for clarification.
I am not have exactly algorithm for reproducing this.
But it happens with my btrfs partition again.

*Hang occured here*
echo 1 > /proc/sys/kernel/sysrq
Reproduce the problem where you get blocked task messages in dmesg
echo w > /proc/sysrq-trigger
journalctl -k > kernel-sysrqw-btrfscleaner770blocked-2.txt

Here full log: http://btrfs.sy24.ru/kernel-sysrqw-btrfscleaner770blocked-2.txt

I am so sorry if this log is useless.
If "sysrq" is needed enabled before hang then I need set this
permanently because as I said I not having exactly reproducing this.

My mount options:
UUID=82df2d84-bf54-46cb-84ba-c88e93677948 /home btrfs
subvolid=5,autodefrag,noatime,space_cache,inode_cache 0 0


--
Best Regards,
Mike Gavrilov.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: task btrfs-cleaner:770 blocked for more than 120 seconds.

2016-02-10 Thread Chris Murphy
On Wed, Feb 10, 2016 at 1:39 PM, Михаил Гаврилов
 wrote:


>
> Here full log: http://btrfs.sy24.ru/kernel-sysrqw-btrfscleaner770blocked-2.txt
>
> I am so sorry if this log is useless.

Looks good to me. The blocked task happens out of no where with
nothing reported for almost an hour before the blocking. And I see the
sysrq: SysRq : Show Blocked State was issued and lots of information
is in the file.

> If "sysrq" is needed enabled before hang then I need set this
> permanently because as I said I not having exactly reproducing this.

echo 1 > /proc/sys/kernel/sysrq can happen anytime, it just enables
sysrq triggering functions which on Fedora kernels is not enabled by
default. The main thing is that the echo w to the sysrq trigger needs
to happen at the time of the problem to show the state. You did that.
Let's see what Liu Bo has to say about it.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Unable to remove Failing HD

2016-02-10 Thread Anand Jain



On 02/11/2016 12:58 AM, Rene Castberg wrote:

Arnand, thanks for the tip. What kernels are these meant for? I am not
able to apply these cleanly to the kernels i have tried. Or is there a
kernel with these incorporated?


 As I am trying again, they apply nice on v4.4-rc8
 (last commit b82dde0230439215b55e545880e90337ee16f51a)

 Probably you may be missing some not related independent patches.
 To make things easier, I have attached here a tar of patches
 on 4.4-rc8, these patches are already in the ML as individual
 and set where there are dependencies. Pls apply them in the
 same order as the dir names.


I have tried rebooting without the disk attached and am unable to
mount the partition. Complaining about bad tree and
failed to read chunk. So at the moment the disk is still readable,
though not sure how long that will last.


  Pls physically remove the disk (/dev/sdc), And as you are already
  using -o degrade pls continue to use it.

  So now you can delete the missing.

Thanks, Anand



2to5.tar.gz
Description: application/gzip


Re: btrfs-image failure (btrfs-tools 4.4)

2016-02-10 Thread Marc MERLIN
On Tue, Jan 26, 2016 at 09:03:07AM +0800, Qu Wenruo wrote:
> If the fs is small enough, would you please do a btrfs-image dump?
> That would help a lot to locate the direct cause.
 
I started making a dump, image was growing past 3GB, and then it failed
and the image got deleted:

gargamel:~# btrfs-image -s -c 9 /dev/mapper/dshelf1old /mnt/dshelf1/ds1old.dump
Error adding space cache blocks -5
Error flushing pending -5
create failed (Success)

gargamel:~# dpkg --status btrfs-tools
Package: btrfs-tools
Status: install ok installed
Priority: optional
Section: admin
Installed-Size: 3605
Maintainer: Dimitri John Ledkov 
Architecture: amd64
Version: 4.4-1

Is there a 4G file size limit, or did I hit another problem?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.

2016-02-10 Thread Chandan Rajendra
On Wednesday 10 Feb 2016 11:39:25 David Sterba wrote:
> 
> The explanations and the table would be good in the changelog and as
> comments. I think we'll need to consider the smaller blocks more often
> so some examples and locking rules would be useful, eg. documented in
> this file.

David, I agree.  As suggested, I will add the documentation to the commit
message and as comments in the code.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image failure (btrfs-tools 4.4)

2016-02-10 Thread Qu Wenruo



Marc MERLIN wrote on 2016/02/10 22:31 -0800:

On Tue, Jan 26, 2016 at 09:03:07AM +0800, Qu Wenruo wrote:

If the fs is small enough, would you please do a btrfs-image dump?
That would help a lot to locate the direct cause.


I started making a dump, image was growing past 3GB, and then it failed
and the image got deleted:

gargamel:~# btrfs-image -s -c 9 /dev/mapper/dshelf1old /mnt/dshelf1/ds1old.dump
Error adding space cache blocks -5


It seems that btrfs-image failed to read space cache, in 
read_data_extent() function.


And since there is no "Couldn't map the block " error message, 
either some device is missing or pread64 failed to read the desired data.



Error flushing pending -5
create failed (Success)

gargamel:~# dpkg --status btrfs-tools
Package: btrfs-tools
Status: install ok installed
Priority: optional
Section: admin
Installed-Size: 3605
Maintainer: Dimitri John Ledkov 
Architecture: amd64
Version: 4.4-1

Is there a 4G file size limit, or did I hit another problem?


For the 4G file size limit, did you mean the limit from old filesystem 
like FAT32?


I didn't think there is such limit for modern Linux filesystem, or 
normal read/write operation won't has such limit either.


Thanks,
Qu



Thanks,
Marc




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL 4.6] Preparatory work for subpage-blocksize patchset

2016-02-10 Thread David Sterba
Hi,

the preparatory patchset has been split from the core subpage-blocksize so we
can merge it in smaller pieces. It has been pending for a long time and IMHO
should be merged so we can focus on the core patchset.

The branch contains the unmodified v10, partial reviews from Josef and Liu Bo,
tested it by fstests.

Please pull to 4.6.


The following changes since commit e410e34fad913dd568ec28d2a9949694324c14db:

  Revert "btrfs: synchronize incompat feature bits with sysfs files" 
(2016-01-29 08:19:37 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git 
chandan/prep-subpage-blocksize

for you to fetch changes up to 65bfa6580791f8c01fbc9cd8bd73d92aea53723f:

  Btrfs: btrfs_ioctl_clone: Truncate complete page after performing clone 
operation (2016-02-01 19:24:29 +0100)



Chandan Rajendra (12):
  Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to
block size
  Btrfs: Compute and look up csums based on sectorsized blocks
  Btrfs: Direct I/O read: Work on sectorsized blocks
  Btrfs: fallocate: Work with sectorsized blocks
  Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
  Btrfs: Search for all ordered extents that could span across a page
  Btrfs: Use (eb->start, seq) as search key for tree modification log
  Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
  Btrfs: Limit inline extents to root->sectorsize
  Btrfs: Fix block size returned to user space
  Btrfs: Clean pte corresponding to page straddling i_size
  Btrfs: btrfs_ioctl_clone: Truncate complete page after performing
clone operation

 fs/btrfs/ctree.c |  34 +++
 fs/btrfs/ctree.h |   5 +-
 fs/btrfs/extent_io.c |   3 +-
 fs/btrfs/file-item.c |  92 ---
 fs/btrfs/file.c  |  99 
 fs/btrfs/inode.c | 248 ---
 fs/btrfs/ioctl.c |   5 +-
 7 files changed, 321 insertions(+), 165 deletions(-)

-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions

2016-02-10 Thread Psalle

On 05/02/16 20:36, Mackenzie Meyer wrote:

RAID 6 stability?
I'll say more: currently, btrfs is in a state of flux where if you don't 
have a very recent kernel that's the first recommendation you're going 
to receive in case of problems. This means going out of stable packages 
in most distros.


Once you're in the bleeding kernel edge, you are obviously more likely 
to run into undiscovered bugs. I even see here people that has to patch 
the kernel with still non-mainline patches when trying to recover.


So don't for anything but testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.

2016-02-10 Thread David Sterba
On Fri, Jun 19, 2015 at 03:15:01PM +0530, Chandan Rajendra wrote:
> > private->io_lock is not acquired here but not in below.
> > 
> > IIUC, this can be protected by EXTENT_LOCKED.
> >
> 
> private->io_lock plays the same role as BH_Uptodate_Lock (see
> end_buffer_async_read()) i.e. without the io_lock we may end up in the
> following situation,
> 
> NOTE: Assume 64k page size and 4k block size. Also assume that the first 12
> blocks of the page are contiguous while the next 4 blocks are contiguous. When
> reading the page we end up submitting two "logical address space" bios. So
> end_bio_extent_readpage function is invoked twice (once for each bio).
> 
> |-+-+-|
> | Task A  | Task B  | Task C  |
> |-+-+-|
> | end_bio_extent_readpage | | |
> | process block 0 | | |
> | - clear BLK_STATE_IO| | |
> | - page_read_complete| | |
> | process block 1 | | |
> | ... | | |
> | ... | | |
> | ... | end_bio_extent_readpage | |
> | ... | process block 0 | |
> | ... | - clear BLK_STATE_IO| |
> | ... | - page_read_complete| |
> | ... | process block 1 | |
> | ... | ... | |
> | process block 11| process block 3 | |
> | - clear BLK_STATE_IO| - clear BLK_STATE_IO| |
> | - page_read_complete| - page_read_complete| |
> |   - returns true|   - returns true| |
> |   - unlock_page()   | | |
> | | | lock_page() |
> | |   - unlock_page()   | |
> |-+-+-|
> 
> So we end up incorrectly unlocking the page twice and "Task C" ends up working
> on an unlocked page. So private->io_lock makes sure that only one of the tasks
> gets "true" as the return value when page_read_complete() is invoked. As an
> optimization the patch gets the io_lock only when nr_sectors counter reaches
> the value 0 (i.e. when the last block of the bio_vec is being processed).
> Please let me know if my analysis was incorrect.
> 
> Also, I noticed that page_read_complete() and page_write_complete() can be
> replaced by just one function i.e. page_io_complete().

The explanations and the table would be good in the changelog and as
comments. I think we'll need to consider the smaller blocks more often
so some examples and locking rules would be useful, eg. documented in
this file.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V11 01/21] Btrfs: subpagesize-blocksize: Fix whole page read.

2016-02-10 Thread David Sterba
On Tue, Jun 23, 2015 at 04:37:48PM +0800, Liu Bo wrote:
...
> > | - clear BLK_STATE_IO| - clear BLK_STATE_IO| |
> > | - page_read_complete| - page_read_complete| |
> > |   - returns true|   - returns true| |
> > |   - unlock_page()   | | |
> > | | | lock_page() |
> > | |   - unlock_page()   | |
> > |-+-+-|
> > 
> > So we end up incorrectly unlocking the page twice and "Task C" ends up 
> > working
> > on an unlocked page. So private->io_lock makes sure that only one of the 
> > tasks
> > gets "true" as the return value when page_read_complete() is invoked. As an
> > optimization the patch gets the io_lock only when nr_sectors counter reaches
> > the value 0 (i.e. when the last block of the bio_vec is being processed).
> > Please let me know if my analysis was incorrect.
> 
> Thanks for the nice explanation, it looks reasonable to me.

Please don't hesitate to add your reviewed-by if you spent time on that
and think it's ok, this rellay helps to make decisions about merging.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Fujitsu for 4.5

2016-02-10 Thread David Sterba
On Wed, Jan 13, 2016 at 05:28:12PM +0800, Zhao Lei wrote:
> This is collection of some bug fix, enhance and cleanup from fujitsu
> against btrfs for v4.5, mainly for reada, plus some small fix and cleanup
> for scrub and raid56.
> 
> All patchs are in btrfs-maillist, rebased on top of integration-4.5.

I was trying to isolate safe fixes for 4.5 but saw hangs (same as Chris
reported) and was not able to find the right followups.

Can you please collect all your readahead patches sent recently? I got
lost. Make it a git branch and let me know, I'll add it to for-next and
send pull request for 4.6 later.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Unable to remove Failing HD

2016-02-10 Thread Anand Jain



Rene,

Thanks for the report. Fixes are in the following patch sets

 concern1:
 Btrfs to fail/offline a device for write/flush error:
   [PATCH 00/15] btrfs: Hot spare and Auto replace

 concern2:
 User should be able to delete a device when device has failed:
   [PATCH 0/7] Introduce device delete by devid

 If you were able to tryout these patches, pls lets know.

Thanks, Anand


On 02/10/2016 03:17 PM, Rene Castberg wrote:

Hi,

This morning i woke up to a failing disk:

[230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush
503, corrupt 0, gen 0
[230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush
503, corrupt 0, gen 0
[230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc
[230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc
[230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
[230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush
503, corrupt 0, gen 0
[230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush
503, corrupt 0, gen 0
[230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed
[230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush
503, corrupt 0, gen 0
[230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush
503, corrupt 0, gen 0
[230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush
503, corrupt 0, gen 0
[230802.000425] nfsd: last server has exited, flushing export cache
[230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush
503, corrupt 0, gen 0
[230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush
503, corrupt 0, gen 0
[230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush
503, corrupt 0, gen 0
[230898.830893] BTRFS info (device sdd): allowing degraded mounts
[230898.830902] BTRFS info (device sdd): disk space caching is enabled

Eventually i remounted it as degraded, hopefully to prevent any loss of data.

It seems taht the btrfs filesystem still hasn't noticed that the disk
has failed:
$btrfs fi show
Label: 'RenesData'  uuid: ee80dae2-7c86-43ea-a253-c8f04589b496
 Total devices 5 FS bytes used 5.38TiB
 devid1 size 2.73TiB used 1.84TiB path /dev/sdb
 devid2 size 2.73TiB used 1.84TiB path /dev/sde
 devid3 size 3.64TiB used 1.84TiB path /dev/sdf
 devid4 size 2.73TiB used 1.84TiB path /dev/sdd
 devid5 size 3.64TiB used 1.84TiB path /dev/sdc

I tried deleting the device:
# btrfs device delete /dev/sdc /mnt2/RenesData/
ERROR: error removing device '/dev/sdc': Invalid argument

I have been unlucky and already had a failure last friday, where a
RAID5 array failed after a disk failure.  I rebooted, and the data was
unrecoverable. Fortunately this was only temp data so the failure
wasn't a real issue.

Can somebody give me some advice how to delete the failing disk? I
plan on replacing the disk but unfortunately the system doesn't have
hotplug, so i will need to shutdown to replace the disk without
loosing any of the data stored on these devices.

Regards

Rene Castberg

# uname -a
Linux midgard 4.3.3-1.el7.elrepo.x86_64 #1 SMP Tue Dec 15 11:18:19 EST
2015 x86_64 x86_64 x86_64 GNU/Linux
[root@midgard ~]# btrfs --version
btrfs-progs v4.3.1
[root@midgard ~]# btrfs fi df  /mnt2/RenesData/
Data, RAID6: total=5.52TiB, used=5.37TiB
System, RAID6: total=96.00MiB, used=480.00KiB
Metadata, RAID6: total=17.53GiB, used=11.86GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


# btrfs device stats /mnt2/RenesData/
[/dev/sdb].write_io_errs   0
[/dev/sdb].read_io_errs0
[/dev/sdb].flush_io_errs   0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sde].write_io_errs   0
[/dev/sde].read_io_errs0
[/dev/sde].flush_io_errs   0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0
[/dev/sdf].write_io_errs   0
[/dev/sdf].read_io_errs0
[/dev/sdf].flush_io_errs   0
[/dev/sdf].corruption_errs 0
[/dev/sdf].generation_errs 0
[/dev/sdd].write_io_errs   0
[/dev/sdd].read_io_errs0
[/dev/sdd].flush_io_errs   0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[/dev/sdc].write_io_errs   1583
[/dev/sdc].read_io_errs45652
[/dev/sdc].flush_io_errs   503
[/dev/sdc].corruption_errs 0
[/dev/sdc].generation_errs 0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org

Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-10 Thread Christian Rohmann
Hey btrfs-folks,


I did a bit of digging using "perf":


1)
 * "perf stat -B -p 3933 sleep 60"
 * "perf stat -e 'btrfs:*' -a sleep 60"
 -> http://fpaste.org/320718/10016145/



2)
 * perf record -e block:block_rq_issue -ag" for about 30 seconds:
 -> http://fpaste.org/320719/51101751/raw/


3)
* perf top
 -> http://fpaste.org/320720/45511028/






Regards

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


письмо от Екатерины

2016-02-10 Thread Виноградова Алевтина
Здравствуйте.

Так же как Вы видите это сообщение, смогут такие же люди увидеть Ваше письмо.

Цены от 1500.

--

С уважением, менеджер Екатерина.

Сот.: 7961136 3521

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Business Partnership

2016-02-10 Thread EYADEMA
Hello,

I am Mr. LAURENT EYADEMA from Republic of Togo.please read the attached 
proposal.
Thanks in anticipation of your urgent response,


LAURENT EYADEMA

proposal.docx
Description: Binary data


Business Partnership

2016-02-10 Thread EYADEMA
Hello,

I am Mr. LAURENT EYADEMA from Republic of Togo.please read the attached 
proposal.
Thanks in anticipation of your urgent response,


LAURENT EYADEMA

proposal.docx
Description: Binary data