Re: Chunk root problem

2017-07-06 Thread Daniel Brady
On 7/6/2017 2:26 AM, Duncan wrote:
> Daniel Brady posted on Wed, 05 Jul 2017 22:10:35 -0600 as excerpted:
>
>> My system suddenly decided it did not want to mount my BTRFS setup. I
>> recently rebooted the computer. When it came back, the file system was
>> in read only mode. I gave it another boot, but now it does not want to
>> mount at all. Anything I can do to recover? This is a Rockstor setup
>> that I have had running for about a year.
>>
>> uname -a
>> Linux hobonas 4.10.6-1.el7.elrepo.x86_64 #1 SMP Sun Mar 26
>> 12:19:32 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
>>
>> btrfs --version
>> btrfs-progs v4.10.1
>
> FWIW, open ctree failed is the btrfs-generic error, but the transid
> faileds may provide some help.
>
> Addressing the easy answer first...
>
> What btrfs raid mode was it configured for?  If raid56, you want the
> brand new 4.12 kernel at least, as there were serious bugs in previous
> kernels' raid56 mode.  DO NOT ATTEMPT A FIX OF RAID56 MODE WITH AN
> EARLIER KERNEL, IT'S VERY LIKELY TO ONLY CAUSE FURTHER DAMAGE!  But if
> you're lucky, kernel 4.12 can auto-repair it.
>
> With those fixes the known bugs are fixed, but we'll need to wait a
> few
> cycles to see what the reports are.  Even then, however, due to the
> infamous parity-raid write hole and the fact that the parity isn't
> checksummed, it's not going to be as stable as raid1 or raid10 mode.
> Parity-checksumming will take a new implementation and I'm not sure if
> anyone's actually working on that or not.  But at least until we see
> how
> stable the newer raid56 code is, 2-4 kernel cycles, it's not
> recommended
> except for testing only, with even more backups than normal.
>
> If you were raid1 or raid10 mode, the raid mode is stable so it's a
> different issue.  I'll let the experts take it from here.  Single or
> raid0 mode would of course be similar, but without the protection of
> the
> second copy, making it less resilient.

The raid mode was configured for raid56... unfortunately. I learned of
the potential instability after it died. I have not attempted to repair
it yet because of the possible corruption. I've only tried various ways
of mounting it and dry runs of the restore function.

I did as you mentioned and upgraded to kernel 4.12. The auto-repair
seemed to fix quite a few things, but it is not quite there. Even with a
few reboots.

uname -r
4.12.0-1.el7.elrepo.x86_64

rpm -qa | grep btrfs
btrfs-progs-4.10.1-0.rockstor.x86_64

dmesg
[   21.400190] BTRFS info (device sdb): use no compression
[   21.400191] BTRFS info (device sdb): disk space caching is enabled
[   21.400192] BTRFS info (device sdb): has skinny extents
[   21.584923] BTRFS info (device sdb): bdev /dev/sde errs: wr 402545,
rd 234683174, flush 194501, corrupt 0, gen 0
[   23.394788] BTRFS error (device sdb): parent transid verify failed on
5257838690304 wanted 591492 found 489231
[   23.416489] BTRFS error (device sdb): parent transid verify failed on
5257838690304 wanted 591492 found 489231
[   23.416524] BTRFS error (device sdb): failed to read block groups: -5
[   23.448478] BTRFS error (device sdb): open_ctree failed

-Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix early ENOSPC due to delalloc

2017-07-06 Thread Omar Sandoval
From: Omar Sandoval 

If a lot of metadata is reserved for outstanding delayed allocations, we
rely on shrink_delalloc() to reclaim metadata space in order to fulfill
reservation tickets. However, shrink_delalloc() has a shortcut where if
it determines that space can be overcommitted, it will stop early. This
made sense before the ticketed enospc system, but now it means that
shrink_delalloc() will often not reclaim enough space to fulfill any
tickets, leading to an early ENOSPC. (Reservation tickets don't care
about being able to overcommit, they need every byte accounted for.)

Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
all of the metadata it is supposed to. This fixes early ENOSPCs we were
seeing when doing a btrfs receive to populate a new filesystem.

Signed-off-by: Omar Sandoval 
---
I don't have a good reproducer for this except for the btrfs send stream
I was given by someone internally, unfortunately.

 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9ea2a..83eecd33ad96 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4776,10 +4776,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 
to_reclaim, u64 orig,
else
flush = BTRFS_RESERVE_NO_FLUSH;
spin_lock(&space_info->lock);
-   if (can_overcommit(root, space_info, orig, flush)) {
-   spin_unlock(&space_info->lock);
-   break;
-   }
if (list_empty(&space_info->tickets) &&
list_empty(&space_info->priority_tickets)) {
spin_unlock(&space_info->lock);
-- 
2.13.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chunk root problem

2017-07-06 Thread Roman Mamedov
On Wed, 5 Jul 2017 22:10:35 -0600
Daniel Brady  wrote:

> parent transid verify failed

Typically in Btrfs terms this means "you're screwed", fsck will not fix it, and
nobody will know how to fix or what is the cause either. Time to restore from
backups! Or look into "btrfs restore" if you don't have any.

In your case it's especially puzzling as the difference in transid numbers is
really significant (about 100K), almost like the FS was operating for months
without updating some parts of itself -- and no checksum errors either, so
all looks correct, except that everything is horribly wrong.

This kind of error seems to occur more often in RAID setups, either Btrfs
native RAID, or with Btrfs on top of other RAID setups -- i.e. where it
becomes a complex issue that all writes to multi devices DO complete IN order,
in case of an unclean shutdown. (which is much simpler on a single device FS).

Also one of your disks or cables is failing (was /dev/sde on that boot, but may
get a different index next boot), check SMART data for it and replace.

> [   21.230919] BTRFS info (device sdf): bdev /dev/sde errs: wr 402545, rd
> 234683174, flush 194501, corrupt 0, gen 0

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Omar Sandoval
On Thu, Jul 06, 2017 at 04:59:39PM -0700, Marc MERLIN wrote:
> On Thu, Jul 06, 2017 at 04:44:51PM -0700, Omar Sandoval wrote:
> > In the bug report, you commented that CURRENT contained MANIFEST-010814,
> > is that indeed the case or was it actually something newer? If it was
> > the newer one, then it's still tricky how we'd end up that way but not
> > as outlandish.
> 
> You are correct, my bad.
> At this point, I'm going to have to assume that something bad happened with
> rsync when I rsync'ed an old profile over the one that caused chrome to fail
> to restart.
> Especially because CURRENT was dated Oct 4th, which does not make sense.

Okay, this makes sense.

> Now that I know what to look for, I'll have a much closer look next time
> this happens, with the understanding that it would be a while if I've
> successfully fixed the reason why my laptop was crashing too often.

Sounds good.

> But you said you've also seen issues with google-chrome profile and btrfs.
> What did you experience?

I never debugged it, but I had to blow away the profile a couple of times. Then
there's this weird one which looks like a Btrfs bug:

┌[osandov@vader ~/.config]
└$ ls -al google-chrome-busted/**
ls: cannot access 'google-chrome-busted/Local State': No such file or directory
google-chrome-busted/Default:
ls: cannot access 'google-chrome-busted/Default/Preferences': No such file or 
directory
ls: cannot access 'google-chrome-busted/Default/.com.google.Chrome.VfAUNx': No 
such file or directory
total 0
drwx-- 1 osandov users 12 Feb  7 16:50 .
drwx-- 1 osandov users 14 Feb  7 16:50 ..
-? ? ?   ?  ?? .com.google.Chrome.VfAUNx
-? ? ?   ?  ?? Preferences
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ctree.c:197: update_ref_for_cow: BUG_ON `ret` triggered, value -5

2017-07-06 Thread Marc MERLIN
On Thu, Jul 06, 2017 at 10:37:18PM -0700, Marc MERLIN wrote:
> I'm still trying to fix my filesystem.
> It seems to work well enough since the damage is apparently localized, but
> I'd really want check --repair to actually bring it back to a working
> state, but now it's crashing
> 
> This is btrfs tools from git from a few days ago
> 
> Failed to find [4068943577088, 168, 16384]
> btrfs unable to find ref byte nr 4068943577088 parent 0 root 4  owner 1 
> offset 0
> Failed to find [5905106075648, 168, 16384]
> btrfs unable to find ref byte nr 5906282119168 parent 0 root 4  owner 0 
> offset 1
> Failed to find [21037056, 168, 16384]
> btrfs unable to find ref byte nr 21037056 parent 0 root 3  owner 1 offset 0
> Failed to find [21053440, 168, 16384]
> btrfs unable to find ref byte nr 21053440 parent 0 root 3  owner 0 offset 1
> Failed to find [21299200, 168, 16384]
> btrfs unable to find ref byte nr 21299200 parent 0 root 3  owner 0 offset 1
> Failed to find [5523931971584, 168, 16384]
> btrfs unable to find ref byte nr 5524037566464 parent 0 root 3861  owner 3 
> offset 0
> ctree.c:197: update_ref_for_cow: BUG_ON `ret` triggered, value -5
> btrfs(+0x113cf)[0x5651e60443cf]
> btrfs(__btrfs_cow_block+0x576)[0x5651e6045848]
> btrfs(btrfs_cow_block+0xea)[0x5651e6045dc6]
> btrfs(btrfs_search_slot+0x11df)[0x5651e604969d]
> btrfs(+0x59184)[0x5651e608c184]
> btrfs(cmd_check+0x2bd4)[0x5651e60987b3]
> btrfs(main+0x85)[0x5651e60442c3]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f34f523d2b1]
> btrfs(_start+0x2a)[0x5651e6043e3a]

Mmmh, never mind, it seems that the software raid suffered yet another
double disk failure due to some undermined flakiness in the underlying block
device cabling :-/
That would likely explain the failures here.

 
> Full log:
> enabling repair mode
> Checking filesystem on /dev/mapper/dshelf2
> UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
> checking extents
> Fixed 0 roots.
> checking free space cache
> cache and super generation don't match, space cache will be invalidated
> checking fs roots
> checksum verify failed on 3037243965440 found 179689AF wanted 82B97043
> checksum verify failed on 3037243965440 found 179689AF wanted 82B97043
> checksum verify failed on 3037243998208 found 60EA5C5B wanted 0CF5948F
> checksum verify failed on 3037243998208 found 60EA5C5B wanted 0CF5948F
> checksum verify failed on 3037244293120 found 38382803 wanted 39E4F85E
> checksum verify failed on 3037244293120 found 38382803 wanted 39E4F85E
> checksum verify failed on 3037244342272 found E84F1D8F wanted 472DA98C
> checksum verify failed on 3037244342272 found E84F1D8F wanted 472DA98C
> checksum verify failed on 3037244669952 found 2F6E4C0E wanted E00BBF09
> checksum verify failed on 3037244669952 found 2F6E4C0E wanted E00BBF09
> checksum verify failed on 3037248913408 found CE2E4AEE wanted EF22F9CA
> checksum verify failed on 3037248913408 found CE2E4AEE wanted EF22F9CA
> checksum verify failed on 3037248929792 found C989CB0E wanted E27527BC
> checksum verify failed on 3037248929792 found C989CB0E wanted E27527BC
> checksum verify failed on 3037247569920 found 05848C79 wanted EF3D5598
> checksum verify failed on 3037247569920 found 05848C79 wanted EF3D5598
> checksum verify failed on 3037247586304 found 9D1E4E39 wanted F1EC8135
> checksum verify failed on 3037247586304 found 9D1E4E39 wanted F1EC8135
> checksum verify failed on 3037247619072 found BFE40520 wanted 627DB20D
> checksum verify failed on 3037247619072 found BFE40520 wanted 627DB20D
> checksum verify failed on 3037249208320 found A6B5775F wanted B1E6C0FC
> checksum verify failed on 3037249208320 found A6B5775F wanted B1E6C0FC
> checksum verify failed on 3037252534272 found 207AD7DF wanted DE72BDF7
> checksum verify failed on 3037252534272 found 207AD7DF wanted DE72BDF7
> checksum verify failed on 3111569391616 found 3C623707 wanted D955D668
> checksum verify failed on 3111569391616 found 3C623707 wanted D955D668
> checksum verify failed on 3111569768448 found 0C129F3C wanted C509003A
> checksum verify failed on 3111569768448 found 0C129F3C wanted C509003A
> checksum verify failed on 3111569735680 found E94C9D41 wanted 55836DD2
> checksum verify failed on 3111569735680 found E94C9D41 wanted 55836DD2
> checksum verify failed on 3037253435392 found 8E124EB5 wanted A3291C35
> checksum verify failed on 3037253435392 found 8E124EB5 wanted A3291C35
> checksum verify failed on 3037253746688 found 2B6A4DCD wanted 4323B339
> checksum verify failed on 3037253746688 found 2B6A4DCD wanted 4323B339
> checksum verify failed on 3111569702912 found 1048610C wanted 9856BB43
> checksum verify failed on 3111569702912 found 1048610C wanted 9856BB43
> checksum verify failed on 3111569801216 found CD7AAF82 wanted C1DA44DF
> checksum verify failed on 3111569801216 found CD7AAF82 wanted C1DA44DF
> checksum verify failed on 3037251878912 found 86FB02F3 wanted 728772CE
> checksum verify failed on 3037251878912 found 86FB02F3 wanted 728772CE
> checksum verify fail

ctree.c:197: update_ref_for_cow: BUG_ON `ret` triggered, value -5

2017-07-06 Thread Marc MERLIN
I'm still trying to fix my filesystem.
It seems to work well enough since the damage is apparently localized, but
I'd really want check --repair to actually bring it back to a working
state, but now it's crashing

This is btrfs tools from git from a few days ago

Failed to find [4068943577088, 168, 16384]
btrfs unable to find ref byte nr 4068943577088 parent 0 root 4  owner 1 offset 0
Failed to find [5905106075648, 168, 16384]
btrfs unable to find ref byte nr 5906282119168 parent 0 root 4  owner 0 offset 1
Failed to find [21037056, 168, 16384]
btrfs unable to find ref byte nr 21037056 parent 0 root 3  owner 1 offset 0
Failed to find [21053440, 168, 16384]
btrfs unable to find ref byte nr 21053440 parent 0 root 3  owner 0 offset 1
Failed to find [21299200, 168, 16384]
btrfs unable to find ref byte nr 21299200 parent 0 root 3  owner 0 offset 1
Failed to find [5523931971584, 168, 16384]
btrfs unable to find ref byte nr 5524037566464 parent 0 root 3861  owner 3 
offset 0
ctree.c:197: update_ref_for_cow: BUG_ON `ret` triggered, value -5
btrfs(+0x113cf)[0x5651e60443cf]
btrfs(__btrfs_cow_block+0x576)[0x5651e6045848]
btrfs(btrfs_cow_block+0xea)[0x5651e6045dc6]
btrfs(btrfs_search_slot+0x11df)[0x5651e604969d]
btrfs(+0x59184)[0x5651e608c184]
btrfs(cmd_check+0x2bd4)[0x5651e60987b3]
btrfs(main+0x85)[0x5651e60442c3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f34f523d2b1]
btrfs(_start+0x2a)[0x5651e6043e3a]


Full log:
enabling repair mode
Checking filesystem on /dev/mapper/dshelf2
UUID: 85441c59-ad11-4b25-b1fe-974f9e4acede
checking extents
Fixed 0 roots.
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
checksum verify failed on 3037243965440 found 179689AF wanted 82B97043
checksum verify failed on 3037243965440 found 179689AF wanted 82B97043
checksum verify failed on 3037243998208 found 60EA5C5B wanted 0CF5948F
checksum verify failed on 3037243998208 found 60EA5C5B wanted 0CF5948F
checksum verify failed on 3037244293120 found 38382803 wanted 39E4F85E
checksum verify failed on 3037244293120 found 38382803 wanted 39E4F85E
checksum verify failed on 3037244342272 found E84F1D8F wanted 472DA98C
checksum verify failed on 3037244342272 found E84F1D8F wanted 472DA98C
checksum verify failed on 3037244669952 found 2F6E4C0E wanted E00BBF09
checksum verify failed on 3037244669952 found 2F6E4C0E wanted E00BBF09
checksum verify failed on 3037248913408 found CE2E4AEE wanted EF22F9CA
checksum verify failed on 3037248913408 found CE2E4AEE wanted EF22F9CA
checksum verify failed on 3037248929792 found C989CB0E wanted E27527BC
checksum verify failed on 3037248929792 found C989CB0E wanted E27527BC
checksum verify failed on 3037247569920 found 05848C79 wanted EF3D5598
checksum verify failed on 3037247569920 found 05848C79 wanted EF3D5598
checksum verify failed on 3037247586304 found 9D1E4E39 wanted F1EC8135
checksum verify failed on 3037247586304 found 9D1E4E39 wanted F1EC8135
checksum verify failed on 3037247619072 found BFE40520 wanted 627DB20D
checksum verify failed on 3037247619072 found BFE40520 wanted 627DB20D
checksum verify failed on 3037249208320 found A6B5775F wanted B1E6C0FC
checksum verify failed on 3037249208320 found A6B5775F wanted B1E6C0FC
checksum verify failed on 3037252534272 found 207AD7DF wanted DE72BDF7
checksum verify failed on 3037252534272 found 207AD7DF wanted DE72BDF7
checksum verify failed on 3111569391616 found 3C623707 wanted D955D668
checksum verify failed on 3111569391616 found 3C623707 wanted D955D668
checksum verify failed on 3111569768448 found 0C129F3C wanted C509003A
checksum verify failed on 3111569768448 found 0C129F3C wanted C509003A
checksum verify failed on 3111569735680 found E94C9D41 wanted 55836DD2
checksum verify failed on 3111569735680 found E94C9D41 wanted 55836DD2
checksum verify failed on 3037253435392 found 8E124EB5 wanted A3291C35
checksum verify failed on 3037253435392 found 8E124EB5 wanted A3291C35
checksum verify failed on 3037253746688 found 2B6A4DCD wanted 4323B339
checksum verify failed on 3037253746688 found 2B6A4DCD wanted 4323B339
checksum verify failed on 3111569702912 found 1048610C wanted 9856BB43
checksum verify failed on 3111569702912 found 1048610C wanted 9856BB43
checksum verify failed on 3111569801216 found CD7AAF82 wanted C1DA44DF
checksum verify failed on 3111569801216 found CD7AAF82 wanted C1DA44DF
checksum verify failed on 3037251878912 found 86FB02F3 wanted 728772CE
checksum verify failed on 3037251878912 found 86FB02F3 wanted 728772CE
checksum verify failed on 3037252861952 found CFD54426 wanted E91774C0
checksum verify failed on 3037252861952 found CFD54426 wanted E91774C0
checksum verify failed on 3037255974912 found E3655B7C wanted 8163FDDE
checksum verify failed on 3037255974912 found E3655B7C wanted 8163FDDE
checksum verify failed on 3037252927488 found E7AD88A3 wanted F6BA5B10
checksum verify failed on 3037252927488 found E7AD88A3 wanted F6BA5B10
checksum verify failed on 303725350

Re: [PATCH v2] btrfs/146: Test various btrfs operations rounding behavior

2017-07-06 Thread Eryu Guan
On Fri, Jun 23, 2017 at 04:25:43PM +0800, Eryu Guan wrote:
> On Wed, Jun 21, 2017 at 10:50:35AM +0300, Nikolay Borisov wrote:
> > When changing the size of disks/filesystem we should always be
> > rounding down to a multiple of sectorsize
> > 
> > Signed-off-by: Nikolay Borisov 
> 
> Thanks for the update! But I still need some reviews from btrfs list to
> see if this is a valid test.

Still looking for reviews on this test. Thanks a lot!

Eryu

> 
> (There're still two places in _cleanup() that have trailing whitespace
> issue, but I can fix them at commit time if test passes review of other
> btrfs developers.)
> 
> Thanks,
> Eryu
> 
> > ---
> > 
> > Changes since v1: 
> >  - Worked on incorporated feedback by Eryu 
> >  - Changed test number to 146 to avoid clashes
> > 
> >  tests/btrfs/146 | 147 
> > 
> >  tests/btrfs/146.out |  20 +++
> >  tests/btrfs/group   |   1 +
> >  3 files changed, 168 insertions(+)
> >  create mode 100755 tests/btrfs/146
> >  create mode 100644 tests/btrfs/146.out
> > 
> > diff --git a/tests/btrfs/146 b/tests/btrfs/146
> > new file mode 100755
> > index 000..7e6d40f
> > --- /dev/null
> > +++ b/tests/btrfs/146
> > @@ -0,0 +1,147 @@
> > +#! /bin/bash
> > +# FS QA Test No. btrfs/146
> > +#
> > +# Test that various code paths which deal with modifying the total size
> > +# of devices/superblock correctly round the value to a multiple of
> > +# sector size
> > +#
> > +#---
> > +#
> > +# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
> > +# Author: Nikolay Borisov 
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#---
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +tmp=/tmp/$$
> > +status=1   # failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +   $UMOUNT_PROG $loop_mnt 
> > +   _destroy_loop_device $loop_dev1
> > +   _destroy_loop_device $loop_dev2
> > +   cd /
> > +   rm -f $tmp.*
> > +   
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# real QA test starts here
> > +_supported_fs btrfs
> > +_supported_os Linux
> > +_require_scratch
> > +_require_loop
> > +_require_block_device $SCRATCH_DEV
> > +_require_btrfs_command inspect-internal dump-super
> > +# ~2.1 gigabytes just to be on the safe side
> > +_require_fs_space $SCRATCH_MNT $(( 2202009 ))
> > +
> > +rm -f $seqres.full
> > +
> > +_scratch_mkfs >>$seqres.full 2>&1
> > +_scratch_mount
> > +
> > +
> > +# Size of devices which are going to be half a page larger than
> > +# default sectorsize (4kb)
> > +expectedsize=$(( 1 * 1024 * 1024 * 1024 ))
> > +filesize=$(( $expectedsize + 2048 ))
> > +loop_mnt=$SCRATCH_MNT/mount
> > +fs_img1=$SCRATCH_MNT/disk1.img
> > +fs_img2=$SCRATCH_MNT/disk2.img
> > +
> > +mkdir $loop_mnt
> > +
> > +#create files to hold dummy devices
> > +$XFS_IO_PROG -f -c "falloc 0 $filesize" $fs_img1 &> /dev/null
> > +$XFS_IO_PROG -f -c "falloc 0 $filesize" $fs_img2 &> /dev/null
> > +
> > +loop_dev1=$(_create_loop_device $fs_img1)
> > +loop_dev2=$(_create_loop_device $fs_img2)
> > +
> > +#create fs only on the first device
> > +_mkfs_dev $loop_dev1
> > +_mount $loop_dev1 $loop_mnt
> > +
> > +echo "Size from mkfs"
> > +$BTRFS_UTIL_PROG inspect-internal dump-super /dev/loop0 | grep total
> > +
> > +#resize down to 768mb + 2k
> > +$BTRFS_UTIL_PROG filesystem resize 824182784 $loop_mnt >>$seqres.full 2>&1
> > +sync
> > +
> > +echo ""
> > +
> > +echo "Size after resizing down"
> > +$BTRFS_UTIL_PROG inspect-internal dump-super $loop_dev1 | grep total
> > +
> > +echo ""
> > +
> > +#now resize back up to 1 gb
> > +$BTRFS_UTIL_PROG filesystem resize max $loop_mnt >>$seqres.full 2>&1
> > +sync
> > +
> > +echo "Size after resizing up"
> > +$BTRFS_UTIL_PROG inspect-internal dump-super /dev/loop0 | grep total
> > +
> > +# Add fs load for later balance operations, we need to do this
> > +# before adding a second device
> > +$FSSTRESS_PROG -w -n 15000 -p 2 -d $loop_mnt >> $seqres.full 2>&1
> > +
> > +# add second unaligned device, this checks the btrfs_init_new_device 
> > codepath
> > +# dev

Re: Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Marc MERLIN
On Thu, Jul 06, 2017 at 04:44:51PM -0700, Omar Sandoval wrote:
> In the bug report, you commented that CURRENT contained MANIFEST-010814,
> is that indeed the case or was it actually something newer? If it was
> the newer one, then it's still tricky how we'd end up that way but not
> as outlandish.

You are correct, my bad.
At this point, I'm going to have to assume that something bad happened with
rsync when I rsync'ed an old profile over the one that caused chrome to fail
to restart.
Especially because CURRENT was dated Oct 4th, which does not make sense.

Now that I know what to look for, I'll have a much closer look next time
this happens, with the understanding that it would be a while if I've
successfully fixed the reason why my laptop was crashing too often.

But you said you've also seen issues with google-chrome profile and btrfs.
What did you experience?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Omar Sandoval
On Thu, Jul 06, 2017 at 04:31:52PM -0700, Marc MERLIN wrote:
> On Thu, Jul 06, 2017 at 04:01:41PM -0700, Omar Sandoval wrote:
> > What doesn't add up about your bug report is that your CURRENT points to
> > a MANIFEST-010814 way behind all of the other files in that directory,
> > which are numbered 022745+. If there were a bug here, I'd expect the
> > stale MANIFEST file would be one older than the new one. The filenames
> > seem to be allocated sequentially, so that old MANIFEST file CURRENT
> > refers to must be really old, which doesn't make sense. I don't see how
> > Btrfs would screw that up :) I'd be interested to see if you can make
> > the same condition trigger again.
> > 
> 
> First, thanks for looking at it.
> 
> Second, you are right on the numbers being so far apart that something was
> wrong. I checked my snapshots, and I've been carrying that MANIFEST-010814
> for a long time.
> In other words, it's a old stale manifest that never got deleted.
> 
> The new real old one apparently got deleted, the new one was created but
> didn't make it to disk, but the pointer in CURRENT did get repointed to the
> new one that never made it to actual disk.
> 
> So I think what happened is something like this:
> MANIFEST-new got created
> echo MANIFEST-new > CURRENT
> MANIFEST-old got deleted
> system crashed
> 
> MANIFEST-old was indeed deleted, and MANIFEST-new never made it to disk.
> 
> Does that sound more plausible?

In the bug report, you commented that CURRENT contained MANIFEST-010814,
is that indeed the case or was it actually something newer? If it was
the newer one, then it's still tricky how we'd end up that way but not
as outlandish.

> As for redoing this at will, apparently I may have been hit by the skylake
> hyperthreading CPU bug that I just installed a microcode update for, which
> was causing random crashes, which hopefully are now solved.
> I can't say if those in turn messed with btrfs writing data, but I'd rather
> not recreate this since it's my real filesystem I care about and don't want
> to corrupt on purpose :)

Understandable :)

> That said, the google-chrome on my previous haswell CPU also had routine
> problems when restarting chrome, although at this point I don't know if they
> were due to leveldb or sqlite or something else.
> I'm just mentioning this to say that I'm pretty sure that the haswell HT bug
> isn't the sole culprit of this problem, likely just the trigger of some of
> my crashes.
>
> Hope this helps
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>    what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Marc MERLIN
On Thu, Jul 06, 2017 at 04:01:41PM -0700, Omar Sandoval wrote:
> What doesn't add up about your bug report is that your CURRENT points to
> a MANIFEST-010814 way behind all of the other files in that directory,
> which are numbered 022745+. If there were a bug here, I'd expect the
> stale MANIFEST file would be one older than the new one. The filenames
> seem to be allocated sequentially, so that old MANIFEST file CURRENT
> refers to must be really old, which doesn't make sense. I don't see how
> Btrfs would screw that up :) I'd be interested to see if you can make
> the same condition trigger again.
> 

First, thanks for looking at it.

Second, you are right on the numbers being so far apart that something was
wrong. I checked my snapshots, and I've been carrying that MANIFEST-010814
for a long time.
In other words, it's a old stale manifest that never got deleted.

The new real old one apparently got deleted, the new one was created but
didn't make it to disk, but the pointer in CURRENT did get repointed to the
new one that never made it to actual disk.

So I think what happened is something like this:
MANIFEST-new got created
echo MANIFEST-new > CURRENT
MANIFEST-old got deleted
system crashed

MANIFEST-old was indeed deleted, and MANIFEST-new never made it to disk.

Does that sound more plausible?

As for redoing this at will, apparently I may have been hit by the skylake
hyperthreading CPU bug that I just installed a microcode update for, which
was causing random crashes, which hopefully are now solved.
I can't say if those in turn messed with btrfs writing data, but I'd rather
not recreate this since it's my real filesystem I care about and don't want
to corrupt on purpose :)
That said, the google-chrome on my previous haswell CPU also had routine
problems when restarting chrome, although at this point I don't know if they
were due to leveldb or sqlite or something else.
I'm just mentioning this to say that I'm pretty sure that the haswell HT bug
isn't the sole culprit of this problem, likely just the trigger of some of
my crashes.

Hope this helps
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Omar Sandoval
On Thu, Jul 06, 2017 at 02:24:22PM -0700, Marc MERLIN wrote:
> On Thu, Jul 06, 2017 at 02:13:20PM -0700, Omar Sandoval wrote:
> > On Thu, Jul 06, 2017 at 08:00:46AM -0700, Marc MERLIN wrote:
> > > I don't know who else uses google-chrome here, but for me, for as long as
> > > I've used btrfs (3+ years now), I've had no end of troubles recovering 
> > > from
> > > a linux crash, and google-chrome has had problems recovering my tabs and
> > > usually cmoplains about plenty of problems, some are corruption looking.
> > 
> > I've also had issues with chrome and Btrfs, not just you.
> > 
> > [snip]
> > 
> > > Does anyone know if it's leveldb relying on non POSIX semantics that just
> > > happen to work out on ext4, or if btrfs COW and atomicity doesn't quite
> > > handle multi file updates in a way that is expected by a spec, or by
> > > application developers?
> > 
> > A quick google search turned this up: 
> > https://github.com/google/leveldb/issues/195.
> > Unless anything has changed since that issue was last updated, it does
> > sound like LevelDB is making some unsafe assumptions. I'll take a look.
> 
> Thanks Omar, this very much looks related indeed.
> 
> Marc

Hm, tracing a simple program using LevelDB I put together, it looks like
the relevant sequence of events on open is pretty much

create new MANIFEST-nn
write out new MANIFEST-nn
fsync() parent directory
fdatasync() MANIFEST-nn

create temporary CURRENT
write out temporary CURRENT to point to new MANIFEST-nn name
fdatasync() temporary CURRENT
rename temporary CURRENT to CURRENT

unlink old MANIFEST-nn

There's no fsync of the parent directory between renaming CURRENT and
unlinking the old MANIFEST, but that shouldn't be an issue on Btrfs.
Either both of those operations will be committed in the same filesystem
transaction, or the rename will be committed before the unlink. I can't
think of any way that the unlink would end up reordered before the
rename.

What doesn't add up about your bug report is that your CURRENT points to
a MANIFEST-010814 way behind all of the other files in that directory,
which are numbered 022745+. If there were a bug here, I'd expect the
stale MANIFEST file would be one older than the new one. The filenames
seem to be allocated sequentially, so that old MANIFEST file CURRENT
refers to must be really old, which doesn't make sense. I don't see how
Btrfs would screw that up :) I'd be interested to see if you can make
the same condition trigger again.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Marc MERLIN
On Thu, Jul 06, 2017 at 02:13:20PM -0700, Omar Sandoval wrote:
> On Thu, Jul 06, 2017 at 08:00:46AM -0700, Marc MERLIN wrote:
> > I don't know who else uses google-chrome here, but for me, for as long as
> > I've used btrfs (3+ years now), I've had no end of troubles recovering from
> > a linux crash, and google-chrome has had problems recovering my tabs and
> > usually cmoplains about plenty of problems, some are corruption looking.
> 
> I've also had issues with chrome and Btrfs, not just you.
> 
> [snip]
> 
> > Does anyone know if it's leveldb relying on non POSIX semantics that just
> > happen to work out on ext4, or if btrfs COW and atomicity doesn't quite
> > handle multi file updates in a way that is expected by a spec, or by
> > application developers?
> 
> A quick google search turned this up: 
> https://github.com/google/leveldb/issues/195.
> Unless anything has changed since that issue was last updated, it does
> sound like LevelDB is making some unsafe assumptions. I'll take a look.

Thanks Omar, this very much looks related indeed.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Omar Sandoval
On Thu, Jul 06, 2017 at 08:00:46AM -0700, Marc MERLIN wrote:
> I don't know who else uses google-chrome here, but for me, for as long as
> I've used btrfs (3+ years now), I've had no end of troubles recovering from
> a linux crash, and google-chrome has had problems recovering my tabs and
> usually cmoplains about plenty of problems, some are corruption looking.

I've also had issues with chrome and Btrfs, not just you.

[snip]

> Does anyone know if it's leveldb relying on non POSIX semantics that just
> happen to work out on ext4, or if btrfs COW and atomicity doesn't quite
> handle multi file updates in a way that is expected by a spec, or by
> application developers?

A quick google search turned this up: 
https://github.com/google/leveldb/issues/195.
Unless anything has changed since that issue was last updated, it does
sound like LevelDB is making some unsafe assumptions. I'll take a look.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] btrfs: Add zstd support

2017-07-06 Thread Adam Borowski
On Thu, Jun 29, 2017 at 12:41:07PM -0700, Nick Terrell wrote:
> Add zstd compression and decompression support to BtrFS. zstd at its
> fastest level compresses almost as well as zlib, while offering much
> faster compression and decompression, approaching lzo speeds.

Got a reproducible crash on amd64:

[98235.266511] BUG: unable to handle kernel paging request at c90001251000
[98235.267485] IP: ZSTD_storeSeq.constprop.24+0x67/0xe0
[98235.269395] PGD 227034067 
[98235.269397] P4D 227034067 
[98235.271587] PUD 227035067 
[98235.273657] PMD 223323067 
[98235.275744] PTE 0

[98235.281545] Oops: 0002 [#1] SMP
[98235.283353] Modules linked in: loop veth tun fuse arc4 rtl8xxxu mac80211 
cfg80211 cp210x pl2303 rfkill usbserial nouveau video mxm_wmi ttm
[98235.285203] CPU: 0 PID: 10850 Comm: kworker/u12:9 Not tainted 4.12.0+ #1
[98235.287070] Hardware name: System manufacturer System Product Name/M4A77T, 
BIOS 240105/18/2011
[98235.288964] Workqueue: btrfs-delalloc btrfs_delalloc_helper
[98235.290934] task: 880224984140 task.stack: c90007e5c000
[98235.292731] RIP: 0010:ZSTD_storeSeq.constprop.24+0x67/0xe0
[98235.294579] RSP: 0018:c90007e5fa68 EFLAGS: 00010282
[98235.296395] RAX: c90001251001 RBX: 0094 RCX: c9000118f930
[98235.298380] RDX: 0006 RSI: c900011b06b0 RDI: c9000118d1e0
[98235.300321] RBP: 009f R08: 1fffbe58 R09: 
[98235.302282] R10: c9000118f970 R11: 0005 R12: c9000118f878
[98235.304221] R13: 005b R14: c9000118f915 R15: c900011cfe88
[98235.306147] FS:  () GS:88022fc0() 
knlGS:
[98235.308162] CS:  0010 DS:  ES:  CR0: 80050033
[98235.310129] CR2: c90001251000 CR3: 00021018d000 CR4: 06f0
[98235.312095] Call Trace:
[98235.314008]  ? ZSTD_compressBlock_fast+0x94b/0xb30
[98235.315975]  ? ZSTD_compressContinue_internal+0x1a0/0x580
[98235.317938]  ? ZSTD_compressStream_generic+0x248/0x2f0
[98235.319877]  ? ZSTD_compressStream+0x41/0x60
[98235.321821]  ? zstd_compress_pages+0x236/0x5d0
[98235.323724]  ? btrfs_compress_pages+0x5e/0x80
[98235.325684]  ? compress_file_range.constprop.79+0x1eb/0x750
[98235.327668]  ? async_cow_start+0x2e/0x50
[98235.329594]  ? btrfs_worker_helper+0x1b9/0x1d0
[98235.331486]  ? process_one_work+0x158/0x2f0
[98235.61]  ? worker_thread+0x45/0x3a0
[98235.335253]  ? process_one_work+0x2f0/0x2f0
[98235.337189]  ? kthread+0x10e/0x130
[98235.339020]  ? kthread_park+0x60/0x60
[98235.340819]  ? ret_from_fork+0x22/0x30
[98235.342637] Code: 8b 4e d0 4c 89 48 d0 4c 8b 4e d8 4c 89 48 d8 4c 8b 4e e0 
4c 89 48 e0 4c 8b 4e e8 4c 89 48 e8 4c 8b 4e f0 4c 89 48 f0 4c 8b 4e f8 <4c> 89 
48 f8 48 39 f1 75 a2 4e 8d 04 c0 48 8b 31 48 83 c0 08 48 
[98235.346773] RIP: ZSTD_storeSeq.constprop.24+0x67/0xe0 RSP: c90007e5fa68
[98235.348809] CR2: c90001251000
[98235.363216] ---[ end trace 5fb3ad0f2aec0605 ]---
[98235.363218] BUG: unable to handle kernel paging request at c9000393a000
[98235.363239] IP: ZSTD_storeSeq.constprop.24+0x67/0xe0
[98235.363241] PGD 227034067 
[98235.363242] P4D 227034067 
[98235.363243] PUD 227035067 
[98235.363244] PMD 21edec067 
[98235.363245] PTE 0
(More of the above follows.)

My reproducer copies an uncompressed tarball onto a fresh filesystem:
.
#!/bin/sh
set -e

losetup -D; umount /mnt/vol1 ||:
dd if=/dev/zero of=/tmp/disk bs=2048 seek=1048575 count=1
mkfs.btrfs -msingle /tmp/disk
losetup -f /tmp/disk
sleep 1 # yay udev races
mount -onoatime,compress=$1 /dev/loop0 /mnt/vol1
time sh -c 'cp -p ~kilobyte/tmp/kernel.tar /mnt/vol1 && umount /mnt/vol1'
losetup -D
`
(run it with arg of "zstd")

Kernel is 4.12.0 + btrfs-for-4.13 + v4 of Qu's chunk check + some unrelated
stuff + zstd; in case it matters I've pushed my tree to
https://github.com/kilobyte/linux/tree/zstd-crash

The payload is a tarball of the above, but, for debugging compression you
need the exact byte stream.  https://angband.pl/tmp/kernel.tar.xz --
without xz, I compressed it for transport.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Leveldb in google-chrome incompatible with btrfs?

2017-07-06 Thread Marc MERLIN
I don't know who else uses google-chrome here, but for me, for as long as
I've used btrfs (3+ years now), I've had no end of troubles recovering from
a linux crash, and google-chrome has had problems recovering my tabs and
usually cmoplains about plenty of problems, some are corruption looking.

The way I recover, every time, is to simply rsync over the previous
snapshot's data for ~/.config/google-chrome-dir back over the current
corrupted one, and move on.

I had a bit more look at this with a chrome engineer, and there seems to be
at least a clear problem with leveldb where its file containing the index
name, does not match the files on disk.
Details are in https://bugs.chromium.org/p/chromium/issues/detail?id=738961
but I'll summarize here.


All the files are in a non corrupted state, but I end up with this:
saruman:~/.config/google-chrome-dir/google-chrome-beta/Default/IndexedDB/https_docs.google.com_0.indexeddb.leveldb$
 l
(...)
-rw--- 1 merlin merlin  16 Oct  4  2016 CURRENT
-rw--- 1 merlin merlin   0 Apr  9  2016 LOCK
-rw--- 1 merlin merlin   0 Jul  5 22:31 LOG
-rw--- 1 merlin merlin   0 Jul  5 22:31 LOG.old
-rw--- 1 merlin merlin  23 Jul  3 08:31 MANIFEST-01
-rw--- 1 merlin merlin8639 May 12 17:36 MANIFEST-022745
saruman:~/.config/google-chrome-dir/google-chrome-beta/Default/IndexedDB/https_docs.google.com_0.indexeddb.leveldb$
 cat CURRENT 
MANIFEST-010814

In other words, I think things go like this:
- MANIFEST-010814 is replaced by MANIFEST-022745, 
- MANIFEST-022745 is written into the file CURRENT
- ideally both MANIFEST-010814 and MANIFEST-022745 should be present on disk
  and MANIFEST-010814 deleted only after CURRENT has been written to with
  the new pointer, but I'm not sure if or how that is done.
- my laptop crashes
- after reboot, btrfs has not been able to commit the update to CURRENT to disk
  in a consistent state, therefore discards the COW data to CURRENT and
  leaves it in its prior state, which now holds the old data
- somehow MANIFEST-010814 has however been deleted, so now CURRENT points to
  a non existent file.

I don't know how leveldb works or where its (f)sync statements are. but I'm
guessing it works reliably on ext4 since it's used by many users and I'm the
first one reporting this problem with leveldb.

Does anyone know if it's leveldb relying on non POSIX semantics that just
happen to work out on ext4, or if btrfs COW and atomicity doesn't quite
handle multi file updates in a way that is expected by a spec, or by
application developers?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] btrfs: Add zstd support

2017-07-06 Thread Austin S. Hemmelgarn

On 2017-07-06 08:09, Lionel Bouton wrote:

Le 06/07/2017 à 13:59, Austin S. Hemmelgarn a écrit :

On 2017-07-05 20:25, Nick Terrell wrote:

On 7/5/17, 12:57 PM, "Austin S. Hemmelgarn" 
wrote:

It's the slower compression speed that has me arguing for the
possibility of configurable levels on zlib.  11MB/s is painfully slow
considering that most decent HDD's these days can get almost 5-10x that
speed with no compression.  There are cases (WORM pattern archival
storage for example) where slow writes to that degree may be
acceptable,
but for most users they won't be, and zlib at level 9 would probably be
a better choice.  I don't think it can beat zstd at level 15 for
compression ratio, but if they're even close, then zlib would still
be a
better option at that high of a compression level most of the time.


I don't imagine the very high zstd levels would be useful to too many
btrfs users, except in rare cases. However, lower levels of zstd should
outperform zlib level 9 in all aspects except memory usage. I would
expect
zstd level 7 would compress as well as or better than zlib 9 with faster
compression and decompression speed. It's worth benchmarking to
ensure that
it holds for many different workloads, but I wouldn't expect zlib 9 to
compress better than zstd 7 often. zstd up to level 12 should
compress as
fast as or faster than zlib level 9. zstd levels 12 and beyond allow
stronger compression than zlib, at the cost of slow compression and more
memory usage.

While I generally agree that most people probably won't use zstd
levels above 3, it shouldn't be hard to support them if we're going to
have configurable compression levels, so I would argue that it's still
worth supporting anyway.


One use case for the higher compression levels would be manual
defragmentation with recompression for a subset of data (files that
won't be updated and are stored for long periods typically). The
filesystem would be mounted with a low level for general usage low
latencies and the subset of files would be recompressed with a high
level asynchronously.
That's one of the cases I was thinking of, we actually do that on a 
couple of systems where I work that see mostly WORM access patterns, so 
zstd level 15's insanely good decompression speed would be great for this.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Btrfs Compression

2017-07-06 Thread Paul Jones
> -Original Message-
> From: Austin S. Hemmelgarn [mailto:ahferro...@gmail.com]
> Sent: Thursday, 6 July 2017 9:52 PM
> To: Paul Jones ; linux-btrfs@vger.kernel.org
> Subject: Re: Btrfs Compression
> 
> On 2017-07-05 23:19, Paul Jones wrote:
> > While reading the thread about adding zstd compression, it occurred to
> > me that there is potentially another thing affecting performance -
> > Compressed extent size. (correct my terminology if it's incorrect). I
> > have two near identical RAID1 filesystems (used for backups) on near
> > identical discs (HGST 3T), one compressed and one not. The filesystems
> > have about 40 snapshots and are about 50% full. The uncompressed
> > filesystem runs at about 60 MB/s, the compressed filesystem about 5-10
> > MB/s. There is noticeably more "noise" from the compressed filesystem
> > from all the head thrashing that happens while rsync is happening.
> >
> > Which brings me to my point - In terms of performance for compression,
> > is there some low hanging fruit in adjusting the extent size to be
> > more like uncompressed extents so there is not so much seeking
> > happening? With spinning discs with large data sets it seems pointless
> > making the numerical calculations faster if the discs can't keep up.
> > Obviously this is assuming optimisation for speed over compression
> > ratio.
> >
> > Thoughts?That really depends on too much to be certain.  In all
> > likelihood, your
> CPU or memory are your bottleneck, not your storage devices.  The data
> itself gets compressed in memory, and then sent to the storage device, it's
> not streamed directly there from the compression thread, so if the CPU was
> compressing data faster than the storage devices could transfer it, you would
> (or at least, should) be seeing better performance on the compressed
> filesystem than the uncompressed one (because you transfer less data on
> the compressed filesystem), assuming the datasets are functionally identical.
> 
> That in turn brings up a few other questions:
> * What are the other hardware components involved (namely, CPU, RAM<
> and storage controller)?  If you're using some dinky little Atom or
> Cortex-A7 CPU (or almost anything else 32-bit running at less than 2GHz
> peak), then that's probably your bottleneck.  Similarly, if you've got a cheap
> storage controller than needs a lot of attention from the CPU, then that's
> probably your bottleneck (you can check this by seeing how much processing
> power is being used when just writing to the uncompressed array (check
> how much processing power rsync uses copying between two tmpfs mounts,
> then subtract that from the total for copying the same data to the
> uncompressed filesystem)).

Hardware is AMD Phenom II X6 1055T with 8GB DDR3 on the compressed filesystem, 
Intel i7-3770K with 8GB DDR3 on the uncompressed. Slight difference, but both 
are up to the task.

> * Which compression algorithm are you using, lzo or zlib?  If the answer is
> zlib, then what you're seeing is generally expected behavior except on
> systems with reasonably high-end CPU's and fast memory, because zlib is
> _slow_.

Zlib.

> * Are you storing the same data on both arrays?  If not, then that
> immediately makes the comparison suspect (if one array is storing lots of
> small files and the other is mostly storing small numbers of large files, 
> then I
> would expect the one with lots of small files to get worse performance, and
> compression on that one will just make things worse).
> This is even more important when using rsync, because the size of the files
> involved has a pretty big impact on it's hashing performance and even data
> transfer rate (lots of small files == more time spent in syscalls other than
> read() and write()).

The dataset is rsync-ed to the primary backup and then to the secondary backup, 
so contains the same content.

> 
> Additionally, when you're referring to extent size, I assume you mean the
> huge number of 128k extents that the FIEMAP ioctl (and at least older
> versions of `filefrag`) shows for compressed files?  If that's the case, then 
> it's
> important to understand that that's due to an issue with FIEMAP, it doesn't
> understand compressed extents in BTRFS correctly, so it shows one extent
> per compressed _block_ instead, even if they are internally an extent in
> BTRFS.  You can verify the actual number of extents by checking how many
> runs of continuous 128k 'extents' there are.

It was my understanding that compressed extents are significantly smaller in 
size than uncompressed ones? (like 64k vs 128M? perhaps I'm thinking of 
something else.) I couldn't find any info about this, but I remember it being 
mentioned here before. Either way disk io is maxed out so something is 
different with compression in a way that spinning rust doesn't seem to like.


Paul.


Re: [PATCH v2 3/4] btrfs: Add zstd support

2017-07-06 Thread Lionel Bouton
Le 06/07/2017 à 13:59, Austin S. Hemmelgarn a écrit :
> On 2017-07-05 20:25, Nick Terrell wrote:
>> On 7/5/17, 12:57 PM, "Austin S. Hemmelgarn" 
>> wrote:
>>> It's the slower compression speed that has me arguing for the
>>> possibility of configurable levels on zlib.  11MB/s is painfully slow
>>> considering that most decent HDD's these days can get almost 5-10x that
>>> speed with no compression.  There are cases (WORM pattern archival
>>> storage for example) where slow writes to that degree may be
>>> acceptable,
>>> but for most users they won't be, and zlib at level 9 would probably be
>>> a better choice.  I don't think it can beat zstd at level 15 for
>>> compression ratio, but if they're even close, then zlib would still
>>> be a
>>> better option at that high of a compression level most of the time.
>>
>> I don't imagine the very high zstd levels would be useful to too many
>> btrfs users, except in rare cases. However, lower levels of zstd should
>> outperform zlib level 9 in all aspects except memory usage. I would
>> expect
>> zstd level 7 would compress as well as or better than zlib 9 with faster
>> compression and decompression speed. It's worth benchmarking to
>> ensure that
>> it holds for many different workloads, but I wouldn't expect zlib 9 to
>> compress better than zstd 7 often. zstd up to level 12 should
>> compress as
>> fast as or faster than zlib level 9. zstd levels 12 and beyond allow
>> stronger compression than zlib, at the cost of slow compression and more
>> memory usage.
> While I generally agree that most people probably won't use zstd
> levels above 3, it shouldn't be hard to support them if we're going to
> have configurable compression levels, so I would argue that it's still
> worth supporting anyway.

One use case for the higher compression levels would be manual
defragmentation with recompression for a subset of data (files that
won't be updated and are stored for long periods typically). The
filesystem would be mounted with a low level for general usage low
latencies and the subset of files would be recompressed with a high
level asynchronously.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Compression

2017-07-06 Thread Lionel Bouton
Le 06/07/2017 à 13:51, Austin S. Hemmelgarn a écrit :
>
> Additionally, when you're referring to extent size, I assume you mean
> the huge number of 128k extents that the FIEMAP ioctl (and at least
> older versions of `filefrag`) shows for compressed files?  If that's
> the case, then it's important to understand that that's due to an
> issue with FIEMAP, it doesn't understand compressed extents in BTRFS
> correctly, so it shows one extent per compressed _block_ instead, even
> if they are internally an extent in BTRFS.  You can verify the actual
> number of extents by checking how many runs of continuous 128k
> 'extents' there are.

This is in fact the problem : compressed extents are far less likely to
be contiguous than uncompressed extents (even compensating for the
fiemap limitations). When calling defrag on these files BTRFS is likely
to ignore the fragmentation too : when I modeled the cost of reading a
file as stored vs the ideal cost if it were in one single block I got
this surprise. Uncompressed files can be fully defragmented most of the
time, compressed files usually reach a fragmentation cost of
approximately 1.5x to 2.5x the ideal case after defragmentation (it
seems to depend on how the whole filesystem is used).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/4] btrfs: Add zstd support

2017-07-06 Thread Austin S. Hemmelgarn

On 2017-07-05 20:25, Nick Terrell wrote:

On 7/5/17, 12:57 PM, "Austin S. Hemmelgarn"  wrote:

It's the slower compression speed that has me arguing for the
possibility of configurable levels on zlib.  11MB/s is painfully slow
considering that most decent HDD's these days can get almost 5-10x that
speed with no compression.  There are cases (WORM pattern archival
storage for example) where slow writes to that degree may be acceptable,
but for most users they won't be, and zlib at level 9 would probably be
a better choice.  I don't think it can beat zstd at level 15 for
compression ratio, but if they're even close, then zlib would still be a
better option at that high of a compression level most of the time.


I don't imagine the very high zstd levels would be useful to too many
btrfs users, except in rare cases. However, lower levels of zstd should
outperform zlib level 9 in all aspects except memory usage. I would expect
zstd level 7 would compress as well as or better than zlib 9 with faster
compression and decompression speed. It's worth benchmarking to ensure that
it holds for many different workloads, but I wouldn't expect zlib 9 to
compress better than zstd 7 often. zstd up to level 12 should compress as
fast as or faster than zlib level 9. zstd levels 12 and beyond allow
stronger compression than zlib, at the cost of slow compression and more
memory usage.
While I generally agree that most people probably won't use zstd levels 
above 3, it shouldn't be hard to support them if we're going to have 
configurable compression levels, so I would argue that it's still worth 
supporting anyway.


Looking at it another way, ZFS (at least, ZFS on FreeBSD) supports zlib 
compression (they call it gzip) with selectable compression levels, but 
95% of the time nobody uses anything but levels 1, 3, and 9.  Despite 
this, they still support other levels, and I have seen cases where other 
levels have been better than one of those three 'normal' ones.


Supporting multiple zlib compression levels could be intersting for older
kernels, lower memory usage, or backwards compatibility with older btrfs
versions. But for every zlib level, zstd has a level that provides better
compression ratio, compression speed, and decompression speed.
Just the memory footprint is a remarkably strong argument in many cases. 
 It appears to be one of the few things that zlib does better than zstd 
(although I'm not 100% certain about that), and can make a very big 
difference when dealing with small systems.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Compression

2017-07-06 Thread Austin S. Hemmelgarn

On 2017-07-05 23:19, Paul Jones wrote:

While reading the thread about adding zstd compression, it occurred
to me that there is potentially another thing affecting performance -
Compressed extent size. (correct my terminology if it's incorrect). I
have two near identical RAID1 filesystems (used for backups) on near
identical discs (HGST 3T), one compressed and one not. The
filesystems have about 40 snapshots and are about 50% full. The
uncompressed filesystem runs at about 60 MB/s, the compressed
filesystem about 5-10 MB/s. There is noticeably more "noise" from the
compressed filesystem from all the head thrashing that happens while
rsync is happening.

Which brings me to my point - In terms of performance for
compression, is there some low hanging fruit in adjusting the extent
size to be more like uncompressed extents so there is not so much
seeking happening? With spinning discs with large data sets it seems
pointless making the numerical calculations faster if the discs can't
keep up. Obviously this is assuming optimisation for speed over
compression ratio.

Thoughts?That really depends on too much to be certain.  In all likelihood, your
CPU or memory are your bottleneck, not your storage devices.  The data 
itself gets compressed in memory, and then sent to the storage device, 
it's not streamed directly there from the compression thread, so if the 
CPU was compressing data faster than the storage devices could transfer 
it, you would (or at least, should) be seeing better performance on the 
compressed filesystem than the uncompressed one (because you transfer 
less data on the compressed filesystem), assuming the datasets are 
functionally identical.


That in turn brings up a few other questions:
* What are the other hardware components involved (namely, CPU, RAM< and 
storage controller)?  If you're using some dinky little Atom or 
Cortex-A7 CPU (or almost anything else 32-bit running at less than 2GHz 
peak), then that's probably your bottleneck.  Similarly, if you've got a 
cheap storage controller than needs a lot of attention from the CPU, 
then that's probably your bottleneck (you can check this by seeing how 
much processing power is being used when just writing to the 
uncompressed array (check how much processing power rsync uses copying 
between two tmpfs mounts, then subtract that from the total for copying 
the same data to the uncompressed filesystem)).
* Which compression algorithm are you using, lzo or zlib?  If the answer 
is zlib, then what you're seeing is generally expected behavior except 
on systems with reasonably high-end CPU's and fast memory, because zlib 
is _slow_.
* Are you storing the same data on both arrays?  If not, then that 
immediately makes the comparison suspect (if one array is storing lots 
of small files and the other is mostly storing small numbers of large 
files, then I would expect the one with lots of small files to get worse 
performance, and compression on that one will just make things worse). 
This is even more important when using rsync, because the size of the 
files involved has a pretty big impact on it's hashing performance and 
even data transfer rate (lots of small files == more time spent in 
syscalls other than read() and write()).


Additionally, when you're referring to extent size, I assume you mean 
the huge number of 128k extents that the FIEMAP ioctl (and at least 
older versions of `filefrag`) shows for compressed files?  If that's the 
case, then it's important to understand that that's due to an issue with 
FIEMAP, it doesn't understand compressed extents in BTRFS correctly, so 
it shows one extent per compressed _block_ instead, even if they are 
internally an extent in BTRFS.  You can verify the actual number of 
extents by checking how many runs of continuous 128k 'extents' there are.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chunk root problem

2017-07-06 Thread Duncan
Daniel Brady posted on Wed, 05 Jul 2017 22:10:35 -0600 as excerpted:

> My system suddenly decided it did not want to mount my BTRFS setup. I
> recently rebooted the computer. When it came back, the file system was
> in read only mode. I gave it another boot, but now it does not want to
> mount at all. Anything I can do to recover? This is a Rockstor setup
> that I have had running for about a year.
> 
> uname -a
> Linux hobonas 4.10.6-1.el7.elrepo.x86_64 #1 SMP Sun Mar 26
> 12:19:32 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
> btrfs --version
> btrfs-progs v4.10.1

FWIW, open ctree failed is the btrfs-generic error, but the transid 
faileds may provide some help.

Addressing the easy answer first...

What btrfs raid mode was it configured for?  If raid56, you want the 
brand new 4.12 kernel at least, as there were serious bugs in previous 
kernels' raid56 mode.  DO NOT ATTEMPT A FIX OF RAID56 MODE WITH AN 
EARLIER KERNEL, IT'S VERY LIKELY TO ONLY CAUSE FURTHER DAMAGE!  But if 
you're lucky, kernel 4.12 can auto-repair it.

With those fixes the known bugs are fixed, but we'll need to wait a few 
cycles to see what the reports are.  Even then, however, due to the 
infamous parity-raid write hole and the fact that the parity isn't 
checksummed, it's not going to be as stable as raid1 or raid10 mode.  
Parity-checksumming will take a new implementation and I'm not sure if 
anyone's actually working on that or not.  But at least until we see how 
stable the newer raid56 code is, 2-4 kernel cycles, it's not recommended 
except for testing only, with even more backups than normal.

If you were raid1 or raid10 mode, the raid mode is stable so it's a 
different issue.  I'll let the experts take it from here.  Single or 
raid0 mode would of course be similar, but without the protection of the 
second copy, making it less resilient.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html