Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 7:30 PM, Kai Krakow  wrote:

> Both filesystems on this PC show similar corruption now - but they are
> connected to completely different buses (SATA3 bcache + 3x SATA2
> backing store bache{0,1,2}, and USB3 without bcache = sde), use
> different compression (compress=lzo vs. compress-force=zlib), but
> similar redundancy scheme (draid=0,mraid=1 vs. draid=single,mraid=dup).
> A hardware problem would induce completely random errors on these
> pathes.
>
> Completely different hardware shows similar problems - but that system
> is currently not available to me, and will stay there for a while
> (it's a non-production installation at my workplace). Why would similar
> errors show up here, if it'd be a hardware error of the first system?

Then there's something about the particular combination of mount
options you're using with the workload that's inducing this, if it's
reproducing on two different systems. What's the workload and what's
the full history of the mount options? Looks like it started life as
compress lzo and then later compress-force zlib and then after that
the addition of space_cache=v2?

Hopefully Qu has some advice on what's next. It might not be a bad
idea to get a btrfs-image going.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 7:50 PM, Kai Krakow  wrote:

>
> # now let's wait for the backup to mount the FS and look at dmesg:
>
> [21375.606479] BTRFS info (device sde1): force zlib compression
> [21375.606483] BTRFS info (device sde1): using free space tree

You're using space_cache=v2. You're aware new free space tree option
sets a read only incompat feature flag on the file system? You've got
quite a few non-default mount options on this backup volume. Hopefully
Qu has some idea what to try next or if you're better off just
starting over with a new file system.



> I only saw unreliable behavior with 4.4.5, 4.4.6, and 4.5.0 tho the
> problem may exist longer in my FS.
>
> $ sudo btrfs-show-super /dev/sde1
> superblock: bytenr=65536, device=/dev/sde1
> -
> csum0xcc976d97 [match]
> bytenr  65536
> flags   0x1
> ( WRITTEN )
> magic   _BHRfS_M [match]
> fsid1318ec21-c421-4e36-a44a-7be3d41f9c3f
> label   usb-backup
> generation  50814
> root1251250159616
> sys_array_size  129
> chunk_root_generation   50784
> root_level  1
> chunk_root  2516518567936
> chunk_root_level1
> log_root0
> log_root_transid0
> log_root_level  0
> total_bytes 2000397864960
> bytes_used  1860398493696
> sectorsize  4096
> nodesize16384
> leafsize16384
> stripesize  4096
> root_dir6
> num_devices 1
> compat_flags0x0
> compat_ro_flags 0x1
> incompat_flags  0x169
> ( MIXED_BACKREF |
>   COMPRESS_LZO |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   SKINNY_METADATA )
> csum_type   0
> csum_size   4
> cache_generation50208
> uuid_tree_generation50742
> dev_item.uuid   9008d5a0-ac7b-4505-8193-27428429f953
> dev_item.fsid   1318ec21-c421-4e36-a44a-7be3d41f9c3f [match]
> dev_item.type   0
> dev_item.total_bytes2000397864960
> dev_item.bytes_used 1912308039680
> dev_item.io_align   4096
> dev_item.io_width   4096
> dev_item.sector_size4096
> dev_item.devid  1
> dev_item.dev_group  0
> dev_item.seek_speed 0
> dev_item.bandwidth  0
> dev_item.generation 0
>
>
>
> BTW: btrfsck thinks that the space tree is invalid every time it is
> run, no matter if cleanly unmounted, uncleanly unmounted, or "btrfsck
> --repair" and then ran a second time.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-1 refuses to balance large drive

2016-03-26 Thread Brad Templeton



For those curious as the the result, the reduction to single and
restoration to RAID1 did indeed balance the array.   It was extremely
slow of course on a 12tb array.   I did not bother doing this with the
metadata.   I also stopped the conversion to single when it had freed up
enough space on the 2 smaller drives, because at that time it was moving
stuff into the big drive, which seemed sub-optimal considering what was
to come.

In general, obviously, I hope the long term goal is to not need this,
indeed not to need manual balance at all.   I would hope the goal is to
just be able to add and remove drives, tell the system what type of
redundancy you need and let it figure out the rest.  But I know this is
an FS in development.

I've actually come to feel that when it comes to personal drive arrays,
we actually need something much smarter than today's filesystems.  Truth
is, for example, that once my infrequently accessed files, such as old
photo and video archives, have a solid backup made, there is not
actually a need to keep them redundantly at all, except for speed, while
the much smaller volume of frequently accessed files needs that (or even
extra redundancy not for safety but extra speed, and of course cache on
an SSD is even better.)   This requires not just the fileystem and OS to
get smarter about this, but even the apps.  It may happen some day -- no
matter how cheap storage gets, we keep coming up with ways to fill it.

Thanks for the help.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Kai Krakow
Am Sat, 26 Mar 2016 20:30:35 +0100
schrieb Kai Krakow :

> Am Wed, 23 Mar 2016 12:16:24 +0800
> schrieb Qu Wenruo :
> 
> > Kai Krakow wrote on 2016/03/22 19:48 +0100:  
> > > Am Tue, 22 Mar 2016 16:47:10 +0800
> > > schrieb Qu Wenruo :
> > >
>  [...]  
> >  [...]
>  [...]  
> > >
> > > Apparently, that system does not boot now due to errors in bcache
> > > b-tree. That being that, it may well be some bcache error and not
> > > btrfs' fault. Apparently I couldn't catch the output, I've been
> > > in a hurry. It said "write error" and had some backtrace. I will
> > > come to this back later.
> > >
> > > Let's go to the system I currently care about (that one with the
> > > always breaking VDI file):
> > >
> >  [...]
>  [...]  
> > >
> > > After the error occured?
> > >
> > > Yes, some text about the extent being compressed and btrfs repair
> > > doesn't currently handle that case (I tried --repair as I'm
> > > having a backup). I simply decided not to investigate that
> > > further at that point but delete and restore the affected file
> > > from backup. However, this is the message from dmesg (tho, I
> > > didn't catch the backtrace):
> > >
> > > btrfs_run_delayed_refs:2927: errno=-17 Object already exists
> > 
> > That's nice, at least we have some clue.
> > 
> > It's almost sure, it's a bug either in btrfs kernel which doesn't
> > handle delayed refs well(low possibility), or, corrupted fs which
> > create something kernel can't handle(I bet that's the case).  
> 
> [kernel 4.5.0 gentoo, btrfs-progs 4.4.1]
> 
> Well, this time it hit me on the USB backup drive which uses no bcache
> and no other fancy options except compress-force=zlib. Apparently,
> I've only got a (real) screenshot which I'm going to link here:
> 
> https://www.dropbox.com/s/9qbc7np23y8lrii/IMG_20160326_200033.jpg?dl=0
> 
> The same drive has no problems except "bad metadata crossing stripe
> boundary" - but a lot of them. This drive was never converted, it was
> freshly generated several months ago.
> [...]

I finally got copy data:

# before mounting let's check the FS:

$ sudo btrfsck /dev/disk/by-label/usb-backup 
Checking filesystem on /dev/disk/by-label/usb-backup
UUID: 1318ec21-c421-4e36-a44a-7be3d41f9c3f
checking extents
bad metadata [156041216, 156057600) crossing stripe boundary
bad metadata [181403648, 181420032) crossing stripe boundary
bad metadata [392167424, 392183808) crossing stripe boundary
bad metadata [783482880, 783499264) crossing stripe boundary
bad metadata [784924672, 784941056) crossing stripe boundary
bad metadata [130151612416, 130151628800) crossing stripe boundary
bad metadata [162826813440, 162826829824) crossing stripe boundary
bad metadata [162927083520, 162927099904) crossing stripe boundary
bad metadata [619740659712, 619740676096) crossing stripe boundary
bad metadata [619781947392, 619781963776) crossing stripe boundary
bad metadata [619795644416, 619795660800) crossing stripe boundary
bad metadata [619816091648, 619816108032) crossing stripe boundary
bad metadata [620011388928, 620011405312) crossing stripe boundary
bad metadata [890992459776, 890992476160) crossing stripe boundary
bad metadata [891022737408, 891022753792) crossing stripe boundary
bad metadata [891101773824, 891101790208) crossing stripe boundary
bad metadata [891301199872, 891301216256) crossing stripe boundary
bad metadata [1012219314176, 1012219330560) crossing stripe boundary
bad metadata [1017202409472, 1017202425856) crossing stripe boundary
bad metadata [1017365397504, 1017365413888) crossing stripe boundary
bad metadata [1020764422144, 1020764438528) crossing stripe boundary
bad metadata [1251103342592, 1251103358976) crossing stripe boundary
bad metadata [1251144695808, 1251144712192) crossing stripe boundary
bad metadata [1251147055104, 1251147071488) crossing stripe boundary
bad metadata [1259271225344, 1259271241728) crossing stripe boundary
bad metadata [1266223611904, 1266223628288) crossing stripe boundary
bad metadata [1304750063616, 130475008) crossing stripe boundary
bad metadata [1304790106112, 1304790122496) crossing stripe boundary
bad metadata [1304850792448, 1304850808832) crossing stripe boundary
bad metadata [1304869928960, 1304869945344) crossing stripe boundary
bad metadata [1305089540096, 1305089556480) crossing stripe boundary
bad metadata [1309561651200, 1309561667584) crossing stripe boundary
bad metadata [1309581443072, 1309581459456) crossing stripe boundary
bad metadata [1309583671296, 1309583687680) crossing stripe boundary
bad metadata [1309942808576, 1309942824960) crossing stripe boundary
bad metadata [1310050549760, 1310050566144) crossing stripe boundary
bad metadata [1313031585792, 1313031602176) crossing stripe boundary
bad metadata [1313232912384, 1313232928768) crossing stripe boundary
bad metadata [1555210764288, 1555210780672) crossing stripe boundary
bad metadata [1555395182592, 

Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Kai Krakow
Am Sat, 26 Mar 2016 15:04:13 -0600
schrieb Chris Murphy :

> On Sat, Mar 26, 2016 at 2:28 PM, Chris Murphy
>  wrote:
> > On Sat, Mar 26, 2016 at 1:30 PM, Kai Krakow 
> > wrote: 
> >> Well, this time it hit me on the USB backup drive which uses no
> >> bcache and no other fancy options except compress-force=zlib.
> >> Apparently, I've only got a (real) screenshot which I'm going to
> >> link here:
> >>
> >> https://www.dropbox.com/s/9qbc7np23y8lrii/IMG_20160326_200033.jpg?dl=0  
> >
> > This is a curious screen shot. It's a dracut pre-mount shell, so
> > nothing should be mounted yet. And btrfs check only works on an
> > unmounted file system. And yet the bottom part of the trace shows a
> > Btrfs volume being made read only, as if it was mounted read write
> > and is still mounted. Huh?  
> 
> Wait. You said no bcache, and yet in this screen shot it shows 'btrfs
> check /dev/bcache2 ...' right before the back trace.
> 
> This thread is confusing. You're talking about two different btrfs
> volumes intermixed, one uses bcache the other doesn't, yet they both
> have corruption. I think it's hardware related: bad cable bad ram bad
> power, something.

No it's not, it's tested. That system ran rock stable until somewhere
in the 4.4 kernel series (probably). It ran high loads without problems
(loadavg >50), it ran huge IO copies concurrently without problems, it
survived unintentional reboots without FS corruption, it ran
VirtualBox VMs without problems. And the system still runs almost
without problems: Except for the "object already exists" which forced
my rootfs RO, I did not even take note that the FS has corruptions:
Nothing in dmesg, everything fine. There's just VirtualBox crashing a
VM now, and I see csum errors in that very VDI file - even after
recovering the file from backup, it happens again and again. Qu
mentioned that this may be a follow-up of other corruption - and tada:
Yes, there are lots of them now (my last check was back in 4.1 or 4.2
series). But because I still can rsync all my important files, I'd like
to get my backup drive in sane state again first.

Both filesystems on this PC show similar corruption now - but they are
connected to completely different buses (SATA3 bcache + 3x SATA2
backing store bache{0,1,2}, and USB3 without bcache = sde), use
different compression (compress=lzo vs. compress-force=zlib), but
similar redundancy scheme (draid=0,mraid=1 vs. draid=single,mraid=dup).
A hardware problem would induce completely random errors on these
pathes.

Completely different hardware shows similar problems - but that system
is currently not available to me, and will stay there for a while
(it's a non-production installation at my workplace). Why would similar
errors show up here, if it'd be a hardware error of the first system?

Meanwhile, I conclude we can rule out bcache or hardware - three file
systems show similar errors:

1. bcache on Crucial MX100 SATA3, 3x SATA2 backing HDD
2. bcache on Samsung Evo 850 SATA2, 1x SATA1 backing HDD
3. 1x plain USB3 btrfs (no bcache)

Not even the SSD hardware is in common, just system configuration in
general (Gentoo kernel, rootfs on btrfs) and workload (I do lot's of
similar things on both machines).

I need to grab the errors for machine setup 2 - tho I can't do that
currently, that system is offline and will be for a while.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Kai Krakow
Am Sat, 26 Mar 2016 14:28:22 -0600
schrieb Chris Murphy :

> On Sat, Mar 26, 2016 at 1:30 PM, Kai Krakow 
> wrote:
> 
> > Well, this time it hit me on the USB backup drive which uses no
> > bcache and no other fancy options except compress-force=zlib.
> > Apparently, I've only got a (real) screenshot which I'm going to
> > link here:
> >
> > https://www.dropbox.com/s/9qbc7np23y8lrii/IMG_20160326_200033.jpg?dl=0  
> 
> This is a curious screen shot. It's a dracut pre-mount shell, so
> nothing should be mounted yet. And btrfs check only works on an
> unmounted file system. And yet the bottom part of the trace shows a
> Btrfs volume being made read only, as if it was mounted read write and
> is still mounted. Huh?

It's a pre-mount shell because I wanted to check the rootfs from there.
I mounted it once (and unmounted) before checking (that's
bcache{0,1,2}). Yeah, you can get there forcibly by using
rd.break=pre-mount - and yeah, nothing "should" be mounted unless I did
so previously. But I cut that away as it contained unrelated errors to
this problem and would be even more confusing.

The file system that failed then was the one I just mounted to put the
stdout of btrfsck on (sde1). That one showed these (screenshot) kernel
console logs just in the middle of typing the command - so a few
seconds after mounting.

What may be consusing to you: I use more than one btrfs. ;-)

bcache{0,1,2} = my rootfs (plus subvolumes)
sde = my USB3 backup drive (or whatever the kernel assigns)

Both run btrfs. The bcache*'s have their own problems currently, I'd
like to set those aside first and get the backup drive back in good
shape. The latter seems easier to fix.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 3:01 PM, John Marrett  wrote:
>> Well off hand it seems like the missing 2.73TB has nothing on it at
>> all, and doesn't need to be counted as missing. The other missing is
>> counted, and should have all of its data replicated elsewhere. But
>> then you're running into csum errors. So something still isn't right,
>> we just don't understand what it is.
>
> I'm not sure what we can do to get a better understanding of these
> errors, that said it may not be necessary if replace helps, more
> below.
>
>> Btrfs replace has been around for a while. 'man btrfs replace' the
>> command takes the form 'btrfs replace start' plus three required
>> pieces of information. You should be able to infer the missing devid
>> using 'btrfs show' looks like it's 6.
>
> I was looking under btrfs device, sorry about that. I do have the
> command. I tried replace and it seemed more promising than the last
> attempt, it wrote enough data to the new drive to overflow and break
> my overlay. I'm trying it without the overlay on the destination
> device, I'll report back later with the results.
>
> I'm running ubuntu linux-image-4.2.0-34-generic with a patch to remove
> this check:
>
> https://github.com/torvalds/linux/blob/master/fs/btrfs/super.c#L1770
>
> I can switch to whatever kernel though as desired. Would you prefer a
> mainline ubuntu packaged kernel? Straight from kernel.org?

Things are a lot more deterministic for developers and testers if
you're using something current. It might not matter in this case that
you're using 4.2 but all you have to do is look at the git pulls in
the list archives to see many hundreds, often over 1000, btrfs changes
per kernel cycle. So, lots and lots of fixes have happened since 4.2.
And any bugs found in 4.2 don't really matter, because you'd have to
try to reproduce in 4.4.6 or 4.5, and then the fix would go into 4.6
before it'd get backported, and then 4.2 won't be getting backports
done by upstream. That's why list folks always suggest using something
so recent. Again, in this case it might not matter, I don't read or
understand every single commit.

If you do want to use a newer one, I'd build against kernel.org, just
because the developers only use that base. And use 4.4.6 or 4.5.

It's reasonable to keep the overlay on the existing devices, but
remove the overlay for the replacement so that you're directly writing
to it. If that blows up with 4.2 you can still start over with a newer
kernel. *shrug*


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 2:28 PM, Chris Murphy  wrote:
> On Sat, Mar 26, 2016 at 1:30 PM, Kai Krakow  wrote:
>
>> Well, this time it hit me on the USB backup drive which uses no bcache
>> and no other fancy options except compress-force=zlib. Apparently, I've
>> only got a (real) screenshot which I'm going to link here:
>>
>> https://www.dropbox.com/s/9qbc7np23y8lrii/IMG_20160326_200033.jpg?dl=0
>
> This is a curious screen shot. It's a dracut pre-mount shell, so
> nothing should be mounted yet. And btrfs check only works on an
> unmounted file system. And yet the bottom part of the trace shows a
> Btrfs volume being made read only, as if it was mounted read write and
> is still mounted. Huh?

Wait. You said no bcache, and yet in this screen shot it shows 'btrfs
check /dev/bcache2 ...' right before the back trace.

This thread is confusing. You're talking about two different btrfs
volumes intermixed, one uses bcache the other doesn't, yet they both
have corruption. I think it's hardware related: bad cable bad ram bad
power, something.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-26 Thread John Marrett
> Well off hand it seems like the missing 2.73TB has nothing on it at
> all, and doesn't need to be counted as missing. The other missing is
> counted, and should have all of its data replicated elsewhere. But
> then you're running into csum errors. So something still isn't right,
> we just don't understand what it is.

I'm not sure what we can do to get a better understanding of these
errors, that said it may not be necessary if replace helps, more
below.

> Btrfs replace has been around for a while. 'man btrfs replace' the
> command takes the form 'btrfs replace start' plus three required
> pieces of information. You should be able to infer the missing devid
> using 'btrfs show' looks like it's 6.

I was looking under btrfs device, sorry about that. I do have the
command. I tried replace and it seemed more promising than the last
attempt, it wrote enough data to the new drive to overflow and break
my overlay. I'm trying it without the overlay on the destination
device, I'll report back later with the results.

I'm running ubuntu linux-image-4.2.0-34-generic with a patch to remove
this check:

https://github.com/torvalds/linux/blob/master/fs/btrfs/super.c#L1770

I can switch to whatever kernel though as desired. Would you prefer a
mainline ubuntu packaged kernel? Straight from kernel.org?

-JohnF
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 8:00 AM, Stephen Williams  wrote:

> I know this is quite a rare occurrence for home use but for Data center
> use this is something that will happen A LOT.
> This really should be placed in the wiki while we wait for a fix. I can
> see a lot of sys admins crying over this.

Maybe on the gotchas page? While it's not a data loss bug, it might be
viewed as an uptime bug because the dataset is stuck being ro and
hence unmodifiable, until a restore to a rw volume is complete.

Since we can ro mount a volume, some way to safely make it a seed
device could be useful. All that's needed to make it rw is adding even
a small USB stick for example, and now at least ro snapshots can be
taken and migrate data off the volume. A larger device that's used for
rw would allow this raid to be brought back online. And then once the
new array is up and has most data restored, a short downtime to get
the latest incremental changes sent over.

Yeah, the alternative to this is a cluster, and you just consider this
one brick a loss and move on. But most regular users don't do
clusters, even with big (for them) storage.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 5:51 AM, Patrik Lundquist
 wrote:

> # btrfs replace start -B 4 /dev/sde /mnt; dmesg | tail
>
> # btrfs device stats /mnt
>
> [/dev/sde].write_io_errs   0
> [/dev/sde].read_io_errs0
> [/dev/sde].flush_io_errs   0
> [/dev/sde].corruption_errs 0
> [/dev/sde].generation_errs 0
>
> We didn't inherit the /dev/sde error count. Is that a bug?

I'm not sure where this information is stored. Presumably in the fs
metadata? So when mounted degraded the counter is zero's is that
what's going on?



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 6:15 AM, John Marrett  wrote:
> Chris,
>
>> Post 'btrfs fi usage' for the fileystem. That may give some insight
>> what's expected to be on all the missing drives.
>
> Here's the information, I believe that the missing we see in most
> entries is the failed and absent drive, only the unallocated shows two
> missing entries, the 2.73 TB is the missing but empty device. I don't
> know if there's a way to prove it however.
>
> ubuntu@btrfs-recovery:~$ sudo btrfs fi usage /mnt
> Overall:
> Device size:  15.45TiB
> Device allocated:  12.12TiB
> Device unallocated:   3.33TiB
> Device missing:   5.46TiB
> Used:  10.93TiB
> Free (estimated):   2.25TiB(min: 2.25TiB)
> Data ratio:  2.00
> Metadata ratio:  2.00
> Global reserve: 512.00MiB(used: 0.00B)
>
> Data,RAID1: Size:6.04TiB, Used:5.46TiB
>/dev/sda   2.61TiB
>/dev/sdb   1.71TiB
>/dev/sdc   1.72TiB
>/dev/sdd   1.72TiB
>/dev/sdf   1.71TiB
>missing   2.61TiB
>
> Metadata,RAID1: Size:14.00GiB, Used:11.59GiB
>/dev/sda   8.00GiB
>/dev/sdb   2.00GiB
>/dev/sdc   3.00GiB
>/dev/sdd   4.00GiB
>/dev/sdf   3.00GiB
>missing   8.00GiB
>
> System,RAID1: Size:32.00MiB, Used:880.00KiB
>/dev/sda  32.00MiB
>missing  32.00MiB
>
> Unallocated:
>/dev/sda 111.49GiB
>/dev/sdb  98.02GiB
>/dev/sdc  98.02GiB
>/dev/sdd  98.02GiB
>/dev/sdf  98.02GiB
>missing 111.49GiB
>missing   2.73TiB
>
> I tried to remove missing, first remove missing only removes the
> 2.73TiB missing entry seen above. All the other missing entries
> remain.

Well off hand it seems like the missing 2.73TB has nothing on it at
all, and doesn't need to be counted as missing. The other missing is
counted, and should have all of its data replicated elsewhere. But
then you're running into csum errors. So something still isn't right,
we just don't understand what it is.


> I can't "replace", it's not a valid command on my btrfs tools version;
> I upgraded btrfs this morning in order to have the btrfs fi usage
> command.

Btrfs replace has been around for a while. 'man btrfs replace' the
command takes the form 'btrfs replace start' plus three required
pieces of information. You should be able to infer the missing devid
using 'btrfs show' looks like it's 6.



> ubuntu@btrfs-recovery:~$ sudo btrfs version
> btrfs-progs v4.0
> ubuntu@btrfs-recovery:~$ dpkg -l | grep btrfs
> ii  btrfs-tools4.0-2
>  amd64Checksumming Copy on Write Filesystem utilities

I would use something newer, but btrfs replace is in 4.0. But I also
don't see in this thread what kernel version you're using.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Chris Murphy
On Sat, Mar 26, 2016 at 1:30 PM, Kai Krakow  wrote:

> Well, this time it hit me on the USB backup drive which uses no bcache
> and no other fancy options except compress-force=zlib. Apparently, I've
> only got a (real) screenshot which I'm going to link here:
>
> https://www.dropbox.com/s/9qbc7np23y8lrii/IMG_20160326_200033.jpg?dl=0

This is a curious screen shot. It's a dracut pre-mount shell, so
nothing should be mounted yet. And btrfs check only works on an
unmounted file system. And yet the bottom part of the trace shows a
Btrfs volume being made read only, as if it was mounted read write and
is still mounted. Huh?


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum errors in VirtualBox VDI files

2016-03-26 Thread Kai Krakow
Am Wed, 23 Mar 2016 12:16:24 +0800
schrieb Qu Wenruo :

> Kai Krakow wrote on 2016/03/22 19:48 +0100:
> > Am Tue, 22 Mar 2016 16:47:10 +0800
> > schrieb Qu Wenruo :
> >  
> >> Hi,
> >>
> >> Kai Krakow wrote on 2016/03/22 09:03 +0100:  
>  [...]  
> >>
> >> When it goes RO, it must have some warning in kernel log.
> >> Would you please paste the kernel log?  
> >
> > Apparently, that system does not boot now due to errors in bcache
> > b-tree. That being that, it may well be some bcache error and not
> > btrfs' fault. Apparently I couldn't catch the output, I've been in a
> > hurry. It said "write error" and had some backtrace. I will come to
> > this back later.
> >
> > Let's go to the system I currently care about (that one with the
> > always breaking VDI file):
> >  
>  [...]  
> >> Does btrfs check report anything wrong?  
> >
> > After the error occured?
> >
> > Yes, some text about the extent being compressed and btrfs repair
> > doesn't currently handle that case (I tried --repair as I'm having a
> > backup). I simply decided not to investigate that further at that
> > point but delete and restore the affected file from backup.
> > However, this is the message from dmesg (tho, I didn't catch the
> > backtrace):
> >
> > btrfs_run_delayed_refs:2927: errno=-17 Object already exists  
> 
> That's nice, at least we have some clue.
> 
> It's almost sure, it's a bug either in btrfs kernel which doesn't
> handle delayed refs well(low possibility), or, corrupted fs which
> create something kernel can't handle(I bet that's the case).

[kernel 4.5.0 gentoo, btrfs-progs 4.4.1]

Well, this time it hit me on the USB backup drive which uses no bcache
and no other fancy options except compress-force=zlib. Apparently, I've
only got a (real) screenshot which I'm going to link here:

https://www.dropbox.com/s/9qbc7np23y8lrii/IMG_20160326_200033.jpg?dl=0

The same drive has no problems except "bad metadata crossing stripe
boundary" - but a lot of them. This drive was never converted, it was
freshly generated several months ago.

---8<---
$ sudo btrfsck /dev/disk/by-label/usb-backup 
Checking filesystem on /dev/disk/by-label/usb-backup
UUID: 1318ec21-c421-4e36-a44a-7be3d41f9c3f
checking extents
bad metadata [156041216, 156057600) crossing stripe boundary
bad metadata [181403648, 181420032) crossing stripe boundary
bad metadata [392167424, 392183808) crossing stripe boundary
bad metadata [783482880, 783499264) crossing stripe boundary
bad metadata [784924672, 784941056) crossing stripe boundary
bad metadata [130151612416, 130151628800) crossing stripe boundary
bad metadata [162826813440, 162826829824) crossing stripe boundary
bad metadata [162927083520, 162927099904) crossing stripe boundary
bad metadata [619740659712, 619740676096) crossing stripe boundary
bad metadata [619781947392, 619781963776) crossing stripe boundary
bad metadata [619795644416, 619795660800) crossing stripe boundary
bad metadata [619816091648, 619816108032) crossing stripe boundary
bad metadata [620011388928, 620011405312) crossing stripe boundary
bad metadata [890992459776, 890992476160) crossing stripe boundary
bad metadata [891022737408, 891022753792) crossing stripe boundary
bad metadata [891101773824, 891101790208) crossing stripe boundary
bad metadata [891301199872, 891301216256) crossing stripe boundary
[...]
--->8---

My main drive (which this thread was about) has a huge amount of
different problems according to btrfsck. Repair doesn't work: it says
something about overlapping extents and that it needs a careful
thought. I wanted to catch the output when the above problem occured. So
I'd like to defer that until later and first fix my backup drive. If I
lose my main drive, I simply restore from backup. It is very old anyway
(still using 4k node size). Only downside it takes 24+ hours to restore.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-26 Thread Stephen Williams
Can confirm that you only get one chance to fix the problem before the
array is dead.

I know this is quite a rare occurrence for home use but for Data center
use this is something that will happen A LOT. 
This really should be placed in the wiki while we wait for a fix. I can
see a lot of sys admins crying over this. 

-- 
  Stephen Williams
  steph...@veryfast.biz

On Sat, Mar 26, 2016, at 11:51 AM, Patrik Lundquist wrote:
> So with the lessons learned:
> 
> # mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sde
> 
> # mount /dev/sdb /mnt; dmesg | tail
> # touch /mnt/test1; sync; btrfs device usage /mnt
> 
> Only raid10 profiles.
> 
> # echo 1 >/sys/block/sde/device/delete
> 
> We lost a disk.
> 
> # touch /mnt/test2; sync; dmesg | tail
> 
> We've got write errors.
> 
> # btrfs device usage /mnt
> 
> No 'single' profiles because we haven't remounted yet.
> 
> # reboot
> # wipefs -a /dev/sde; reboot
> 
> # mount -o degraded /dev/sdb /mnt; dmesg | tail
> # btrfs device usage /mnt
> 
> Still only raid10 profiles.
> 
> # touch /mnt/test3; sync; btrfs device usage /mnt
> 
> Now we've got 'single' profiles. Replace now or get hosed.
> 
> # btrfs replace start -B 4 /dev/sde /mnt; dmesg | tail
> 
> # btrfs device stats /mnt
> 
> [/dev/sde].write_io_errs   0
> [/dev/sde].read_io_errs0
> [/dev/sde].flush_io_errs   0
> [/dev/sde].corruption_errs 0
> [/dev/sde].generation_errs 0
> 
> We didn't inherit the /dev/sde error count. Is that a bug?
> 
> # btrfs balance start -dconvert=raid10,soft -mconvert=raid10,soft
> -sconvert=raid10,soft -vf /mnt; dmesg | tail
> 
> # btrfs device usage /mnt
> 
> Back to only 'raid10' profiles.
> 
> # umount /mnt; mount /dev/sdb /mnt; dmesg | tail
> 
> # btrfs device stats /mnt
> 
> [/dev/sde].write_io_errs   11
> [/dev/sde].read_io_errs0
> [/dev/sde].flush_io_errs   2
> [/dev/sde].corruption_errs 0
> [/dev/sde].generation_errs 0
> 
> The old counters are back. That's good, but wtf?
> 
> # btrfs device stats -z /dev/sde
> 
> Give /dev/sde a clean bill of health. Won't warn when mounting again.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method

2016-03-26 Thread Qu Wenruo



On 03/25/2016 11:11 PM, Chris Mason wrote:

On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:



Chris Mason wrote on 2016/03/24 16:58 -0400:

Are you storing the entire hash, or just the parts not represented in
the key?  I'd like to keep the on-disk part as compact as possible for
this part.


Currently, it's entire hash.

More detailed can be checked in another mail.

Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
I still quite like current implementation, as one memcpy() is simpler.


[ sorry FB makes urls look ugly, so I delete them from replys ;) ]

Right, I saw that but wanted to reply to the specific patch.  One of the
lessons learned from the extent allocation tree and file extent items is
they are just too big.  Lets save those bytes, it'll add up.


OK, I'll reduce the duplicated last 8 bytes.

And also, removing the "length" member, as it can be always fetched from 
dedupe_info->block_size.


The length itself is used to verify if we are at the transaction to a 
new dedupe size, but later we use full sync_fs(), such behavior is not 
needed any more.










+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * all its content is hash.
+ * So no special item struct is needed.
+ */
+


Can we do this instead with a backref from the extent?  It'll save us a
huge amount of IO as we delete things.


That's the original implementation from Liu Bo.

The problem is, it changes the data backref rules(originally, only
EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
other than current RO_COMPACT.
So I really don't like to change the data backref rule.


Let me reread this part, the cost of maintaining the second index is
dramatically higher than adding a backref.  I do agree that's its nice
to be able to delete the dedup trees without impacting the rest, but
over the long term I think we'll regret the added balances.


Thanks for pointing the problem. Yes, I didn't even consider this fact.

But, on the other hand. such remove only happens when we remove the 
*last* reference of the extent.
So, for medium to high dedupe rate case, such routine is not that 
frequent, which will reduce the impact.

(Which is quite different for non-dedupe case)

And for low dedupe rate case, why use dedupe anyway. In that case, 
compression would be much more appropriate if user just wants to reduce 
disk usage IMO.



Another reason I don't want to touch delayed-ref codes is, it already 
has made us quite pain.

We were fighting with delayed-ref from the beginning.
The delayed ref, especially the ability to run delayed refs 
asynchronously, is the biggest problem we met.


And that's why we added ability to increase data ref while holding 
delayed_refs->lock in patch 5, and then uses a long lock-and-try-inc 
method to search hash in patch 6.


Any modification to delayed ref can easily lead to new bugs (Yes, I have 
proved it several times by myself).

So I choose to use current method.





If only want to reduce ondisk space, just trashing the hash and making
DEDUPE_BYTENR_ITEM have no data would be good enough.

As (bytenr, DEDEUPE_BYTENR_ITEM) can locate the hash uniquely.


For the second index, the big problem is the cost of the btree
operations.  We're already pretty expensive in terms of the cost of
deleting an extent, with dedup its 2x higher, with dedup + extra index,
its 3x higher.


The good news is, we only delete hash bytenr and its ref at the last 
de-reference.
And in normal (medium to high dedupe rate) case, it's not a frequent 
operation IMHO.


Thanks,
Qu





In fact no code really checked the hash for dedupe bytenr item, they all
just swap objectid and offset, reset the type and do search for
DEDUPE_HASH_ITEM.

So it's OK to emit the hash.


If we have to go with the second index, I do agree here.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID Assembly with Missing Empty Drive

2016-03-26 Thread John Marrett
Chris,

> Post 'btrfs fi usage' for the fileystem. That may give some insight
> what's expected to be on all the missing drives.

Here's the information, I believe that the missing we see in most
entries is the failed and absent drive, only the unallocated shows two
missing entries, the 2.73 TB is the missing but empty device. I don't
know if there's a way to prove it however.

ubuntu@btrfs-recovery:~$ sudo btrfs fi usage /mnt
Overall:
Device size:  15.45TiB
Device allocated:  12.12TiB
Device unallocated:   3.33TiB
Device missing:   5.46TiB
Used:  10.93TiB
Free (estimated):   2.25TiB(min: 2.25TiB)
Data ratio:  2.00
Metadata ratio:  2.00
Global reserve: 512.00MiB(used: 0.00B)

Data,RAID1: Size:6.04TiB, Used:5.46TiB
   /dev/sda   2.61TiB
   /dev/sdb   1.71TiB
   /dev/sdc   1.72TiB
   /dev/sdd   1.72TiB
   /dev/sdf   1.71TiB
   missing   2.61TiB

Metadata,RAID1: Size:14.00GiB, Used:11.59GiB
   /dev/sda   8.00GiB
   /dev/sdb   2.00GiB
   /dev/sdc   3.00GiB
   /dev/sdd   4.00GiB
   /dev/sdf   3.00GiB
   missing   8.00GiB

System,RAID1: Size:32.00MiB, Used:880.00KiB
   /dev/sda  32.00MiB
   missing  32.00MiB

Unallocated:
   /dev/sda 111.49GiB
   /dev/sdb  98.02GiB
   /dev/sdc  98.02GiB
   /dev/sdd  98.02GiB
   /dev/sdf  98.02GiB
   missing 111.49GiB
   missing   2.73TiB

I tried to remove missing, first remove missing only removes the
2.73TiB missing entry seen above. All the other missing entries
remain.

I can't "replace", it's not a valid command on my btrfs tools version;
I upgraded btrfs this morning in order to have the btrfs fi usage
command.

ubuntu@btrfs-recovery:~$ sudo btrfs version
btrfs-progs v4.0
ubuntu@btrfs-recovery:~$ dpkg -l | grep btrfs
ii  btrfs-tools4.0-2
 amd64Checksumming Copy on Write Filesystem utilities

For those interested in my recovery techniques, here's how I rebuild
the overlay loop devices, be careful, these scripts make certain
assumptions that may not be accurate for your system:

On Client:

sudo umount /mnt
sudo /etc/init.d/open-iscsi stop

On Server:

/etc/init.d/iscsitarget stop
loop_devices=$(losetup -a | grep overlay | tr ":" " " | awk ' { printf
$1 " " } END { print "" } ')
for fn in /dev/mapper/sd??; do dmsetup remove $fn; done
for ln in $loop_devices; do losetup -d $ln; done
cd /home/ubuntu
rm sd*overlay

for device in sda3 sdb3 sdc1 sdd1 sde1 sdf1; do
  dev=/dev/$device
  ovl=/home/ubuntu/$device-overlay
  truncate -s512M $ovl
  newdevname=$device
  size=$(blockdev --getsize "$dev")
  loop=$(losetup -f --show -- "$ovl")
  echo Setting up loop for $dev using overlay $ovl on loop $loop for
target $newdevname
  printf '%s\n' "0 $size snapshot $dev $loop P 8" | dmsetup create "$newdevname"
done

Start the targets

/etc/init.d/iscsitarget start

On Client:

sudo /etc/init.d/open-iscsi start

-JohnF
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible Raid Bug

2016-03-26 Thread Patrik Lundquist
So with the lessons learned:

# mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sde

# mount /dev/sdb /mnt; dmesg | tail
# touch /mnt/test1; sync; btrfs device usage /mnt

Only raid10 profiles.

# echo 1 >/sys/block/sde/device/delete

We lost a disk.

# touch /mnt/test2; sync; dmesg | tail

We've got write errors.

# btrfs device usage /mnt

No 'single' profiles because we haven't remounted yet.

# reboot
# wipefs -a /dev/sde; reboot

# mount -o degraded /dev/sdb /mnt; dmesg | tail
# btrfs device usage /mnt

Still only raid10 profiles.

# touch /mnt/test3; sync; btrfs device usage /mnt

Now we've got 'single' profiles. Replace now or get hosed.

# btrfs replace start -B 4 /dev/sde /mnt; dmesg | tail

# btrfs device stats /mnt

[/dev/sde].write_io_errs   0
[/dev/sde].read_io_errs0
[/dev/sde].flush_io_errs   0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0

We didn't inherit the /dev/sde error count. Is that a bug?

# btrfs balance start -dconvert=raid10,soft -mconvert=raid10,soft
-sconvert=raid10,soft -vf /mnt; dmesg | tail

# btrfs device usage /mnt

Back to only 'raid10' profiles.

# umount /mnt; mount /dev/sdb /mnt; dmesg | tail

# btrfs device stats /mnt

[/dev/sde].write_io_errs   11
[/dev/sde].read_io_errs0
[/dev/sde].flush_io_errs   2
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0

The old counters are back. That's good, but wtf?

# btrfs device stats -z /dev/sde

Give /dev/sde a clean bill of health. Won't warn when mounting again.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html