Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Duncan
Kai Krakow posted on Sun, 03 Apr 2016 06:02:02 +0200 as excerpted:

> No, other files are affected, too. And it looks like those files are
> easily affected even when removed and recreated from whatever backup
> source.

I've seen you say that several times now, I think.  But none of those 
times has it apparently occurred to you to double-check whether it's the 
/same/ corruptions every time, or at least, if you checked it, I've not 
seen it actually /reported/.  (Note that I didn't say you didn't report 
it, only that I've not seen it.  A difference there is! =:^)

If I'm getting repeated corruptions of something, that's the first thing 
I'd check, is there some repeating pattern to those corruptions, same 
place in the file, same "wanted" value (expected), same "got" value, (not 
expected if it's reporting corruption), etc.

Then I'd try different variations like renaming the file, putting it in a 
different directory with all of the same other files, putting it in a 
different directory with all different files, putting it in a different 
directory by itself, putting it in the same directory but in a different 
subvolume... you get the point.

Then I'd try different mount options, with and without compression, with 
different kinds of compression, with compress-force and with simple 
compress, with and without autodefrag...

I could try it with nocow enabled for the file (note that the file has to 
be created with nocow before it gets content, for nocow to take effect), 
tho of course that'll turn off btrfs checksumming, but I could still for 
instance md5sum the original source and the nocowed test version and see 
if it tests clean that way.

I could try it with nocow on the file but with a bunch of snapshots 
interwoven with writing changes to the file (obviously this will kill 
comparison against the original, but I could arrange to write the same 
changes to the test file on btrfs, and to a control copy of the file on 
non-btrfs, and then md5sum or whatever compare them).

Then, if I had the devices available to do so, I'd try it in a different 
btrfs of the same layout (same redundancy mode and number of devices), 
both single and dup mode on a single device, etc.

And again if available, I'd try swapping the filesystem to different 
machines...

OK, so trying /all/ the above might be a bit overboard but I think you 
get the point.  Try to find some pattern or common element in the whole 
thing, and report back the results at least for the "simple" experiments 
like whether the corruption appears to be the same (same got at the same 
spot) or different, and whether putting the file in a different subdir or 
using a different name for it matters at all.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Kai Krakow
Am Sat, 2 Apr 2016 18:14:17 -0600
schrieb Chris Murphy :

> On Sat, Apr 2, 2016 at 2:16 PM, Kai Krakow 
> wrote:
> 
> > I'll go checking the RAM for problems - tho that would be the first
> > time in twenty years that a RAM module hadn't errors from the
> > beginning. Well, you'll never know. But I expect no error since
> > usually this would mean all sorts of different and random problems
> > which I don't have. Problems are very specific, which is atypical
> > for RAM errors.  
> 
> Well so far it's just the VDI that's experiencing csum mismatch
> errors, right? So that's not bad RAM, which would affect other files
> too. And same for a failing SSD.

No, other files are affected, too. And it looks like those files are
easily affected even when removed and recreated from whatever backup
source.

> I think you've got a bug somewhere and it's just hard to say where it
> is based on the available information. I've already lost track if
> others have all of the exact same setup you do: bcache + nossd +
> autodefrag + lzo + VirtualBox writing to VDI on this Btrfs volume.
> There are others who have some of those options, but I don't know if
> there's anyone who has all of those going on.

I didn't run VirtualBox since the incident. So I'd rule out VirtualBox.
Currently, there seems to be no csum error for the VDI file, instead
now another file gets corruptions, even after recreated. I think it is
result of another corruption and thus a side effect.

Also I think, having options nossd+autodefrag+lzo shouldn't be an
exotic or unsupported option. Having this on top of bcache should just
work.

Let's not rule out bcache had a problem although I usually expect
bcache to freak out with internal btree corruption then.

> Maybe Qu has some suggestions, but if it were me I'd do this. Build
> mainline 4.5.0, it's a known quantity by Btrfs devs.

4.5.0-gentoo is currently only a few patches so I could easily build
vanilla.

> Build the kernel
> with BTRFS_FS_CHECK_INTEGRITY enabled in kernel config. And when you
> mount the file system, don't use mount option check_int, just use your
> regular mount options and try to reproduce the VDI corruption. If you
> can reproduce it, then start over, this time with check_int mount
> option included along with the others you're using and try to
> reproduce. It's possible there will be fairly verbose kernel messages,
> so use boot parameter log_buf_len=1M and then that way you can use
> dmesg rather than depending on journalctl -k which sometimes drops
> messages if there are too many.

Does it make sense while I still have the corruptions in the FS? I'd
like to wait for Qu whether I should recreate the FS or whether I
should take some image, or send info to improve btrfsck...

I'm pretty sure I do not have reproducible corruptions which are not
caused by another corruption - so check_int would probably be of less
use currently.

> If you reproduce the corruption while check_int is enabled, kernel
> messages should have clues and then you can put that in a file and
> attach to the list or open a bug. FWIW, I'm pretty sure your MUA is
> wrapping poorly, when I look at this URL for your post with smartctl
> output, it wraps in a way that's essentially impossible to sort out at
> a glance. Whether it's your MUA or my web browser pretty much doesn't
> matter, it's not legible so what I do is just attach as file to a bug
> report or if small enough onto the list itself.
> http://www.spinics.net/lists/linux-btrfs/msg53790.html

Claws mail is just too smart for me... It showed up correctly in the
editor before hitting the send button. I wish I could go back to knode
(that did it's job right). But it's currently an unsupported orphan
project of KDE. :-(

> Finally, I would retest yet again with check_int_data as a mount
> option and try to reproduce. This is reported to be dirt slow, but it
> might capture something that check_int doesn't. But I admit this is
> throwing spaghetti on the wall, and is something of a goose chase just
> because I don't know what else to recommend other than iterating all
> of your mount options from none, adding just one at a time, and trying
> to reproduce. That somehow sounds more tedious. But chances are you'd
> find out what mount option is causing it; OR maybe you'd find out the
> corruption always happens, even with defaults, even without bcache, in
> which case that'd seem to implicate either a gentoo patch, or a
> virtual box bug of some sort.

I think the latter two are easily the least probable sort of bugs. But
I'll give it a try. For the time being, I could switch bcache to
write-around mode - so it could at least not corrupt btrfs during
writes.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub: Tree block spanning stripes, ignored

2016-04-02 Thread Qu Wenruo



On 04/03/2016 12:29 AM, Ivan P wrote:

It's about 800Mb, I think I could upload that.

I ran it with the -s parameter, is that enough to remove all personal
info from the image?
Also, I had to run it with -w because otherwise it died on the same
corrupt node.


You can also use -c9 to further compress the data.

Thanks,
Qu


On Fri, Apr 1, 2016 at 2:25 AM, Qu Wenruo  wrote:



Ivan P wrote on 2016/03/31 18:04 +0200:


Ok, it will take a while until I can attempt repairing it, since I
will have to order a spare HDD to copy the data to.
Should I take some sort of debug snapshot of the fs so you can take a
look at it? I think I read something about a snapshot that only
contains the fs but not the data that somewhere.


That's btrfs-image.

It would be good, but if your metadata is over 3G, I think it's would take a
lot of time uploading.

Thanks,
Qu



Regards,
Ivan.

On Tue, Mar 29, 2016 at 3:57 AM, Qu Wenruo 
wrote:




Ivan P wrote on 2016/03/28 23:21 +0200:



Well, the file in this inode is fine, I was able to copy it off the
disk. However, rm-ing the file causes a segmentation fault. Shortly
after that, I get a kernel oops. Same thing happens if I attempt to
re-run scrub.

How can I delete that inode? Could deleting it destroy the filesystem
beyond repair?




The kernel oops should protect you from completely destroying the fs.

However it seems that the problem is beyond kernel's handle (kernel
oops).

So no safe recovery method now.

  From now on, any repair advice from me *MAY* *destroy* your fs.
So please do backup when you still can.


The best possible try would be "btrfsck --init-extent-tree --repair".

If it works, then mount it and run "btrfs balance start ".
Lastly, umount and use btrfsck to re-check if it fixes the problem.

Thanks,
Qu




Regards,
Ivan

On Mon, Mar 28, 2016 at 3:10 AM, Qu Wenruo 
wrote:





Ivan P wrote on 2016/03/27 16:31 +0200:




Thanks for the reply,

the raid1 array was created from scratch, so not converted from ext*.
I used btrfs-progs version 4.2.3 on kernel 4.2.5 to create the array,
btw.





I don't remember any strange behavior after 4.0, so no clue here.

Go to the subvolume 5 (the top-level subvolume), find inode 71723 and
try
to
remove it.
Then, use 'btrfs filesystem sync ' to sync the inode
removal.

Finally use latest btrfs-progs to check if the problem disappears.

This problem seems to be quite strange, so I can't locate the root
cause,
but try to remove the file and hopes kernel can handle it.

Thanks,
Qu





Is there a way to fix the current situation without taking the whole
data off the disk?
I'm not familiar with file systems terms, so what exactly could I have
lost, if anything?

Regards,
Ivan

On Sun, Mar 27, 2016 at 4:23 PM, Qu Wenruo > wrote:



   On 03/27/2016 05:54 PM, Ivan P wrote:

   Read the info on the wiki, here's the rest of the requested
   information:

   # uname -r
   4.4.5-1-ARCH

   # btrfs fi show
   Label: 'ArchVault'  uuid:
cd8a92b6-c5b5-4b19-b5e6-a839828d12d8
Total devices 1 FS bytes used 2.10GiB
devid1 size 14.92GiB used 4.02GiB path
/dev/sdc1

   Label: 'Vault'  uuid: 013cda95-8aab-4cb2-acdd-2f0f78036e02
Total devices 2 FS bytes used 800.72GiB
devid1 size 931.51GiB used 808.01GiB path
/dev/sda
devid2 size 931.51GiB used 808.01GiB path
/dev/sdb

   # btrfs fi df /mnt/vault/
   Data, RAID1: total=806.00GiB, used=799.81GiB
   System, RAID1: total=8.00MiB, used=128.00KiB
   Metadata, RAID1: total=2.00GiB, used=936.20MiB
   GlobalReserve, single: total=320.00MiB, used=0.00B

   On Fri, Mar 25, 2016 at 3:16 PM, Ivan P
> wrote:

   Hello,

   using kernel  4.4.5 and btrfs-progs 4.4.1, I today ran a
   scrub on my
   2x1Tb btrfs raid1 array and it finished with 36
   unrecoverable errors
   [1], all blaming the treeblock 741942071296. Running
"btrfs
   check
   --readonly" on one of the devices lists that extent as
   corrupted [2].

   How can I recover, how much did I really lose, and how
can
I
   prevent
   it from happening again?
   If you need me to provide more info, do tell.

   [1] http://cwillu.com:8080/188.110.141.36/1


   This message itself is normal, it just means a tree block is
   crossing 64K stripe boundary.
   And due to scrub limit, it can check if it's good or bad.
   But

   [2] http://pastebin.com/xA5zezqw

   This one is much more meaningful, showing several strange bugs.

   1. corrupt 

Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Chris Murphy
On Sat, Apr 2, 2016 at 2:16 PM, Kai Krakow  wrote:

> I'll go checking the RAM for problems - tho that would be the first
> time in twenty years that a RAM module hadn't errors from the
> beginning. Well, you'll never know. But I expect no error since usually
> this would mean all sorts of different and random problems which I
> don't have. Problems are very specific, which is atypical for RAM
> errors.

Well so far it's just the VDI that's experiencing csum mismatch
errors, right? So that's not bad RAM, which would affect other files
too. And same for a failing SSD.

I think you've got a bug somewhere and it's just hard to say where it
is based on the available information. I've already lost track if
others have all of the exact same setup you do: bcache + nossd +
autodefrag + lzo + VirtualBox writing to VDI on this Btrfs volume.
There are others who have some of those options, but I don't know if
there's anyone who has all of those going on.

Maybe Qu has some suggestions, but if it were me I'd do this. Build
mainline 4.5.0, it's a known quantity by Btrfs devs. Build the kernel
with BTRFS_FS_CHECK_INTEGRITY enabled in kernel config. And when you
mount the file system, don't use mount option check_int, just use your
regular mount options and try to reproduce the VDI corruption. If you
can reproduce it, then start over, this time with check_int mount
option included along with the others you're using and try to
reproduce. It's possible there will be fairly verbose kernel messages,
so use boot parameter log_buf_len=1M and then that way you can use
dmesg rather than depending on journalctl -k which sometimes drops
messages if there are too many.

If you reproduce the corruption while check_int is enabled, kernel
messages should have clues and then you can put that in a file and
attach to the list or open a bug. FWIW, I'm pretty sure your MUA is
wrapping poorly, when I look at this URL for your post with smartctl
output, it wraps in a way that's essentially impossible to sort out at
a glance. Whether it's your MUA or my web browser pretty much doesn't
matter, it's not legible so what I do is just attach as file to a bug
report or if small enough onto the list itself.
http://www.spinics.net/lists/linux-btrfs/msg53790.html

Finally, I would retest yet again with check_int_data as a mount
option and try to reproduce. This is reported to be dirt slow, but it
might capture something that check_int doesn't. But I admit this is
throwing spaghetti on the wall, and is something of a goose chase just
because I don't know what else to recommend other than iterating all
of your mount options from none, adding just one at a time, and trying
to reproduce. That somehow sounds more tedious. But chances are you'd
find out what mount option is causing it; OR maybe you'd find out the
corruption always happens, even with defaults, even without bcache, in
which case that'd seem to implicate either a gentoo patch, or a
virtual box bug of some sort.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Kai Krakow
Am Sat, 2 Apr 2016 19:17:55 +0200
schrieb Henk Slager :

> On Sat, Apr 2, 2016 at 11:00 AM, Kai Krakow 
> wrote:
> > Am Fri, 1 Apr 2016 01:27:21 +0200
> > schrieb Henk Slager :
> >  
> >> It is not clear to me what 'Gentoo patch-set r1' is and does. So
> >> just boot a vanilla v4.5 kernel from kernel.org and see if you get
> >> csum errors in dmesg.  
> >
> > It is the gentoo patchset, I don't think anything there relates to
> > btrfs:
> > https://dev.gentoo.org/~mpagano/genpatches/trunk/4.5/
> >  
> >> Also, where does 'duplicate object' come from? dmesg ? then please
> >> post its surroundings, straight from dmesg.  
> >
> > It was in dmesg. I already posted it in the other thread and Qu took
> > note of it. Apparently, I didn't manage to capture anything else
> > than:
> >
> > btrfs_run_delayed_refs:2927: errno=-17 Object already exists
> >
> > It hit me unexpected. This was the first time btrfs went RO for me.
> > It was with kernel 4.4.5 I think.
> >
> > I suspect this is the outcome of unnoticed corruptions that sneaked
> > in earlier over some period of time. The system had no problems
> > until this incident, and only then I discovered the huge pile of
> > corruptions when I ran btrfsck.
> >
> > I'm also pretty convinced now that VirtualBox itself is not the
> > problem but only victim of these corruptions, that's why it
> > primarily shows up in the VDI file.
> >
> > However, I now found csum errors in unrelated files (see other post
> > in this thread), even for files not touched in a long time.  
> 
> Ok, this is some good further status and background. That there are
> more csum errors elsewhere is quite worrying I would say. You said HW
> is tested, are you sure there no rare undetected failures, like due to
> overclocking or just aging or whatever. It might just be that spurious
> HW errors just now start to happen and are unrelated to kernel upgrade
> from 4.4.x to 4.5.
> I had once a RAM module going bad; Windows7 ran fine (at least no
> crashes), but when I booted with Linux/btrfs, all kinds of strange
> btrfs errors started to appear including csum errors.

I'll go checking the RAM for problems - tho that would be the first
time in twenty years that a RAM module hadn't errors from the
beginning. Well, you'll never know. But I expect no error since usually
this would mean all sorts of different and random problems which I
don't have. Problems are very specific, which is atypical for RAM
errors.

The hardware is not overclocked, every part was tested when installed.

> The other thing you could think about is the SSD cache partition. I
> don't remember if blocks from RAM to SSD get an extra CRC attached
> (independent of BTRFS). But if data gets corrupted while in the SSD,
> you could get very nasty errors, how nasty depends a bit on the
> various bcache settings. It is not unthinkable that dirty changed data
> gets written to the harddisks. But at least btrfs (scub) can detect
> that (the situation you are in now).

Well, the SSD could in fact soon become a problem. It's at 97% of its
lifetime according to SMART. I'm probably somewhere near 85TB (that's
the lifetime spec of the SSD) of written data within one year thanks to
some unfortunate disk replacement (btrfs replace) action with btrfs
through bcache, and weekly scrubs (which does not just read, but
writes).

ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f   100
100   000Pre-fail  Always   -   1 5 Reallocate_NAND_Blk_Cnt
0x0033   100   100   000Pre-fail  Always   -   0 9
Power_On_Hours  0x0032   100   100   000Old_age
Always   -   8705 12 Power_Cycle_Count   0x0032   100
100   000Old_age   Always   -   286 171
Program_Fail_Count  0x0032   100   100   000Old_age
Always   -   0 172 Erase_Fail_Count0x0032   100   100
000Old_age   Always   -   0 173 Ave_Block-Erase_Count
0x0032   003   003   000Old_age   Always   -   2913 174
Unexpect_Power_Loss_Ct  0x0032   100   100   000Old_age
Always   -   112 180 Unused_Reserve_NAND_Blk 0x0033   000
000   000Pre-fail  Always   -   1036 183
SATA_Interfac_Downshift 0x0032   100   100   000Old_age
Always   -   0 184 Error_Correction_Count  0x0032   100   100
000Old_age   Always   -   0 187 Reported_Uncorrect
0x0032   100   100   000Old_age   Always   -   0 194
Temperature_Celsius 0x0022   067   057   000Old_age
Always   -   33 (Min/Max 20/43) 196 Reallocated_Event_Count
0x0032   100   100   000Old_age   Always   -   0 197
Current_Pending_Sector  0x0032   100   100   000Old_age
Always   -   0 198 Offline_Uncorrectable   0x0030   100   100
000Old_age   Offline  -   0 199 UDMA_CRC_Error_Count
0x0032   100   100   000Old_age   Always   -   0 202

Re: bad metadata crossing stripe boundary

2016-04-02 Thread Chris Murphy
On Thu, Mar 31, 2016 at 11:57 PM, Marc Haber
 wrote:
> On Thu, Mar 31, 2016 at 11:16:30PM +0200, Kai Krakow wrote:
>> Am Thu, 31 Mar 2016 23:00:04 +0200
>> schrieb Marc Haber :
>> > I find it somewhere between funny and disturbing that the first call
>> > of btrfs check made my kernel log the following:
>> > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31): mounted
>> > filesystem with ordered data mode. Opts: (null) Mar 31 22:45:38 fan
>> > kernel: [ 6255.361328] BTRFS: device label fanbtr devid 1 transid
>> > 67526 /dev/dm-31
>> >
>> > No, the filesystem was not converted, it was directly created as
>> > btrfs, and no, I didn't try mounting it.
>>
>> I suggest that your partition contained ext4 before, and you didn't run
>> wipefs before running mkfs.btrfs.
>
> I cryptsetup luksFormat'ted the partition before I mkfs.btrfs'ed it.
> That should do a much better job than wipefsing it, shouldnt it?

Not really. The first btrfs super is at 64K. The second at 64M. The
third at 256G. While wipefs will remove the magic only on the first,
mkfs.btrfs will take care of all three. And luksFormat only overwrites
the first 132K of a block device. There's a scant chance of bugs
related to previous filesystems not being erased, I think this is more
likely when mixing and matching filesystems just because the
superblocks for each filesystem aren't in the same location.

If you're concerned about traces of previous file systems, then use
the dmcrypt device itself, rather than merely using the original block
device where merely 132K at the beginning has been overwritten.
Everytime you format a device, the resulting dmcrypt logical device is
in effect full of completely random data. A new random key is
generated each time you use luksFormat, even if you're using the same
passphrase.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-04-02 Thread Patrik Lundquist
On 2 April 2016 at 20:31, Kai Krakow  wrote:
> Am Sat, 2 Apr 2016 11:44:32 +0200
> schrieb Marc Haber :
>
>> On Sat, Apr 02, 2016 at 11:03:53AM +0200, Kai Krakow wrote:
>> > Am Fri, 1 Apr 2016 07:57:25 +0200
>> > schrieb Marc Haber :
>> > > On Thu, Mar 31, 2016 at 11:16:30PM +0200, Kai Krakow wrote:
>>  [...]
>>  [...]
>>  [...]
>> > >
>> > > I cryptsetup luksFormat'ted the partition before I mkfs.btrfs'ed
>> > > it. That should do a much better job than wipefsing it, shouldnt
>> > > it?
>> >
>> > Not sure how luksFormat works. If it encrypts what is already on the
>> > device, it would also encrypt orphan superblocks.
>>
>> It overwrites the LUKS metadata including the symmetric key that was
>> used to encrypt the existing data. Short of Shor's Algorithm and
>> Quantum Computers, after that operation it is no longer possible to
>> even guess what was on the disk before.
>
> If it was encrypted before... ;-)

What does wipefs -n find?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-04-02 Thread Kai Krakow
Am Sat, 2 Apr 2016 11:44:32 +0200
schrieb Marc Haber :

> On Sat, Apr 02, 2016 at 11:03:53AM +0200, Kai Krakow wrote:
> > Am Fri, 1 Apr 2016 07:57:25 +0200
> > schrieb Marc Haber :  
> > > On Thu, Mar 31, 2016 at 11:16:30PM +0200, Kai Krakow wrote:  
>  [...]  
>  [...]  
>  [...]  
> > > 
> > > I cryptsetup luksFormat'ted the partition before I mkfs.btrfs'ed
> > > it. That should do a much better job than wipefsing it, shouldnt
> > > it?  
> > 
> > Not sure how luksFormat works. If it encrypts what is already on the
> > device, it would also encrypt orphan superblocks.  
> 
> It overwrites the LUKS metadata including the symmetric key that was
> used to encrypt the existing data. Short of Shor's Algorithm and
> Quantum Computers, after that operation it is no longer possible to
> even guess what was on the disk before.

If it was encrypted before... ;-)

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Henk Slager
On Sat, Apr 2, 2016 at 11:00 AM, Kai Krakow  wrote:
> Am Fri, 1 Apr 2016 01:27:21 +0200
> schrieb Henk Slager :
>
>> It is not clear to me what 'Gentoo patch-set r1' is and does. So just
>> boot a vanilla v4.5 kernel from kernel.org and see if you get csum
>> errors in dmesg.
>
> It is the gentoo patchset, I don't think anything there relates to
> btrfs:
> https://dev.gentoo.org/~mpagano/genpatches/trunk/4.5/
>
>> Also, where does 'duplicate object' come from? dmesg ? then please
>> post its surroundings, straight from dmesg.
>
> It was in dmesg. I already posted it in the other thread and Qu took
> note of it. Apparently, I didn't manage to capture anything else than:
>
> btrfs_run_delayed_refs:2927: errno=-17 Object already exists
>
> It hit me unexpected. This was the first time btrfs went RO for me. It
> was with kernel 4.4.5 I think.
>
> I suspect this is the outcome of unnoticed corruptions that sneaked in
> earlier over some period of time. The system had no problems until this
> incident, and only then I discovered the huge pile of corruptions when I
> ran btrfsck.
>
> I'm also pretty convinced now that VirtualBox itself is not the problem
> but only victim of these corruptions, that's why it primarily shows up
> in the VDI file.
>
> However, I now found csum errors in unrelated files (see other post in
> this thread), even for files not touched in a long time.

Ok, this is some good further status and background. That there are
more csum errors elsewhere is quite worrying I would say. You said HW
is tested, are you sure there no rare undetected failures, like due to
overclocking or just aging or whatever. It might just be that spurious
HW errors just now start to happen and are unrelated to kernel upgrade
from 4.4.x to 4.5.
I had once a RAM module going bad; Windows7 ran fine (at least no
crashes), but when I booted with Linux/btrfs, all kinds of strange
btrfs errors started to appear including csum errors.

The other thing you could think about is the SSD cache partition. I
don't remember if blocks from RAM to SSD get an extra CRC attached
(independent of BTRFS). But if data gets corrupted while in the SSD,
you could get very nasty errors, how nasty depends a bit on the
various bcache settings. It is not unthinkable that dirty changed data
gets written to the harddisks. But at least btrfs (scub) can detect
that (the situation you are in now).

Maybe to further isolate just btrfs, you could temporary rule out
bcache by making sure the cache is clean and then increase the
startsectors of second partitions on the harddisks by 16 (8KiB) and
then reboot. Of course after any write to the partitions, you'll have
to recreate all bcache.

But maybe it is just due to bugs in older kernels that the fs has been
silently corrupted and now kernel 4.5 cannot handle it anymore and any
use of the fs increases corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub: Tree block spanning stripes, ignored

2016-04-02 Thread Ivan P
It's about 800Mb, I think I could upload that.

I ran it with the -s parameter, is that enough to remove all personal
info from the image?
Also, I had to run it with -w because otherwise it died on the same
corrupt node.

On Fri, Apr 1, 2016 at 2:25 AM, Qu Wenruo  wrote:
>
>
> Ivan P wrote on 2016/03/31 18:04 +0200:
>>
>> Ok, it will take a while until I can attempt repairing it, since I
>> will have to order a spare HDD to copy the data to.
>> Should I take some sort of debug snapshot of the fs so you can take a
>> look at it? I think I read something about a snapshot that only
>> contains the fs but not the data that somewhere.
>
> That's btrfs-image.
>
> It would be good, but if your metadata is over 3G, I think it's would take a
> lot of time uploading.
>
> Thanks,
> Qu
>
>>
>> Regards,
>> Ivan.
>>
>> On Tue, Mar 29, 2016 at 3:57 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>> Ivan P wrote on 2016/03/28 23:21 +0200:


 Well, the file in this inode is fine, I was able to copy it off the
 disk. However, rm-ing the file causes a segmentation fault. Shortly
 after that, I get a kernel oops. Same thing happens if I attempt to
 re-run scrub.

 How can I delete that inode? Could deleting it destroy the filesystem
 beyond repair?
>>>
>>>
>>>
>>> The kernel oops should protect you from completely destroying the fs.
>>>
>>> However it seems that the problem is beyond kernel's handle (kernel
>>> oops).
>>>
>>> So no safe recovery method now.
>>>
>>>  From now on, any repair advice from me *MAY* *destroy* your fs.
>>> So please do backup when you still can.
>>>
>>>
>>> The best possible try would be "btrfsck --init-extent-tree --repair".
>>>
>>> If it works, then mount it and run "btrfs balance start ".
>>> Lastly, umount and use btrfsck to re-check if it fixes the problem.
>>>
>>> Thanks,
>>> Qu
>>>
>>>

 Regards,
 Ivan

 On Mon, Mar 28, 2016 at 3:10 AM, Qu Wenruo 
 wrote:
>
>
>
>
> Ivan P wrote on 2016/03/27 16:31 +0200:
>>
>>
>>
>> Thanks for the reply,
>>
>> the raid1 array was created from scratch, so not converted from ext*.
>> I used btrfs-progs version 4.2.3 on kernel 4.2.5 to create the array,
>> btw.
>
>
>
>
> I don't remember any strange behavior after 4.0, so no clue here.
>
> Go to the subvolume 5 (the top-level subvolume), find inode 71723 and
> try
> to
> remove it.
> Then, use 'btrfs filesystem sync ' to sync the inode
> removal.
>
> Finally use latest btrfs-progs to check if the problem disappears.
>
> This problem seems to be quite strange, so I can't locate the root
> cause,
> but try to remove the file and hopes kernel can handle it.
>
> Thanks,
> Qu
>>
>>
>>
>>
>> Is there a way to fix the current situation without taking the whole
>> data off the disk?
>> I'm not familiar with file systems terms, so what exactly could I have
>> lost, if anything?
>>
>> Regards,
>> Ivan
>>
>> On Sun, Mar 27, 2016 at 4:23 PM, Qu Wenruo > > wrote:
>>
>>
>>
>>   On 03/27/2016 05:54 PM, Ivan P wrote:
>>
>>   Read the info on the wiki, here's the rest of the requested
>>   information:
>>
>>   # uname -r
>>   4.4.5-1-ARCH
>>
>>   # btrfs fi show
>>   Label: 'ArchVault'  uuid:
>> cd8a92b6-c5b5-4b19-b5e6-a839828d12d8
>>Total devices 1 FS bytes used 2.10GiB
>>devid1 size 14.92GiB used 4.02GiB path
>> /dev/sdc1
>>
>>   Label: 'Vault'  uuid: 013cda95-8aab-4cb2-acdd-2f0f78036e02
>>Total devices 2 FS bytes used 800.72GiB
>>devid1 size 931.51GiB used 808.01GiB path
>> /dev/sda
>>devid2 size 931.51GiB used 808.01GiB path
>> /dev/sdb
>>
>>   # btrfs fi df /mnt/vault/
>>   Data, RAID1: total=806.00GiB, used=799.81GiB
>>   System, RAID1: total=8.00MiB, used=128.00KiB
>>   Metadata, RAID1: total=2.00GiB, used=936.20MiB
>>   GlobalReserve, single: total=320.00MiB, used=0.00B
>>
>>   On Fri, Mar 25, 2016 at 3:16 PM, Ivan P
>> >   > wrote:
>>
>>   Hello,
>>
>>   using kernel  4.4.5 and btrfs-progs 4.4.1, I today ran a
>>   scrub on my
>>   2x1Tb btrfs raid1 array and it finished with 36
>>   unrecoverable errors
>>   [1], all blaming the treeblock 741942071296. Running
>> "btrfs
>>   check
>>   --readonly" on one 

Non-mountable FS (raid10, can't read superblock / corrupt leaf / open_ctree failed)

2016-04-02 Thread Jarno Elonen
Hi,

I've got a big BTRFS raid10 set that fails to mount. I'm currently
running "btrfs restore" on the most important stuff, but would
appreciate any ideas on how to fix the FS gracefully.

# btrfs de scan
# mount -o clear_cache,recovery,degraded /dev/sdh1 /mnt/raid/
mount: /dev/sdh1: can't read superblock


The FS consists of 8 disks:

# btrfs fi show
Label: 'btrfs-raid'  uuid: 8b5b4aa6-c3c6-4db0-9ead-df93812e51bc
Total devices 8 FS bytes used 9.26TiB
devid4 size 2.73TiB used 2.34TiB path /dev/sdi1
devid5 size 2.73TiB used 2.34TiB path /dev/sdj1
devid7 size 2.73TiB used 2.34TiB path /dev/sdc1
devid9 size 3.64TiB used 2.34TiB path /dev/sdh1
devid   11 size 2.73TiB used 2.34TiB path /dev/sda1
devid   16 size 2.73TiB used 2.34TiB path /dev/sdk1
devid   17 size 2.73TiB used 2.34TiB path /dev/sdl1
devid   18 size 3.64TiB used 2.34TiB path /dev/sdd1


Software versions are pretty recent:

# uname -a
Linux serv-archive 4.4.0-1-amd64 #1 SMP Debian 4.4.6-1 (2016-03-17)
x86_64 GNU/Linux

# btrfs version
btrfs-progs v4.4


Errors don't *look* that bad...

 # btrfsck -p /dev/sda1

 Checking filesystem on /dev/sda1
 UUID: 8b5b4aa6-c3c6-4db0-9ead-df93812e51bc
 bad key ordering 173 174
 bad block 103797139668992

 Errors found in extent allocation tree or chunk allocation
 block group 98636232261632 has wrong amount of free space
 failed to load free space cache for block group 98636232261632
 checking free space cache [o]
 checking fs roots [.]
(...havent's seen this through, takes forever)

 # btrfsck -s 2 -p /dev/sda1

 using SB copy 2, bytenr 274877906944
 Checking filesystem on /dev/sda1
 UUID: 8b5b4aa6-c3c6-4db0-9ead-df93812e51bc
 bad key ordering 173 174
 bad block 103797139668992

 Errors found in extent allocation tree or chunk allocation
 block group 98636232261632 has wrong amount of free space
 failed to load free space cache for block group 98636232261632
 checking free space cache [o]
 checking fs roots [o]


...but mount attempts show "corrupt leaf, bad key" and end in "BTRFS:
open_ctree failed":

# dmesg

...

[   13.156643] Btrfs loaded
[   13.228545] BTRFS: device label btrfs-raid devid 17 transid
15891267 /dev/sdl1
[   13.228606] BTRFS: device label btrfs-raid devid 16 transid
15891267 /dev/sdk1
[   13.228666] BTRFS: device label btrfs-raid devid 5 transid 15891267 /dev/sdj1
[   13.228723] BTRFS: device label btrfs-raid devid 4 transid 15891267 /dev/sdi1
[   13.228799] BTRFS: device label btrfs-raid devid 9 transid 15891267 /dev/sdh1
[   13.228858] BTRFS: device label btrfs-raid devid 18 transid
15891267 /dev/sdd1
[   13.228920] BTRFS: device label btrfs-raid devid 7 transid 15891267 /dev/sdc1
[   13.228977] BTRFS: device label btrfs-raid devid 11 transid
15891267 /dev/sda1

...

[  191.183172] BTRFS info (device sda1): force clearing of disk cache
[  191.183181] BTRFS info (device sda1): enabling auto recovery
[  191.183184] BTRFS info (device sda1): allowing degraded mounts
[  191.183188] BTRFS info (device sda1): disk space caching is enabled
[  192.057675] BTRFS info (device sda1): bdev /dev/sdi1 errs: wr 0, rd
0, flush 0, corrupt 10, gen 0
[  192.057678] BTRFS info (device sda1): bdev /dev/sdj1 errs: wr 0, rd
7, flush 0, corrupt 20, gen 0
[  210.226357] BTRFS critical (device sda1): corrupt leaf, bad key
order: block=103797139668992,root=1, slot=173
[  210.226535] BTRFS critical (device sda1): corrupt leaf, bad key
order: block=103797139668992,root=1, slot=173
[  210.236768] BTRFS critical (device sda1): corrupt leaf, bad key
order: block=103797139668992,root=1, slot=173
[  210.263439] BTRFS critical (device sda1): corrupt leaf, bad key
order: block=103797139668992,root=1, slot=173
[  210.263447] [ cut here ]
[  210.263464] WARNING: CPU: 1 PID: 118 at
/build/linux-dFNWCu/linux-4.4.6/fs/btrfs/extent-tree.c:6552
__btrfs_free_extent.isra.70+0x314/0xd70 [btrfs]()
[  210.263465] BTRFS: Transaction aborted (error -5)
[  210.263466] Modules linked in: nfsd auth_rpcgss oid_registry
nfs_acl nfs lockd grace fscache sunrpc intel_rapl iosf_mbi
x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support eeepc_wmi
asus_wmi sparse_keymap rfkill sha256_ssse3 sha256_generic hmac drbg
ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper
cryptd psmouse pcspkr nouveau serio_raw i2c_i801 mxm_wmi lpc_ich ttm
mfd_core drm_kms_helper drm option huawei_cdc_ncm cdc_wdm usb_wwan
cdc_ncm usbnet joydev evdev mei_me mii usbserial shpchp mei
8250_fintek battery wmi video tpm_infineon button tpm_tis tpm
processor coretemp loop autofs4 ext4 crc16 mbcache jbd2 btrfs xor
raid6_pq dm_mod osst st ch sr_mod cdrom hid_generic usbhid hid sg
sd_mod uas usb_storage crc32c_intel
[  210.263497]  ahci libahci firewire_ohci mpt3sas raid_class
scsi_transport_sas igb i2c_algo_bit dca ptp ehci_pci pps_core ehci_hcd
firewire_core xhci_pci crc_itu_t 

Re: bad metadata crossing stripe boundary

2016-04-02 Thread Marc Haber
On Sat, Apr 02, 2016 at 11:03:53AM +0200, Kai Krakow wrote:
> Am Fri, 1 Apr 2016 07:57:25 +0200
> schrieb Marc Haber :
> > On Thu, Mar 31, 2016 at 11:16:30PM +0200, Kai Krakow wrote:
> > > Am Thu, 31 Mar 2016 23:00:04 +0200
> > > schrieb Marc Haber :  
> > > > I find it somewhere between funny and disturbing that the first
> > > > call of btrfs check made my kernel log the following:
> > > > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31):
> > > > mounted filesystem with ordered data mode. Opts: (null) Mar 31
> > > > 22:45:38 fan kernel: [ 6255.361328] BTRFS: device label fanbtr
> > > > devid 1 transid 67526 /dev/dm-31
> > > > 
> > > > No, the filesystem was not converted, it was directly created as
> > > > btrfs, and no, I didn't try mounting it.  
> > > 
> > > I suggest that your partition contained ext4 before, and you didn't
> > > run wipefs before running mkfs.btrfs.  
> > 
> > I cryptsetup luksFormat'ted the partition before I mkfs.btrfs'ed it.
> > That should do a much better job than wipefsing it, shouldnt it?
> 
> Not sure how luksFormat works. If it encrypts what is already on the
> device, it would also encrypt orphan superblocks.

It overwrites the LUKS metadata including the symmetric key that was
used to encrypt the existing data. Short of Shor's Algorithm and
Quantum Computers, after that operation it is no longer possible to
even guess what was on the disk before.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad metadata crossing stripe boundary

2016-04-02 Thread Kai Krakow
Am Fri, 1 Apr 2016 07:57:25 +0200
schrieb Marc Haber :

> On Thu, Mar 31, 2016 at 11:16:30PM +0200, Kai Krakow wrote:
> > Am Thu, 31 Mar 2016 23:00:04 +0200
> > schrieb Marc Haber :  
> > > I find it somewhere between funny and disturbing that the first
> > > call of btrfs check made my kernel log the following:
> > > Mar 31 22:45:36 fan kernel: [ 6253.178264] EXT4-fs (dm-31):
> > > mounted filesystem with ordered data mode. Opts: (null) Mar 31
> > > 22:45:38 fan kernel: [ 6255.361328] BTRFS: device label fanbtr
> > > devid 1 transid 67526 /dev/dm-31
> > > 
> > > No, the filesystem was not converted, it was directly created as
> > > btrfs, and no, I didn't try mounting it.  
> > 
> > I suggest that your partition contained ext4 before, and you didn't
> > run wipefs before running mkfs.btrfs.  
> 
> I cryptsetup luksFormat'ted the partition before I mkfs.btrfs'ed it.
> That should do a much better job than wipefsing it, shouldnt it?

Not sure how luksFormat works. If it encrypts what is already on the
device, it would also encrypt orphan superblocks.

If it actually wipes the device, the orphan superblock should be gone.

This suggests it does not wipe the device:

https://wiki.archlinux.org/index.php/Dm-crypt/Device_encryption:

> Encryption options for LUKS mode
> The cryptsetup action to set up a new dm-crypt device in LUKS
> encryption mode is luksFormat. Unlike the name implies, it does not
> format the device, but sets up the LUKS device header and encrypts
> the master-key with the desired cryptographic options.

Thus, I may very well have an orphan superblock lying around.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Kai Krakow
Am Fri, 1 Apr 2016 01:27:21 +0200
schrieb Henk Slager :

> It is not clear to me what 'Gentoo patch-set r1' is and does. So just
> boot a vanilla v4.5 kernel from kernel.org and see if you get csum
> errors in dmesg.

It is the gentoo patchset, I don't think anything there relates to
btrfs:
https://dev.gentoo.org/~mpagano/genpatches/trunk/4.5/

> Also, where does 'duplicate object' come from? dmesg ? then please
> post its surroundings, straight from dmesg.

It was in dmesg. I already posted it in the other thread and Qu took
note of it. Apparently, I didn't manage to capture anything else than:

btrfs_run_delayed_refs:2927: errno=-17 Object already exists

It hit me unexpected. This was the first time btrfs went RO for me. It
was with kernel 4.4.5 I think.

I suspect this is the outcome of unnoticed corruptions that sneaked in
earlier over some period of time. The system had no problems until this
incident, and only then I discovered the huge pile of corruptions when I
ran btrfsck.

I'm also pretty convinced now that VirtualBox itself is not the problem
but only victim of these corruptions, that's why it primarily shows up
in the VDI file.

However, I now found csum errors in unrelated files (see other post in
this thread), even for files not touched in a long time.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsck: backpointer mismatch (and multiple other errors)

2016-04-02 Thread Kai Krakow
Am Fri, 1 Apr 2016 09:10:44 +0800
schrieb Qu Wenruo :

> The real problem is, the extent has mismatched reference.
> Normally it can fixed by --init-extent-tree option, but it normally 
> means bigger problem, especially it has already caused kernel 
> delayed-ref problem.
> 
> No to mention the error "extent item 11271947091968 has multiple
> extent items", which makes the problem more serious.
> 
> 
> I assume some older kernel have already screwed up the extent tree,
> as although delayed-ref is bug-prove, it has improved in recent years.
> 
> But it seems fs tree is less damaged, I assume the extent tree 
> corruption could be fixed by "--init-extent-tree".
> 
> For the only fs tree error (missing csum), if "btrfsck 
> --init-extent-tree --repair" works without any problem, the most
> simple fix would be, just removing the file.
> Or you can use a lot of CPU time and disk IO to rebuild the whole
> csum, by using "--init-csum-tree" option.

Okay, so I'm going to inode-resolve the file with csum errors.
Actually, it's a file from Steam which has been there for ages and
never showed csum errors before which make me wonder if csum errors may
sneak in on long existing files through other corruptions.

I now removed this file and had to reboot because btrfs went RO. Here's
the backtrace:

https://gist.github.com/kakra/a7be40c23e08fc6e237f9108371afadf

[137619.835374] [ cut here ]
[137619.835385] WARNING: CPU: 1 PID: 4840 at fs/btrfs/extent-tree.c:1625 
lookup_inline_extent_backref+0x156/0x620()
[137619.835394] Modules linked in: nvidia_drm(PO) uas usb_storage vboxnetadp(O) 
vboxnetflt(O) vboxdrv(O) nvidia_modeset(PO) nvidia(PO)
[137619.835405] CPU: 1 PID: 4840 Comm: rm Tainted: P   O
4.5.0-gentoo-r1 #1
[137619.835407] Hardware name: To Be Filled By O.E.M. To Be Filled By 
O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013
[137619.835409]   8159eae9  
81ea1d08
[137619.835412]  810c6e37 8803d56a4d20 88040c7daa00 
0a4075114000
[137619.835415]  00201000  81489836 
001d
[137619.835418] Call Trace:
[137619.835423]  [] ? dump_stack+0x46/0x5d
[137619.835429]  [] ? warn_slowpath_common+0x77/0xb0
[137619.835432]  [] ? lookup_inline_extent_backref+0x156/0x620
[137619.835435]  [] ? btrfs_get_token_32+0xee/0x110
[137619.835440]  [] ? __set_page_dirty_nobuffers+0xf8/0x150
[137619.835443]  [] ? insert_inline_extent_backref+0x54/0xe0
[137619.835450]  [] ? __slab_free+0x98/0x220
[137619.835453]  [] ? kmem_cache_alloc+0x14d/0x160
[137619.835456]  [] ? 
__btrfs_inc_extent_ref.isra.64+0x99/0x270
[137619.835459]  [] ? __btrfs_run_delayed_refs+0x673/0x1020
[137619.835463]  [] ? 
btrfs_release_extent_buffer_page+0x71/0x120
[137619.835466]  [] ? release_extent_buffer+0x3f/0x90
[137619.835469]  [] ? btrfs_run_delayed_refs+0x8f/0x2b0
[137619.835473]  [] ? btrfs_truncate_inode_items+0x8b8/0xdc0
[137619.835477]  [] ? btrfs_evict_inode+0x3fe/0x550
[137619.835481]  [] ? evict+0xb7/0x180
[137619.835484]  [] ? do_unlinkat+0x12c/0x2d0
[137619.835488]  [] ? entry_SYSCALL_64_fastpath+0x12/0x6a
[137619.835491] ---[ end trace 6e8061336c42ff93 ]---
[137619.835494] [ cut here ]
[137619.835497] WARNING: CPU: 1 PID: 4840 at fs/btrfs/extent-tree.c:2946 
btrfs_run_delayed_refs+0x279/0x2b0()
[137619.835499] BTRFS: Transaction aborted (error -5)
[137619.835500] Modules linked in: nvidia_drm(PO) uas usb_storage vboxnetadp(O) 
vboxnetflt(O) vboxdrv(O) nvidia_modeset(PO) nvidia(PO)
[137619.835506] CPU: 1 PID: 4840 Comm: rm Tainted: PW  O
4.5.0-gentoo-r1 #1
[137619.835508] Hardware name: To Be Filled By O.E.M. To Be Filled By 
O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013
[137619.835509]   8159eae9 880255d1bc98 
81ea1d08
[137619.835512]  810c6e37 88040c7daa00 880255d1bce8 
01c6
[137619.835514]  8803211b4510 000b 810c6eb7 
81e8a0a0
[137619.835517] Call Trace:
[137619.835519]  [] ? dump_stack+0x46/0x5d
[137619.835522]  [] ? warn_slowpath_common+0x77/0xb0
[137619.835525]  [] ? warn_slowpath_fmt+0x47/0x50
[137619.835528]  [] ? btrfs_run_delayed_refs+0x279/0x2b0
[137619.835531]  [] ? btrfs_truncate_inode_items+0x8b8/0xdc0
[137619.835535]  [] ? btrfs_evict_inode+0x3fe/0x550
[137619.835538]  [] ? evict+0xb7/0x180
[137619.835541]  [] ? do_unlinkat+0x12c/0x2d0
[137619.835543]  [] ? entry_SYSCALL_64_fastpath+0x12/0x6a
[137619.835545] ---[ end trace 6e8061336c42ff94 ]---
[137619.835547] BTRFS: error (device bcache2) in btrfs_run_delayed_refs:2946: 
errno=-5 IO failure
[137619.835550] BTRFS info (device bcache2): forced readonly
[137619.886069] pending csums is 410705920

So it looks like fixing one error introduces other errors. Should I try
init-extent-tree after taking a backup?

BTW: "btrfsck --repair" does not work: I complains about unsupported
cases due to compression of 

Re: Another ENOSPC situation

2016-04-02 Thread Duncan
Chris Murphy posted on Fri, 01 Apr 2016 23:43:46 -0600 as excerpted:

> On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted:
> 
>>> [4/502]mh@swivel:~$ sudo btrfs fi usage / Overall:
>>> Device size: 600.00GiB Device allocated:  
>>>  600.00GiB Device unallocated:1.00MiB
>>
>> That's the problem right there.  The admin didn't do his job and spot
>> the near full allocation issue
> 
> 
> I don't yet agree this is an admin problem. This is the 2nd or 3rd case
> we've seen only recently where there's plenty of space in all chunk
> types and yet ENOSPC happens, seemingly only because there's no
> unallocated space remaining. I don't know that this is a regression for
> sure, but it sure seems like one.

Notice that he said _balance_ failed with ENOSPC.  He did _NOT_ say he 
was getting it in ordinary usage, just yet.  Which would fit a 100% 
allocated situation, with plenty of space left in both data and metadata 
chunks.  The plenty of space left inside the chunks would keep ordinary 
usage from running into problems just yet, but balance really /does/ need 
room to allocate at least one new chunk in ordered to properly handle the 
chunk rewrite via COW.  (At least for data, metadata seems to work a bit 
differently.  See below.)

Balance has always failed with ENOSPC if there was no unallocated space 
left.  It used to happen all the time, before btrfs learned how to delete 
empty chunks in 3.17, but while that helps, it only works for literally 
/empty/ chunks.  Chunks with even a single block/node still in use don't 
get deleted automatically.

What I think is happening now is that while the empty-chunk deleting from 
3.17 on helped, it has been long enough since then, now, that people with 
particular usage patterns, I'd strongly suspect those with heavy 
snapshotting, don't tend to fully empty their chunks to the extent that 
those with other usage patterns do, and it has been just long enough now 
that we're beginning to see the problem reported again, because deleting 
empty chunks helped, but they weren't fully emptying enough chunks to 
keep up with things that way, in their particular use-cases.

>>> Data,single: Size:553.93GiB, Used:405.73GiB
>>>/dev/mapper/swivelbtr 553.93GiB
>>>
>>> Metadata,DUP: Size:23.00GiB, Used:3.83GiB
>>>/dev/mapper/swivelbtr  46.00GiB
>>>
>>> System,DUP: Size:32.00MiB, Used:112.00KiB
>>>/dev/mapper/swivelbtr  64.00MiB
>>>
>>> Unallocated:
>>>/dev/mapper/swivelbtr   1.00MiB
>>> [5/503]mh@swivel:~$
>>
>> Both data and metadata have several GiB free, data ~140 GiB free, and
>> metadata isn't into global reserve, so the system isn't totally wedged,
>> only partially, due to the lack of unallocated space.
> 
> Unallocated space alone hasn't ever caused this that I can remember.
> It's most often been totally full metadata chunks, with free space in
> allocated data chunks, with no unallocated space out of which to create
> another metadata chunk to write out changes.

Unallocated space alone doesn't cause ENOSPC with normal operations; for 
those you're correct, running out of either data or metadata space is 
required as well.  (Normally it's metadata that runs out, but I recall 
seeing one post from someone who had metadata room but full data.  The 
behavior was.. "interesting", as he could do renames, etc, and even 
create small files as long as they were small enough to stay in 
metadata.  As soon as he tried to do anything that needed an actual data 
extent, however, ENOSPC.)

But balance has always required space to allocate at least one chunk, as 
COW means the existing chunk can't be released until everything is 
rewritten into the new one.

Tho it seems that btrfs can sometimes either write very small metadata 
chunks, which don't forget are dup by default on a single device, as they 
are in this case.  He has 1 MiB unallocated.  Split in half that's 512 
KiB.  I'm not sure if btrfs can go that small, but if it can, and it can 
find a low enough usage metadata chunk to write into it, freeing the 
larger metadata chunk...

Or maybe btrfs can actually use the global reserve for that, since global 
reserve is part of metadata.  If it can, a 512 MiB global reserve would 
be just large enough to write the two copies of a nominally 256 MiB 
metadata chunk.

Either way, I've seen a number of times now where btrfs was able to 
balance metadata, when it had less than the 256 (*2 if dup) MiB 
unallocated that would normally be required.  Maybe it /is/ able to use 
global reserve for that, which would allow it to work, as long as 
metadata isn't so tight that it's already using global reserve.  That's 
actually what I bet it's doing, now that I think about it.  Because as 
long as the global reserve isn't being used, 512 MiB of global reserve 
would be exactly 2*256 MiB metadata chunks, and if they're