Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Filipe David Manana
On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas a...@zadarastorage.com wrote:
 Greetings,
 Looking at the code of should_cow_block(), I see:

 if (btrfs_header_generation(buf) == trans-transid 
!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) 
 ...
 So if the extent buffer has been written to disk, and now is changed again
 in the same transaction, we insist on COW'ing it. Can anybody explain why
 COW is needed in this case? The transaction has not committed yet, so what
 is the danger of rewriting to the same location on disk? My understanding
 was that a tree block needs to be COW'ed at most once in the same
 transaction. But I see that this is not the case.

That logic is there, as far as I can see, for at least 2 obvious reasons:

1) fsync/log trees. All extent buffers (tree blocks) of a log tree
have the same transaction id/generation, and you can have multiple
fsyncs (log transaction commits) per transaction so you need to ensure
consistency. If we skipped the COWing in the example below, you would
get an inconsistent log tree at log replay time when the fs is
mounted:

transaction N start

   fsync inode A start
   creates tree block X
   flush X to disk
   write a new superblock
   fsync inode A end

   fsync inode B start
   skip COW of X because its generation == current transaction id and
modify it in place
   flush X to disk

== crash ===

   write a new superblock
   fsync inode B end

transaction N commit

2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
written to disk but instead when we trigger writeback for it. So while
the writeback is ongoing we want to make sure the block's content
isn't concurrently modified (we don't keep the eb write locked to
allow concurrent reads during the writeback).

All tree blocks that don't belong to a log tree are normally written
only when at the end of a transaction commit. But often, due to memory
pressure for e.g., the VM can call the writepages() callback of the
btree inode to force dirty tree blocks to be written to disk before
the transaction commit.


 I am asking because I am doing some profiling of btrfs metadata work under
 heavy loads, and I see that sometimes btrfs COW's almost twice more tree
 blocks than the total metadata size.

 Thanks,
 Alex.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Add an option to disable automatic chunk reclamation.

2015-07-13 Thread Austin S Hemmelgarn
Since upgrading to a kernel with the automatic chunk reclamation 
patches, I've noticed a number of issues with BTRFS that all seem to 
either be caused by, or are further exacerbated by, this 'feature'.


The four big issues I've seen regarding it are:
1. TRIM/DISCARD support is broken as a (partial?) result of this.
2. It appears to expose underlying issues with the defrag code (stuff 
ending up more fragmented after defrag).
3. Since upgrading to a kernel with this patch, most of the BTRFS 
filesystems that I have that are very rewrite heavy have gotten very 
noticeably slower that they were beforehand (and this goes away when I 
run them on a kernel without auto-reclaim).
4. All of my filesystems are experiencing seemingly non-deterministic 
delays around most large scale VFS level operations (eg, deleting or 
relocating lots of files)


While I understand that this feature does solve (at least partially) an 
very real issue with BTRFS, there were a number of people who never had 
this issue to begin with because we ran regular balance operations on 
our filesystems.


Based on this, I would like to propose that some method be provided to 
disable auto-reclaim be added.  Personally, I would prefer to leave it 
as the default and have a mount option to disable it.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: slowdown after one week

2015-07-13 Thread Austin S Hemmelgarn

On 2015-07-11 02:46, Stefan Priebe wrote:

Hi,

while using a 40TB btrfs partition for VM backups. I see a massive
slowdown after around one week.

The backup task takes usally 2-3 hours. After one week it takes 20
hours. If i umount and remount the btrfs volume it takes 2-3 hours again.

Kernel 4.1.1

I've been seeing similar (although much less drastic) slowdowns over 
time myself pretty much since I started using BTRFS (IIRC, sometime 
around 3.16).  If you're not constantly writing to that backup volume, 
you might want to consider setting up automounting for it.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Did btrfs filesystem defrag just make things worse?

2015-07-13 Thread Austin S Hemmelgarn

On 2015-07-11 11:24, Duncan wrote:

I'm not a coder, only a list regular and btrfs user, and I'm not sure on
this, but there have been several reports of this nature on the list
recently, and I have a theory.  Maybe the devs can step in and either
confirm or shoot it down.
While I am a coder, I'm not a BTRFS developer, so what I say below may 
still be incorrect.



[...trimmed for brevity...]

Of course during normal use, files get deleted as well, thereby clearing
space in existing chunks.  But this space will be fragmented, with a mix
of unallocated extents and still remaining files.  The allocator will I
/believe/ (this is where people who can actually read the code come in)
try to use up space in existing chunks before allocating additional
space, possibly subject to some reasonable extent minimum size, below
which btrfs will simply allocate another chunk.

AFAICT, this is in fact the case.


1) Prioritize reduced fragmentation, at the expense of higher data chunk
allocation.  In the extreme, this would mean always choosing to allocate
a new chunk and use it if the file (or remainder of the file not yet
defragged) was larger than the largest free extent in existing data
chunks.

The problem with this is that over time, the number of partially used
data chunks goes up as new ones are allocated to defrag into, but sub-1
GiB files that are already defragged are left where they are.  Of course
a balance can help here, by combining multiple partial chunks into fewer
full chunks, but unless a balance is run...

2) Prioritize chunk utilization, at the expense of leaving some
fragmentation, despite massive amounts of unallocated space.

This is what I've begun to suspect defrag does.  With a bunch of free but
fragmented space in existing chunks, defrag could actually increase
fragmentation, as the space in existing chunks is so fragmented a rewrite
is forced to use more, smaller extents, because that's all there is free,
until another chunk is allocated.

As I mentioned above for normal file allocation, it's quite possible that
there's some minimum extent size (greater than the bare minimum 4 KiB
block size) where the allocator will give up and allocate a new data
chunk, but if so, perhaps this size needs bumped upward, as it seems a
bit low, today.
If I'm reading the code correctly, defrag does indeed try to avoid 
allocating a new chunk if at all possible.



Meanwhile, there's a number of exacerbating factors to consider as well.

* Snapshots and other shared references lock extents in place.

Defrag doesn't touch anything but the subvolume it's actually pointed at
for the defrag.  Other subvolumes and shared-reference files will
continue to keep the extents they reference locked in place.  And COW
will rewrite blocks of a file, but the old reference extent remains
locked, until all references to it are cleared -- the entire file (or at
least all blocks that were in that extent) must be rewritten, and no
snapshots or other references to it remain, before it can be freed.

For a few kernel cycles btrfs had snapshot-aware-defrag, but that
implementation didn't scale well at all, so it was disabled until it
could be rewritten, and that rewrite hasn't occurred yet.  So snapshot-
aware-defrag remains disabled, and defrag only works on the subvolume
it's actually pointed at.

As a result, if defrag rewrites a snapshotted file, it actually doubles
the space that file takes, as it makes a new copy, breaking the reference
link between it and the copy in the snapshot.

Of course, with the space not freed up, this will, over time, tend to
fragment space that is freed even more heavily.
To mitigate this, one can run offline data deduplication (duperemove is 
the tool I'd suggest for this), although there are caveats to doing that 
as well.


* Chunk reclamation.

This is the relatively new development that I think is triggering the
surge in defrag not defragging reports we're seeing now.

Until quite recently, btrfs could allocate new chunks, but it couldn't,
on its own, deallocate empty chunks.  What tended to happen over time was
that people would find all the filesystem space taken up by empty or
mostly empty data chunks, and btrfs would start spitting ENOSPC errors
when it needed to allocate new metadata chunks but couldn't, as all the
space was in empty data chunks.  A balance could fix it, often relatively
quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a
manual process, btrfs wouldn't do it on its own.

Recently the devs (mostly) fixed that, and btrfs will automatically
reclaim entirely empty chunks on its own now.  It still doesn't reclaim
partially empty chunks automatically; a manual rebalance must still be
used to combine multiple partially empty chunks into fewer full chunks;
but it does well enough to make the previous problem pretty rare -- we
don't see the hundreds of GiB of empty data chunks allocated any more,
like we used to.

Which fixed the one problem, but if my theory is 

Re: Wiki suggestions

2015-07-13 Thread Marc Joliet
Am Mon, 13 Jul 2015 06:56:17 + (UTC)
schrieb Duncan 1i5t5.dun...@cox.net:

 Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted:
 
  I hope it's not out of place, but I have a few suggestions for the Wiki:
 
 Just in case it wasn't obvious...  The wiki is open to user editing.  You 
 can, if you like, get an account and make the changes yourself. =:^)
 
 Of course, it's understandable if your reaction to web and wiki 
 technologies is similar to mine, newsgroups and mailing lists (in my case 
 via gmane.org's list2news service, so they too are presented as 
 newsgroups) are your primary domain, and you tend to treat the web as 
 read-only so rarely reply on a web forum, let alone edit a wiki.  I've 
 never gotten a wiki account here for that reason, either, or I'd have 
 probably gone ahead and made the suggested changes...
 
 But with a bit of luck someone with an existing (or even new) account 
 will be along to make the changes...

It's partially a read-only habit, but it's also that I'm just not confident
in deciding whether those actually *are* good suggestions, or put differently:
it's the public face of btrfs, and I don't want to accidentally do something to
ruin it (to use some hyperbole).

However, if somebody gives me the go-ahead, I might just edit the wiki myself
(though I don't know enough to be able to edit the kernel news entry ;-) ).

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


pgpbh6wXjKf9C.pgp
Description: Digitale Signatur von OpenPGP


kernel crash - btrfs check shows extent buffer leak; Suggestions?

2015-07-13 Thread Donald Pearson
Last time something happened and I poked at it myself I ended up
ruining the pool so I thought I'd ask here before doing anything.

I'm not sure if this really indicates that anything needs doing or
not.  The filesystem will mount like normal.

It doesn't look like the core dump was written anywhere but I've never
actually looked for it before.  I'm still Googling where it might be.

[root@san01 ~]# btrfs check /dev/sdi
Checking filesystem on /dev/sdi
UUID: 6848df32-bd2a-49b8-b3b9-40038f98ef8a
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
checking quota groups
Counts for qgroup id: 352 are different
our:referenced 699122253824 referenced compressed 699122253824
disk:   referenced 699151855616 referenced compressed 699151855616
diff:   referenced -29601792 referenced compressed -29601792
our:exclusive 1279471616 exclusive compressed 1279471616
disk:   exclusive 1279471616 exclusive compressed 1279471616
Counts for qgroup id: 844 are different
our:referenced 699130273792 referenced compressed 699130273792
disk:   referenced 699159875584 referenced compressed 699159875584
diff:   referenced -29601792 referenced compressed -29601792
our:exclusive 81920 exclusive compressed 81920
disk:   exclusive 81920 exclusive compressed 81920
found 875663138891 bytes used err is 0
total csum bytes: 790806028
total tree bytes: 1950498816
total fs tree bytes: 436994048
total extent tree bytes: 526237696
btree space waste bytes: 439941098
file data blocks allocated: 981391929344
 referenced 1086835916800
btrfs-progs v4.1
extent buffer leak: start 3338550263808 len 16384
extent buffer leak: start 3338550165504 len 16384
extent buffer leak: start 3100998254592 len 16384
extent buffer leak: start 3100998270976 len 16384
extent buffer leak: start 3100998287360 len 16384
extent buffer leak: start 3100998303744 len 16384
extent buffer leak: start 3100998320128 len 16384
extent buffer leak: start 3100998336512 len 16384
extent buffer leak: start 3100998352896 len 16384
extent buffer leak: start 3100998369280 len 16384
extent buffer leak: start 3338550149120 len 16384
extent buffer leak: start 3338550132736 len 16384
extent buffer leak: start 2756246339584 len 16384
extent buffer leak: start 2756284366848 len 16384
extent buffer leak: start 3339485298688 len 16384
extent buffer leak: start 3339485347840 len 16384
extent buffer leak: start 3339485429760 len 16384
extent buffer leak: start 3339485446144 len 16384
extent buffer leak: start 3339485462528 len 16384
extent buffer leak: start 3339485528064 len 16384
extent buffer leak: start 333948558 len 16384
extent buffer leak: start 3339485560832 len 16384
extent buffer leak: start 3339488018432 len 16384
extent buffer leak: start 3339489361920 len 16384
extent buffer leak: start 3339504140288 len 16384
extent buffer leak: start 3339504156672 len 16384
extent buffer leak: start 3339504435200 len 16384
extent buffer leak: start 3339504467968 len 16384
extent buffer leak: start 3339504484352 len 16384
extent buffer leak: start 3339505778688 len 16384
extent buffer leak: start 3339505811456 len 16384
extent buffer leak: start 3339507105792 len 16384
extent buffer leak: start 3339507187712 len 16384
extent buffer leak: start 3339507204096 len 16384
extent buffer leak: start 3339518394368 len 16384
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slowdown after one week

2015-07-13 Thread Stefan Priebe - Profihost AG

Am 13.07.2015 um 13:20 schrieb Austin S Hemmelgarn:
 On 2015-07-11 02:46, Stefan Priebe wrote:
 Hi,

 while using a 40TB btrfs partition for VM backups. I see a massive
 slowdown after around one week.

 The backup task takes usally 2-3 hours. After one week it takes 20
 hours. If i umount and remount the btrfs volume it takes 2-3 hours again.

 Kernel 4.1.1

 I've been seeing similar (although much less drastic) slowdowns over
 time myself pretty much since I started using BTRFS (IIRC, sometime
 around 3.16).  If you're not constantly writing to that backup volume,
 you might want to consider setting up automounting for it.
 

Yes but that's awful. it's a bug. It would be very nice if someone
involved in the btrfs development can comment on that one.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs read-only after btrfs-convert from Ext4 workaround

2015-07-13 Thread René Pfeiffer
On Jul 12, 2015 at 2026 -0600, Chris Murphy appeared and said:
 On Sun, Jul 12, 2015 at 7:23 PM, René Pfeiffer l...@luchs.at wrote:
 ...
  Output of uname, btrfs, and the dmesg log is attached. Let me know if you
  need anything else. The old Btrfs is still on another disk, and I can
  extract information from it.
 
 If you can run 'btrfs check' on it (without repair) using btrfs-progs
 4.0, and 3.19.1, and report the results of each, that would be really
 useful.

Here we go.

Best,
René.

-- 
  )\._.,--,'``.  fL  Let GNU/Linux work for you while you take a nap.
 /,   _.. \   _\  (`._ ,. R. Pfeiffer lynx at luchs.at + http://web.luchs.at/
`._.-(,_..'--(,_..'`-.;.'  - System administration + Consulting + Teaching -
Got mail delivery problems?  http://web.luchs.at/information/blockedmail.php
Checking filesystem on /dev/mapper/oldcrypt
UUID: 703fc8b4-b2b9-470b-af2f-9aae9536c2fb
checking extents
checking free space cache
There is no free space entry for 163242479616-163242483712
There is no free space entry for 163242479616-167537278976
cache appears valid but isnt 162168569856
found 358532382931 bytes used err is -22
total csum bytes: 346644792
total tree bytes: 3568107520
total fs tree bytes: 3061465088
total extent tree bytes: 50102272
btree space waste bytes: 987626752
file data blocks allocated: 355459522560
 referenced 354990657536
btrfs-progs v3.19.1
Checking filesystem on /dev/mapper/oldcrypt
UUID: 703fc8b4-b2b9-470b-af2f-9aae9536c2fb
checking extents
checking free space cache
block group 162168569856 has wrong amount of free spacefailed to load free 
space cache for block group 162168569856
checking fs roots
root 5 inode 39321856 errors 200, dir isize wrong
root 5 inode 40898635 errors 200, dir isize wrong
found 358532382931 bytes used err is 1
total csum bytes: 346644792
total tree bytes: 3568107520
total fs tree bytes: 3061465088
total extent tree bytes: 50102272
btree space waste bytes: 987626752
file data blocks allocated: 355459522560
 referenced 354990657536
btrfs-progs v4.0


pgp1xPu9NdcZy.pgp
Description: PGP signature


[GIT PULL] More btrfs bug fixes

2015-07-13 Thread fdmanana
From: Filipe Manana fdman...@suse.com

Hi Chris,

Please consider the following changes for the kernel 4.2 release. All
these patches have been available in the mailing list for some time.

One of the patches is a fix for a regression in the delayed references
code that landed in 4.2-rc1. Two of them are for issues reported by users
on the list and IRC recently (which I've cc'ed for stable) and the final
one is just a missing update of an inode's on disk size after truncating
a file if the no_holes feature is enabled, which I found some time ago.

I have rebased them on top of your current integration-4.2 branch,
re-tested them and incorporated any tags people have added through the
mailing list (Reviewed-by, Acked-by).

Thanks.

The following changes since commit 9689457b5b0a2b69874c421a489d3fb50ca76b7b:

  Btrfs: fix wrong check for btrfs_force_chunk_alloc() (2015-07-01 17:17:22 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git 
integration-4.2

for you to fetch changes up to cffc3374e567ef42954f3c7070b3fa83f20f9684:

  Btrfs: fix order by which delayed references are run (2015-07-11 22:36:44 
+0100)


Filipe Manana (4):
  Btrfs: fix shrinking truncate when the no_holes feature is enabled
  Btrfs: fix memory leak in the extent_same ioctl
  Btrfs: fix list transaction-pending_ordered corruption
  Btrfs: fix order by which delayed references are run

 fs/btrfs/extent-tree.c | 13 +
 fs/btrfs/inode.c   |  5 ++---
 fs/btrfs/ioctl.c   |  4 +++-
 fs/btrfs/transaction.c |  4 ++--
 4 files changed, 20 insertions(+), 6 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk failed while doing scrub

2015-07-13 Thread Dāvis Mosāns
2015-07-13 11:12 GMT+03:00 Duncan 1i5t5.dun...@cox.net:
 You say five disk, but nowhere in your post do you mention what raid mode
 you were using, neither do you post btrfs filesystem show and btrfs
 filesystem df, as suggested on the wiki and which list that information.

Sorry, I forgot. I'm running Arch Linux 4.0.7, with btrfs-progs v4.1
Using RAID1 for metadata and single for data, with features
big_metadata, extended_iref, mixed_backref, no_holes, skinny_metadata
and mounted with noatime,compress=zlib,space_cache,autodefrag

Label: 'Data'  uuid: 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
   Total devices 5 FS bytes used 7.16TiB
   devid1 size 2.73TiB used 2.35TiB path /dev/sdc
   devid2 size 1.82TiB used 1.44TiB path /dev/sdd
   devid3 size 1.82TiB used 1.44TiB path /dev/sde
   devid4 size 1.82TiB used 1.44TiB path /dev/sdg
   devid5 size 931.51GiB used 539.01GiB path /dev/sdh

Data, single: total=7.15TiB, used=7.15TiB
System, RAID1: total=8.00MiB, used=784.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=16.00GiB, used=14.37GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B


 Because filesystem still mounts, I assume I should do btrfs device
 delete /dev/sdd /mntpoint and then restore damaged files from backup.

 You can try a replace, but with a failing drive still connected, people
 report mixed results.  It's likely to fail as it can't read certain
 blocks to transfer them to the new device.

As I understand, device delete will copy data from that disk and
distribute across rest of disks,
while btrfs replace will copy to new disk which must be atleast size
of disk I'm replacing.
Assuming other existing disks are good, if so, why replace would be
preferable over delete?
because delete could fail, but replace not?


 There's no such partial-file with null-fill tools shipped just yet.
 Those files normally simply trigger errors trying to read them, because
 btrfs won't let you at them if the checksum doesn't verify.

From journal I have only 14 files mentioned where errors occurred. Now
13 files from
them don't throw any errors and their SHA's match to my backups so they're fine.
And actually btrfs does allow to copy/read that one damaged file, only
I get I/O error
when trying to read data from those broken sectors

kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [0] tag[0], task
[88011c8c9900]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0001,  slot [0].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9.00: exception Emask 0x0 SAct 0x4000 SErr 0x0 action 0x0
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/00:00:00:33:a1/0f:00:ab:00:00/40 tag 14 ncq 1966080 in
 res 41/40:00:48:40:a1/00:0f:ab:00:00/00 Emask
0x409 (media error) F
kernel: ata9.00: status: { DRDY ERR }
kernel: ata9.00: error: { UNC }
kernel: ata9.00: configured for UDMA/133
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00
driverbyte=0x08
kernel: sd 7:0:2:0: [sdd] tag#0 Sense Key : 0x3 [current] [descriptor]
kernel: sd 7:0:2:0: [sdd] tag#0 ASC=0x11 ASCQ=0x4
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 33 00 00 0f 00 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1


but all other sectors can be copied fine

$ du -m ./damaged_file
6250 ./damaged_file

$ cp ./damaged_file /tmp/
cp: error reading ‘damaged_file’: Input/output error

$ du -m /tmp/damaged_file
4335/tmp/damaged_file

cp copies first file part correctly, and I verified that both
start of file (first 4336M) and end of file (last 1890M) SHA's match backup

$ head -c 4336M ./damaged_file | sha256sum
e81b20bfa7358c9f5a0ed165bffe43185abc59e35246e52a7be1d43e6b7e040d  -
$ head -c 4337M ./damaged_file | sha256sum
head: error reading ‘./damaged_file’: Input/output error

$ tail -c 1890M ./damaged_file | sha256sum
941568f4b614077858cb8c8dd262bb431bf4c45eca936af728ecffc95619cb60  -
$ tail -c 1891M ./damaged_file  | sha256sum
tail: error reading ‘./damaged_file’: Input/output error

with dd can also copy almost all file, only using noerror option it
excludes those regions
from target file rather than filling with nulls so this isn't good for recovery

$ dd conv=noerror if=damaged_file of=/tmp/damaged_file
dd: error reading ‘damaged_file’: Input/output error
8880328+0 records in
8880328+0 records out
4546727936 bytes (4,5 GB) copied, 69,7282 s, 65,2 MB/s
dd: error reading ‘damaged_file’: Input/output error
8930824+0 records in
8930824+0 records out
4572581888 bytes (4,6 GB) copied, 113,648 s, 40,2 MB/s
12801720+0 records in

[PATCH] Revert btrfs-progs: mkfs: create only desired block groups for single device

2015-07-13 Thread Qu Wenruo
This reverts commit 5f8232e5c8f0b0de0ef426274911385b0e877392.

This commit causes a regression:
---
$ mkfs.btrfs -f /dev/sda6
$ btrfsck /dev/sda6
Checking filesystem on /dev/sda6
UUID: 2ebb483c-1986-4610-802a-c6f3e6ab4b76
checking extents
Chunk[256, 228, 0]: length(4194304), offset(0), type(2) mismatch with
block group[0, 192, 4194304]: offset(4194304), objectid(0), flags(34)
Chunk[256, 228, 4194304]: length(8388608), offset(4194304), type(4)
mismatch with block group[4194304, 192, 8388608]: offset(8388608),
objectid(4194304), flags(36)
Block group[0, 4194304] (flags = 34) didn't find the relative chunk.
Block group[4194304, 8388608] (flags = 36) didn't find the relative
chunk.
..
---

The commit has the following bug causing the problem.
1) Typo forgets to add meta/data_profile for alloc_chunk.
Only meta/data_profile is added to allocate a block group, but not
chunk.

2) Type for the first system chunk is impossible to modify yet.
The type for the first chunk and its stripe is hard coded into
make_btrfs() function.
So even we try to modify the type of the block group, we are unable to
change the type of the first chunk.
Causing the chunk type mismatch problem.

The 1st bug can be fixed quite easily but the second is not.
The good news is, the last patch btrfs-progs: mkfs: Cleanup temporary
chunk to avoid strange balance behavior. from my patchset can handle it
quite well alone.

So just revert the patch.
New bug fix for btrfsck(err is 0 even chunk/extent tree is corrupted) and
new test cases for mkfs will follow soon.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 mkfs.c | 34 +++---
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index ee8a3cb..afecf00 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -59,9 +59,8 @@ struct mkfs_allocation {
u64 system;
 };
 
-static int create_metadata_block_groups(struct btrfs_root *root,
-   u64 metadata_profile, int mixed,
-   struct mkfs_allocation *allocation)
+static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
+   struct mkfs_allocation *allocation)
 {
struct btrfs_trans_handle *trans;
u64 bytes_used;
@@ -74,7 +73,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root,
 
root-fs_info-system_allocs = 1;
ret = btrfs_make_block_group(trans, root, bytes_used,
-metadata_profile |
 BTRFS_BLOCK_GROUP_SYSTEM,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
 0, BTRFS_MKFS_SYSTEM_GROUP_SIZE);
@@ -93,7 +91,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root,
}
BUG_ON(ret);
ret = btrfs_make_block_group(trans, root, 0,
-metadata_profile |
 BTRFS_BLOCK_GROUP_METADATA |
 BTRFS_BLOCK_GROUP_DATA,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
@@ -110,7 +107,6 @@ static int create_metadata_block_groups(struct btrfs_root 
*root,
}
BUG_ON(ret);
ret = btrfs_make_block_group(trans, root, 0,
-metadata_profile |
 BTRFS_BLOCK_GROUP_METADATA,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
 chunk_start, chunk_size);
@@ -126,7 +122,7 @@ err:
 }
 
 static int create_data_block_groups(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root, u64 data_profile, int mixed,
+   struct btrfs_root *root, int mixed,
struct mkfs_allocation *allocation)
 {
u64 chunk_start = 0;
@@ -143,7 +139,6 @@ static int create_data_block_groups(struct 
btrfs_trans_handle *trans,
}
BUG_ON(ret);
ret = btrfs_make_block_group(trans, root, 0,
-data_profile |
 BTRFS_BLOCK_GROUP_DATA,
 BTRFS_FIRST_CHUNK_TREE_OBJECTID,
 chunk_start, chunk_size);
@@ -1337,8 +1332,6 @@ int main(int ac, char **av)
u64 alloc_start = 0;
u64 metadata_profile = 0;
u64 data_profile = 0;
-   u64 default_metadata_profile = 0;
-   u64 default_data_profile = 0;
u32 nodesize = max_t(u32, sysconf(_SC_PAGESIZE),
BTRFS_MKFS_DEFAULT_NODE_SIZE);
u32 sectorsize = 4096;
@@ -1697,19 +1690,7 @@ int main(int ac, char **av)
}
root-fs_info-alloc_start = alloc_start;
 
-   if (dev_cnt == 0) {
-   default_metadata_profile = metadata_profile;
-  

Re: Wiki suggestions

2015-07-13 Thread Marc Joliet
Am Mon, 13 Jul 2015 19:21:54 +0200
schrieb Marc Joliet mar...@gmx.de:

 OK, I'll make the changes then (sans kernel log).

Just a heads up: I accepted the terms of service, but the link goes to a
non-existent wiki page.

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


pgpy82XmHTZbA.pgp
Description: Digitale Signatur von OpenPGP


Re: Wiki suggestions

2015-07-13 Thread Marc Joliet
Am Mon, 13 Jul 2015 18:30:09 +0200
schrieb David Sterba dste...@suse.com:

 On Mon, Jul 13, 2015 at 01:18:27PM +0200, Marc Joliet wrote:
  Am Mon, 13 Jul 2015 06:56:17 + (UTC)
  schrieb Duncan 1i5t5.dun...@cox.net:
  
   Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted:
   
I hope it's not out of place, but I have a few suggestions for the Wiki:
   
   Just in case it wasn't obvious...  The wiki is open to user editing.  You 
   can, if you like, get an account and make the changes yourself. =:^)
   
   Of course, it's understandable if your reaction to web and wiki 
   technologies is similar to mine, newsgroups and mailing lists (in my case 
   via gmane.org's list2news service, so they too are presented as 
   newsgroups) are your primary domain, and you tend to treat the web as 
   read-only so rarely reply on a web forum, let alone edit a wiki.  I've 
   never gotten a wiki account here for that reason, either, or I'd have 
   probably gone ahead and made the suggested changes...
   
   But with a bit of luck someone with an existing (or even new) account 
   will be along to make the changes...
  
  It's partially a read-only habit, but it's also that I'm just not 
  confident
  in deciding whether those actually *are* good suggestions, or put 
  differently:
  it's the public face of btrfs, and I don't want to accidentally do 
  something to
  ruin it (to use some hyperbole).
 
 All your suggesstions are good, adding more articles/videos/talks should
 be easy as there's a section for that already. The news section is
 mostly written by me but if you keep your entries consistent with the
 rest then it's ok.
 
 There are a few people who watch over new wiki edits and fix/enhance
 them if needed.  You can't do too much damage unless you really want to.

OK, I'll make the changes then (sans kernel log).

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


pgpNnh3EGP1Rh.pgp
Description: Digitale Signatur von OpenPGP


Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Chris Mason
On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
 Filipe,
 Thanks for the explanation. Those reasons were not so obvious for me.
 
 Would it make sense not to COW the block in case-1, if we are mounted
 with notreelog? Or, perhaps, to check that the block does not belong
 to a log tree?
 

Hi Alex,

The crc rules are the most important, we have to make sure the block
isn't changed while it is in flight.  Also, think about something like
this:

transaction write block A, puts pointer to it in the btree, generation Y

hard disk properly completes the IO

transaction rewrites block A, same generation Y

hard disk drops the IO on the floor and never does it

Later on, we try to read block A again.  We find it has the correct crc
and the correct generation number, but the contents are actually wrong.

 The second case is more difficult. One problem is that
 BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
 due to memory pressure (this is what I see happening), we complete the
 writeback, release the extent buffer, and pages are evicted from the
 page cache of btree_inode. After some time we read the block again
 (because we want to modify it in the same transaction), but its header
 is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
 this point it should be safe to avoid COW, we will re-COW.
 
 Would it make sense to have some runtime-only mechanism to lock-out
 the write-back for an eb? I.e., if we know that eb is not under
 writeback, and writeback is locked out from starting, we can redirty
 the block without COW. Then we allow the writeback to start when it
 wants to.
 
 In one of my test runs, btrfs had 6.4GB of metadata (before
 raid-induced overhead), but during a particular transaction total of
 10GB of metadata (again, before raid-induced overhead) was written to
 disk. (Thisis  total of all ebs having
 header-generation==curr_transid, not only during commit of the
 transaction). This particular run was with notreelog.
 
 Machine had 8GB of RAM. Linux allows the btree_inode to grow its
 page-cache upto ~6.9GB (judging by btree_inode-i_mapping-nrpages).
 But even though the used amount of metadata is less than that, this
 re-COW'ing of already-COW'ed blocks seems to cause page-cache
 trashing...

Interesting.  We've addressed this in the past with changes to the
writepage(s) callback for the btree, basically skipping memory pressure
related writeback if there isn't that much dirty.  There is a lot of
room to improve those decisions, like preferring to write leaves over
nodes, especially full leaves that are not likely to change again.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki suggestions

2015-07-13 Thread David Sterba
On Mon, Jul 13, 2015 at 01:18:27PM +0200, Marc Joliet wrote:
 Am Mon, 13 Jul 2015 06:56:17 + (UTC)
 schrieb Duncan 1i5t5.dun...@cox.net:
 
  Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted:
  
   I hope it's not out of place, but I have a few suggestions for the Wiki:
  
  Just in case it wasn't obvious...  The wiki is open to user editing.  You 
  can, if you like, get an account and make the changes yourself. =:^)
  
  Of course, it's understandable if your reaction to web and wiki 
  technologies is similar to mine, newsgroups and mailing lists (in my case 
  via gmane.org's list2news service, so they too are presented as 
  newsgroups) are your primary domain, and you tend to treat the web as 
  read-only so rarely reply on a web forum, let alone edit a wiki.  I've 
  never gotten a wiki account here for that reason, either, or I'd have 
  probably gone ahead and made the suggested changes...
  
  But with a bit of luck someone with an existing (or even new) account 
  will be along to make the changes...
 
 It's partially a read-only habit, but it's also that I'm just not confident
 in deciding whether those actually *are* good suggestions, or put differently:
 it's the public face of btrfs, and I don't want to accidentally do something 
 to
 ruin it (to use some hyperbole).

All your suggesstions are good, adding more articles/videos/talks should
be easy as there's a section for that already. The news section is
mostly written by me but if you keep your entries consistent with the
rest then it's ok.

There are a few people who watch over new wiki edits and fix/enhance
them if needed.  You can't do too much damage unless you really want to.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Alex Lyakas
Filipe,
Thanks for the explanation. Those reasons were not so obvious for me.

Would it make sense not to COW the block in case-1, if we are mounted
with notreelog? Or, perhaps, to check that the block does not belong
to a log tree?

The second case is more difficult. One problem is that
BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
due to memory pressure (this is what I see happening), we complete the
writeback, release the extent buffer, and pages are evicted from the
page cache of btree_inode. After some time we read the block again
(because we want to modify it in the same transaction), but its header
is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
this point it should be safe to avoid COW, we will re-COW.

Would it make sense to have some runtime-only mechanism to lock-out
the write-back for an eb? I.e., if we know that eb is not under
writeback, and writeback is locked out from starting, we can redirty
the block without COW. Then we allow the writeback to start when it
wants to.

In one of my test runs, btrfs had 6.4GB of metadata (before
raid-induced overhead), but during a particular transaction total of
10GB of metadata (again, before raid-induced overhead) was written to
disk. (Thisis  total of all ebs having
header-generation==curr_transid, not only during commit of the
transaction). This particular run was with notreelog.

Machine had 8GB of RAM. Linux allows the btree_inode to grow its
page-cache upto ~6.9GB (judging by btree_inode-i_mapping-nrpages).
But even though the used amount of metadata is less than that, this
re-COW'ing of already-COW'ed blocks seems to cause page-cache
trashing...

Thanks,
Alex.


On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana
fdman...@gmail.com wrote:
 On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas a...@zadarastorage.com wrote:
 Greetings,
 Looking at the code of should_cow_block(), I see:

 if (btrfs_header_generation(buf) == trans-transid 
!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) 
 ...
 So if the extent buffer has been written to disk, and now is changed again
 in the same transaction, we insist on COW'ing it. Can anybody explain why
 COW is needed in this case? The transaction has not committed yet, so what
 is the danger of rewriting to the same location on disk? My understanding
 was that a tree block needs to be COW'ed at most once in the same
 transaction. But I see that this is not the case.

 That logic is there, as far as I can see, for at least 2 obvious reasons:

 1) fsync/log trees. All extent buffers (tree blocks) of a log tree
 have the same transaction id/generation, and you can have multiple
 fsyncs (log transaction commits) per transaction so you need to ensure
 consistency. If we skipped the COWing in the example below, you would
 get an inconsistent log tree at log replay time when the fs is
 mounted:

 transaction N start

fsync inode A start
creates tree block X
flush X to disk
write a new superblock
fsync inode A end

fsync inode B start
skip COW of X because its generation == current transaction id and
 modify it in place
flush X to disk

 == crash ===

write a new superblock
fsync inode B end

 transaction N commit

 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
 written to disk but instead when we trigger writeback for it. So while
 the writeback is ongoing we want to make sure the block's content
 isn't concurrently modified (we don't keep the eb write locked to
 allow concurrent reads during the writeback).

 All tree blocks that don't belong to a log tree are normally written
 only when at the end of a transaction commit. But often, due to memory
 pressure for e.g., the VM can call the writepages() callback of the
 btree inode to force dirty tree blocks to be written to disk before
 the transaction commit.


 I am asking because I am doing some profiling of btrfs metadata work under
 heavy loads, and I see that sometimes btrfs COW's almost twice more tree
 blocks than the total metadata size.

 Thanks,
 Alex.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Filipe David Manana,

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't mount btrfs volume on rbd

2015-07-13 Thread Steve Dainard
Hi Qu,

I ran into this issue again, without pacemaker involved, so I'm really
not sure what is triggering this.

There is no content at all on this disk, basically it was created with
a btrfs filesystem, mounted, and now after some reboots later (and
possibly hard resets) won't mount with a stale file handle error.

I've DD'd the 10G disk and tarballed it to 10MB, I'll send it to you
in another email so the attachment doesn't spam the list.

Thanks,
Steve

On Mon, Jun 15, 2015 at 6:27 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote:


 Steve Dainard wrote on 2015/06/15 09:19 -0700:

 Hi Qu,

 # btrfs --version
 btrfs-progs v4.0.1
 # btrfs check /dev/rbd30
 Checking filesystem on /dev/rbd30
 UUID: 1bb22a03-bc25-466f-b078-c66c6f6a6d28
 checking extents
 cmds-check.c:3735: check_owner_ref: Assertion `rec-is_root` failed.
 btrfs[0x41aee6]
 btrfs[0x423f5d]
 btrfs[0x424c99]
 btrfs[0x4258f6]
 btrfs(cmd_check+0x14a3)[0x42893d]
 btrfs(main+0x15d)[0x409c71]
 /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f29ce437af5]
 btrfs[0x409829]

 # btrfs-image /dev/rbd30 rbd30.image -c9
 # btrfs-image -r rbd30.image rbd30.image.2
 # mount rbd30.image.2 temp
 mount: mount /dev/loop0 on /mnt/temp failed: Stale file handle

 OK, my assumption are all wrong.

 I'd better check the debug-tree output more carefully.

 BTW, the rbd30 is the block device which you took the debug-tree output?

 If so, would you please do a dd dump of it and send it to me?
 If it contains important/secret info, just forget this.

 Maybe I can improve the btrfsck tool to fix it.


 I have a suspicion this was caused by pacemaker starting
 ceph/filesystem resources on two nodes at the same time,I haven't
 been able to replicate the issue after hard poweroff if ceph/btrfs are
 not being controlled by pacemaker.

 Did you mean mount the same device on different system?

 Thanks,
 Qu


 Thanks for your help.



 On Mon, Jun 15, 2015 at 1:06 AM, Qu Wenruo quwen...@cn.fujitsu.com
 wrote:

 The debug result seems valid.
 So I'm afraid the problem is not in btrfs.

 Would your please try the following 2 things to eliminate btrfs problems?

 1) btrfsck from 4.0.1 on the rbd

 If assert still happens, please update the image of the volume(dd image),
 to
 help us improve btrfs-progs.

 2) btrfs-image dump and rebuilt the fs into other place.

 # btrfs-image RBD_DEV tmp_file1 -c9
 # btrfs-image -r tmp_file1 tmp_file2
 # mount tmp_file2 mnt

 This will dump all metadata from RBD_DEV to tmp_file1,
 and then use tmp_file1 to rebuild a image called tmp_file2.

 If tmp_file2 can be mounted, then the metadata in the RBD device is
 completely OK, and we can make conclusion the problem is not caused by
 btrfs.(maybe ceph?)

 BTW, all the commands are recommended to be executed on the device which
 you
 get the debug info from.
 As it's a small and almost empty device, so commands execution should be
 quite fast on it.

 Thanks,
 Qu


 在 2015年06月13日 00:09, Steve Dainard 写道:


 Hi Qu,

 I have another volume with the same error, btrfs-debug-tree output
 from btrfs-progs 4.0.1 is here: http://pastebin.com/k3R3bngE

 I'm not sure how to interpret the output, but the exit status is 0 so
 it looks like btrfs doesn't think there's an issue with the file
 system.

 I get the same mount error with options ro,recovery.

 On Fri, Jun 12, 2015 at 12:23 AM, Qu Wenruo quwen...@cn.fujitsu.com
 wrote:




  Original Message  
 Subject: Can't mount btrfs volume on rbd
 From: Steve Dainard sdain...@spd1.com
 To: linux-btrfs@vger.kernel.org
 Date: 2015年06月11日 23:26

 Hello,

 I'm getting an error when attempting to mount a volume on a host that
 was forceably powered off:

 # mount /dev/rbd4 climate-downscale-CMIP5/
 mount: mount /dev/rbd4 on /mnt/climate-downscale-CMIP5 failed: Stale
 file
 handle

 /var/log/messages:
 Jun 10 15:31:07 node1 kernel: rbd4: unknown partition table

 # parted /dev/rbd4 print
 Model: Unknown (unknown)
 Disk /dev/rbd4: 36.5TB
 Sector size (logical/physical): 512B/512B
 Partition Table: loop
 Disk Flags:

 Number  Start  End SizeFile system  Flags
 1  0.00B  36.5TB  36.5TB  btrfs

 # btrfs check --repair /dev/rbd4
 enabling repair mode
 Checking filesystem on /dev/rbd4
 UUID: dfe6b0c8-2866-4318-abc2-e1e75c891a5e
 checking extents
 cmds-check.c:2274: check_owner_ref: Assertion `rec-is_root` failed.
 btrfs[0x4175cc]
 btrfs[0x41b873]
 btrfs[0x41c3fe]
 btrfs[0x41dc1d]
 btrfs[0x406922]


 OS: CentOS 7.1
 btrfs-progs: 3.16.2



 The btrfs-progs seems quite old, and the above btrfsck error seems
 quite
 possible related to the old version.

 Would you please upgrade btrfs-progs to 4.0 and see what will happen?
 Hopes it can give better info.

 BTW, it's a good idea to call btrfs-debug-tree /dev/rbd4 to see the
 output.

 Thanks
 Qu.



 Ceph: version: 0.94.1/CentOS 7.1

 I haven't found any references to 'stale file handle' on btrfs.

 The underlying block device is ceph rbd, so I've posted to both lists
 for any feedback. Also once I 

Re: Can't mount btrfs volume on rbd

2015-07-13 Thread Qu Wenruo

Thanks a lot Steve!

With this binary dump, we can find out what's the cause of your problem 
and makes btrfsck handle and repair it.


Further more, this provides a good hint on what's going wrong in kernel.

I'll start investigating this right now.

Thanks,
Qu

Steve Dainard wrote on 2015/07/13 13:22 -0700:

Hi Qu,

I ran into this issue again, without pacemaker involved, so I'm really
not sure what is triggering this.

There is no content at all on this disk, basically it was created with
a btrfs filesystem, mounted, and now after some reboots later (and
possibly hard resets) won't mount with a stale file handle error.

I've DD'd the 10G disk and tarballed it to 10MB, I'll send it to you
in another email so the attachment doesn't spam the list.

Thanks,
Steve

On Mon, Jun 15, 2015 at 6:27 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote:



Steve Dainard wrote on 2015/06/15 09:19 -0700:


Hi Qu,

# btrfs --version
btrfs-progs v4.0.1
# btrfs check /dev/rbd30
Checking filesystem on /dev/rbd30
UUID: 1bb22a03-bc25-466f-b078-c66c6f6a6d28
checking extents
cmds-check.c:3735: check_owner_ref: Assertion `rec-is_root` failed.
btrfs[0x41aee6]
btrfs[0x423f5d]
btrfs[0x424c99]
btrfs[0x4258f6]
btrfs(cmd_check+0x14a3)[0x42893d]
btrfs(main+0x15d)[0x409c71]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f29ce437af5]
btrfs[0x409829]

# btrfs-image /dev/rbd30 rbd30.image -c9
# btrfs-image -r rbd30.image rbd30.image.2
# mount rbd30.image.2 temp
mount: mount /dev/loop0 on /mnt/temp failed: Stale file handle


OK, my assumption are all wrong.

I'd better check the debug-tree output more carefully.

BTW, the rbd30 is the block device which you took the debug-tree output?

If so, would you please do a dd dump of it and send it to me?
If it contains important/secret info, just forget this.

Maybe I can improve the btrfsck tool to fix it.



I have a suspicion this was caused by pacemaker starting
ceph/filesystem resources on two nodes at the same time,I haven't
been able to replicate the issue after hard poweroff if ceph/btrfs are
not being controlled by pacemaker.


Did you mean mount the same device on different system?

Thanks,
Qu



Thanks for your help.



On Mon, Jun 15, 2015 at 1:06 AM, Qu Wenruo quwen...@cn.fujitsu.com
wrote:


The debug result seems valid.
So I'm afraid the problem is not in btrfs.

Would your please try the following 2 things to eliminate btrfs problems?

1) btrfsck from 4.0.1 on the rbd

If assert still happens, please update the image of the volume(dd image),
to
help us improve btrfs-progs.

2) btrfs-image dump and rebuilt the fs into other place.

# btrfs-image RBD_DEV tmp_file1 -c9
# btrfs-image -r tmp_file1 tmp_file2
# mount tmp_file2 mnt

This will dump all metadata from RBD_DEV to tmp_file1,
and then use tmp_file1 to rebuild a image called tmp_file2.

If tmp_file2 can be mounted, then the metadata in the RBD device is
completely OK, and we can make conclusion the problem is not caused by
btrfs.(maybe ceph?)

BTW, all the commands are recommended to be executed on the device which
you
get the debug info from.
As it's a small and almost empty device, so commands execution should be
quite fast on it.

Thanks,
Qu


在 2015年06月13日 00:09, Steve Dainard 写道:



Hi Qu,

I have another volume with the same error, btrfs-debug-tree output
from btrfs-progs 4.0.1 is here: http://pastebin.com/k3R3bngE

I'm not sure how to interpret the output, but the exit status is 0 so
it looks like btrfs doesn't think there's an issue with the file
system.

I get the same mount error with options ro,recovery.

On Fri, Jun 12, 2015 at 12:23 AM, Qu Wenruo quwen...@cn.fujitsu.com
wrote:





 Original Message  
Subject: Can't mount btrfs volume on rbd
From: Steve Dainard sdain...@spd1.com
To: linux-btrfs@vger.kernel.org
Date: 2015年06月11日 23:26


Hello,

I'm getting an error when attempting to mount a volume on a host that
was forceably powered off:

# mount /dev/rbd4 climate-downscale-CMIP5/
mount: mount /dev/rbd4 on /mnt/climate-downscale-CMIP5 failed: Stale
file
handle

/var/log/messages:
Jun 10 15:31:07 node1 kernel: rbd4: unknown partition table

# parted /dev/rbd4 print
Model: Unknown (unknown)
Disk /dev/rbd4: 36.5TB
Sector size (logical/physical): 512B/512B
Partition Table: loop
Disk Flags:

Number  Start  End SizeFile system  Flags
 1  0.00B  36.5TB  36.5TB  btrfs

# btrfs check --repair /dev/rbd4
enabling repair mode
Checking filesystem on /dev/rbd4
UUID: dfe6b0c8-2866-4318-abc2-e1e75c891a5e
checking extents
cmds-check.c:2274: check_owner_ref: Assertion `rec-is_root` failed.
btrfs[0x4175cc]
btrfs[0x41b873]
btrfs[0x41c3fe]
btrfs[0x41dc1d]
btrfs[0x406922]


OS: CentOS 7.1
btrfs-progs: 3.16.2




The btrfs-progs seems quite old, and the above btrfsck error seems
quite
possible related to the old version.

Would you please upgrade btrfs-progs to 4.0 and see what will happen?
Hopes it can give better info.

BTW, it's a good idea to call btrfs-debug-tree /dev/rbd4 to see the
output.

Thanks
Qu.





Re: [GIT PULL] More btrfs bug fixes

2015-07-13 Thread Chris Mason
On Sun, Jul 12, 2015 at 02:50:47AM +0100, fdman...@kernel.org wrote:
 From: Filipe Manana fdman...@suse.com
 
 Hi Chris,
 
 Please consider the following changes for the kernel 4.2 release. All
 these patches have been available in the mailing list for some time.
 
 One of the patches is a fix for a regression in the delayed references
 code that landed in 4.2-rc1. Two of them are for issues reported by users
 on the list and IRC recently (which I've cc'ed for stable) and the final
 one is just a missing update of an inode's on disk size after truncating
 a file if the no_holes feature is enabled, which I found some time ago.
 
 I have rebased them on top of your current integration-4.2 branch,
 re-tested them and incorporated any tags people have added through the
 mailing list (Reviewed-by, Acked-by).
 

Thanks Filipe, I've pulled these in along with a few more.  I'll test
overnight and push out in the morning.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Disk failed while doing scrub

2015-07-13 Thread Dāvis Mosāns
Hello,

Short version: while doing scrub on 5 disk btrfs filesystem, /dev/sdd
failed and also had some error on other disk (/dev/sdh)

Because filesystem still mounts, I assume I should do btrfs device
delete /dev/sdd /mntpoint and then restore damaged files from backup.
Are all affected files listed in journal? there's messages about x
callbacks suppressed so I'm not sure and if there aren't how to get
full list of damaged files?
Also I wonder if there are any tools to recover partial file fragments
and reconstruct file? (where missing fragments filled with nulls)
I assume that there's no point in running btrfs check
--check-data-csum because scrub already does check that?

from journal:

kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [1] tag[1], task
[88007efb8800]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0002,  slot [1].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/00:00:00:3d:a1/04:00:ab:00:00/40 tag 11 ncq 524288 in
res
41/40:00:48:40:a1/00:04:ab:00:00/00 Emask 0x409 (media error) F
kernel: ata9.00: status: { DRDY ERR }
kernel: ata9.00: error: { UNC }
kernel: ata9.00: configured for UDMA/133
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00
driverbyte=0x08
kernel: sd 7:0:2:0: [sdd] tag#0 Sense Key : 0x3 [current] [descriptor]
kernel: sd 7:0:2:0: [sdd] tag#0 ASC=0x11 ASCQ=0x4
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 3d 00 00 04 00 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [1] tag[1], task
[88007efb9a00]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0003,  slot [1].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 2 failed: 2
kernel: sas: trying to find task 0x8801e0cadb00
kernel: sas: sas_scsi_find_task: aborting task 0x8801e0cadb00
kernel: sas: sas_scsi_find_task: task 0x8801e0cadb00 is aborted
kernel: sas: sas_eh_handle_sas_errors: task 0x8801e0cadb00 is aborted
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata8: end_device-7:1: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata8: end_device-7:1: dev error handler
kernel: ata8.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 frozen
kernel: ata8.00: failed command: READ FPDMA QUEUED
kernel: ata8.00: cmd 60/00:00:00:1b:36/04:00:bf:00:00/40 tag 18 ncq 524288 in
res
40/00:08:00:58:11/00:00:a6:00:00/40 Emask 0x4 (timeout)
kernel: ata8.00: status: { DRDY }
kernel: ata8: hard resetting link
kernel: sas: ata9: end_device-7:2: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9: log page 10h reported inactive tag 26
kernel: ata9.00: exception Emask 0x1 SAct 0x40 SErr 0x0 action 0x6
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/08:00:48:40:a1/00:00:ab:00:00/40 tag 22 ncq 4096 in
res
01/04:a8:40:40:a1/00:00:ab:00:00/40 Emask 0x3 (HSM violation)
kernel: ata9.00: status: { ERR }
kernel: ata9.00: error: { ABRT }
kernel: ata9: hard resetting link
kernel: sas: sas_form_port: phy1 belongs to port1 already(1)!
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[1]:rc= 0
kernel: ata8.00: configured for UDMA/133
kernel: ata8.00: device reported invalid CHS sector 0
kernel: ata8: EH complete
kernel: ata9: hard resetting link
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: ata9: hard resetting link
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: ata9.00: disabled
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 40 48 00 00 08 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 45 00 00 06 00 00
kernel: BTRFS: unable to fixup (regular) error at logical
7390602616832 on dev /dev/sdd
kernel: BTRFS: unable to fixup (regular) error at 

Re: Wiki suggestions

2015-07-13 Thread Duncan
Marc Joliet posted on Sun, 12 Jul 2015 14:26:04 +0200 as excerpted:

 I hope it's not out of place, but I have a few suggestions for the Wiki:

Just in case it wasn't obvious...  The wiki is open to user editing.  You 
can, if you like, get an account and make the changes yourself. =:^)

Of course, it's understandable if your reaction to web and wiki 
technologies is similar to mine, newsgroups and mailing lists (in my case 
via gmane.org's list2news service, so they too are presented as 
newsgroups) are your primary domain, and you tend to treat the web as 
read-only so rarely reply on a web forum, let alone edit a wiki.  I've 
never gotten a wiki account here for that reason, either, or I'd have 
probably gone ahead and made the suggested changes...

But with a bit of luck someone with an existing (or even new) account 
will be along to make the changes...

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't remove missing device

2015-07-13 Thread Patrik Lundquist
On 10 July 2015 at 06:05, None None whocares0...@freemail.hu wrote:
 According to dmesg sda returns bad data but the smart values for it seem fine.

 # smartctl -a /dev/sda
...
 SMART Self-test log structure revision number 1
 No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Run smartctl -t long /dev/sda
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk failed while doing scrub

2015-07-13 Thread Duncan
Dāvis Mosāns posted on Mon, 13 Jul 2015 09:26:05 +0300 as excerpted:

 Short version: while doing scrub on 5 disk btrfs filesystem, /dev/sdd
 failed and also had some error on other disk (/dev/sdh)

You say five disk, but nowhere in your post do you mention what raid mode 
you were using, neither do you post btrfs filesystem show and btrfs 
filesystem df, as suggested on the wiki and which list that information.

FWIW, btrfs defaults for a multi-device filesystem are raid1 metadata, 
raid0 data.  If you didn't specify raid level at mkfs time, it's very 
likely that's what you're using.  The scrub results seem to support this 
as if the data had been raid1 or raid10, nearly all the errors should 
have been correctable by pulling from the second copy.  And raid5/6 
should have been able to recover from parity, tho this mode is new enough 
it's still not recommended as the chances of bugs and thus failure to 
work properly are much higher.

So you really should have been using raid1/10 if you wanted device 
failure tolerance, but you didn't say, and if you're using defaults as 
seems reasonably likely, your data was raid0, and thus it's likely many/
most files are either gone or damaged beyond repair.

(As it happens I have a number of btrfs raid1 data/metadata on a pair of 
partitioned ssds, with each btrfs on a corresponding partition on both of 
them, with one of the ssds developing bad sectors and basically slowly 
failing.  But the other member of the raid1 pair is solid and I have 
backups, as well as a spare I can replace the failing one with when I 
decide it's time, so I've been letting the bad one stick around due as 
much as anything to morbid curiosity, watching it slowly fail. So I know 
exactly how scrub on btrfs raid1 behaves in a bad-sector case, pulling 
the copy from the good device to overwrite the bad copy with, triggering 
the device's sector remapping in the process.  Despite all the read 
errors, they've all been correctable, because I'm using raid1 for both 
data and metadata.)

 Because filesystem still mounts, I assume I should do btrfs device
 delete /dev/sdd /mntpoint and then restore damaged files from backup.

You can try a replace, but with a failing drive still connected, people 
report mixed results.  It's likely to fail as it can't read certain 
blocks to transfer them to the new device.

With raid1 or better, physically disconnecting the failing device, and 
doing a device delete missing (or replace missing, but AFAIK this doesn't 
work with released versions and I'm not sure if it's even in integration 
yet, but there are patches on-list that should make it work) can work, 
but with raid0/single, you can mount with a missing device if you use 
degraded,ro, but obviously that'll only let you try to copy files off, 
and you'll likely not have a lot of luck with raid0, with files missing 
but a bit more luck with single.

In the likely raid0/single case, you're best bet is probably to try 
copying off what you can, and/or restoring from backups.  See the 
discussion below.

 Are all affected files listed in journal? there's messages about x
 callbacks suppressed so I'm not sure and if there aren't how to get
 full list of damaged files?

 Also I wonder if there are any tools to recover partial file fragments
 and reconstruct file? (where missing fragments filled with nulls)
 I assume that there's no point in running btrfs check
 --check-data-csum because scrub already does check that?

There's no such partial-file with null-fill tools shipped just yet.  
Those files normally simply trigger errors trying to read them, because 
btrfs won't let you at them if the checksum doesn't verify.

There /is/, however, a command that can be used to either regenerate or 
zero-out the checksum tree.  See btrfs check --init-csum-tree.  Current 
versions recalculate the csums, older versions (btrfsck as that was 
before btrfs check) simply zeroed it out.  Then you can read the file 
despite bad checksums, tho you'll still get errors if the block 
physically cannot be read.

There's also btrfs restore, which works on the unmounted filesystem 
without actually writing to it, copying the files it can read to a new 
location, which of course has to be a filesystem with enough room to 
restore the files to, altho it's possible to tell restore to do only 
specific subdirs, for instance.

What I'd recommend depends on how complete and how recent your backup 
is.  If it's complete and recent enough, probably the easiest thing is to 
simply blow away the bad filesystem and start over, recovering from the 
backup to a new filesystem.

If there's files you'd like to get back that weren't backed up or where 
the backup is old, since the filesystem is mountable, I'd probably copy 
everything off it I could.  Then, I'd try restore, letting it restore to 
the same location I had copied to, but NOT using the --overwrite option, 
so it only wrote any files it could restore that the copy wasn't able to 
get