Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Qu Wenruo



At 10/12/2016 12:37 PM, Zygo Blaxell wrote:

On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote:

But consider the identical scenario with md or LVM raid5, or any
conventional hardware raid5. A scrub check simply reports a mismatch.
It's unknown whether data or parity is bad, so the bad data strip is
propagated upward to user space without error. On a scrub repair, the
data strip is assumed to be good, and good parity is overwritten with
bad.


Totally true.

Original RAID5/6 design is only to handle missing device, not rotted bits.


Missing device is the _only_ thing the current design handles.  i.e. you
umount the filesystem cleanly, remove a disk, and mount it again degraded,
and then the only thing you can safely do with the filesystem is delete
or replace a device.  There is also a probability of being able to repair
bitrot under some circumstances.

If your disk failure looks any different from this, btrfs can't handle it.
If a disk fails while the array is running and the filesystem is writing,
the filesystem is likely to be severely damaged, possibly unrecoverably.

A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
snowball's chance in hell of surviving a disk failure on a live array
with only data losses.  This would work if mdadm and btrfs successfully
arrange to have each dup copy of metadata updated separately, and one
of the copies survives the raid5 write hole.  I've never tested this
configuration, and I'd test the heck out of it before considering
using it.


So while I agree in total that Btrfs raid56 isn't mature or tested
enough to consider it production ready, I think that's because of the
UNKNOWN causes for problems we've seen with raid56. Not the parity
scrub bug which - yeah NOT good, not least of which is the data
integrity guarantees Btrfs is purported to make are substantially
negated by this bug. I think the bark is worse than the bite. It is
not the bark we'd like Btrfs to have though, for sure.



Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and
data checksum.

[snip]

This leads directly to a variety of problems with the diagnostic tools,
e.g.  scrub reports errors randomly across devices, and cannot report the
path of files containing corrupted blocks if it's the parity block that
gets corrupted.


At least better than screwing up good stripes.

The tool is just used to let user know if there is any corrupted stripes 
like kernel scrub, but with better behavior, like won't reconstruct 
stripes ignoring checksum.



For human readable report, it's not that hard (compared the the complex 
csum and parity check) to implement and can be added later.
For parity report, there is no way to output any human readable result 
anyway.




btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
to reconstruct any parity that was damaged by an incomplete data stripe
update.
 As long as all disks are working, the parity can be reconstructed
from the data disks.  If a disk fails prior to the completion of the
scrub, any data stripes that were written during previous crashes may
be destroyed.  And all that assumes the scrub bugs are fixed first.


This is true.
I didn't take this into account.

But this is not a *single* problem, but 2 problems.
1) Power loss
2) Device crash

Before making things complex, why not focusing on single problem.

Not to mention the possibility is much smaller than single problem.



If writes occur after a disk fails, they all temporarily corrupt small
amounts of data in the filesystem.  btrfs cannot tolerate any metadata
corruption (it relies on redundant metadata to self-repair), so when a
write to metadata is interrupted, the filesystem is instantly doomed
(damaged beyond the current tools' ability to repair and mount
read-write).


That's why we used higher duplication level for metadata by default.
And considering metadata size, it's much acceptable to use RAID1 for 
metadata other than RADI5/6.




Currently the upper layers of the filesystem assume that once data
blocks are written to disk, they are stable.  This is not true in raid5/6
because the parity and data blocks within each stripe cannot be updated
atomically.


True, but if we ignore parity, we'd find that, RAID5 is just RAID0.

COW ensures (cowed) data and metadata are all safe and checksum will 
ensure they are OK, so even for RAID0, it's not a problem for case like 
power loss.


So we should follow csum first and then parity.

If we following this principle, RAID5 should be a raid0 with a little 
higher possibility to recover some cases, like missing one device.


So, I'd like to fix RAID5 scrub to make it at least better than RAID0, 
not worse than RAID0.




 btrfs doesn't avoid writing new data in the same RAID stripe
as old data (it provides a rmw function for raid56, which is simply a bug
in a CoW filesystem), so previously committed data can be 

Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Zygo Blaxell
On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote:
> >But consider the identical scenario with md or LVM raid5, or any
> >conventional hardware raid5. A scrub check simply reports a mismatch.
> >It's unknown whether data or parity is bad, so the bad data strip is
> >propagated upward to user space without error. On a scrub repair, the
> >data strip is assumed to be good, and good parity is overwritten with
> >bad.
> 
> Totally true.
> 
> Original RAID5/6 design is only to handle missing device, not rotted bits.

Missing device is the _only_ thing the current design handles.  i.e. you
umount the filesystem cleanly, remove a disk, and mount it again degraded,
and then the only thing you can safely do with the filesystem is delete
or replace a device.  There is also a probability of being able to repair
bitrot under some circumstances.

If your disk failure looks any different from this, btrfs can't handle it.
If a disk fails while the array is running and the filesystem is writing,
the filesystem is likely to be severely damaged, possibly unrecoverably.

A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
snowball's chance in hell of surviving a disk failure on a live array
with only data losses.  This would work if mdadm and btrfs successfully
arrange to have each dup copy of metadata updated separately, and one
of the copies survives the raid5 write hole.  I've never tested this
configuration, and I'd test the heck out of it before considering
using it.

> >So while I agree in total that Btrfs raid56 isn't mature or tested
> >enough to consider it production ready, I think that's because of the
> >UNKNOWN causes for problems we've seen with raid56. Not the parity
> >scrub bug which - yeah NOT good, not least of which is the data
> >integrity guarantees Btrfs is purported to make are substantially
> >negated by this bug. I think the bark is worse than the bite. It is
> >not the bark we'd like Btrfs to have though, for sure.
> >
> 
> Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and
> data checksum.
[snip]

This leads directly to a variety of problems with the diagnostic tools,
e.g.  scrub reports errors randomly across devices, and cannot report the
path of files containing corrupted blocks if it's the parity block that
gets corrupted.

btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
to reconstruct any parity that was damaged by an incomplete data stripe
update.  As long as all disks are working, the parity can be reconstructed
from the data disks.  If a disk fails prior to the completion of the
scrub, any data stripes that were written during previous crashes may
be destroyed.  And all that assumes the scrub bugs are fixed first.

If writes occur after a disk fails, they all temporarily corrupt small
amounts of data in the filesystem.  btrfs cannot tolerate any metadata
corruption (it relies on redundant metadata to self-repair), so when a
write to metadata is interrupted, the filesystem is instantly doomed
(damaged beyond the current tools' ability to repair and mount
read-write).

Currently the upper layers of the filesystem assume that once data
blocks are written to disk, they are stable.  This is not true in raid5/6
because the parity and data blocks within each stripe cannot be updated
atomically.  btrfs doesn't avoid writing new data in the same RAID stripe
as old data (it provides a rmw function for raid56, which is simply a bug
in a CoW filesystem), so previously committed data can be lost.  If the
previously committed data is part of the metadata tree, the filesystem
is doomed; for ordinary data blocks there are just a few dozen to a few
thousand corrupted files for the admin to clean up after each crash.

It might be possible to hack up the allocator to pack writes into empty
stripes to avoid the write hole, but every time I think about this it
looks insanely hard to do (or insanely wasteful of space) for data
stripes.



signature.asc
Description: Digital signature


Re: [RFC] btrfs: make max inline data can be equal to sectorsize

2016-10-11 Thread Wang Xiaoguang

hi,

On 10/11/2016 11:49 PM, Chris Murphy wrote:

On Tue, Oct 11, 2016 at 12:47 AM, Wang Xiaoguang
 wrote:

If we use mount option "-o max_inline=sectorsize", say 4096, indeed
even for a fresh fs, say nodesize is 16k, we can not make the first
4k data completely inline, I found this conditon causing this issue:
   !compressed_size && (actual_end & (root->sectorsize - 1)) == 0

If it retuns true, we'll not make data inline. For 4k sectorsize,
0~4094 dara range, we can make it inline, but 0~4095, it can not.
I don't think this limition is useful, so here remove it which will
make max inline data can be equal to sectorsize.

Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/inode.c | 2 --
  1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ea15520..c0db393 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct btrfs_root 
*root,
 if (start > 0 ||
 actual_end > root->sectorsize ||
 data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) ||
-   (!compressed_size &&
-   (actual_end & (root->sectorsize - 1)) == 0) ||
 end + 1 < isize ||
 data_len > root->fs_info->max_inline) {
 return 1;
--
2.9.0


Before making any further changes to inline data, does it make sense
to find the source of corruption Zygo has been experiencing? That's in
the "btrfs rare silent data corruption with kernel data leak" thread.

Yes, agree.
Also Zygo has sent a patch to fix that bug this morning :)

Regards,
XIaoguang Wang








--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: fix false enospc for compression

2016-10-11 Thread Wang Xiaoguang

hi,

Stefan often reports enospc error in his servers when having btrfs 
compression
enabled. Now he has applied these 2 patches to run and no enospc error 
occurs

for more than 6 days, it seems they are useful :)

And these 2 patches are somewhat big, please check it, thanks.

Regards,
Xiaoguang Wang
On 10/06/2016 10:51 AM, Wang Xiaoguang wrote:

When testing btrfs compression, sometimes we got ENOSPC error, though fs
still has much free space, xfstests generic/171, generic/172, generic/173,
generic/174, generic/175 can reveal this bug in my test environment when
compression is enabled.

After some debuging work, we found that it's btrfs_delalloc_reserve_metadata()
which sometimes tries to reserve plenty of metadata space, even for very small
data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes
we try to reserve is calculated by the difference between outstanding_extents
and reserved_extents. Please see below case for how ENOSPC occurs:

   1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode
outstanding extents be 1, and reserved_extents be 1024. Note it's
btrfs_merge_extent_hook() that merges these 128KB units into one big
outstanding extent, but do not change reserved_extents.

   2, When writing dirty pages, for compression, cow_file_range_async() will
split above big extent in unit of 128KB(compression extent size is 128KB).
When first split opeartion finishes, we'll have 2 outstanding extents and 1024
reserved extents, and just right now the currently generated ordered extent is
dispatched to run and complete, then btrfs_delalloc_release_metadata()(see
btrfs_finish_ordered_io()) will be called to release metadata, after that we
will have 1 outstanding extents and 1 reserved extents(also see logic in
drop_outstanding_extent()). Later cow_file_range_async() continues to handles
left data range[128KB, 128MB), and if no other ordered extent was dispatched
to run, there will be 1023 outstanding extents and 1 reserved extent.

   3, Now if another bufferd write for this file enters, then
btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
for 1023 outstanding extents' metadata, for 16KB node size, it'll be 
1023*16384*2*8,
about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, so
obviously it's not sane and can easily result in enospc error.

The root cause is that for compression, its max extent size will no longer be
BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation
method in btrfs is not appropriate or correct, here we introduce:
enum btrfs_metadata_reserve_type {
BTRFS_RESERVE_NORMAL,
BTRFS_RESERVE_COMPRESS,
};
and expand btrfs_delalloc_reserve_metadata() and btrfs_delalloc_reserve_space()
by adding a new enum btrfs_metadata_reserve_type argument. When a data range 
will
go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata.
Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go
through compression path.

With this patch, we can fix these false enospc error for compression.

Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/ctree.h |  31 ++--
  fs/btrfs/extent-tree.c   |  55 +
  fs/btrfs/extent_io.c |  59 +-
  fs/btrfs/extent_io.h |   2 +
  fs/btrfs/file.c  |  26 +--
  fs/btrfs/free-space-cache.c  |   6 +-
  fs/btrfs/inode-map.c |   5 +-
  fs/btrfs/inode.c | 181 ---
  fs/btrfs/ioctl.c |  12 ++-
  fs/btrfs/relocation.c|  14 +++-
  fs/btrfs/tests/inode-tests.c |  15 ++--
  11 files changed, 309 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 16885f6..fa6a19a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -97,6 +97,19 @@ static const int btrfs_csum_sizes[] = { 4 };
  
  #define BTRFS_DIRTY_METADATA_THRESH	SZ_32M
  
+/*

+ * for compression, max file extent size would be limited to 128K, so when
+ * reserving metadata for such delalloc writes, pass BTRFS_RESERVE_COMPRESS to
+ * btrfs_delalloc_reserve_metadata() or btrfs_delalloc_reserve_space() to
+ * calculate metadata, for none-compression, use BTRFS_RESERVE_NORMAL.
+ */
+enum btrfs_metadata_reserve_type {
+   BTRFS_RESERVE_NORMAL,
+   BTRFS_RESERVE_COMPRESS,
+};
+int inode_need_compress(struct inode *inode);
+u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type);
+
  #define BTRFS_MAX_EXTENT_SIZE SZ_128M
  
  struct btrfs_mapping_tree {

@@ -2677,10 +2690,14 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
  void btrfs_subvolume_release_metadata(struct btrfs_root *root,
  struct btrfs_block_rsv *rsv,
  u64 qgroup_reserved);
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
-void 

[PATCH] btrfs: fix silent data corruption while reading compressed inline extents

2016-10-11 Thread Zygo Blaxell
rsync -S causes a large number of small writes separated by small seeks
to form sparse holes in files that contain runs of zero bytes.  Rarely,
this can lead btrfs to write a file with a compressed inline extent
followed by other data, like this:

Filesystem type is: 9123683e
File size of /try/./30/share/locale/nl/LC_MESSAGES/tar.mo is 61906 (16 blocks 
of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..4095:  0..  4095:   4096: 
encoded,not_aligned,inline
   1:1..  15: 331372..331386: 15:  1: 
last,encoded,eof
/try/./30/share/locale/nl/LC_MESSAGES/tar.mo: 2 extents found

The inline extent size is less than the page size, so the ram_bytes field
in the extent is smaller than 4096.  The difference between ram_bytes and
the end of the first page of the file forms a small hole.  Like any other
hole, the correct value of each byte within the hole is zero.

When the inline extent is not compressed, btrfs_get_extent copies the
inline extent data and then memsets the remainder of the page to zero.
There is no corruption in this case.

When the inline extent is compressed, uncompress_inline uses the
ram_bytes field from the extent ref as the size of the uncompressed data.
ram_bytes is smaller than the page size, so the remainder of the page
(i.e. the bytes in the small hole) is uninitialized memory.  Each time the
extent is read into the page cache, userspace may see different contents.

Fix this by zeroing out the difference between the size of the
uncompressed inline extent and PAGE_CACHE_SIZE in uncompress_inline.

Only bytes within the hole are affected, so affected files can be read
correctly with a fixed kernel.  The corruption happens after IO and
checksum validation, so the corruption is never reported in dmesg or
counted in dev stats.

The bug is at least as old as 3.5.7 (the oldest kernel I can conveniently
test), and possibly much older.

The code may not be correct if the extent is larger than a page, so add
a WARN_ON for that case.

To reproduce the bug, run this on a 3072M kvm VM:

#!/bin/sh
# Use your favorite block device here
blk=/dev/vdc

# Create test filesystem and mount point
mkdir -p /try
mkfs.btrfs -dsingle -mdup -O ^extref,^skinny-metadata,^no-holes -f 
"$blk" || exit 1
mount -ocompress-force,flushoncommit,max_inline=8192,noatime "$blk" 
/try || exit 1

# Create a few inline extents in larger files.
# Multiple processes seem to be necessary.
y=/usr; for x in $(seq 10 19); do rsync -axHSWI "$y/." "$x"; y="$x"; 
done &
y=/usr; for x in $(seq 20 29); do rsync -axHSWI "$y/." "$x"; y="$x"; 
done &
y=/usr; for x in $(seq 30 39); do rsync -axHSWI "$y/." "$x"; y="$x"; 
done &
y=/usr; for x in $(seq 40 49); do rsync -axHSWI "$y/." "$x"; y="$x"; 
done &
wait

# Make a list of the files with inline extents
touch /try/list
find -type f -size +4097c -exec sh -c 'for x; do if filefrag -v "$x" | 
sed -n "4p" | grep -q "inline"; then echo "$x" >> list; fi; done' -- {} +

# Check the inline extents to see if they change as they are read 
multiple times
while read -r x; do
sum="$(sha1sum "$x")"
for y in $(seq 0 99); do
sysctl vm.drop_caches=1
sum2="$(sha1sum "$x")"
if [ "$sum" != "$sum2" ]; then
echo "Inconsistent reads from '$x'"
exit 1
fi
done
done < list

The reproducer may need to run up to 20 times before it finds a
corruption.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/inode.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6811c4..34f9c80 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6791,6 +6791,12 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, page,
   extent_offset, inline_size, max_size);
+   WARN_ON(max_size > PAGE_SIZE);
+   if (max_size < PAGE_SIZE) {
+   char *map = kmap(page);
+   memset(map + max_size, 0, PAGE_SIZE - max_size);
+   kunmap(page);
+   }
kfree(tmp);
return ret;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Dan Mons
Ignoring the RAID56 bugs for a moment, if you have mismatched drives,
BtrFS RAID1 is a pretty good way of utilising available space and
having redundancy.

My home array is BtrFS with a hobbled together collection of disks
ranging from 500GB to 3TB (and 5 of them, so it's not an even number).
I have a grand total of 8TB of linear space, and with BtrFS RAID1 I
can use exactly 50% of this (4TB) even with the weird combination of
disks.  That's something other RAID1 implementations can't do (they're
limited to the size of the smallest disk of any pair, and need an even
number of disks all up), and I get free compression and snapshotting,
so yay for that.

As drives die of natural old age, I replace them ad-hoc with bigger
drives (whatever is the sane price-point at the time).  A replace
followed by a rebalance later, and I'm back to using all available
space (which grows every time I throw a bigger drive in the mix),
which again is incredibly handy when you're a home user looking for
sane long-term storage that doesn't require complete rebuilds of your
array.

-Dan


Dan Mons - VFX Sysadmin
Cutting Edge
http://cuttingedge.com.au


On 12 October 2016 at 01:14, Philip Louis Moetteli
 wrote:
> Hello,
>
>
> I have to build a RAID 6 with the following 3 requirements:
>
> • Use different kinds of disks with different sizes.
> • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
> • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
>
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
>
> Is BTrFS capable of doing that?
>
>
> Thanks a lot for your help!
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Qu Wenruo



At 10/12/2016 07:58 AM, Chris Murphy wrote:

https://btrfs.wiki.kernel.org/index.php/Status
Scrub + RAID56 Unstable will verify but not repair

This doesn't seem quite accurate. It does repair the vast majority of
the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad
data strip results in a.) fixed up data strip from parity b.) wrong
recomputation of replacement parity c.) good parity is overwritten
with bad, silently, d.) if parity reconstruction is needed in the
future e.g. device or sector failure, it results in EIO, a kind of
data loss.

Bad bug. For sure.

But consider the identical scenario with md or LVM raid5, or any
conventional hardware raid5. A scrub check simply reports a mismatch.
It's unknown whether data or parity is bad, so the bad data strip is
propagated upward to user space without error. On a scrub repair, the
data strip is assumed to be good, and good parity is overwritten with
bad.


Totally true.

Original RAID5/6 design is only to handle missing device, not rotted bits.



So while I agree in total that Btrfs raid56 isn't mature or tested
enough to consider it production ready, I think that's because of the
UNKNOWN causes for problems we've seen with raid56. Not the parity
scrub bug which - yeah NOT good, not least of which is the data
integrity guarantees Btrfs is purported to make are substantially
negated by this bug. I think the bark is worse than the bite. It is
not the bark we'd like Btrfs to have though, for sure.



Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree 
and data checksum.


In ideal situation, btrfs should detect which stripe is corrupted, and 
only try to recover data/parity if recovered data checksum matches.


For example, for a very traditional RAID5 layout like the following:

  Disk 1|   Disk 2|  Disk 3 |
-
  Data 1|   Data 2|  Parity |

Scrub should check data stripe 1 and 2, against their checksum first

[All data extents has csum]
1) All csum matches
   Good, then check parity.
   1.1) Parity matches
Nothing wrong at all

   1.1) Parity mismatch
Just recalculate parity. Corruption may happen in unused data
space or in parity. Either way recalculate parity is good
enough.

2) One data stripe csum mismatches(missing), parity mismatches too
   We only know one data stripe mismatch, not sure if parity is OK.
   Try to recover that data stripe from parity, and recheck csum.

   2.1) Recovered data stripe matches csum
That data stripe is corrupted and parity is OK
Recoverable.

   2.2) Recovered data stripe mismatch csum
Both that data stripe and parity is corrupted.

3) Two data stripes csum mismatch, no matter parity matches or not
   At least 2 stripes are screwed up. no fix anyway.

[Some data extents has no csum(nodatasum)]
4) Existing(or no csum at all) csum matches, parity matches
   Good, nothing to worry about

5) Exist csum mismatch for one data stripe, parity mismatch
   Like 2), try to recover that data stripe, and re-check csum.

   5.1) recovered data stripes matches csum
At least we can recover the data covered by csum.
Corrupted no-csum data is not our concern.

   5.2) recovered data stripes mismatches csum
Screwed up

6) No csum at all, parity mismatch
   We are screwed up, just like traditional RAID5.

And I'm coding for the above cases in btrfs-progs to implement an 
off-line scrub tool.


Currently it looks good, and can already handle case from 1) to 3).
And I tend to ignore any full stripe who lacks checksum and parity 
mismatches.


But as you can see, there are so many things(csum exists,matches pairty 
matches, missing devices) involved in btrfs RAID5(RAID6 will be more 
complex), it's already much complex than traditional RAID5/6 or current 
scrub implementation.



So what current kernel scub lacks is:
1) Detection of good/bad stripes
2) Recheck of recovery attempts

But that's all traditional RAID5/6 lacks unless there is some hidden 
checksum like btrfs they can use.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Chris Murphy
https://btrfs.wiki.kernel.org/index.php/Status
Scrub + RAID56 Unstable will verify but not repair

This doesn't seem quite accurate. It does repair the vast majority of
the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad
data strip results in a.) fixed up data strip from parity b.) wrong
recomputation of replacement parity c.) good parity is overwritten
with bad, silently, d.) if parity reconstruction is needed in the
future e.g. device or sector failure, it results in EIO, a kind of
data loss.

Bad bug. For sure.

But consider the identical scenario with md or LVM raid5, or any
conventional hardware raid5. A scrub check simply reports a mismatch.
It's unknown whether data or parity is bad, so the bad data strip is
propagated upward to user space without error. On a scrub repair, the
data strip is assumed to be good, and good parity is overwritten with
bad.

So while I agree in total that Btrfs raid56 isn't mature or tested
enough to consider it production ready, I think that's because of the
UNKNOWN causes for problems we've seen with raid56. Not the parity
scrub bug which - yeah NOT good, not least of which is the data
integrity guarantees Btrfs is purported to make are substantially
negated by this bug. I think the bark is worse than the bite. It is
not the bark we'd like Btrfs to have though, for sure.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid levels and NAS drives

2016-10-11 Thread Nicholas D Steeves
On Mon, Oct 10, 2016 at 08:07:53AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-10-09 19:12, Charles Zeitler wrote:
> >Is there any advantage to using NAS drives
> >under RAID levels,  as oppposed to regular
> >'desktop' drives for BTRFS?
[...]
> So, as for what you should use in a RAID array, here's my specific advice:
> 1. Don't worry about enterprise drives unless you've already got a system
> that has them.  They're insanely overpriced for relatively minimal benefit
> when compared to NAS drives.
> 2. If you can afford NAS drives, use them, they'll get you the best
> combination of energy efficiency, performance, and error recovery.
> 3. If you can't get NAS drives, most desktop drives work fine, but you will
> want to bump up the scsi_command_timer attribute in the kernel for them (200
> seconds is reasonable, just make sure you have good cables and a good
> storage controller).
> 4. Avoid WD Green drives, without special effort, they will get worse
> performance and have shorter lifetimes than any other hard disk I've ever
> seen.
> 5. Generally avoid drives with a capacity over 1TB from manufacturers other
> than WD, HGST, and Seagate, most of them are not particularly good quality
> (especially if it's an odd non-power-of-two size like 5TB).

+1 !  Additionally, is it still the case that it is generally safer to
buy the largest capacity disks offered by the previous generation of
technology rather than the current largest capacity?  eg: right now
that would be 4TB or 6TB, and not 8TB or 10TB.

Cheers,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-11 Thread Chris Murphy
readding btrfs

On Tue, Oct 11, 2016 at 1:00 PM, Jason D. Michaelson
 wrote:
>
>
>> -Original Message-
>> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On
>> Behalf Of Chris Murphy
>> Sent: Tuesday, October 11, 2016 12:41 PM
>> To: Jason D. Michaelson
>> Cc: Chris Murphy; Btrfs BTRFS
>> Subject: Re: raid6 file system in a bad state
>>
>> On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson
>>  wrote:
>> > superblock: bytenr=65536, device=/dev/sda
>> > -
>> > generation  161562
>> > root5752616386560
>>
>>
>>
>> > superblock: bytenr=65536, device=/dev/sdh
>> > -
>> > generation  161474
>> > root4844272943104
>>
>> OK so most obvious is that the bad super is many generations back than
>> the good super. That's expected given all the write errors.
>>
>>
>
> Is there any chance/way of going back to use this generation/root as a source 
> for btrfs restore?

Yes with -t option and that root bytenr for the generation you want to
restore. Thing is, that's so far back the metadata may be gone
(overwritten) already. But worth a shot. I've recovered recently
deleted files this way.


OK at this point I'm thinking that fixing the super blocks won't
change anything because it sounds like it's using the new ones anyway
and maybe the thing to try is going back to a tree root that isn't in
any of the new supers. That means losing anything that was being
written when the lost writes happened. However, for all we know some
overwrites happened so this won't work. And also it does nothing to
deal with the fragile state of having at least two flaky devices, and
one of the system chunks with no redundancy.


Try 'btrfs check' without repair. And then also try it with -r flag
using the various tree roots we've seen so far. Try explicitly using
5752616386560, which is what it ought to use first anyway. And then
also 4844272943104.

That might go far enough back before the bad sectors were a factor.
Normally what you'd want is for it to use one of the backup roots, but
it's consistently running into a problem with all of them when using
recovery mount option.





-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 file system in a bad state

2016-10-11 Thread Chris Murphy
On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson
 wrote:
> superblock: bytenr=65536, device=/dev/sda
> -
> generation  161562
> root5752616386560



> superblock: bytenr=65536, device=/dev/sdh
> -
> generation  161474
> root4844272943104

OK so most obvious is that the bad super is many generations back than
the good super. That's expected given all the write errors.


>root@castor:~/logs# btrfs-find-root /dev/sda
>parent transid verify failed on 5752357961728 wanted 161562 found 159746
>parent transid verify failed on 5752357961728 wanted 161562 found 159746
>Couldn't setup extent tree
>Superblock thinks the generation is 161562
>Superblock thinks the level is 1


This squares with the good super. So btrfs-find-root is using a good
super. I don't know what 5752357961728 is for, maybe it's possible to
read that with btrfs-debug-tree -b 5752357961728  and see what
comes back. This is not the tree root according to the super though.
So what do you get for btrfs-debug-tree -b 5752616386560 

Going back to your logs


[   38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
recovery directory
[   38.810595] NFSD: starting 90-second grace period (net b12e5b80)
[  241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds.
[  241.299135]   Not tainted 4.7.0-0.bpo.1-amd64 #1
[  241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

I don't know what this kernel is. I think you'd be better off with
stable 4.7.7 or 4.8.1 for this work, so you're not running into a
bunch of weird blocked task problems in addition to whatever is going
on with the fs.
[   38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state
recovery directory
[   38.810595] NFSD: starting 90-second grace period (net b12e5b80)
[  241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds.
[  241.299135]   Not tainted 4.7.0-0.bpo.1-amd64 #1
[  241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

I don't know what this kernel is. I think you'd be better off with
stable 4.7.7 or 4.8.1 for this work, so you're not running into a
bunch of weird blocked task problems in addition to whatever is going
on with the fs.


[   20.552205] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 3 transid 161562 /dev/sdd
[   20.552372] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 5 transid 161562 /dev/sdf
[   20.552524] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 6 transid 161562 /dev/sde
[   20.552689] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 4 transid 161562 /dev/sdg
[   20.552858] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd
devid 1 transid 161562 /dev/sda
[  669.843166] BTRFS warning (device sda): devid 2 uuid
dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c is missing
[232572.871243] sd 0:0:8:0: [sdh] tag#4 Sense Key : Medium Error [current]


Two items missing, in effect, for this failed read. One literally
missing, and the other one missing due to unrecoverable read error.
The fact it's not trying to fix anything suggests it hasn't really
finished mounting, there must be something wrong where it either just
gets confused and won't fix (because it might make things worse) or
there isn't reduncancy.


[52799.495999] mce: [Hardware Error]: Machine check events logged
[53249.491975] mce: [Hardware Error]: Machine check events logged
[231298.005594] mce: [Hardware Error]: Machine check events logged

Bunch of other hardware issues...

I *really* think you need to get the hardware issues sorted out before
working on this file system unless you just don't care that much about
it. There are already enough unknowns without contributing who knows
what effect the hardware issues are having while trying to repair
things. Or even understand what's going on.



> sys_chunk_array[2048]:
> item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0)
> chunk length 4194304 owner 2 stripe_len 65536
> type SYSTEM num_stripes 1
> stripe 0 devid 1 offset 0
> dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
> item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520)
> chunk length 11010048 owner 2 stripe_len 65536
> type SYSTEM|RAID6 num_stripes 6
> stripe 0 devid 6 offset 1048576
> dev uuid: 390a1fd8-cc6c-40e7-b0b5-88ca7dcbcc32
> stripe 1 devid 5 offset 1048576
> dev uuid: 2df974c5-9dde-4062-81e9-c613db62
> stripe 2 devid 4 offset 1048576
> dev uuid: dce3d159-721d-4859-9955-37a03769bb0d
> stripe 3 devid 

Re: RAID system with adaption to changed number of disks

2016-10-11 Thread ronnie sahlberg
On Tue, Oct 11, 2016 at 8:14 AM, Philip Louis Moetteli
 wrote:
>
> Hello,
>
>
> I have to build a RAID 6 with the following 3 requirements:


You should under no circumstances use RAID5/6 for anything other than
test and throw-away data.
It has several known issues that will eat your data. Total data loss
is a real possibility.

(the capability to even create raid5/6 filesystems should imho be
removed from btrfs until this changes.)

>
> • Use different kinds of disks with different sizes.
> • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
> • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
>
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
>
> Is BTrFS capable of doing that?
>
>
> Thanks a lot for your help!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Tomasz Kusmierz
I think you just described all the benefits of btrfs in that type of
configuration  unfortunately after btrfs RAID 5 & 6 was marked as
OK it got marked as "it will eat your data" (and there is a tone of
people in random places poping up with raid 5 & 6 that just killed
their data)

On 11 October 2016 at 16:14, Philip Louis Moetteli
 wrote:
> Hello,
>
>
> I have to build a RAID 6 with the following 3 requirements:
>
> • Use different kinds of disks with different sizes.
> • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
> • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
>
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
>
> Is BTrFS capable of doing that?
>
>
> Thanks a lot for your help!
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Hugo Mills
On Tue, Oct 11, 2016 at 03:14:30PM +, Philip Louis Moetteli wrote:
> Hello,
> 
> 
> I have to build a RAID 6 with the following 3 requirements:
> 
>   • Use different kinds of disks with different sizes.
>   • When a disk fails and there's enough space, the RAID should be able 
> to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
> RAID with 8 disks and 1 fails, I should be able to chose to transform this in 
> a non-degraded (!) RAID with 7 disks.
>   • Also the other way round: If I add a disk of what size ever, it 
> should redistribute the data, so that it becomes a RAID with 9 disks.
> 
> I don’t care, if I have to do it manually.
> I don’t care so much about speed either.
> 
> Is BTrFS capable of doing that?

1) Take a look at http://carfax.org.uk/btrfs-usage/ which will tell
   you how much space you can get out of a btrfs array with different
   sized devices.

2) Btrfs's parity RAID implementation is not in good shape right
   now. It has known data corruption issues, and should not be used in
   production.

3) The redistribution of space is something that btrfs can do. It
   needs to be triggered manually at the moment, but it definitely
   works.

   Hugo.

-- 
Hugo Mills | We are all lying in the gutter, but some of us are
hugo@... carfax.org.uk | looking at the stars.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Oscar Wilde


signature.asc
Description: Digital signature


Re: RAID system with adaption to changed number of disks

2016-10-11 Thread Austin S. Hemmelgarn

On 2016-10-11 11:14, Philip Louis Moetteli wrote:

Hello,


I have to build a RAID 6 with the following 3 requirements:

• Use different kinds of disks with different sizes.
• When a disk fails and there's enough space, the RAID should be able 
to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
RAID with 8 disks and 1 fails, I should be able to chose to transform this in a 
non-degraded (!) RAID with 7 disks.
• Also the other way round: If I add a disk of what size ever, it 
should redistribute the data, so that it becomes a RAID with 9 disks.

I don’t care, if I have to do it manually.
I don’t care so much about speed either.

Is BTrFS capable of doing that?
In theory yes.  In practice, BTRFS RAID5/6 mode should not be used in 
production due to a number of known serious issues relating to 
rebuilding and reshaping arrays.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason


On 10/11/2016 10:45 AM, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.
> 
> [ cut here ]
> WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
> list_add corruption. prev->next should be next (e8806648), but was 
> c967fcd8. (prev=880503878b80).
> CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
>  c9d87458 8d32007c c9d874a8 
>  c9d87498 8d07a6c1 00210246 88050388e880
>  880503878b80 e8806648 e8c06600 880502808008
> Call Trace:
> [] dump_stack+0x4f/0x73
> [] __warn+0xc1/0xe0
> [] warn_slowpath_fmt+0x5a/0x80
> [] __list_add+0x89/0xb0
> [] blk_sq_make_request+0x2f8/0x350

   /*  
 * A task plug currently exists. Since this is completely lockless, 
 * utilize that to temporarily store requests until the task is 
 * either done or scheduled away.   
 */ 
plug = current->plug;   
if (plug) { 
blk_mq_bio_to_request(rq, bio); 
if (!request_count) 
trace_block_plug(q);

blk_mq_put_ctx(data.ctx);   

if (request_count >= BLK_MAX_REQUEST_COUNT) {   
blk_flush_plug_list(plug, false);   
trace_block_plug(q);
}   

list_add_tail(>queuelist, >mq_list);  
^^

Dave, is this where we're crashing?  This seems strange.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > > 
 > > [ cut here ]
 > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
 > > list_add corruption. prev->next should be next (e8806648), but was 
 > > c967fcd8. (prev=880503878b80).
 > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 > >  c9d87458 8d32007c c9d874a8 
 > >  c9d87498 8d07a6c1 00210246 88050388e880
 > >  880503878b80 e8806648 e8c06600 880502808008
 > > Call Trace:
 > > [] dump_stack+0x4f/0x73
 > > [] __warn+0xc1/0xe0
 > > [] warn_slowpath_fmt+0x5a/0x80
 > > [] __list_add+0x89/0xb0
 > > [] blk_sq_make_request+0x2f8/0x350
 > 
 >/*
 >   
 >  * A task plug currently exists. Since this is completely lockless,  
 >
 >  * utilize that to temporarily store requests until the task is  
 >
 >  * either done or scheduled away.
 >
 >  */  
 >
 > plug = current->plug;
 >
 > if (plug) {  
 >
 > blk_mq_bio_to_request(rq, bio);  
 >
 > if (!request_count)  
 >
 > trace_block_plug(q); 
 >
 >  
 >
 > blk_mq_put_ctx(data.ctx);
 >
 >  
 >
 > if (request_count >= BLK_MAX_REQUEST_COUNT) {
 >
 > blk_flush_plug_list(plug, false);
 >
 > trace_block_plug(q); 
 >
 > }
 >
 >  
 >
 > list_add_tail(>queuelist, >mq_list);   
 >
 > ^^
 > 
 > Dave, is this where we're crashing?  This seems strange.

According to objdump -S ..


8130a1b7:   48 8b 70 50 mov0x50(%rax),%rsi
list_add_tail(>queuelist, >rq_list);
8130a1bb:   48 8d 50 48 lea0x48(%rax),%rdx
8130a1bf:   48 89 45 a8 mov%rax,-0x58(%rbp)
8130a1c3:   e8 38 44 03 00  callq  8133e600 
<__list_add>
blk_mq_hctx_mark_pending(hctx, ctx);
8130a1c8:   48 8b 45 a8 mov-0x58(%rbp),%rax
8130a1cc:   4c 89 ffmov%r15,%rdi

That looks like the list_add_tail from __blk_mq_insert_req_list

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: raid6 file system in a bad state

2016-10-11 Thread Jason D. Michaelson
> 
> 
> Bad superblocks can't be a good thing and would only cause confusion.
> I'd think that a known bad superblock would be ignored at mount time
> and even by btrfs-find-root, or maybe even replaced like any other kind
> of known bad metadata where good copies are available.
> 
> btrfs-show-super -f /dev/sda
> btrfs-show-super -f /dev/sdh
> 
> 
> Find out what the difference is between good and bad supers.
> 
root@castor:~# btrfs-show-super -f /dev/sda
superblock: bytenr=65536, device=/dev/sda
-
csum_type   0 (crc32c)
csum_size   4
csum0x45278835 [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid73ed01df-fb2a-4b27-b6fc-12a57da934bd
label
generation  161562
root5752616386560
sys_array_size  354
chunk_root_generation   156893
root_level  1
chunk_root  20971520
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 18003557892096
bytes_used  7107627130880
sectorsize  4096
nodesize16384
leafsize16384
stripesize  4096
root_dir6
num_devices 6
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0xe1
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  RAID56 )
cache_generation161562
uuid_tree_generation161562
dev_item.uuid   08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
dev_item.fsid   73ed01df-fb2a-4b27-b6fc-12a57da934bd [match]
dev_item.type   0
dev_item.total_bytes3000592982016
dev_item.bytes_used 1800957198336
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  1
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0)
chunk length 4194304 owner 2 stripe_len 65536
type SYSTEM num_stripes 1
stripe 0 devid 1 offset 0
dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520)
chunk length 11010048 owner 2 stripe_len 65536
type SYSTEM|RAID6 num_stripes 6
stripe 0 devid 6 offset 1048576
dev uuid: 390a1fd8-cc6c-40e7-b0b5-88ca7dcbcc32
stripe 1 devid 5 offset 1048576
dev uuid: 2df974c5-9dde-4062-81e9-c613db62
stripe 2 devid 4 offset 1048576
dev uuid: dce3d159-721d-4859-9955-37a03769bb0d
stripe 3 devid 3 offset 1048576
dev uuid: 6f7142db-824c-4791-a5b2-d6ce11c81c8f
stripe 4 devid 2 offset 1048576
dev uuid: dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c
stripe 5 devid 1 offset 20971520
dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4
backup_roots[4]:
backup 0:
backup_tree_root:   5752437456896   gen: 161561 level: 1
backup_chunk_root:  20971520gen: 156893 level: 1
backup_extent_root: 5752385224704   gen: 161561 level: 2
backup_fs_root: 124387328   gen: 74008  level: 0
backup_dev_root:5752437587968   gen: 161561 level: 1
backup_csum_root:   5752389615616   gen: 161561 level: 3
backup_total_bytes: 18003557892096
backup_bytes_used:  7112579833856
backup_num_devices: 6

backup 1:
backup_tree_root:   5752616386560   gen: 161562 level: 1
backup_chunk_root:  20971520gen: 156893 level: 1
backup_extent_root: 5752649416704   gen: 161563 level: 2
backup_fs_root: 124387328   gen: 74008  level: 0
backup_dev_root:5752616501248   gen: 161562 level: 1
backup_csum_root:   5752650203136   gen: 161563 level: 3
backup_total_bytes: 18003557892096
backup_bytes_used:  7107602407424
backup_num_devices: 6

backup 2:
backup_tree_root:   5752112103424   gen: 161559 level: 1
backup_chunk_root:  20971520gen: 156893 level: 1
backup_extent_root: 5752207409152   gen: 161560 level: 2

Re: raid6 file system in a bad state

2016-10-11 Thread Chris Murphy
On Tue, Oct 11, 2016 at 9:52 AM, Jason D. Michaelson
 wrote:

>> btrfs rescue super-recover -v 
>
> root@castor:~/logs# btrfs rescue super-recover -v /dev/sda
> All Devices:
> Device: id = 2, name = /dev/sdh
> Device: id = 3, name = /dev/sdd
> Device: id = 5, name = /dev/sdf
> Device: id = 6, name = /dev/sde
> Device: id = 4, name = /dev/sdg
> Device: id = 1, name = /dev/sda
>
> Before Recovering:
> [All good supers]:
> device name = /dev/sdd
> superblock bytenr = 65536
>
> device name = /dev/sdd
> superblock bytenr = 67108864
>
> device name = /dev/sdd
> superblock bytenr = 274877906944
>
> device name = /dev/sdf
> superblock bytenr = 65536
>
> device name = /dev/sdf
> superblock bytenr = 67108864
>
> device name = /dev/sdf
> superblock bytenr = 274877906944
>
> device name = /dev/sde
> superblock bytenr = 65536
>
> device name = /dev/sde
> superblock bytenr = 67108864
>
> device name = /dev/sde
> superblock bytenr = 274877906944
>
> device name = /dev/sdg
> superblock bytenr = 65536
>
> device name = /dev/sdg
> superblock bytenr = 67108864
>
> device name = /dev/sdg
> superblock bytenr = 274877906944
>
> device name = /dev/sda
> superblock bytenr = 65536
>
> device name = /dev/sda
> superblock bytenr = 67108864
>
> device name = /dev/sda
> superblock bytenr = 274877906944
>
> [All bad supers]:
> device name = /dev/sdh
> superblock bytenr = 65536
>
> device name = /dev/sdh
> superblock bytenr = 67108864
>
> device name = /dev/sdh
> superblock bytenr = 274877906944
>
>
> Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are 
> you sure? [y/N]: n
> Aborted to recover bad superblocks
>
> I aborted this waiting for instructions on whether to proceed from the list.


Bad superblocks can't be a good thing and would only cause confusion.
I'd think that a known bad superblock would be ignored at mount time
and even by btrfs-find-root, or maybe even replaced like any other
kind of known bad metadata where good copies are available.

btrfs-show-super -f /dev/sda
btrfs-show-super -f /dev/sdh


Find out what the difference is between good and bad supers.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID system with adaption to changed number of disks

2016-10-11 Thread Philip Louis Moetteli
Hello,


I have to build a RAID 6 with the following 3 requirements:

• Use different kinds of disks with different sizes.
• When a disk fails and there's enough space, the RAID should be able 
to reconstruct itself out of the degraded state. Meaning, if I have e. g. a 
RAID with 8 disks and 1 fails, I should be able to chose to transform this in a 
non-degraded (!) RAID with 7 disks.
• Also the other way round: If I add a disk of what size ever, it 
should redistribute the data, so that it becomes a RAID with 9 disks.

I don’t care, if I have to do it manually.
I don’t care so much about speed either.

Is BTrFS capable of doing that?


Thanks a lot for your help!



RE: raid6 file system in a bad state

2016-10-11 Thread Jason D. Michaelson

> -Original Message-
> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On
> Behalf Of Chris Murphy
> Sent: Monday, October 10, 2016 11:23 PM
> To: Jason D. Michaelson
> Cc: Chris Murphy; Btrfs BTRFS
> Subject: Re: raid6 file system in a bad state
> 
> What do you get for
> 
> btrfs-find-root 

root@castor:~/logs# btrfs-find-root /dev/sda
parent transid verify failed on 5752357961728 wanted 161562 found 159746
parent transid verify failed on 5752357961728 wanted 161562 found 159746
Couldn't setup extent tree
Superblock thinks the generation is 161562
Superblock thinks the level is 1

There's no further output, and btrfs-find-root is pegged at 100%.

At the moment, the perceived bad disc is connected. I received the same results 
without as well.

> btrfs rescue super-recover -v 

root@castor:~/logs# btrfs rescue super-recover -v /dev/sda
All Devices:
Device: id = 2, name = /dev/sdh
Device: id = 3, name = /dev/sdd
Device: id = 5, name = /dev/sdf
Device: id = 6, name = /dev/sde
Device: id = 4, name = /dev/sdg
Device: id = 1, name = /dev/sda

Before Recovering:
[All good supers]:
device name = /dev/sdd
superblock bytenr = 65536

device name = /dev/sdd
superblock bytenr = 67108864

device name = /dev/sdd
superblock bytenr = 274877906944

device name = /dev/sdf
superblock bytenr = 65536

device name = /dev/sdf
superblock bytenr = 67108864

device name = /dev/sdf
superblock bytenr = 274877906944

device name = /dev/sde
superblock bytenr = 65536

device name = /dev/sde
superblock bytenr = 67108864

device name = /dev/sde
superblock bytenr = 274877906944

device name = /dev/sdg
superblock bytenr = 65536

device name = /dev/sdg
superblock bytenr = 67108864

device name = /dev/sdg
superblock bytenr = 274877906944

device name = /dev/sda
superblock bytenr = 65536

device name = /dev/sda
superblock bytenr = 67108864

device name = /dev/sda
superblock bytenr = 274877906944

[All bad supers]:
device name = /dev/sdh
superblock bytenr = 65536

device name = /dev/sdh
superblock bytenr = 67108864

device name = /dev/sdh
superblock bytenr = 274877906944


Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are 
you sure? [y/N]: n
Aborted to recover bad superblocks

I aborted this waiting for instructions on whether to proceed from the list.

>   
> 
> 
> It shouldn't matter which dev you pick, unless it face plants, then try
> another.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] btrfs: make max inline data can be equal to sectorsize

2016-10-11 Thread Chris Murphy
On Tue, Oct 11, 2016 at 12:47 AM, Wang Xiaoguang
 wrote:
> If we use mount option "-o max_inline=sectorsize", say 4096, indeed
> even for a fresh fs, say nodesize is 16k, we can not make the first
> 4k data completely inline, I found this conditon causing this issue:
>   !compressed_size && (actual_end & (root->sectorsize - 1)) == 0
>
> If it retuns true, we'll not make data inline. For 4k sectorsize,
> 0~4094 dara range, we can make it inline, but 0~4095, it can not.
> I don't think this limition is useful, so here remove it which will
> make max inline data can be equal to sectorsize.
>
> Signed-off-by: Wang Xiaoguang 
> ---
>  fs/btrfs/inode.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index ea15520..c0db393 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct 
> btrfs_root *root,
> if (start > 0 ||
> actual_end > root->sectorsize ||
> data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) ||
> -   (!compressed_size &&
> -   (actual_end & (root->sectorsize - 1)) == 0) ||
> end + 1 < isize ||
> data_len > root->fs_info->max_inline) {
> return 1;
> --
> 2.9.0


Before making any further changes to inline data, does it make sense
to find the source of corruption Zygo has been experiencing? That's in
the "btrfs rare silent data corruption with kernel data leak" thread.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 11:20:41AM -0400, Chris Mason wrote:
 > 
 > 
 > On 10/11/2016 11:19 AM, Dave Jones wrote:
 > > On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > >  > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >
 > >  > Those iovec fixups are in the current tree...
 > >
 > > ah yeah, git quietly dropped my local copy when I rebased so I didn't 
 > > notice.
 > >
 > >  > TBH, I don't see anything
 > >  > in splice-related stuff that could come anywhere near that (short of
 > >  > some general memory corruption having random effects of that sort).
 > >  >
 > >  > Could you try to bisect that sucker, or is it too hard to reproduce?
 > >
 > > Only hit it the once overnight so far. Will see if I can find a better way 
 > > to
 > > reproduce today.
 > 
 > This call trace is reading metadata so we can finish the truncate.  I'd 
 > say adding more memory pressure would make it happen more often.

That story checks out. There were a bunch of oom's in the log before this.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2016-10-11 Thread Chris Mason
Hi Linus,

My for-linus-4.9 has our merge window pull:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.9

This is later than normal because I was tracking down a use-after-free
during btrfs/101 in xfstests.  I had hoped to fix up the offending
patch, but wasn't happy with the size of the changes at this point in
the merge window.

The use-after-free was enough of a corner case that I didn't want to
rebase things out at this point.  So instead the top of the pull is my
revert, and the rest of these were prepped by Dave Sterba (thanks Dave!).  

This is a big variety of fixes and cleanups.  Liu Bo continues to fixup
fuzzer related problems, and some of Josef's cleanups are prep for his
bigger extent buffer changes (slated for v4.10).

Liu Bo (13) commits (+207/-36):
Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf (+5/-1)
Btrfs: return gracefully from balance if fs tree is corrupted (+17/-6)
Btrfs: improve check_node to avoid reading corrupted nodes (+28/-4)
Btrfs: add error handling for extent buffer in print tree (+7/-0)
Btrfs: memset to avoid stale content in btree node block (+11/-0)
Btrfs: bail out if block group has different mixed flag (+14/-0)
Btrfs: memset to avoid stale content in btree leaf (+28/-19)
Btrfs: fix memory leak in reading btree blocks (+9/-0)
Btrfs: fix memory leak of block group cache (+75/-0)
Btrfs: kill BUG_ON in run_delayed_tree_ref (+7/-1)
Btrfs: remove BUG_ON in start_transaction (+1/-4)
Btrfs: fix memory leak in do_walk_down (+1/-0)
Btrfs: remove BUG() in raid56 (+4/-1)

Jeff Mahoney (7) commits (+849/-902):
btrfs: btrfs_debug should consume fs_info when DEBUG is not defined (+10/-4)
btrfs: clean the old superblocks before freeing the device (+11/-27)
btrfs: convert send's verbose_printk to btrfs_debug (+38/-27)
btrfs: convert printk(KERN_* to use pr_* calls (+205/-275)
btrfs: convert pr_* to btrfs_* where possible (+231/-177)
btrfs: unsplit printed strings (+324/-391)
btrfs: add dynamic debug support (+30/-1)

Josef Bacik (5) commits (+178/-156):
Btrfs: kill the start argument to read_extent_buffer_pages (+15/-28)
Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written (+33/-8)
Btrfs: add a flags field to btrfs_fs_info (+99/-109)
Btrfs: don't leak reloc root nodes on error (+4/-0)
Btrfs: don't BUG() during drop snapshot (+27/-11)

Goldwyn Rodrigues (3) commits (+3/-18):
btrfs: Do not reassign count in btrfs_run_delayed_refs (+0/-1)
btrfs: Remove already completed TODO comment (+0/-2)
btrfs: parent_start initialization cleanup (+3/-15)

Luis Henriques (2) commits (+0/-4):
btrfs: Fix warning "variable ‘blocksize’ set but not used" (+0/-2)
btrfs: Fix warning "variable ‘gen’ set but not used" (+0/-2)

Eric Sandeen (1) commits (+1/-1):
btrfs: fix perms on demonstration debugfs interface

Anand Jain (1) commits (+20/-6):
btrfs: fix a possible umount deadlock

Lu Fengqi (1) commits (+369/-10):
btrfs: fix check_shared for fiemap ioctl

Chris Mason (1) commits (+15/-11):
Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"

Masahiro Yamada (1) commits (+8/-28):
btrfs: squash lines for simple wrapper functions

Qu Wenruo (1) commits (+37/-25):
btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band 
dedupe and subpage size patchset

Arnd Bergmann (1) commits (+7/-10):
btrfs: fix btrfs_no_printk stub helper

David Sterba (1) commits (+9/-0):
btrfs: create example debugfs file only in debugging build

Naohiro Aota (1) commits (+11/-15):
btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs

Total: (39) commits (+1714/-1222)

 fs/btrfs/backref.c| 409 ++
 fs/btrfs/btrfs_inode.h|  11 --
 fs/btrfs/check-integrity.c| 342 +++
 fs/btrfs/compression.c|   6 +-
 fs/btrfs/ctree.c  |  56 ++
 fs/btrfs/ctree.h  | 116 
 fs/btrfs/delayed-inode.c  |  25 ++-
 fs/btrfs/delayed-ref.c|  15 +-
 fs/btrfs/dev-replace.c|  21 ++-
 fs/btrfs/dir-item.c   |   7 +-
 fs/btrfs/disk-io.c| 237 
 fs/btrfs/disk-io.h|   2 +
 fs/btrfs/extent-tree.c| 198 +++-
 fs/btrfs/extent_io.c  | 170 +++---
 fs/btrfs/extent_io.h  |   4 +-
 fs/btrfs/file.c   |  43 -
 fs/btrfs/free-space-cache.c   |  21 ++-
 fs/btrfs/free-space-cache.h   |   6 +-
 fs/btrfs/free-space-tree.c|  20 ++-
 fs/btrfs/inode-map.c  |  31 ++--
 fs/btrfs/inode.c  |  70 +---
 fs/btrfs/ioctl.c  |  14 +-
 fs/btrfs/lzo.c|   6 +-
 fs/btrfs/ordered-data.c   |   4 +-
 fs/btrfs/print-tree.c |  93 +-
 fs/btrfs/qgroup.c |  77 
 fs/btrfs/raid56.c |   5 +-

btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
This is from Linus' current tree, with Al's iovec fixups on top.

[ cut here ]
WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0
list_add corruption. prev->next should be next (e8806648), but was 
c967fcd8. (prev=880503878b80).
CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 
 c9d87458 8d32007c c9d874a8 
 c9d87498 8d07a6c1 00210246 88050388e880
 880503878b80 e8806648 e8c06600 880502808008
Call Trace:
[] dump_stack+0x4f/0x73
[] __warn+0xc1/0xe0
[] warn_slowpath_fmt+0x5a/0x80
[] __list_add+0x89/0xb0
[] blk_sq_make_request+0x2f8/0x350
[] ? generic_make_request+0xec/0x240
[] generic_make_request+0xf9/0x240
[] submit_bio+0x78/0x150
[] ? __percpu_counter_add+0x85/0xb0
[] btrfs_map_bio+0x19e/0x330 [btrfs]
[] btree_submit_bio_hook+0xfa/0x110 [btrfs]
[] submit_one_bio+0x65/0xa0 [btrfs]
[] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs]
[] ? free_root_pointers+0x60/0x60 [btrfs]
[] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 
[btrfs]
[] read_tree_block+0x2d/0x50 [btrfs]
[] read_block_for_search.isra.33+0x134/0x330 [btrfs]
[] ? _raw_write_unlock+0x2c/0x50
[] ? unlock_up+0x16c/0x1a0 [btrfs]
[] btrfs_search_slot+0x450/0xa40 [btrfs]
[] btrfs_del_csums+0xe3/0x2e0 [btrfs]
[] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs]
[] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs]
[] ? debug_smp_processor_id+0x17/0x20
[] ? get_lock_stats+0x19/0x50
[] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs]
[] btrfs_truncate_inode_items+0x888/0xda0 [btrfs]
[] btrfs_truncate+0xe5/0x2b0 [btrfs]
[] btrfs_setattr+0x249/0x360 [btrfs]
[] notify_change+0x252/0x440
[] do_truncate+0x6e/0xc0
[] do_sys_ftruncate.constprop.19+0x10c/0x170
[] ? __this_cpu_preempt_check+0x13/0x20
[] SyS_ftruncate+0x9/0x10
[] do_syscall_64+0x5c/0x170
[] entry_SYSCALL64_slow_path+0x25/0x25
--[ end trace 906673a2f703b373 ]---

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs bio linked list corruption.

2016-10-11 Thread Chris Mason



On 10/11/2016 11:19 AM, Dave Jones wrote:

On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 >
 > Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

 > TBH, I don't see anything
 > in splice-related stuff that could come anywhere near that (short of
 > some general memory corruption having random effects of that sort).
 >
 > Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.


This call trace is reading metadata so we can finish the truncate.  I'd 
say adding more memory pressure would make it happen more often.


I'll try to trigger.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs bio linked list corruption.

2016-10-11 Thread Dave Jones
On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote:
 > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
 > > This is from Linus' current tree, with Al's iovec fixups on top.
 > 
 > Those iovec fixups are in the current tree...

ah yeah, git quietly dropped my local copy when I rebased so I didn't notice.

 > TBH, I don't see anything
 > in splice-related stuff that could come anywhere near that (short of
 > some general memory corruption having random effects of that sort).
 > 
 > Could you try to bisect that sucker, or is it too hard to reproduce?

Only hit it the once overnight so far. Will see if I can find a better way to
reproduce today.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs bio linked list corruption.

2016-10-11 Thread Al Viro
On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote:
> This is from Linus' current tree, with Al's iovec fixups on top.

Those iovec fixups are in the current tree...  TBH, I don't see anything
in splice-related stuff that could come anywhere near that (short of
some general memory corruption having random effects of that sort).

Could you try to bisect that sucker, or is it too hard to reproduce?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: kill BUG_ON in do_relocation

2016-10-11 Thread David Sterba
On Fri, Sep 23, 2016 at 02:05:04PM -0700, Liu Bo wrote:
> While updating btree, we try to push items between sibling
> nodes/leaves in order to keep height as low as possible.
> But we don't memset the original places with zero when
> pushing items so that we could end up leaving stale content
> in nodes/leaves.  One may read the above stale content by
> increasing btree blocks' @nritems.
> 
> One case I've come across is that in fs tree, a leaf has two
> parent nodes, hence running balance ends up with processing
> this leaf with two parent nodes, but it can only reach the
> valid parent node through btrfs_search_slot, so it'd be like,
> 
> do_relocation
> for P in all parent nodes of block A:
> if !P->eb:
> btrfs_search_slot(key);   --> get path from P to A.
> if lowest:
> BUG_ON(A->bytenr != bytenr of A recorded in P);
> btrfs_cow_block(P, A);   --> change A's bytenr in P.
> 
> After btrfs_cow_block, P has the new bytenr of A, but with the
> same @key, we get the same path again, and get panic by BUG_ON.
> 
> Note that this is only happening in a corrupted fs, for a
> regular fs in which we have correct @nritems so that we won't
> read stale content in any case.
> 
> Reviewed-by: Josef Bacik 
> Signed-off-by: Liu Bo 
> ---
> v2: - use new internal error EFSCORRUPTED as "Filesystem is corrupted",
>   suggested by David Sterba.

Sorry I steered it to EFSCORRUPTED, we should introduce the error code
separately and audit the call chains. I'll drop the parts and change it
back to EIO.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corrupt leaf, slot offset bad

2016-10-11 Thread Liu Bo
On Tue, Oct 11, 2016 at 02:48:09PM +0200, David Sterba wrote:
> Hi,
> 
> looks like a lot of random bitflips.
> 
> On Mon, Oct 10, 2016 at 11:50:14PM +0200, a...@aron.ws wrote:
> > item 109 has a few strange chars in its name (and it's truncated): 
> > 1-x86_64.pkg.tar.xz 0x62 0x14 0x0a 0x0a
> > 
> > item 105 key (261 DIR_ITEM 54556048) itemoff 11723 itemsize 72
> > location key (606286 INODE_ITEM 0) type FILE
> > namelen 42 datalen 0 name: 
> > python2-gobject-3.20.1-1-x86_64.pkg.tar.xz
> > item 106 key (261 DIR_ITEM 56363628) itemoff 11660 itemsize 63
> > location key (894298 INODE_ITEM 0) type FILE
> > namelen 33 datalen 0 name: unrar-1:5.4.5-1-x86_64.pkg.tar.xz
> > item 107 key (261 DIR_ITEM 66963651) itemoff 11600 itemsize 60
> > location key (1178 INODE_ITEM 0) type FILE
> > namelen 30 datalen 0 name: glibc-2.23-5-x86_64.pkg.tar.xz
> > item 108 key (261 DIR_ITEM 68561395) itemoff 11532 itemsize 68
> > location key (660578 INODE_ITEM 0) type FILE
> > namelen 38 datalen 0 name: 
> > squashfs-tools-4.3-4-x86_64.pkg.tar.xz
> > item 109 key (261 DIR_ITEM 76859450) itemoff 11483 itemsize 65
> > location key (2397184 UNKNOWN.0 7091317839824617472) type 45
> > namelen 13102 datalen 13358 name: 1-x86_64.pkg.tar.xzb
> 
> namelen must be smaller than 255, but the number itself does not look
> like a bitflip (0x332e), the name looks like a fragment of.
> 
> The location key is random garbage, likely an overwritten memory,
> 7091317839824617472 == 0x62696c010023 contains ascii 'bil', the key
> type is unknown but should be INODE_ITEM.
> 
> > data
> > item 110 key (261 DIR_ITEM 9799832789237604651) itemoff 11405 itemsize 
> > 62
> > location key (388547 INODE_ITEM 0) type FILE
> > namelen 32 datalen 0 name: intltool-0.51.0-1-any.pkg.tar.xz
> > item 111 key (261 DIR_ITEM 81211850) itemoff 11344 itemsize 131133
> 
> itemsize 131133 == 0x2003d is a clear bitflip, 0x3d == 61, corresponds
> to the expected item size.
> 
> There's possibly other random bitflips in the keys or other structures.
> It's hard to estimate the damage and thus the scope of restorable data.

It makes sense since this's a ssd we may have only one copy for metadata.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corrupt leaf, slot offset bad

2016-10-11 Thread David Sterba
Hi,

looks like a lot of random bitflips.

On Mon, Oct 10, 2016 at 11:50:14PM +0200, a...@aron.ws wrote:
> item 109 has a few strange chars in its name (and it's truncated): 
> 1-x86_64.pkg.tar.xz 0x62 0x14 0x0a 0x0a
> 
>   item 105 key (261 DIR_ITEM 54556048) itemoff 11723 itemsize 72
>   location key (606286 INODE_ITEM 0) type FILE
>   namelen 42 datalen 0 name: 
> python2-gobject-3.20.1-1-x86_64.pkg.tar.xz
>   item 106 key (261 DIR_ITEM 56363628) itemoff 11660 itemsize 63
>   location key (894298 INODE_ITEM 0) type FILE
>   namelen 33 datalen 0 name: unrar-1:5.4.5-1-x86_64.pkg.tar.xz
>   item 107 key (261 DIR_ITEM 66963651) itemoff 11600 itemsize 60
>   location key (1178 INODE_ITEM 0) type FILE
>   namelen 30 datalen 0 name: glibc-2.23-5-x86_64.pkg.tar.xz
>   item 108 key (261 DIR_ITEM 68561395) itemoff 11532 itemsize 68
>   location key (660578 INODE_ITEM 0) type FILE
>   namelen 38 datalen 0 name: 
> squashfs-tools-4.3-4-x86_64.pkg.tar.xz
>   item 109 key (261 DIR_ITEM 76859450) itemoff 11483 itemsize 65
>   location key (2397184 UNKNOWN.0 7091317839824617472) type 45
>   namelen 13102 datalen 13358 name: 1-x86_64.pkg.tar.xzb

namelen must be smaller than 255, but the number itself does not look
like a bitflip (0x332e), the name looks like a fragment of.

The location key is random garbage, likely an overwritten memory,
7091317839824617472 == 0x62696c010023 contains ascii 'bil', the key
type is unknown but should be INODE_ITEM.

>   data
>   item 110 key (261 DIR_ITEM 9799832789237604651) itemoff 11405 itemsize 
> 62
>   location key (388547 INODE_ITEM 0) type FILE
>   namelen 32 datalen 0 name: intltool-0.51.0-1-any.pkg.tar.xz
>   item 111 key (261 DIR_ITEM 81211850) itemoff 11344 itemsize 131133

itemsize 131133 == 0x2003d is a clear bitflip, 0x3d == 61, corresponds
to the expected item size.

There's possibly other random bitflips in the keys or other structures.
It's hard to estimate the damage and thus the scope of restorable data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: State of the fuzzer

2016-10-11 Thread Lukas Lueg
Hi,

I've now shut down all fuzzer nodes since they only cost money and
there is no progress on most of the aforementioned bugs.

Best regards
Lukas

-- Forwarded message --
From: Lukas Lueg 
Date: 2016-09-26 11:39 GMT+02:00
Subject: Re: State of the fuzzer
To: linux-btrfs@vger.kernel.org


Hi David,

do we have any chance of engagement on those 23 bugs which came out of
the last fuzzing round? The nodes have been basically idle for a week,
spewing duplicates and variants of what's already known...

Best regards
Lukas

2016-09-20 13:33 GMT+02:00 Lukas Lueg :
> There are now 21 bugs open on bko, most of them crashes and some
> undefined behavior. The nodes are now mostly running idle as no new
> paths are discovered (after around one billion images tested in the
> current run).
>
> My thoughts are to wait until the current bugs have been fixed, then
> restart the whole process from HEAD (together with the corpus of
> ~2.000 seed images discovered by now) and catch new bugs and aborts()
> - we need to get rid of the reachable ones so code coverage can
> improve. After those, I'll change the process to run btrfsck --repair,
> which is slower but has a lot of yet uncovered code.
>
> DigitalOcean has provided some funding for this undertaking so we are
> good on CPU power. Kudos to them.
>
> 2016-09-13 22:28 GMT+02:00 Lukas Lueg :
>> I've booted another instance with btrfs-progs checked out to 2b7c507
>> and collected some bugs which remained from the run before the current
>> one. The current run discovered what qgroups are just three days ago
>> and will spend some time on that. I've also added UBSAN- and
>> MSAN-logging to my setup and there were three bugs found so far (one
>> is already fixed). I will boot a third instance to run lowmem-mode
>> exclusively in the next few days.
>>
>> There are 11 bugs open at the moment, all have a reproducing image
>> attached to them. The whole list is at
>>
>> https://bugzilla.kernel.org/buglist.cgi?bug_status=NEW_status=ASSIGNED_status=REOPENED=btrfs=lukas.lueg%40gmail.com=1=exact_id=858441_format=advanced
>>
>>
>> 2016-09-09 16:00 GMT+02:00 David Sterba :
>>> On Tue, Sep 06, 2016 at 10:32:28PM +0200, Lukas Lueg wrote:
 I'm currently fuzzing rev 2076992 and things start to slowly, slowly
 quiet down. We will probably run out of steam at the end of the week
 when a total of (roughly) half a billion BTRFS-images have passed by.
 I will switch revisions to current HEAD and restart the whole process
 then. A few things:

 * There are a couple of crashes (mostly segfaults) I have not reported
 yet. I'll report them if they show up again with the latest revision.
>>>
>>> Ok.
>>>
 * The coverage-analysis shows assertion failures which are currently
 silenced. An assertion failure is technically a worse disaster
 successfully prevented, it still constitutes unexpected/unusable
 behaviour, though. Do you want assertions to be enabled and images
 triggering those assertions reported? This is basically the same
 conundrum as with BUG_ON and abort().
>>>
>>> Yes please. I'd like to turn most bugons/assertions into a normal
>>> failure report if it would make sense.
>>>
 * A few endless loops entered into by btrfsck are currently
 unmitigated (see bugs 155621, 155571, 11 and 155151). It would be
 nice if those had been taken care of by next week if possible.
>>>
>>> Two of them are fixed, the other two need more work, updating all
>>> callers of read_node_slot and the callchain. So you may still see that
>>> kind of looping in more images. I don't have an ETA for the fix, I won't
>>> be available during the next week.
>>>
>>> At the moment, the initial sanity checks should catch most of the
>>> corrupted values, so I'm expecting that you'll see different classes of
>>> problems in the next rounds.
>>>
>>> The testsuite now contains all images that you reported and we have a
>>> fix in git. There are more utilities run on the images, there may be
>>> more problems for us to fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs-progs: Make btrfs-debug-tree print all readable strings for inode flags

2016-10-11 Thread David Sterba
On Tue, Oct 11, 2016 at 10:18:51AM +0800, Qu Wenruo wrote:
> >> -/* Caller should ensure sizeof(*ret) >= 29 "NODATASUM|NODATACOW|READONLY" 
> >> */
> >> +#define copy_one_inode_flag(flags, name, empty, dst) ({   
> >> \
> >> +  if (flags & BTRFS_INODE_##name) {   \
> >> +  if (!empty) \
> >> +  strcat(dst, "|");   \
> >> +  strcat(dst, #name); \
> >> +  empty = 0;  \
> >> +  }   \
> >> +})
> >
> > Can you please avoid using the macro? Or at least make it uppercase so
> > it's visible. Similar in the next patch.
> >
> >
> OK, I'll change it to upper case.

Ok.

> The only reason I'm using macro is, inline function can't do 
> stringification, or I missed something?

No, that's where macros help. My concern was about the hidden use of a
local variable, so at least an all-caps macro name would make it more
visible. As this is not going to be used elsewhere, we can live with
that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix -EINVEL in tree log recovering

2016-10-11 Thread robbieko
From: Robbie Ko 

when tree log recovery, space_cache rebuild or dirty maybe save the cache.
and then replay extent with disk_bytenr and disk_num_bytes,
but disk_bytenr and disk_num_bytes maybe had been use for free space inode,
will lead to -EINVEL.

BTRFS: error in btrfs_replay_log:2446: errno=-22 unknown (Failed to recover log 
tree)

therefore, we not save cache when tree log recovering.

Signed-off-by: Robbie Ko 
---
 fs/btrfs/extent-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 665da8f..38b932c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3434,6 +3434,7 @@ again:
 
spin_lock(_group->lock);
if (block_group->cached != BTRFS_CACHE_FINISHED ||
+   block_group->fs_info->log_root_recovering ||
!btrfs_test_opt(root->fs_info, SPACE_CACHE)) {
/*
 * don't bother trying to write stuff out _if_
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix enospc in punch hole

2016-10-11 Thread robbieko

Hi Filipe:

because btrfs_calc_trunc_metadata_size is reserved leafsize + nodesize * 
(8 - 1)

assume leafsize is the same as nodesize, we total reserved 8 nodesize
when split leaf, we need 2 path, if extent_tree level small than 4, it's 
OK

because worst case is (leafsize + nodesize * 3) *2, is 8 nodesize.
but if extent_tree is greater level 4, worst case is need (leafsize + 
nodesize * 7) * 2,
is bigger than resvered size, so we should use 
btrfs_calc_trans_metadata_size,

is taken into account split leaf case.

Thanks.
robbieko

Filipe Manana 於 2016-10-07 18:18 寫到:

On Fri, Oct 7, 2016 at 7:09 AM, robbieko  wrote:

From: Robbie Ko 

when extent-tree level > BTRFS_MAX_LEVEL / 2,
__btrfs_drop_extents -> btrfs_duplicate_item ->
setup_leaf_for_split -> split_leaf
maybe enospc, because min_size is too small,
need use btrfs_calc_trans_metadata_size.


This change log is terrible.
You should describe the problem and fix. That is, that hole punching
can result in adding new leafs (and as a consequence new nodes) to the
tree because when we find file extent items that span beyond the hole
range we may end up not deleting them (just adjusting them) and add
new file extent items representing holes.

And I don't see why this is exclusive for the case where the height of
the extent tree is greater than 4 (BTRFS_MAX_LEVEL / 2).

The code changes themselves look good to me.

thanks



Signed-off-by: Robbie Ko 
---
 fs/btrfs/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fea31a4..809ca85 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2322,7 +2322,7 @@ static int btrfs_punch_hole(struct inode *inode, 
loff_t offset, loff_t len)

u64 tail_len;
u64 orig_start = offset;
u64 cur_offset;
-   u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
+   u64 min_size = btrfs_calc_trans_metadata_size(root, 1);
u64 drop_end;
int ret = 0;
int err = 0;
@@ -2469,7 +2469,7 @@ static int btrfs_punch_hole(struct inode *inode, 
loff_t offset, loff_t len)

ret = -ENOMEM;
goto out_free;
}
-   rsv->size = btrfs_calc_trunc_metadata_size(root, 1);
+   rsv->size = btrfs_calc_trans_metadata_size(root, 1);
rsv->failfast = 1;

/*
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix fsync deadlock in log_new_dir_dentries

2016-10-11 Thread robbieko

Hi Filipe:

why did you replace the continue statement with a break statement:
because we released ahead of the path, it can not continue to use,
need to jump out, and then go to again.

supplement:
We found a fsync deadlock ie. 32021->32020->32028->14431->14436->32021, 
the number id pid.

extent_buffer: start:207060992, len:16384
locker pid: 32020 read lock
wait pid: 32021 write lock
extent_buffer: start:14730821632, len:16384
locker pid: 32028 read lock
wait pid: 32020 write lock
extent_buffer: start:446503813120, len:16384
locker pid: 14431 write lock
wait pid: 32028 read lock
extent_buffer: start:446503845888, len: 16384
locker pid: 14436 write lock
wait pid: 14431 write lock
extent_buffer: start: 446504386560, len: 16384
locker pid: 32021 write lock
wait pid: 14436 write lock

Thanks.
Robbie Ko

Filipe Manana 於 2016-10-07 18:46 寫到:
On Fri, Oct 7, 2016 at 11:43 AM, robbieko  
wrote:

Hi Filipe,

I am sorry, I express not clear enough.
This number is pid, and the above are their call trace respectively.


And why did you replace the continue statement with a break statement?

Also please avoid mixing inline replies with top posting, it just
breaks the thread.

thanks



Thanks.
robbieko

Filipe Manana  於 2016-10-07 18:23 寫道:

On Fri, Oct 7, 2016 at 11:05 AM, robbieko  
wrote:

From: Robbie Ko 

We found a fsync deadlock ie. 
32021->32020->32028->14431->14436->32021,
in log_new_dir_dentries, because btrfs_search_forward get path lock, 
then
call btrfs_iget will get another extent_buffer lock, maybe occur 
deadlock.


What are those numbers? Are they inode numbers?
If so you're suggesting a deadlock due to recursive logging of the 
same

inode.
However the trace below, and the code change, has nothing to do with 
that.

It's just about btrfs_iget trying to do a search on a btree and
attempting to read lock some node/leaf that already has a write lock
acquired previously by the same task.

Please be more clear on your change logs.




we can release path before call btrfs_iget, avoid deadlock occur.

some process call trace like below:
[ 4077.478852] kworker/u24:10  D 88107fc90640 0 14431  2
0x
[ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[btrfs]
[ 4077.494346]  880ffa56bad0 0046 9000
880ffa56bfd8
[ 4077.502629]  880ffa56bfd8 881016ce21c0 a06ecb26
88101a5d6138
[ 4077.510915]  880ebb5173b0 880ffa56baf8 880ebb517410
881016ce21c0
[ 4077.519202] Call Trace:
[ 4077.528752]  [] ? btrfs_tree_lock+0xdd/0x2f0 
[btrfs]

[ 4077.536049]  [] ? wake_up_atomic_t+0x30/0x30
[ 4077.542574]  [] ? btrfs_search_slot+0x79f/0xb10
[btrfs]
[ 4077.550171]  [] ? 
btrfs_lookup_file_extent+0x33/0x40

[btrfs]
[ 4077.558252]  [] ? 
__btrfs_drop_extents+0x13b/0xdf0

[btrfs]
[ 4077.566140]  [] ? 
add_delayed_data_ref+0xe2/0x150

[btrfs]
[ 4077.573928]  [] ?
btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs]
[ 4077.582399]  [] ? __set_extent_bit+0x4c0/0x5c0
[btrfs]
[ 4077.589896]  [] ?
insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
[ 4077.599632]  [] ? start_transaction+0x8d/0x470
[btrfs]
[ 4077.607134]  [] ? 
btrfs_finish_ordered_io+0x2e7/0x600

[btrfs]
[ 4077.615329]  [] ? process_one_work+0x142/0x3d0
[ 4077.622043]  [] ? worker_thread+0x109/0x3b0
[ 4077.628459]  [] ? 
manage_workers.isra.26+0x270/0x270

[ 4077.635759]  [] ? kthread+0xaf/0xc0
[ 4077.641404]  [] ? 
kthread_create_on_node+0x110/0x110

[ 4077.648696]  [] ? ret_from_fork+0x58/0x90
[ 4077.654926]  [] ? 
kthread_create_on_node+0x110/0x110


[ 4078.358087] kworker/u24:15  D 88107fcd0640 0 14436  2
0x
[ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[btrfs]
[ 4078.373574]  880ffa57fad0 0046 9000
880ffa57ffd8
[ 4078.381864]  880ffa57ffd8 88103004d0a0 a06ecb26
88101a5d6138
[ 4078.390163]  880fbeffc298 880ffa57faf8 880fbeffc2f8
88103004d0a0
[ 4078.398466] Call Trace:
[ 4078.408019]  [] ? btrfs_tree_lock+0xdd/0x2f0 
[btrfs]

[ 4078.415322]  [] ? wake_up_atomic_t+0x30/0x30
[ 4078.421844]  [] ? btrfs_search_slot+0x79f/0xb10
[btrfs]
[ 4078.429438]  [] ? 
btrfs_lookup_file_extent+0x33/0x40

[btrfs]
[ 4078.437518]  [] ? 
__btrfs_drop_extents+0x13b/0xdf0

[btrfs]
[ 4078.445404]  [] ? 
add_delayed_data_ref+0xe2/0x150

[btrfs]
[ 4078.453194]  [] ?
btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs]
[ 4078.461663]  [] ? __set_extent_bit+0x4c0/0x5c0
[btrfs]
[ 4078.469161]  [] ?
insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs]
[ 4078.478893]  [] ? start_transaction+0x8d/0x470
[btrfs]
[ 4078.486388]  [] ? 
btrfs_finish_ordered_io+0x2e7/0x600

[btrfs]
[ 4078.494561]  [] ? process_one_work+0x142/0x3d0
[ 4078.501278]  [] ? 
pwq_activate_delayed_work+0x27/0x40

[ 4078.508673]  [] ? worker_thread+0x109/0x3b0
[ 4078.515098]  [] ? 

[RFC] btrfs: make max inline data can be equal to sectorsize

2016-10-11 Thread Wang Xiaoguang
If we use mount option "-o max_inline=sectorsize", say 4096, indeed
even for a fresh fs, say nodesize is 16k, we can not make the first
4k data completely inline, I found this conditon causing this issue:
  !compressed_size && (actual_end & (root->sectorsize - 1)) == 0

If it retuns true, we'll not make data inline. For 4k sectorsize,
0~4094 dara range, we can make it inline, but 0~4095, it can not.
I don't think this limition is useful, so here remove it which will
make max inline data can be equal to sectorsize.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/inode.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ea15520..c0db393 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct btrfs_root 
*root,
if (start > 0 ||
actual_end > root->sectorsize ||
data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) ||
-   (!compressed_size &&
-   (actual_end & (root->sectorsize - 1)) == 0) ||
end + 1 < isize ||
data_len > root->fs_info->max_inline) {
return 1;
-- 
2.9.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html