Re: RAID system with adaption to changed number of disks
At 10/12/2016 12:37 PM, Zygo Blaxell wrote: On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote: But consider the identical scenario with md or LVM raid5, or any conventional hardware raid5. A scrub check simply reports a mismatch. It's unknown whether data or parity is bad, so the bad data strip is propagated upward to user space without error. On a scrub repair, the data strip is assumed to be good, and good parity is overwritten with bad. Totally true. Original RAID5/6 design is only to handle missing device, not rotted bits. Missing device is the _only_ thing the current design handles. i.e. you umount the filesystem cleanly, remove a disk, and mount it again degraded, and then the only thing you can safely do with the filesystem is delete or replace a device. There is also a probability of being able to repair bitrot under some circumstances. If your disk failure looks any different from this, btrfs can't handle it. If a disk fails while the array is running and the filesystem is writing, the filesystem is likely to be severely damaged, possibly unrecoverably. A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a snowball's chance in hell of surviving a disk failure on a live array with only data losses. This would work if mdadm and btrfs successfully arrange to have each dup copy of metadata updated separately, and one of the copies survives the raid5 write hole. I've never tested this configuration, and I'd test the heck out of it before considering using it. So while I agree in total that Btrfs raid56 isn't mature or tested enough to consider it production ready, I think that's because of the UNKNOWN causes for problems we've seen with raid56. Not the parity scrub bug which - yeah NOT good, not least of which is the data integrity guarantees Btrfs is purported to make are substantially negated by this bug. I think the bark is worse than the bite. It is not the bark we'd like Btrfs to have though, for sure. Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and data checksum. [snip] This leads directly to a variety of problems with the diagnostic tools, e.g. scrub reports errors randomly across devices, and cannot report the path of files containing corrupted blocks if it's the parity block that gets corrupted. At least better than screwing up good stripes. The tool is just used to let user know if there is any corrupted stripes like kernel scrub, but with better behavior, like won't reconstruct stripes ignoring checksum. For human readable report, it's not that hard (compared the the complex csum and parity check) to implement and can be added later. For parity report, there is no way to output any human readable result anyway. btrfs also doesn't avoid the raid5 write hole properly. After a crash, a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced) to reconstruct any parity that was damaged by an incomplete data stripe update. As long as all disks are working, the parity can be reconstructed from the data disks. If a disk fails prior to the completion of the scrub, any data stripes that were written during previous crashes may be destroyed. And all that assumes the scrub bugs are fixed first. This is true. I didn't take this into account. But this is not a *single* problem, but 2 problems. 1) Power loss 2) Device crash Before making things complex, why not focusing on single problem. Not to mention the possibility is much smaller than single problem. If writes occur after a disk fails, they all temporarily corrupt small amounts of data in the filesystem. btrfs cannot tolerate any metadata corruption (it relies on redundant metadata to self-repair), so when a write to metadata is interrupted, the filesystem is instantly doomed (damaged beyond the current tools' ability to repair and mount read-write). That's why we used higher duplication level for metadata by default. And considering metadata size, it's much acceptable to use RAID1 for metadata other than RADI5/6. Currently the upper layers of the filesystem assume that once data blocks are written to disk, they are stable. This is not true in raid5/6 because the parity and data blocks within each stripe cannot be updated atomically. True, but if we ignore parity, we'd find that, RAID5 is just RAID0. COW ensures (cowed) data and metadata are all safe and checksum will ensure they are OK, so even for RAID0, it's not a problem for case like power loss. So we should follow csum first and then parity. If we following this principle, RAID5 should be a raid0 with a little higher possibility to recover some cases, like missing one device. So, I'd like to fix RAID5 scrub to make it at least better than RAID0, not worse than RAID0. btrfs doesn't avoid writing new data in the same RAID stripe as old data (it provides a rmw function for raid56, which is simply a bug in a CoW filesystem), so previously committed data can be
Re: RAID system with adaption to changed number of disks
On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote: > >But consider the identical scenario with md or LVM raid5, or any > >conventional hardware raid5. A scrub check simply reports a mismatch. > >It's unknown whether data or parity is bad, so the bad data strip is > >propagated upward to user space without error. On a scrub repair, the > >data strip is assumed to be good, and good parity is overwritten with > >bad. > > Totally true. > > Original RAID5/6 design is only to handle missing device, not rotted bits. Missing device is the _only_ thing the current design handles. i.e. you umount the filesystem cleanly, remove a disk, and mount it again degraded, and then the only thing you can safely do with the filesystem is delete or replace a device. There is also a probability of being able to repair bitrot under some circumstances. If your disk failure looks any different from this, btrfs can't handle it. If a disk fails while the array is running and the filesystem is writing, the filesystem is likely to be severely damaged, possibly unrecoverably. A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a snowball's chance in hell of surviving a disk failure on a live array with only data losses. This would work if mdadm and btrfs successfully arrange to have each dup copy of metadata updated separately, and one of the copies survives the raid5 write hole. I've never tested this configuration, and I'd test the heck out of it before considering using it. > >So while I agree in total that Btrfs raid56 isn't mature or tested > >enough to consider it production ready, I think that's because of the > >UNKNOWN causes for problems we've seen with raid56. Not the parity > >scrub bug which - yeah NOT good, not least of which is the data > >integrity guarantees Btrfs is purported to make are substantially > >negated by this bug. I think the bark is worse than the bite. It is > >not the bark we'd like Btrfs to have though, for sure. > > > > Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and > data checksum. [snip] This leads directly to a variety of problems with the diagnostic tools, e.g. scrub reports errors randomly across devices, and cannot report the path of files containing corrupted blocks if it's the parity block that gets corrupted. btrfs also doesn't avoid the raid5 write hole properly. After a crash, a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced) to reconstruct any parity that was damaged by an incomplete data stripe update. As long as all disks are working, the parity can be reconstructed from the data disks. If a disk fails prior to the completion of the scrub, any data stripes that were written during previous crashes may be destroyed. And all that assumes the scrub bugs are fixed first. If writes occur after a disk fails, they all temporarily corrupt small amounts of data in the filesystem. btrfs cannot tolerate any metadata corruption (it relies on redundant metadata to self-repair), so when a write to metadata is interrupted, the filesystem is instantly doomed (damaged beyond the current tools' ability to repair and mount read-write). Currently the upper layers of the filesystem assume that once data blocks are written to disk, they are stable. This is not true in raid5/6 because the parity and data blocks within each stripe cannot be updated atomically. btrfs doesn't avoid writing new data in the same RAID stripe as old data (it provides a rmw function for raid56, which is simply a bug in a CoW filesystem), so previously committed data can be lost. If the previously committed data is part of the metadata tree, the filesystem is doomed; for ordinary data blocks there are just a few dozen to a few thousand corrupted files for the admin to clean up after each crash. It might be possible to hack up the allocator to pack writes into empty stripes to avoid the write hole, but every time I think about this it looks insanely hard to do (or insanely wasteful of space) for data stripes. signature.asc Description: Digital signature
Re: [RFC] btrfs: make max inline data can be equal to sectorsize
hi, On 10/11/2016 11:49 PM, Chris Murphy wrote: On Tue, Oct 11, 2016 at 12:47 AM, Wang Xiaoguangwrote: If we use mount option "-o max_inline=sectorsize", say 4096, indeed even for a fresh fs, say nodesize is 16k, we can not make the first 4k data completely inline, I found this conditon causing this issue: !compressed_size && (actual_end & (root->sectorsize - 1)) == 0 If it retuns true, we'll not make data inline. For 4k sectorsize, 0~4094 dara range, we can make it inline, but 0~4095, it can not. I don't think this limition is useful, so here remove it which will make max inline data can be equal to sectorsize. Signed-off-by: Wang Xiaoguang --- fs/btrfs/inode.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ea15520..c0db393 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct btrfs_root *root, if (start > 0 || actual_end > root->sectorsize || data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) || - (!compressed_size && - (actual_end & (root->sectorsize - 1)) == 0) || end + 1 < isize || data_len > root->fs_info->max_inline) { return 1; -- 2.9.0 Before making any further changes to inline data, does it make sense to find the source of corruption Zygo has been experiencing? That's in the "btrfs rare silent data corruption with kernel data leak" thread. Yes, agree. Also Zygo has sent a patch to fix that bug this morning :) Regards, XIaoguang Wang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: fix false enospc for compression
hi, Stefan often reports enospc error in his servers when having btrfs compression enabled. Now he has applied these 2 patches to run and no enospc error occurs for more than 6 days, it seems they are useful :) And these 2 patches are somewhat big, please check it, thanks. Regards, Xiaoguang Wang On 10/06/2016 10:51 AM, Wang Xiaoguang wrote: When testing btrfs compression, sometimes we got ENOSPC error, though fs still has much free space, xfstests generic/171, generic/172, generic/173, generic/174, generic/175 can reveal this bug in my test environment when compression is enabled. After some debuging work, we found that it's btrfs_delalloc_reserve_metadata() which sometimes tries to reserve plenty of metadata space, even for very small data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try to reserve is calculated by the difference between outstanding_extents and reserved_extents. Please see below case for how ENOSPC occurs: 1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode outstanding extents be 1, and reserved_extents be 1024. Note it's btrfs_merge_extent_hook() that merges these 128KB units into one big outstanding extent, but do not change reserved_extents. 2, When writing dirty pages, for compression, cow_file_range_async() will split above big extent in unit of 128KB(compression extent size is 128KB). When first split opeartion finishes, we'll have 2 outstanding extents and 1024 reserved extents, and just right now the currently generated ordered extent is dispatched to run and complete, then btrfs_delalloc_release_metadata()(see btrfs_finish_ordered_io()) will be called to release metadata, after that we will have 1 outstanding extents and 1 reserved extents(also see logic in drop_outstanding_extent()). Later cow_file_range_async() continues to handles left data range[128KB, 128MB), and if no other ordered extent was dispatched to run, there will be 1023 outstanding extents and 1 reserved extent. 3, Now if another bufferd write for this file enters, then btrfs_delalloc_reserve_metadata() will at least try to reserve metadata for 1023 outstanding extents' metadata, for 16KB node size, it'll be 1023*16384*2*8, about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, so obviously it's not sane and can easily result in enospc error. The root cause is that for compression, its max extent size will no longer be BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation method in btrfs is not appropriate or correct, here we introduce: enum btrfs_metadata_reserve_type { BTRFS_RESERVE_NORMAL, BTRFS_RESERVE_COMPRESS, }; and expand btrfs_delalloc_reserve_metadata() and btrfs_delalloc_reserve_space() by adding a new enum btrfs_metadata_reserve_type argument. When a data range will go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata. Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go through compression path. With this patch, we can fix these false enospc error for compression. Signed-off-by: Wang Xiaoguang--- fs/btrfs/ctree.h | 31 ++-- fs/btrfs/extent-tree.c | 55 + fs/btrfs/extent_io.c | 59 +- fs/btrfs/extent_io.h | 2 + fs/btrfs/file.c | 26 +-- fs/btrfs/free-space-cache.c | 6 +- fs/btrfs/inode-map.c | 5 +- fs/btrfs/inode.c | 181 --- fs/btrfs/ioctl.c | 12 ++- fs/btrfs/relocation.c| 14 +++- fs/btrfs/tests/inode-tests.c | 15 ++-- 11 files changed, 309 insertions(+), 97 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 16885f6..fa6a19a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -97,6 +97,19 @@ static const int btrfs_csum_sizes[] = { 4 }; #define BTRFS_DIRTY_METADATA_THRESH SZ_32M +/* + * for compression, max file extent size would be limited to 128K, so when + * reserving metadata for such delalloc writes, pass BTRFS_RESERVE_COMPRESS to + * btrfs_delalloc_reserve_metadata() or btrfs_delalloc_reserve_space() to + * calculate metadata, for none-compression, use BTRFS_RESERVE_NORMAL. + */ +enum btrfs_metadata_reserve_type { + BTRFS_RESERVE_NORMAL, + BTRFS_RESERVE_COMPRESS, +}; +int inode_need_compress(struct inode *inode); +u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); + #define BTRFS_MAX_EXTENT_SIZE SZ_128M struct btrfs_mapping_tree { @@ -2677,10 +2690,14 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root, void btrfs_subvolume_release_metadata(struct btrfs_root *root, struct btrfs_block_rsv *rsv, u64 qgroup_reserved); -int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); -void
[PATCH] btrfs: fix silent data corruption while reading compressed inline extents
rsync -S causes a large number of small writes separated by small seeks to form sparse holes in files that contain runs of zero bytes. Rarely, this can lead btrfs to write a file with a compressed inline extent followed by other data, like this: Filesystem type is: 9123683e File size of /try/./30/share/locale/nl/LC_MESSAGES/tar.mo is 61906 (16 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0..4095: 0.. 4095: 4096: encoded,not_aligned,inline 1:1.. 15: 331372..331386: 15: 1: last,encoded,eof /try/./30/share/locale/nl/LC_MESSAGES/tar.mo: 2 extents found The inline extent size is less than the page size, so the ram_bytes field in the extent is smaller than 4096. The difference between ram_bytes and the end of the first page of the file forms a small hole. Like any other hole, the correct value of each byte within the hole is zero. When the inline extent is not compressed, btrfs_get_extent copies the inline extent data and then memsets the remainder of the page to zero. There is no corruption in this case. When the inline extent is compressed, uncompress_inline uses the ram_bytes field from the extent ref as the size of the uncompressed data. ram_bytes is smaller than the page size, so the remainder of the page (i.e. the bytes in the small hole) is uninitialized memory. Each time the extent is read into the page cache, userspace may see different contents. Fix this by zeroing out the difference between the size of the uncompressed inline extent and PAGE_CACHE_SIZE in uncompress_inline. Only bytes within the hole are affected, so affected files can be read correctly with a fixed kernel. The corruption happens after IO and checksum validation, so the corruption is never reported in dmesg or counted in dev stats. The bug is at least as old as 3.5.7 (the oldest kernel I can conveniently test), and possibly much older. The code may not be correct if the extent is larger than a page, so add a WARN_ON for that case. To reproduce the bug, run this on a 3072M kvm VM: #!/bin/sh # Use your favorite block device here blk=/dev/vdc # Create test filesystem and mount point mkdir -p /try mkfs.btrfs -dsingle -mdup -O ^extref,^skinny-metadata,^no-holes -f "$blk" || exit 1 mount -ocompress-force,flushoncommit,max_inline=8192,noatime "$blk" /try || exit 1 # Create a few inline extents in larger files. # Multiple processes seem to be necessary. y=/usr; for x in $(seq 10 19); do rsync -axHSWI "$y/." "$x"; y="$x"; done & y=/usr; for x in $(seq 20 29); do rsync -axHSWI "$y/." "$x"; y="$x"; done & y=/usr; for x in $(seq 30 39); do rsync -axHSWI "$y/." "$x"; y="$x"; done & y=/usr; for x in $(seq 40 49); do rsync -axHSWI "$y/." "$x"; y="$x"; done & wait # Make a list of the files with inline extents touch /try/list find -type f -size +4097c -exec sh -c 'for x; do if filefrag -v "$x" | sed -n "4p" | grep -q "inline"; then echo "$x" >> list; fi; done' -- {} + # Check the inline extents to see if they change as they are read multiple times while read -r x; do sum="$(sha1sum "$x")" for y in $(seq 0 99); do sysctl vm.drop_caches=1 sum2="$(sha1sum "$x")" if [ "$sum" != "$sum2" ]; then echo "Inconsistent reads from '$x'" exit 1 fi done done < list The reproducer may need to run up to 20 times before it finds a corruption. Signed-off-by: Zygo Blaxell--- fs/btrfs/inode.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e6811c4..34f9c80 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6791,6 +6791,12 @@ static noinline int uncompress_inline(struct btrfs_path *path, max_size = min_t(unsigned long, PAGE_SIZE, max_size); ret = btrfs_decompress(compress_type, tmp, page, extent_offset, inline_size, max_size); + WARN_ON(max_size > PAGE_SIZE); + if (max_size < PAGE_SIZE) { + char *map = kmap(page); + memset(map + max_size, 0, PAGE_SIZE - max_size); + kunmap(page); + } kfree(tmp); return ret; } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
Ignoring the RAID56 bugs for a moment, if you have mismatched drives, BtrFS RAID1 is a pretty good way of utilising available space and having redundancy. My home array is BtrFS with a hobbled together collection of disks ranging from 500GB to 3TB (and 5 of them, so it's not an even number). I have a grand total of 8TB of linear space, and with BtrFS RAID1 I can use exactly 50% of this (4TB) even with the weird combination of disks. That's something other RAID1 implementations can't do (they're limited to the size of the smallest disk of any pair, and need an even number of disks all up), and I get free compression and snapshotting, so yay for that. As drives die of natural old age, I replace them ad-hoc with bigger drives (whatever is the sane price-point at the time). A replace followed by a rebalance later, and I'm back to using all available space (which grows every time I throw a bigger drive in the mix), which again is incredibly handy when you're a home user looking for sane long-term storage that doesn't require complete rebuilds of your array. -Dan Dan Mons - VFX Sysadmin Cutting Edge http://cuttingedge.com.au On 12 October 2016 at 01:14, Philip Louis Moetteliwrote: > Hello, > > > I have to build a RAID 6 with the following 3 requirements: > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? > > > Thanks a lot for your help! > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
At 10/12/2016 07:58 AM, Chris Murphy wrote: https://btrfs.wiki.kernel.org/index.php/Status Scrub + RAID56 Unstable will verify but not repair This doesn't seem quite accurate. It does repair the vast majority of the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad data strip results in a.) fixed up data strip from parity b.) wrong recomputation of replacement parity c.) good parity is overwritten with bad, silently, d.) if parity reconstruction is needed in the future e.g. device or sector failure, it results in EIO, a kind of data loss. Bad bug. For sure. But consider the identical scenario with md or LVM raid5, or any conventional hardware raid5. A scrub check simply reports a mismatch. It's unknown whether data or parity is bad, so the bad data strip is propagated upward to user space without error. On a scrub repair, the data strip is assumed to be good, and good parity is overwritten with bad. Totally true. Original RAID5/6 design is only to handle missing device, not rotted bits. So while I agree in total that Btrfs raid56 isn't mature or tested enough to consider it production ready, I think that's because of the UNKNOWN causes for problems we've seen with raid56. Not the parity scrub bug which - yeah NOT good, not least of which is the data integrity guarantees Btrfs is purported to make are substantially negated by this bug. I think the bark is worse than the bite. It is not the bark we'd like Btrfs to have though, for sure. Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and data checksum. In ideal situation, btrfs should detect which stripe is corrupted, and only try to recover data/parity if recovered data checksum matches. For example, for a very traditional RAID5 layout like the following: Disk 1| Disk 2| Disk 3 | - Data 1| Data 2| Parity | Scrub should check data stripe 1 and 2, against their checksum first [All data extents has csum] 1) All csum matches Good, then check parity. 1.1) Parity matches Nothing wrong at all 1.1) Parity mismatch Just recalculate parity. Corruption may happen in unused data space or in parity. Either way recalculate parity is good enough. 2) One data stripe csum mismatches(missing), parity mismatches too We only know one data stripe mismatch, not sure if parity is OK. Try to recover that data stripe from parity, and recheck csum. 2.1) Recovered data stripe matches csum That data stripe is corrupted and parity is OK Recoverable. 2.2) Recovered data stripe mismatch csum Both that data stripe and parity is corrupted. 3) Two data stripes csum mismatch, no matter parity matches or not At least 2 stripes are screwed up. no fix anyway. [Some data extents has no csum(nodatasum)] 4) Existing(or no csum at all) csum matches, parity matches Good, nothing to worry about 5) Exist csum mismatch for one data stripe, parity mismatch Like 2), try to recover that data stripe, and re-check csum. 5.1) recovered data stripes matches csum At least we can recover the data covered by csum. Corrupted no-csum data is not our concern. 5.2) recovered data stripes mismatches csum Screwed up 6) No csum at all, parity mismatch We are screwed up, just like traditional RAID5. And I'm coding for the above cases in btrfs-progs to implement an off-line scrub tool. Currently it looks good, and can already handle case from 1) to 3). And I tend to ignore any full stripe who lacks checksum and parity mismatches. But as you can see, there are so many things(csum exists,matches pairty matches, missing devices) involved in btrfs RAID5(RAID6 will be more complex), it's already much complex than traditional RAID5/6 or current scrub implementation. So what current kernel scub lacks is: 1) Detection of good/bad stripes 2) Recheck of recovery attempts But that's all traditional RAID5/6 lacks unless there is some hidden checksum like btrfs they can use. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
https://btrfs.wiki.kernel.org/index.php/Status Scrub + RAID56 Unstable will verify but not repair This doesn't seem quite accurate. It does repair the vast majority of the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad data strip results in a.) fixed up data strip from parity b.) wrong recomputation of replacement parity c.) good parity is overwritten with bad, silently, d.) if parity reconstruction is needed in the future e.g. device or sector failure, it results in EIO, a kind of data loss. Bad bug. For sure. But consider the identical scenario with md or LVM raid5, or any conventional hardware raid5. A scrub check simply reports a mismatch. It's unknown whether data or parity is bad, so the bad data strip is propagated upward to user space without error. On a scrub repair, the data strip is assumed to be good, and good parity is overwritten with bad. So while I agree in total that Btrfs raid56 isn't mature or tested enough to consider it production ready, I think that's because of the UNKNOWN causes for problems we've seen with raid56. Not the parity scrub bug which - yeah NOT good, not least of which is the data integrity guarantees Btrfs is purported to make are substantially negated by this bug. I think the bark is worse than the bite. It is not the bark we'd like Btrfs to have though, for sure. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid levels and NAS drives
On Mon, Oct 10, 2016 at 08:07:53AM -0400, Austin S. Hemmelgarn wrote: > On 2016-10-09 19:12, Charles Zeitler wrote: > >Is there any advantage to using NAS drives > >under RAID levels, as oppposed to regular > >'desktop' drives for BTRFS? [...] > So, as for what you should use in a RAID array, here's my specific advice: > 1. Don't worry about enterprise drives unless you've already got a system > that has them. They're insanely overpriced for relatively minimal benefit > when compared to NAS drives. > 2. If you can afford NAS drives, use them, they'll get you the best > combination of energy efficiency, performance, and error recovery. > 3. If you can't get NAS drives, most desktop drives work fine, but you will > want to bump up the scsi_command_timer attribute in the kernel for them (200 > seconds is reasonable, just make sure you have good cables and a good > storage controller). > 4. Avoid WD Green drives, without special effort, they will get worse > performance and have shorter lifetimes than any other hard disk I've ever > seen. > 5. Generally avoid drives with a capacity over 1TB from manufacturers other > than WD, HGST, and Seagate, most of them are not particularly good quality > (especially if it's an odd non-power-of-two size like 5TB). +1 ! Additionally, is it still the case that it is generally safer to buy the largest capacity disks offered by the previous generation of technology rather than the current largest capacity? eg: right now that would be 4TB or 6TB, and not 8TB or 10TB. Cheers, Nicholas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 file system in a bad state
readding btrfs On Tue, Oct 11, 2016 at 1:00 PM, Jason D. Michaelsonwrote: > > >> -Original Message- >> From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On >> Behalf Of Chris Murphy >> Sent: Tuesday, October 11, 2016 12:41 PM >> To: Jason D. Michaelson >> Cc: Chris Murphy; Btrfs BTRFS >> Subject: Re: raid6 file system in a bad state >> >> On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson >> wrote: >> > superblock: bytenr=65536, device=/dev/sda >> > - >> > generation 161562 >> > root5752616386560 >> >> >> >> > superblock: bytenr=65536, device=/dev/sdh >> > - >> > generation 161474 >> > root4844272943104 >> >> OK so most obvious is that the bad super is many generations back than >> the good super. That's expected given all the write errors. >> >> > > Is there any chance/way of going back to use this generation/root as a source > for btrfs restore? Yes with -t option and that root bytenr for the generation you want to restore. Thing is, that's so far back the metadata may be gone (overwritten) already. But worth a shot. I've recovered recently deleted files this way. OK at this point I'm thinking that fixing the super blocks won't change anything because it sounds like it's using the new ones anyway and maybe the thing to try is going back to a tree root that isn't in any of the new supers. That means losing anything that was being written when the lost writes happened. However, for all we know some overwrites happened so this won't work. And also it does nothing to deal with the fragile state of having at least two flaky devices, and one of the system chunks with no redundancy. Try 'btrfs check' without repair. And then also try it with -r flag using the various tree roots we've seen so far. Try explicitly using 5752616386560, which is what it ought to use first anyway. And then also 4844272943104. That might go far enough back before the bad sectors were a factor. Normally what you'd want is for it to use one of the backup roots, but it's consistently running into a problem with all of them when using recovery mount option. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 file system in a bad state
On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelsonwrote: > superblock: bytenr=65536, device=/dev/sda > - > generation 161562 > root5752616386560 > superblock: bytenr=65536, device=/dev/sdh > - > generation 161474 > root4844272943104 OK so most obvious is that the bad super is many generations back than the good super. That's expected given all the write errors. >root@castor:~/logs# btrfs-find-root /dev/sda >parent transid verify failed on 5752357961728 wanted 161562 found 159746 >parent transid verify failed on 5752357961728 wanted 161562 found 159746 >Couldn't setup extent tree >Superblock thinks the generation is 161562 >Superblock thinks the level is 1 This squares with the good super. So btrfs-find-root is using a good super. I don't know what 5752357961728 is for, maybe it's possible to read that with btrfs-debug-tree -b 5752357961728 and see what comes back. This is not the tree root according to the super though. So what do you get for btrfs-debug-tree -b 5752616386560 Going back to your logs [ 38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 38.810595] NFSD: starting 90-second grace period (net b12e5b80) [ 241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds. [ 241.299135] Not tainted 4.7.0-0.bpo.1-amd64 #1 [ 241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. I don't know what this kernel is. I think you'd be better off with stable 4.7.7 or 4.8.1 for this work, so you're not running into a bunch of weird blocked task problems in addition to whatever is going on with the fs. [ 38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 38.810595] NFSD: starting 90-second grace period (net b12e5b80) [ 241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds. [ 241.299135] Not tainted 4.7.0-0.bpo.1-amd64 #1 [ 241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. I don't know what this kernel is. I think you'd be better off with stable 4.7.7 or 4.8.1 for this work, so you're not running into a bunch of weird blocked task problems in addition to whatever is going on with the fs. [ 20.552205] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 3 transid 161562 /dev/sdd [ 20.552372] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 5 transid 161562 /dev/sdf [ 20.552524] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 6 transid 161562 /dev/sde [ 20.552689] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 4 transid 161562 /dev/sdg [ 20.552858] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 1 transid 161562 /dev/sda [ 669.843166] BTRFS warning (device sda): devid 2 uuid dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c is missing [232572.871243] sd 0:0:8:0: [sdh] tag#4 Sense Key : Medium Error [current] Two items missing, in effect, for this failed read. One literally missing, and the other one missing due to unrecoverable read error. The fact it's not trying to fix anything suggests it hasn't really finished mounting, there must be something wrong where it either just gets confused and won't fix (because it might make things worse) or there isn't reduncancy. [52799.495999] mce: [Hardware Error]: Machine check events logged [53249.491975] mce: [Hardware Error]: Machine check events logged [231298.005594] mce: [Hardware Error]: Machine check events logged Bunch of other hardware issues... I *really* think you need to get the hardware issues sorted out before working on this file system unless you just don't care that much about it. There are already enough unknowns without contributing who knows what effect the hardware issues are having while trying to repair things. Or even understand what's going on. > sys_chunk_array[2048]: > item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0) > chunk length 4194304 owner 2 stripe_len 65536 > type SYSTEM num_stripes 1 > stripe 0 devid 1 offset 0 > dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 > item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) > chunk length 11010048 owner 2 stripe_len 65536 > type SYSTEM|RAID6 num_stripes 6 > stripe 0 devid 6 offset 1048576 > dev uuid: 390a1fd8-cc6c-40e7-b0b5-88ca7dcbcc32 > stripe 1 devid 5 offset 1048576 > dev uuid: 2df974c5-9dde-4062-81e9-c613db62 > stripe 2 devid 4 offset 1048576 > dev uuid: dce3d159-721d-4859-9955-37a03769bb0d > stripe 3 devid
Re: RAID system with adaption to changed number of disks
On Tue, Oct 11, 2016 at 8:14 AM, Philip Louis Moetteliwrote: > > Hello, > > > I have to build a RAID 6 with the following 3 requirements: You should under no circumstances use RAID5/6 for anything other than test and throw-away data. It has several known issues that will eat your data. Total data loss is a real possibility. (the capability to even create raid5/6 filesystems should imho be removed from btrfs until this changes.) > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? > > > Thanks a lot for your help! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
I think you just described all the benefits of btrfs in that type of configuration unfortunately after btrfs RAID 5 & 6 was marked as OK it got marked as "it will eat your data" (and there is a tone of people in random places poping up with raid 5 & 6 that just killed their data) On 11 October 2016 at 16:14, Philip Louis Moetteliwrote: > Hello, > > > I have to build a RAID 6 with the following 3 requirements: > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? > > > Thanks a lot for your help! > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Tue, Oct 11, 2016 at 03:14:30PM +, Philip Louis Moetteli wrote: > Hello, > > > I have to build a RAID 6 with the following 3 requirements: > > • Use different kinds of disks with different sizes. > • When a disk fails and there's enough space, the RAID should be able > to reconstruct itself out of the degraded state. Meaning, if I have e. g. a > RAID with 8 disks and 1 fails, I should be able to chose to transform this in > a non-degraded (!) RAID with 7 disks. > • Also the other way round: If I add a disk of what size ever, it > should redistribute the data, so that it becomes a RAID with 9 disks. > > I don’t care, if I have to do it manually. > I don’t care so much about speed either. > > Is BTrFS capable of doing that? 1) Take a look at http://carfax.org.uk/btrfs-usage/ which will tell you how much space you can get out of a btrfs array with different sized devices. 2) Btrfs's parity RAID implementation is not in good shape right now. It has known data corruption issues, and should not be used in production. 3) The redistribution of space is something that btrfs can do. It needs to be triggered manually at the moment, but it definitely works. Hugo. -- Hugo Mills | We are all lying in the gutter, but some of us are hugo@... carfax.org.uk | looking at the stars. http://carfax.org.uk/ | PGP: E2AB1DE4 | Oscar Wilde signature.asc Description: Digital signature
Re: RAID system with adaption to changed number of disks
On 2016-10-11 11:14, Philip Louis Moetteli wrote: Hello, I have to build a RAID 6 with the following 3 requirements: • Use different kinds of disks with different sizes. • When a disk fails and there's enough space, the RAID should be able to reconstruct itself out of the degraded state. Meaning, if I have e. g. a RAID with 8 disks and 1 fails, I should be able to chose to transform this in a non-degraded (!) RAID with 7 disks. • Also the other way round: If I add a disk of what size ever, it should redistribute the data, so that it becomes a RAID with 9 disks. I don’t care, if I have to do it manually. I don’t care so much about speed either. Is BTrFS capable of doing that? In theory yes. In practice, BTRFS RAID5/6 mode should not be used in production due to a number of known serious issues relating to rebuilding and reshaping arrays. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs bio linked list corruption.
On 10/11/2016 10:45 AM, Dave Jones wrote: > This is from Linus' current tree, with Al's iovec fixups on top. > > [ cut here ] > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > list_add corruption. prev->next should be next (e8806648), but was > c967fcd8. (prev=880503878b80). > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > c9d87458 8d32007c c9d874a8 > c9d87498 8d07a6c1 00210246 88050388e880 > 880503878b80 e8806648 e8c06600 880502808008 > Call Trace: > [] dump_stack+0x4f/0x73 > [] __warn+0xc1/0xe0 > [] warn_slowpath_fmt+0x5a/0x80 > [] __list_add+0x89/0xb0 > [] blk_sq_make_request+0x2f8/0x350 /* * A task plug currently exists. Since this is completely lockless, * utilize that to temporarily store requests until the task is * either done or scheduled away. */ plug = current->plug; if (plug) { blk_mq_bio_to_request(rq, bio); if (!request_count) trace_block_plug(q); blk_mq_put_ctx(data.ctx); if (request_count >= BLK_MAX_REQUEST_COUNT) { blk_flush_plug_list(plug, false); trace_block_plug(q); } list_add_tail(>queuelist, >mq_list); ^^ Dave, is this where we're crashing? This seems strange. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs bio linked list corruption.
On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > [ cut here ] > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > list_add corruption. prev->next should be next (e8806648), but was > > c967fcd8. (prev=880503878b80). > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > c9d87458 8d32007c c9d874a8 > > c9d87498 8d07a6c1 00210246 88050388e880 > > 880503878b80 e8806648 e8c06600 880502808008 > > Call Trace: > > [] dump_stack+0x4f/0x73 > > [] __warn+0xc1/0xe0 > > [] warn_slowpath_fmt+0x5a/0x80 > > [] __list_add+0x89/0xb0 > > [] blk_sq_make_request+0x2f8/0x350 > >/* > > * A task plug currently exists. Since this is completely lockless, > > * utilize that to temporarily store requests until the task is > > * either done or scheduled away. > > */ > > plug = current->plug; > > if (plug) { > > blk_mq_bio_to_request(rq, bio); > > if (!request_count) > > trace_block_plug(q); > > > > blk_mq_put_ctx(data.ctx); > > > > if (request_count >= BLK_MAX_REQUEST_COUNT) { > > blk_flush_plug_list(plug, false); > > trace_block_plug(q); > > } > > > > list_add_tail(>queuelist, >mq_list); > > ^^ > > Dave, is this where we're crashing? This seems strange. According to objdump -S .. 8130a1b7: 48 8b 70 50 mov0x50(%rax),%rsi list_add_tail(>queuelist, >rq_list); 8130a1bb: 48 8d 50 48 lea0x48(%rax),%rdx 8130a1bf: 48 89 45 a8 mov%rax,-0x58(%rbp) 8130a1c3: e8 38 44 03 00 callq 8133e600 <__list_add> blk_mq_hctx_mark_pending(hctx, ctx); 8130a1c8: 48 8b 45 a8 mov-0x58(%rbp),%rax 8130a1cc: 4c 89 ffmov%r15,%rdi That looks like the list_add_tail from __blk_mq_insert_req_list Dave -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: raid6 file system in a bad state
> > > Bad superblocks can't be a good thing and would only cause confusion. > I'd think that a known bad superblock would be ignored at mount time > and even by btrfs-find-root, or maybe even replaced like any other kind > of known bad metadata where good copies are available. > > btrfs-show-super -f /dev/sda > btrfs-show-super -f /dev/sdh > > > Find out what the difference is between good and bad supers. > root@castor:~# btrfs-show-super -f /dev/sda superblock: bytenr=65536, device=/dev/sda - csum_type 0 (crc32c) csum_size 4 csum0x45278835 [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid73ed01df-fb2a-4b27-b6fc-12a57da934bd label generation 161562 root5752616386560 sys_array_size 354 chunk_root_generation 156893 root_level 1 chunk_root 20971520 chunk_root_level1 log_root0 log_root_transid0 log_root_level 0 total_bytes 18003557892096 bytes_used 7107627130880 sectorsize 4096 nodesize16384 leafsize16384 stripesize 4096 root_dir6 num_devices 6 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0xe1 ( MIXED_BACKREF | BIG_METADATA | EXTENDED_IREF | RAID56 ) cache_generation161562 uuid_tree_generation161562 dev_item.uuid 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 dev_item.fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd [match] dev_item.type 0 dev_item.total_bytes3000592982016 dev_item.bytes_used 1800957198336 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0) chunk length 4194304 owner 2 stripe_len 65536 type SYSTEM num_stripes 1 stripe 0 devid 1 offset 0 dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) chunk length 11010048 owner 2 stripe_len 65536 type SYSTEM|RAID6 num_stripes 6 stripe 0 devid 6 offset 1048576 dev uuid: 390a1fd8-cc6c-40e7-b0b5-88ca7dcbcc32 stripe 1 devid 5 offset 1048576 dev uuid: 2df974c5-9dde-4062-81e9-c613db62 stripe 2 devid 4 offset 1048576 dev uuid: dce3d159-721d-4859-9955-37a03769bb0d stripe 3 devid 3 offset 1048576 dev uuid: 6f7142db-824c-4791-a5b2-d6ce11c81c8f stripe 4 devid 2 offset 1048576 dev uuid: dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c stripe 5 devid 1 offset 20971520 dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 backup_roots[4]: backup 0: backup_tree_root: 5752437456896 gen: 161561 level: 1 backup_chunk_root: 20971520gen: 156893 level: 1 backup_extent_root: 5752385224704 gen: 161561 level: 2 backup_fs_root: 124387328 gen: 74008 level: 0 backup_dev_root:5752437587968 gen: 161561 level: 1 backup_csum_root: 5752389615616 gen: 161561 level: 3 backup_total_bytes: 18003557892096 backup_bytes_used: 7112579833856 backup_num_devices: 6 backup 1: backup_tree_root: 5752616386560 gen: 161562 level: 1 backup_chunk_root: 20971520gen: 156893 level: 1 backup_extent_root: 5752649416704 gen: 161563 level: 2 backup_fs_root: 124387328 gen: 74008 level: 0 backup_dev_root:5752616501248 gen: 161562 level: 1 backup_csum_root: 5752650203136 gen: 161563 level: 3 backup_total_bytes: 18003557892096 backup_bytes_used: 7107602407424 backup_num_devices: 6 backup 2: backup_tree_root: 5752112103424 gen: 161559 level: 1 backup_chunk_root: 20971520gen: 156893 level: 1 backup_extent_root: 5752207409152 gen: 161560 level: 2
Re: raid6 file system in a bad state
On Tue, Oct 11, 2016 at 9:52 AM, Jason D. Michaelsonwrote: >> btrfs rescue super-recover -v > > root@castor:~/logs# btrfs rescue super-recover -v /dev/sda > All Devices: > Device: id = 2, name = /dev/sdh > Device: id = 3, name = /dev/sdd > Device: id = 5, name = /dev/sdf > Device: id = 6, name = /dev/sde > Device: id = 4, name = /dev/sdg > Device: id = 1, name = /dev/sda > > Before Recovering: > [All good supers]: > device name = /dev/sdd > superblock bytenr = 65536 > > device name = /dev/sdd > superblock bytenr = 67108864 > > device name = /dev/sdd > superblock bytenr = 274877906944 > > device name = /dev/sdf > superblock bytenr = 65536 > > device name = /dev/sdf > superblock bytenr = 67108864 > > device name = /dev/sdf > superblock bytenr = 274877906944 > > device name = /dev/sde > superblock bytenr = 65536 > > device name = /dev/sde > superblock bytenr = 67108864 > > device name = /dev/sde > superblock bytenr = 274877906944 > > device name = /dev/sdg > superblock bytenr = 65536 > > device name = /dev/sdg > superblock bytenr = 67108864 > > device name = /dev/sdg > superblock bytenr = 274877906944 > > device name = /dev/sda > superblock bytenr = 65536 > > device name = /dev/sda > superblock bytenr = 67108864 > > device name = /dev/sda > superblock bytenr = 274877906944 > > [All bad supers]: > device name = /dev/sdh > superblock bytenr = 65536 > > device name = /dev/sdh > superblock bytenr = 67108864 > > device name = /dev/sdh > superblock bytenr = 274877906944 > > > Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are > you sure? [y/N]: n > Aborted to recover bad superblocks > > I aborted this waiting for instructions on whether to proceed from the list. Bad superblocks can't be a good thing and would only cause confusion. I'd think that a known bad superblock would be ignored at mount time and even by btrfs-find-root, or maybe even replaced like any other kind of known bad metadata where good copies are available. btrfs-show-super -f /dev/sda btrfs-show-super -f /dev/sdh Find out what the difference is between good and bad supers. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID system with adaption to changed number of disks
Hello, I have to build a RAID 6 with the following 3 requirements: • Use different kinds of disks with different sizes. • When a disk fails and there's enough space, the RAID should be able to reconstruct itself out of the degraded state. Meaning, if I have e. g. a RAID with 8 disks and 1 fails, I should be able to chose to transform this in a non-degraded (!) RAID with 7 disks. • Also the other way round: If I add a disk of what size ever, it should redistribute the data, so that it becomes a RAID with 9 disks. I don’t care, if I have to do it manually. I don’t care so much about speed either. Is BTrFS capable of doing that? Thanks a lot for your help!
RE: raid6 file system in a bad state
> -Original Message- > From: ch...@colorremedies.com [mailto:ch...@colorremedies.com] On > Behalf Of Chris Murphy > Sent: Monday, October 10, 2016 11:23 PM > To: Jason D. Michaelson > Cc: Chris Murphy; Btrfs BTRFS > Subject: Re: raid6 file system in a bad state > > What do you get for > > btrfs-find-root root@castor:~/logs# btrfs-find-root /dev/sda parent transid verify failed on 5752357961728 wanted 161562 found 159746 parent transid verify failed on 5752357961728 wanted 161562 found 159746 Couldn't setup extent tree Superblock thinks the generation is 161562 Superblock thinks the level is 1 There's no further output, and btrfs-find-root is pegged at 100%. At the moment, the perceived bad disc is connected. I received the same results without as well. > btrfs rescue super-recover -v root@castor:~/logs# btrfs rescue super-recover -v /dev/sda All Devices: Device: id = 2, name = /dev/sdh Device: id = 3, name = /dev/sdd Device: id = 5, name = /dev/sdf Device: id = 6, name = /dev/sde Device: id = 4, name = /dev/sdg Device: id = 1, name = /dev/sda Before Recovering: [All good supers]: device name = /dev/sdd superblock bytenr = 65536 device name = /dev/sdd superblock bytenr = 67108864 device name = /dev/sdd superblock bytenr = 274877906944 device name = /dev/sdf superblock bytenr = 65536 device name = /dev/sdf superblock bytenr = 67108864 device name = /dev/sdf superblock bytenr = 274877906944 device name = /dev/sde superblock bytenr = 65536 device name = /dev/sde superblock bytenr = 67108864 device name = /dev/sde superblock bytenr = 274877906944 device name = /dev/sdg superblock bytenr = 65536 device name = /dev/sdg superblock bytenr = 67108864 device name = /dev/sdg superblock bytenr = 274877906944 device name = /dev/sda superblock bytenr = 65536 device name = /dev/sda superblock bytenr = 67108864 device name = /dev/sda superblock bytenr = 274877906944 [All bad supers]: device name = /dev/sdh superblock bytenr = 65536 device name = /dev/sdh superblock bytenr = 67108864 device name = /dev/sdh superblock bytenr = 274877906944 Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are you sure? [y/N]: n Aborted to recover bad superblocks I aborted this waiting for instructions on whether to proceed from the list. > > > > It shouldn't matter which dev you pick, unless it face plants, then try > another. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] btrfs: make max inline data can be equal to sectorsize
On Tue, Oct 11, 2016 at 12:47 AM, Wang Xiaoguangwrote: > If we use mount option "-o max_inline=sectorsize", say 4096, indeed > even for a fresh fs, say nodesize is 16k, we can not make the first > 4k data completely inline, I found this conditon causing this issue: > !compressed_size && (actual_end & (root->sectorsize - 1)) == 0 > > If it retuns true, we'll not make data inline. For 4k sectorsize, > 0~4094 dara range, we can make it inline, but 0~4095, it can not. > I don't think this limition is useful, so here remove it which will > make max inline data can be equal to sectorsize. > > Signed-off-by: Wang Xiaoguang > --- > fs/btrfs/inode.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index ea15520..c0db393 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct > btrfs_root *root, > if (start > 0 || > actual_end > root->sectorsize || > data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) || > - (!compressed_size && > - (actual_end & (root->sectorsize - 1)) == 0) || > end + 1 < isize || > data_len > root->fs_info->max_inline) { > return 1; > -- > 2.9.0 Before making any further changes to inline data, does it make sense to find the source of corruption Zygo has been experiencing? That's in the "btrfs rare silent data corruption with kernel data leak" thread. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs bio linked list corruption.
On Tue, Oct 11, 2016 at 11:20:41AM -0400, Chris Mason wrote: > > > On 10/11/2016 11:19 AM, Dave Jones wrote: > > On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote: > > > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > Those iovec fixups are in the current tree... > > > > ah yeah, git quietly dropped my local copy when I rebased so I didn't > > notice. > > > > > TBH, I don't see anything > > > in splice-related stuff that could come anywhere near that (short of > > > some general memory corruption having random effects of that sort). > > > > > > Could you try to bisect that sucker, or is it too hard to reproduce? > > > > Only hit it the once overnight so far. Will see if I can find a better way > > to > > reproduce today. > > This call trace is reading metadata so we can finish the truncate. I'd > say adding more memory pressure would make it happen more often. That story checks out. There were a bunch of oom's in the log before this. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.9 has our merge window pull: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.9 This is later than normal because I was tracking down a use-after-free during btrfs/101 in xfstests. I had hoped to fix up the offending patch, but wasn't happy with the size of the changes at this point in the merge window. The use-after-free was enough of a corner case that I didn't want to rebase things out at this point. So instead the top of the pull is my revert, and the rest of these were prepped by Dave Sterba (thanks Dave!). This is a big variety of fixes and cleanups. Liu Bo continues to fixup fuzzer related problems, and some of Josef's cleanups are prep for his bigger extent buffer changes (slated for v4.10). Liu Bo (13) commits (+207/-36): Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf (+5/-1) Btrfs: return gracefully from balance if fs tree is corrupted (+17/-6) Btrfs: improve check_node to avoid reading corrupted nodes (+28/-4) Btrfs: add error handling for extent buffer in print tree (+7/-0) Btrfs: memset to avoid stale content in btree node block (+11/-0) Btrfs: bail out if block group has different mixed flag (+14/-0) Btrfs: memset to avoid stale content in btree leaf (+28/-19) Btrfs: fix memory leak in reading btree blocks (+9/-0) Btrfs: fix memory leak of block group cache (+75/-0) Btrfs: kill BUG_ON in run_delayed_tree_ref (+7/-1) Btrfs: remove BUG_ON in start_transaction (+1/-4) Btrfs: fix memory leak in do_walk_down (+1/-0) Btrfs: remove BUG() in raid56 (+4/-1) Jeff Mahoney (7) commits (+849/-902): btrfs: btrfs_debug should consume fs_info when DEBUG is not defined (+10/-4) btrfs: clean the old superblocks before freeing the device (+11/-27) btrfs: convert send's verbose_printk to btrfs_debug (+38/-27) btrfs: convert printk(KERN_* to use pr_* calls (+205/-275) btrfs: convert pr_* to btrfs_* where possible (+231/-177) btrfs: unsplit printed strings (+324/-391) btrfs: add dynamic debug support (+30/-1) Josef Bacik (5) commits (+178/-156): Btrfs: kill the start argument to read_extent_buffer_pages (+15/-28) Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written (+33/-8) Btrfs: add a flags field to btrfs_fs_info (+99/-109) Btrfs: don't leak reloc root nodes on error (+4/-0) Btrfs: don't BUG() during drop snapshot (+27/-11) Goldwyn Rodrigues (3) commits (+3/-18): btrfs: Do not reassign count in btrfs_run_delayed_refs (+0/-1) btrfs: Remove already completed TODO comment (+0/-2) btrfs: parent_start initialization cleanup (+3/-15) Luis Henriques (2) commits (+0/-4): btrfs: Fix warning "variable ‘blocksize’ set but not used" (+0/-2) btrfs: Fix warning "variable ‘gen’ set but not used" (+0/-2) Eric Sandeen (1) commits (+1/-1): btrfs: fix perms on demonstration debugfs interface Anand Jain (1) commits (+20/-6): btrfs: fix a possible umount deadlock Lu Fengqi (1) commits (+369/-10): btrfs: fix check_shared for fiemap ioctl Chris Mason (1) commits (+15/-11): Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" Masahiro Yamada (1) commits (+8/-28): btrfs: squash lines for simple wrapper functions Qu Wenruo (1) commits (+37/-25): btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset Arnd Bergmann (1) commits (+7/-10): btrfs: fix btrfs_no_printk stub helper David Sterba (1) commits (+9/-0): btrfs: create example debugfs file only in debugging build Naohiro Aota (1) commits (+11/-15): btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs Total: (39) commits (+1714/-1222) fs/btrfs/backref.c| 409 ++ fs/btrfs/btrfs_inode.h| 11 -- fs/btrfs/check-integrity.c| 342 +++ fs/btrfs/compression.c| 6 +- fs/btrfs/ctree.c | 56 ++ fs/btrfs/ctree.h | 116 fs/btrfs/delayed-inode.c | 25 ++- fs/btrfs/delayed-ref.c| 15 +- fs/btrfs/dev-replace.c| 21 ++- fs/btrfs/dir-item.c | 7 +- fs/btrfs/disk-io.c| 237 fs/btrfs/disk-io.h| 2 + fs/btrfs/extent-tree.c| 198 +++- fs/btrfs/extent_io.c | 170 +++--- fs/btrfs/extent_io.h | 4 +- fs/btrfs/file.c | 43 - fs/btrfs/free-space-cache.c | 21 ++- fs/btrfs/free-space-cache.h | 6 +- fs/btrfs/free-space-tree.c| 20 ++- fs/btrfs/inode-map.c | 31 ++-- fs/btrfs/inode.c | 70 +--- fs/btrfs/ioctl.c | 14 +- fs/btrfs/lzo.c| 6 +- fs/btrfs/ordered-data.c | 4 +- fs/btrfs/print-tree.c | 93 +- fs/btrfs/qgroup.c | 77 fs/btrfs/raid56.c | 5 +-
btrfs bio linked list corruption.
This is from Linus' current tree, with Al's iovec fixups on top. [ cut here ] WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption. prev->next should be next (e8806648), but was c967fcd8. (prev=880503878b80). CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 c9d87458 8d32007c c9d874a8 c9d87498 8d07a6c1 00210246 88050388e880 880503878b80 e8806648 e8c06600 880502808008 Call Trace: [] dump_stack+0x4f/0x73 [] __warn+0xc1/0xe0 [] warn_slowpath_fmt+0x5a/0x80 [] __list_add+0x89/0xb0 [] blk_sq_make_request+0x2f8/0x350 [] ? generic_make_request+0xec/0x240 [] generic_make_request+0xf9/0x240 [] submit_bio+0x78/0x150 [] ? __percpu_counter_add+0x85/0xb0 [] btrfs_map_bio+0x19e/0x330 [btrfs] [] btree_submit_bio_hook+0xfa/0x110 [btrfs] [] submit_one_bio+0x65/0xa0 [btrfs] [] read_extent_buffer_pages+0x2f0/0x3d0 [btrfs] [] ? free_root_pointers+0x60/0x60 [btrfs] [] btree_read_extent_buffer_pages.constprop.55+0xa8/0x110 [btrfs] [] read_tree_block+0x2d/0x50 [btrfs] [] read_block_for_search.isra.33+0x134/0x330 [btrfs] [] ? _raw_write_unlock+0x2c/0x50 [] ? unlock_up+0x16c/0x1a0 [btrfs] [] btrfs_search_slot+0x450/0xa40 [btrfs] [] btrfs_del_csums+0xe3/0x2e0 [btrfs] [] __btrfs_free_extent.isra.82+0x32d/0xc90 [btrfs] [] __btrfs_run_delayed_refs+0x4d3/0x1010 [btrfs] [] ? debug_smp_processor_id+0x17/0x20 [] ? get_lock_stats+0x19/0x50 [] btrfs_run_delayed_refs+0x9c/0x2d0 [btrfs] [] btrfs_truncate_inode_items+0x888/0xda0 [btrfs] [] btrfs_truncate+0xe5/0x2b0 [btrfs] [] btrfs_setattr+0x249/0x360 [btrfs] [] notify_change+0x252/0x440 [] do_truncate+0x6e/0xc0 [] do_sys_ftruncate.constprop.19+0x10c/0x170 [] ? __this_cpu_preempt_check+0x13/0x20 [] SyS_ftruncate+0x9/0x10 [] do_syscall_64+0x5c/0x170 [] entry_SYSCALL64_slow_path+0x25/0x25 --[ end trace 906673a2f703b373 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs bio linked list corruption.
On 10/11/2016 11:19 AM, Dave Jones wrote: On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote: > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > Those iovec fixups are in the current tree... ah yeah, git quietly dropped my local copy when I rebased so I didn't notice. > TBH, I don't see anything > in splice-related stuff that could come anywhere near that (short of > some general memory corruption having random effects of that sort). > > Could you try to bisect that sucker, or is it too hard to reproduce? Only hit it the once overnight so far. Will see if I can find a better way to reproduce today. This call trace is reading metadata so we can finish the truncate. I'd say adding more memory pressure would make it happen more often. I'll try to trigger. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs bio linked list corruption.
On Tue, Oct 11, 2016 at 04:11:39PM +0100, Al Viro wrote: > On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > > This is from Linus' current tree, with Al's iovec fixups on top. > > Those iovec fixups are in the current tree... ah yeah, git quietly dropped my local copy when I rebased so I didn't notice. > TBH, I don't see anything > in splice-related stuff that could come anywhere near that (short of > some general memory corruption having random effects of that sort). > > Could you try to bisect that sucker, or is it too hard to reproduce? Only hit it the once overnight so far. Will see if I can find a better way to reproduce today. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs bio linked list corruption.
On Tue, Oct 11, 2016 at 10:45:08AM -0400, Dave Jones wrote: > This is from Linus' current tree, with Al's iovec fixups on top. Those iovec fixups are in the current tree... TBH, I don't see anything in splice-related stuff that could come anywhere near that (short of some general memory corruption having random effects of that sort). Could you try to bisect that sucker, or is it too hard to reproduce? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: kill BUG_ON in do_relocation
On Fri, Sep 23, 2016 at 02:05:04PM -0700, Liu Bo wrote: > While updating btree, we try to push items between sibling > nodes/leaves in order to keep height as low as possible. > But we don't memset the original places with zero when > pushing items so that we could end up leaving stale content > in nodes/leaves. One may read the above stale content by > increasing btree blocks' @nritems. > > One case I've come across is that in fs tree, a leaf has two > parent nodes, hence running balance ends up with processing > this leaf with two parent nodes, but it can only reach the > valid parent node through btrfs_search_slot, so it'd be like, > > do_relocation > for P in all parent nodes of block A: > if !P->eb: > btrfs_search_slot(key); --> get path from P to A. > if lowest: > BUG_ON(A->bytenr != bytenr of A recorded in P); > btrfs_cow_block(P, A); --> change A's bytenr in P. > > After btrfs_cow_block, P has the new bytenr of A, but with the > same @key, we get the same path again, and get panic by BUG_ON. > > Note that this is only happening in a corrupted fs, for a > regular fs in which we have correct @nritems so that we won't > read stale content in any case. > > Reviewed-by: Josef Bacik> Signed-off-by: Liu Bo > --- > v2: - use new internal error EFSCORRUPTED as "Filesystem is corrupted", > suggested by David Sterba. Sorry I steered it to EFSCORRUPTED, we should introduce the error code separately and audit the call chains. I'll drop the parts and change it back to EIO. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corrupt leaf, slot offset bad
On Tue, Oct 11, 2016 at 02:48:09PM +0200, David Sterba wrote: > Hi, > > looks like a lot of random bitflips. > > On Mon, Oct 10, 2016 at 11:50:14PM +0200, a...@aron.ws wrote: > > item 109 has a few strange chars in its name (and it's truncated): > > 1-x86_64.pkg.tar.xz 0x62 0x14 0x0a 0x0a > > > > item 105 key (261 DIR_ITEM 54556048) itemoff 11723 itemsize 72 > > location key (606286 INODE_ITEM 0) type FILE > > namelen 42 datalen 0 name: > > python2-gobject-3.20.1-1-x86_64.pkg.tar.xz > > item 106 key (261 DIR_ITEM 56363628) itemoff 11660 itemsize 63 > > location key (894298 INODE_ITEM 0) type FILE > > namelen 33 datalen 0 name: unrar-1:5.4.5-1-x86_64.pkg.tar.xz > > item 107 key (261 DIR_ITEM 66963651) itemoff 11600 itemsize 60 > > location key (1178 INODE_ITEM 0) type FILE > > namelen 30 datalen 0 name: glibc-2.23-5-x86_64.pkg.tar.xz > > item 108 key (261 DIR_ITEM 68561395) itemoff 11532 itemsize 68 > > location key (660578 INODE_ITEM 0) type FILE > > namelen 38 datalen 0 name: > > squashfs-tools-4.3-4-x86_64.pkg.tar.xz > > item 109 key (261 DIR_ITEM 76859450) itemoff 11483 itemsize 65 > > location key (2397184 UNKNOWN.0 7091317839824617472) type 45 > > namelen 13102 datalen 13358 name: 1-x86_64.pkg.tar.xzb > > namelen must be smaller than 255, but the number itself does not look > like a bitflip (0x332e), the name looks like a fragment of. > > The location key is random garbage, likely an overwritten memory, > 7091317839824617472 == 0x62696c010023 contains ascii 'bil', the key > type is unknown but should be INODE_ITEM. > > > data > > item 110 key (261 DIR_ITEM 9799832789237604651) itemoff 11405 itemsize > > 62 > > location key (388547 INODE_ITEM 0) type FILE > > namelen 32 datalen 0 name: intltool-0.51.0-1-any.pkg.tar.xz > > item 111 key (261 DIR_ITEM 81211850) itemoff 11344 itemsize 131133 > > itemsize 131133 == 0x2003d is a clear bitflip, 0x3d == 61, corresponds > to the expected item size. > > There's possibly other random bitflips in the keys or other structures. > It's hard to estimate the damage and thus the scope of restorable data. It makes sense since this's a ssd we may have only one copy for metadata. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corrupt leaf, slot offset bad
Hi, looks like a lot of random bitflips. On Mon, Oct 10, 2016 at 11:50:14PM +0200, a...@aron.ws wrote: > item 109 has a few strange chars in its name (and it's truncated): > 1-x86_64.pkg.tar.xz 0x62 0x14 0x0a 0x0a > > item 105 key (261 DIR_ITEM 54556048) itemoff 11723 itemsize 72 > location key (606286 INODE_ITEM 0) type FILE > namelen 42 datalen 0 name: > python2-gobject-3.20.1-1-x86_64.pkg.tar.xz > item 106 key (261 DIR_ITEM 56363628) itemoff 11660 itemsize 63 > location key (894298 INODE_ITEM 0) type FILE > namelen 33 datalen 0 name: unrar-1:5.4.5-1-x86_64.pkg.tar.xz > item 107 key (261 DIR_ITEM 66963651) itemoff 11600 itemsize 60 > location key (1178 INODE_ITEM 0) type FILE > namelen 30 datalen 0 name: glibc-2.23-5-x86_64.pkg.tar.xz > item 108 key (261 DIR_ITEM 68561395) itemoff 11532 itemsize 68 > location key (660578 INODE_ITEM 0) type FILE > namelen 38 datalen 0 name: > squashfs-tools-4.3-4-x86_64.pkg.tar.xz > item 109 key (261 DIR_ITEM 76859450) itemoff 11483 itemsize 65 > location key (2397184 UNKNOWN.0 7091317839824617472) type 45 > namelen 13102 datalen 13358 name: 1-x86_64.pkg.tar.xzb namelen must be smaller than 255, but the number itself does not look like a bitflip (0x332e), the name looks like a fragment of. The location key is random garbage, likely an overwritten memory, 7091317839824617472 == 0x62696c010023 contains ascii 'bil', the key type is unknown but should be INODE_ITEM. > data > item 110 key (261 DIR_ITEM 9799832789237604651) itemoff 11405 itemsize > 62 > location key (388547 INODE_ITEM 0) type FILE > namelen 32 datalen 0 name: intltool-0.51.0-1-any.pkg.tar.xz > item 111 key (261 DIR_ITEM 81211850) itemoff 11344 itemsize 131133 itemsize 131133 == 0x2003d is a clear bitflip, 0x3d == 61, corresponds to the expected item size. There's possibly other random bitflips in the keys or other structures. It's hard to estimate the damage and thus the scope of restorable data. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: State of the fuzzer
Hi, I've now shut down all fuzzer nodes since they only cost money and there is no progress on most of the aforementioned bugs. Best regards Lukas -- Forwarded message -- From: Lukas LuegDate: 2016-09-26 11:39 GMT+02:00 Subject: Re: State of the fuzzer To: linux-btrfs@vger.kernel.org Hi David, do we have any chance of engagement on those 23 bugs which came out of the last fuzzing round? The nodes have been basically idle for a week, spewing duplicates and variants of what's already known... Best regards Lukas 2016-09-20 13:33 GMT+02:00 Lukas Lueg : > There are now 21 bugs open on bko, most of them crashes and some > undefined behavior. The nodes are now mostly running idle as no new > paths are discovered (after around one billion images tested in the > current run). > > My thoughts are to wait until the current bugs have been fixed, then > restart the whole process from HEAD (together with the corpus of > ~2.000 seed images discovered by now) and catch new bugs and aborts() > - we need to get rid of the reachable ones so code coverage can > improve. After those, I'll change the process to run btrfsck --repair, > which is slower but has a lot of yet uncovered code. > > DigitalOcean has provided some funding for this undertaking so we are > good on CPU power. Kudos to them. > > 2016-09-13 22:28 GMT+02:00 Lukas Lueg : >> I've booted another instance with btrfs-progs checked out to 2b7c507 >> and collected some bugs which remained from the run before the current >> one. The current run discovered what qgroups are just three days ago >> and will spend some time on that. I've also added UBSAN- and >> MSAN-logging to my setup and there were three bugs found so far (one >> is already fixed). I will boot a third instance to run lowmem-mode >> exclusively in the next few days. >> >> There are 11 bugs open at the moment, all have a reproducing image >> attached to them. The whole list is at >> >> https://bugzilla.kernel.org/buglist.cgi?bug_status=NEW_status=ASSIGNED_status=REOPENED=btrfs=lukas.lueg%40gmail.com=1=exact_id=858441_format=advanced >> >> >> 2016-09-09 16:00 GMT+02:00 David Sterba : >>> On Tue, Sep 06, 2016 at 10:32:28PM +0200, Lukas Lueg wrote: I'm currently fuzzing rev 2076992 and things start to slowly, slowly quiet down. We will probably run out of steam at the end of the week when a total of (roughly) half a billion BTRFS-images have passed by. I will switch revisions to current HEAD and restart the whole process then. A few things: * There are a couple of crashes (mostly segfaults) I have not reported yet. I'll report them if they show up again with the latest revision. >>> >>> Ok. >>> * The coverage-analysis shows assertion failures which are currently silenced. An assertion failure is technically a worse disaster successfully prevented, it still constitutes unexpected/unusable behaviour, though. Do you want assertions to be enabled and images triggering those assertions reported? This is basically the same conundrum as with BUG_ON and abort(). >>> >>> Yes please. I'd like to turn most bugons/assertions into a normal >>> failure report if it would make sense. >>> * A few endless loops entered into by btrfsck are currently unmitigated (see bugs 155621, 155571, 11 and 155151). It would be nice if those had been taken care of by next week if possible. >>> >>> Two of them are fixed, the other two need more work, updating all >>> callers of read_node_slot and the callchain. So you may still see that >>> kind of looping in more images. I don't have an ETA for the fix, I won't >>> be available during the next week. >>> >>> At the moment, the initial sanity checks should catch most of the >>> corrupted values, so I'm expecting that you'll see different classes of >>> problems in the next rounds. >>> >>> The testsuite now contains all images that you reported and we have a >>> fix in git. There are more utilities run on the images, there may be >>> more problems for us to fix. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] btrfs-progs: Make btrfs-debug-tree print all readable strings for inode flags
On Tue, Oct 11, 2016 at 10:18:51AM +0800, Qu Wenruo wrote: > >> -/* Caller should ensure sizeof(*ret) >= 29 "NODATASUM|NODATACOW|READONLY" > >> */ > >> +#define copy_one_inode_flag(flags, name, empty, dst) ({ > >> \ > >> + if (flags & BTRFS_INODE_##name) { \ > >> + if (!empty) \ > >> + strcat(dst, "|"); \ > >> + strcat(dst, #name); \ > >> + empty = 0; \ > >> + } \ > >> +}) > > > > Can you please avoid using the macro? Or at least make it uppercase so > > it's visible. Similar in the next patch. > > > > > OK, I'll change it to upper case. Ok. > The only reason I'm using macro is, inline function can't do > stringification, or I missed something? No, that's where macros help. My concern was about the hidden use of a local variable, so at least an all-caps macro name would make it more visible. As this is not going to be used elsewhere, we can live with that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix -EINVEL in tree log recovering
From: Robbie Kowhen tree log recovery, space_cache rebuild or dirty maybe save the cache. and then replay extent with disk_bytenr and disk_num_bytes, but disk_bytenr and disk_num_bytes maybe had been use for free space inode, will lead to -EINVEL. BTRFS: error in btrfs_replay_log:2446: errno=-22 unknown (Failed to recover log tree) therefore, we not save cache when tree log recovering. Signed-off-by: Robbie Ko --- fs/btrfs/extent-tree.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 665da8f..38b932c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3434,6 +3434,7 @@ again: spin_lock(_group->lock); if (block_group->cached != BTRFS_CACHE_FINISHED || + block_group->fs_info->log_root_recovering || !btrfs_test_opt(root->fs_info, SPACE_CACHE)) { /* * don't bother trying to write stuff out _if_ -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix enospc in punch hole
Hi Filipe: because btrfs_calc_trunc_metadata_size is reserved leafsize + nodesize * (8 - 1) assume leafsize is the same as nodesize, we total reserved 8 nodesize when split leaf, we need 2 path, if extent_tree level small than 4, it's OK because worst case is (leafsize + nodesize * 3) *2, is 8 nodesize. but if extent_tree is greater level 4, worst case is need (leafsize + nodesize * 7) * 2, is bigger than resvered size, so we should use btrfs_calc_trans_metadata_size, is taken into account split leaf case. Thanks. robbieko Filipe Manana 於 2016-10-07 18:18 寫到: On Fri, Oct 7, 2016 at 7:09 AM, robbiekowrote: From: Robbie Ko when extent-tree level > BTRFS_MAX_LEVEL / 2, __btrfs_drop_extents -> btrfs_duplicate_item -> setup_leaf_for_split -> split_leaf maybe enospc, because min_size is too small, need use btrfs_calc_trans_metadata_size. This change log is terrible. You should describe the problem and fix. That is, that hole punching can result in adding new leafs (and as a consequence new nodes) to the tree because when we find file extent items that span beyond the hole range we may end up not deleting them (just adjusting them) and add new file extent items representing holes. And I don't see why this is exclusive for the case where the height of the extent tree is greater than 4 (BTRFS_MAX_LEVEL / 2). The code changes themselves look good to me. thanks Signed-off-by: Robbie Ko --- fs/btrfs/file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index fea31a4..809ca85 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2322,7 +2322,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) u64 tail_len; u64 orig_start = offset; u64 cur_offset; - u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); + u64 min_size = btrfs_calc_trans_metadata_size(root, 1); u64 drop_end; int ret = 0; int err = 0; @@ -2469,7 +2469,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) ret = -ENOMEM; goto out_free; } - rsv->size = btrfs_calc_trunc_metadata_size(root, 1); + rsv->size = btrfs_calc_trans_metadata_size(root, 1); rsv->failfast = 1; /* -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix fsync deadlock in log_new_dir_dentries
Hi Filipe: why did you replace the continue statement with a break statement: because we released ahead of the path, it can not continue to use, need to jump out, and then go to again. supplement: We found a fsync deadlock ie. 32021->32020->32028->14431->14436->32021, the number id pid. extent_buffer: start:207060992, len:16384 locker pid: 32020 read lock wait pid: 32021 write lock extent_buffer: start:14730821632, len:16384 locker pid: 32028 read lock wait pid: 32020 write lock extent_buffer: start:446503813120, len:16384 locker pid: 14431 write lock wait pid: 32028 read lock extent_buffer: start:446503845888, len: 16384 locker pid: 14436 write lock wait pid: 14431 write lock extent_buffer: start: 446504386560, len: 16384 locker pid: 32021 write lock wait pid: 14436 write lock Thanks. Robbie Ko Filipe Manana 於 2016-10-07 18:46 寫到: On Fri, Oct 7, 2016 at 11:43 AM, robbiekowrote: Hi Filipe, I am sorry, I express not clear enough. This number is pid, and the above are their call trace respectively. And why did you replace the continue statement with a break statement? Also please avoid mixing inline replies with top posting, it just breaks the thread. thanks Thanks. robbieko Filipe Manana 於 2016-10-07 18:23 寫道: On Fri, Oct 7, 2016 at 11:05 AM, robbieko wrote: From: Robbie Ko We found a fsync deadlock ie. 32021->32020->32028->14431->14436->32021, in log_new_dir_dentries, because btrfs_search_forward get path lock, then call btrfs_iget will get another extent_buffer lock, maybe occur deadlock. What are those numbers? Are they inode numbers? If so you're suggesting a deadlock due to recursive logging of the same inode. However the trace below, and the code change, has nothing to do with that. It's just about btrfs_iget trying to do a search on a btree and attempting to read lock some node/leaf that already has a write lock acquired previously by the same task. Please be more clear on your change logs. we can release path before call btrfs_iget, avoid deadlock occur. some process call trace like below: [ 4077.478852] kworker/u24:10 D 88107fc90640 0 14431 2 0x [ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 4077.494346] 880ffa56bad0 0046 9000 880ffa56bfd8 [ 4077.502629] 880ffa56bfd8 881016ce21c0 a06ecb26 88101a5d6138 [ 4077.510915] 880ebb5173b0 880ffa56baf8 880ebb517410 881016ce21c0 [ 4077.519202] Call Trace: [ 4077.528752] [] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] [ 4077.536049] [] ? wake_up_atomic_t+0x30/0x30 [ 4077.542574] [] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4077.550171] [] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs] [ 4077.558252] [] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs] [ 4077.566140] [] ? add_delayed_data_ref+0xe2/0x150 [btrfs] [ 4077.573928] [] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs] [ 4077.582399] [] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] [ 4077.589896] [] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] [ 4077.599632] [] ? start_transaction+0x8d/0x470 [btrfs] [ 4077.607134] [] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs] [ 4077.615329] [] ? process_one_work+0x142/0x3d0 [ 4077.622043] [] ? worker_thread+0x109/0x3b0 [ 4077.628459] [] ? manage_workers.isra.26+0x270/0x270 [ 4077.635759] [] ? kthread+0xaf/0xc0 [ 4077.641404] [] ? kthread_create_on_node+0x110/0x110 [ 4077.648696] [] ? ret_from_fork+0x58/0x90 [ 4077.654926] [] ? kthread_create_on_node+0x110/0x110 [ 4078.358087] kworker/u24:15 D 88107fcd0640 0 14436 2 0x [ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 4078.373574] 880ffa57fad0 0046 9000 880ffa57ffd8 [ 4078.381864] 880ffa57ffd8 88103004d0a0 a06ecb26 88101a5d6138 [ 4078.390163] 880fbeffc298 880ffa57faf8 880fbeffc2f8 88103004d0a0 [ 4078.398466] Call Trace: [ 4078.408019] [] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] [ 4078.415322] [] ? wake_up_atomic_t+0x30/0x30 [ 4078.421844] [] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4078.429438] [] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs] [ 4078.437518] [] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs] [ 4078.445404] [] ? add_delayed_data_ref+0xe2/0x150 [btrfs] [ 4078.453194] [] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs] [ 4078.461663] [] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] [ 4078.469161] [] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] [ 4078.478893] [] ? start_transaction+0x8d/0x470 [btrfs] [ 4078.486388] [] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs] [ 4078.494561] [] ? process_one_work+0x142/0x3d0 [ 4078.501278] [] ? pwq_activate_delayed_work+0x27/0x40 [ 4078.508673] [] ? worker_thread+0x109/0x3b0 [ 4078.515098] [] ?
[RFC] btrfs: make max inline data can be equal to sectorsize
If we use mount option "-o max_inline=sectorsize", say 4096, indeed even for a fresh fs, say nodesize is 16k, we can not make the first 4k data completely inline, I found this conditon causing this issue: !compressed_size && (actual_end & (root->sectorsize - 1)) == 0 If it retuns true, we'll not make data inline. For 4k sectorsize, 0~4094 dara range, we can make it inline, but 0~4095, it can not. I don't think this limition is useful, so here remove it which will make max inline data can be equal to sectorsize. Signed-off-by: Wang Xiaoguang--- fs/btrfs/inode.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ea15520..c0db393 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -267,8 +267,6 @@ static noinline int cow_file_range_inline(struct btrfs_root *root, if (start > 0 || actual_end > root->sectorsize || data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) || - (!compressed_size && - (actual_end & (root->sectorsize - 1)) == 0) || end + 1 < isize || data_len > root->fs_info->max_inline) { return 1; -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html