Re: So, does btrfs check lowmem take days? weeks?
On 06/29/2018 01:28 PM, Marc MERLIN wrote: On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote: lowmem repair seems to be going still, but it's been days and -p seems to do absolutely nothing. I'm a afraid you hit a bug in lowmem repair code. By all means, --repair shouldn't really be used unless you're pretty sure the problem is something btrfs check can handle. That's also why --repair is still marked as dangerous. Especially when it's combined with experimental lowmem mode. Understood, but btrfs got corrupted (by itself or not, I don't know) I cannot mount the filesystem read/write I cannot btrfs check --repair it since that code will kill my machine What do I have left? My filesystem is "only" 10TB or so, albeit with a lot of files. Unless you have tons of snapshots and reflinked (deduped) files, it shouldn't take so long. I may have a fair amount. gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 enabling repair mode WARNING: low-memory mode repair support is only partial Checking filesystem on /dev/mapper/dshelf2 UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d Fixed 0 roots. ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4 Created new chunk [18457780224000 1073741824] Delete backref in extent [84302495744 69632] ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4 Delete backref in extent [84302495744 69632] ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240 Delete backref in extent [125712527360 12214272] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431 Delete backref in extent [129952120832 20242432] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433 Delete backref in extent [129952120832 20242432] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249 Delete backref in extent [147895111680 12345344] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251 Delete backref in extent [147895111680 12345344] ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418 Delete backref in extent [150850146304 17522688] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449 Deleted root 2 item[156909494272, 178, 5476627808561673095] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452 Deleted root 2 item[156909494272, 178, 7338474132555182983] ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost Add one extent data backref [156909494272 55320576] ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost Add one extent data backref [156909494272 55320576] My bad. It's almost possiblelly a bug about extent of lowmem check which was reported by Chris too. The extent check was wrong, the the repair did wrong things. I have figured out the bug is lowmem check can't deal with shared tree block in reloc tree. The fix is simple, you can try the follow repo: https://github.com/Damenly/btrfs-progs/tree/tmp1 Please run lowmem check "without =--repair" first to be sure whether your filesystem is fine. Though the bug and phenomenon are clear enough, before sending my patch, I have to make a test image. I have spent a week to study btrfs balance but it seems a liitle
Re: So, does btrfs check lowmem take days? weeks?
On 2018年06月29日 13:28, Marc MERLIN wrote: > On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote: >>> lowmem repair seems to be going still, but it's been days and -p seems >>> to do absolutely nothing. >> >> I'm a afraid you hit a bug in lowmem repair code. >> By all means, --repair shouldn't really be used unless you're pretty >> sure the problem is something btrfs check can handle. >> >> That's also why --repair is still marked as dangerous. >> Especially when it's combined with experimental lowmem mode. > > Understood, but btrfs got corrupted (by itself or not, I don't know) > I cannot mount the filesystem read/write > I cannot btrfs check --repair it since that code will kill my machine > What do I have left? Just normal btrfs check, and post the output. If normal check eats up all your memory, btrfs check --mode=lowmem. --repair should be considered as the last method. > >>> My filesystem is "only" 10TB or so, albeit with a lot of files. >> >> Unless you have tons of snapshots and reflinked (deduped) files, it >> shouldn't take so long. > > I may have a fair amount. > gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 > enabling repair mode > WARNING: low-memory mode repair support is only partial > Checking filesystem on /dev/mapper/dshelf2 > UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d > Fixed 0 roots. > ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, > owner: 374857, offset: 3407872) wanted: 3, have: 4 > Created new chunk [18457780224000 1073741824] > Delete backref in extent [84302495744 69632] > ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, > owner: 374857, offset: 3407872) wanted: 3, have: 4 > Delete backref in extent [84302495744 69632] > ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, > owner: 374857, offset: 114540544) wanted: 181, have: 240 > Delete backref in extent [125712527360 12214272] > ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, > owner: 374857, offset: 126754816) wanted: 68, have: 115 > Delete backref in extent [125730848768 5111808] > ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, > owner: 374857, offset: 126754816) wanted: 68, have: 115 > Delete backref in extent [125730848768 5111808] > ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, > owner: 374857, offset: 131866624) wanted: 115, have: 143 > Delete backref in extent [125736914944 6037504] > ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, > owner: 374857, offset: 131866624) wanted: 115, have: 143 > Delete backref in extent [125736914944 6037504] > ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, > owner: 374857, offset: 148234240) wanted: 302, have: 431 > Delete backref in extent [129952120832 20242432] > ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, > owner: 374857, offset: 148234240) wanted: 356, have: 433 > Delete backref in extent [129952120832 20242432] > ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, > owner: 374857, offset: 180371456) wanted: 161, have: 240 > Delete backref in extent [134925357056 11829248] > ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, > owner: 374857, offset: 180371456) wanted: 162, have: 240 > Delete backref in extent [134925357056 11829248] > ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, > owner: 374857, offset: 192200704) wanted: 170, have: 249 > Delete backref in extent [147895111680 12345344] > ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, > owner: 374857, offset: 192200704) wanted: 172, have: 251 > Delete backref in extent [147895111680 12345344] > ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, > owner: 374857, offset: 217653248) wanted: 348, have: 418 > Delete backref in extent [150850146304 17522688] > ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, > owner: 374857, offset: 235175936) wanted: 555, have: 1449 > Deleted root 2 item[156909494272, 178, 5476627808561673095] > ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, > owner: 374857, offset: 235175936) wanted: 556, have: 1452 > Deleted root 2 item[156909494272, 178, 7338474132555182983] > ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost > Add one extent data backref [156909494272 55320576] > ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost > Add one extent data backref [156909494272 55320576] > > The last two ERROR lines took over a day to get generated, so I'm not sure if > it's still working, but just slowly. OK, that explains something. One extent is referred hundreds times, no wonder it will take a long time. Just one tip here, there are really too many
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 01:35:06PM +0800, Su Yue wrote: > > It's hard to estimate, especially when every cross check involves a lot > > of disk IO. > > > > But at least, we could add such indicator to show we're doing something. > > Maybe we can account all roots in root tree first, before checking a > tree, report i/num_roots. So users can see the what is the check doing > something meaningful or silly dead looping. Sounds reasonable. Do you want to submit something in git master for btrfs-progs, I pull it, and just my btrfs check again? In the meantime, how sane does the output I just posted, look? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On 06/29/2018 01:07 PM, Qu Wenruo wrote: On 2018年06月29日 12:27, Marc MERLIN wrote: Regular btrfs check --repair has a nice progress option. It wasn't perfect, but it showed something. But then it also takes all your memory quicker than the linux kernel can defend itself and reliably completely kills my 32GB server quicker than it can OOM anything. lowmem repair seems to be going still, but it's been days and -p seems to do absolutely nothing. I'm a afraid you hit a bug in lowmem repair code. By all means, --repair shouldn't really be used unless you're pretty sure the problem is something btrfs check can handle. That's also why --repair is still marked as dangerous. Especially when it's combined with experimental lowmem mode. My filesystem is "only" 10TB or so, albeit with a lot of files. Unless you have tons of snapshots and reflinked (deduped) files, it shouldn't take so long. 2 things that come to mind 1) can lowmem have some progress working so that I know if I'm looking at days, weeks, or even months before it will be done? It's hard to estimate, especially when every cross check involves a lot of disk IO. But at least, we could add such indicator to show we're doing something. Maybe we can account all roots in root tree first, before checking a tree, report i/num_roots. So users can see the what is the check doing something meaningful or silly dead looping. Thanks, Su 2) non lowmem is more efficient obviously when it doesn't completely crash your machine, but could lowmem be given an amount of memory to use for caching, or maybe use some heuristics based on RAM free so that it's not so excrutiatingly slow? IIRC recent commit has added the ability. a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers") That's already included in btrfs-progs v4.13.2. So it should be a dead loop which lowmem repair code can't handle. Thanks, Qu Thanks, Marc -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On Fri, Jun 29, 2018 at 01:07:20PM +0800, Qu Wenruo wrote: > > lowmem repair seems to be going still, but it's been days and -p seems > > to do absolutely nothing. > > I'm a afraid you hit a bug in lowmem repair code. > By all means, --repair shouldn't really be used unless you're pretty > sure the problem is something btrfs check can handle. > > That's also why --repair is still marked as dangerous. > Especially when it's combined with experimental lowmem mode. Understood, but btrfs got corrupted (by itself or not, I don't know) I cannot mount the filesystem read/write I cannot btrfs check --repair it since that code will kill my machine What do I have left? > > My filesystem is "only" 10TB or so, albeit with a lot of files. > > Unless you have tons of snapshots and reflinked (deduped) files, it > shouldn't take so long. I may have a fair amount. gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 enabling repair mode WARNING: low-memory mode repair support is only partial Checking filesystem on /dev/mapper/dshelf2 UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d Fixed 0 roots. ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4 Created new chunk [18457780224000 1073741824] Delete backref in extent [84302495744 69632] ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4 Delete backref in extent [84302495744 69632] ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240 Delete backref in extent [125712527360 12214272] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431 Delete backref in extent [129952120832 20242432] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433 Delete backref in extent [129952120832 20242432] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249 Delete backref in extent [147895111680 12345344] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251 Delete backref in extent [147895111680 12345344] ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418 Delete backref in extent [150850146304 17522688] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449 Deleted root 2 item[156909494272, 178, 5476627808561673095] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452 Deleted root 2 item[156909494272, 178, 7338474132555182983] ERROR: file extent[374857 235184128] root 21872 owner 21872 backref lost Add one extent data backref [156909494272 55320576] ERROR: file extent[374857 235184128] root 22911 owner 22911 backref lost Add one extent data backref [156909494272 55320576] The last two ERROR lines took over a day to get generated, so I'm not sure if it's still working, but just slowly. For what it's worth non lowmem check used to take 12 to 24H on that filesystem back when it still worked. > > 2 things that come to mind > > 1) can lowmem have some progress working so that I know if I'm looking > > at days, weeks, or even months before it will be done? > > It's hard to estimate, especially when every cross check involves a lot > of disk IO. > But at least, we could add such indicator to show we're doing something. Yes, anything to show that I should still wait is still good :) > > 2) non lowmem
[PATCH v2] Revert "btrfs: fix a possible umount deadlock"
Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for device list traversal") btrfs_show_devname no longer takes device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix a possible umount deadlock") aimed to fix no longer exists. So remove the extra code that commit added. This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa. Signed-off-by: Nikolay Borisov --- V2: * Fixed build failure due to using old name of free_device_rcu function. fs/btrfs/volumes.c | 26 ++ 1 file changed, 6 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 5bd6f3a40f9c..011a19b7930f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device *device) blkdev_put(device->bdev, device->mode); } -static void btrfs_prepare_close_one_device(struct btrfs_device *device) +static void btrfs_close_one_device(struct btrfs_device *device) { struct btrfs_fs_devices *fs_devices = device->fs_devices; struct btrfs_device *new_device; @@ -1022,6 +1022,8 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) fs_devices->missing_devices--; + btrfs_close_bdev(device); + new_device = btrfs_alloc_device(NULL, >devid, device->uuid); BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ @@ -1035,39 +1037,23 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) list_replace_rcu(>dev_list, _device->dev_list); new_device->fs_devices = device->fs_devices; + + call_rcu(>rcu, free_device_rcu); } static int close_fs_devices(struct btrfs_fs_devices *fs_devices) { struct btrfs_device *device, *tmp; - struct list_head pending_put; - - INIT_LIST_HEAD(_put); if (--fs_devices->opened > 0) return 0; mutex_lock(_devices->device_list_mutex); list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) { - btrfs_prepare_close_one_device(device); - list_add(>dev_list, _put); + btrfs_close_one_device(device); } mutex_unlock(_devices->device_list_mutex); - /* -* btrfs_show_devname() is using the device_list_mutex, -* sometimes call to blkdev_put() leads vfs calling -* into this func. So do put outside of device_list_mutex, -* as of now. -*/ - while (!list_empty(_put)) { - device = list_first_entry(_put, - struct btrfs_device, dev_list); - list_del(>dev_list); - btrfs_close_bdev(device); - call_rcu(>rcu, free_device_rcu); - } - WARN_ON(fs_devices->open_devices); WARN_ON(fs_devices->rw_devices); fs_devices->opened = 0; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: So, does btrfs check lowmem take days? weeks?
On 2018年06月29日 12:27, Marc MERLIN wrote: > Regular btrfs check --repair has a nice progress option. It wasn't > perfect, but it showed something. > > But then it also takes all your memory quicker than the linux kernel can > defend itself and reliably completely kills my 32GB server quicker than > it can OOM anything. > > lowmem repair seems to be going still, but it's been days and -p seems > to do absolutely nothing. I'm a afraid you hit a bug in lowmem repair code. By all means, --repair shouldn't really be used unless you're pretty sure the problem is something btrfs check can handle. That's also why --repair is still marked as dangerous. Especially when it's combined with experimental lowmem mode. > > My filesystem is "only" 10TB or so, albeit with a lot of files. Unless you have tons of snapshots and reflinked (deduped) files, it shouldn't take so long. > > 2 things that come to mind > 1) can lowmem have some progress working so that I know if I'm looking > at days, weeks, or even months before it will be done? It's hard to estimate, especially when every cross check involves a lot of disk IO. But at least, we could add such indicator to show we're doing something. > > 2) non lowmem is more efficient obviously when it doesn't completely > crash your machine, but could lowmem be given an amount of memory to use > for caching, or maybe use some heuristics based on RAM free so that it's > not so excrutiatingly slow? IIRC recent commit has added the ability. a5ce5d219822 ("btrfs-progs: extent-cache: actually cache extent buffers") That's already included in btrfs-progs v4.13.2. So it should be a dead loop which lowmem repair code can't handle. Thanks, Qu > > Thanks, > Marc > signature.asc Description: OpenPGP digital signature
So, does btrfs check lowmem take days? weeks?
Regular btrfs check --repair has a nice progress option. It wasn't perfect, but it showed something. But then it also takes all your memory quicker than the linux kernel can defend itself and reliably completely kills my 32GB server quicker than it can OOM anything. lowmem repair seems to be going still, but it's been days and -p seems to do absolutely nothing. My filesystem is "only" 10TB or so, albeit with a lot of files. 2 things that come to mind 1) can lowmem have some progress working so that I know if I'm looking at days, weeks, or even months before it will be done? 2) non lowmem is more efficient obviously when it doesn't completely crash your machine, but could lowmem be given an amount of memory to use for caching, or maybe use some heuristics based on RAM free so that it's not so excrutiatingly slow? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On Thu, Jun 28, 2018 at 3:15 AM, Qu Wenruo wrote: > I'd like to make sure everyone, including developers and end-users, are > fine with the restrict error-out behavior. Yes, please error out, as a start. Requesting this was on my btrfs-to-do list. A device generation mismatch from a drive going offline until forced reboot was what ultimately sent me on the path of problems I had. Had parent transid errors, which I realized was due to this, and decided to replace the outdated lvm partition with a new one. (Hiding the bad one to prevent UUID problems of course.) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
On 2018年06月29日 09:05, Christoph Anton Mitterer wrote: > Hey Qu and Nikolay. > > > On Thu, 2018-06-28 at 22:58 +0800, Qu Wenruo wrote: >> Nothing special. Btrfs-progs will handle it pretty well. > Since this a remote system where the ISP provides only a rescue image > with pretty old kernel/btrfs-progs, I had to copy a current local > binary and use that... but that seemed to have worked quite well Glad that works. Just forgot to mention, if it's only device size unalignment while the super block total_bytes is still accurate, you could just modify it on-line, using "btrfs fi resize" Just like # btrfs fi resize 1:-4K Then it should fix device size unalignment. > >> Because the WARN_ON() is newly added. > Ah I see. > >> Yep, latest will warn about it, and --repair can also fix it too. > Great. > > > On Thu, 2018-06-28 at 17:25 +0300, Nikolay Borisov wrote: >> Was this an old FS or a fresh one? > You mean in terms of original fs creation? Probably rather oldish.. I'd > guess at least a year or maybe even 2-3 or more. > >> Looking at the callstack this >> seems >> to have occured due to "btrfs_set_device_total_bytes(leaf, dev_item, >> btrfs_device_get_disk_total_bytes(device));" call. Meaning the total >> bytes of the disk were unalgined. Perhaps this has been like that for >> quite some time, then you did a couple of kernel upgrades (this >> WARN_ON >> was added later than 4.11) and just now you happened to delete a >> chunk >> which would trigger a device update on-disk ? > Could be... > > > The following was however still a bit strange: > sda2 and sdb2 are the partitions on the two HDDs forming the RAID1. > > > root@rescue ~ # ./btrfs rescue fix-device-size /dev/sda2 > Fixed device size for devid 2, old size: 999131127296 new size: 999131123712 > Fixed super total bytes, old size: 1998262251008 new size: 1998262247424 > Fixed unaligned/mismatched total_bytes for super block and device items > root@rescue ~ # ./btrfs rescue fix-device-size /dev/sdb2 > No device size related problem found > > As you can see, no alignment issues were found on sdb2. > > I've created these at the same time... > I don't think (but cannot exclude for 100%) that this server ever lost > a disk (in that case I could image that newer progs/kernel might have > created sdb2 with proper alignment) > > Looking at the partitions: > > root@rescue ~ # gdisk -l /dev/sda > GPT fdisk (gdisk) version 1.0.1 > > Partition table scan: > MBR: protective > BSD: not present > APM: not present > GPT: present > > Found valid GPT with protective MBR; using GPT. > Disk /dev/sda: 1953525168 sectors, 931.5 GiB > Logical sector size: 512 bytes > Disk identifier (GUID): > Partition table holds up to 128 entries > First usable sector is 34, last usable sector is 1953525134 > Partitions will be aligned on 2048-sector boundaries > Total free space is 2014 sectors (1007.0 KiB) > > Number Start (sector)End (sector) Size Code Name >12048 2097151 1023.0 MiB EF02 BIOS boot partition >2 2097152 1953525134 930.5 GiB 8300 Linux filesystem > root@rescue ~ # gdisk -l /dev/sdb > GPT fdisk (gdisk) version 1.0.1 > > Partition table scan: > MBR: protective > BSD: not present > APM: not present > GPT: present > > Found valid GPT with protective MBR; using GPT. > Disk /dev/sdb: 1953525168 sectors, 931.5 GiB > Logical sector size: 512 bytes > Disk identifier (GUID): > Partition table holds up to 128 entries > First usable sector is 34, last usable sector is 1953525134 > Partitions will be aligned on 2048-sector boundaries > Total free space is 2014 sectors (1007.0 KiB) > > Number Start (sector)End (sector) Size Code Name >12048 2097151 1023.0 MiB EF02 BIOS boot partition >2 2097152 1953525134 930.5 GiB 8300 Linux filesystem > > > Both the same... so if there was no device replace or so... then I > wonder why only one device was affected. Maybe it's the old mkfs causing the problem? Although mkfs.btrfs added device size alignment much earlier than kernel, it's still possible that the old mkfs doesn't handle the initial device and extra device (mkfs.btrfs will always create a temporary fs on the first device, then add all the other devices to the system) the same way. Thanks, Qu > > > Cheers, > Chris. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
Hey Qu and Nikolay. On Thu, 2018-06-28 at 22:58 +0800, Qu Wenruo wrote: > Nothing special. Btrfs-progs will handle it pretty well. Since this a remote system where the ISP provides only a rescue image with pretty old kernel/btrfs-progs, I had to copy a current local binary and use that... but that seemed to have worked quite well > Because the WARN_ON() is newly added. Ah I see. > Yep, latest will warn about it, and --repair can also fix it too. Great. On Thu, 2018-06-28 at 17:25 +0300, Nikolay Borisov wrote: > Was this an old FS or a fresh one? You mean in terms of original fs creation? Probably rather oldish.. I'd guess at least a year or maybe even 2-3 or more. > Looking at the callstack this > seems > to have occured due to "btrfs_set_device_total_bytes(leaf, dev_item, > btrfs_device_get_disk_total_bytes(device));" call. Meaning the total > bytes of the disk were unalgined. Perhaps this has been like that for > quite some time, then you did a couple of kernel upgrades (this > WARN_ON > was added later than 4.11) and just now you happened to delete a > chunk > which would trigger a device update on-disk ? Could be... The following was however still a bit strange: sda2 and sdb2 are the partitions on the two HDDs forming the RAID1. root@rescue ~ # ./btrfs rescue fix-device-size /dev/sda2 Fixed device size for devid 2, old size: 999131127296 new size: 999131123712 Fixed super total bytes, old size: 1998262251008 new size: 1998262247424 Fixed unaligned/mismatched total_bytes for super block and device items root@rescue ~ # ./btrfs rescue fix-device-size /dev/sdb2 No device size related problem found As you can see, no alignment issues were found on sdb2. I've created these at the same time... I don't think (but cannot exclude for 100%) that this server ever lost a disk (in that case I could image that newer progs/kernel might have created sdb2 with proper alignment) Looking at the partitions: root@rescue ~ # gdisk -l /dev/sda GPT fdisk (gdisk) version 1.0.1 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/sda: 1953525168 sectors, 931.5 GiB Logical sector size: 512 bytes Disk identifier (GUID): Partition table holds up to 128 entries First usable sector is 34, last usable sector is 1953525134 Partitions will be aligned on 2048-sector boundaries Total free space is 2014 sectors (1007.0 KiB) Number Start (sector)End (sector) Size Code Name 12048 2097151 1023.0 MiB EF02 BIOS boot partition 2 2097152 1953525134 930.5 GiB 8300 Linux filesystem root@rescue ~ # gdisk -l /dev/sdb GPT fdisk (gdisk) version 1.0.1 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/sdb: 1953525168 sectors, 931.5 GiB Logical sector size: 512 bytes Disk identifier (GUID): Partition table holds up to 128 entries First usable sector is 34, last usable sector is 1953525134 Partitions will be aligned on 2048-sector boundaries Total free space is 2014 sectors (1007.0 KiB) Number Start (sector)End (sector) Size Code Name 12048 2097151 1023.0 MiB EF02 BIOS boot partition 2 2097152 1953525134 930.5 GiB 8300 Linux filesystem Both the same... so if there was no device replace or so... then I wonder why only one device was affected. Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 2018年06月28日 22:36, Adam Borowski wrote: > On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote: >> There is a reporter considering btrfs raid1 has a major design flaw >> which can't handle nodatasum files. >> >> Despite his incorrect expectation, btrfs indeed doesn't handle device >> generation mismatch well. >> >> This means if one devices missed and re-appeared, even its generation >> no longer matches with the rest device pool, btrfs does nothing to it, >> but treat it as normal good device. >> >> At least let's detect such generation mismatch and avoid mounting the >> fs. > > Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. > The vast majority of data is fine, and extents that have been written to > while a device is missing will be either placed elsewhere (if the filesystem > knew it was degraded) or read one of the copies to notice a wrong checksum > and automatically recover (if the device was still falsely believed to be > good at write time). Yes, for fs without any nodatasum usage, the behavior is indeed overkilled. But sometimes such overkilled sanity check is really important, as long as nodatasum is a provided feature. > > We currently don't have selective scrub yet so resyncing such single-copy > extents is costly, but 1. all will be fine if the data is read, 2. it's > possible to add such a smart resync in the future, far better than a > write-intent bitmap can do. Well, auto scrub for a device looks not that bad to me. Since normally scrub is scheduled as normal maintenance work, it should not be a super expensive work. We only need to teach btrfs to treat such device as kind of degraded. Then we can reuse most of the scrub routine to fix it. > > To do the latter, we can note the last generation the filesystem was known > to be fully coherent (ie, all devices were successfully flushed with no > mysterious write failures), then run selective scrub (perhaps even > automatically) when the filesystem is no longer degraded. There's some > extra complexity with 3- or 4-way RAID (multiple levels of degradation) but > a single number would help even there. > > But even currently, without the above not-yet-written recovery, it's > reasonably safe to continue without scrub -- it's a case of running > partially degraded when the bad copy is already known to be suspicious. > > For no-nodatacow data and metadata, that is. > >> Currently there is no automatic rebuild yet, which means if users find >> device generation mismatch error message, they can only mount the fs >> using "device" and "degraded" mount option (if possible), then replace >> the offending device to manually "rebuild" the fs. > > As nodatacow already means "I don't care about this data, or have another > way of recovering it", I don't quite get why we would drop existing > auto-recovery for a common transient failure case. Yep, exactly my understanding of nodatasum behavior. However in real world, btrfs is the only *linux* fs supports data csum, and the most widely used fs like ext4 xfs doesn't support data csum. As the discussion about the behavior goes, I find that LVM/mdraid + ext4/xfs could do better device missing management than btrfs nodatasum, this means we need to at least do something that LVM/mdraid could provide. > > If you're paranoid, perhaps some bit "this filesystem has some nodatacow > data on it" could warrant such a block, but it would still need to be > overridable _without_ a need for replace. There's also the problem that > systemd marks its journal nodatacow (despite it having infamously bad > handling of failures!), and too many distributions infect their default > installs with systemd, meaning such a bit would be on in most cases. > > But why would I put all my other data at risk, just because there's a > nodatacow file? There's a big difference between scrubbing when only a few > transactions worth of data is suspicious and completely throwing away a > mostly-good replica to replace it from the now fully degraded copy. > >> I totally understand that, generation based solution can't handle >> split-brain case (where 2 RAID1 devices get mounted degraded separately) >> at all, but at least let's handle what we can do. > > Generation can do well at least unless both devices were mounted elsewhere > and got the exact same number of transactions, the problem is that nodatacow > doesn't bump generation number. Generation is never a problem, as any metadata change will still bump generation. > >> The best way to solve the problem is to make btrfs treat such lower gen >> devices as some kind of missing device, and queue an automatic scrub for >> that device. >> But that's a lot of extra work, at least let's start from detecting such >> problem first. > > I wonder if there's some way to treat problematic nodatacow files as > degraded only? > > Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd > need to pretty much copy all of md's
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月29日 01:10, Andrei Borzenkov wrote: > 28.06.2018 12:15, Qu Wenruo пишет: >> >> >> On 2018年06月28日 16:16, Andrei Borzenkov wrote: >>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: > > > On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > >> >> Please get yourself clear of what other raid1 is doing. > > A drive failure, where the drive is still there when the computer > reboots, is a situation that *any* raid 1, (or for that matter, raid 5, > raid 6, anything but raid 0) will recover from perfectly without raising > a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? >>> >>> When one drive fails, it is recorded in meta-data on remaining drives; >>> probably configuration generation number is increased. Next time drive >>> with older generation is not incorporated. Hardware controllers also >>> keep this information in NVRAM and so do not even depend on scanning >>> of other disks. >> >> Yep, the only possible way to determine such case is from external info. >> >> For device generation, it's possible to enhance btrfs, but at least we >> could start from detect and refuse to RW mount to avoid possible further >> corruption. >> But anyway, if one really cares about such case, hardware RAID >> controller seems to be the only solution as other software may have the >> same problem. >> >> And the hardware solution looks pretty interesting, is the write to >> NVRAM 100% atomic? Even at power loss? >> >>> The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. >>> >>> Why it should not work as long as any write to array is suspended >>> until superblock on remaining devices is updated? >> >> What happens if there is no generation gap in device superblock? >> > > Well, you use "generation" in strict btrfs sense, I use "generation" > generically. That is exactly what btrfs apparently lacks currently - > some monotonic counter that is used to record such event. Indeed, btrfs doesn't have any way to record which device get degraded at all. The usage of btrfs device generation is already kind of workaround. So to keep the same behavior of mdraid/lvm, each time btrfs detects a device missing/fatal command (flush/fua) not executed correctly, btrfs needs to record it, maybe into its device item, and commit it to disk. In short, the btrfs csum makes us a little conceited about such device missing case, normally csum will tell us which data is wrong so we could avoid complex device status tracking. But apparently, if nodatasum is involved, everything just goes out of our expectation. > >> If one device got some of its (nodatacow) data written to disk, while >> the other device doesn't get data written, and neither of them reached >> super block update, there is no difference in device superblock, thus no >> way to detect which is correct. >> > > Again, the very fact that device failed should have triggered update of > superblock to record this information which presumably should increase > some counter. Indeed. > >>> If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). >>> >>> That's different. Yes, with software-based raid there is usually no >>> way to detect outdated copy if no other copies are present. Having >>> older valid data is still very different from corrupting newer data. >> >> While for VDI case (or any VM image file format other than raw), older >> valid data normally means corruption. >> Unless they have their own write-ahead log. >>> Some file format may detect such problem by themselves if they have >> internal checksum, but anyway, older data normally means corruption, >> especially when partial new and partial old. >> > > Yes, that's true. But there is really nothing that can be done here, > even theoretically; it hardly a reason to not do what looks possible. Well, theoretically, you can just use datasum and datacow :) Thanks, Qu > >> On the other hand, with data COW and csum, btrfs can ensure the whole >> filesystem update is atomic (at least for single device). >> So the title, especially the "major design flaw" can't be wrong any more. >> >>> > others will automatically kick out the misbehaving drive. *none* of them > will take back the the drive with old data and start commingling that > data with good copy.)\ This behaviour from BTRFS is completely abnormal.. > and
Re: [PATCH] Revert "btrfs: fix a possible umount deadlock"
Hi Nikolay, Thank you for the patch! Yet something to improve: [auto build test ERROR on v4.18-rc2] [also build test ERROR on next-20180628] [cannot apply to btrfs/next] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Nikolay-Borisov/Revert-btrfs-fix-a-possible-umount-deadlock/20180629-044154 config: x86_64-randconfig-x015-201825 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): fs/btrfs/volumes.c: In function 'btrfs_close_one_device': >> fs/btrfs/volumes.c:1041:25: error: 'free_device' undeclared (first use in >> this function); did you mean 'new_device'? call_rcu(>rcu, free_device); ^~~ new_device fs/btrfs/volumes.c:1041:25: note: each undeclared identifier is reported only once for each function it appears in vim +1041 fs/btrfs/volumes.c 1006 1007 static void btrfs_close_one_device(struct btrfs_device *device) 1008 { 1009 struct btrfs_fs_devices *fs_devices = device->fs_devices; 1010 struct btrfs_device *new_device; 1011 struct rcu_string *name; 1012 1013 if (device->bdev) 1014 fs_devices->open_devices--; 1015 1016 if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) && 1017 device->devid != BTRFS_DEV_REPLACE_DEVID) { 1018 list_del_init(>dev_alloc_list); 1019 fs_devices->rw_devices--; 1020 } 1021 1022 if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) 1023 fs_devices->missing_devices--; 1024 1025 btrfs_close_bdev(device); 1026 1027 new_device = btrfs_alloc_device(NULL, >devid, 1028 device->uuid); 1029 BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ 1030 1031 /* Safe because we are under uuid_mutex */ 1032 if (device->name) { 1033 name = rcu_string_strdup(device->name->str, GFP_NOFS); 1034 BUG_ON(!name); /* -ENOMEM */ 1035 rcu_assign_pointer(new_device->name, name); 1036 } 1037 1038 list_replace_rcu(>dev_list, _device->dev_list); 1039 new_device->fs_devices = device->fs_devices; 1040 > 1041 call_rcu(>rcu, free_device); 1042 } 1043 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 11:37 AM, Goffredo Baroncelli wrote: > Regarding your point 3), it must be point out that in case of NOCOW files, > even having the same transid it is not enough. It still be possible that a > copy is update before a power failure preventing the super-block update. > I think that the only way to prevent it to happens is: > 1) using a data journal (which means that each data is copied two times) > OR > 2) using a cow filesystem (with cow enabled of course !) There is no power failure in this example. So it's really off the table considering whether Btrfs or mdadm/lvm raid do better in the same situation with a nodatacow file. I think here is the problem in the Btrfs nodatacow case. Btrfs doesn't have a way of untrusting nodatacow files on a previously missing drive that hasn't been balanced. There is no such thing as nometadatacow, so no matter what it figures out there's a problem, and uses the good copy of metadata, but it never "marks" the previously missing device as suspicious. When it comes time to read a nodatacow file, Btrfs just blindly reads off one of the drives, it has no mechanism for questioning the formerly missing drive without csum. That is actually a really weird and unique kind of write hole for Btrfs raid1 when the data is nodatacow. I have to agree with Remi. This is a flaw in the design or bad bug, however you want to consider it. Because mdadm/lvm do not behave this way in the exact same situation. And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? As for fixes: - During mount time, Btrfs sees from supers that there is a transid mismatch, to not read nodatacow files from the lower transid device until an auto balance has completed. Right now Btrfs doesn't have an abbreviated balance that "replays" the events between two transids. Basically it would work like send/receive but for balance to catch up a previously missing device. Right now we have to do a full balance which is a brutal penalty for a briefly missing drive. Again, mdadm and lvm do better here by default. - Fix the performance issues of COW with disk images. ZFS doesn't even have a nodatacow option and they're running VM images on ZFS and it doesn't sound like they're running into ridiculous performance penalties that makes it impractical to use. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 9:37 AM, Remi Gauvin wrote: > On 2018-06-28 10:17 AM, Chris Murphy wrote: > >> 2. The new data goes in a single chunk; even if the user does a manual >> balance (resync) their data isn't replicated. They must know to do a >> -dconvert balance to replicate the new data. Again this is a net worse >> behavior than mdadm out of the box, putting user data at risk. > > I'm not sure this is the case. Even though writes failed to the > disconnected device, btrfs seemed to keep on going as though it *were*. Yeah in your case the failure happens during normal operation and in that case there's no degraded state on Btrfs. So it keeps writing to raid1 chunk on the working drive, with writes on the failed devices going nowhere (with lots of write errors). When you stop using the volume, fix the problem with the missing drive, then remount the volume, it really should still use only the new copy on the never missing drive, even though it won't necessarily notice the file is missing on the formerly missing drive. You have to balance manually to fix it. > When the array was re-mounted with both devices, (never mounted as > degraded), and scrub was run, scrub took a *long* time fixing errors, at > a whopping 3MB/s, and reported having fixed millions of them. That's slow but it's expected to fix a lot of problems. Even in a very short amount of time there are thousands of missing data and metadata extents that need to be replicated. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
> Acceptable, but not really apply to software based RAID1. > Which completely disregards the minor detail that all the software Raid's I know of can handle exactly this kind of situation without loosing or corrupting a single byte of data, (Errors on the remaining hard drive notwithstanding.) Exactly what methods they employ to do so I'm not an expert at,, but it *does* work, contrary to your repeated assertions otherwise. In any case, thank you the for the patch you wrote. I will, however, propose a different solution. Given the reliance of BTRFS on csum, and the lack of any resynchronization, (no matter how the drives got out of sync, doesn't matter.). I think NoDataCow should just be ignored in the case of RAID, just like the data blocks would get copied if there was a snapshot. In the current implementation of RAID on btrfs, RAID and nodatacow are effectively mutually exclusive. Consider the kinds of use cases nodatacow is usually recommended for, VM images and databases. Even though those files should have their own mechanisms for dealing with incomplete writes, and data verification, BTRFS RAID creates a unique situation where parts of the file can be inconsistent, with different data being read depending on which device is doing the reading. Regardless of which method, short term and long term, developers choose to address this, this next part I have stress I consider very important. The status page really needs to be updated to reflect this gotchya. It *will* bite people in ways they do not expect, and disastrously. <> signature.asc Description: OpenPGP digital signature
Re: [PATCH] Revert "btrfs: fix a possible umount deadlock"
Hi Nikolay, Thank you for the patch! Yet something to improve: [auto build test ERROR on v4.18-rc2] [also build test ERROR on next-20180628] [cannot apply to btrfs/next] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Nikolay-Borisov/Revert-btrfs-fix-a-possible-umount-deadlock/20180629-044154 config: i386-randconfig-a0-201825 (attached as .config) compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): fs//btrfs/volumes.c: In function 'btrfs_close_one_device': >> fs//btrfs/volumes.c:1041:25: error: 'free_device' undeclared (first use in >> this function) call_rcu(>rcu, free_device); ^ fs//btrfs/volumes.c:1041:25: note: each undeclared identifier is reported only once for each function it appears in vim +/free_device +1041 fs//btrfs/volumes.c 1006 1007 static void btrfs_close_one_device(struct btrfs_device *device) 1008 { 1009 struct btrfs_fs_devices *fs_devices = device->fs_devices; 1010 struct btrfs_device *new_device; 1011 struct rcu_string *name; 1012 1013 if (device->bdev) 1014 fs_devices->open_devices--; 1015 1016 if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) && 1017 device->devid != BTRFS_DEV_REPLACE_DEVID) { 1018 list_del_init(>dev_alloc_list); 1019 fs_devices->rw_devices--; 1020 } 1021 1022 if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) 1023 fs_devices->missing_devices--; 1024 1025 btrfs_close_bdev(device); 1026 1027 new_device = btrfs_alloc_device(NULL, >devid, 1028 device->uuid); 1029 BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ 1030 1031 /* Safe because we are under uuid_mutex */ 1032 if (device->name) { 1033 name = rcu_string_strdup(device->name->str, GFP_NOFS); 1034 BUG_ON(!name); /* -ENOMEM */ 1035 rcu_assign_pointer(new_device->name, name); 1036 } 1037 1038 list_replace_rcu(>dev_list, _device->dev_list); 1039 new_device->fs_devices = device->fs_devices; 1040 > 1041 call_rcu(>rcu, free_device); 1042 } 1043 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Incremental send/receive broken after snapshot restore
Hi, Here's my environment: Linux diablo 4.17.0-gentoo #5 SMP Mon Jun 25 00:26:55 CEST 2018 x86_64 Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz GenuineIntel GNU/Linux btrfs-progs v4.17 Label: 'online' uuid: e4dc6617-b7ed-4dfb-84a6-26e3952c8390 Total devices 2 FS bytes used 3.16TiB devid1 size 1.82TiB used 1.58TiB path /dev/mapper/online0 devid2 size 1.82TiB used 1.58TiB path /dev/mapper/online1 Data, RAID0: total=3.16TiB, used=3.15TiB System, RAID0: total=16.00MiB, used=240.00KiB Metadata, RAID0: total=7.00GiB, used=4.91GiB GlobalReserve, single: total=512.00MiB, used=0.00B Label: 'offline' uuid: 5b449116-93e5-473e-aaf5-bf3097b14f29 Total devices 2 FS bytes used 3.52TiB devid1 size 5.46TiB used 3.53TiB path /dev/mapper/offline0 devid2 size 5.46TiB used 3.53TiB path /dev/mapper/offline1 Data, RAID1: total=3.52TiB, used=3.52TiB System, RAID1: total=8.00MiB, used=512.00KiB Metadata, RAID1: total=6.00GiB, used=5.11GiB GlobalReserve, single: total=512.00MiB, used=0.00B Label: 'external' uuid: 8bf13621-01f0-4f09-95c7-2c157d3087d0 Total devices 1 FS bytes used 3.65TiB devid1 size 5.46TiB used 3.66TiB path /dev/mapper/luks-3c196e96-d46c-4a9c-9583-b79c707678fc Data, single: total=3.64TiB, used=3.64TiB System, DUP: total=32.00MiB, used=448.00KiB Metadata, DUP: total=11.00GiB, used=9.72GiB GlobalReserve, single: total=512.00MiB, used=0.00B The following automatic backup scheme is in place: hourly: btrfs sub snap -r online/root online/root. daily: btrfs sub snap -r online/root online/root. btrfs send -c online/root. online/root. | btrfs receive offline btrfs sub del -c online/root. monthly: btrfs sub snap -r online/root online/root. btrfs send -c online/root. online/root. | btrfs receive external btrfs sub del -c online/root. Now here are the commands leading up to my problem: After the online filesystem suddenly went ro, and btrfs check showed massive problems, I decided to start the online array from scratch: 1: mkfs.btrfs -f -d raid0 -m raid0 -L "online" /dev/mapper/online0 /dev/mapper/online1 As you can see from the backup commands above, the snapshots of offline and external are not related, so in order to at least keep the extensive backlog of the external snapshot set (including all reflinks), I decided to restore the latest snapshot from external. 2: btrfs send external/root. | btrfs receive online I wanted to ensure I can restart the incremental backup flow from online to external, so I did this 3: mv online/root. online/root 4: btrfs sub snap -r online/root online/root. 5: btrfs property set online/root ro false Now, I naively expected a simple restart of my automatic backups for external should work. However after running 6: btrfs sub snap -r online/root online/root. 7: btrfs send -c online/root. online/root. | btrfs receive external I see the following error: ERROR: unlink root/.ssh/agent-diablo-_dev_pts_3 failed. No such file or directory Which is unfortunate, but the second problem actually encouraged me to post this message. As planned, I had to start the offline array from scratch as well, because I no longer had any reference snapshot for incremental backups on other devices: 8: mkfs.btrfs -f -d raid1 -m raid1 -L "offline" /dev/mapper/offline0 /dev/mapper/offline1 However restarting the automatic daily backup flow bails out with a similar error, although no potentially problematic previous incremental snapshots should be involved here! ERROR: unlink o925031-987-0/2139527549 failed. No such file or directory I'm a bit lost now. The only thing I could image which might be confusing for btrfs, is the residual "Received UUID" of online/root. after command 2. What's the recommended way to restore snapshots with send/receive without breaking subsequent incremental backups (including reflinks of existing backups)? Any hints appreciated... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
On 28.06.2018 17:17, Christoph Anton Mitterer wrote: > On Thu, 2018-06-28 at 22:09 +0800, Qu Wenruo wrote: >>> [ 72.168662] WARNING: CPU: 0 PID: 242 at /build/linux- >>> uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 >>> btrfs_update_device+0x1b2/0x1c0It >> looks like it's the old WARN_ON() for unaligned device size. >> Would you please verify if it is the case? > > # blockdev --getsize64 /dev/sdb2 /dev/sda2 > 999131127296 > 999131127296 > > > Since getsize64 returns bytes and not sectors, I suppose it would need > to be aligned to 1024 by the least? > > 999131127296 / 1024 = 975713991,5 > > So it's not. > > >> If so, "btrfs rescue fix-device-size" should handle it pretty well. > > I guess this needs to be done with the fs unmounted? > Anything to consider since I have RAID1 (except from running it on both > devices)? > > > Also, it's a bit strange that this error occurred never before (though > the btrfs-restore manpage says the kernel would check for this since > 4.11). Was this an old FS or a fresh one? Looking at the callstack this seems to have occured due to "btrfs_set_device_total_bytes(leaf, dev_item, btrfs_device_get_disk_total_bytes(device));" call. Meaning the total bytes of the disk were unalgined. Perhaps this has been like that for quite some time, then you did a couple of kernel upgrades (this WARN_ON was added later than 4.11) and just now you happened to delete a chunk which would trigger a device update on-disk ? > > It would further be nice if btrfs-check would warn about this. > > > Thanks, > Chris. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 2018年06月28日 16:04, Nikolay Borisov wrote: > > > On 28.06.2018 10:04, Qu Wenruo wrote: >> There is a reporter considering btrfs raid1 has a major design flaw >> which can't handle nodatasum files. >> >> Despite his incorrect expectation, btrfs indeed doesn't handle device >> generation mismatch well. > > I think this is getting a bit personal, no need for that. Just say that > btrfs doesn't handle this particular case and that's it. Though I agree > the reporter's attitude wasn't exactly nice... > >> >> This means if one devices missed and re-appeared, even its generation >> no longer matches with the rest device pool, btrfs does nothing to it, >> but treat it as normal good device. >> >> At least let's detect such generation mismatch and avoid mounting the >> fs. >> Currently there is no automatic rebuild yet, which means if users find >> device generation mismatch error message, they can only mount the fs >> using "device" and "degraded" mount option (if possible), then replace >> the offending device to manually "rebuild" the fs. >> >> Signed-off-by: Qu Wenruo >> --- >> I totally understand that, generation based solution can't handle >> split-brain case (where 2 RAID1 devices get mounted degraded separately) >> at all, but at least let's handle what we can do. >> >> The best way to solve the problem is to make btrfs treat such lower gen >> devices as some kind of missing device, and queue an automatic scrub for >> that device. >> But that's a lot of extra work, at least let's start from detecting such >> problem first. >> --- >> fs/btrfs/volumes.c | 50 ++ >> 1 file changed, 50 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index e034ad9e23b4..80a7c44993bc 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct >> btrfs_fs_info *fs_info, >>devid, uuid); >> } >> >> +static int verify_devices_generation(struct btrfs_fs_info *fs_info, >> + struct btrfs_device *dev) >> +{ >> +struct btrfs_fs_devices *fs_devices = dev->fs_devices; >> +struct btrfs_device *cur; >> +bool warn_only = false; >> +int ret = 0; >> + >> +if (!fs_devices || fs_devices->seeding || !dev->generation) >> +return 0; >> + >> +/* >> + * If we're not replaying log, we're completely safe to allow >> + * generation mismatch as it won't write anything to disks, nor >> + * remount to rw. >> + */ >> +if (btrfs_test_opt(fs_info, NOLOGREPLAY)) >> +warn_only = true; >> + >> +rcu_read_lock(); >> +list_for_each_entry_rcu(cur, _devices->devices, dev_list) {> + >> if (cur->generation && cur->generation != dev->generation) { >> +if (warn_only) { >> +btrfs_warn_rl_in_rcu(fs_info, >> +"devid %llu has unexpected generation, has %llu expected %llu", >> + dev->devid, >> + dev->generation, >> + cur->generation); >> +} else { >> +btrfs_err_rl_in_rcu(fs_info, >> +"devid %llu has unexpected generation, has %llu expected %llu", >> + dev->devid, >> + dev->generation, >> + cur->generation); >> +ret = -EINVAL; >> +break; >> +} >> +} >> +} >> +rcu_read_unlock(); >> +return ret; >> +} >> + >> static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key >> *key, >>struct extent_buffer *leaf, >>struct btrfs_chunk *chunk) >> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info >> *fs_info, struct btrfs_key *key, >> return PTR_ERR(map->stripes[i].dev); >> } >> btrfs_report_missing_device(fs_info, devid, uuid, >> false); >> +} else { >> +ret = verify_devices_generation(fs_info, >> +map->stripes[i].dev); >> +if (ret < 0) { >> +free_extent_map(em); >> +return ret; >> +} > > So this will be executed when doing mount. What about one device > disappearing, while the filesystem is still mounted and then later > re-appears. This code won't be executed in this case, no ? This depends on how the re-appear happens. If just several bio fails, it shouldn't be a big problem. If it really re-appears, I'm not pretty sure how btrfs will handle it, since there
Re: Enabling quota may not correctly rescan on 4.17
On 28.06.2018 11:10, Misono Tomohiro wrote: > On 2018/06/28 16:12, Qu Wenruo wrote: >> >> >> On 2018年06月27日 16:25, Misono Tomohiro wrote: >>> On 2018/06/27 17:10, Qu Wenruo wrote: On 2018年06月26日 14:00, Misono Tomohiro wrote: > Hello Nikolay, > > I noticed that commit 5d23515be669 ("btrfs: Move qgroup rescan > on quota enable to btrfs_quota_enable") in 4.17 sometimes causes > to fail correctly rescanning quota when quota is enabled. > > Simple reproducer: > > $ mkfs.btrfs -f $DEV > $ mount $DEV /mnt > $ dd if=/dev/urandom of=/mnt/file bs=1000 count=1000 > $ btrfs quota enbale /mnt > $ umount /mnt > $ btrfs check $DEV > ... >> Unfortunately in my environment, btrfs/114 failed to reprocduce it with >> 1024 runs overnight, with v4.18-rc1 kernel. >> >> Would you please provide the whole btrfs-image dump of the corrupted fs? > > Yes. > The attached file is an image-dump of above reproducer (kernel 4.17.0, progs > 4.17) > as the dump of btrfs/114 is a bit large for mail. > > Though this does not always happen, I see the failure both on 4.17.0 or > 4.18-rc2. > > Thanks, > Tomohiro Misono Misono, Can you please try the attached patch? > >> >> There are several different assumptions on how the bug happens, with >> your btrfs-image dump, it would help a lot to rule out some assumption. >> >> Thanks, >> Qu >From 10345e21bc2b4e61644da6b76ee4528710b2be25 Mon Sep 17 00:00:00 2001 From: Nikolay Borisov Date: Tue, 26 Jun 2018 10:03:58 +0300 Subject: [PATCH] btrfs: qgroups: Move transaction managed inside btrfs_quota_enable/disable Commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable") not only resulted in an easier to follow code but it also introduced a subtle bug. It changed the timing when the initial transaction rescan was happening - before the commit it would happen after transaction commit had occured but after the commit it might happen before the transaction was committed. This results in failure to correctly rescan the quota since there could be data which is still not committed on disk. This patch aims to fix this by movign the transaction creation/commit inside btrfs_quota_enable, which allows to schedule the quota commit after the transaction has been committed. For the sake of symmetry this patch also moves the transaction logic inside btrfs_quota_disable Fixes: 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable") Reported-by: Misono Tomohiro Link: https://marc.info/?l=linux-btrfs=152999289017582 Signed-off-by: Nikolay Borisov --- fs/btrfs/ioctl.c | 15 ++- fs/btrfs/qgroup.c | 38 +++--- fs/btrfs/qgroup.h | 6 ++ 3 files changed, 35 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a399750b9e41..316fb1af15e2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -5135,9 +5135,7 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg) struct inode *inode = file_inode(file); struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_ioctl_quota_ctl_args *sa; - struct btrfs_trans_handle *trans = NULL; int ret; - int err; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -5153,28 +5151,19 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg) } down_write(_info->subvol_sem); - trans = btrfs_start_transaction(fs_info->tree_root, 2); - if (IS_ERR(trans)) { - ret = PTR_ERR(trans); - goto out; - } switch (sa->cmd) { case BTRFS_QUOTA_CTL_ENABLE: - ret = btrfs_quota_enable(trans, fs_info); + ret = btrfs_quota_enable(fs_info); break; case BTRFS_QUOTA_CTL_DISABLE: - ret = btrfs_quota_disable(trans, fs_info); + ret = btrfs_quota_disable(fs_info); break; default: ret = -EINVAL; break; } - err = btrfs_commit_transaction(trans); - if (err && !ret) - ret = err; -out: kfree(sa); up_write(_info->subvol_sem); drop_write: diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 1874a6d2e6f5..1d84af0d053f 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -875,8 +875,7 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans, return ret; } -int btrfs_quota_enable(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info) +int btrfs_quota_enable(struct btrfs_fs_info *fs_info) { struct btrfs_root *quota_root; struct btrfs_root *tree_root = fs_info->tree_root; @@ -886,6 +885,7 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans, struct btrfs_key key; struct btrfs_key found_key; struct btrfs_qgroup *qgroup = NULL; + struct btrfs_trans_handle *trans = NULL; int ret = 0; int slot; @@ -893,6 +893,12 @@ int btrfs_quota_enable(struct btrfs_trans_handle *trans, if (fs_info->quota_root) goto out; + trans = btrfs_start_transaction(tree_root, 2); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto out; + } +
Re: [GIT PULL] Btrfs updates for 4.18
On Thu, Jun 28, 2018 at 07:22:59PM +0800, Anand Jain wrote: > The circular locking dependency warning occurs at FSSTRESS_PROG. > And in particular at doproc() in xfstests/ltp/fsstress.c, randomly > at any of the command at > opdesc_tops[] = { ..} > which involves calling mmap file operation and if there is something > to commit. > > The commit transaction does need device_list_mutex which is also being > used for the btrfs_open_devices() in the commit 542c5908abfe84f7. > > But btrfs_open_devices() is only called at mount, and mmap() can > establish only be established after the mount has completed. With > this give its unclear to me why the circular locking dependency check > is warning about this. > > I feel until we have clarity about this and also solve other problem > related to the streamlining of uuid_mutex, I suggest we revert > 542c5908abfe84f7. Sorry for the inconvenience. Ok, the revert is one option. I'm cosidering adding both the locks, like is in https://patchwork.kernel.org/patch/10478443/ . This would have no effect, as btrfs_open_devices is called only from mount path and the list_sort is done only for the first time when there are not other users of the list that would not also be under the uuid_mutex. This passed the syzbot and other tests, so this does not break things and goes towards pushing the device_list_mutex as the real protection mechanism for the fs_devices members. Let me know what you think, the revert should be the last option if we don't have anything better. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1] btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
On Wed, Jun 27, 2018 at 06:19:55PM +0800, Qu Wenruo wrote: > Commit ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf > of extent tree") added a new exit for rescan finish. > > However after finishing quota rescan, we set > fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the > original exit path. > While we missed that assignment of (u64)-1 in the new exit path. > > The end result is, the quota status item doesn't have the same value. > (-1 vs the last bytenr + 1) > Although it doesn't affect quota accounting, it's still better to keep > the original behavior. > > Reported-by: Misono Tomohiro > Fixes: ff3d27a048d9 ("btrfs: qgroup: Finish rescan when hit the last leaf of > extent tree") > Signed-off-by: Qu Wenruo Thanks, added to 4.18 queue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
On 28.06.2018 16:19, Christoph Anton Mitterer wrote: > Hey. > > On a 4.16.16 kernel with a RAID 1 btrfs I got the following messages > since today. > > Data seems still to be readable (correctly)... and there are no other > errors (like SATA errors) in the kernel log. > > Any idea what these could mean? It means you have a device, whose size is not aligned to 4k. You can fix this either by resizing the device to be properly aligned to a 4k boundary (if you are on x86) or better yet, just run : btrfs rescue fix-device-size /dev/ > > Thanks, > Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker
On Wed, Jun 20, 2018 at 07:56:12AM -0700, Chris Mason wrote: > For COW, btrfs expects pages dirty pages to have been through a few setup > steps. This includes reserving space for the new block allocations and > marking > the range in the state tree for delayed allocation. > > A few places outside btrfs will dirty pages directly, especially when > unmapping > mmap'd pages. In order for these to properly go through COW, we run them > through a fixup worker to wait for stable pages, and do the delalloc prep. > > 87826df0ec36 added a window where the dirty pages were cleaned, but pending > more action from the fixup worker. Can you please be more specific about the window, where it starts and ends? > During this window, page migration can jump > in and relocate the page. Once our fixup work actually starts, it finds > page->mapping is NULL and we end up freeing the page without ever writing it. AFAICS the old and new code do the same sequence of calls from the first mapping check: ClearPageChecked, ulock_page, put_page, kfree, extent_changeset_free > This leads to crc errors and other exciting problems, since it screws up the > whole statemachine for waiting for ordered extents. The fix here is to keep > the page dirty while we're waiting for the fixup worker to get to work. This > also makes sure the error handling in btrfs_writepage_fixup_worker does the > right thing with dirty bits when we run out of space. So this would need to find the mapping first to be not NULL, go until btrfs_start_ordered_extent where the lock is droppend and back to again:, check for mapping that's now NULL? But I still don't see how this is making things different. In the remaining sequence btrfs_lookup_ordered_range, btrfs_delalloc_reserve_space, btrfs_set_extent_delalloc (without any errors), the clear page checked comes after the extent is unlocked. > Signed-off-by: Chris Mason > --- > fs/btrfs/inode.c | 67 > +--- > 1 file changed, 49 insertions(+), 18 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0b86cf1..5538900 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -2100,11 +2100,21 @@ static void btrfs_writepage_fixup_worker(struct > btrfs_work *work) > page = fixup->page; > again: > lock_page(page); > - if (!page->mapping || !PageDirty(page) || !PageChecked(page)) { > - ClearPageChecked(page); > + > + /* > + * before we queued this fixup, we took a reference on the page. > + * page->mapping may go NULL, but it shouldn't be moved to a > + * different address space. > + */ > + if (!page->mapping || !PageDirty(page) || !PageChecked(page)) > goto out_page; > - } > > + /* > + * we keep the PageChecked() bit set until we're done with the > + * btrfs_start_ordered_extent() dance that we do below. That > + * drops and retakes the page lock, so we don't want new > + * fixup workers queued for this page during the churn. > + */ > inode = page->mapping->host; > page_start = page_offset(page); > page_end = page_offset(page) + PAGE_SIZE - 1; > @@ -2129,33 +2139,46 @@ static void btrfs_writepage_fixup_worker(struct > btrfs_work *work) > > ret = btrfs_delalloc_reserve_space(inode, _reserved, page_start, > PAGE_SIZE); > - if (ret) { > - mapping_set_error(page->mapping, ret); > - end_extent_writepage(page, ret, page_start, page_end); > - ClearPageChecked(page); > - goto out; > - } > + if (ret) > + goto out_error; > > ret = btrfs_set_extent_delalloc(inode, page_start, page_end, 0, > _state, 0); > - if (ret) { > - mapping_set_error(page->mapping, ret); > - end_extent_writepage(page, ret, page_start, page_end); > - ClearPageChecked(page); > - goto out; > - } > + if (ret) > + goto out_error; > > - ClearPageChecked(page); > - set_page_dirty(page); Hm, so previously the page was dirty, unconditionally calling down to set_page_dirty that could call btree_set_page_dirty and __set_page_dirty_nobuffers. If the dirty bit is set there, it'll do nothing. So this should be equivalent to the new code but looks strange to say at least. > btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, false); > + > + /* > + * everything went as planned, we're now the proud owners of a > + * Dirty page with delayed allocation bits set and space reserved > + * for our COW destination. > + * > + * The page was dirty when we started, nothing should have cleaned it. > + */ > + BUG_ON(!PageDirty(page)); > + > out: > unlock_extent_cached(_I(inode)->io_tree, page_start, page_end, >_state); > out_page: > +
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 28.06.2018 10:04, Qu Wenruo wrote: > There is a reporter considering btrfs raid1 has a major design flaw > which can't handle nodatasum files. > > Despite his incorrect expectation, btrfs indeed doesn't handle device > generation mismatch well. I think this is getting a bit personal, no need for that. Just say that btrfs doesn't handle this particular case and that's it. Though I agree the reporter's attitude wasn't exactly nice... > > This means if one devices missed and re-appeared, even its generation > no longer matches with the rest device pool, btrfs does nothing to it, > but treat it as normal good device. > > At least let's detect such generation mismatch and avoid mounting the > fs. > Currently there is no automatic rebuild yet, which means if users find > device generation mismatch error message, they can only mount the fs > using "device" and "degraded" mount option (if possible), then replace > the offending device to manually "rebuild" the fs. > > Signed-off-by: Qu Wenruo > --- > I totally understand that, generation based solution can't handle > split-brain case (where 2 RAID1 devices get mounted degraded separately) > at all, but at least let's handle what we can do. > > The best way to solve the problem is to make btrfs treat such lower gen > devices as some kind of missing device, and queue an automatic scrub for > that device. > But that's a lot of extra work, at least let's start from detecting such > problem first. > --- > fs/btrfs/volumes.c | 50 ++ > 1 file changed, 50 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index e034ad9e23b4..80a7c44993bc 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct > btrfs_fs_info *fs_info, > devid, uuid); > } > > +static int verify_devices_generation(struct btrfs_fs_info *fs_info, > + struct btrfs_device *dev) > +{ > + struct btrfs_fs_devices *fs_devices = dev->fs_devices; > + struct btrfs_device *cur; > + bool warn_only = false; > + int ret = 0; > + > + if (!fs_devices || fs_devices->seeding || !dev->generation) > + return 0; > + > + /* > + * If we're not replaying log, we're completely safe to allow > + * generation mismatch as it won't write anything to disks, nor > + * remount to rw. > + */ > + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) > + warn_only = true; > + > + rcu_read_lock(); > + list_for_each_entry_rcu(cur, _devices->devices, dev_list) {> + > if (cur->generation && cur->generation != dev->generation) { > + if (warn_only) { > + btrfs_warn_rl_in_rcu(fs_info, > + "devid %llu has unexpected generation, has %llu expected %llu", > + dev->devid, > + dev->generation, > + cur->generation); > + } else { > + btrfs_err_rl_in_rcu(fs_info, > + "devid %llu has unexpected generation, has %llu expected %llu", > + dev->devid, > + dev->generation, > + cur->generation); > + ret = -EINVAL; > + break; > + } > + } > + } > + rcu_read_unlock(); > + return ret; > +} > + > static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key > *key, > struct extent_buffer *leaf, > struct btrfs_chunk *chunk) > @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info > *fs_info, struct btrfs_key *key, > return PTR_ERR(map->stripes[i].dev); > } > btrfs_report_missing_device(fs_info, devid, uuid, > false); > + } else { > + ret = verify_devices_generation(fs_info, > + map->stripes[i].dev); > + if (ret < 0) { > + free_extent_map(em); > + return ret; > + } So this will be executed when doing mount. What about one device disappearing, while the filesystem is still mounted and then later re-appears. This code won't be executed in this case, no ? > } > set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, > &(map->stripes[i].dev->dev_state)); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
[PATCH] Revert "btrfs: fix a possible umount deadlock"
Since commit 88c14590cdd6 ("btrfs: use RCU in btrfs_show_devname for device list traversal") btrfs_show_devname no longer takes device_list_mutex. As such the deadlock that 0ccd05285e7f ("btrfs: fix a possible umount deadlock") aimed to fix no longer exists. So remove the extra code this commit added. No functional changes. This reverts commit 0ccd05285e7f5a8e297e1d6dfc41e7c65757d6fa. Signed-off-by: Nikolay Borisov --- Anand, I would like to use the opportunity to ask you about the peculiar sequence needed to close btrfs devices. Why do we first replace the closed device with a copy in btrfs_close_one_device, then dispose of the copied devices in btrfs_close_devices IFF we had fs_devices->seed not being NULL? fs/btrfs/volumes.c | 26 ++ 1 file changed, 6 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 5bd6f3a40f9c..70b0ed2ba4df 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1004,7 +1004,7 @@ static void btrfs_close_bdev(struct btrfs_device *device) blkdev_put(device->bdev, device->mode); } -static void btrfs_prepare_close_one_device(struct btrfs_device *device) +static void btrfs_close_one_device(struct btrfs_device *device) { struct btrfs_fs_devices *fs_devices = device->fs_devices; struct btrfs_device *new_device; @@ -1022,6 +1022,8 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) fs_devices->missing_devices--; + btrfs_close_bdev(device); + new_device = btrfs_alloc_device(NULL, >devid, device->uuid); BUG_ON(IS_ERR(new_device)); /* -ENOMEM */ @@ -1035,39 +1037,23 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) list_replace_rcu(>dev_list, _device->dev_list); new_device->fs_devices = device->fs_devices; + + call_rcu(>rcu, free_device); } static int close_fs_devices(struct btrfs_fs_devices *fs_devices) { struct btrfs_device *device, *tmp; - struct list_head pending_put; - - INIT_LIST_HEAD(_put); if (--fs_devices->opened > 0) return 0; mutex_lock(_devices->device_list_mutex); list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) { - btrfs_prepare_close_one_device(device); - list_add(>dev_list, _put); + btrfs_close_one_device(device); } mutex_unlock(_devices->device_list_mutex); - /* -* btrfs_show_devname() is using the device_list_mutex, -* sometimes call to blkdev_put() leads vfs calling -* into this func. So do put outside of device_list_mutex, -* as of now. -*/ - while (!list_empty(_put)) { - device = list_first_entry(_put, - struct btrfs_device, dev_list); - list_del(>dev_list); - btrfs_close_bdev(device); - call_rcu(>rcu, free_device_rcu); - } - WARN_ON(fs_devices->open_devices); WARN_ON(fs_devices->rw_devices); fs_devices->opened = 0; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 06/28/2018 04:17 PM, Chris Murphy wrote: > Btrfs does two, maybe three, bad things: > 1. No automatic resync. This is a net worse behavior than mdadm and > lvm, putting data at risk. > 2. The new data goes in a single chunk; even if the user does a manual > balance (resync) their data isn't replicated. They must know to do a > -dconvert balance to replicate the new data. Again this is a net worse > behavior than mdadm out of the box, putting user data at risk. > 3. Apparently if nodatacow, given a file with two copies of different > transid, Btrfs won't always pick the higher transid copy? If true > that's terrible, and again not at all what mdadm/lvm are doing. All these could be avoided simply not allowing a multidevice filesystem to mount without ensuring that all the devices have the same generation. In the past I proposed a mount.btrfs helper; I am still thinking that it would be the right place to a) put all the check before mounting the filesystem b) print the correct information in order to help the user on what he has to do to solve the issues Regarding your point 3), it must be point out that in case of NOCOW files, even having the same transid it is not enough. It still be possible that a copy is update before a power failure preventing the super-block update. I think that the only way to prevent it to happens is: 1) using a data journal (which means that each data is copied two times) OR 2) using a cow filesystem (with cow enabled of course !) I think that this is a good example of why a HW Raid controller battery backed could be better than a SW raid. Of course the likelihood of a lot of problems could be reduced using a power supply. BR -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
28.06.2018 12:15, Qu Wenruo пишет: > > > On 2018年06月28日 16:16, Andrei Borzenkov wrote: >> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: >>> >>> >>> On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > > Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, >>> >>> WOW, that's black magic, at least for RAID1. >>> The whole RAID1 has no idea of which copy is correct unlike btrfs who >>> has datasum. >>> >>> Don't bother other things, just tell me how to determine which one is >>> correct? >>> >> >> When one drive fails, it is recorded in meta-data on remaining drives; >> probably configuration generation number is increased. Next time drive >> with older generation is not incorporated. Hardware controllers also >> keep this information in NVRAM and so do not even depend on scanning >> of other disks. > > Yep, the only possible way to determine such case is from external info. > > For device generation, it's possible to enhance btrfs, but at least we > could start from detect and refuse to RW mount to avoid possible further > corruption. > But anyway, if one really cares about such case, hardware RAID > controller seems to be the only solution as other software may have the > same problem. > > And the hardware solution looks pretty interesting, is the write to > NVRAM 100% atomic? Even at power loss? > >> >>> The only possibility is that, the misbehaved device missed several super >>> block update so we have a chance to detect it's out-of-date. >>> But that's not always working. >>> >> >> Why it should not work as long as any write to array is suspended >> until superblock on remaining devices is updated? > > What happens if there is no generation gap in device superblock? > Well, you use "generation" in strict btrfs sense, I use "generation" generically. That is exactly what btrfs apparently lacks currently - some monotonic counter that is used to record such event. > If one device got some of its (nodatacow) data written to disk, while > the other device doesn't get data written, and neither of them reached > super block update, there is no difference in device superblock, thus no > way to detect which is correct. > Again, the very fact that device failed should have triggered update of superblock to record this information which presumably should increase some counter. >> >>> If you're talking about missing generation check for btrfs, that's >>> valid, but it's far from a "major design flaw", as there are a lot of >>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected >>> (the brain-split case). >>> >> >> That's different. Yes, with software-based raid there is usually no >> way to detect outdated copy if no other copies are present. Having >> older valid data is still very different from corrupting newer data. > > While for VDI case (or any VM image file format other than raw), older > valid data normally means corruption. > Unless they have their own write-ahead log. >> Some file format may detect such problem by themselves if they have > internal checksum, but anyway, older data normally means corruption, > especially when partial new and partial old. > Yes, that's true. But there is really nothing that can be done here, even theoretically; it hardly a reason to not do what looks possible. > On the other hand, with data COW and csum, btrfs can ensure the whole > filesystem update is atomic (at least for single device). > So the title, especially the "major design flaw" can't be wrong any more. > >> others will automatically kick out the misbehaving drive. *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID. >>> >>> RAID1 can only tolerate 1 missing device, it has nothing to do with >>> error detection. >>> And it's impossible to detect such case without extra help. >>> >>> Your expectation is completely wrong. >>> >> >> Well ... somehow it is my experience as well ... :) > > Acceptable, but not really apply to software based RAID1. > > Thanks, > Qu > >> I'm not the one who has to clear his expectations here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > signature.asc Description: OpenPGP digital signature
[PATCH RFC] btrfs: Do extra device generation check at mount time
On 2018-06-28 10:36 AM, Adam Borowski wrote: > > Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. > The vast majority of data is fine, and extents that have been written to > while a device is missing will be either placed elsewhere (if the filesystem > knew it was degraded) or read one of the copies to notice a wrong checksum > and automatically recover (if the device was still falsely believed to be > good at write time). > > We currently don't have selective scrub yet so resyncing such single-copy That might not be the case. though I don't really know the numbers myself and repeating this is hearsay: crc32 is not infallible. 1 in so many billion errors will be undetected by it. In the case of a dropped device with write failures, when you *know* the data supposedly written to the disk is bad, re-synching from believed good copy (so long as it passes checksum verification, of course), is the only way to be certain that the data is good. Otherwise, you can be left with a Schroedinger's bit somewhere, (It's not 0 or 1, but both, depending on which device the filesystem is reading from at the time.) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 10:17 AM, Chris Murphy wrote: > 2. The new data goes in a single chunk; even if the user does a manual > balance (resync) their data isn't replicated. They must know to do a > -dconvert balance to replicate the new data. Again this is a net worse > behavior than mdadm out of the box, putting user data at risk. I'm not sure this is the case. Even though writes failed to the disconnected device, btrfs seemed to keep on going as though it *were*. When the array was re-mounted with both devices, (never mounted as degraded), and scrub was run, scrub took a *long* time fixing errors, at a whopping 3MB/s, and reported having fixed millions of them. <>
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
On 2018年06月28日 22:17, Christoph Anton Mitterer wrote: > On Thu, 2018-06-28 at 22:09 +0800, Qu Wenruo wrote: >>> [ 72.168662] WARNING: CPU: 0 PID: 242 at /build/linux- >>> uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 >>> btrfs_update_device+0x1b2/0x1c0It >> looks like it's the old WARN_ON() for unaligned device size. >> Would you please verify if it is the case? > > # blockdev --getsize64 /dev/sdb2 /dev/sda2 > 999131127296 > 999131127296 > > > Since getsize64 returns bytes and not sectors, I suppose it would need > to be aligned to 1024 by the least? > > 999131127296 / 1024 = 975713991,5 > > So it's not. So it's the case. > > >> If so, "btrfs rescue fix-device-size" should handle it pretty well. > > I guess this needs to be done with the fs unmounted? Yep. > Anything to consider since I have RAID1 (except from running it on both > devices)? Nothing special. Btrfs-progs will handle it pretty well. > > > Also, it's a bit strange that this error occurred never before (though > the btrfs-restore manpage says the kernel would check for this since > 4.11). Because the WARN_ON() is newly added. > > It would further be nice if btrfs-check would warn about this. Yep, latest will warn about it, and --repair can also fix it too. Thanks, Qu > > > Thanks, > Chris. > signature.asc Description: OpenPGP digital signature
Re: unsolvable technical issues?
On Wed, Jun 27, 2018 at 08:50:11PM +0200, waxhead wrote: > Chris Murphy wrote: > > On Thu, Jun 21, 2018 at 5:13 PM, waxhead wrote: > > > According to this: > > > > > > https://stratis-storage.github.io/StratisSoftwareDesign.pdf > > > Page 4 , section 1.2 > > > > > > It claims that BTRFS still have significant technical issues that may > > > never > > > be resolved. > > > Could someone shed some light on exactly what these technical issues might > > > be?! What are BTRFS biggest technical problems? > > > > > > I think it's appropriate to file an issue and ask what they're > > referring to. It very well might be use case specific to Red Hat. > > https://github.com/stratis-storage/stratis-storage.github.io/issues > https://github.com/stratis-storage/stratis-storage.github.io/issues/1 > > Apparently the author have toned down the wording a bit, this confirm that > the claim was without basis and probably based on "popular myth". > The document the PDF links to is not yet updated. It's a company whose profits rely on users choosing it over anything that competes. Adding propaganda to a public document is a natural thing for them to do. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones. ⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes ⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious. It's also interesting to see OSes ⠈⠳⣄ go back and forth wrt their intended target. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote: > There is a reporter considering btrfs raid1 has a major design flaw > which can't handle nodatasum files. > > Despite his incorrect expectation, btrfs indeed doesn't handle device > generation mismatch well. > > This means if one devices missed and re-appeared, even its generation > no longer matches with the rest device pool, btrfs does nothing to it, > but treat it as normal good device. > > At least let's detect such generation mismatch and avoid mounting the > fs. Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. The vast majority of data is fine, and extents that have been written to while a device is missing will be either placed elsewhere (if the filesystem knew it was degraded) or read one of the copies to notice a wrong checksum and automatically recover (if the device was still falsely believed to be good at write time). We currently don't have selective scrub yet so resyncing such single-copy extents is costly, but 1. all will be fine if the data is read, 2. it's possible to add such a smart resync in the future, far better than a write-intent bitmap can do. To do the latter, we can note the last generation the filesystem was known to be fully coherent (ie, all devices were successfully flushed with no mysterious write failures), then run selective scrub (perhaps even automatically) when the filesystem is no longer degraded. There's some extra complexity with 3- or 4-way RAID (multiple levels of degradation) but a single number would help even there. But even currently, without the above not-yet-written recovery, it's reasonably safe to continue without scrub -- it's a case of running partially degraded when the bad copy is already known to be suspicious. For no-nodatacow data and metadata, that is. > Currently there is no automatic rebuild yet, which means if users find > device generation mismatch error message, they can only mount the fs > using "device" and "degraded" mount option (if possible), then replace > the offending device to manually "rebuild" the fs. As nodatacow already means "I don't care about this data, or have another way of recovering it", I don't quite get why we would drop existing auto-recovery for a common transient failure case. If you're paranoid, perhaps some bit "this filesystem has some nodatacow data on it" could warrant such a block, but it would still need to be overridable _without_ a need for replace. There's also the problem that systemd marks its journal nodatacow (despite it having infamously bad handling of failures!), and too many distributions infect their default installs with systemd, meaning such a bit would be on in most cases. But why would I put all my other data at risk, just because there's a nodatacow file? There's a big difference between scrubbing when only a few transactions worth of data is suspicious and completely throwing away a mostly-good replica to replace it from the now fully degraded copy. > I totally understand that, generation based solution can't handle > split-brain case (where 2 RAID1 devices get mounted degraded separately) > at all, but at least let's handle what we can do. Generation can do well at least unless both devices were mounted elsewhere and got the exact same number of transactions, the problem is that nodatacow doesn't bump generation number. > The best way to solve the problem is to make btrfs treat such lower gen > devices as some kind of missing device, and queue an automatic scrub for > that device. > But that's a lot of extra work, at least let's start from detecting such > problem first. I wonder if there's some way to treat problematic nodatacow files as degraded only? Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd need to pretty much copy all of md's logic, with a write-intent bitmap or an equivalent. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones. ⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes ⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious. It's also interesting to see OSes ⠈⠳⣄ go back and forth wrt their intended target. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 28/06/2018 09:04, Qu Wenruo wrote: > Despite his incorrect expectation, btrfs indeed doesn't handle device > generation mismatch well. > > This means if one devices missed and re-appeared, even its generation > no longer matches with the rest device pool, btrfs does nothing to it, > but treat it as normal good device. > > At least let's detect such generation mismatch and avoid mounting the > fs. > Currently there is no automatic rebuild yet, which means if users find > device generation mismatch error message, they can only mount the fs > using "device" and "degraded" mount option (if possible), then replace > the offending device to manually "rebuild" the fs. > Yes. This is a long-standing issue, handling it this way is similar to what mdadm (software raid) also does. Please get this merged fast, don't get bogged down too much with integrating with Anand Jain's branch as this is a big issue and should get at least this basic mitigation asap. -Alberto
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
The problems are known with Btrfs raid1, but I think they bear repeating because they are really not OK. In the exact same described scenario: a simple clear cut drop off of a member device, which then later clearly reappears (no transient failure). Both mdadm and LVM based raid1 would have re-added the missing device and resynced it because internal bitmap is the default (on > 100G arrays for mdadm and always with lvm). Only the new data would be propagated to user space. Both mdadm and lvm have metadata to know which drive has stale data in this common scenario. Btrfs does two, maybe three, bad things: 1. No automatic resync. This is a net worse behavior than mdadm and lvm, putting data at risk. 2. The new data goes in a single chunk; even if the user does a manual balance (resync) their data isn't replicated. They must know to do a -dconvert balance to replicate the new data. Again this is a net worse behavior than mdadm out of the box, putting user data at risk. 3. Apparently if nodatacow, given a file with two copies of different transid, Btrfs won't always pick the higher transid copy? If true that's terrible, and again not at all what mdadm/lvm are doing. Btrfs can do better because it has more information available to make unambiguous decisions about data. But it needs to always do at least as good a job as mdadm/lvm and as reported, that didn't happen. So some tested is needed in particular case #3 above with nodatacow. That's a huge bug, if it's true. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
On Thu, 2018-06-28 at 22:09 +0800, Qu Wenruo wrote: > > [ 72.168662] WARNING: CPU: 0 PID: 242 at /build/linux- > > uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 > > btrfs_update_device+0x1b2/0x1c0It > looks like it's the old WARN_ON() for unaligned device size. > Would you please verify if it is the case? # blockdev --getsize64 /dev/sdb2 /dev/sda2 999131127296 999131127296 Since getsize64 returns bytes and not sectors, I suppose it would need to be aligned to 1024 by the least? 999131127296 / 1024 = 975713991,5 So it's not. > If so, "btrfs rescue fix-device-size" should handle it pretty well. I guess this needs to be done with the fs unmounted? Anything to consider since I have RAID1 (except from running it on both devices)? Also, it's a bit strange that this error occurred never before (though the btrfs-restore manpage says the kernel would check for this since 4.11). It would further be nice if btrfs-check would warn about this. Thanks, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
On 2018年06月28日 21:19, Christoph Anton Mitterer wrote: > Hey. > > On a 4.16.16 kernel with a RAID 1 btrfs I got the following messages > since today. > > Data seems still to be readable (correctly)... and there are no other > errors (like SATA errors) in the kernel log. > > Any idea what these could mean? > > Thanks, > Chris. > > > [ 72.168662] WARNING: CPU: 0 PID: 242 at > /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 > btrfs_update_device+0x1b2/0x1c0It looks like it's the old WARN_ON() for > unaligned device size. Would you please verify if it is the case? If so, "btrfs rescue fix-device-size" should handle it pretty well. Thanks, Qu signature.asc Description: OpenPGP digital signature
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 2018年06月28日 21:26, Anand Jain wrote: > > > On 06/28/2018 04:02 PM, Qu Wenruo wrote: >> Also CC Anand Jain, since he is also working on various device related >> work, and an expert on this. >> >> Especially I'm not pretty sure if such enhancement is already on his >> agenda or there are already some unmerged patch for this. > > Right, some of the patches are already in the ML and probably it needs > a revival. Unless there are new challenges, my target is to consolidate > them and get them integrated by this year. I am trying harder. > > I think its a good idea to report the write-hole status to the > user-land instead of failing the mount, so that a script can trigger > the re-silver as required by the use case. I introduced > fs_devices::volume_flags, which is under review as of now, and have > plans to add the volume degraded state bit which can be accessed by > the sysfs. So yes this is been taken care. Glad to hear that! However I found it pretty hard to trace your latest work, would you mind to share the git repo and branch you're working on? Maybe I could take some time to do some review. Thanks, Qu > > > Thanks, Anand > > >> Thanks, >> Qu >> >> On 2018年06月28日 15:04, Qu Wenruo wrote: >>> There is a reporter considering btrfs raid1 has a major design flaw >>> which can't handle nodatasum files. >>> >>> Despite his incorrect expectation, btrfs indeed doesn't handle device >>> generation mismatch well. >>> >>> This means if one devices missed and re-appeared, even its generation >>> no longer matches with the rest device pool, btrfs does nothing to it, >>> but treat it as normal good device. >>> >>> At least let's detect such generation mismatch and avoid mounting the >>> fs. >>> Currently there is no automatic rebuild yet, which means if users find >>> device generation mismatch error message, they can only mount the fs >>> using "device" and "degraded" mount option (if possible), then replace >>> the offending device to manually "rebuild" the fs. >>> >>> Signed-off-by: Qu Wenruo >>> --- >>> I totally understand that, generation based solution can't handle >>> split-brain case (where 2 RAID1 devices get mounted degraded separately) >>> at all, but at least let's handle what we can do. >>> >>> The best way to solve the problem is to make btrfs treat such lower gen >>> devices as some kind of missing device, and queue an automatic scrub for >>> that device. >>> But that's a lot of extra work, at least let's start from detecting such >>> problem first. >>> --- >>> fs/btrfs/volumes.c | 50 ++ >>> 1 file changed, 50 insertions(+) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index e034ad9e23b4..80a7c44993bc 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct >>> btrfs_fs_info *fs_info, >>> devid, uuid); >>> } >>> +static int verify_devices_generation(struct btrfs_fs_info *fs_info, >>> + struct btrfs_device *dev) >>> +{ >>> + struct btrfs_fs_devices *fs_devices = dev->fs_devices; >>> + struct btrfs_device *cur; >>> + bool warn_only = false; >>> + int ret = 0; >>> + >>> + if (!fs_devices || fs_devices->seeding || !dev->generation) >>> + return 0; >>> + >>> + /* >>> + * If we're not replaying log, we're completely safe to allow >>> + * generation mismatch as it won't write anything to disks, nor >>> + * remount to rw. >>> + */ >>> + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) >>> + warn_only = true; >>> + >>> + rcu_read_lock(); >>> + list_for_each_entry_rcu(cur, _devices->devices, dev_list) { >>> + if (cur->generation && cur->generation != dev->generation) { >>> + if (warn_only) { >>> + btrfs_warn_rl_in_rcu(fs_info, >>> + "devid %llu has unexpected generation, has %llu expected %llu", >>> + dev->devid, >>> + dev->generation, >>> + cur->generation); >>> + } else { >>> + btrfs_err_rl_in_rcu(fs_info, >>> + "devid %llu has unexpected generation, has %llu expected %llu", >>> + dev->devid, >>> + dev->generation, >>> + cur->generation); >>> + ret = -EINVAL; > > > > >>> + break; >>> + } >>> + } >>> + } >>> + rcu_read_unlock(); >>> + return ret; >>> +} >>> + >>> static int read_one_chunk(struct btrfs_fs_info *fs_info, struct >>> btrfs_key *key, >>> struct extent_buffer *leaf, >>> struct btrfs_chunk *chunk) >>> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info >>> *fs_info, struct btrfs_key *key, >>> return PTR_ERR(map->stripes[i].dev); >>> } >>>
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 06/28/2018 04:02 PM, Qu Wenruo wrote: Also CC Anand Jain, since he is also working on various device related work, and an expert on this. Especially I'm not pretty sure if such enhancement is already on his agenda or there are already some unmerged patch for this. Right, some of the patches are already in the ML and probably it needs a revival. Unless there are new challenges, my target is to consolidate them and get them integrated by this year. I am trying harder. I think its a good idea to report the write-hole status to the user-land instead of failing the mount, so that a script can trigger the re-silver as required by the use case. I introduced fs_devices::volume_flags, which is under review as of now, and have plans to add the volume degraded state bit which can be accessed by the sysfs. So yes this is been taken care. Thanks, Anand Thanks, Qu On 2018年06月28日 15:04, Qu Wenruo wrote: There is a reporter considering btrfs raid1 has a major design flaw which can't handle nodatasum files. Despite his incorrect expectation, btrfs indeed doesn't handle device generation mismatch well. This means if one devices missed and re-appeared, even its generation no longer matches with the rest device pool, btrfs does nothing to it, but treat it as normal good device. At least let's detect such generation mismatch and avoid mounting the fs. Currently there is no automatic rebuild yet, which means if users find device generation mismatch error message, they can only mount the fs using "device" and "degraded" mount option (if possible), then replace the offending device to manually "rebuild" the fs. Signed-off-by: Qu Wenruo --- I totally understand that, generation based solution can't handle split-brain case (where 2 RAID1 devices get mounted degraded separately) at all, but at least let's handle what we can do. The best way to solve the problem is to make btrfs treat such lower gen devices as some kind of missing device, and queue an automatic scrub for that device. But that's a lot of extra work, at least let's start from detecting such problem first. --- fs/btrfs/volumes.c | 50 ++ 1 file changed, 50 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e034ad9e23b4..80a7c44993bc 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, devid, uuid); } +static int verify_devices_generation(struct btrfs_fs_info *fs_info, +struct btrfs_device *dev) +{ + struct btrfs_fs_devices *fs_devices = dev->fs_devices; + struct btrfs_device *cur; + bool warn_only = false; + int ret = 0; + + if (!fs_devices || fs_devices->seeding || !dev->generation) + return 0; + + /* +* If we're not replaying log, we're completely safe to allow +* generation mismatch as it won't write anything to disks, nor +* remount to rw. +*/ + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) + warn_only = true; + + rcu_read_lock(); + list_for_each_entry_rcu(cur, _devices->devices, dev_list) { + if (cur->generation && cur->generation != dev->generation) { + if (warn_only) { + btrfs_warn_rl_in_rcu(fs_info, + "devid %llu has unexpected generation, has %llu expected %llu", +dev->devid, +dev->generation, +cur->generation); + } else { + btrfs_err_rl_in_rcu(fs_info, + "devid %llu has unexpected generation, has %llu expected %llu", +dev->devid, +dev->generation, +cur->generation); + ret = -EINVAL; + break; + } + } + } + rcu_read_unlock(); + return ret; +} + static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, struct extent_buffer *leaf, struct btrfs_chunk *chunk) @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, return PTR_ERR(map->stripes[i].dev); } btrfs_report_missing_device(fs_info, devid, uuid, false); + } else { + ret = verify_devices_generation(fs_info, + map->stripes[i].dev); + if (ret < 0) { +
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 06/28/2018 09:42 AM, Remi Gauvin wrote: There seems to be a major design flaw with BTRFS that needs to be better documented, to avoid massive data loss. Tested with Raid 1 on Ubuntu Kernel 4.15 The use case being tested was a Virtualbox VDI file created with NODATACOW attribute, (as is often suggested, due to the painful performance penalty of COW on these files.) However, if a device is temporarily dropped (this in case, tested by disconnecting drives.) and re-connects automatically next boot, BTRFS does not in any way synchronize the VDI file, or have any means to know that one of copy is out of date and bad. The result of trying to use said VDI file is interestingly insane. Scrub did not do anything to rectify the situation. Please use Balance to rectify as its RAID1. Because when one of the device was missing we wrote Single Chunks. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
call trace: WARNING: at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device
Hey. On a 4.16.16 kernel with a RAID 1 btrfs I got the following messages since today. Data seems still to be readable (correctly)... and there are no other errors (like SATA errors) in the kernel log. Any idea what these could mean? Thanks, Chris. [ 72.168662] WARNING: CPU: 0 PID: 242 at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device+0x1b2/0x1c0 [btrfs] [ 72.168701] Modules linked in: cpufreq_userspace cpufreq_powersave cpufreq_conservative snd_hda_codec_hdmi ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment xt_tcpudp nf_conntrack_ipv4 powernow_k8 nf_defrag_ipv4 edac_mce_amd snd_hda_intel kvm_amd snd_hda_codec ccp rng_core snd_hda_core kvm snd_hwdep irqbypass snd_pcm wmi_bmof radeon snd_timer ttm xt_multiport snd pcspkr soundcore drm_kms_helper k8temp ohci_pci ata_generic pata_atiixp ohci_hcd ehci_pci sg wmi xt_conntrack drm nf_conntrack i2c_algo_bit ehci_hcd usbcore button sp5100_tco usb_common shpchp i2c_piix4 iptable_filter binfmt_misc sunrpc hwmon_vid ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash sd_mod raid10 raid456 async_raid6_recov [ 72.168776] async_memcpy async_pq async_xor async_tx libcrc32c crc32c_generic xor raid6_pq raid1 raid0 multipath linear md_mod evdev ahci libahci serio_raw libata r8169 mii scsi_mod [ 72.168820] CPU: 0 PID: 242 Comm: btrfs-cleaner Not tainted 4.16.0-2-amd64 #1 Debian 4.16.16-2 [ 72.168852] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7551/KA780G (MS-7551), BIOS V16.6 05/12/2010 [ 72.168907] RIP: 0010:btrfs_update_device+0x1b2/0x1c0 [btrfs] [ 72.168939] RSP: 0018:bd5a810a3d60 EFLAGS: 00010206 [ 72.168973] RAX: 0fff RBX: 938e847f8000 RCX: 00e8a0db1e00 [ 72.169006] RDX: 1000 RSI: 3f5c RDI: 938e7a8015e0 [ 72.169040] RBP: 938e8fb97a00 R08: bd5a810a3d10 R09: bd5a810a3d18 [ 72.169073] R10: 0003 R11: 3000 R12: [ 72.169106] R13: 3f3c R14: 938e7a8015e0 R15: 938e8f0c6328 [ 72.169140] FS: () GS:938e9dc0() knlGS: [ 72.169177] CS: 0010 DS: ES: CR0: 80050033 [ 72.169210] CR2: 7fcff92ce000 CR3: 00020575e000 CR4: 06f0 [ 72.169243] Call Trace: [ 72.169304] btrfs_remove_chunk+0x2a9/0x8c0 [btrfs] [ 72.169359] btrfs_delete_unused_bgs+0x323/0x3f0 [btrfs] [ 72.169415] ? __btree_submit_bio_start+0x20/0x20 [btrfs] [ 72.169469] cleaner_kthread+0x152/0x160 [btrfs] [ 72.169506] kthread+0x113/0x130 [ 72.169540] ? kthread_create_worker_on_cpu+0x70/0x70 [ 72.169575] ? SyS_exit_group+0x10/0x10 [ 72.169610] ret_from_fork+0x35/0x40 [ 72.169643] Code: 4c 89 f7 45 31 c0 ba 10 00 00 00 4c 89 ee e8 16 23 ff ff 4c 89 f7 e8 9e ef fc ff e9 de fe ff ff 41 bc f4 ff ff ff e9 db fe ff ff <0f> 0b eb b7 e8 85 4c 1a c5 0f 1f 44 00 00 66 66 66 66 90 41 55 [ 72.169705] ---[ end trace ed549af9d9cf6190 ]--- [ 72.170009] WARNING: CPU: 0 PID: 242 at /build/linux-uwVqDp/linux-4.16.16/fs/btrfs/ctree.h:1565 btrfs_update_device+0x1b2/0x1c0 [btrfs] [ 72.170050] Modules linked in: cpufreq_userspace cpufreq_powersave cpufreq_conservative snd_hda_codec_hdmi ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment xt_tcpudp nf_conntrack_ipv4 powernow_k8 nf_defrag_ipv4 edac_mce_amd snd_hda_intel kvm_amd snd_hda_codec ccp rng_core snd_hda_core kvm snd_hwdep irqbypass snd_pcm wmi_bmof radeon snd_timer ttm xt_multiport snd pcspkr soundcore drm_kms_helper k8temp ohci_pci ata_generic pata_atiixp ohci_hcd ehci_pci sg wmi xt_conntrack drm nf_conntrack i2c_algo_bit ehci_hcd usbcore button sp5100_tco usb_common shpchp i2c_piix4 iptable_filter binfmt_misc sunrpc hwmon_vid ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash sd_mod raid10 raid456 async_raid6_recov [ 72.170152] async_memcpy async_pq async_xor async_tx libcrc32c crc32c_generic xor raid6_pq raid1 raid0 multipath linear md_mod evdev ahci libahci serio_raw libata r8169 mii scsi_mod [ 72.170204] CPU: 0 PID: 242 Comm: btrfs-cleaner Tainted: GW 4.16.0-2-amd64 #1 Debian 4.16.16-2 [ 72.170241] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7551/KA780G (MS-7551), BIOS V16.6 05/12/2010 [ 72.170300] RIP: 0010:btrfs_update_device+0x1b2/0x1c0 [btrfs] [ 72.170333] RSP: 0018:bd5a810a3d60 EFLAGS: 00010206 [ 72.170367] RAX: 0fff RBX: 938e847f8000 RCX: 00e8a0db1e00 [ 72.170401] RDX: 1000 RSI: 3f5c RDI: 938e7a8015e0 [ 72.170434] RBP: 938e8fb97a00 R08: bd5a810a3d10 R09: bd5a810a3d18 [ 72.170468] R10: 0003 R11: 3000 R12: [ 72.170501] R13: 3f3c R14: 938e7a8015e0 R15: 938e8f0c6328 [
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 07:46, Qu Wenruo wrote: On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote: On 2018-06-28 05:15, Qu Wenruo wrote: On 2018年06月28日 16:16, Andrei Borzenkov wrote: On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. LVM doesn't. It detects that one of the devices was gone for some period of time and marks the volume as degraded (and _might_, depending on how you have things configured, automatically re-sync). Not sure about MD, but I am willing to bet it properly detects this type of situation too. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? On a proper RAID controller, it's battery backed, and that battery backing provides enough power to also make sure that the state change is properly recorded in the event of power loss. Well, that explains a lot of thing. So hardware RAID controller can be considered having a special battery (always atomic) journal device. If we can't provide UPS for the whole system, a battery powered journal device indeed makes sense. The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. Yes, but that should be a very small window (at least, once we finally quit serializing writes across devices), and it's a problem on existing RAID1 implementations too (and therefore isn't something we should be using as an excuse for not doing this). If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. The title is excessive, but I'd agree it's a design flaw that BTRFS doesn't at least notice that the generation ID's are different and preferentially trust the device with the newer generation ID. Well, a design flaw should be something that can't be easily fixed without *huge* on-disk format or behavior change. Flaw in btrfs' one-subvolume-per-tree metadata design or current extent booking behavior could be called design flaw. That would be a structural design flaw. it's a result of how the software is structured. There are other types of design flaws though. While for things like this, just as the submitted RFC patch, less than 100 lines could change the behavior. I would still consider this case a design flaw (a purely behavioral one not tied to how
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote: > On 2018-06-28 05:15, Qu Wenruo wrote: >> >> >> On 2018年06月28日 16:16, Andrei Borzenkov wrote: >>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo >>> wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: > > > On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > >> >> Please get yourself clear of what other raid1 is doing. > > A drive failure, where the drive is still there when the computer > reboots, is a situation that *any* raid 1, (or for that matter, > raid 5, raid 6, anything but raid 0) will recover from perfectly > without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? >>> >>> When one drive fails, it is recorded in meta-data on remaining drives; >>> probably configuration generation number is increased. Next time drive >>> with older generation is not incorporated. Hardware controllers also >>> keep this information in NVRAM and so do not even depend on scanning >>> of other disks. >> >> Yep, the only possible way to determine such case is from external info. >> >> For device generation, it's possible to enhance btrfs, but at least we >> could start from detect and refuse to RW mount to avoid possible further >> corruption. >> But anyway, if one really cares about such case, hardware RAID >> controller seems to be the only solution as other software may have the >> same problem. > LVM doesn't. It detects that one of the devices was gone for some > period of time and marks the volume as degraded (and _might_, depending > on how you have things configured, automatically re-sync). Not sure > about MD, but I am willing to bet it properly detects this type of > situation too. >> >> And the hardware solution looks pretty interesting, is the write to >> NVRAM 100% atomic? Even at power loss? > On a proper RAID controller, it's battery backed, and that battery > backing provides enough power to also make sure that the state change is > properly recorded in the event of power loss. Well, that explains a lot of thing. So hardware RAID controller can be considered having a special battery (always atomic) journal device. If we can't provide UPS for the whole system, a battery powered journal device indeed makes sense. >> >>> The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. >>> >>> Why it should not work as long as any write to array is suspended >>> until superblock on remaining devices is updated? >> >> What happens if there is no generation gap in device superblock? >> >> If one device got some of its (nodatacow) data written to disk, while >> the other device doesn't get data written, and neither of them reached >> super block update, there is no difference in device superblock, thus no >> way to detect which is correct. > Yes, but that should be a very small window (at least, once we finally > quit serializing writes across devices), and it's a problem on existing > RAID1 implementations too (and therefore isn't something we should be > using as an excuse for not doing this). >> >>> If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). >>> >>> That's different. Yes, with software-based raid there is usually no >>> way to detect outdated copy if no other copies are present. Having >>> older valid data is still very different from corrupting newer data. >> >> While for VDI case (or any VM image file format other than raw), older >> valid data normally means corruption. >> Unless they have their own write-ahead log. >> >> Some file format may detect such problem by themselves if they have >> internal checksum, but anyway, older data normally means corruption, >> especially when partial new and partial old. >> >> On the other hand, with data COW and csum, btrfs can ensure the whole >> filesystem update is atomic (at least for single device). >> So the title, especially the "major design flaw" can't be wrong any more. > The title is excessive, but I'd agree it's a design flaw that BTRFS > doesn't at least notice that the generation ID's are different and > preferentially trust the device with the newer generation ID. Well, a design flaw should be something that can't be easily fixed without *huge* on-disk format or behavior change. Flaw in btrfs' one-subvolume-per-tree metadata design or current extent booking behavior could be called design flaw. While for things like this, just as the submitted RFC
Re: [GIT PULL] Btrfs updates for 4.18
On 06/12/2018 12:16 AM, David Sterba wrote: On Mon, Jun 11, 2018 at 10:50:54AM +0100, Filipe Manana wrote: btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices * * the mutex can be very coarse and can cover long-running operations * * protects: updates to fs_devices counters like missing devices, rw devices, * seeding, structure cloning, openning/closing devices at mount/umount time generates some confusion since btrfs_open_devices(), after that commit, no longer takes the uuid_mutex and it updates some fs_devices counters (opened, open_devices, etc). As uuid_mutex is a global fs_uuids lock for the per fsid operations doesn't make any sense. This problem is reproducible only for-4.18, misc-next if fine. I am looking deeper. What about the unprotected updates (increments) to fs_devices->opened and fs_devices->open_devices? Other functions are accessing/updating them while holding the uuid mutex. The goal is to reduce usage of uuid_mutex only to protect search or update of the fs_uuids list, everything else should be protected by the device_list_mutex. The commit 542c5908abfe84f7 (use device_list_mutex in btrfs_open_devices) implements that but then the access to the ->opened member is not protected consistently. There are patches that convert the use to device_list_mutex but haven't been merged due to refinements or pending review. At this point I think we should revert the one commit 542c5908abfe84f7 as it introduces the locking problems and revisit the whole fs_devices locking scheme again in the dex dev cycle. That will be post rc1 as there might be more to revert. I tried to narrow this, it appears some of the things that circular locking dependency check report doesn't make sense. Here below is what I find.. as of now. The test case btrfs/004 can be simplified to.. which also reproduces the problem. -8<- $ cat 165 #! /bin/bash # FS QA Test No. btrfs/165 # seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" here=`pwd` tmp=/tmp/$$ status=1 noise_pid=0 _cleanup() { wait rm -f $tmp.* } trap "_cleanup; exit \$status" 0 1 2 3 15 # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full run_check _scratch_mkfs_sized $((2000 * 1024 * 1024)) run_check _scratch_mount run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 1 -n 2000 $FSSTRESS_AVOID run_check _scratch_unmount echo "done" status=0 exit -8<- The circular locking dependency warning occurs at FSSTRESS_PROG. And in particular at doproc() in xfstests/ltp/fsstress.c, randomly at any of the command at opdesc_tops[] = { ..} which involves calling mmap file operation and if there is something to commit. The commit transaction does need device_list_mutex which is also being used for the btrfs_open_devices() in the commit 542c5908abfe84f7. But btrfs_open_devices() is only called at mount, and mmap() can establish only be established after the mount has completed. With this give its unclear to me why the circular locking dependency check is warning about this. I feel until we have clarity about this and also solve other problem related to the streamlining of uuid_mutex, I suggest we revert 542c5908abfe84f7. Sorry for the inconvenience. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 05:15, Qu Wenruo wrote: On 2018年06月28日 16:16, Andrei Borzenkov wrote: On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. LVM doesn't. It detects that one of the devices was gone for some period of time and marks the volume as degraded (and _might_, depending on how you have things configured, automatically re-sync). Not sure about MD, but I am willing to bet it properly detects this type of situation too. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? On a proper RAID controller, it's battery backed, and that battery backing provides enough power to also make sure that the state change is properly recorded in the event of power loss. The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. Yes, but that should be a very small window (at least, once we finally quit serializing writes across devices), and it's a problem on existing RAID1 implementations too (and therefore isn't something we should be using as an excuse for not doing this). If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. The title is excessive, but I'd agree it's a design flaw that BTRFS doesn't at least notice that the generation ID's are different and preferentially trust the device with the newer generation ID. The only special handling I can see that would be needed is around volumes mounted with the `nodatacow` option, which may not see generation changes for a very long time otherwise. others will automatically kick out the misbehaving drive. *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID. RAID1 can only tolerate 1 missing device, it has nothing to do with error detection. And it's impossible to detect such case without extra help. Your expectation is completely wrong. Well ... somehow it is my experience as well ... :) Acceptable, but not really apply to software based RAID1. Thanks, Qu I'm not the one who has to clear his expectations here. -- To unsubscribe from this list: send the line "unsubscribe
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月28日 16:16, Andrei Borzenkov wrote: > On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: >> >> >> On 2018年06月28日 11:14, r...@georgianit.com wrote: >>> >>> >>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: >>> Please get yourself clear of what other raid1 is doing. >>> >>> A drive failure, where the drive is still there when the computer reboots, >>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, >>> anything but raid 0) will recover from perfectly without raising a sweat. >>> Some will rebuild the array automatically, >> >> WOW, that's black magic, at least for RAID1. >> The whole RAID1 has no idea of which copy is correct unlike btrfs who >> has datasum. >> >> Don't bother other things, just tell me how to determine which one is >> correct? >> > > When one drive fails, it is recorded in meta-data on remaining drives; > probably configuration generation number is increased. Next time drive > with older generation is not incorporated. Hardware controllers also > keep this information in NVRAM and so do not even depend on scanning > of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? > >> The only possibility is that, the misbehaved device missed several super >> block update so we have a chance to detect it's out-of-date. >> But that's not always working. >> > > Why it should not work as long as any write to array is suspended > until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. > >> If you're talking about missing generation check for btrfs, that's >> valid, but it's far from a "major design flaw", as there are a lot of >> cases where other RAID1 (mdraid or LVM mirrored) can also be affected >> (the brain-split case). >> > > That's different. Yes, with software-based raid there is usually no > way to detect outdated copy if no other copies are present. Having > older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. > >>> others will automatically kick out the misbehaving drive. *none* of them >>> will take back the the drive with old data and start commingling that data >>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and >>> defeats even the most basic expectations of RAID. >> >> RAID1 can only tolerate 1 missing device, it has nothing to do with >> error detection. >> And it's impossible to detect such case without extra help. >> >> Your expectation is completely wrong. >> > > Well ... somehow it is my experience as well ... :) Acceptable, but not really apply to software based RAID1. Thanks, Qu > >>> >>> I'm not the one who has to clear his expectations here. >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> signature.asc Description: OpenPGP digital signature
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 11:16 AM, Andrei Borzenkov wrote: > On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: >> >> >> On 2018年06月28日 11:14, r...@georgianit.com wrote: >>> >>> >>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: >>> Please get yourself clear of what other raid1 is doing. >>> >>> A drive failure, where the drive is still there when the computer reboots, >>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, >>> anything but raid 0) will recover from perfectly without raising a sweat. >>> Some will rebuild the array automatically, >> >> WOW, that's black magic, at least for RAID1. >> The whole RAID1 has no idea of which copy is correct unlike btrfs who >> has datasum. >> >> Don't bother other things, just tell me how to determine which one is >> correct? >> > > When one drive fails, it is recorded in meta-data on remaining drives; > probably configuration generation number is increased. Next time drive > with older generation is not incorporated. Hardware controllers also > keep this information in NVRAM and so do not even depend on scanning > of other disks. > >> The only possibility is that, the misbehaved device missed several super >> block update so we have a chance to detect it's out-of-date. >> But that's not always working. >> > > Why it should not work as long as any write to array is suspended > until superblock on remaining devices is updated? > >> If you're talking about missing generation check for btrfs, that's >> valid, but it's far from a "major design flaw", as there are a lot of >> cases where other RAID1 (mdraid or LVM mirrored) can also be affected >> (the brain-split case). >> > > That's different. Yes, with software-based raid there is usually no > way to detect outdated copy if no other copies are present. Having > older valid data is still very different from corrupting newer data. > >>> others will automatically kick out the misbehaving drive. *none* of them >>> will take back the the drive with old data and start commingling that data >>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and >>> defeats even the most basic expectations of RAID. >> >> RAID1 can only tolerate 1 missing device, it has nothing to do with >> error detection. >> And it's impossible to detect such case without extra help. >> >> Your expectation is completely wrong. >> > > Well ... somehow it is my experience as well ... :) s/experience/expectation/ sorry. > >>> >>> I'm not the one who has to clear his expectations here. >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: > > > On 2018年06月28日 11:14, r...@georgianit.com wrote: >> >> >> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: >> >>> >>> Please get yourself clear of what other raid1 is doing. >> >> A drive failure, where the drive is still there when the computer reboots, >> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, >> anything but raid 0) will recover from perfectly without raising a sweat. >> Some will rebuild the array automatically, > > WOW, that's black magic, at least for RAID1. > The whole RAID1 has no idea of which copy is correct unlike btrfs who > has datasum. > > Don't bother other things, just tell me how to determine which one is > correct? > When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. > The only possibility is that, the misbehaved device missed several super > block update so we have a chance to detect it's out-of-date. > But that's not always working. > Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? > If you're talking about missing generation check for btrfs, that's > valid, but it's far from a "major design flaw", as there are a lot of > cases where other RAID1 (mdraid or LVM mirrored) can also be affected > (the brain-split case). > That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. >> others will automatically kick out the misbehaving drive. *none* of them >> will take back the the drive with old data and start commingling that data >> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and >> defeats even the most basic expectations of RAID. > > RAID1 can only tolerate 1 missing device, it has nothing to do with > error detection. > And it's impossible to detect such case without extra help. > > Your expectation is completely wrong. > Well ... somehow it is my experience as well ... :) >> >> I'm not the one who has to clear his expectations here. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enabling quota may not correctly rescan on 4.17
On 2018/06/28 16:12, Qu Wenruo wrote: > > > On 2018年06月27日 16:25, Misono Tomohiro wrote: >> On 2018/06/27 17:10, Qu Wenruo wrote: >>> >>> >>> On 2018年06月26日 14:00, Misono Tomohiro wrote: Hello Nikolay, I noticed that commit 5d23515be669 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable") in 4.17 sometimes causes to fail correctly rescanning quota when quota is enabled. Simple reproducer: $ mkfs.btrfs -f $DEV $ mount $DEV /mnt $ dd if=/dev/urandom of=/mnt/file bs=1000 count=1000 $ btrfs quota enbale /mnt $ umount /mnt $ btrfs check $DEV ... > Unfortunately in my environment, btrfs/114 failed to reprocduce it with > 1024 runs overnight, with v4.18-rc1 kernel. > > Would you please provide the whole btrfs-image dump of the corrupted fs? Yes. The attached file is an image-dump of above reproducer (kernel 4.17.0, progs 4.17) as the dump of btrfs/114 is a bit large for mail. Though this does not always happen, I see the failure both on 4.17.0 or 4.18-rc2. Thanks, Tomohiro Misono > > There are several different assumptions on how the bug happens, with > your btrfs-image dump, it would help a lot to rule out some assumption. > > Thanks, > Qu btrfs-image Description: Binary data
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
Also CC Anand Jain, since he is also working on various device related work, and an expert on this. Especially I'm not pretty sure if such enhancement is already on his agenda or there are already some unmerged patch for this. Thanks, Qu On 2018年06月28日 15:04, Qu Wenruo wrote: > There is a reporter considering btrfs raid1 has a major design flaw > which can't handle nodatasum files. > > Despite his incorrect expectation, btrfs indeed doesn't handle device > generation mismatch well. > > This means if one devices missed and re-appeared, even its generation > no longer matches with the rest device pool, btrfs does nothing to it, > but treat it as normal good device. > > At least let's detect such generation mismatch and avoid mounting the > fs. > Currently there is no automatic rebuild yet, which means if users find > device generation mismatch error message, they can only mount the fs > using "device" and "degraded" mount option (if possible), then replace > the offending device to manually "rebuild" the fs. > > Signed-off-by: Qu Wenruo > --- > I totally understand that, generation based solution can't handle > split-brain case (where 2 RAID1 devices get mounted degraded separately) > at all, but at least let's handle what we can do. > > The best way to solve the problem is to make btrfs treat such lower gen > devices as some kind of missing device, and queue an automatic scrub for > that device. > But that's a lot of extra work, at least let's start from detecting such > problem first. > --- > fs/btrfs/volumes.c | 50 ++ > 1 file changed, 50 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index e034ad9e23b4..80a7c44993bc 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct > btrfs_fs_info *fs_info, > devid, uuid); > } > > +static int verify_devices_generation(struct btrfs_fs_info *fs_info, > + struct btrfs_device *dev) > +{ > + struct btrfs_fs_devices *fs_devices = dev->fs_devices; > + struct btrfs_device *cur; > + bool warn_only = false; > + int ret = 0; > + > + if (!fs_devices || fs_devices->seeding || !dev->generation) > + return 0; > + > + /* > + * If we're not replaying log, we're completely safe to allow > + * generation mismatch as it won't write anything to disks, nor > + * remount to rw. > + */ > + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) > + warn_only = true; > + > + rcu_read_lock(); > + list_for_each_entry_rcu(cur, _devices->devices, dev_list) { > + if (cur->generation && cur->generation != dev->generation) { > + if (warn_only) { > + btrfs_warn_rl_in_rcu(fs_info, > + "devid %llu has unexpected generation, has %llu expected %llu", > + dev->devid, > + dev->generation, > + cur->generation); > + } else { > + btrfs_err_rl_in_rcu(fs_info, > + "devid %llu has unexpected generation, has %llu expected %llu", > + dev->devid, > + dev->generation, > + cur->generation); > + ret = -EINVAL; > + break; > + } > + } > + } > + rcu_read_unlock(); > + return ret; > +} > + > static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key > *key, > struct extent_buffer *leaf, > struct btrfs_chunk *chunk) > @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info > *fs_info, struct btrfs_key *key, > return PTR_ERR(map->stripes[i].dev); > } > btrfs_report_missing_device(fs_info, devid, uuid, > false); > + } else { > + ret = verify_devices_generation(fs_info, > + map->stripes[i].dev); > + if (ret < 0) { > + free_extent_map(em); > + return ret; > + } > } > set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, > &(map->stripes[i].dev->dev_state)); > signature.asc Description: OpenPGP digital signature
RE: [PATCH RFC] btrfs: Do extra device generation check at mount time
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org ow...@vger.kernel.org> On Behalf Of Qu Wenruo > Sent: Thursday, 28 June 2018 5:16 PM > To: Nikolay Borisov ; Qu Wenruo ; > linux-btrfs@vger.kernel.org > Subject: Re: [PATCH RFC] btrfs: Do extra device generation check at mount > time > > > > On 2018年06月28日 15:06, Nikolay Borisov wrote: > > > > > > On 28.06.2018 10:04, Qu Wenruo wrote: > >> There is a reporter considering btrfs raid1 has a major design flaw > >> which can't handle nodatasum files. > >> > >> Despite his incorrect expectation, btrfs indeed doesn't handle device > >> generation mismatch well. > >> > >> This means if one devices missed and re-appeared, even its generation > >> no longer matches with the rest device pool, btrfs does nothing to > >> it, but treat it as normal good device. > >> > >> At least let's detect such generation mismatch and avoid mounting the > >> fs. > >> Currently there is no automatic rebuild yet, which means if users > >> find device generation mismatch error message, they can only mount > >> the fs using "device" and "degraded" mount option (if possible), then > >> replace the offending device to manually "rebuild" the fs. > >> > >> Signed-off-by: Qu Wenruo > > > > I think a testcase of this functionality is important as well. > > It's currently an RFC patch, test case would come along with final version. > > I'd like to make sure everyone, including developers and end-users, are fine > with the restrict error-out behavior. I've been bitten by this before and was most surprised the first time it happened. I had assumed that of course btrfs would check such a thing before mounting. Refusing to mount is a great first step, auto scrub is even better, and only "scrubbing" files with incorrect generation is better yet. Paul.
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 2018年06月28日 15:17, Su Yue wrote: > > > On 06/28/2018 03:04 PM, Qu Wenruo wrote: >> There is a reporter considering btrfs raid1 has a major design flaw >> which can't handle nodatasum files. >> >> Despite his incorrect expectation, btrfs indeed doesn't handle device >> generation mismatch well. >> > Just say " btrfs indeed doesn't handle device generation mismatch well." > is enough. > Otherwise on someday someone may be confused about who is the reporter > and what was reported. I'm just a little over-reacting to that rude reporter. Will definitely change that part of commit message in final version. Thanks, Qu > > Thanks, > Su >> This means if one devices missed and re-appeared, even its generation >> no longer matches with the rest device pool, btrfs does nothing to it, >> but treat it as normal good device. >> >> At least let's detect such generation mismatch and avoid mounting the >> fs. >> Currently there is no automatic rebuild yet, which means if users find >> device generation mismatch error message, they can only mount the fs >> using "device" and "degraded" mount option (if possible), then replace >> the offending device to manually "rebuild" the fs. >> >> Signed-off-by: Qu Wenruo >> --- >> I totally understand that, generation based solution can't handle >> split-brain case (where 2 RAID1 devices get mounted degraded separately) >> at all, but at least let's handle what we can do. >> >> The best way to solve the problem is to make btrfs treat such lower gen >> devices as some kind of missing device, and queue an automatic scrub for >> that device. >> But that's a lot of extra work, at least let's start from detecting such >> problem first. >> --- >> fs/btrfs/volumes.c | 50 ++ >> 1 file changed, 50 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index e034ad9e23b4..80a7c44993bc 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct >> btrfs_fs_info *fs_info, >> devid, uuid); >> } >> +static int verify_devices_generation(struct btrfs_fs_info *fs_info, >> + struct btrfs_device *dev) >> +{ >> + struct btrfs_fs_devices *fs_devices = dev->fs_devices; >> + struct btrfs_device *cur; >> + bool warn_only = false; >> + int ret = 0; >> + >> + if (!fs_devices || fs_devices->seeding || !dev->generation) >> + return 0; >> + >> + /* >> + * If we're not replaying log, we're completely safe to allow >> + * generation mismatch as it won't write anything to disks, nor >> + * remount to rw. >> + */ >> + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) >> + warn_only = true; >> + >> + rcu_read_lock(); >> + list_for_each_entry_rcu(cur, _devices->devices, dev_list) { >> + if (cur->generation && cur->generation != dev->generation) { >> + if (warn_only) { >> + btrfs_warn_rl_in_rcu(fs_info, >> + "devid %llu has unexpected generation, has %llu expected %llu", >> + dev->devid, >> + dev->generation, >> + cur->generation); >> + } else { >> + btrfs_err_rl_in_rcu(fs_info, >> + "devid %llu has unexpected generation, has %llu expected %llu", >> + dev->devid, >> + dev->generation, >> + cur->generation); >> + ret = -EINVAL; >> + break; >> + } >> + } >> + } >> + rcu_read_unlock(); >> + return ret; >> +} >> + >> static int read_one_chunk(struct btrfs_fs_info *fs_info, struct >> btrfs_key *key, >> struct extent_buffer *leaf, >> struct btrfs_chunk *chunk) >> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info >> *fs_info, struct btrfs_key *key, >> return PTR_ERR(map->stripes[i].dev); >> } >> btrfs_report_missing_device(fs_info, devid, uuid, false); >> + } else { >> + ret = verify_devices_generation(fs_info, >> + map->stripes[i].dev); >> + if (ret < 0) { >> + free_extent_map(em); >> + return ret; >> + } >> } >> set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >> &(map->stripes[i].dev->dev_state)); >> > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 2018年06月28日 15:06, Nikolay Borisov wrote: > > > On 28.06.2018 10:04, Qu Wenruo wrote: >> There is a reporter considering btrfs raid1 has a major design flaw >> which can't handle nodatasum files. >> >> Despite his incorrect expectation, btrfs indeed doesn't handle device >> generation mismatch well. >> >> This means if one devices missed and re-appeared, even its generation >> no longer matches with the rest device pool, btrfs does nothing to it, >> but treat it as normal good device. >> >> At least let's detect such generation mismatch and avoid mounting the >> fs. >> Currently there is no automatic rebuild yet, which means if users find >> device generation mismatch error message, they can only mount the fs >> using "device" and "degraded" mount option (if possible), then replace >> the offending device to manually "rebuild" the fs. >> >> Signed-off-by: Qu Wenruo > > I think a testcase of this functionality is important as well. It's currently an RFC patch, test case would come along with final version. I'd like to make sure everyone, including developers and end-users, are fine with the restrict error-out behavior. Thanks, Qu >> --- >> I totally understand that, generation based solution can't handle >> split-brain case (where 2 RAID1 devices get mounted degraded separately) >> at all, but at least let's handle what we can do. >> >> The best way to solve the problem is to make btrfs treat such lower gen >> devices as some kind of missing device, and queue an automatic scrub for >> that device. >> But that's a lot of extra work, at least let's start from detecting such >> problem first. >> --- >> fs/btrfs/volumes.c | 50 ++ >> 1 file changed, 50 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index e034ad9e23b4..80a7c44993bc 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct >> btrfs_fs_info *fs_info, >>devid, uuid); >> } >> >> +static int verify_devices_generation(struct btrfs_fs_info *fs_info, >> + struct btrfs_device *dev) >> +{ >> +struct btrfs_fs_devices *fs_devices = dev->fs_devices; >> +struct btrfs_device *cur; >> +bool warn_only = false; >> +int ret = 0; >> + >> +if (!fs_devices || fs_devices->seeding || !dev->generation) >> +return 0; >> + >> +/* >> + * If we're not replaying log, we're completely safe to allow >> + * generation mismatch as it won't write anything to disks, nor >> + * remount to rw. >> + */ >> +if (btrfs_test_opt(fs_info, NOLOGREPLAY)) >> +warn_only = true; >> + >> +rcu_read_lock(); >> +list_for_each_entry_rcu(cur, _devices->devices, dev_list) { >> +if (cur->generation && cur->generation != dev->generation) { >> +if (warn_only) { >> +btrfs_warn_rl_in_rcu(fs_info, >> +"devid %llu has unexpected generation, has %llu expected %llu", >> + dev->devid, >> + dev->generation, >> + cur->generation); >> +} else { >> +btrfs_err_rl_in_rcu(fs_info, >> +"devid %llu has unexpected generation, has %llu expected %llu", >> + dev->devid, >> + dev->generation, >> + cur->generation); >> +ret = -EINVAL; >> +break; >> +} >> +} >> +} >> +rcu_read_unlock(); >> +return ret; >> +} >> + >> static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key >> *key, >>struct extent_buffer *leaf, >>struct btrfs_chunk *chunk) >> @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info >> *fs_info, struct btrfs_key *key, >> return PTR_ERR(map->stripes[i].dev); >> } >> btrfs_report_missing_device(fs_info, devid, uuid, >> false); >> +} else { >> +ret = verify_devices_generation(fs_info, >> +map->stripes[i].dev); >> +if (ret < 0) { >> +free_extent_map(em); >> +return ret; >> +} >> } >> set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >> &(map->stripes[i].dev->dev_state)); >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at
Re: Enabling quota may not correctly rescan on 4.17
On 2018年06月27日 16:25, Misono Tomohiro wrote: > On 2018/06/27 17:10, Qu Wenruo wrote: >> >> >> On 2018年06月26日 14:00, Misono Tomohiro wrote: >>> Hello Nikolay, >>> >>> I noticed that commit 5d23515be669 ("btrfs: Move qgroup rescan >>> on quota enable to btrfs_quota_enable") in 4.17 sometimes causes >>> to fail correctly rescanning quota when quota is enabled. >>> >>> Simple reproducer: >>> >>> $ mkfs.btrfs -f $DEV >>> $ mount $DEV /mnt >>> $ dd if=/dev/urandom of=/mnt/file bs=1000 count=1000 >>> $ btrfs quota enbale /mnt >>> $ umount /mnt >>> $ btrfs check $DEV >>> ... >>> checking quota groups >>> Counts for qgroup id: 0/5 are different >>> our:referenced 1019904 referenced compressed 1019904 >>> disk: referenced 16384 referenced compressed 16384 >>> diff: referenced 1003520 referenced compressed 1003520 >>> our:exclusive 1019904 exclusive compressed 1019904 >>> disk: exclusive 16384 exclusive compressed 16384 >>> diff: exclusive 1003520 exclusive compressed 1003520 >>> found 1413120 bytes used, error(s) found >>> ... >>> >>> This can be also observed in btrfs/114. (Note that progs < 4.17 >>> returns error code 0 even if quota is not consistency and therefore >>> test will incorrectly pass.) >> >> BTW, would you please try to dump the quota tree for such mismatch case? >> >> It could be a btrfs-progs bug which it should skip quota checking if it >> found the quota status item has RESCAN flag. > > Yes, this is what I see after running btrfs/114 (/dev/sdh1 is scratch dev): > > $ sudo btrfs check -Q /dev/sdh1 > Checking filesystem on /dev/sdh1 > UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb > Print quota groups for /dev/sdh1 > UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb > Counts for qgroup id: 0/5 are different > our:referenced 170999808 referenced compressed 170999808 > disk: referenced 16384 referenced compressed 16384 > diff: referenced 170983424 referenced compressed 170983424 > our:exclusive 170999808 exclusive compressed 170999808 > disk: exclusive 16384 exclusive compressed 16384 > diff: exclusive 170983424 exclusive compressed 170983424 > Unfortunately in my environment, btrfs/114 failed to reproduce it with 1024 runs overnight, with v4.18-rc1 kernel. Would you please provide the whole btrfs-image dump of the corrupted fs? There are several different assumptions on how the bug happens, with your btrfs-image dump, it would help a lot to rule out some assumption. Thanks, Qu > > $ sudo btrfs inspect-internal dump-tree -t quota /dev/sdh1 > btrfs-progs v4.17 > quota tree key (QUOTA_TREE ROOT_ITEM 0) > leaf 213958656 items 3 free space 16096 generation 9 owner QUOTA_TREE > leaf 213958656 flags 0x1(WRITTEN) backref revision 1 > fs uuid d07f6028-0ae7-40d4-ac45-01a4505ddcfb > chunk uuid 78d753d0-eeb7-4c3e-b825-b6c2c5de5c7a > item 0 key (0 QGROUP_STATUS 0) itemoff 16251 itemsize 32 > version 1 generation 9 flags ON scan 30572545 > item 1 key (0 QGROUP_INFO 0/5) itemoff 16211 itemsize 40 > generation 7 > referenced 16384 referenced_compressed 16384 > exclusive 16384 exclusive_compressed 16384 > item 2 key (0 QGROUP_LIMIT 0/5) itemoff 16171 itemsize 40 > flags 0 > max_referenced 0 max_exclusive 0 > rsv_referenced 0 rsv_exclusive 0 > total bytes 26843545600 > bytes used 171769856 > uuid d07f6028-0ae7-40d4-ac45-01a4505ddcfb > > > And if I mount+rescan again: > > $ sudo mount /dev/sdh1 /mnt > $ sudo btrfs quota rescan -w /mnt > $ sudo umount /mnt > > $ sudo btrfs check -Q /dev/sdh1 > Checking filesystem on /dev/sdh1 > UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb > Print quota groups for /dev/sdh1 > UUID: d07f6028-0ae7-40d4-ac45-01a4505ddcfb > Counts for qgroup id: 0/5 > our:referenced 170999808 referenced compressed 170999808 > disk: referenced 170999808 referenced compressed 170999808 > our:exclusive 170999808 exclusive compressed 170999808 > disk: exclusive 170999808 exclusive compressed 170999808 > > $ sudo btrfs inspect-internal dump-tree -t quota /dev/sdh1 > btrfs-progs v4.17 > quota tree key (QUOTA_TREE ROOT_ITEM 0) > leaf 31309824 items 3 free space 16096 generation 13 owner QUOTA_TREE > leaf 31309824 flags 0x1(WRITTEN) backref revision 1 > fs uuid d07f6028-0ae7-40d4-ac45-01a4505ddcfb > chunk uuid 78d753d0-eeb7-4c3e-b825-b6c2c5de5c7a > item 0 key (0 QGROUP_STATUS 0) itemoff 16251 itemsize 32 > version 1 generation 13 flags ON scan 213827585 > item 1 key (0 QGROUP_INFO 0/5) itemoff 16211 itemsize 40 > generation 11 > referenced 170999808 referenced_compressed 170999808 > exclusive 170999808 exclusive_compressed 170999808 > item 2 key (0 QGROUP_LIMIT 0/5) itemoff 16171 itemsize 40 > flags 0 >
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 06/28/2018 03:04 PM, Qu Wenruo wrote: There is a reporter considering btrfs raid1 has a major design flaw which can't handle nodatasum files. Despite his incorrect expectation, btrfs indeed doesn't handle device generation mismatch well. Just say " btrfs indeed doesn't handle device generation mismatch well." is enough. Otherwise on someday someone may be confused about who is the reporter and what was reported. Thanks, Su This means if one devices missed and re-appeared, even its generation no longer matches with the rest device pool, btrfs does nothing to it, but treat it as normal good device. At least let's detect such generation mismatch and avoid mounting the fs. Currently there is no automatic rebuild yet, which means if users find device generation mismatch error message, they can only mount the fs using "device" and "degraded" mount option (if possible), then replace the offending device to manually "rebuild" the fs. Signed-off-by: Qu Wenruo --- I totally understand that, generation based solution can't handle split-brain case (where 2 RAID1 devices get mounted degraded separately) at all, but at least let's handle what we can do. The best way to solve the problem is to make btrfs treat such lower gen devices as some kind of missing device, and queue an automatic scrub for that device. But that's a lot of extra work, at least let's start from detecting such problem first. --- fs/btrfs/volumes.c | 50 ++ 1 file changed, 50 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e034ad9e23b4..80a7c44993bc 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, devid, uuid); } +static int verify_devices_generation(struct btrfs_fs_info *fs_info, +struct btrfs_device *dev) +{ + struct btrfs_fs_devices *fs_devices = dev->fs_devices; + struct btrfs_device *cur; + bool warn_only = false; + int ret = 0; + + if (!fs_devices || fs_devices->seeding || !dev->generation) + return 0; + + /* +* If we're not replaying log, we're completely safe to allow +* generation mismatch as it won't write anything to disks, nor +* remount to rw. +*/ + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) + warn_only = true; + + rcu_read_lock(); + list_for_each_entry_rcu(cur, _devices->devices, dev_list) { + if (cur->generation && cur->generation != dev->generation) { + if (warn_only) { + btrfs_warn_rl_in_rcu(fs_info, + "devid %llu has unexpected generation, has %llu expected %llu", +dev->devid, +dev->generation, +cur->generation); + } else { + btrfs_err_rl_in_rcu(fs_info, + "devid %llu has unexpected generation, has %llu expected %llu", +dev->devid, +dev->generation, +cur->generation); + ret = -EINVAL; + break; + } + } + } + rcu_read_unlock(); + return ret; +} + static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, struct extent_buffer *leaf, struct btrfs_chunk *chunk) @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, return PTR_ERR(map->stripes[i].dev); } btrfs_report_missing_device(fs_info, devid, uuid, false); + } else { + ret = verify_devices_generation(fs_info, + map->stripes[i].dev); + if (ret < 0) { + free_extent_map(em); + return ret; + } } set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &(map->stripes[i].dev->dev_state)); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Do extra device generation check at mount time
On 28.06.2018 10:04, Qu Wenruo wrote: > There is a reporter considering btrfs raid1 has a major design flaw > which can't handle nodatasum files. > > Despite his incorrect expectation, btrfs indeed doesn't handle device > generation mismatch well. > > This means if one devices missed and re-appeared, even its generation > no longer matches with the rest device pool, btrfs does nothing to it, > but treat it as normal good device. > > At least let's detect such generation mismatch and avoid mounting the > fs. > Currently there is no automatic rebuild yet, which means if users find > device generation mismatch error message, they can only mount the fs > using "device" and "degraded" mount option (if possible), then replace > the offending device to manually "rebuild" the fs. > > Signed-off-by: Qu Wenruo I think a testcase of this functionality is important as well. > --- > I totally understand that, generation based solution can't handle > split-brain case (where 2 RAID1 devices get mounted degraded separately) > at all, but at least let's handle what we can do. > > The best way to solve the problem is to make btrfs treat such lower gen > devices as some kind of missing device, and queue an automatic scrub for > that device. > But that's a lot of extra work, at least let's start from detecting such > problem first. > --- > fs/btrfs/volumes.c | 50 ++ > 1 file changed, 50 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index e034ad9e23b4..80a7c44993bc 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct > btrfs_fs_info *fs_info, > devid, uuid); > } > > +static int verify_devices_generation(struct btrfs_fs_info *fs_info, > + struct btrfs_device *dev) > +{ > + struct btrfs_fs_devices *fs_devices = dev->fs_devices; > + struct btrfs_device *cur; > + bool warn_only = false; > + int ret = 0; > + > + if (!fs_devices || fs_devices->seeding || !dev->generation) > + return 0; > + > + /* > + * If we're not replaying log, we're completely safe to allow > + * generation mismatch as it won't write anything to disks, nor > + * remount to rw. > + */ > + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) > + warn_only = true; > + > + rcu_read_lock(); > + list_for_each_entry_rcu(cur, _devices->devices, dev_list) { > + if (cur->generation && cur->generation != dev->generation) { > + if (warn_only) { > + btrfs_warn_rl_in_rcu(fs_info, > + "devid %llu has unexpected generation, has %llu expected %llu", > + dev->devid, > + dev->generation, > + cur->generation); > + } else { > + btrfs_err_rl_in_rcu(fs_info, > + "devid %llu has unexpected generation, has %llu expected %llu", > + dev->devid, > + dev->generation, > + cur->generation); > + ret = -EINVAL; > + break; > + } > + } > + } > + rcu_read_unlock(); > + return ret; > +} > + > static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key > *key, > struct extent_buffer *leaf, > struct btrfs_chunk *chunk) > @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info > *fs_info, struct btrfs_key *key, > return PTR_ERR(map->stripes[i].dev); > } > btrfs_report_missing_device(fs_info, devid, uuid, > false); > + } else { > + ret = verify_devices_generation(fs_info, > + map->stripes[i].dev); > + if (ret < 0) { > + free_extent_map(em); > + return ret; > + } > } > set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, > &(map->stripes[i].dev->dev_state)); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] btrfs: Do extra device generation check at mount time
There is a reporter considering btrfs raid1 has a major design flaw which can't handle nodatasum files. Despite his incorrect expectation, btrfs indeed doesn't handle device generation mismatch well. This means if one devices missed and re-appeared, even its generation no longer matches with the rest device pool, btrfs does nothing to it, but treat it as normal good device. At least let's detect such generation mismatch and avoid mounting the fs. Currently there is no automatic rebuild yet, which means if users find device generation mismatch error message, they can only mount the fs using "device" and "degraded" mount option (if possible), then replace the offending device to manually "rebuild" the fs. Signed-off-by: Qu Wenruo --- I totally understand that, generation based solution can't handle split-brain case (where 2 RAID1 devices get mounted degraded separately) at all, but at least let's handle what we can do. The best way to solve the problem is to make btrfs treat such lower gen devices as some kind of missing device, and queue an automatic scrub for that device. But that's a lot of extra work, at least let's start from detecting such problem first. --- fs/btrfs/volumes.c | 50 ++ 1 file changed, 50 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e034ad9e23b4..80a7c44993bc 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6467,6 +6467,49 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, devid, uuid); } +static int verify_devices_generation(struct btrfs_fs_info *fs_info, +struct btrfs_device *dev) +{ + struct btrfs_fs_devices *fs_devices = dev->fs_devices; + struct btrfs_device *cur; + bool warn_only = false; + int ret = 0; + + if (!fs_devices || fs_devices->seeding || !dev->generation) + return 0; + + /* +* If we're not replaying log, we're completely safe to allow +* generation mismatch as it won't write anything to disks, nor +* remount to rw. +*/ + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) + warn_only = true; + + rcu_read_lock(); + list_for_each_entry_rcu(cur, _devices->devices, dev_list) { + if (cur->generation && cur->generation != dev->generation) { + if (warn_only) { + btrfs_warn_rl_in_rcu(fs_info, + "devid %llu has unexpected generation, has %llu expected %llu", +dev->devid, +dev->generation, +cur->generation); + } else { + btrfs_err_rl_in_rcu(fs_info, + "devid %llu has unexpected generation, has %llu expected %llu", +dev->devid, +dev->generation, +cur->generation); + ret = -EINVAL; + break; + } + } + } + rcu_read_unlock(); + return ret; +} + static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, struct extent_buffer *leaf, struct btrfs_chunk *chunk) @@ -6552,6 +6595,13 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, return PTR_ERR(map->stripes[i].dev); } btrfs_report_missing_device(fs_info, devid, uuid, false); + } else { + ret = verify_devices_generation(fs_info, + map->stripes[i].dev); + if (ret < 0) { + free_extent_map(em); + return ret; + } } set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &(map->stripes[i].dev->dev_state)); -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs: Test if btrfs will corrupt nodatasum compressed extent when replacing device
On 28.06.2018 08:34, Eryu Guan wrote: > On Thu, Jun 28, 2018 at 08:11:00AM +0300, Nikolay Borisov wrote: >> >> >> On 1.06.2018 04:34, Qu Wenruo wrote: >>> This is a long existing bug (from 2012) but exposed by a reporter >>> recently, that when compressed extent without data csum get written to >>> device-replace target device, the written data is in fact uncompressed data >>> other than the original compressed data. >>> >>> And since btrfs still consider the data is compressed and will try to read >>> it >>> as compressed, it can cause read error. >>> >>> The root cause is located, and one RFC patch already sent to fix it, >>> titled "[PATCH RFC] btrfs: scrub: Don't use inode pages for device replace". >>> (The RFC is only for the extra possible way to fix the bug, the fix >>> itself should work without problem) >>> >>> Reported-by: James Harvey >>> Signed-off-by: Qu Wenruo >> >> Reviewed-by: Nikolay Borisov > > Thanks for the review! I assume the v3 patch also passes your review :) Yes, I just saw that you requested an ack from a btrfs developer some time ago and this test didn't move forward, hence i replied. But yes, it's a valid test for btrfs. > > Eryu > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html