Re: how to best segment a big block device in resizeable btrfs filesystems?
Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted: > 03.07.2018 10:15, Duncan пишет: >> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as >> excerpted: >> >>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет: them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit dangerous to do it while writes are happening). >>> >>> Could you please elaborate? Do you mean btrfs can trim data before new >>> writes are actually committed to disk? >> >> No. >> >> But normally old roots aren't rewritten for some time simply due to >> odds (fuller filesystems will of course recycle them sooner), and the >> btrfs mount option usebackuproot (formerly recovery, until the >> norecovery mount option that parallels that of other filesystems was >> added and this option was renamed to avoid confusion) can be used to >> try an older root if the current root is too damaged to successfully >> mount. >> But other than simply by odds not using them again immediately, btrfs >> has >> no special protection for those old roots, and trim/discard will >> recover them to hardware-unused as it does any other unused space, tho >> whether it simply marks them for later processing or actually processes >> them immediately is up to the individual implementation -- some do it >> immediately, killing all chances at using the backup root because it's >> already zeroed out, some don't. >> >> > How is it relevant to "while writes are happening"? Will trimming old > tress immediately after writes have stopped be any different? Why? Define "while writes are happening" vs. "immediately after writes have stopped". How soon is "immediately", and does the writes stopped condition account for data that has reached the device-hardware write buffer (so is no longer being transmitted to the device across the bus) but not been actually written to media, or not? On a reasonably quiescent system, multiple empty write cycles are likely to have occurred since the last write barrier, and anything in-process is likely to have made it to media even if software is missing a write barrier it needs (software bug) or the hardware lies about honoring the write barrier (hardware bug, allegedly sometimes deliberate on hardware willing to gamble with your data that a crash won't happen in a critical moment, a somewhat rare occurrence, in ordered to improve normal operation performance metrics). On an IO-maxed system, data and write-barriers are coming down as fast as the system can handle them, and write-barriers become critical -- crash after something was supposed to get to media but didn't, either because of a missing write barrier or because the hardware/firmware lied about the barrier and said the data it was supposed to ensure was on-media was, when it wasn't, and the btrfs atomic-cow commit guarantees of consistent state at each commit go out the window. At this point it becomes useful to have a number of previous "guaranteed consistent state" roots to fall back on, with the /hope/ being that at least /one/ of them is usably consistent. If all but the last one are wiped due to trim... When the system isn't write-maxed the write will have almost certainly made it regardless of whether the barrier is there or not, because there's enough idle time to finish the current write before another one comes down the pipe, so the last-written root is almost certain to be fine regardless of barriers, and the history of past roots doesn't matter even if there's a crash. If "immediately after writes have stopped" is strictly defined as a condition when all writes including the btrfs commit updating the current root and the superblock pointers to the current root have completed, with no new writes coming down the pipe in the mean time that might have delayed a critical update if a barrier was missed, then trimming old roots in this state should be entirely safe, and the distinction between that state and the "while writes are happening" is clear. But if "immediately after writes have stopped" is less strictly defined, then the distinction between that state and "while writes are happening" remains blurry at best, and having old roots around to fall back on in case a write-barrier was missed (for whatever reason, hardware or software) becomes a very good thing. Of course the fact that trim/discard itself is an instruction written to the device in the combined command/data stream complexifies the picture substantially. If those write barriers get missed who knows what state the new root is in, and if the old ones got erased... But again, on a mostly idle system, it'll probably all "just work", because the writes will likely all make it to media, regardless, because there's not a bunch of other writes competing for limited write bandwidth and making ordering critical. >> In the context of the discard mount option, that can mean there's never >> any old roots available
Re: how to best segment a big block device in resizeable btrfs filesystems?
03.07.2018 10:15, Duncan пишет: > Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted: > >> 02.07.2018 21:35, Austin S. Hemmelgarn пишет: >>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a >>> bit dangerous to do it while writes are happening). >> >> Could you please elaborate? Do you mean btrfs can trim data before new >> writes are actually committed to disk? > > No. > > But normally old roots aren't rewritten for some time simply due to odds > (fuller filesystems will of course recycle them sooner), and the btrfs > mount option usebackuproot (formerly recovery, until the norecovery mount > option that parallels that of other filesystems was added and this option > was renamed to avoid confusion) can be used to try an older root if the > current root is too damaged to successfully mount. > > But other than simply by odds not using them again immediately, btrfs has > no special protection for those old roots, and trim/discard will recover > them to hardware-unused as it does any other unused space, tho whether it > simply marks them for later processing or actually processes them > immediately is up to the individual implementation -- some do it > immediately, killing all chances at using the backup root because it's > already zeroed out, some don't. > How is it relevant to "while writes are happening"? Will trimming old tress immediately after writes have stopped be any different? Why? > In the context of the discard mount option, that can mean there's never > any old roots available ever, as they've already been cleaned up by the > hardware due to the discard option telling the hardware to do it. > > But even not using that mount option, and simply doing the trims > periodically, as done weekly by for instance the systemd fstrim timer and > service units, or done manually if you prefer, obviously potentially > wipes the old roots at that point. If the system's effectively idle at > the time, not much risk as the current commit is likely to represent a > filesystem in full stasis, but if there's lots of writes going on at that > moment *AND* the system happens to crash at just the wrong time, before > additional commits have recreated at least a bit of root history, again, > you'll potentially be left without any old roots for the usebackuproot > mount option to try to fall back to, should it actually be necessary. > Sorry? You are just saying that "previous state can be discarded before new state is committed", just more verbosely. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On 2018年07月03日 17:55, Paul Jones wrote: >> -Original Message- >> From: linux-btrfs-ow...@vger.kernel.org > ow...@vger.kernel.org> On Behalf Of Marc MERLIN >> Sent: Tuesday, 3 July 2018 2:16 PM >> To: Qu Wenruo >> Cc: Su Yue ; linux-btrfs@vger.kernel.org >> Subject: Re: how to best segment a big block device in resizeable btrfs >> filesystems? >> >> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote: If I do this, I would have software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and that's also starting to make me nervous :) >>> >>> If you could keep the number of snapshots to minimal (less than 10) >>> for each btrfs (and the number of send source is less than 5), one big >>> btrfs may work in that case. >> >> Well, we kind of discussed this already. If btrfs falls over if you reach >> 100 snapshots or so, and it sure seems to in my case, I won't be much better >> off. >> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's >> unable to use swap, is a big deal in my case. You also confirmed that btrfs >> check lowmem does not scale to filesystems like mine, so this translates into >> "if regular btrfs check repair can't fit in 32GB, I am completely out of >> luck if >> anything happens to the filesystem" > > Just out of curiosity I had a look at my backup filesystem. > vm-server /media/backup # btrfs fi us /media/backup/ > Overall: > Device size: 5.46TiB > Device allocated: 3.42TiB > Device unallocated:2.04TiB > Device missing: 0.00B > Used: 1.80TiB > Free (estimated): 1.83TiB (min: 1.83TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID1: Size:1.69TiB, Used:906.26GiB It doesn't affect how fast check run at all. Unless --check-data-csum is specified. And even --check-data-csum is specified, most read will still be sequential, and deduped/reflink won't affect the csum verification speed. >/dev/mapper/a-backup--a 1.69TiB >/dev/mapper/b-backup--b 1.69TiB > > Metadata,RAID1: Size:19.00GiB, Used:16.90GiB This is the main factor contributing to btrfs check time. Just consider it as the minimal amount of data btrfs check needs to read. >/dev/mapper/a-backup--a19.00GiB >/dev/mapper/b-backup--b19.00GiB > > System,RAID1: Size:64.00MiB, Used:336.00KiB >/dev/mapper/a-backup--a64.00MiB >/dev/mapper/b-backup--b64.00MiB > > Unallocated: >/dev/mapper/a-backup--a 1.02TiB >/dev/mapper/b-backup--b 1.02TiB > > compress=zstd,space_cache=v2 > 202 snapshots, heavily de-duplicated > 551G / 361,000 files in latest snapshot No wonder it's so slow for lowmem mode. > > Btrfs check normal mode took 12 mins and 11.5G ram > Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G For lowmem, btrfs check will use 25% of your total memory as cache to speed up it a little. (but as you can see, it's still slow) Maybe we could add some option to modify how many bytes we could use for lowmem mode. Thanks, Qu > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: how to best segment a big block device in resizeable btrfs filesystems?
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org ow...@vger.kernel.org> On Behalf Of Marc MERLIN > Sent: Tuesday, 3 July 2018 2:16 PM > To: Qu Wenruo > Cc: Su Yue ; linux-btrfs@vger.kernel.org > Subject: Re: how to best segment a big block device in resizeable btrfs > filesystems? > > On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote: > > > If I do this, I would have > > > software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of > > > layers, and that's also starting to make me nervous :) > > > > If you could keep the number of snapshots to minimal (less than 10) > > for each btrfs (and the number of send source is less than 5), one big > > btrfs may work in that case. > > Well, we kind of discussed this already. If btrfs falls over if you reach > 100 snapshots or so, and it sure seems to in my case, I won't be much better > off. > Having btrfs check --repair fail because 32GB of RAM is not enough, and it's > unable to use swap, is a big deal in my case. You also confirmed that btrfs > check lowmem does not scale to filesystems like mine, so this translates into > "if regular btrfs check repair can't fit in 32GB, I am completely out of luck > if > anything happens to the filesystem" Just out of curiosity I had a look at my backup filesystem. vm-server /media/backup # btrfs fi us /media/backup/ Overall: Device size: 5.46TiB Device allocated: 3.42TiB Device unallocated:2.04TiB Device missing: 0.00B Used: 1.80TiB Free (estimated): 1.83TiB (min: 1.83TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:1.69TiB, Used:906.26GiB /dev/mapper/a-backup--a 1.69TiB /dev/mapper/b-backup--b 1.69TiB Metadata,RAID1: Size:19.00GiB, Used:16.90GiB /dev/mapper/a-backup--a19.00GiB /dev/mapper/b-backup--b19.00GiB System,RAID1: Size:64.00MiB, Used:336.00KiB /dev/mapper/a-backup--a64.00MiB /dev/mapper/b-backup--b64.00MiB Unallocated: /dev/mapper/a-backup--a 1.02TiB /dev/mapper/b-backup--b 1.02TiB compress=zstd,space_cache=v2 202 snapshots, heavily de-duplicated 551G / 361,000 files in latest snapshot Btrfs check normal mode took 12 mins and 11.5G ram Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted: > 02.07.2018 21:35, Austin S. Hemmelgarn пишет: >> them (trimming blocks on BTRFS gets rid of old root trees, so it's a >> bit dangerous to do it while writes are happening). > > Could you please elaborate? Do you mean btrfs can trim data before new > writes are actually committed to disk? No. But normally old roots aren't rewritten for some time simply due to odds (fuller filesystems will of course recycle them sooner), and the btrfs mount option usebackuproot (formerly recovery, until the norecovery mount option that parallels that of other filesystems was added and this option was renamed to avoid confusion) can be used to try an older root if the current root is too damaged to successfully mount. But other than simply by odds not using them again immediately, btrfs has no special protection for those old roots, and trim/discard will recover them to hardware-unused as it does any other unused space, tho whether it simply marks them for later processing or actually processes them immediately is up to the individual implementation -- some do it immediately, killing all chances at using the backup root because it's already zeroed out, some don't. In the context of the discard mount option, that can mean there's never any old roots available ever, as they've already been cleaned up by the hardware due to the discard option telling the hardware to do it. But even not using that mount option, and simply doing the trims periodically, as done weekly by for instance the systemd fstrim timer and service units, or done manually if you prefer, obviously potentially wipes the old roots at that point. If the system's effectively idle at the time, not much risk as the current commit is likely to represent a filesystem in full stasis, but if there's lots of writes going on at that moment *AND* the system happens to crash at just the wrong time, before additional commits have recreated at least a bit of root history, again, you'll potentially be left without any old roots for the usebackuproot mount option to try to fall back to, should it actually be necessary. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On Tue, Jul 03, 2018 at 04:26:37AM +, Paul Jones wrote: > I don't have any experience with this, but since it's the internet let me > tell you how I'd do it anyway That's the spirit :) > raid5 > dm-crypt > lvm (using thin provisioning + cache) > btrfs > > The cache mode on lvm requires you to set up all your volumes first, then > add caching to those volumes last. If you need to modify the volume then > you have to remove the cache, make your changes, then re-add the cache. It > sounds like a pain, but having the cache separate from the data is quite > handy. I'm ok enough with that. > Given you are running a backup server I don't think the cache would > really do much unless you enable writeback mode. If you can split up your > filesystem a bit to the point that btrfs check doesn't OOM that will > seriously help performance as well. Rsync might be feasible again. I'm a bit warry of write caching with the issues I've had. I may do write-through, but not writeback :) But caching helps indeed for my older filesystems that are still backed up via rsync because the source fs is ext4 and not btrfs. Thanks for the suggestions Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: how to best segment a big block device in resizeable btrfs filesystems?
> -Original Message- > From: Marc MERLIN > Sent: Tuesday, 3 July 2018 2:07 PM > To: Paul Jones > Cc: linux-btrfs@vger.kernel.org > Subject: Re: how to best segment a big block device in resizeable btrfs > filesystems? > > On Tue, Jul 03, 2018 at 12:51:30AM +, Paul Jones wrote: > > You could combine bcache and lvm if you are happy to use dm-cache > instead (which lvm uses). > > I use it myself (but without thin provisioning) and it works well. > > Interesting point. So, I used to use lvm and then lvm2 many years ago until I > got tired with its performance, especially as asoon as I took even a single > snapshot. > But that was a long time ago now, just saying that I'm a bit rusty on LVM > itself. > > That being said, if I have > raid5 > dm-cache > dm-crypt > dm-thin > > That's still 4 block layers under btrfs. > Am I any better off using dm-cache instead of bcache, my understanding is > that it only replaces one block layer with another one and one codebase with > another. True, I didn't think of it like that. > Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which > might change things, or not. > I'll admit that setting up and maintaining bcache is a bit of a pain, I only > used it > at the time because it seemed more ready then, but we're a few years later > now. > > So, what do you recommend nowadays, assuming you've used both? > (given that it's literally going to take days to recreate my array, I'd > rather do it > once and the right way the first time :) ) I don't have any experience with this, but since it's the internet let me tell you how I'd do it anyway raid5 dm-crypt lvm (using thin provisioning + cache) btrfs The cache mode on lvm requires you to set up all your volumes first, then add caching to those volumes last. If you need to modify the volume then you have to remove the cache, make your changes, then re-add the cache. It sounds like a pain, but having the cache separate from the data is quite handy. Given you are running a backup server I don't think the cache would really do much unless you enable writeback mode. If you can split up your filesystem a bit to the point that btrfs check doesn't OOM that will seriously help performance as well. Rsync might be feasible again. Paul. N�r��y���b�X��ǧv�^�){.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥
Re: how to best segment a big block device in resizeable btrfs filesystems?
02.07.2018 21:35, Austin S. Hemmelgarn пишет: > them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit > dangerous to do it while writes are happening). Could you please elaborate? Do you mean btrfs can trim data before new writes are actually committed to disk? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
03.07.2018 04:37, Qu Wenruo пишет: > > BTW, IMHO the bcache is not really helping for backup system, which is > more write oriented. > There is new writecache target which may help in this case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote: > > If I do this, I would have > > software raid 5 < dmcrypt < bcache < lvm < btrfs > > That's a lot of layers, and that's also starting to make me nervous :) > > If you could keep the number of snapshots to minimal (less than 10) for > each btrfs (and the number of send source is less than 5), one big btrfs > may work in that case. Well, we kind of discussed this already. If btrfs falls over if you reach 100 snapshots or so, and it sure seems to in my case, I won't be much better off. Having btrfs check --repair fail because 32GB of RAM is not enough, and it's unable to use swap, is a big deal in my case. You also confirmed that btrfs check lowmem does not scale to filesystems like mine, so this translates into "if regular btrfs check repair can't fit in 32GB, I am completely out of luck if anything happens to the filesystem" You're correct that I could tweak my backups and snapshot rotation to get from 250 or so down to 100, but it seems that I'll just be hoping to avoid the problem by being just under the limit, until I'm not, again, and it'll be too late to do anything it next time I'm in trouble again, putting me back right in the same spot I'm in now. Is all this fair to say, or did I misunderstand? > BTW, IMHO the bcache is not really helping for backup system, which is > more write oriented. That's a good point. So, what I didn't explain is that I still have some old filesystem that do get backed up with rsync instead of btrfs send (going into the same filesystem, but not same subvolume). Because rsync is so painfully slow when it needs to scan both sides before it'll even start doing any work, bcache helps there. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On Tue, Jul 03, 2018 at 12:51:30AM +, Paul Jones wrote: > You could combine bcache and lvm if you are happy to use dm-cache instead > (which lvm uses). > I use it myself (but without thin provisioning) and it works well. Interesting point. So, I used to use lvm and then lvm2 many years ago until I got tired with its performance, especially as asoon as I took even a single snapshot. But that was a long time ago now, just saying that I'm a bit rusty on LVM itself. That being said, if I have raid5 dm-cache dm-crypt dm-thin That's still 4 block layers under btrfs. Am I any better off using dm-cache instead of bcache, my understanding is that it only replaces one block layer with another one and one codebase with another. Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which might change things, or not. I'll admit that setting up and maintaining bcache is a bit of a pain, I only used it at the time because it seemed more ready then, but we're a few years later now. So, what do you recommend nowadays, assuming you've used both? (given that it's literally going to take days to recreate my array, I'd rather do it once and the right way the first time :) ) Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On 2018年07月02日 23:18, Marc MERLIN wrote: > Hi Qu, > > I'll split this part into a new thread: > >> 2) Don't keep unrelated snapshots in one btrfs. >>I totally understand that maintain different btrfs would hugely add >>maintenance pressure, but as explains, all snapshots share one >>fragile extent tree. > > Yes, I understand that this is what I should do given what you > explained. > My main problem is knowing how to segment things so I don't end up with > filesystems that are full while others are almost empty :) > > Am I supposed to put LVM thin volumes underneath so that I can share > the same single 10TB raid5? > > If I do this, I would have > software raid 5 < dmcrypt < bcache < lvm < btrfs > That's a lot of layers, and that's also starting to make me nervous :) If you could keep the number of snapshots to minimal (less than 10) for each btrfs (and the number of send source is less than 5), one big btrfs may work in that case. BTW, IMHO the bcache is not really helping for backup system, which is more write oriented. Thanks, Qu > > Is there any other way that does not involve me creating smaller block > devices for multiple btrfs filesystems and hope that they are the right > size because I won't be able to change it later? > > Thanks, > Marc > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: how to best segment a big block device in resizeable btrfs filesystems?
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org ow...@vger.kernel.org> On Behalf Of Marc MERLIN > Sent: Tuesday, 3 July 2018 1:19 AM > To: Qu Wenruo > Cc: Su Yue ; linux-btrfs@vger.kernel.org > Subject: Re: how to best segment a big block device in resizeable btrfs > filesystems? > > Hi Qu, > > I'll split this part into a new thread: > > > 2) Don't keep unrelated snapshots in one btrfs. > >I totally understand that maintain different btrfs would hugely add > >maintenance pressure, but as explains, all snapshots share one > >fragile extent tree. > > Yes, I understand that this is what I should do given what you explained. > My main problem is knowing how to segment things so I don't end up with > filesystems that are full while others are almost empty :) > > Am I supposed to put LVM thin volumes underneath so that I can share the > same single 10TB raid5? > > If I do this, I would have > software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and > that's also starting to make me nervous :) You could combine bcache and lvm if you are happy to use dm-cache instead (which lvm uses). I use it myself (but without thin provisioning) and it works well. > > Is there any other way that does not involve me creating smaller block > devices for multiple btrfs filesystems and hope that they are the right size > because I won't be able to change it later? > > Thanks, > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ | PGP > 7F55D5F27AAF9D08 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On Mon, Jul 02, 2018 at 02:35:19PM -0400, Austin S. Hemmelgarn wrote: > >I kind of linked the thin provisioning idea because it's hands off, > >which is appealing. Any reason against it? > No, not currently, except that it adds a whole lot more stuff between > BTRFS and whatever layer is below it. That increase in what's being > done adds some overhead (it's noticeable on 7200 RPM consumer SATA > drives, but not on decent consumer SATA SSD's). > > There used to be issues running BTRFS on top of LVM thin targets which > had zero mode turned off, but AFAIK, all of those problems were fixed > long ago (before 4.0). I see, thanks for the heads up. > >Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm > >radi5? > Actually, it uses MD's RAID5 implementation as a back-end. Same for > RAID6, and optionally for RAID0, RAID1, and RAID10. Ok, that makes me feel a bit better :) > >But yeah, if it's incompatible with thin provisioning, it's not that > >useful. > It's technically not incompatible, just a bit of a pain. Last time I > tried to use it, you had to jump through hoops to repair a damaged RAID > volume that was serving as an underlying volume in a thin pool, and it > required keeping the thin pool offline for the entire duration of the > rebuild. Argh, not good :( / thanks for the heads up. > If you do go with thin provisioning, I would encourage you to make > certain to call fstrim on the BTRFS volumes on a semi regular basis so > that the thin pool doesn't get filled up with old unused blocks, That's a very good point/reminder, thanks for that. I guess it's like running on an ssd :) > preferably when you are 100% certain that there are no ongoing writes on > them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit > dangerous to do it while writes are happening). Argh, that will be harder, but I'll try. Given what you said, it sounds like I'll still be best off with separate layers to avoid the rebuild problem you mentioned. So it'll be swraid5 / dmcrypt / bcache / lvm dm thin / btrfs Hopefully that will work well enough. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On 2018-07-02 13:34, Marc MERLIN wrote: On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote: Am I supposed to put LVM thin volumes underneath so that I can share the same single 10TB raid5? Actually, because of the online resize ability in BTRFS, you don't technically _need_ to use thin provisioning here. It makes the maintenance a bit easier, but it also adds a much more complicated layer of indirection than just doing regular volumes. You're right that I can use btrfs resize, but then I still need an LVM device underneath, correct? So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10% each of the full size available (as a guess), and then I'd have to - btrfs resize down one that's bigger than I need - LVM shrink the LV - LVM grow the other LV - LVM resize up the other btrfs and I think LVM resize and btrfs resize are not linked so I have to do them separately and hope to type the right numbers each time, correct? (or is that easier now?) I kind of linked the thin provisioning idea because it's hands off, which is appealing. Any reason against it? No, not currently, except that it adds a whole lot more stuff between BTRFS and whatever layer is below it. That increase in what's being done adds some overhead (it's noticeable on 7200 RPM consumer SATA drives, but not on decent consumer SATA SSD's). There used to be issues running BTRFS on top of LVM thin targets which had zero mode turned off, but AFAIK, all of those problems were fixed long ago (before 4.0). You could (in theory) merge the LVM and software RAID5 layers, though that may make handling of the RAID5 layer a bit complicated if you choose to use thin provisioning (for some reason, LVM is unable to do on-line checks and rebuilds of RAID arrays that are acting as thin pool data or metadata). Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm radi5? Actually, it uses MD's RAID5 implementation as a back-end. Same for RAID6, and optionally for RAID0, RAID1, and RAID10. But yeah, if it's incompatible with thin provisioning, it's not that useful. It's technically not incompatible, just a bit of a pain. Last time I tried to use it, you had to jump through hoops to repair a damaged RAID volume that was serving as an underlying volume in a thin pool, and it required keeping the thin pool offline for the entire duration of the rebuild. Alternatively, you could increase your array size, remove the software RAID layer, and switch to using BTRFS in raid10 mode so that you could eliminate one of the layers, though that would probably reduce the effectiveness of bcache (you might want to get a bigger cache device if you do this). Sadly that won't work. I have more data than will fit on raid10 Thanks for your suggestions though. Still need to read up on whether I should do thin provisioning, or not. If you do go with thin provisioning, I would encourage you to make certain to call fstrim on the BTRFS volumes on a semi regular basis so that the thin pool doesn't get filled up with old unused blocks, preferably when you are 100% certain that there are no ongoing writes on them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit dangerous to do it while writes are happening). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote: > > Am I supposed to put LVM thin volumes underneath so that I can share > > the same single 10TB raid5? > > Actually, because of the online resize ability in BTRFS, you don't > technically _need_ to use thin provisioning here. It makes the maintenance > a bit easier, but it also adds a much more complicated layer of indirection > than just doing regular volumes. You're right that I can use btrfs resize, but then I still need an LVM device underneath, correct? So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10% each of the full size available (as a guess), and then I'd have to - btrfs resize down one that's bigger than I need - LVM shrink the LV - LVM grow the other LV - LVM resize up the other btrfs and I think LVM resize and btrfs resize are not linked so I have to do them separately and hope to type the right numbers each time, correct? (or is that easier now?) I kind of linked the thin provisioning idea because it's hands off, which is appealing. Any reason against it? > You could (in theory) merge the LVM and software RAID5 layers, though that > may make handling of the RAID5 layer a bit complicated if you choose to use > thin provisioning (for some reason, LVM is unable to do on-line checks and > rebuilds of RAID arrays that are acting as thin pool data or metadata). Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm radi5? But yeah, if it's incompatible with thin provisioning, it's not that useful. > Alternatively, you could increase your array size, remove the software RAID > layer, and switch to using BTRFS in raid10 mode so that you could eliminate > one of the layers, though that would probably reduce the effectiveness of > bcache (you might want to get a bigger cache device if you do this). Sadly that won't work. I have more data than will fit on raid10 Thanks for your suggestions though. Still need to read up on whether I should do thin provisioning, or not. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
On 2018-07-02 11:18, Marc MERLIN wrote: Hi Qu, I'll split this part into a new thread: 2) Don't keep unrelated snapshots in one btrfs. I totally understand that maintain different btrfs would hugely add maintenance pressure, but as explains, all snapshots share one fragile extent tree. Yes, I understand that this is what I should do given what you explained. My main problem is knowing how to segment things so I don't end up with filesystems that are full while others are almost empty :) Am I supposed to put LVM thin volumes underneath so that I can share the same single 10TB raid5? Actually, because of the online resize ability in BTRFS, you don't technically _need_ to use thin provisioning here. It makes the maintenance a bit easier, but it also adds a much more complicated layer of indirection than just doing regular volumes. If I do this, I would have software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and that's also starting to make me nervous :) Is there any other way that does not involve me creating smaller block devices for multiple btrfs filesystems and hope that they are the right size because I won't be able to change it later? You could (in theory) merge the LVM and software RAID5 layers, though that may make handling of the RAID5 layer a bit complicated if you choose to use thin provisioning (for some reason, LVM is unable to do on-line checks and rebuilds of RAID arrays that are acting as thin pool data or metadata). Alternatively, you could increase your array size, remove the software RAID layer, and switch to using BTRFS in raid10 mode so that you could eliminate one of the layers, though that would probably reduce the effectiveness of bcache (you might want to get a bigger cache device if you do this). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to best segment a big block device in resizeable btrfs filesystems?
Hi Qu, I'll split this part into a new thread: > 2) Don't keep unrelated snapshots in one btrfs. >I totally understand that maintain different btrfs would hugely add >maintenance pressure, but as explains, all snapshots share one >fragile extent tree. Yes, I understand that this is what I should do given what you explained. My main problem is knowing how to segment things so I don't end up with filesystems that are full while others are almost empty :) Am I supposed to put LVM thin volumes underneath so that I can share the same single 10TB raid5? If I do this, I would have software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and that's also starting to make me nervous :) Is there any other way that does not involve me creating smaller block devices for multiple btrfs filesystems and hope that they are the right size because I won't be able to change it later? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html