Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-08 Thread Duncan
Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted:

> 03.07.2018 10:15, Duncan пишет:
>> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as
>> excerpted:
>> 
>>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
 them (trimming blocks on BTRFS gets rid of old root trees, so it's a
 bit dangerous to do it while writes are happening).
>>>
>>> Could you please elaborate? Do you mean btrfs can trim data before new
>>> writes are actually committed to disk?
>> 
>> No.
>> 
>> But normally old roots aren't rewritten for some time simply due to
>> odds (fuller filesystems will of course recycle them sooner), and the
>> btrfs mount option usebackuproot (formerly recovery, until the
>> norecovery mount option that parallels that of other filesystems was
>> added and this option was renamed to avoid confusion) can be used to
>> try an older root if the current root is too damaged to successfully
>> mount.

>> But other than simply by odds not using them again immediately, btrfs
>> has
>> no special protection for those old roots, and trim/discard will
>> recover them to hardware-unused as it does any other unused space, tho
>> whether it simply marks them for later processing or actually processes
>> them immediately is up to the individual implementation -- some do it
>> immediately, killing all chances at using the backup root because it's
>> already zeroed out, some don't.
>> 
>> 
> How is it relevant to "while writes are happening"? Will trimming old
> tress immediately after writes have stopped be any different? Why?

Define "while writes are happening" vs. "immediately after writes have 
stopped".  How soon is "immediately", and does the writes stopped 
condition account for data that has reached the device-hardware write 
buffer (so is no longer being transmitted to the device across the bus) 
but not been actually written to media, or not?

On a reasonably quiescent system, multiple empty write cycles are likely 
to have occurred since the last write barrier, and anything in-process is 
likely to have made it to media even if software is missing a write 
barrier it needs (software bug) or the hardware lies about honoring the 
write barrier (hardware bug, allegedly sometimes deliberate on hardware 
willing to gamble with your data that a crash won't happen in a critical 
moment, a somewhat rare occurrence, in ordered to improve normal 
operation performance metrics).

On an IO-maxed system, data and write-barriers are coming down as fast as 
the system can handle them, and write-barriers become critical -- crash 
after something was supposed to get to media but didn't, either because 
of a missing write barrier or because the hardware/firmware lied about 
the barrier and said the data it was supposed to ensure was on-media was, 
when it wasn't, and the btrfs atomic-cow commit guarantees of consistent 
state at each commit go out the window.

At this point it becomes useful to have a number of previous "guaranteed 
consistent state" roots to fall back on, with the /hope/ being that at 
least /one/ of them is usably consistent.  If all but the last one are 
wiped due to trim...

When the system isn't write-maxed the write will have almost certainly 
made it regardless of whether the barrier is there or not, because 
there's enough idle time to finish the current write before another one 
comes down the pipe, so the last-written root is almost certain to be 
fine regardless of barriers, and the history of past roots doesn't matter 
even if there's a crash.

If "immediately after writes have stopped" is strictly defined as a 
condition when all writes including the btrfs commit updating the current 
root and the superblock pointers to the current root have completed, with 
no new writes coming down the pipe in the mean time that might have 
delayed a critical update if a barrier was missed, then trimming old 
roots in this state should be entirely safe, and the distinction between 
that state and the "while writes are happening" is clear.

But if "immediately after writes have stopped" is less strictly defined, 
then the distinction between that state and "while writes are happening" 
remains blurry at best, and having old roots around to fall back on in 
case a write-barrier was missed (for whatever reason, hardware or 
software) becomes a very good thing.

Of course the fact that trim/discard itself is an instruction written to 
the device in the combined command/data stream complexifies the picture 
substantially.  If those write barriers get missed who knows what state 
the new root is in, and if the old ones got erased...  But again, on a 
mostly idle system, it'll probably all "just work", because the writes 
will likely all make it to media, regardless, because there's not a bunch 
of other writes competing for limited write bandwidth and making ordering 
critical.

>> In the context of the discard mount option, that can mean there's never
>> any old roots available 

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-05 Thread Andrei Borzenkov
03.07.2018 10:15, Duncan пишет:
> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:
> 
>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>> bit dangerous to do it while writes are happening).
>>
>> Could you please elaborate? Do you mean btrfs can trim data before new
>> writes are actually committed to disk?
> 
> No.
> 
> But normally old roots aren't rewritten for some time simply due to odds 
> (fuller filesystems will of course recycle them sooner), and the btrfs 
> mount option usebackuproot (formerly recovery, until the norecovery mount 
> option that parallels that of other filesystems was added and this option 
> was renamed to avoid confusion) can be used to try an older root if the 
> current root is too damaged to successfully mount.
> > But other than simply by odds not using them again immediately, btrfs has
> no special protection for those old roots, and trim/discard will recover 
> them to hardware-unused as it does any other unused space, tho whether it 
> simply marks them for later processing or actually processes them 
> immediately is up to the individual implementation -- some do it 
> immediately, killing all chances at using the backup root because it's 
> already zeroed out, some don't.
> 

How is it relevant to "while writes are happening"? Will trimming old
tress immediately after writes have stopped be any different? Why?

> In the context of the discard mount option, that can mean there's never 
> any old roots available ever, as they've already been cleaned up by the 
> hardware due to the discard option telling the hardware to do it.
> 
> But even not using that mount option, and simply doing the trims 
> periodically, as done weekly by for instance the systemd fstrim timer and 
> service units, or done manually if you prefer, obviously potentially 
> wipes the old roots at that point.  If the system's effectively idle at 
> the time, not much risk as the current commit is likely to represent a 
> filesystem in full stasis, but if there's lots of writes going on at that 
> moment *AND* the system happens to crash at just the wrong time, before 
> additional commits have recreated at least a bit of root history, again, 
> you'll potentially be left without any old roots for the usebackuproot 
> mount option to try to fall back to, should it actually be necessary.
> 

Sorry? You are just saying that "previous state can be discarded before
new state is committed", just more verbosely.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-03 Thread Qu Wenruo



On 2018年07月03日 17:55, Paul Jones wrote:
>> -Original Message-
>> From: linux-btrfs-ow...@vger.kernel.org > ow...@vger.kernel.org> On Behalf Of Marc MERLIN
>> Sent: Tuesday, 3 July 2018 2:16 PM
>> To: Qu Wenruo 
>> Cc: Su Yue ; linux-btrfs@vger.kernel.org
>> Subject: Re: how to best segment a big block device in resizeable btrfs
>> filesystems?
>>
>> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
 If I do this, I would have
 software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
 layers, and that's also starting to make me nervous :)
>>>
>>> If you could keep the number of snapshots to minimal (less than 10)
>>> for each btrfs (and the number of send source is less than 5), one big
>>> btrfs may work in that case.
>>
>> Well, we kind of discussed this already. If btrfs falls over if you reach
>> 100 snapshots or so, and it sure seems to in my case, I won't be much better
>> off.
>> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
>> unable to use swap, is a big deal in my case. You also confirmed that btrfs
>> check lowmem does not scale to filesystems like mine, so this translates into
>> "if regular btrfs check repair can't fit in 32GB, I am completely out of 
>> luck if
>> anything happens to the filesystem"
> 
> Just out of curiosity I had a look at my backup filesystem.
> vm-server /media/backup # btrfs fi us /media/backup/
> Overall:
> Device size:   5.46TiB
> Device allocated:  3.42TiB
> Device unallocated:2.04TiB
> Device missing:  0.00B
> Used:  1.80TiB
> Free (estimated):  1.83TiB  (min: 1.83TiB)
> Data ratio:   2.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
> Data,RAID1: Size:1.69TiB, Used:906.26GiB

It doesn't affect how fast check run at all.
Unless --check-data-csum is specified.

And even --check-data-csum is specified, most read will still be
sequential, and deduped/reflink won't affect the csum verification speed.

>/dev/mapper/a-backup--a 1.69TiB
>/dev/mapper/b-backup--b 1.69TiB
> 
> Metadata,RAID1: Size:19.00GiB, Used:16.90GiB

This is the main factor contributing to btrfs check time.
Just consider it as the minimal amount of data btrfs check needs to read.

>/dev/mapper/a-backup--a19.00GiB
>/dev/mapper/b-backup--b19.00GiB
> 
> System,RAID1: Size:64.00MiB, Used:336.00KiB
>/dev/mapper/a-backup--a64.00MiB
>/dev/mapper/b-backup--b64.00MiB
> 
> Unallocated:
>/dev/mapper/a-backup--a 1.02TiB
>/dev/mapper/b-backup--b 1.02TiB
> 
> compress=zstd,space_cache=v2
> 202 snapshots, heavily de-duplicated
> 551G / 361,000 files in latest snapshot

No wonder it's so slow for lowmem mode.

> 
> Btrfs check normal mode took 12 mins and 11.5G ram
> Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G

For lowmem, btrfs check will use 25% of your total memory as cache to
speed up it a little. (but as you can see, it's still slow)
Maybe we could add some option to modify how many bytes we could use for
lowmem mode.

Thanks,
Qu

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-03 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 2:16 PM
> To: Qu Wenruo 
> Cc: Su Yue ; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > > If I do this, I would have
> > > software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of
> > > layers, and that's also starting to make me nervous :)
> >
> > If you could keep the number of snapshots to minimal (less than 10)
> > for each btrfs (and the number of send source is less than 5), one big
> > btrfs may work in that case.
> 
> Well, we kind of discussed this already. If btrfs falls over if you reach
> 100 snapshots or so, and it sure seems to in my case, I won't be much better
> off.
> Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
> unable to use swap, is a big deal in my case. You also confirmed that btrfs
> check lowmem does not scale to filesystems like mine, so this translates into
> "if regular btrfs check repair can't fit in 32GB, I am completely out of luck 
> if
> anything happens to the filesystem"

Just out of curiosity I had a look at my backup filesystem.
vm-server /media/backup # btrfs fi us /media/backup/
Overall:
Device size:   5.46TiB
Device allocated:  3.42TiB
Device unallocated:2.04TiB
Device missing:  0.00B
Used:  1.80TiB
Free (estimated):  1.83TiB  (min: 1.83TiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID1: Size:1.69TiB, Used:906.26GiB
   /dev/mapper/a-backup--a 1.69TiB
   /dev/mapper/b-backup--b 1.69TiB

Metadata,RAID1: Size:19.00GiB, Used:16.90GiB
   /dev/mapper/a-backup--a19.00GiB
   /dev/mapper/b-backup--b19.00GiB

System,RAID1: Size:64.00MiB, Used:336.00KiB
   /dev/mapper/a-backup--a64.00MiB
   /dev/mapper/b-backup--b64.00MiB

Unallocated:
   /dev/mapper/a-backup--a 1.02TiB
   /dev/mapper/b-backup--b 1.02TiB

compress=zstd,space_cache=v2
202 snapshots, heavily de-duplicated
551G / 361,000 files in latest snapshot

Btrfs check normal mode took 12 mins and 11.5G ram
Lowmem mode I stopped after 4 hours, max memory usage was around 3.9G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-03 Thread Duncan
Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:

> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>> bit dangerous to do it while writes are happening).
> 
> Could you please elaborate? Do you mean btrfs can trim data before new
> writes are actually committed to disk?

No.

But normally old roots aren't rewritten for some time simply due to odds 
(fuller filesystems will of course recycle them sooner), and the btrfs 
mount option usebackuproot (formerly recovery, until the norecovery mount 
option that parallels that of other filesystems was added and this option 
was renamed to avoid confusion) can be used to try an older root if the 
current root is too damaged to successfully mount.

But other than simply by odds not using them again immediately, btrfs has 
no special protection for those old roots, and trim/discard will recover 
them to hardware-unused as it does any other unused space, tho whether it 
simply marks them for later processing or actually processes them 
immediately is up to the individual implementation -- some do it 
immediately, killing all chances at using the backup root because it's 
already zeroed out, some don't.

In the context of the discard mount option, that can mean there's never 
any old roots available ever, as they've already been cleaned up by the 
hardware due to the discard option telling the hardware to do it.

But even not using that mount option, and simply doing the trims 
periodically, as done weekly by for instance the systemd fstrim timer and 
service units, or done manually if you prefer, obviously potentially 
wipes the old roots at that point.  If the system's effectively idle at 
the time, not much risk as the current commit is likely to represent a 
filesystem in full stasis, but if there's lots of writes going on at that 
moment *AND* the system happens to crash at just the wrong time, before 
additional commits have recreated at least a bit of root history, again, 
you'll potentially be left without any old roots for the usebackuproot 
mount option to try to fall back to, should it actually be necessary.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Tue, Jul 03, 2018 at 04:26:37AM +, Paul Jones wrote:
> I don't have any experience with this, but since it's the internet let me 
> tell you how I'd do it anyway 

That's the spirit :)

> raid5
> dm-crypt
> lvm (using thin provisioning + cache)
> btrfs
> 
> The cache mode on lvm requires you to set up all your volumes first, then
> add caching to those volumes last. If you need to modify the volume then
> you have to remove the cache, make your changes, then re-add the cache. It
> sounds like a pain, but having the cache separate from the data is quite
> handy.

I'm ok enough with that.

> Given you are running a backup server I don't think the cache would
> really do much unless you enable writeback mode. If you can split up your
> filesystem a bit to the point that btrfs check doesn't OOM that will
> seriously help performance as well. Rsync might be feasible again.

I'm a bit warry of write caching with the issues I've had. I may do
write-through, but not writeback :)

But caching helps indeed for my older filesystems that are still backed up
via rsync because the source fs is ext4 and not btrfs.

Thanks for the suggestions
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Paul Jones

> -Original Message-
> From: Marc MERLIN 
> Sent: Tuesday, 3 July 2018 2:07 PM
> To: Paul Jones 
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> On Tue, Jul 03, 2018 at 12:51:30AM +, Paul Jones wrote:
> > You could combine bcache and lvm if you are happy to use dm-cache
> instead (which lvm uses).
> > I use it myself (but without thin provisioning) and it works well.
> 
> Interesting point. So, I used to use lvm and then lvm2 many years ago until I
> got tired with its performance, especially as asoon as I took even a single
> snapshot.
> But that was a long time ago now, just saying that I'm a bit rusty on LVM
> itself.
> 
> That being said, if I have
> raid5
> dm-cache
> dm-crypt
> dm-thin
> 
> That's still 4 block layers under btrfs.
> Am I any better off using dm-cache instead of bcache, my understanding is
> that it only replaces one block layer with another one and one codebase with
> another.

True, I didn't think of it like that.

> Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
> might change things, or not.
> I'll admit that setting up and maintaining bcache is a bit of a pain, I only 
> used it
> at the time because it seemed more ready then, but we're a few years later
> now.
> 
> So, what do you recommend nowadays, assuming you've used both?
> (given that it's literally going to take days to recreate my array, I'd 
> rather do it
> once and the right way the first time :) )

I don't have any experience with this, but since it's the internet let me tell 
you how I'd do it anyway 
raid5
dm-crypt
lvm (using thin provisioning + cache)
btrfs

The cache mode on lvm requires you to set up all your volumes first, then add 
caching to those volumes last. If you need to modify the volume then you have 
to remove the cache, make your changes, then re-add the cache. It sounds like a 
pain, but having the cache separate from the data is quite handy.
Given you are running a backup server I don't think the cache would really do 
much unless you enable writeback mode. If you can split up your filesystem a 
bit to the point that btrfs check doesn't OOM that will seriously help 
performance as well. Rsync might be feasible again.

Paul.

N�r��y���b�X��ǧv�^�)޺{.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Andrei Borzenkov
02.07.2018 21:35, Austin S. Hemmelgarn пишет:
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
> dangerous to do it while writes are happening).

Could you please elaborate? Do you mean btrfs can trim data before new
writes are actually committed to disk?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Andrei Borzenkov
03.07.2018 04:37, Qu Wenruo пишет:
> 
> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.
> 

There is new writecache target which may help in this case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Tue, Jul 03, 2018 at 09:37:47AM +0800, Qu Wenruo wrote:
> > If I do this, I would have
> > software raid 5 < dmcrypt < bcache < lvm < btrfs
> > That's a lot of layers, and that's also starting to make me nervous :)
> 
> If you could keep the number of snapshots to minimal (less than 10) for
> each btrfs (and the number of send source is less than 5), one big btrfs
> may work in that case.
 
Well, we kind of discussed this already. If btrfs falls over if you reach
100 snapshots or so, and it sure seems to in my case, I won't be much better
off.
Having btrfs check --repair fail because 32GB of RAM is not enough, and it's
unable to use swap, is a big deal in my case. You also confirmed that btrfs
check lowmem does not scale to filesystems like mine, so this translates
into "if regular btrfs check repair can't fit in 32GB, I am completely out
of luck if anything happens to the filesystem"

You're correct that I could tweak my backups and snapshot rotation to get
from 250 or so down to 100, but it seems that I'll just be hoping to avoid
the problem by being just under the limit, until I'm not, again, and it'll
be too late to do anything it next time I'm in trouble again, putting me
back right in the same spot I'm in now.
Is all this fair to say, or did I misunderstand?

> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.

That's a good point. So, what I didn't explain is that I still have some old
filesystem that do get backed up with rsync instead of btrfs send (going
into the same filesystem, but not same subvolume).
Because rsync is so painfully slow when it needs to scan both sides before
it'll even start doing any work, bcache helps there.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Tue, Jul 03, 2018 at 12:51:30AM +, Paul Jones wrote:
> You could combine bcache and lvm if you are happy to use dm-cache instead 
> (which lvm uses).
> I use it myself (but without thin provisioning) and it works well.

Interesting point. So, I used to use lvm and then lvm2 many years ago until
I got tired with its performance, especially as asoon as I took even a
single snapshot.
But that was a long time ago now, just saying that I'm a bit rusty on LVM
itself.

That being said, if I have
raid5
dm-cache
dm-crypt
dm-thin

That's still 4 block layers under btrfs.
Am I any better off using dm-cache instead of bcache, my understanding is
that it only replaces one block layer with another one and one codebase with
another.

Mmmh, a bit of reading shows that dm-cache is now used as lvmcache, which
might change things, or not.
I'll admit that setting up and maintaining bcache is a bit of a pain, I only
used it at the time because it seemed more ready then, but we're a few years
later now.

So, what do you recommend nowadays, assuming you've used both?
(given that it's literally going to take days to recreate my array, I'd
rather do it once and the right way the first time :) )

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Qu Wenruo



On 2018年07月02日 23:18, Marc MERLIN wrote:
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
>> 2) Don't keep unrelated snapshots in one btrfs.
>>I totally understand that maintain different btrfs would hugely add
>>maintenance pressure, but as explains, all snapshots share one
>>fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you
> explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share
> the same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs
> That's a lot of layers, and that's also starting to make me nervous :)

If you could keep the number of snapshots to minimal (less than 10) for
each btrfs (and the number of send source is less than 5), one big btrfs
may work in that case.

BTW, IMHO the bcache is not really helping for backup system, which is
more write oriented.

Thanks,
Qu

> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right
> size because I won't be able to change it later?
> 
> Thanks,
> Marc
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org  ow...@vger.kernel.org> On Behalf Of Marc MERLIN
> Sent: Tuesday, 3 July 2018 1:19 AM
> To: Qu Wenruo 
> Cc: Su Yue ; linux-btrfs@vger.kernel.org
> Subject: Re: how to best segment a big block device in resizeable btrfs
> filesystems?
> 
> Hi Qu,
> 
> I'll split this part into a new thread:
> 
> > 2) Don't keep unrelated snapshots in one btrfs.
> >I totally understand that maintain different btrfs would hugely add
> >maintenance pressure, but as explains, all snapshots share one
> >fragile extent tree.
> 
> Yes, I understand that this is what I should do given what you explained.
> My main problem is knowing how to segment things so I don't end up with
> filesystems that are full while others are almost empty :)
> 
> Am I supposed to put LVM thin volumes underneath so that I can share the
> same single 10TB raid5?
> 
> If I do this, I would have
> software raid 5 < dmcrypt < bcache < lvm < btrfs That's a lot of layers, and
> that's also starting to make me nervous :)

You could combine bcache and lvm if you are happy to use dm-cache instead 
(which lvm uses).
I use it myself (but without thin provisioning) and it works well.


> 
> Is there any other way that does not involve me creating smaller block
> devices for multiple btrfs filesystems and hope that they are the right size
> because I won't be able to change it later?
> 
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>    what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/   | PGP 
> 7F55D5F27AAF9D08
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Mon, Jul 02, 2018 at 02:35:19PM -0400, Austin S. Hemmelgarn wrote:
> >I kind of linked the thin provisioning idea because it's hands off,
> >which is appealing. Any reason against it?
> No, not currently, except that it adds a whole lot more stuff between 
> BTRFS and whatever layer is below it.  That increase in what's being 
> done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
> drives, but not on decent consumer SATA SSD's).
> 
> There used to be issues running BTRFS on top of LVM thin targets which 
> had zero mode turned off, but AFAIK, all of those problems were fixed 
> long ago (before 4.0).

I see, thanks for the heads up.

> >Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
> >radi5?
> Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
> RAID6, and optionally for RAID0, RAID1, and RAID10.
 
Ok, that makes me feel a bit better :)

> >But yeah, if it's incompatible with thin provisioning, it's not that
> >useful.
> It's technically not incompatible, just a bit of a pain.  Last time I 
> tried to use it, you had to jump through hoops to repair a damaged RAID 
> volume that was serving as an underlying volume in a thin pool, and it 
> required keeping the thin pool offline for the entire duration of the 
> rebuild.

Argh, not good :( / thanks for the heads up.

> If you do go with thin provisioning, I would encourage you to make 
> certain to call fstrim on the BTRFS volumes on a semi regular basis so 
> that the thin pool doesn't get filled up with old unused blocks, 

That's a very good point/reminder, thanks for that. I guess it's like
running on an ssd :)

> preferably when you are 100% certain that there are no ongoing writes on 
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
> dangerous to do it while writes are happening).
 
Argh, that will be harder, but I'll try.

Given what you said, it sounds like I'll still be best off with separate
layers to avoid the rebuild problem you mentioned.
So it'll be
swraid5 / dmcrypt / bcache / lvm dm thin / btrfs

Hopefully that will work well enough.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 13:34, Marc MERLIN wrote:

On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?


Actually, because of the online resize ability in BTRFS, you don't
technically _need_ to use thin provisioning here.  It makes the maintenance
a bit easier, but it also adds a much more complicated layer of indirection
than just doing regular volumes.


You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs

and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)

I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?
No, not currently, except that it adds a whole lot more stuff between 
BTRFS and whatever layer is below it.  That increase in what's being 
done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
drives, but not on decent consumer SATA SSD's).


There used to be issues running BTRFS on top of LVM thin targets which 
had zero mode turned off, but AFAIK, all of those problems were fixed 
long ago (before 4.0).



You could (in theory) merge the LVM and software RAID5 layers, though that
may make handling of the RAID5 layer a bit complicated if you choose to use
thin provisioning (for some reason, LVM is unable to do on-line checks and
rebuilds of RAID arrays that are acting as thin pool data or metadata).
  
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm

radi5?
Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
RAID6, and optionally for RAID0, RAID1, and RAID10.



But yeah, if it's incompatible with thin provisioning, it's not that
useful.
It's technically not incompatible, just a bit of a pain.  Last time I 
tried to use it, you had to jump through hoops to repair a damaged RAID 
volume that was serving as an underlying volume in a thin pool, and it 
required keeping the thin pool offline for the entire duration of the 
rebuild.



Alternatively, you could increase your array size, remove the software RAID
layer, and switch to using BTRFS in raid10 mode so that you could eliminate
one of the layers, though that would probably reduce the effectiveness of
bcache (you might want to get a bigger cache device if you do this).


Sadly that won't work. I have more data than will fit on raid10

Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.
If you do go with thin provisioning, I would encourage you to make 
certain to call fstrim on the BTRFS volumes on a semi regular basis so 
that the thin pool doesn't get filled up with old unused blocks, 
preferably when you are 100% certain that there are no ongoing writes on 
them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
dangerous to do it while writes are happening).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:
> > Am I supposed to put LVM thin volumes underneath so that I can share
> > the same single 10TB raid5?
>
> Actually, because of the online resize ability in BTRFS, you don't
> technically _need_ to use thin provisioning here.  It makes the maintenance
> a bit easier, but it also adds a much more complicated layer of indirection
> than just doing regular volumes.

You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to 
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs

and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)

I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?

> You could (in theory) merge the LVM and software RAID5 layers, though that
> may make handling of the RAID5 layer a bit complicated if you choose to use
> thin provisioning (for some reason, LVM is unable to do on-line checks and
> rebuilds of RAID arrays that are acting as thin pool data or metadata).
 
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm
radi5?
But yeah, if it's incompatible with thin provisioning, it's not that
useful.

> Alternatively, you could increase your array size, remove the software RAID
> layer, and switch to using BTRFS in raid10 mode so that you could eliminate
> one of the layers, though that would probably reduce the effectiveness of
> bcache (you might want to get a bigger cache device if you do this).

Sadly that won't work. I have more data than will fit on raid10

Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 11:18, Marc MERLIN wrote:

Hi Qu,

I'll split this part into a new thread:


2) Don't keep unrelated snapshots in one btrfs.
I totally understand that maintain different btrfs would hugely add
maintenance pressure, but as explains, all snapshots share one
fragile extent tree.


Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?
Actually, because of the online resize ability in BTRFS, you don't 
technically _need_ to use thin provisioning here.  It makes the 
maintenance a bit easier, but it also adds a much more complicated layer 
of indirection than just doing regular volumes.


If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)

Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?
You could (in theory) merge the LVM and software RAID5 layers, though 
that may make handling of the RAID5 layer a bit complicated if you 
choose to use thin provisioning (for some reason, LVM is unable to do 
on-line checks and rebuilds of RAID arrays that are acting as thin pool 
data or metadata).


Alternatively, you could increase your array size, remove the software 
RAID layer, and switch to using BTRFS in raid10 mode so that you could 
eliminate one of the layers, though that would probably reduce the 
effectiveness of bcache (you might want to get a bigger cache device if 
you do this).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Marc MERLIN
Hi Qu,

I'll split this part into a new thread:

> 2) Don't keep unrelated snapshots in one btrfs.
>I totally understand that maintain different btrfs would hugely add
>maintenance pressure, but as explains, all snapshots share one
>fragile extent tree.

Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?

If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)

Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html