Re: Is metadata redundant over more than one drive with raid0 too?
Hi, Marc Raid0 is not redundant in any way. See inline below. On 2014/05/04 01:27 AM, Marc MERLIN wrote: So, I was thinking. In the past, I've done this: mkfs.btrfs -d raid0 -m raid1 -L btrfs_raid0 /dev/mapper/raid0d* My rationale at the time was that if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files. If I have raid1 with 2 drives, I should end up with 4 copies of each file's metadata, right? But now I have 2 questions 1) btrfs has two copies of all metadata on even a single drive, correct? Only when *specifically* using -m dup (which is the default on a single non-SSD device), will there be two copies of the metadata stored on a single device. This is not recommended when using multiple devices as it means one device failure will likely cause critical loss of metadata. When using -m raid1 (as is the case in your first example above and as is the default with multiple devices), two copies of the metadata are distributed across two devices (each of those devices with a copy has only a single copy). If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the metadata on the same drive or is btrfs smart enough to spread out metadata copies so that they're not on the same drive? This will mean there is only a single copy, albeit striped across the drives. 2) does btrfs lay out files on raid0 so that files aren't striped across more than one drive, so that if I lose a drive, I only lose whole files, but not little chunks of all my files, making my entire FS toast? raid0 currently allocates a single chunk on each device and then makes use of RAID0-like stripes across these chunks until a new chunk needs to be allocated. This is good for performance but not good for redundancy. A total failure of a single device will mean any large files will be lost and only files smaller than the default per-disk stripe width (I believe this used to be 4K and is now 16K - I could be wrong) stored only on the remaining disk will be available. The scenario you mentioned at the beginning, if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files is more applicable to using -m raid1 -d single. Single is not geared towards performance and, though it doesn't guarantee a file is only on a single disk, the allocation does mean that the majority of all files smaller than a chunk will be stored on only one disk or the other - not both. Thanks, Marc I hope the above is helpful. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
On 2014/05/04 02:47 AM, Marc MERLIN wrote: Is there any functional difference between mount -o subvol=usr /dev/sda1 /usr and mount /dev/sda1 /mnt/btrfs_pool mount -o bind /mnt/btrfs_pool/usr /usr ? Thanks, Marc There are two issues with this. 1) There will be a *very* small performance penalty (negligible, really) 2) Old snapshots and other supposedly-hidden subvolumes will be accessible under /mnt/btrfs_pool. This is a minor security concern (which of course may not concern you, depending on your use-case). There are a few similar minor security concerns - the recently-highlighted issue with old snapshots is the potential that old vulnerable binaries within a snapshot are still accessible and/or executable. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying related snapshots to another server with btrfs send/receive?
On 2014/05/04 05:12 AM, Marc MERLIN wrote: Another question I just came up with. If I have historical snapshots like so: backup backup.sav1 backup.sav2 backup.sav3 If I want to copy them up to another server, can btrfs send/receive let me copy all of the to another btrfs pool while keeping the duplicated block relationship between all of them? Note that the backup.sav dirs will never change, so I won't need incremental backups on those, just a one time send. I believe this is supposed to work, correct? The only part I'm not clear about is am I supposed to copy them all at once in the same send command, or one by one? If they had to be copied together and if I create a new snapshot of backup: backup.sav4 If I use btrfs send to that same destination, is btrfs send/receive indeed able to keep the shared block relationship? Thanks, Marc I'm not sure if they can be sent in one go. :-/ Sending one-at-a-time, the shared-data relationship will be kept by using the -p (parent) parameter. Send will only send the differences and receive will create a new snapshot, adjusting for those differences, even when the receive is run on a remote server. $ btrfs send backup | btrfs receive $path/ $ btrfs send -p backup backup.sav1 | btrfs receive $path/ $ btrfs send -p backup.sav1 backup.sav2 | btrfs receive $path/ $ btrfs send -p backup.sav2 backup.sav3 | btrfs receive $path/ $ btrfs send -p backup.sav3 backup.sav4 | btrfs receive $path/ -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
On Sun, May 04, 2014 at 08:57:19AM +0200, Brendan Hide wrote: Hi, Marc Raid0 is not redundant in any way. See inline below. Thanks for clearing things up. But now I have 2 questions 1) btrfs has two copies of all metadata on even a single drive, correct? Only when *specifically* using -m dup (which is the default on a single non-SSD device), will there be two copies of the metadata stored on a single device. This is not recommended when using Ah, so -m dup is default like I thought, but not on SSD? Ooops, that means that my laptop does not have redundant metadata on its SSD like I thought. Thanks for the heads up. Ah, I see the man page now This is because SSDs can remap blocks internally so duplicate blocks could end up in the same erase block which negates the benefits of doing metadata duplication. multiple devices as it means one device failure will likely cause critical loss of metadata. That's the part where I'm not clear: What's the difference between -m dup and -m raid1 Don't they both say 2 copies of the metadata? Is -m dup only valid for a single drive, while -m raid1 for 2+ drives? If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the metadata on the same drive or is btrfs smart enough to spread out metadata copies so that they're not on the same drive? This will mean there is only a single copy, albeit striped across the drives. Ok, so -m raid0 only means a single copy of metadata, thanks for explaining. good for redundancy. A total failure of a single device will mean any large files will be lost and only files smaller than the default per-disk stripe width (I believe this used to be 4K and is now 16K - I could be wrong) stored only on the remaining disk will be available. Gotcha, thanks for confirming, so -m raid1 -d raid0 really only protects against metadata corruption or a single block loss, but otherwise if you lost a drive in a 2 drive raid0, you'll have lost more than just half your files. The scenario you mentioned at the beginning, if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files is more applicable to using -m raid1 -d single. Single is not geared towards performance and, though it doesn't guarantee a file is only on a single disk, the allocation does mean that the majority of all files smaller than a chunk will be stored on only one disk or the other - not both. Ok, so in other words: -d raid0: if you one 1 drive out of 2, you may end up with small files and the rest will be lost -d single: you're more likely to have files be on one drive or the other, although there is no guarantee there either. Correct? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copies= option
On 2014/05/04 05:27 AM, Duncan wrote: Russell Coker posted on Sun, 04 May 2014 12:16:54 +1000 as excerpted: Are there any plans for a feature like the ZFS copies= option? I'd like to be able to set copies= separately for data and metadata. In most cases RAID-1 provides adequate data protection but I'd like to have RAID-1 and copies=2 for metadata so that if one disk dies and another has some bad sectors during recovery I'm unlikely to lose metadata. Hugo's the guy with the better info on this one, but until he answers... The zfs license issues mean it's not an option for me and I'm thus not familiar with its options in any detail, but if I understand the question correctly, yes. And of course since btrfs treats data and metadata separately, it's extremely unlikely that any sort of copies= option wouldn't be separately configurable for each. There was a discussion of a very nice multi-way-configuration schema that I deliberately stayed out of as both a bit above my head and far enough in the future that I didn't want to get my hopes up too high about it yet. I already want N-way-mirroring so bad I can taste it, and this was that and way more... if/when it ever actually gets coded and committed to the mainline kernel btrfs. As I said, Hugo should have more on it, as he was active in that discussion as it seemed to line up perfectly with his area of interest. The simple answer is yes, this is planned. As Duncan implied, however, it is not on the immediate roadmap. Internally we appear to be referring to this feature as N-way redundancy or N-way mirroring. My understanding is that the biggest hurdle before the primary devs will look into N-way redundancy is to finish the Raid5/6 implementation to include self-healing/scrubbing support - a critical issue before it can be adopted further. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying related snapshots to another server with btrfs send/receive?
On Sun, May 04, 2014 at 09:16:02AM +0200, Brendan Hide wrote: Sending one-at-a-time, the shared-data relationship will be kept by using the -p (parent) parameter. Send will only send the differences and receive will create a new snapshot, adjusting for those differences, even when the receive is run on a remote server. $ btrfs send backup | btrfs receive $path/ $ btrfs send -p backup backup.sav1 | btrfs receive $path/ $ btrfs send -p backup.sav1 backup.sav2 | btrfs receive $path/ $ btrfs send -p backup.sav2 backup.sav3 | btrfs receive $path/ $ btrfs send -p backup.sav3 backup.sav4 | btrfs receive $path/ So this is exactly the same than what I do incremental backups with brrfs send, but -p only works if the snapshot is read only, does it not? I do use that for my incremental syncs and don't mind read only snapshots there, but if I have read/write snapshots that are there for other reasons than btrfs send incrementals, can I still send them that way with -p? (I thought that wouldn't work) Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
On 2014/05/04 09:24 AM, Marc MERLIN wrote: On Sun, May 04, 2014 at 08:57:19AM +0200, Brendan Hide wrote: Hi, Marc Raid0 is not redundant in any way. See inline below. Thanks for clearing things up. But now I have 2 questions 1) btrfs has two copies of all metadata on even a single drive, correct? Only when *specifically* using -m dup (which is the default on a single non-SSD device), will there be two copies of the metadata stored on a single device. This is not recommended when using Ah, so -m dup is default like I thought, but not on SSD? Ooops, that means that my laptop does not have redundant metadata on its SSD like I thought. Thanks for the heads up. Ah, I see the man page now This is because SSDs can remap blocks internally so duplicate blocks could end up in the same erase block which negates the benefits of doing metadata duplication. You can force dup but, per the man page, whether or not that is beneficial is questionable. multiple devices as it means one device failure will likely cause critical loss of metadata. That's the part where I'm not clear: What's the difference between -m dup and -m raid1 Don't they both say 2 copies of the metadata? Is -m dup only valid for a single drive, while -m raid1 for 2+ drives? The issue is that -m dup will always put both copies on a single device. If you lose that device, you've lost both (all) copies of that metadata. With -m raid1 the second copy is on a *different* device. I believe dup *can* be used with multiple devices but mkfs.btrfs might not let you do it from the get-go. The way most have gotten there is by having dup on a single device and then, after adding another device, they didn't convert the metadata to raid1. If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the metadata on the same drive or is btrfs smart enough to spread out metadata copies so that they're not on the same drive? This will mean there is only a single copy, albeit striped across the drives. Ok, so -m raid0 only means a single copy of metadata, thanks for explaining. good for redundancy. A total failure of a single device will mean any large files will be lost and only files smaller than the default per-disk stripe width (I believe this used to be 4K and is now 16K - I could be wrong) stored only on the remaining disk will be available. Gotcha, thanks for confirming, so -m raid1 -d raid0 really only protects against metadata corruption or a single block loss, but otherwise if you lost a drive in a 2 drive raid0, you'll have lost more than just half your files. The scenario you mentioned at the beginning, if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files is more applicable to using -m raid1 -d single. Single is not geared towards performance and, though it doesn't guarantee a file is only on a single disk, the allocation does mean that the majority of all files smaller than a chunk will be stored on only one disk or the other - not both. Ok, so in other words: -d raid0: if you one 1 drive out of 2, you may end up with small files and the rest will be lost -d single: you're more likely to have files be on one drive or the other, although there is no guarantee there either. Correct? Correct Thanks, Marc -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs on top of multiple dmcrypted devices howto
I've just updated https://btrfs.wiki.kernel.org/index.php/FAQ#Does_Btrfs_work_on_top_of_dm-crypt.3F to point to http://marc.merlins.org/perso/btrfs/post_2014-04-27_Btrfs-Multi-Device-Dmcrypt.html where I give this script: http://marc.merlins.org/linux/scripts/start-btrfs-dmcrypt which shows one way to bring up a btrfs filesystem based off multiple dm-crypted devices. Hope this helps someone. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying related snapshots to another server with btrfs send/receive?
On 2014/05/04 09:28 AM, Marc MERLIN wrote: On Sun, May 04, 2014 at 09:16:02AM +0200, Brendan Hide wrote: Sending one-at-a-time, the shared-data relationship will be kept by using the -p (parent) parameter. Send will only send the differences and receive will create a new snapshot, adjusting for those differences, even when the receive is run on a remote server. $ btrfs send backup | btrfs receive $path/ $ btrfs send -p backup backup.sav1 | btrfs receive $path/ $ btrfs send -p backup.sav1 backup.sav2 | btrfs receive $path/ $ btrfs send -p backup.sav2 backup.sav3 | btrfs receive $path/ $ btrfs send -p backup.sav3 backup.sav4 | btrfs receive $path/ So this is exactly the same than what I do incremental backups with brrfs send, but -p only works if the snapshot is read only, does it not? I do use that for my incremental syncs and don't mind read only snapshots there, but if I have read/write snapshots that are there for other reasons than btrfs send incrementals, can I still send them that way with -p? (I thought that wouldn't work) Thanks, Marc Yes, -p (parent) and -c (clone source) are the only ways I'm aware of to push subvolumes across while ensuring data-sharing relationship remains intact. This will end up being much the same as doing incremental backups: From the man page section on -c: You must not specify clone sources unless you guarantee that these snapshots are exactly in the same state on both sides, the sender and the receiver. It is allowed to omit the '-p parent' option when '-c clone-src' options are given, in which case 'btrfs send' will determine a suitable parent among the clone sources itself. -p does require that the sources be read-only. I suspect -c does as well. This means that it won't be so simple as you want your sources to be read-write. Probably the only way then would be to make read-only snapshots whenever you want to sync these over while also ensuring that you keep at least one read-only snapshot intact - again, much like incremental backups. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs scrub will not start nor cancel howto
This has been asked a few times, so I ended up writing a blog entry on it http://marc.merlins.org/perso/btrfs/post_2014-04-26_Btrfs-Tips_-Cancel-A-Btrfs-Scrub-That-Is-Already-Stopped.html and in the end pasted all of it in the main wiki https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#btrfs_scrub_will_not_start_nor_cancel Of course, this is really a stopgap until the cancel tool can realize that the scrub isn't really running anymore, and update the state file on its own. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copies= option
On Sun, May 04, 2014 at 11:12:38AM -0700, Duncan wrote: On Sun, 04 May 2014 09:27:10 +0200 Brendan Hide bren...@swiftspirit.co.za wrote: On 2014/05/04 05:27 AM, Duncan wrote: Russell Coker posted on Sun, 04 May 2014 12:16:54 +1000 as excerpted: Are there any plans for a feature like the ZFS copies= option? I'd like to be able to set copies= separately for data and metadata. In most cases RAID-1 provides adequate data protection but I'd like to have RAID-1 and copies=2 for metadata so that if one disk dies and another has some bad sectors during recovery I'm unlikely to lose metadata. Hugo's the guy with the better info on this one, but until he answers... The zfs license issues mean it's not an option for me and I'm thus not familiar with its options in any detail, but if I understand the question correctly, yes. And of course since btrfs treats data and metadata separately, it's extremely unlikely that any sort of copies= option wouldn't be separately configurable for each. There was a discussion of a very nice multi-way-configuration schema that I deliberately stayed out of as both a bit above my head and far enough in the future that I didn't want to get my hopes up too high about it yet. I already want N-way-mirroring so bad I can taste it, and this was that and way more... if/when it ever actually gets coded and committed to the mainline kernel btrfs. As I said, Hugo should have more on it, as he was active in that discussion as it seemed to line up perfectly with his area of interest. The simple answer is yes, this is planned. As Duncan implied, however, it is not on the immediate roadmap. Internally we appear to be referring to this feature as N-way redundancy or N-way mirroring. My understanding is that the biggest hurdle before the primary devs will look into N-way redundancy is to finish the Raid5/6 implementation to include self-healing/scrubbing support - a critical issue before it can be adopted further. Well, there's N-way-mirroring, which /is/ on the roadmap for fairly soon (after raid56 completion), and which is the feature I've been heavily anticipating ever since I first looked into btrfs and realized that raid1 didn't include it already, but what I was referring to above was something much nicer than that. As I said I don't understand the full details, Hugo's the one that can properly answer there, but the general idea (I think) is the ability to three-way specify N-copies, M-parity, S-stripe, possibly with near/far-layout specification like md/raid's raid10, as well. But Hugo refers to it with three different letters, cps copies/parity/stripes, perhaps? That doesn't look quite correct... My proposal was simply a description mechanism, not an implementation. The description is N-copies, M-device-stripe, P-parity-devices (NcMsPp), and (more or less comfortably) covers at minimum all of the current and currently-proposed replication levels. There's a couple of tweaks covering description of allocation rules (DUP vs RAID-1). I think, as you say below, that it's going to be hard to make this completely general in terms of application, but we've already seen code that extends the available replication capabilities beyond the current terminology (to RAID-6.3, ... -6.6), which we can cope with in the proposed nomenclature -- NsP3 to NsP6. There are other things in the pipeline, such as the N-way mirroring, which also aren't describable in traditional RAID terms, but which the csp notation will handle nicely. It doesn't deal with complex nested configurations (e.g. the difference between RAID-10 and RAID-0+1), but given btrfs's more freewheeling chunk allocation decisions, those distinctions tend to go away. So: don't expect to see completely general usability of csp notation, but do expect it to be used in the future to describe the increasing complexity of replication strategies in btrfs. There may even be a shift internally to csp-style description of replication; I'd probably expect that to arrive with per-object RAID levels, since if there's going to be a big overhaul of that area, it would make sense to do that change at the same time. [It's worth noting that when I mooted extending the current RAID-level bit-field to pack in csp-style notation, Chris was mildly horrified at the concept. The next best implementation would be to use the xattrs for per-object RAID for this.] Hugo. But that at least has the potential to be /so/ nice, and possibly also /so/ complicated, that I'm deliberately avoiding looking too much at the details as it's far enough out and may in fact never get fully implemented that I don't want to spoil my enjoyment of (relatively, compared to that) simple N-way-mirroring when it comes. And more particularly, I really /really/ hope they don't put off a reasonably simple and (hopefully) fast
Re: copies= option
On Sun, 04 May 2014 09:27:10 +0200 Brendan Hide bren...@swiftspirit.co.za wrote: On 2014/05/04 05:27 AM, Duncan wrote: Russell Coker posted on Sun, 04 May 2014 12:16:54 +1000 as excerpted: Are there any plans for a feature like the ZFS copies= option? I'd like to be able to set copies= separately for data and metadata. In most cases RAID-1 provides adequate data protection but I'd like to have RAID-1 and copies=2 for metadata so that if one disk dies and another has some bad sectors during recovery I'm unlikely to lose metadata. Hugo's the guy with the better info on this one, but until he answers... The zfs license issues mean it's not an option for me and I'm thus not familiar with its options in any detail, but if I understand the question correctly, yes. And of course since btrfs treats data and metadata separately, it's extremely unlikely that any sort of copies= option wouldn't be separately configurable for each. There was a discussion of a very nice multi-way-configuration schema that I deliberately stayed out of as both a bit above my head and far enough in the future that I didn't want to get my hopes up too high about it yet. I already want N-way-mirroring so bad I can taste it, and this was that and way more... if/when it ever actually gets coded and committed to the mainline kernel btrfs. As I said, Hugo should have more on it, as he was active in that discussion as it seemed to line up perfectly with his area of interest. The simple answer is yes, this is planned. As Duncan implied, however, it is not on the immediate roadmap. Internally we appear to be referring to this feature as N-way redundancy or N-way mirroring. My understanding is that the biggest hurdle before the primary devs will look into N-way redundancy is to finish the Raid5/6 implementation to include self-healing/scrubbing support - a critical issue before it can be adopted further. Well, there's N-way-mirroring, which /is/ on the roadmap for fairly soon (after raid56 completion), and which is the feature I've been heavily anticipating ever since I first looked into btrfs and realized that raid1 didn't include it already, but what I was referring to above was something much nicer than that. As I said I don't understand the full details, Hugo's the one that can properly answer there, but the general idea (I think) is the ability to three-way specify N-copies, M-parity, S-stripe, possibly with near/far-layout specification like md/raid's raid10, as well. But Hugo refers to it with three different letters, cps copies/parity/stripes, perhaps? That doesn't look quite correct... But that at least has the potential to be /so/ nice, and possibly also /so/ complicated, that I'm deliberately avoiding looking too much at the details as it's far enough out and may in fact never get fully implemented that I don't want to spoil my enjoyment of (relatively, compared to that) simple N-way-mirroring when it comes. And more particularly, I really /really/ hope they don't put off a reasonably simple and (hopefully) fast implementation of N-way-mirroring as soon as possible after raid56 completion, because I really /really/ want N-way-mirroring, and this other thing would certainly be extremely nice, but I'm quite fearful that it could also be the perfect being the enemy of the good-enough, and btrfs already has a long history of features repeatedly taking far longer to implement than originally predicted, which with something that potentially complex, I'm very afraid could mean a 2-5 year wait before it's actually usable. And given how long I've been waiting for the simple-compared-to-that N-way-mirroring thing and how much I anticipate it, I just don't know what I'd do if I were to find out that they were going to work on this perfect thing instead, with N-way-mirroring being one possible option with it, but that as a result, given the btrfs history to date, it'd very likely be a good five years before I could get the comparatively simple N-way-mirroring (or even, for me, just a specific 3-way-mirroring to compliment the specific 2-way-mirroring that's already there) that's all I'm really asking for. So I guess you can see why I don't want to get into the details of the more fancy solution too much, both as a means of protecting my own sanity, and to hopefully avoid throwing the 3-way-mirroring that's my own personal focal point off the track. So Hugo's the one with the details, to the extent they've been discussed at least, there. -- Duncan - No HTML messages please, as they are filtered as spam. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
Marc MERLIN posted on Sat, 03 May 2014 16:27:02 -0700 as excerpted: So, I was thinking. In the past, I've done this: mkfs.btrfs -d raid0 -m raid1 -L btrfs_raid0 /dev/mapper/raid0d* My rationale at the time was that if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files. If I have raid1 with 2 drives, I should end up with 4 copies of each file's metadata, right? Brendan has answered well, but sometimes a second way of putting things helps, especially when there was originally some misconception to clear up, as seems to be the case here. So let me try to be that rewording. =:^) No. Btrfs raid1 (the multi-device metadata default) is (still only) two copies, as is btrfs dup (which is the single-device metadata default except for SSDs). The distinction is that dup is designed for the single device case and puts both copies on that single device, while raid1 is designed for the multi-device case, and ensures that the two copies always go to different devices, so loss of the single device won't kill the metadata. Additional details: I am not aware of any current possibility of having more than two copies, no matter the mode, with a possible exception during mode conversion (say between raid1 and raid6), altho even then, there should be only two / active/ copies. Dup mode being designed for single device usage only, it's normally not available on multi-device filesystems. As Brendan mentions, the way people sometimes get it is starting with a single-device filesystem in dup mode and adding devices. If they then fail to balance-convert, old metadata chunks will be dup mode on the original device, while new ones should be created as raid1 by default. Of course a partial balance- convert will be just that, partial, with whatever failed to convert still dup mode on the original single device. As a result, originally (and I believe still) it was impossible to configure dup mode on a multi-device filesystem at all. However, someone did post a request that dup mode on multi-device be added as a (normally still heavily discouraged) option, to allow a conversion back to single- device, without at any point dropping to non-redundant single-copy-only. Using the two-device raid1 to single-device dup conversion as an example, currently you can't btrfs device delete below two devices as that's no longer raid1. Of course if both data and metadata are raid1, it's possible to physically disconnect one device, leaving the other as the only online copy but having the disconnected one in reserve, but that's not possible when the data is single mode, and even if it was, that physical disconnection will trigger read-only mode on filesystem as it's no longer raid1, thereby making the balance-conversion back to dup impossible. And you can't balance-convert to dup on a multi-device filesystem, so balance-converting to single, thereby losing the protection of the second copy, then doing the btrfs device delete, becomes the only option. Thus the request to allow balance-convert to dup mode on a multi-device filesystem, for the sole purpose of then allowing btrfs device delete of the second device, converting it back to a single- device filesystem without ever losing second-copy redundancy protection. Finally, for the single-device-filesystem case, dup mode is normally only allowed for metadata (where it is again the default, except on ssd), *NOT* for data. However, someone noticed and posted that one of the side- effects of mixed-block-group mode, used by default on filesystems under 1 GiB but normally discouraged on filesystems above 32-64 gig for performance reasons, because in mixed-bg mode data and metadata share the same chunks, mixed-bg mode actually allows (and defaults to, except on SSD) dup for data as well as metadata. There was some discussion in that thread as to whether that was a deliberate feature or simply an accidental result of the sharing. Chris Mason confirmed it was the latter. The intention has been that dup mode is a special case for rather critical metadata on a single device in ordered to provide better protection for it, and the fact that mixed-bg mode allows (indeed, even defaults to) dup mode for data was entirely an accident of mixed-bg mode implementation -- albeit one that's pretty much impossible to remove. But given that accident and the fact that some users do appreciate the ability to do dup mode data via mixed-bg mode on larger single-device filesystems even if it reduces performance and effectively halves storage space, I expect/predict that at some point, dup mode for data will be added as an option as well, thereby eliminating the performance impact of mixed-bg mode while offering single-device duplicate data redundancy on large filesystems, for those that value the protection such duplication provides, particularly given btrfs' data checksumming and integrity features.
Re: How does Suse do live filesystem revert with btrfs?
Actually, never mind Suse, does someone know whether you can revert to an older snapshot in place? The only way I can think of is to mount the snapshot on top of the other filesystem. This gets around the umounting a filesystem with open filehandles problem, but this also means that you have to keep track of daemons that are still accessing filehandles on the overlayed filesystem. My one concern with this approach is that you can't free up the subvolume/snapshot of the underlying filesystem if it's mounted and even after you free up filehandles pointing to it, I don't think you can umount it. In other words, you can play this trick to delay a reboot a bit, but ultimately you'll have to reboot to free up the mountpoints, old subvolumes, and be able to delete them. Somehow I'm thinking Suse came up with a better method. Even if you don't know Suse, can you think of a better way to do this? Thanks, Marc On Sat, May 03, 2014 at 05:52:57PM -0700, Marc MERLIN wrote: (more questions I'm asking myself while writing my talk slides) I know Suse uses btrfs to roll back filesystem changes. So I understand how you can take a snapshot before making a change, but not how you revert to that snapshot without rebooting or using rsync, How do you do a pivot-root like mountpoint swap to an older snapshot, especially if you have filehandles opened on the current snapshot? Is that what Suse manages, or are they doing something simpler? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does Suse do live filesystem revert with btrfs?
On Sun, May 04, 2014 at 04:26:45PM -0700, Marc MERLIN wrote: Actually, never mind Suse, does someone know whether you can revert to an older snapshot in place? Not while the system's running useful services, no. The only way I can think of is to mount the snapshot on top of the other filesystem. This gets around the umounting a filesystem with open filehandles problem, but this also means that you have to keep track of daemons that are still accessing filehandles on the overlayed filesystem. You have a good handle on the problems. My one concern with this approach is that you can't free up the subvolume/snapshot of the underlying filesystem if it's mounted and even after you free up filehandles pointing to it, I don't think you can umount it. In other words, you can play this trick to delay a reboot a bit, but ultimately you'll have to reboot to free up the mountpoints, old subvolumes, and be able to delete them. Yup. Somehow I'm thinking Suse came up with a better method. I'm guessing it involves reflink copies of files from the snapshot back to the original, and then restarting affected services. That's about the only other thing that I can think of, but it's got load of race conditions in it (albeit difficult to hit in most cases, I suspect). Hugo. Even if you don't know Suse, can you think of a better way to do this? Thanks, Marc On Sat, May 03, 2014 at 05:52:57PM -0700, Marc MERLIN wrote: (more questions I'm asking myself while writing my talk slides) I know Suse uses btrfs to roll back filesystem changes. So I understand how you can take a snapshot before making a change, but not how you revert to that snapshot without rebooting or using rsync, How do you do a pivot-root like mountpoint swap to an older snapshot, especially if you have filehandles opened on the current snapshot? Is that what Suse manages, or are they doing something simpler? Thanks, Marc -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- That's not rain, that's a lake with slots in it. --- signature.asc Description: Digital signature
Re: Is metadata redundant over more than one drive with raid0 too?
On 05/04/2014 12:24 AM, Marc MERLIN wrote: Gotcha, thanks for confirming, so -m raid1 -d raid0 really only protects against metadata corruption or a single block loss, but otherwise if you lost a drive in a 2 drive raid0, you'll have lost more than just half your files. The scenario you mentioned at the beginning, if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files is more applicable to using -m raid1 -d single. Single is not geared towards performance and, though it doesn't guarantee a file is only on a single disk, the allocation does mean that the majority of all files smaller than a chunk will be stored on only one disk or the other - not both. Ok, so in other words: -d raid0: if you one 1 drive out of 2, you may end up with small files and the rest will be lost -d single: you're more likely to have files be on one drive or the other, although there is no guarantee there either. Correct? Thanks, Marc This often seems to confuse people and I think there is a common misconception that the btrfs raid/single/dup features work at the file level when in reality they work at a level closer to lvm/md. If someone told you that they lost a device out of a jbod or multi disk lvm group(somewhat analogous to -d single) with ext on top you would expect them to lose data in any file that had a fragment in the lost region (lets ignore metadata for a moment). This is potentially up to 100% of the files but this should not be a surprising result. Similarly, someone who has lost a disk out of a md/lvm raid0 volume should not be surprised to have a hard time recovering any data at all from it. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying related snapshots to another server with btrfs send/receive?
On Sun, May 04, 2014 at 09:54:38AM +0200, Brendan Hide wrote: Yes, -p (parent) and -c (clone source) are the only ways I'm aware of to push subvolumes across while ensuring data-sharing relationship remains intact. This will end up being much the same as doing incremental backups: From the man page section on -c: You must not specify clone sources unless you guarantee that these snapshots are exactly in the same state on both sides, the sender and the receiver. It is allowed to omit the '-p parent' option when '-c clone-src' options are given, in which case 'btrfs send' will determine a suitable parent among the clone sources itself. Right. I had read that, but it was not super clear to me how it can be useful, especially if it's supposed to find the source clone by itself. From what you said and what I read, I think the source might be allowed to be read write, otherwise it would be simpler for btrfs send to know that the source has not changed. I think I'll have to do more testing with this when I get some time. Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How does btrfs fi show show full?
More slides, more questions, sorry :) (thanks for the other answers, I'm still going through them) If I have: gandalfthegreat:~# btrfs fi show Label: 'btrfs_pool1' uuid: 873d526c-e911-4234-af1b-239889cd143d Total devices 1 FS bytes used 214.44GB devid1 size 231.02GB used 231.02GB path /dev/dm-0 I'm a bit confused. It tells me 1) FS uses 214GB out of 231GB 2) Device uses 231GB out of 231GB I understand how the device can use less than the FS if you have multiple devices that share a filesystem. But I'm not sure how a filesystem can use less than what's being used on a single device. Similarly, my current laptop shows: legolas:~# btrfs fi show Label: btrfs_pool1 uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6 Total devices 1 FS bytes used 442.17GiB devid1 size 865.01GiB used 751.04GiB path /dev/mapper/cryptroot So, am I 100GB from being full, or am I really only using 442GB out of 865GB? If so, what does the device used value really mean if it can be that much higher than the filesystem used value? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
On Sun, May 04, 2014 at 09:44:41AM +0200, Brendan Hide wrote: Ah, I see the man page now This is because SSDs can remap blocks internally so duplicate blocks could end up in the same erase block which negates the benefits of doing metadata duplication. You can force dup but, per the man page, whether or not that is beneficial is questionable. So the reason I was confused originally was this: legolas:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=734.01GiB, used=435.39GiB System, DUP: total=8.00MiB, used=96.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=8.50GiB, used=6.74GiB Metadata, single: total=8.00MiB, used=0.00 This is on my laptop with an SSD. Clearly btrfs is using duplicate metadata on an SSD, and I did not ask it to do so. Note that I'm still generally happy with the idea of duplicate metadata on an SSD even if it's not bulletproof. What's the difference between -m dup and -m raid1 Don't they both say 2 copies of the metadata? Is -m dup only valid for a single drive, while -m raid1 for 2+ drives? The issue is that -m dup will always put both copies on a single device. If you lose that device, you've lost both (all) copies of that metadata. With -m raid1 the second copy is on a *different* device. Aaah, that explains it now, thanks. So -m dup is indeed kind of stupid if you have more than one drive. I believe dup *can* be used with multiple devices but mkfs.btrfs might not let you do it from the get-go. The way most have gotten there is by having dup on a single device and then, after adding another device, they didn't convert the metadata to raid1. Right, that also makes sense. -d raid0: if you one 1 drive out of 2, you may end up with small files and the rest will be lost -d single: you're more likely to have files be on one drive or the other, although there is no guarantee there either. Correct? Correct Thanmks :) On Sun, May 04, 2014 at 09:49:24PM +, Duncan wrote: Brendan has answered well, but sometimes a second way of putting things helps, especially when there was originally some misconception to clear up, as seems to be the case here. So let me try to be that rewording. =:^) Sure, that can always help. No. Btrfs raid1 (the multi-device metadata default) is (still only) two copies, as is btrfs dup (which is the single-device metadata default except for SSDs). The distinction is that dup is designed for the single device case and puts both copies on that single device, while raid1 is designed for the multi-device case, and ensures that the two copies always go to different devices, so loss of the single device won't kill the metadata. Yep, I got that now. Dup mode being designed for single device usage only, it's normally not available on multi-device filesystems. As Brendan mentions, the way people sometimes get it is starting with a single-device filesystem in dup mode and adding devices. If they then fail to balance-convert, old metadata chunks will be dup mode on the original device, while new ones should be created as raid1 by default. Of course a partial balance- convert will be just that, partial, with whatever failed to convert still dup mode on the original single device. Yes, that makes sense too. Finally, for the single-device-filesystem case, dup mode is normally only allowed for metadata (where it is again the default, except on ssd), *NOT* for data. However, someone noticed and posted that one of the side- effects of mixed-block-group mode, used by default on filesystems under 1 GiB but normally discouraged on filesystems above 32-64 gig for performance reasons, because in mixed-bg mode data and metadata share the same chunks, mixed-bg mode actually allows (and defaults to, except on SSD) dup for data as well as metadata. There was some discussion in that Yes, I read that. That's an interesting side effect which could be used in some cases. thread as to whether that was a deliberate feature or simply an accidental result of the sharing. Chris Mason confirmed it was the latter. The intention has been that dup mode is a special case for rather critical metadata on a single device in ordered to provide better protection for it, and the fact that mixed-bg mode allows (indeed, even defaults to) dup mode for data was entirely an accident of mixed-bg mode implementation -- albeit one that's pretty much impossible to remove. But given that accident and the fact that some users do appreciate the ability to do dup mode data via mixed-bg mode on larger single-device filesystems even if it reduces performance and effectively halves storage space, I expect/predict that at some point, dup mode for data will be added as an option as well, thereby eliminating the performance impact of mixed-bg mode while offering single-device duplicate data redundancy on large filesystems, for those that value the protection such duplication
Re: How does Suse do live filesystem revert with btrfs?
On May 4, 2014, at 5:26 PM, Marc MERLIN m...@merlins.org wrote: Actually, never mind Suse, does someone know whether you can revert to an older snapshot in place? They are using snapper. Updates are not atomic, that is they are applied to the currently mounted fs, not the snapshot, and after update the system is rebooted using the same (now updated) subvolumes. The rollback I think creates another snapshot and an earlier snapshot is moved into place because they are using the top level (subvolume id 5) for rootfs. The only way I can think of is to mount the snapshot on top of the other filesystem. This gets around the umounting a filesystem with open filehandles problem, but this also means that you have to keep track of daemons that are still accessing filehandles on the overlayed filesystem. Production baremetal systems need well tested and safe update strategies that avoid update related problems, so that rollbacks aren't even necessary. Or such systems can tolerate rebooting. If the use case considers rebooting a bit problem, then either a heavy weight virtual machine should be used, or something lighter weight like LXC containers. systemd-nspawn containers I think are still not considered for production use, but for testing and proof of concept you could see if it can boot arbitrary subvolumes - I think it can. And they boot really fast, like maybe a few seconds fast. For user space applications needing rollbacks, that's where application containers come in handy - you could either have two applications icons available (current and previous) and if on Btrfs the previous version could be a reflink copy. Maybe there's some way to quit everything but the kernel and PID 1 switching back to an initrd, and then at switch root time, use a new root with all new daemons and libraries. It'd be faster than a warm reboot. It probably takes a special initrd to do this. The other thin you can consider is kexec, but then going forward realize this isn't compatible with a UEFI Secure Boot world. My one concern with this approach is that you can't free up the subvolume/snapshot of the underlying filesystem if it's mounted and even after you free up filehandles pointing to it, I don't think you can umount it. In other words, you can play this trick to delay a reboot a bit, but ultimately you'll have to reboot to free up the mountpoints, old subvolumes, and be able to delete them. Well I think the bigger issue with system updates is the fact they're not atomic right now. The running system has a bunch of libraries yanked out from under it during the update process, things are either partially updated, or wholly replaced, and it's just a matter of time before something up in user space really doesn't like that. This was a major motivation for offline updates in gnome, so certain updates require reboot/poweroff. To take advantage of Btrfs (and LVM thinp snapshots for that matter) what we ought to do is take a snapshot of rootfs and update the snapshot in a chroot or a container. And then the user can reboot whenever its convenient for them, and instead of a much, much longer reboot as the updates are applied, they get a normal boot. Plus there could be some metric to test for whether the update process was even successful, or likely to result in an unbootable system; and at that point the snapshot could just be obliterated and the reasons logged. Of course this update the snapshot idea poses some problems with the FHS because there are things in /var that the current system needs to continue to write to, and yet so does the new system, and they shouldn't necessarily be separate, e.g. logs. /usr is a given, /boot is a given, and then /home should be dealt with differently because we probably shouldnt ever have rollbacks of /home but rather retrieval of deleted files from a snapshot into the current /home using reflink. So we either need some FHS re-evaluation with atomic system updates, and system rollbacks in mind. Or we end up needing a lot of subvolumes to carve the necessarily snapshotting/rollback granularity needed. And this makes for a less well understood system: how it functions, how to troubleshoot it, etc. So I'm more in favor of changes to the FHS. Already look at how Fedora does this. The file system at the top level of a Btrfs volume is not FHS. It's its own thing, and only via fstab do the subvolumes at the top level get mounted in accordance with the FHS. So that means you get to look at fstab to figure out how a system is put together when troubleshooting it, if you're not already familiar with the layout. Will every distribution end up doing their own thing? Almost certainly yes, SUSE does it differently still as a consequence of installing the whole OS to the top level, making every snapshot navigable from the always mounted top level. *shrug* Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs
Re: How does btrfs fi show show full?
On 2014/05/05 02:54 AM, Marc MERLIN wrote: More slides, more questions, sorry :) (thanks for the other answers, I'm still going through them) If I have: gandalfthegreat:~# btrfs fi show Label: 'btrfs_pool1' uuid: 873d526c-e911-4234-af1b-239889cd143d Total devices 1 FS bytes used 214.44GB devid1 size 231.02GB used 231.02GB path /dev/dm-0 I'm a bit confused. It tells me 1) FS uses 214GB out of 231GB 2) Device uses 231GB out of 231GB I understand how the device can use less than the FS if you have multiple devices that share a filesystem. But I'm not sure how a filesystem can use less than what's being used on a single device. Similarly, my current laptop shows: legolas:~# btrfs fi show Label: btrfs_pool1 uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6 Total devices 1 FS bytes used 442.17GiB devid1 size 865.01GiB used 751.04GiB path /dev/mapper/cryptroot So, am I 100GB from being full, or am I really only using 442GB out of 865GB? If so, what does the device used value really mean if it can be that much higher than the filesystem used value? Thanks, Marc The per-device used amount refers to the amount of space that has been allocated to chunks. That first one probably needs a balance. Btrfs doesn't behave very well when available diskspace is so low due to the fact that it cannot allocate any new chunks. An attempt to allocate a new chunk will result in ENOSPC errors. The Total bytes used refers to the total actual data that is stored. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
On 2014/05/05 02:56 AM, Marc MERLIN wrote: On Sun, May 04, 2014 at 09:07:55AM +0200, Brendan Hide wrote: On 2014/05/04 02:47 AM, Marc MERLIN wrote: Is there any functional difference between mount -o subvol=usr /dev/sda1 /usr and mount /dev/sda1 /mnt/btrfs_pool mount -o bind /mnt/btrfs_pool/usr /usr ? Thanks, Marc There are two issues with this. 1) There will be a *very* small performance penalty (negligible, really) Oh, really, it's slower to mount the device directly? Not that I really care, but that's unexpected. Um ... the penalty is if you're mounting indirectly. ;) 2) Old snapshots and other supposedly-hidden subvolumes will be accessible under /mnt/btrfs_pool. This is a minor security concern (which of course may not concern you, depending on your use-case). There are a few similar minor security concerns - the recently-highlighted issue with old snapshots is the potential that old vulnerable binaries within a snapshot are still accessible and/or executable. That's a fair point. I can of course make that mountpoint 0700, but it's a valid concern in some cases (not for me though). So thanks for confirming my understanding, it sounds like both are valid and if you're already mounting the main pool like I am, that's the easiest way. Thanks, Marc All good. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
On Mon, 05 May 2014 06:13:30 +0200 Brendan Hide bren...@swiftspirit.co.za wrote: 1) There will be a *very* small performance penalty (negligible, really) Oh, really, it's slower to mount the device directly? Not that I really care, but that's unexpected. Um ... the penalty is if you're mounting indirectly. ;) I feel that's on about the same scale as giving your files shorter filenames, so that they open faster. Or have you looked at the actual kernel code with regard to how it's handled, or maybe even have any benchmarks, other than a general thought of it's indirect, so it probably must be slower? -- With respect, Roman signature.asc Description: PGP signature
Re: Using mount -o bind vs mount -o subvol=vol
On Mon, May 05, 2014 at 06:13:30AM +0200, Brendan Hide wrote: Oh, really, it's slower to mount the device directly? Not that I really care, but that's unexpected. Um ... the penalty is if you're mounting indirectly. ;) I'd be willing to believe that more then :) (but indeed, if slowdown there is, it must be pretty irrelevant in the big picture. Cheers, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html