Re: BTRFS messes up snapshot LV with origin
On Fri, Nov 28, 2014 at 11:55:07PM -0800, Robert White wrote: On 11/28/2014 08:59 PM, Zygo Blaxell wrote: On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote: On 11/27/2014 05:15 AM, Zygo Blaxell wrote: This is a weakness of the current udev and asynchronous device hotplug concept: there is no notion of bus enumeration in progress, so we can be trying to assemble multi-device storage before we have all the devices visible. Assembly of aggregate storage (whatever it is--btrfs, md, lvm2...) has to wait until all known storage buses are fully enumerated in order to detect if there are duplicates. It is more complex than that. Some devices may appear after the 1st bus enumeration. That case is well handled already--a new enumeration will start with the second (and all later) hotplug events. The problem arises when we try to assemble disk arrays before the known end of the 1st (or any) enumeration. There is no way for an enumerating agent to tell other agents this is definitely not the complete list of devices yet, other devices may be inserted imminently and defer all the multi-device assembly until the address space of the enumering bus is fully covered. MDADM has an attached but not started state for arrays that handles this condition during incremental assembly. (see mdadm --incremental /dev/whatever), [...very complicated mdadm-architecture-invades-the-filesystem-layer thing snipped...] I don't see why it can't all be done in user-space more or less the same way LVM does. Scan all the parititions known to be available, build a table of devices with UUIDs matching the target filesystem, check for sufficiency, check for uniqueness, and if the configuration passes all the sanity checks (or we have hints from the user that resolve ambiguity), submit the entire list of devices to the kernel as a BTRFS filesystem. If there are UUID duplicates or missing devices, submit nothing to the kernel at all. initramfs-less multi-disk configurations can calculate all that in advance and generate a rootflags parameter for the kernel command line. It's not necessary to resolve every possible situation in the kernel. signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
On 11/28/2014 11:35 PM, Goffredo Baroncelli wrote: I agree with you; but I have to find a default so during the boot a system can start even if snapshots are present. No, you really _don't_ need to find such a default. Better a system that doesn't boot than one that boots based on a guess. I've been spending a lot of time thinking about booting while writing underdog (http://underdog.sourceforge.net) and while booting is fragile, an even partially incorrect boot is a system and _security_ nightmare. If you start making preferential guesses then an intruder could trick the system into booting from a thumb-drive or other alternate media by coercing a UUID colision in a way that the system picks the new media. Conflicts should _never_ be guessed at during boot. Ever. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/28/2014 11:29 PM, Duncan wrote: Since I can't/won't run pretty much anything proprietary, there's little chance of it being taken as anything but Linux, here. (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do with UUIDs. They are type codes. They aren't short form of anything else at all. In fact 0700 is the _long_ _form_ of the original code of 7, but in big-endian order now that it went from one byte to two. Microsoft started using pre-assigned UUIDs as classes, e.g. type codes they could cram into their various registry files. If you actually read the registry you'll find a lot of places where rational word is defined as {some_uuid_here} and then eslwere {some_uuid_here} has a bunch of data items attached to it. So gpartd didn;t reuse microsoft UUIDs. In some/many of the older formats there was a code for operating system data (which I think is what 7 was originally). Others came by and said since we're going to put in a type code for linux swap (82) then lets put in a code for linux data as well (83), and all this before the whole byte expansion to turn these things from bytes into two-byte words. Once everybody else picked their own type codes for their data partitions, everybody just started calling 7 microsoft data. And linux doesn't care at all since it's noise since every partition just ends up as /dev/[sh]d? anyway. All this stuff has historical reasons. GNU/Linux attempts to be an egalitarian actor so it adapts to whatever you do. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Robert White posted on Sat, 29 Nov 2014 00:20:11 -0800 as excerpted: On 11/28/2014 11:29 PM, Duncan wrote: (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do with UUIDs. They are type codes. They aren't short form of anything else at all. In fact 0700 is the _long_ _form_ of the original code of 7, but in big-endian order now that it went from one byte to two. You obviously know where the short forms originated (MBR type codes), but you haven't the foggiest what you're talking about in relation to gdisk, where they're used as 4-hex-char entry shortcuts for the similar GPT/EFI GUIDs. Now that's what I expected with the mention of a different partition editor, thus my mention that they were shortcuts for GUIDs, apparently gdisk specific, but in gdisk they certainly ARE shortcuts to the various GUIDs and you certainly do *NOT* know what you're talking about saying they are not even related. From the gdisk (8) manpage entry for the l/list action: l Display a summary of partition types. GPT uses a GUID to identify partition types for particular OSes and purposes. For ease of data entry, gdisk compresses these into two-byte (four-digit hexadecimal) values that are related to their equivalent MBR codes. Specifically, the MBR code is multiplied by hexadecimal 0x0100. For instance, the code for Linux swap space in MBR is 0x82, and it's 0x8200 in gdisk. A one-to-one correspondence is impossible, though. Most notably, the codes for all varieties of FAT and NTFS partition correspond to a single GPT code (entered as 0x0700 in sgdisk). Some OSes use a single MBR code but employ many more codes in GPT. For these, gdisk adds code numbers sequentially, such as 0xa500 for a FreeBSD disklabel, 0xa501 for FreeBSD boot, 0xa502 for FreeBSD swap, and so on. Note that these two-byte codes are unique to gdisk. See also the gdisk home page: http://www.rodsbooks.com/gdisk/ In particular, see the gdisk walkthru here: http://www.rodsbooks.com/gdisk/walkthrough.html ... and the gdisk manpage I quoted above here: http://www.rodsbooks.com/gdisk/gdisk.html So as I said, gdisk uses a 4-hexit short code based on the legacy MBR type-code as an easy entry and display form referencing the longer and much less human readable GUIDs, just like I said, and such usage is gdisk specific, just like I said I thought it was. And you might have known the legacy MBR type-codes from which they were derived, but obviously you had no idea what I was talking about here, and despite my saying it was gdisk specific you decided to simply claim I didn't know what I was talking about without actually checking the situation, despite my telling you exactly what app I was referring to and that I thought those references were app-specific, giving you plenty of chance to actually look it up yourself if you decided to, or simply not argue that point if you weren't interested in checking out the app- specific stuff. =:^( Microsoft started using pre-assigned UUIDs as classes, e.g. type codes they could cram into their various registry files. If you actually read the registry you'll find a lot of places where rational word is defined as {some_uuid_here} and then eslwere {some_uuid_here} has a bunch of data items attached to it. FWIW I know about the MS registry stuff from actually doing MS-registry and API related programming (hobbiest/VB level but using the regular API not just the VB exposed stuff) back before the turn of the century. I've not touched it in nearing a decade and a half now and my knowledge is consequently dated 9x vintage, but it obviously had the registry and I used to be /quite/ familiar with it, including of course the UUIDs. So gpartd didn;t reuse microsoft UUIDs. In some/many of the older formats there was a code for operating system data (which I think is what 7 was originally). Others came by and said since we're going to put in a type code for linux swap (82) then lets put in a code for linux data as well (83), and all this before the whole byte expansion to turn these things from bytes into two-byte words. Once everybody else picked their own type codes for their data partitions, everybody just started calling 7 microsoft data. And linux doesn't care at all since it's noise since every partition just ends up as /dev/[sh]d? anyway. All this stuff has historical reasons. GNU/Linux attempts to be an egalitarian actor so it adapts to whatever you do. This part I have no disagreement with... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program
Re: BTRFS messes up snapshot LV with origin
On 11/29/2014 01:41 AM, Duncan wrote: Robert White posted on Sat, 29 Nov 2014 00:20:11 -0800 as excerpted: l Display a summary of partition types. GPT uses a GUID to identify partition types for particular OSes and purposes. For ease of data entry, gdisk compresses these into two-byte (four-digit hexadecimal) values that are related to their equivalent MBR codes. Specifically, the MBR code is multiplied by hexadecimal 0x0100. That EFI uses GUIDs is one thing. That the standard allows these to be selected based on type codes originally derived from ms-dos partition type codes (compressed is the wrong word) is something else. If they were compressed then it would be a relationship that could represent any GUID at all. It's marginally hashed, in that there is a table lookup, but its not properly a hashed as the hash function is undefined for virtually all possible input values. The other partition GUID is acutally more interesting. So as I said, gdisk uses a 4-hexit short code based on the legacy MBR type-code as an easy entry and display form referencing the longer and much less human readable GUIDs, just like I said, and such usage is gdisk specific, just like I said I thought it was. Which is not what you said. None of the above was mentioned in the email to which I responded. What you actually said :: [QUOTE] Since I can't/won't run pretty much anything proprietary, there's little chance of it being taken as anything but Linux, here. (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) [/QUOTE] None of which is gdisk specific, and all of which is based on EFI and the GUID partition table. What I mistakenly attributed to you and was key to my initial response was your extension of Chris Murphy: Chris Murphy posted on Fri, 28 Nov 2014 00:10:40 -0700 as excerpted: A very good example of WTF reusage of a UUID that irks me to no end is GNU parted devs decided to recycle the Microsoft Windows Basic Data partition type GUID for Linux partitions. It's like watching someone get run over by a zamboni with 50 feet of advance notice... [So my bad there on the quoting...] The irking there being dumb because the universally used type GUID has nothing to do with the second GUID that universally identifies the partition regardless of type. But here is the thing... for all the screed about open and closed source... (and I am an open source guy myself) The actual EFI standard dictates these partition numbers and whatnot so if you used the microsoft tools you'd get the same results. http://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs AND microsoft was one of several principle players in the EFI and its GUID partition subparts. So his being irked to no end and your agreement and that's why I used gdisk response are both completely misplaced, and potentially misleading to others. I just went a little off the rails while trying to explain. /D'oh. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
To those reading along who don't already know. My explanation below is factually inadequate or wrong in various places... The type codes as presented in the various EFI/GUID disk partitioning tools as 0700, 8200, 8300, EF02, and so on are never written to disk as such. They are short-hand values (chosen to be deliberately similar to the MS-DOS partitioning type codes of 07, 82, 83, etc) to select standardized GUIDs for the partition type field. So there is the two-digit code from the ms-dos partitoning scheme, then there are the four-digit codes that let you select which type GUID will be written in an EFI partition scheme. The question of reuse is still improper as the type codes were assigned by the EFI standard for specific use as type codes. The EFI tool used (gdisk, or windows disk partitioning tool, etc) is immaterial as the result codes are selected by standard. I could have, and should have, been _way_ more clear, and/or less wrong. 8-) http://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs On 11/29/2014 12:20 AM, Robert White wrote: On 11/28/2014 11:29 PM, Duncan wrote: Since I can't/won't run pretty much anything proprietary, there's little chance of it being taken as anything but Linux, here. (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do with UUIDs. They are type codes. They aren't short form of anything else at all. In fact 0700 is the _long_ _form_ of the original code of 7, but in big-endian order now that it went from one byte to two. Microsoft started using pre-assigned UUIDs as classes, e.g. type codes they could cram into their various registry files. If you actually read the registry you'll find a lot of places where rational word is defined as {some_uuid_here} and then eslwere {some_uuid_here} has a bunch of data items attached to it. So gpartd didn;t reuse microsoft UUIDs. In some/many of the older formats there was a code for operating system data (which I think is what 7 was originally). Others came by and said since we're going to put in a type code for linux swap (82) then lets put in a code for linux data as well (83), and all this before the whole byte expansion to turn these things from bytes into two-byte words. Once everybody else picked their own type codes for their data partitions, everybody just started calling 7 microsoft data. And linux doesn't care at all since it's noise since every partition just ends up as /dev/[sh]d? anyway. All this stuff has historical reasons. GNU/Linux attempts to be an egalitarian actor so it adapts to whatever you do. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Sat, Nov 29, 2014 at 1:20 AM, Robert White rwh...@pobox.com wrote: On 11/28/2014 11:29 PM, Duncan wrote: Since I can't/won't run pretty much anything proprietary, there's little chance of it being taken as anything but Linux, here. (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do with UUIDs. They are type codes. They aren't short form of anything else at all. In fact 0700 is the _long_ _form_ of the original code of 7, but in big-endian order now that it went from one byte to two. No that's not correct. These four digit type codes are a user facing friendly type code, the actual on-disk partitiontype GUID is a UUID in that at the time of creation that UUID followed RFC 4122 so it was unique: no one else was using the UUID. That UUID in the context of a partitiontype GUID is intended to describe the purpose of that partition: what OS, what file system, where it should mount or be used for, etc. This is elaborately detailed in the GPT (GUID partition table) portion of the UEFI specification. A 120 bit type code is rather difficult for humans to remember and interact with, hence gdisk and recently fdisk now use a four digit type code as a front end for the partitiontypeGUID. The selection of four digits was to account for the fact there are many many many more type codes now possible, essentially unlimited. This is a case where UUID are reused effectively. Microsoft started using pre-assigned UUIDs as classes, e.g. type codes they could cram into their various registry files. If you actually read the registry you'll find a lot of places where rational word is defined as {some_uuid_here} and then eslwere {some_uuid_here} has a bunch of data items attached to it. So gpartd didn;t reuse microsoft UUIDs. GNU parted absolutely re-used partitiontypeGUID EBD0A0A2-B9E5-4433-87C0-68B6B72699C for Linux, by default. This you know as gdisk (and friends) type code 0700. It's the same thing as using type code 07 on an MBR partitioned disk instead of 83. It's ridiculous that this happened considering we had distinction on MBR with limited type code availability, and on GPT with unlimited type codes the decision was to use an already existing type code, EBD0A0A2-B9E5-4433-87C0-68B6B72699C. http://www.rodsbooks.com/linux-fs-code/ The Linux partitiontype GUID is now 0FC63DAF-8483-4772-8E79-3D69D8477DE4. And actually some others have been created also for encryption, RAID, LVM, swap, and a pile of GUIDs from the 'discoverable partitions spec' hosted at freedesktop.org for autodiscovery by systemd. Only very recent versions of parted supports code 0FC63DAF-8483-4772-8E79-3D69D8477DE4. All this stuff has historical reasons. GNU/Linux attempts to be an egalitarian actor so it adapts to whatever you do. With respect to this particular reuse of a Windows type code, it did a total face plant on adaptation. The very decision to reuse that GUID was a huge, weird mistake that we'll live with for years to come. Data loss will result from it. And then it was made worse, upon recognition that the conflict was probably not a good idea, to undermine patching GNU parted in a timely manner. The patch to fix the problem, from the gdisk author, sat around for two years before parted upstream merged it. There really isn't good diplomatic language to use for this. Some people flat out dropped the ball, and just didn't give a crap. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Robert White posted on Sat, 29 Nov 2014 08:50:57 -0800 as excerpted: To those reading along who don't already know. My explanation below is factually inadequate or wrong in various places... The type codes as presented in the various EFI/GUID disk partitioning tools as 0700, 8200, 8300, EF02, and so on are never written to disk as such. They are short-hand values (chosen to be deliberately similar to the MS-DOS partitioning type codes of 07, 82, 83, etc) to select standardized GUIDs for the partition type field. I could have, and should have, been _way_ more clear, and/or less wrong. 8-) http://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs Thanks. While I guess we all end up eat humble pie occasionally, you handled it with more rather more grace that I often do, and by taking such a hard line myself I didn't make it as easy as I might have. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/27/2014 05:15 AM, Zygo Blaxell wrote: On Wed, Nov 26, 2014 at 06:19:05PM +0100, Goffredo Baroncelli wrote: On 11/25/2014 11:21 PM, Zygo Blaxell wrote: However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? I want to split a few disks into partitions, but I want to create, move, and resize the partitions from time to time. Only LVM can do that without taking the machine down, reducing RAID integrity levels, hotplugging drives, or leaving installed drives idle most of the time. I want btrfs-raid1 because of its ability to replace corrupted or lost data from one disk using the other. If I run a single-volume btrfs on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage stack), I can detect lost data, but not replace it automatically from the other mirror. OK, now I have understood. Anyway as workaround, take in account that you can pass explicitly the devices as: mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt (supposing that the filesystem is on /dev/sda.../dev/sdd) I am working to a mount.btrfs helper. The aim of this helper is to manage the assembling of multiple devices; the main points will be: - wait until all the devices appeared ...and make sure there are no duplicate UUIDs. Yes, at the end I implemented in this way the snapshot detection: if two autodetected devices have the same DISK_UUID (reported as SUB_UUID by blkid), th emount process stopped. I checked also the num_device field of the superblock. - allow (if required) to mount in degraded mode after a timeout This is a terrible idea with current btrfs, at least for read-write degraded mounting (fallback to read-only degraded would be OK). Mounting a filesystem read-write and degraded is something you only want to do immediately before you replace all the missing disks and bring the filesystem up to a non-degraded space and after you've ensured that the missing disks can never, ever come back; otherwise, btrfs eats your data in a slightly different way than we have discussed so far... I don't care. If the user pass degraded in the options of mount, he have it. Anyway this (wrong) btrfs behavior I hope that it will be solved. - at this point it could/should also skip the lvm-snapshotted devices (but before I have to know how recognize these) You don't have to recognize them as snapshots (and it's probably better not to treat snapshots specially anyway--how do you know whether the snapshot or the origin LVs are wanted for mounting?). You just have to detect duplicate UUIDs at the btrfs subdevice level, and if any are found, stop immediately (or get a hint from the admin). For the disk autodetection, I still convinced that it is a sane default to skip the lvm-snapshot This is a weakness of the current udev and asynchronous device hotplug concept: there is no notion of bus enumeration in progress, so we can be trying to assemble multi-device storage before we have all the devices visible. Assembly of aggregate storage (whatever it is--btrfs, md, lvm2...) has to wait until all known storage buses are fully enumerated in order to detect if there are duplicates. It is more complex than that. Some devices may appear after the 1st bus enumeration. I hope to issue the patches in the next week BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/28/2014 09:05 AM, Goffredo Baroncelli wrote: For the disk autodetection, I still convinced that it is a sane default to skip the lvm-snapshot No... please don't... Maybe offer an option to select between snapshots or no-snapshots but in much the same way there is no _functional_ difference between a subvolume and a snapshot in btrfs, there is no degenerate status to an LVM snapshot. It would be way more useful if the helper dumped a message via stderr or syslog that said something like UUID= ambiguous, must select between /dev/AA and /dev/BB using device= to mount filesystem. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote: On 11/27/2014 05:15 AM, Zygo Blaxell wrote: This is a weakness of the current udev and asynchronous device hotplug concept: there is no notion of bus enumeration in progress, so we can be trying to assemble multi-device storage before we have all the devices visible. Assembly of aggregate storage (whatever it is--btrfs, md, lvm2...) has to wait until all known storage buses are fully enumerated in order to detect if there are duplicates. It is more complex than that. Some devices may appear after the 1st bus enumeration. That case is well handled already--a new enumeration will start with the second (and all later) hotplug events. The problem arises when we try to assemble disk arrays before the known end of the 1st (or any) enumeration. There is no way for an enumerating agent to tell other agents this is definitely not the complete list of devices yet, other devices may be inserted imminently and defer all the multi-device assembly until the address space of the enumering bus is fully covered. signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
Chris Murphy posted on Fri, 28 Nov 2014 00:10:40 -0700 as excerpted: On Thu, Nov 27, 2014 at 2:08 AM, Duncan 1i5t5.dun...@cox.net wrote: So, umm... kinda late now, but read that copy as if it had a footnote attached, saying Yes, I know it's not actual copy, it's two views of the same thing using COW, but my point is, from the btrfs perspective it's a copy, the universally UNIQUE ID no longer looks unique and thus no longer can be properly called a UUID at all. The copy is sort of a misnomer anyway because up until the computer age the copy was a derivative, a facsimile, like a photocopy. But a copy of a digital file is actually another original. Therein lies the problem with the LVM snapshot in this context, we don't want another original. We want a copy, as in we want something we know has been derived from something else, and therefore can be discriminated. Very good point. I had all the pieces but hadn't put them together yet, so thanks. =:^) Well RFC 4122 I don't think would say it's not a UUID, the uniqueness is only guaranteed at the time of UUID creation. And duplication isn't creation so it's not going to say these things are no longer UUIDs, they're just UUIDs that have been recycled. That RFC doesn't specify workflow, but if it did, I think it'd basically say oh crap, why'd you go and do that? After all a major point of UUIDs is that they are effectively unlimited in quantity, therefore a.) we don't need central registry to avoid (unintended) collisions because they're so uncommon, b.) we're encouraged to not be attached to specific UUIDs when in doubt just create another one. Another good point. One common and less RFC/technical way of putting it, that I had thought about a few times but hadn't actually posted yet IIRC, is the old If it hurts when you bang your head against the wall, quit banging! =:^) IOW, LVM could change the UUIDs in its copies, COWing that bit in ordered to do so. While that wouldn't change the same UUIDs embedded in for instance btrfs internals it would provide a mechanism to keep initial scans from confusing things, and filesystems or other UUID applications that duplicated the number for their own internals would then need to provide tools that rewrote them to match the LVM-changed master location UUID. Those that failed to do so would fail to function unless/until the master location version was changed back, but the tools and likely would eventually be provided, as I expect they will be here, but the difference would be at least it'd keep mixups like this from happening. A very good example of WTF reusage of a UUID that irks me to no end is GNU parted devs decided to recycle the Microsoft Windows Basic Data partition type GUID for Linux partitions. It's like watching someone get run over by a zamboni with 50 feet of advance notice... At least I don't have to worry about that one, since I no longer agree to WE REFUSE TO TELL YOU SPECIFICALLY WHAT THIS SOFTWARE DOES AS WE DON'T SUPPLY THE SOURCES, BUT YOU ARE STILL REQUIRED TO ACCEPT ALL RESPONSIBILITY FOR IT, REGARDLESS OF WHAT IT DOES AND REGARDLESS OF WHETHER WE'VE BEEN WARNED style EULAs, which is basically all of them, which means I have no legal way to run that software, so I don't. Note that the GPL among others has similar liability disclaimer wording (and to be fair it'd be hard not to, since the sources are there and the original author can hardly be held responsible for later modifications to them), but because it actually gives you the sources too, it allows you to fairly make your own decision about the responsibility you're about to take on. Since I can't/won't run pretty much anything proprietary, there's little chance of it being taken as anything but Linux, here. (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/29/2014 02:25 AM, Robert White wrote: On 11/28/2014 09:05 AM, Goffredo Baroncelli wrote: For the disk autodetection, I still convinced that it is a sane default to skip the lvm-snapshot No... please don't... Maybe offer an option to select between snapshots or no-snapshots but in much the same way there is no _functional_ difference between a subvolume and a snapshot in btrfs, there is no degenerate status to an LVM snapshot. I agree with you; but I have to find a default so during the boot a system can start even if snapshots are present. And pay attention that there would be cases where multiple snapshot are present: how group these ? My be for generation number ? Anyway for the moment my help simply refuse to mount if there is a conflict of dev_uuid. It would be way more useful if the helper dumped a message via stderr or syslog that said something like UUID= ambiguous, This is what it is printed when the helper finds a duplicate uuid: ghigo@emulato:~$ sudo lvdisplay | grep LV Path LV Path/dev/test/lv01 LV Path/dev/test/lv02 LV Path/dev/test/lv02_snap LV Path/dev/test/lv01_snap ghigo@emulato:~$ sudo mount /dev/test/lv01 /mnt/btrfs1/ ERROR: disk '/dev/mapper/test-lv01' and '/dev/mapper/test-lv01_snap' have the same disk uuid ERROR: disk '/dev/mapper/test-lv02_snap' and '/dev/mapper/test-lv02' have the same disk uuid must select between /dev/AA and /dev/BB using device= to mount filesystem. But anyway I can force the disk to mount: ghigo@emulato:~$ sudo mount /dev/test/lv01_snap -o device=/dev/test/lv02_snap /mnt/btrfs1/ -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
2014-11-29 2:25 GMT+01:00 Robert White rwh...@pobox.com: On 11/28/2014 09:05 AM, Goffredo Baroncelli wrote: For the disk autodetection, I still convinced that it is a sane default to skip the lvm-snapshot No... please don't... Maybe offer an option to select between snapshots or no-snapshots but in much the same way there is no _functional_ difference between a subvolume and a snapshot in btrfs, there is no degenerate status to an LVM snapshot. It would be way more useful if the helper dumped a message via stderr or syslog that said something like UUID= ambiguous, must select between /dev/AA and /dev/BB using device= to mount filesystem. I agree with this. Sometimes people will exactly want to do that: mount the snapshot devices and not the origins. Listing devices in the device= mount option sounds perfectly sane. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/28/2014 08:59 PM, Zygo Blaxell wrote: On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote: On 11/27/2014 05:15 AM, Zygo Blaxell wrote: This is a weakness of the current udev and asynchronous device hotplug concept: there is no notion of bus enumeration in progress, so we can be trying to assemble multi-device storage before we have all the devices visible. Assembly of aggregate storage (whatever it is--btrfs, md, lvm2...) has to wait until all known storage buses are fully enumerated in order to detect if there are duplicates. It is more complex than that. Some devices may appear after the 1st bus enumeration. That case is well handled already--a new enumeration will start with the second (and all later) hotplug events. The problem arises when we try to assemble disk arrays before the known end of the 1st (or any) enumeration. There is no way for an enumerating agent to tell other agents this is definitely not the complete list of devices yet, other devices may be inserted imminently and defer all the multi-device assembly until the address space of the enumering bus is fully covered. MDADM has an attached but not started state for arrays that handles this condition during incremental assembly. (see mdadm --incremental /dev/whatever), To slightly misuse the vocabulary, as each partition is encountered and submitted to the system it's checked for a superblock. If one is found then it has the identity of an array encoded on it and if that array doesn't exist it is allocated, otherwise the device is added to the existent array. The array is only started if all the devices are accounted for unless an option is added to allow earlier starts, and even then enough of the devices must be present to make sense (e.g. only one device missing from a RAID5, or a correct pair of devices for a RAID10 etc.) So we'd need a partially assembled but not started state and some ioctls to do things like force-start or force-disown a filesystem that cannot be finished automatically. That sort of thing is very easy to do with devices because devices don't have to be opened and can reject an open attempt, or at least the read/writes after an open and such. Unfortunately a filesystem can really only exist as a mounted thing, and can really only be controlled by remounting thereafter. The most efficient way to do this would be to have a alternate file system operations structure that was filled mostly with dummy operations that would return ENOENT and friends. Then the remount that finally fulfilled the file system's requirements would then switch out that struct for the fully functional one. That remount would need an adddev= and some other such options (much like AUFS adds layers). It;s all doable. But it stretches to near breaking the mount paradigm. You would need an operation that looked like mount -t btrfs -o do_we_need_this /dev/whatever /this/datum/means/nothing to match and attach a device wherever it goes or you might end up needing to do the Cartesian product of trial attachments of each new device to all active fileystems to match it up, which is an ugly external scripting requirement. As far as waiting for the address space to be fully covered. Meh. If a ready-or-not, or ready-enough, status is established in the file system it would be undesirable for it to know anything about any other subsystem. We don't care if enumeration is done we only care if we have a rational set of storage, and whether that rational set is enough to be fully ready, enough to be only read-ready, or just plain not enough. In theory, the idempotent mount command could be mount -t btrfs some-uuid-instead-of-device /mount/point mount -t btrfs some-other-uuid-here /other/mount/point to create the zero-devices involved entity, followed by mount -t btrfs -o trydev /dev/something /this/bit/is/ignored repeated for all possible somethings. /mount/point and /other/mount/point would be returning ENOENT for their contents until they were ready-enough. In practice this is very impure compared to how mdadm has the /dev/md- namespace in which to build its devices before any actual mount is possible. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Robert White posted on Wed, 26 Nov 2014 14:08:14 -0800 as excerpted: On 11/25/2014 07:22 PM, Duncan wrote: From my perspective, however, btrfs is simply incompatible with lvm snapshots, because the basic assumptions are incompatible. Btrfs assumes UUIDs will be exactly what they say on the label, /unique/, while lvm's snapshot feature directly breaks that uniqueness by copying the (former) UUID, thus making the former UUID no longer unique and thus no longer truly UUID. Thus, part of the lvm /feature/ of snapshots is in direct contradiction to a basic assumption of btrfs, that UUIDs are exactly that, unique, making that feature directly incompatible with btrfs on a very basic level. A finer point here. LVM doesn't copy the UUID. AN LVM snapshot is a copy-on-write entity so it _exposes_ the single sector(s) of the superblock(s) in both views of the underlying storage. I /hate/ it when this happens, which is why my posts often end up so long. People keep saying shorten them, but when I try, invariably I end up shortcutting something like this and get called on it! =:^( So, umm... kinda late now, but read that copy as if it had a footnote attached, saying Yes, I know it's not actual copy, it's two views of the same thing using COW, but my point is, from the btrfs perspective it's a copy, the universally UNIQUE ID no longer looks unique and thus no longer can be properly called a UUID at all. Which kinda makes most of the rest of what you said, which I agree with in general were it the case that I actually thought of it as a literal copy, unnecessary... Tho I can't fault you for catching and pointing out my shortcut as an error, because you're absolutely correct in that case, and I'd almost certainly be doing the same thing were the situation reversed. So while you may have a point about btrfs being unprepared for LVM, neither party is particularly at fault in any way. The damn you photocopier for making photocopies so identically nature of your problem with LVM seems to be leading you to misplaced conclusions. Well, to the extent that I tried to take an unwarranted logical shortcut and didn't properly describe it... But... I'd still say LVM is at fault to the extent that anyone is, as it /knows/ it's dealing with UUIDs because after all that's part of what's /on/ what it's snapshotting, and it doesn't make any effort to deal with the situation, despite the at least theoretical (and now in fact) confusion that may occur when former UUIDs are no longer unique and thus no longer UUIDs. However, the point remains, they are pretty much incompatible, in that one assumes unique means that a second one won't pop up elsewhere and depends on exactly that, while the functionality of the other is exactly that, to make another view of the same thing, including the otherwise unique ID, pop up elsewhere, with COW semantics. If you are waiting for someone to code it up perhaps you should do so. I'm not sure if that was the singular or plural you, but in any case, it won't be /me/, because I'm not a coder, simply another sysadmin willing to guinea-pig this fascinating new filesystem toy. =:^) As previously stated XFS solved this problem by providing a tool that would change the UUID of a file system. This tool cold then be pointed at either (or both) the original and/or snapshot volumes as needed. I think that'll eventually happen. Actually, I see it's on the wiki project ideas page, now (see 1.2.25 and 1.2.26, online/offline UUID changes, respectively): https://btrfs.wiki.kernel.org/index.php/Project_ideas There's even POC code. =:^) Wiki page history says Kdave added that on 06 Oct. 2014, so the entry is reasonably new, and the POC's encouraging, but will it go anywhere from there? Given that BTRFS want's to play in the same level of abstraction as LVM, its kind of a given that they'll butt heads over things like conflicting definitions of what it means to take a snapshot. Agreed. Actually, given btrfs is already doing much of it, it'd be interesting if it eventually got the ability to specify where subvolumes went and limit them in size (ideally more directly than the existing btrfs quotas related functionality does, etc, thus avoiding having to rely on LVM for that and eliminating the need for it in scenarios where that's desired. Couple that with the better snapshot handling that is already in the works, and would there /still/ be a need for LVM under btrfs then; for what if so, and could it too be integrated into btrfs? -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Thu, Nov 27, 2014 at 2:08 AM, Duncan 1i5t5.dun...@cox.net wrote: So, umm... kinda late now, but read that copy as if it had a footnote attached, saying Yes, I know it's not actual copy, it's two views of the same thing using COW, but my point is, from the btrfs perspective it's a copy, the universally UNIQUE ID no longer looks unique and thus no longer can be properly called a UUID at all. The copy is sort of a misnomer anyway because up until the computer age the copy was a derivative, a facsimile, like a photocopy. But a copy of a digital file is actually another original. Therein lies the problem with the LVM snapshot in this context, we don't want another original. We want a copy, as in we want something we know has been derived from something else, and therefore can be discriminated. And that's the same problem with subvolume UUIDs being reused when creating new Btrfs volumes, which have new volume UUIDs, from a Btrfs seed device. There are now multiple originals of those subvolumes, there's no distinguishing them by their UUID alone. But... I'd still say LVM is at fault to the extent that anyone is, as it /knows/ it's dealing with UUIDs because after all that's part of what's /on/ what it's snapshotting, and it doesn't make any effort to deal with the situation, despite the at least theoretical (and now in fact) confusion that may occur when former UUIDs are no longer unique and thus no longer UUIDs. Well RFC 4122 I don't think would say it's not a UUID, the uniqueness is only guaranteed at the time of UUID creation. And duplication isn't creation so it's not going to say these things are no longer UUIDs, they're just UUIDs that have been recycled. That RFC doesn't specify workflow, but if it did, I think it'd basically say oh crap, why'd you go and do that? After all a major point of UUIDs is that they are effectively unlimited in quantity, therefore a.) we don't need central registry to avoid (unintended) collisions because they're so uncommon, b.) we're encouraged to not be attached to specific UUIDs when in doubt just create another one. A very good example of WTF reusage of a UUID that irks me to no end is GNU parted devs decided to recycle the Microsoft Windows Basic Data partition type GUID for Linux partitions. It's like watching someone get run over by a zamboni with 50 feet of advance notice... -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/25/2014 11:21 PM, Zygo Blaxell wrote: However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? I want to split a few disks into partitions, but I want to create, move, and resize the partitions from time to time. Only LVM can do that without taking the machine down, reducing RAID integrity levels, hotplugging drives, or leaving installed drives idle most of the time. I want btrfs-raid1 because of its ability to replace corrupted or lost data from one disk using the other. If I run a single-volume btrfs on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage stack), I can detect lost data, but not replace it automatically from the other mirror. OK, now I have understood. Anyway as workaround, take in account that you can pass explicitly the devices as: mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt (supposing that the filesystem is on /dev/sda.../dev/sdd) I am working to a mount.btrfs helper. The aim of this helper is to manage the assembling of multiple devices; the main points will be: - wait until all the devices appeared - allow (if required) to mount in degraded mode after a timeout - at this point it could/should also skip the lvm-snapshotted devices (but before I have to know how recognize these) I hope to issue the patches in the next week BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/25/2014 07:22 PM, Duncan wrote: From my perspective, however, btrfs is simply incompatible with lvm snapshots, because the basic assumptions are incompatible. Btrfs assumes UUIDs will be exactly what they say on the label, /unique/, while lvm's snapshot feature directly breaks that uniqueness by copying the (former) UUID, thus making the former UUID no longer unique and thus no longer truly UUID. Thus, part of the lvm /feature/ of snapshots is in direct contradiction to a basic assumption of btrfs, that UUIDs are exactly that, unique, making that feature directly incompatible with btrfs on a very basic level. A finer point here. LVM doesn't copy the UUID. AN LVM snapshot is a copy-on-write entity so it _exposes_ the single sector(s) of the superblock(s) in both views of the underlying storage. This is universal to the idea of a snapshot. Just as a btrfs subvol snap /old /new exposes all the unique elements of /old under the name /new (in preparation for the user to implement subsequent divergence); lvmcreate --snapshot Old New causes every block-N of Old to be identically available as block-N of New (in preparation for the user to implement subsequent divergence). In point of fact the LVM snapshot operation is a zero-copy operation at its heart. After the snapshot is established, when a block in modified in Old, it's original content is saved in New. When blocks are written in New, they are written in place and the reference to the block content in Old is overwritten. This is the reason that fsfreeze is unnecessary for things above LVM snapshots as the instant-in-time divergence is _instant_. It's not that LVM goes out and does an fsfreeze equivalent action, its that the switch to write-divergence is essentially atomic. A bunch of metatdata is setup and then all-at-once one write behavior is switched with another by re-mapping the device access routines. So while you may have a point about btrfs being unprepared for LVM, neither party is particularly at fault in any way. The damn you photocopier for making photocopies so identically nature of your problem with LVM seems to be leading you to misplaced conclusions. If you need to harmonize these sorts of things, you need to be able to re-write blocks in question with disambiguating information (like new UUIDS) or restrict your accesses in some other manner. If you are waiting for someone to code it up perhaps you should do so. But it will _never_ be automatic because the use cases that don't match your expectations may need the founding assumptions to be as they are today. In other words, your belief that your position is entirely logical may be a little off, particularly if you think LVM is Copying things when it does a snapshot. As previously stated XFS solved this problem by providing a tool that would change the UUID of a file system. This tool cold then be pointed at either (or both) the original and/or snapshot volumes as needed. I don't see a re-make the btrfs option for changing UUIDs and LVM doesn't care _at_ _all_ about what is actually in its volumes (okay, lvresize has some fsck nonsense, but that's just messy). It might even be wrong to try to harmonize those features, like trying to put a manual clutch into a car with an automatic transmission... it may just not fit. Given that BTRFS want's to play in the same level of abstraction as LVM, its kind of a given that they'll butt heads over things like conflicting definitions of what it means to take a snapshot. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Wed, Nov 26, 2014 at 06:19:05PM +0100, Goffredo Baroncelli wrote: On 11/25/2014 11:21 PM, Zygo Blaxell wrote: However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? I want to split a few disks into partitions, but I want to create, move, and resize the partitions from time to time. Only LVM can do that without taking the machine down, reducing RAID integrity levels, hotplugging drives, or leaving installed drives idle most of the time. I want btrfs-raid1 because of its ability to replace corrupted or lost data from one disk using the other. If I run a single-volume btrfs on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage stack), I can detect lost data, but not replace it automatically from the other mirror. OK, now I have understood. Anyway as workaround, take in account that you can pass explicitly the devices as: mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt (supposing that the filesystem is on /dev/sda.../dev/sdd) I am working to a mount.btrfs helper. The aim of this helper is to manage the assembling of multiple devices; the main points will be: - wait until all the devices appeared ...and make sure there are no duplicate UUIDs. - allow (if required) to mount in degraded mode after a timeout This is a terrible idea with current btrfs, at least for read-write degraded mounting (fallback to read-only degraded would be OK). Mounting a filesystem read-write and degraded is something you only want to do immediately before you replace all the missing disks and bring the filesystem up to a non-degraded space and after you've ensured that the missing disks can never, ever come back; otherwise, btrfs eats your data in a slightly different way than we have discussed so far... - at this point it could/should also skip the lvm-snapshotted devices (but before I have to know how recognize these) You don't have to recognize them as snapshots (and it's probably better not to treat snapshots specially anyway--how do you know whether the snapshot or the origin LVs are wanted for mounting?). You just have to detect duplicate UUIDs at the btrfs subdevice level, and if any are found, stop immediately (or get a hint from the admin). This is a weakness of the current udev and asynchronous device hotplug concept: there is no notion of bus enumeration in progress, so we can be trying to assemble multi-device storage before we have all the devices visible. Assembly of aggregate storage (whatever it is--btrfs, md, lvm2...) has to wait until all known storage buses are fully enumerated in order to detect if there are duplicates. I hope to issue the patches in the next week BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
On 11/23/2014 01:19 AM, Zygo Blaxell wrote: [...] md-raid works as long as you specify the devices, and because it's always the lowest layer it can ignore LVs (snapshot or otherwise). It's also not a particularly common use case, while making an LV snapshot of a filesystem is a typical use case. I fully agree; but you still consider a *multi-device* btrfs over lvm... This is like a dm over lvm... which doesn't make sense at all (as you already wrote) and mounting the filesystem fails at 3. Are you sure ? Yes, I'm sure. I've had to replace filesystems destroyed this way. [working instance snipped] On the basis of the example above, in case you want to mount a single-disk, BTRFS seems me to work properly. You have to pay attention only to not mount the two filesystem at the same time. The problem is btrfs stops searching when it sees one disk with each UUID, BTRFS doens't search anything. It is udev which push the information on the kernel module. The btrfs module groups these information by UUID. When a new disk is inserted, overwrite the information of the old one. so the set of disks (snapshot vs origin) that you get is *random*. For a pair of origin + snapshots, there's a 50% chance it works, 50% chance it eats your data. Sorry but I have to disagree: the code is quite clear (see fs/btrfs/volume.c, near line 512): [...] } else if (!device-name || strcmp(device-name-str, path)) { /* * When FS is already mounted. * 1. If you are here and if the device-name is NULL that *means this device was missing at time of FS mount. * 2. If you are here and if the device-name is different *from 'path' that means either * a. The same device disappeared and reappeared with * different name. or * b. The missing-disk-which-was-replaced, has * reappeared now. * * We must allow 1 and 2a above. But 2b would be a spurious * and unintentional. [...] The case is the 2a; in this case btrfs store the new name and mount it. Anyway I made a small test: I created 1 btrfs filesystem, and made a lvm-snapshot. Then create two different file in the snapshot and in the original one. I run a program which mounts randomly the first or the latter, checks if the correct file is present; after more than 130 tests I never saw your 50% chance it works: it always works. BR G.Baroncelli BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote: On 11/23/2014 01:19 AM, Zygo Blaxell wrote: [...] md-raid works as long as you specify the devices, and because it's always the lowest layer it can ignore LVs (snapshot or otherwise). It's also not a particularly common use case, while making an LV snapshot of a filesystem is a typical use case. I fully agree; but you still consider a *multi-device* btrfs over lvm... This is like a dm over lvm... which doesn't make sense at all (as you already wrote) It makes sense for btrfs because btrfs can productively use LVs on different PVs (e.g. btrfs-raid1 on two LVs, one on each PV). LVM is the bottom layer because not everything in the world is btrfs--things like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs (e.g. before running btrfsck) have to live on the same physical drives as the btrfs filesystems. and mounting the filesystem fails at 3. Are you sure ? Yes, I'm sure. I've had to replace filesystems destroyed this way. [working instance snipped] On the basis of the example above, in case you want to mount a single-disk, BTRFS seems me to work properly. You have to pay attention only to not mount the two filesystem at the same time. The problem is btrfs stops searching when it sees one disk with each UUID, BTRFS doens't search anything. It is udev which push the information on the kernel module. The btrfs module groups these information by UUID. When a new disk is inserted, overwrite the information of the old one. Same result: when presented with multiple devices with the same UUID, one is chosen arbitrarily instead of rejecting all of them. so the set of disks (snapshot vs origin) that you get is *random*. For a pair of origin + snapshots, there's a 50% chance it works, 50% chance it eats your data. Sorry but I have to disagree: the code is quite clear (see fs/btrfs/volume.c, near line 512): [...] } else if (!device-name || strcmp(device-name-str, path)) { /* * When FS is already mounted. * 1. If you are here and if the device-name is NULL that *means this device was missing at time of FS mount. * 2. If you are here and if the device-name is different *from 'path' that means either * a. The same device disappeared and reappeared with * different name. or * b. The missing-disk-which-was-replaced, has * reappeared now. If the FS is already mounted then there is no issue. It's when you're trying to mount the FS that the fun occurs. * * We must allow 1 and 2a above. But 2b would be a spurious * and unintentional. [...] The case is the 2a; in this case btrfs store the new name and mount it. Anyway I made a small test: I created 1 btrfs filesystem, and made a lvm-snapshot. Then create two different file in the snapshot and in the original one. I run a program which mounts randomly the first or the latter, checks if the correct file is present; after more than 130 tests I never saw your 50% chance it works: it always works. One btrfs filesystem on two LVs with a snapshot of each LV also present. So you'd have: lv00 - btrfs device 1 lv01 - btrfs device 2 lv00snap - snapshot of lv00 lv01snap - snapshot of lv01 If you mount by device UUID then you get one of these results at random: lv00 + lv01 - OK lv00snap + lv01snap - also OK lv00 + lv01snap - failure lv00snap + lv01 - failure 2 failures, 2 successes = 50% failure rate. If you mount by the name of one of the devices then you only get the two rows of the above table that match the device you named, but you still get one success row and one failure row. Which result you get seems to depend on the order in which LVM enumerates the LVs, so if you are doing a mount/umount loop then you won't see any problems as btrfs will consistently make the same choice of LVs over and over again. Rebooting or creating other LVs in between mounts will definitely cause problems. BR G.Baroncelli BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
On 11/25/2014 09:29 PM, Zygo Blaxell wrote: On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote: On 11/23/2014 01:19 AM, Zygo Blaxell wrote: [...] md-raid works as long as you specify the devices, and because it's always the lowest layer it can ignore LVs (snapshot or otherwise). It's also not a particularly common use case, while making an LV snapshot of a filesystem is a typical use case. I fully agree; but you still consider a *multi-device* btrfs over lvm... This is like a dm over lvm... which doesn't make sense at all (as you already wrote) It makes sense for btrfs because btrfs can productively use LVs on different PVs (e.g. btrfs-raid1 on two LVs, one on each PV). LVM is the bottom layer because not everything in the world is btrfs--things like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs (e.g. before running btrfsck) have to live on the same physical drives as the btrfs filesystems. Let me to summrize 1) btrfs-single-disk on lvm works fine 2) btrfs-w/multiple-disk on lvm works fine 3) btrfs-single-disk on lvm works fine even with snapshot 4) btrfs-w/multiple-disk doesn't work with lvm AND snapshot However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? and mounting the filesystem fails at 3. Are you sure ? Yes, I'm sure. I've had to replace filesystems destroyed this way. In a previous email you wrote: Multi-device btrfs fails at 2, So I assumed that the point 3 onwards were related to a single-disk btrfs. [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Tue, Nov 25, 2014 at 10:59:53PM +0100, Goffredo Baroncelli wrote: On 11/25/2014 09:29 PM, Zygo Blaxell wrote: On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote: On 11/23/2014 01:19 AM, Zygo Blaxell wrote: [...] md-raid works as long as you specify the devices, and because it's always the lowest layer it can ignore LVs (snapshot or otherwise). It's also not a particularly common use case, while making an LV snapshot of a filesystem is a typical use case. I fully agree; but you still consider a *multi-device* btrfs over lvm... This is like a dm over lvm... which doesn't make sense at all (as you already wrote) It makes sense for btrfs because btrfs can productively use LVs on different PVs (e.g. btrfs-raid1 on two LVs, one on each PV). LVM is the bottom layer because not everything in the world is btrfs--things like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs (e.g. before running btrfsck) have to live on the same physical drives as the btrfs filesystems. Let me to summrize 1) btrfs-single-disk on lvm works fine 2) btrfs-w/multiple-disk on lvm works fine 3) btrfs-single-disk on lvm works fine even with snapshot 4) btrfs-w/multiple-disk doesn't work with lvm AND snapshot However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? I want to split a few disks into partitions, but I want to create, move, and resize the partitions from time to time. Only LVM can do that without taking the machine down, reducing RAID integrity levels, hotplugging drives, or leaving installed drives idle most of the time. I want btrfs-raid1 because of its ability to replace corrupted or lost data from one disk using the other. If I run a single-volume btrfs on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage stack), I can detect lost data, but not replace it automatically from the other mirror. Since I want both things at the same time, I have btrfs w/multiple disks on LVM. The LVM snapshots are for providing an 'undo' capability when I experiment with some btrfs or btrfsck feature that destroys the filesystem. and mounting the filesystem fails at 3. Are you sure ? Yes, I'm sure. I've had to replace filesystems destroyed this way. In a previous email you wrote: Multi-device btrfs fails at 2, So I assumed that the point 3 onwards were related to a single-disk btrfs. [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
What happens when all btrfs LVs are unmounted, and you lvchange -an the LVs (the pair) you do not want mounted; and then btrfs dev scan; and then mount one of the devices? It should only find the matching LV because the others are deactivated. I know this isn't ideal, but it's better than corruption. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Goffredo Baroncelli posted on Tue, 25 Nov 2014 22:59:53 +0100 as excerpted: However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? While I'm not an LVM person here, and he already replied with essentially the same point, I think it's worth repeating... Btrfs' checksummed error detection and automatic rewrite from a different copy isn't a small thing, and simply isn't available at all with most would-be alternatives (zfs being the only similar thing I know of for Linux, and of course it has its own issues both technical and social/ legal/license). That alone is worth running multi-device btrfs to get. That makes btrfs a near-mandatory part of the picture, whatever it's on. And for people wanting LVM's volume management (including partitioning without many of the limitations), the direct result is multi-device btrfs on lvm. From my perspective, however, btrfs is simply incompatible with lvm snapshots, because the basic assumptions are incompatible. Btrfs assumes UUIDs will be exactly what they say on the label, /unique/, while lvm's snapshot feature directly breaks that uniqueness by copying the (former) UUID, thus making the former UUID no longer unique and thus no longer truly UUID. Thus, part of the lvm /feature/ of snapshots is in direct contradiction to a basic assumption of btrfs, that UUIDs are exactly that, unique, making that feature directly incompatible with btrfs on a very basic level. So people can have their btrfs on lvm, but if they do, they have to forego LVM snapshots because btrfs isn't compatible with their usage. To me it's as simple as that, and people can choose either btrfs or lvm snapshots, but not both, it's one XOR the other. So for me it's simply choose the one you will have the most difficulty doing without and forgo the other one. Not a problem, just make your choice and move on. OTOH, there's that common signature about the reasonable man folding to the circumstance while the unreasonable man insisting on folding the circumstance to his wishes instead, so progress depends on the unreasonable man... But that's exactly what I see here, an unreasonable man insisting that entirely logical circumstance bend to his will. Which, given someone to actually code it up, it might well do. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Tue, Nov 25, 2014 at 7:11 PM, Zygo Blaxell zblax...@furryterror.org wrote: On Tue, Nov 25, 2014 at 03:46:32PM -0700, Chris Murphy wrote: What happens when all btrfs LVs are unmounted, and you lvchange -an the LVs (the pair) you do not want mounted; and then btrfs dev scan; and then mount one of the devices? It should only find the matching LV because the others are deactivated. I know this isn't ideal, but it's better than corruption. This is one of two possible ways to assemble the btrfs correctly. The other is to explicitly name all of the devices when mounting. OK I didn't realize it was possible to explicitly name all of them, the last time I'd tried this (about 9 epochs ago) mount didn't understand being passed two devices before the mount point. The challenge for the poor end-user (or inexperienced sysadmin) is to defeat all the defaults in system installers, initramfs-tools, lvm2, udev, etc. to prevent btrfs from destroying a filesystem accidentally. I agree if it finds two identical volumes it should fail to mount with some coherent error. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/21/2014 05:28 AM, Zygo Blaxell wrote: e.g. if an ext4 filesystem explodes, I can: 1. make a LVM snapshot of the broken filesystem 2. run e2fsck on the snapshot 3. mount and repair the snapshot, e.g. rsync any missing files from backups, salvage anything that survived 4. LVM merge the snapshot to its origin volume 5. umount the origin volume and mount the merged volume (or just reboot) ...and I can do all of this on a running system, in-place, with only a few minutes of downtime in the must-reboot case. None of the above works with btrfs at all. Multi-device btrfs fails at 2, You can't compare ext4 with btrfs, if you are talking about a multi-device filesystem: ext4 haven't this capability. Try to make a md-raid over a snapshotted logical volume(s); I never tried that, but I suppose that there will be the same problems... and mounting the filesystem fails at 3. Are you sure ? ghigo@venice:/tmp$ # create a btrfs filesystem in a logical volume ghigo@venice:/tmp$ sudo truncate -s +10G disk.img ghigo@venice:/tmp$ sudo losetup -f disk.img ghigo@venice:/tmp$ sudo pvcreate /dev/loop0 ghigo@venice:/tmp$ sudo vgcreate vgtest /dev/loop0 ghigo@venice:/tmp$ sudo lvcreate -n lvone -L 3G vgtest ghigo@venice:/tmp$ sudo mkfs.btrfs /dev/vgtest/lvone ghigo@venice:/tmp$ mkdir t ghigo@venice:/tmp$ # create a file inside a btrfs fs ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/ ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-orig bs=1M count=1 ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # make a lvm snapshot and add a 2nd file ghigo@venice:/tmp$ sudo lvcreate -s -n lvone_snap -L 3G vgtest/lvone ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/ ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-snap bs=1M count=1 ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # mount the first one lv, and check the file ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/ ghigo@venice:/tmp$ ls -l t total 1024 -rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # mount the first one lv, and check the files ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/ ghigo@venice:/tmp$ ls -l t total 2048 -rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig -rw-r--r-- 1 root root 1048576 Nov 22 18:12 disk-snap On the basis of the example above, in case you want to mount a single-disk, BTRFS seems me to work properly. You have to pay attention only to not mount the two filesystem at the same time. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Volume/subvolume UUID uniqueness, was: BTRFS messes up snapshot LV with origin
I don't know how to fix this but I've convinced myself there's at least a small problem. And not just with LVM snapshots as in the originating thread. - Via seed device method of creating a Btrfs volume, the resulting volume gets a new UUID. The volume UUID from the seed device doesn't pass through, is not inherited / copied. Therefore there's already recognition that snapshotting a Btrfs volume, which is what volume creation from a seed device effectively is, should result in the new volume getting a new UUID. Therefore it seems reasonable a mechanism to support new volume UUIDs upon LVM snapshots being taken is needed. Maybe leveraging existing seed code can help, consider existing volume data a virtual seed device, and the remaining free space as a virtual added device to enable changing volume UUID rather than rewriting possibly piles of UUIDs. - While the seed device method of creating a Btrfs volume results in a new volume UUID, subvolume UUIDs from the seed pass through to the new volume. Since I can create many new volumes from one seed device, in effect I'm creating many instances of subvolumes with identical UUIDs and can now no longer be differentiated, locally and remotely. This seems to be a much bigger problem than the LVM case, since it occurs with only Btrfs tools being used. The grandiose idea of UUIDs is persistence in identifying a specific object/resource for all time, anywhere in the universe. Reducing this to something practical, it should enable a way to identify an object or resource within one or two human lifetimes, within our solar system. Yet the current implementation has broken this on a much shorter time scale, on a single computer. Since we recognize subvolume snapshots should get new subvolume UUIDs, and volume snapshots via seed device method creation of new volumes get new volume UUIDs; a volume snapshot of course is also snapshotting the subvolumes too, so the subvolume UUIDs can't pass through the way they do right now. It's not correct behavior. Another matter is what to do with parent uuid and snapshot relationship metadata in the new volume. Assume all subvolumes get new UUIDs on the new volume, there are three potentials: 1. parent uuid is always blank, no relationships between subvolumes is preserved 2. parent uuid is the uuid of its identical mirror (the original) in the seed device. 3. parent uuid is the new uuid of its relative parent on the current new volume, preserving relationships between subvolumes and snapshots. I think any of those three are better than UUID duplication (recycling actually). Maybe I'm not thinking of a use case for preserving these UUIDs but at the moment I think it's specious. We can't be attached to specific UUIDs, the instant a subvolume is effectively snapshot by LVM or Btrfs seed device, it's a unique object/resource, and should have its own URN. Afterall by default these objects are read/write. Maybe if by default they were readonly I could be convinced of the validity of UUID preservation. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Volume/subvolume UUID uniqueness, was: BTRFS messes up snapshot LV with origin
On 11/22/2014 02:50 PM, Robert White wrote: Take a couple snapshots of a subvolume, and then send those subvolumes to another file system with send/receive, and then do btrfs subvolume list -u -q on the two filesystems and tell me that mess makes sense. Or try to recreate a subvolume from its snapshot in a way that doesn't shatter the relationships in your backup scheme. (I'm researching for a couple patches but I'm not expecting a warm reception given the silence to date). (ASIDE In particular use btrfs sub send -c SNAP1 SNAP2 and then btrfs sub send -c SNAP2 SNAP3 etc before doing the btrfs sub list -u -q to view the mess I speak of.) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Sat, Nov 22, 2014 at 06:34:38PM +0100, Goffredo Baroncelli wrote: On 11/21/2014 05:28 AM, Zygo Blaxell wrote: e.g. if an ext4 filesystem explodes, I can: 1. make a LVM snapshot of the broken filesystem 2. run e2fsck on the snapshot 3. mount and repair the snapshot, e.g. rsync any missing files from backups, salvage anything that survived 4. LVM merge the snapshot to its origin volume 5. umount the origin volume and mount the merged volume (or just reboot) ...and I can do all of this on a running system, in-place, with only a few minutes of downtime in the must-reboot case. None of the above works with btrfs at all. Multi-device btrfs fails at 2, You can't compare ext4 with btrfs, if you are talking about a multi-device filesystem: ext4 haven't this capability. btrfs fails this comparison as a single-device filesystem. Try to make a md-raid over a snapshotted logical volume(s); I never tried that, but I suppose that there will be the same problems... md-raid works as long as you specify the devices, and because it's always the lowest layer it can ignore LVs (snapshot or otherwise). It's also not a particularly common use case, while making an LV snapshot of a filesystem is a typical use case. and mounting the filesystem fails at 3. Are you sure ? Yes, I'm sure. I've had to replace filesystems destroyed this way. [working instance snipped] On the basis of the example above, in case you want to mount a single-disk, BTRFS seems me to work properly. You have to pay attention only to not mount the two filesystem at the same time. The problem is btrfs stops searching when it sees one disk with each UUID, so the set of disks (snapshot vs origin) that you get is *random*. For a pair of origin + snapshots, there's a 50% chance it works, 50% chance it eats your data. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
On 11/20/2014 10:22 PM, Duncan wrote: But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs =:^), because they're no longer unique, requiring them to be unique just as the label says cannot be considered a bug. It's simply stricter enforcement of the rules, which are, after all, plainly stated in the descriptive name. You take Us away, not add them UID = unique ID GUID = globally unique ID UUID = universally unique ID And other file systems have the same issues. XFS, for example uses UUIDs in the same way. It just has a command to re-brand the filesystem's UUID which you apply to the LVM snapshot immediately after taking the snapshot. (problem long-since established and understood since 2009 or so.) I don't know if this approach would work for BRFS with subvolumes. Example Citation :: http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/ XFS also has the nouuids mount option. btrfs has device= mount option. But any system with unique ids will have this identical issue when block-snapshot support is added underneath. -- Rob. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Robert White posted on Fri, 21 Nov 2014 03:35:05 -0800 as excerpted: On 11/20/2014 10:22 PM, Duncan wrote: But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs =:^), because they're no longer unique, requiring them to be unique just as the label says cannot be considered a bug. It's simply stricter enforcement of the rules, which are, after all, plainly stated in the descriptive name. You take Us away, not add them UID = unique ID GUID = globally unique ID UUID = universally unique ID I was making a joke, as I happened to notice un-UUID =3 U-s just as I was writing that. Universally unique ID = UUID, un-UUID (not universally unique ID) = UUUID = U^3ID. =:^) Of course formally it'd be NUID (not/non- unique) or some such, but un- UUID served my purpose well enough, including the joke once I noticed it, so... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Fri, Nov 21, 2014 at 06:22:57AM +, Duncan wrote: After all, an LVM block-level snapshot takes the same space as a file containing the same raw data, and if there's room for the data in an LVM snapshot, given a different layout, there's room for exactly the same amount of data as a file on a different filesystem, piped thru some compressor if necessary due to tight datasize constraints. That isn't true at all. A repairing fsck can take less than 1% of the overall volume size, and a full conversion from another filesystem type can take less than 10%. Usually I can find enough space by blowing away the swap LV for a few hours. I do NOT usually have 13TB of slack space lying around in a 26TB disk array, nor do I have enough bandwidth to move those 13TB to another machine without great inconvenience. But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs =:^), because they're no longer unique, requiring them to be unique just as the label says cannot be considered a bug. It's simply stricter enforcement of the rules, which are, after all, plainly stated in the descriptive name. It's not a bug as long as I can completely control which devices are searched for UUIDs, and the system behaves sanely when multiple UUIDs are found through automatic discovery; otherwise, it's not only a bug, it's a DoS attack security vulnerability. Consider what happens if someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a fake filesystem with those UUIDs, puts the fake filesystem on a USB stick, and plugs it back into the victim machine... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
On Thu, Nov 20, 2014 at 11:22 PM, Duncan 1i5t5.dun...@cox.net wrote: When I have such a filesystem level problem, I simply dd from the backing device to some other location, generally to a file that's on a different filesystem (preferrably non-btrfs, I use reiserfs as I've found it very resilient, here), in which case btrfs device scan won't see the UUID on the copy as it scans block devices, not inside non-device files. That's hours of dd and you have to find space to do it. After all, an LVM block-level snapshot takes the same space as a file containing the same raw data, and if there's room for the data in an LVM snapshot, given a different layout, there's room for exactly the same amount of data as a file on a different filesystem, piped thru some compressor if necessary due to tight datasize constraints. That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. They're more like a qcow2 snapshot than a conventional LVM snapshot; a big difference being if you delete the snapshot, or you delete a bunch of files in a thin volume and follow it with fstrim, the unused extents are returned to the thin pool. There has been a fragmentation problem with thin volumes; I don't know if that's solved yet. And I don't know if it exacerbates things with Btrfs fragmentation. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Chris Murphy posted on Fri, 21 Nov 2014 11:23:45 -0700 as excerpted: On Thu, Nov 20, 2014 at 11:22 PM, Duncan 1i5t5.dun...@cox.net wrote: When I have such a filesystem level problem, I simply dd from the backing device to some other location, generally to a file that's on a different filesystem (preferrably non-btrfs, I use reiserfs as I've found it very resilient, here), in which case btrfs device scan won't see the UUID on the copy as it scans block devices, not inside non-device files. That's hours of dd and you have to find space to do it. I did it recently here. There's a method to my sub-100-GiB partition madness! =:^) The partitions in question were on SSD, and were small enough I could simply DD them to files on my media filesystem, which was after all designed to be able to take full ISO images, etc. Additionally, due to size and reasonably consistent linear intra-file access patterns, the media filesystem's still on much cheaper spinning rust, while most of the system's on much faster to random-access but far more expensive SSD, so in this case one side was SSD, the other spinning rust. Tho granted, if you're doing single-partition/filesystem multi-TiB filesystems, it does get to be a problem. As there would have been if the filesystem in question was the media filesystem, altho that one's not yet btrfs for a reason. But still, if there's room enough for an LVM snapshot in the first place, with a different layout, there'd be room for the same data as a file. That's pretty basic. After all, an LVM block-level snapshot takes the same space as a file containing the same raw data, and if there's room for the data in an LVM snapshot, given a different layout, there's room for exactly the same amount of data as a file on a different filesystem, piped thru some compressor if necessary due to tight datasize constraints. That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. Thus the mention of compression if necessary. Thin-volume snapshots are effectively compression by another name, and a raw dd from them should compress pretty much equally well, depending on compression method chosen, of course. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Zygo Blaxell posted on Fri, 21 Nov 2014 12:56:23 -0500 as excerpted: It's not a bug as long as I can completely control which devices are searched for UUIDs, and the system behaves sanely when multiple UUIDs are found through automatic discovery; otherwise, it's not only a bug, it's a DoS attack security vulnerability. Consider what happens if someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a fake filesystem with those UUIDs, puts the fake filesystem on a USB stick, and plugs it back into the victim machine... With the current state of USB vulnerability (firmware reprogrammed as an input device, etc, the vuln has been all over the tech news for some months now), anyone with USB access to the machine is simply another case of anyone with physical access to the machine, they're normally assumed to be able to be able to at minimum take down the machine, the ultimate DoS, in any case, and often to have effective root, tho that can be mitigated to some extent with encryption, etc. It's generally assumed that if you have physical access, as required to plug in that USB, game over, the machine is effectively p40wn3d. At the /very/ least, with physical access it's vulnerable to the sledgehammer DoS, and there's little to be done about that but prevent physical access by all means necessary (armed guards, nuclear silo hosting, etc) in the first place. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Duncan posted on Fri, 21 Nov 2014 22:49:06 + as excerpted: Chris Murphy posted... That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. Thus the mention of compression if necessary. Thin-volume snapshots are effectively compression by another name, and a raw dd from them should compress pretty much equally well, depending on compression method chosen, of course. =:^) Oops, I mis-parsed thin. Good point and thanks, Chris. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Duncan posted on Fri, 21 Nov 2014 23:41:49 + as excerpted: Duncan posted on Fri, 21 Nov 2014 22:49:06 + as excerpted: Chris Murphy posted... That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. Thus the mention of compression if necessary. Thin-volume snapshots are effectively compression by another name, and a raw dd from them should compress pretty much equally well, depending on compression method chosen, of course. =:^) Oops, I mis-parsed thin. Good point and thanks, Chris. ... And Zygo, who pointed out my error as well. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Mon, Nov 17, 2014 at 08:04:05PM +0100, Goffredo Baroncelli wrote: On 2014-11-17 07:59, Brendan Hide wrote: That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about ecosystem design than grub being silly. Regarding a) IIRC, btrfs collects the filesystem information by UUID; if two filesystems have the same UUID (like the LVM-snapshot case), the last filesystem discovered overwrite the first one. The filesystem discovering is done in user-space; so it should be simple to skip a filesystem on a LVM-snapshot. Regarding b) I am bit confused: if I understood correctly, the root filesystem was picked from a LVM-snapshot, so grub-probe *correctly* reported that the root device is the snapshot. The problem was that during the boot filesystem discovering: first scanned the *real* device, then the LVM-snapshot; the latter overwrote the former so the system booted from the LVM-snapshot. IMHO if the device UUID search finds multiple devices with the same device UUID, it should ignore _all_ of them as the identification problem is unsolvable without further user input. This is what the 'device=' mount option is for. My conclusion is that we should improve the btrfs scan so: - in udev rules, a partition that is a LVM snapshot by default should be not scanned by btrfs dev scan - btrfs dev scan, during the partition discovery should skip the lvm-snapshot. That would mean I can't do this: 1. lvm snapshot of ext4 filesystem 2. btrfs-convert the snapshot 3. mount the snapshot, make sure it's OK 4. merge LVM snapshot to overwrite original ext4 filesystem which would be a shame since that's the only way I ever convert ext3/4 filesystems to btrfs (btrfs-convert is a little buggy still). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
On Wed, Nov 19, 2014 at 10:20:17AM -0500, Phillip Susi wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 9:54 PM, Chris Murphy wrote: Why is it silly? Btrfs on a thin volume has practical use case aside from just being thinly provisioned, its snapshots are block device based, not merely that of an fs tree. Umm... because one of the big selling points of btrfs is that it is in a much better position to make snapshots being aware of the fs tree rather than doing it in the block layer. One of the big selling points of LVM is that it is in a much better position to make snapshots so you can run btrfsck on the shattered remains of your broken btrfs filesystem. The UUID-driven behavior of btrfs is _really extremely annoying_. No other filesystem forces me to jump through the hoops btrfs does to get routine admin tasks done. e.g. if an ext4 filesystem explodes, I can: 1. make a LVM snapshot of the broken filesystem 2. run e2fsck on the snapshot 3. mount and repair the snapshot, e.g. rsync any missing files from backups, salvage anything that survived 4. LVM merge the snapshot to its origin volume 5. umount the origin volume and mount the merged volume (or just reboot) ...and I can do all of this on a running system, in-place, with only a few minutes of downtime in the must-reboot case. None of the above works with btrfs at all. Multi-device btrfs fails at 2, and mounting the filesystem fails at 3. The closest I've gotten to this workflow is to set up a kvm instance that can see only the LVM snapshots, (only) and run the btrfsck or rsync there--and hope that the system doesn't crash and reboot during that time, or the filesystem will be more or less destroyed by the random combination of origin and snapshot LVs. I've also learned the hard way to always make an LVM snapshot before running btrfsck, just in case you discover a new btrfsck bug with your filesystem. That at least works for single-device btrfs filesystems. So it is kind of silly in the first place to be using lvm snapshots under btrfs, but it is is doubly silly to use lvm for snapshots, and btrfs for the mirroring rather than lvm. Pick one layer and use it for both functions. Even if that is lvm, then it should also be handling the mirroring. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbLUxAAoJEI5FoCIzSKrwh0oH/3TZ2oo8u2BjHYO3b0x8800/ LFkmGFWrZFSnAvtWuN5B1WlhMXku4dxLRXz14fJKFp3fNmnYRNVvw3tu9btvsBsC sZdwLaKwKPHTK8RS+QCI2pZPX+cGB+F7/z9PCHrzIzzCKk/4SvnJ76e2nnZFpY1m Md3f1BCHEVUPMMXbqv6Ry6v7PDs/8bx8WITYyAL9uh3tjh0dXQsjbZJn5u4XDitS /CoE8eX4rf1vc7qHI4K56TtArCcXQxAHcC56fXmcmS03bVhAkkJ5Z+/uwi6+TkJe 55rMFCd7UFy9pwKha3Q2flJHtDYG6ns7Njyff6BSL9Yzq7tHh4wLk1H3XxaOCP8= =ktv/ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 9:54 PM, Chris Murphy wrote: Why is it silly? Btrfs on a thin volume has practical use case aside from just being thinly provisioned, its snapshots are block device based, not merely that of an fs tree. Umm... because one of the big selling points of btrfs is that it is in a much better position to make snapshots being aware of the fs tree rather than doing it in the block layer. So it is kind of silly in the first place to be using lvm snapshots under btrfs, but it is is doubly silly to use lvm for snapshots, and btrfs for the mirroring rather than lvm. Pick one layer and use it for both functions. Even if that is lvm, then it should also be handling the mirroring. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbLUxAAoJEI5FoCIzSKrwh0oH/3TZ2oo8u2BjHYO3b0x8800/ LFkmGFWrZFSnAvtWuN5B1WlhMXku4dxLRXz14fJKFp3fNmnYRNVvw3tu9btvsBsC sZdwLaKwKPHTK8RS+QCI2pZPX+cGB+F7/z9PCHrzIzzCKk/4SvnJ76e2nnZFpY1m Md3f1BCHEVUPMMXbqv6Ry6v7PDs/8bx8WITYyAL9uh3tjh0dXQsjbZJn5u4XDitS /CoE8eX4rf1vc7qHI4K56TtArCcXQxAHcC56fXmcmS03bVhAkkJ5Z+/uwi6+TkJe 55rMFCd7UFy9pwKha3Q2flJHtDYG6ns7Njyff6BSL9Yzq7tHh4wLk1H3XxaOCP8= =ktv/ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Wed, Nov 19, 2014 at 8:20 AM, Phillip Susi ps...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 9:54 PM, Chris Murphy wrote: Why is it silly? Btrfs on a thin volume has practical use case aside from just being thinly provisioned, its snapshots are block device based, not merely that of an fs tree. Umm... because one of the big selling points of btrfs is that it is in a much better position to make snapshots being aware of the fs tree rather than doing it in the block layer. This is why we have fsfreeze before taking block level snapshots. And I point out that consistent snapshots with Btrfs have posed challenges too, there's a recent fstest snapshoting after file write + truncate for this reason. A block layer snapshot will snapshot the entire file system, not just one tree. We don't have a way in Btrfs to snapshot the entire volume. Considering how things still aren't exactly stable yet, in particular with many snapshots, it's not unreasonable to want to freeze then snapshot the entire volume before doing some possibly risky testing or usage where even a Btrfs snapshot doesn't protect your entire volume should things go wrong. So it is kind of silly in the first place to be using lvm snapshots under btrfs, but it is is doubly silly to use lvm for snapshots, and btrfs for the mirroring rather than lvm. Pick one layer and use it for both functions. Even if that is lvm, then it should also be handling the mirroring. Thin volumes are more efficient. And the user creating them doesn't have to mess around with locating physical devices or possibly partitioning them. Plus in enterprise environments with lots of storage and many different kinds of use cases, even knowledable users aren't always granted full access to the physical storage anyway. They get a VG to play with, or now they can have a thin pool and only consume on storage what is actually used, and not what they've reserved. You can mkfs a 4TG virtual size volume, while it only uses 1MB of physical extents on storage. And all of that is orthogonal to using XFS or Btrfs which again comes down to use case. And whether I'd have LVM mirror or Btrfs mirror is again a question of use case, maybe I'm OK with LVM mirroring and I just get the rare corrupt file warning and that's OK. In another use case, corruption isn't OK, I need higher availability of known good data therefore I need Btrfs doing the mirroring. So I find your argument thus far uncompelling. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/19/2014 1:33 PM, Chris Murphy wrote: Thin volumes are more efficient. And the user creating them doesn't have to mess around with locating physical devices or possibly partitioning them. Plus in enterprise environments with lots of storage and many different kinds of use cases, even knowledable users aren't always granted full access to the physical storage anyway. They get a VG to play with, or now they can have a thin pool and only consume on storage what is actually used, and not what they've reserved. You can mkfs a 4TG virtual size volume, while it only uses 1MB of physical extents on storage. And all of that is orthogonal to using XFS or Btrfs which again comes down to use case. And whether I'd have LVM mirror or Btrfs mirror is again a question of use case, maybe I'm OK with LVM mirroring and I just get the rare corrupt file warning and that's OK. In another use case, corruption isn't OK, I need higher availability of known good data therefore I need Btrfs doing the mirroring. Correct me if I'm wrong, but this kind of setup is basically where you have a provider running an lvm thin pool volume on their hardware, and exposing it to the customer's vm as a virtual disk. In that case, then the provider can do their snapshots and it won't cause this problem since the snapshots aren't visible to the vm. Also in these cases the provider is normally already providing data protection by having the vg on a raid6 or raid60 or something, so having the client vm mirror the data in btrfs is a bit redundant. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbO4nAAoJEI5FoCIzSKrwl/QIAJ7arJ0ZXVc16pBRjE2F66uV GAOhatdx8pLhGey6by+gV8Ltvx4bK3BG40dkvQIM9RN9UFC5vofQ4FnzIn1nfXZB qyyITE2mF+lE3RNCb8ZKxwG58rfa9NOModPCeNVFWkS6+fyyhGY23sliWbVO6b15 w6BD5xu/Pp7Fhgkx81AL07XpusR9c8pKZd8ZHw4nozFHw20+13XuL+2g8axpZS+O Xd9W5GRlC+0k9jQ0q9xGi1jh6QpjMSWVj54MNS5jRubsY65TtmFPkdvgaMGD4U5k bADSEUMfij9NRMw8VwA4ik/JEi1IbukD4u1geKeZTowMGXReel2RimeA/PhFYcc= =tmDI -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Chris Murphy posted on Mon, 17 Nov 2014 23:21:57 -0700 as excerpted: I think we’re well past the expiration date on grub.cfg, a line should be drawn in the sand to deprecate routine use of os-prober + grub-mkconfig, and move to drop-in scripts by whatever the distro presumes will be responsible for managing what “tree” will be booted or will be offered as a boot option, all GRUB needs to learn is how to use that drop in script file format. Ergo just because I’ve snapshot my root does not mean grub-mkconfig should be creating boot entries for it. But whatever usespace tool I’m using to do those snapshots (ostree, snapper, whatever the GNOME folks might come up with) should be the thing that creates the boot entry script; or as simple as this 2-4 line script should be, even hand done by a user, unlike the current grub.cfg file format. FWIW, I hand-edit my grub.cfg here, grub-probe was taking /forever/ on my system back when I upgraded to grub2, and the direct drive configuration of direct grub.cfg editing was /far/ more flexible, or at least /far/ easier to learn how to do what I wanted to do than to figure out how to do it thru the translation layer, in any case. The configuration is advanced enough it has individual choices to set standard init and init=/bin/bash, current/fallback/stable kernels, current/backup/second-backup roots, etc, plus a choice to interactively type in additional kernel commandline options, loading those choices into grub variables as I change them, then another choice to boot using the loaded variables to select the kernel and setup the kernel commandline. The initial grub.cfg has the default boot option, plus others that load either a troubleshooting menu or the backups choices menu, from separate included config files, as necessary. Just /thinking/ about trying to do that via the cumbersome translation layer gives me a headache, and since I had to learn the grub scripting layer language to set it up anyway, I might as well just write and troubleshoot it in that directly rather than trying to figure out how to get the translation layer to write it, and then have to troubleshoot BOTH the translation layer and the lower level script. Then I deleted grub-probe and grub-mkconfig so they couldn't be run accidentally with unconfigured/default translation-level options to undo all my hard work, and set a mask on them so updating the package wouldn't reinstall them. So deprecate/kill os-prober and grub-mkconfig if you want, but grub.cfg needs to stay working! -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 1:16 AM, Chris Murphy wrote: If fstab specifies rootfs as UUID, and there are two volumes with the same UUID, it’s now ambiguous which one at boot time is the intended rootfs. It’s no different than the days of /dev/sdXY where X would change designations between boots = ambiguity and why we went to UUID. He already said he has NOT rebooted, so there is no way that the snapshot has actually been mounted, even if it were UUID confusion. So we kinda need a way to distinguish derivative volumes. Maybe XFS and ext4 could easily change the volume UUID, but my vague recollection is this is difficult on Btrfs? So that led me to the idea of a way to create an on-the-fly (but consistent) “virtual volume UUID” maybe based on a hash of both the LVM LV and fs volume UUID. When using LVM, you should be referring to the volume by the LVM name rather than UUID. LVM names are stable, and don't have the duplicate uuid problem. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa2j4AAoJEI5FoCIzSKrwvywH/3yS25MAIwsGfIwBfCrNN5Qo NlBttcUcrYgOD/nQHEuulHdilWrvz3q6jGwVL9W8MQsHm0Ah5dMatT5e5zr1DSNC ZqSEXSE8jsYJu99FUWevxO7wtb94ioKa+OF1u0zsaA5yQUdaj5smPqK3iUfskUhs jE/vsJmws5iBv0dxnZI/6n3YqOB1Qck4PcMItRj8xvZQ0GjARIVw36pgJnmboGfY vWRmUXnTeLMu9ilHWhqNUIh3lTTUvRdaYoZtTr6eYh9sIntDCegN71WGmO8FfdjP vXhikg7Yx7FhkhxAl1X2NzM93d7fUSQDeQfTLYLMDbbTV/n2HwcoZ6G2+IQEJnQ= =3Lv1 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Nov 18, 2014, at 8:42 AM, Phillip Susi ps...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 1:16 AM, Chris Murphy wrote: If fstab specifies rootfs as UUID, and there are two volumes with the same UUID, it’s now ambiguous which one at boot time is the intended rootfs. It’s no different than the days of /dev/sdXY where X would change designations between boots = ambiguity and why we went to UUID. He already said he has NOT rebooted, so there is no way that the snapshot has actually been mounted, even if it were UUID confusion. So we kinda need a way to distinguish derivative volumes. Maybe XFS and ext4 could easily change the volume UUID, but my vague recollection is this is difficult on Btrfs? So that led me to the idea of a way to create an on-the-fly (but consistent) “virtual volume UUID” maybe based on a hash of both the LVM LV and fs volume UUID. When using LVM, you should be referring to the volume by the LVM name rather than UUID. LVM names are stable, and don't have the duplicate uuid problem. What if you have a Btrfs raid1 volume using two LV’s and then snapshot both LV’s? Of course I’d specify one of the devices by VG-LV name. But Btrfs finds additional devices itself, it doesn’t support explicitly naming additional member devices. And in this example, there are two identical candidates, so it’s ambiguous to Btrfs which one to use. And further it’s unknown to the user which one Btrfs chose because neither mount, nor /proc/mounts right now shows anything other than the first device that’s mounted. So it’s using one of those two VG-LV’s automatically but not informing us which one. I think there’s some metadata that can be set on each LV whether it’s automatically activated (at e.g. boot time) so I think the thing to do would be to make sure the snapshot LV’s are not activated, therefore their UUID’s shouldn’t be visible to Btrfs and it won’t automatically discover and use the wrong LV. But I haven’t tested this. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 2014-11-18 07:21, Chris Murphy wrote: Ergo just because I’ve snapshot my root does not mean grub-mkconfig should be creating boot entries for it. I find this an useful feature: a snapshot of / is done to rollback some changes, so why don't let grub to start (the kernel) from ? Anyway I find grub-mkconfig quite useful for a standard user. For more advance uses cases editing by hand grub.cfg may be possible. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
2014-11-18 16:42 GMT+01:00 Phillip Susi ps...@ubuntu.com: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 1:16 AM, Chris Murphy wrote: If fstab specifies rootfs as UUID, and there are two volumes with the same UUID, it’s now ambiguous which one at boot time is the intended rootfs. It’s no different than the days of /dev/sdXY where X would change designations between boots = ambiguity and why we went to UUID. He already said he has NOT rebooted, so there is no way that the snapshot has actually been mounted, even if it were UUID confusion. That's right. Anyway, I've built a system to reproduce the bug. You can download the image and run it with KVM or other virtualization technology. Instructions are straightforward – if you start the VM, you'll know what to do, and you'll see what I was talking about. http://undead.megabrutal.com/kvm-reproduce-1391429.img.xz Download size: 113 MB; Unpacked image size: 2 GB. So we kinda need a way to distinguish derivative volumes. Maybe XFS and ext4 could easily change the volume UUID, but my vague recollection is this is difficult on Btrfs? So that led me to the idea of a way to create an on-the-fly (but consistent) “virtual volume UUID” maybe based on a hash of both the LVM LV and fs volume UUID. When using LVM, you should be referring to the volume by the LVM name rather than UUID. LVM names are stable, and don't have the duplicate uuid problem. I use LVM names to identify volumes. I initially suspected it's an UUID confusion, because I thought grub-probe looks for the volume by UUID. But now I think the problem is nothing to do with UUIDs. Probably I should have looked deeper into the problem before I hypothesized. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 11/18/2014 07:42 AM, Phillip Susi wrote: On 11/18/2014 1:16 AM, Chris Murphy wrote: (stuff about UUIDs and LVM snapshots). (suggestion to use LVM paths instead). This is also an XFS+LVM+LVM_Snapshot problem going back to at least 2009. It's inherent to the block-device-level snapshot phenomonia. q.v. http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/ et al In XFS you attack the snapshot with a command to regenerate the UUID as soon as you take the snapshot. I don't think there is a regenerate all my UUIDs command for BTRFS. There are other places this can bone you, like old-format mdadm mirrors, where the metadata was only at the end of the partition so you could accidentally see two copied of your RAID1 file system if you hand't built/started the array. There is no really good way to prevent this other than being really careful or not doing that at all. Sorry. Cost of doing business. Cheers... Rob. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Nov 18, 2014, at 1:17 PM, Phillip Susi ps...@ubuntu.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/2014 2:17 PM, Chris Murphy wrote: What if you have a Btrfs raid1 volume using two LV’s and then snapshot both LV’s? That's even more silly than a single lvm snapshot under btrfs. Just don't do it. Why is it silly? Btrfs on a thin volume has practical use case aside from just being thinly provisioned, its snapshots are block device based, not merely that of an fs tree. Looks like lvm.conf does have a way to affect LV autoactivation, and there may be another way to achieve this also. Right after the snapshot(s) they’d need to have their autoactivation disabled to avoid UUID confusion. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Robert White posted on Tue, 18 Nov 2014 17:29:12 -0800 as excerpted: On 11/18/2014 07:42 AM, Phillip Susi wrote: On 11/18/2014 1:16 AM, Chris Murphy wrote: (stuff about UUIDs and LVM snapshots). (suggestion to use LVM paths instead). This is also an XFS+LVM+LVM_Snapshot problem going back to at least 2009. It's inherent to the block-device-level snapshot phenomonia. q.v. http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/ et al In XFS you attack the snapshot with a command to regenerate the UUID as soon as you take the snapshot. I don't think there is a regenerate all my UUIDs command for BTRFS. Which was part of my point in my reply. Btrfs embeds the UUID in the metadata deeply enough that it's no simple task to simply change it to something else and be done. It's quite a complicated operation for any (future, none current) tool that attempts it, with the most likely candidate being an option to btrfs balance or the like, but even then, we're looking at a timescale of hours for spinning rust. So while it's possible in theory, in practice such a regenerate-all UUIDs command for btrfs isn't available yet, and given the time involved in rewriting all those metadata UUIDs to something else, during which the filesystem's in a critically unstable state, and the limited use-case with other alternatives, such a tool isn't all /that/ practical in any case. Making an entirely new btrfs and doing a btrfs send/receive for the duplicate, or using btrfs snapshots, is a more practical way to go. (Tho watch out for the implications of btrfs snapshots on nocow files!) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
2014-11-17 7:59 GMT+01:00 Brendan Hide bren...@swiftspirit.co.za: Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yesterday, when I reproduced the phenomenon on a VM, I've found something rather interesting thing: even /proc/mounts reports incorrectly, that the snapshot is being mounted instead of the root FS. Note, there were no reboot. Just create an LVM snapshot and then check /proc/mounts. I couldn't reproduce the same with non-root file systems. It seems this only appears when the device in question is mounted as root FS. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying LVM snapshots are incompatible with btrfs is the right way to go either. Before I did a release upgrade, just to be safe, I made both (LVM and btrfs snapshot). That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about ecosystem design than grub being silly. I imagine a couple of aspects that could help fix a): - Utilise a unique drive identifier in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for defined behaviour, preventing corruption - even if that defined behaviour is to simply give out lots of PEBKAC errors and panic. - Utilise a drive list to ensure that two unrelated filesystems with the same UUID cannot get mixed up. Yes, the user/admin would likely be the culprit here (perhaps a VM rollout process that always gives out the same UUID in all its filesystems). Again, does btrfs not already have something like this built-in that we're simply not utilising fully? I'm not exactly sure of the correct way to fix b) except that I imagine it would be trivial to fix once a) is fixed. Note that everything that is written into the file system's metadata gets duplicated with an LVM snapshot. So a unique drive identifier wouldn't solve the problem, as it would also get replicated, and BTRFS would still see two identical devices. But devices on Linux have major and minor numbers those uniquely identify devices while they are attached. The original and the snapshot device have different major/minor numbers, and it would be quite enough to differentiate the devices while they are being opened/mounted. By the way, I actually made an entire release upgrade with the snapshot being there and being reported incorrectly. This would have caused enough corruption in the file system that I would have surely noticed it. But I didn't perceive any data corruption. BTRFS didn't actually write to the snapshot device. It seems the device is only mixed up in /proc/mounts, so probably the problem is not so severe as we think, and wouldn't require fundamental changes to BTRFS to fix it. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On 2014/11/17 09:35, Daniel Dressler top-posted: If a UUID is not unique enough how will adding a second UUID or unique drive identifier help? A UUID is *supposed* to be unique by design. Isolated, the design is adequate. But the bigger picture clearly shows the design is naive. And broken. A second per-disk id (note I said unique - but I never said universal as in UUID) would allow for better-defined behaviour where, presently, we're simply saying current behaviour is undefined and you're likely to get corruption. On the other hand, I asked already if we have IDs of some sort (how else do we know which disk a chunk is stored on?), thus I don't think we need to add anything to the format. A simple scenario similar to the one the OP introduced: Disk sda - says it is UUID Z with diskid 0 Disk sdb - says it is UUID Z with diskid 0 If we're ignoring the fact that there are two disks with the same UUID and diskid and it causes corruption, then the kernel is doing something stupid but fixable. We have some choices: - give a clear warning and ignore one of the disks (could just pick the first one - or be a little smarter and pick one based on some heuristic - for example extent generation number) - give a clear error and panic Normal multi-disk scenario: Disk sda - UUID Z with diskid 1 Disk sdb - UUID Z with diskid 2 These two disks are in the same filesystem and are supposed to work together - no issues. My second suggestion covers another scenario as well: Disk sda - UUID Z with diskid 1; root block indicates that only diskid 1 is recorded as being part of the filesystem Disk sdb - UUID Z with diskid 3; root block indicates that only diskid 3 is recorded as being part of the filesystem Again, based on the existing featureset, it seems reasonable that this information should already be recorded in the fs metadata. If the behaviour is undefined and causing corruption, again the kernel is currently doing something stupid but fixable. Again, we have similar choices: - give a clear warning and ignore bad disk(s) - give a clear error and panic 2014-11-17 15:59 GMT+09:00 Brendan Hide bren...@swiftspirit.co.za: cc'd bug-g...@gnu.org for FYI On 2014/11/17 03:42, Duncan wrote: MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for Universally Unique IDentifier.[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Perhaps the right question to ask is Where should this bug be fixed?. TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being out of scope. The correct fix probably lies in the ecosystem design, which requires co-operation from btrfs. Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job exactly as expected. Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old missing disk re-introduced, etc. That leaves two places where this can be fixed: grub and btrfs Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying LVM snapshots are incompatible with btrfs is the right way to go either. That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about ecosystem design than grub being silly. I imagine a couple of aspects that could help fix a): - Utilise a unique drive identifier in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for defined behaviour, preventing corruption - even if that defined behaviour is to simply give out lots of PEBKAC errors and panic. - Utilise a drive list to ensure that two unrelated filesystems with the same UUID cannot get mixed up. Yes, the user/admin would likely be the culprit here (perhaps a VM rollout process that always gives out the same UUID in all its filesystems). Again, does btrfs not already have something like this built-in that we're simply not utilising
Re: BTRFS messes up snapshot LV with origin
On 2014-11-17 07:59, Brendan Hide wrote: That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about ecosystem design than grub being silly. Regarding a) IIRC, btrfs collects the filesystem information by UUID; if two filesystems have the same UUID (like the LVM-snapshot case), the last filesystem discovered overwrite the first one. The filesystem discovering is done in user-space; so it should be simple to skip a filesystem on a LVM-snapshot. Regarding b) I am bit confused: if I understood correctly, the root filesystem was picked from a LVM-snapshot, so grub-probe *correctly* reported that the root device is the snapshot. The problem was that during the boot filesystem discovering: first scanned the *real* device, then the LVM-snapshot; the latter overwrote the former so the system booted from the LVM-snapshot. My conclusion is that we should improve the btrfs scan so: - in udev rules, a partition that is a LVM snapshot by default should be not scanned by btrfs dev scan - btrfs dev scan, during the partition discovery should skip the lvm-snapshot. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: BTRFS messes up snapshot LV with origin
2014-11-17 20:04 GMT+01:00 Goffredo Baroncelli kreij...@inwind.it: Regarding b) I am bit confused: if I understood correctly, the root filesystem was picked from a LVM-snapshot, so grub-probe *correctly* reported that the root device is the snapshot. This is not what happens. The system doesn't even get a reboot when the mix-up happens. You boot from the original device, create an LVM-snapshot*, and mount starts to report the snapshot as the root device, while in fact it isn't. I know my initial descriptions of the bug were misleading, as myself didn't know what the heck is going on. From this point, please take these comments as reference: https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429/comments/2 https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429/comments/4 * I know I shouldn't make an LVM-snapshot of a mounted file system, but this is not the point. P.S.: E-mail sent twice, as lists didn't accept it in HTML. Plus I'm not on the GRUB list, and can't post there. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: BTRFS messes up snapshot LV with origin
On 2014-11-17 20:45, MegaBrutal wrote: * I know I shouldn't make an LVM-snapshot of a mounted file system, but this is not the point. This should be supported for the filesystem which support the freezing See http://stackoverflow.com/questions/1940093/lvm-snapshot-of-mounted-filesystem -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Nov 17, 2014, at 12:45 PM, MegaBrutal megabru...@gmail.com wrote: 2014-11-17 20:04 GMT+01:00 Goffredo Baroncelli kreij...@inwind.it: Regarding b) I am bit confused: if I understood correctly, the root filesystem was picked from a LVM-snapshot, so grub-probe *correctly* reported that the root device is the snapshot. This is not what happens. The system doesn't even get a reboot when the mix-up happens. You boot from the original device, create an LVM-snapshot*, and mount starts to report the snapshot as the root device, while in fact it isn’t. If fstab specifies rootfs as UUID, and there are two volumes with the same UUID, it’s now ambiguous which one at boot time is the intended rootfs. It’s no different than the days of /dev/sdXY where X would change designations between boots = ambiguity and why we went to UUID. So we kinda need a way to distinguish derivative volumes. Maybe XFS and ext4 could easily change the volume UUID, but my vague recollection is this is difficult on Btrfs? So that led me to the idea of a way to create an on-the-fly (but consistent) “virtual volume UUID” maybe based on a hash of both the LVM LV and fs volume UUID. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Nov 16, 2014, at 11:59 PM, Brendan Hide bren...@swiftspirit.co.za wrote: cc'd bug-g...@gnu.org for FYI On 2014/11/17 03:42, Duncan wrote: MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for Universally Unique IDentifier.[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Perhaps the right question to ask is Where should this bug be fixed?”. TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being out of scope. The correct fix probably lies in the ecosystem design, which requires co-operation from btrfs. I think the libblkid folks should be brought into this discussion, see what their take on this. LVM conventional snapshots causing this problem is rare / self-limiting as they’re short lived. LVM thinp snapshots mean there can be dozens, and they can sanely endure for the life of the thin pool. Effectively we have derivative volumes. At snapshot time, should a.) the fs volume UUID be changed; b.) each fs adds an additional/secondary volume UUID at snapshot time; c.) each fs adds a derivative/version indicator, i.e. 0 at mkfs time and maybe epoch time stamped at snapshot time; d.) not use fs UUID for identifying volumes uniqueness, instead use a virtual volume UUID which is externally determined based on whether the fs is on an LV snapshot. Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job exactly as expected. Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old missing disk re-introduced, etc. Sure and likewise self limiting problem. LVM thinp snapshots actually do make this confusion of multiple instances of the same volume UUID much much more likely. That leaves two places where this can be fixed: grub and btrfs The GRUB os-prober and grub-mkconfig paradigm I think needs to come to an end. The grub.cfg is not supposed to be externally modified, the design is that os-prober + grub-mkconfig obliterate it and generate a whole new one from scratch anytime the system boot state changes, i.e. anytime a new kernel is added. GRUB isn’t good at OS discovery now, I think it should just be abandoned. It can have its grub.cfg generated to do whatever complex things are needed, but the individual boot menu entries should exist as drop-in scripts managed by whatever is changing the OS boot state. This is the fundamental part of the two bootloaderspecs: http://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/ http://www.freedesktop.org/wiki/MatthewGarrett/BootLoaderSpec/ And it’s a fundamental part of OSTree which supports multiple bootable trees on any filesystem, and currently uses a variation on bootloaderspec drop-in scripts to inform GRUB how to boot such a system: https://wiki.gnome.org/action/show/Projects/OSTree?action=showredirect=OSTree Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying LVM snapshots are incompatible with btrfs is the right way to go either. That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about ecosystem design than grub being silly. I think we’re well past the expiration date on grub.cfg, a line should be drawn in the sand to deprecate routine use of os-prober + grub-mkconfig, and move to drop-in scripts by whatever the distro presumes will be responsible for managing what “tree” will be booted or will be offered as a boot option, all GRUB needs to learn is how to use that drop in script file format. Ergo just because I’ve snapshot my root does not mean grub-mkconfig should be creating boot entries for it. But whatever usespace tool I’m using to do those snapshots (ostree, snapper, whatever the GNOME folks might come up with) should be the thing that creates the boot entry script; or as simple as this 2-4 line script should be, even hand done by a user, unlike the current grub.cfg file format. Further I’d like to get more traction from the syslinux/extlinux
BTRFS messes up snapshot LV with origin
Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 MegaBrutal -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for Universally Unique IDentifier.[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Meanwhile, if two or more devices are btrfs and have the same UUID, btrfs considers them part of the same filesystem, since btrfs /can/ be a multi- device filesystem. That's not a bug; that's the way btrfs IDs multiple devices as part of the same filesystem, because a UUID, by definition, can be relied upon to be unique, or it's no longer a UUID. Additionally, the UUID is actually written into the metadata of the filesystem in such a way that it's /not/ a simple task to change the UUID. Put simply, it's ingrained into the filesystem so deeply it cannot be changed, at least not without rewriting pretty much all the metadata. (FWIW, a btrfs balance does just that, rewrite the data, metadata, or both. However, I don't believe a balance plugin to change the UUID is yet available. You're simply not supposed to change the UUID once the filesystem is created.) So if LVM snapshots duplicate a UUID, as I believe they do, then there's your bug, because they're breaking the definition of Universally *UNIQUE* ID. That being the case, using them with btrfs is pretty essentially broken, because btrfs depends on UUIDs to be what they say on the label, actually unique, and UUIDs are deeply enough ingrained into the very fabric of btrfs that it's simply not possible to change that on the btrfs side. Meanwhile, since btrfs *DOES* depend on UUIDs being unique, if there's multiple btrfs that accidentally have the same UUID, btrfs will not distinguish between them and will very possibly be writing into both of them. If I found myself in that situation, I'd very carefully copy all the data I wanted to save off the filesystem and do a new mkfs as soon as possible, because I would not consider the filesystem as it was at all stable, and I'd count myself very lucky if I got everything off the filesystem without damage. In actuality, since the second device was a snapshot of the first, if you catch it reasonably quickly you likely won't have too many issues. However, a btrfs in that condition is in an undefined state, and the longer it exists in that state, the more likely things are to go wrong, possibly VERY VERY wrong. So if you don't already have backups for anything you consider valuable on that thing, get it off there as soon as you possibly can, and consider yourself very lucky if nothing's damaged as a result. --- [1] http://en.wiktionary.org/wiki/UUID -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
cc'd bug-g...@gnu.org for FYI On 2014/11/17 03:42, Duncan wrote: MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for Universally Unique IDentifier.[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Perhaps the right question to ask is Where should this bug be fixed?. TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being out of scope. The correct fix probably lies in the ecosystem design, which requires co-operation from btrfs. Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job exactly as expected. Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old missing disk re-introduced, etc. That leaves two places where this can be fixed: grub and btrfs Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying LVM snapshots are incompatible with btrfs is the right way to go either. That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about ecosystem design than grub being silly. I imagine a couple of aspects that could help fix a): - Utilise a unique drive identifier in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for defined behaviour, preventing corruption - even if that defined behaviour is to simply give out lots of PEBKAC errors and panic. - Utilise a drive list to ensure that two unrelated filesystems with the same UUID cannot get mixed up. Yes, the user/admin would likely be the culprit here (perhaps a VM rollout process that always gives out the same UUID in all its filesystems). Again, does btrfs not already have something like this built-in that we're simply not utilising fully? I'm not exactly sure of the correct way to fix b) except that I imagine it would be trivial to fix once a) is fixed. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
If a UUID is not unique enough how will adding a second UUID or unique drive identifier help? A UUID only serves any purpose when it is unique. Thus duplicate UUIDs are themselves a failure state. The solution should be to make it harder to get into this failure state. Not to make all programs resilient against running under this failure state. It isn't a btrfs bug that it requires Universal Unique IDs to be universally unique. Daniel 2014-11-17 15:59 GMT+09:00 Brendan Hide bren...@swiftspirit.co.za: cc'd bug-g...@gnu.org for FYI On 2014/11/17 03:42, Duncan wrote: MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for Universally Unique IDentifier.[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Perhaps the right question to ask is Where should this bug be fixed?. TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being out of scope. The correct fix probably lies in the ecosystem design, which requires co-operation from btrfs. Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job exactly as expected. Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old missing disk re-introduced, etc. That leaves two places where this can be fixed: grub and btrfs Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying LVM snapshots are incompatible with btrfs is the right way to go either. That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about ecosystem design than grub being silly. I imagine a couple of aspects that could help fix a): - Utilise a unique drive identifier in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for defined behaviour, preventing corruption - even if that defined behaviour is to simply give out lots of PEBKAC errors and panic. - Utilise a drive list to ensure that two unrelated filesystems with the same UUID cannot get mixed up. Yes, the user/admin would likely be the culprit here (perhaps a VM rollout process that always gives out the same UUID in all its filesystems). Again, does btrfs not already have something like this built-in that we're simply not utilising fully? I'm not exactly sure of the correct way to fix b) except that I imagine it would be trivial to fix once a) is fixed. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html