Re: [PATCH] Fix typos
On 11/28/18 1:23 PM, Nikolay Borisov wrote: On 28.11.18 г. 13:05 ч., Andrea Gelmini wrote: Signed-off-by: Andrea Gelmini --- diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index bab2f1983c07..babbd75d91d2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -104,7 +104,7 @@ static void __endio_write_update_ordered(struct inode *inode, /* * Cleanup all submitted ordered extents in specified range to handle errors - * from the fill_dellaloc() callback. + * from the fill_delalloc() callback. This is a pure whitespace fix which is generally frowned upon. What you can do though, is replace 'fill_delalloc callback' with 'btrfs_run_delalloc_range' since the callback is gone already. * * NOTE: caller must ensure that when an error happens, it can not call * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING @@ -1831,7 +1831,7 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode, diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 410c7e007ba8..d7b6c2b09a0c 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -892,7 +892,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, * 7. If we were asked to remove a directory and victim isn't one - ENOTDIR. * 8. If we were asked to remove a non-directory and victim isn't one - EISDIR. * 9. We can't remove a root or mountpoint. - * 10. We don't allow removal of NFS sillyrenamed files; it's handled by + * 10. We don't allow removal of NFS silly renamed files; it's handled by * nfs_async_unlink(). */ @@ -3522,7 +3522,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 olen, false); /* * If one of the inodes has dirty pages in the respective range or -* ordered extents, we need to flush dellaloc and wait for all ordered +* ordered extents, we need to flush delalloc and wait for all ordered Just whitespace fix, drop it. If the spelling is changed, surely that is not a whitespace fix?
Re: NVMe SSD + compression - benchmarking
On 04/28/2018 04:05 AM, Qu Wenruo wrote: On 2018年04月28日 01:41, Brendan Hide wrote: Hey, all I'm following up on the queries I had last week since I have installed the NVMe SSD into the PCI-e adapter. I'm having difficulty knowing whether or not I'm doing these benchmarks correctly. As a first test, I put together a 4.7GB .tar containing mostly duplicated copies of the kernel source code (rather compressible). Writing this to the SSD I was seeing repeatable numbers - but noted that the new (supposedly faster) zstd compression is noticeably slower than all other methods. Perhaps this is partly due to lack of multi-threading? No matter, I did also notice a supposedly impossible stat when there is no compression, in that it seems to be faster than the PCI-E 2.0 bus theoretically can deliver: I'd say the test method is more like real world usage other than benchmark. Moreover, the kernel source copying is not that good for compression, as mostly of the files are smaller than 128K, which means they can't take much advantage of multi thread split based on 128K. And kernel source is consistent of multiple small files, and btrfs is really slow for metadata heavy workload. I'd recommend to start with simpler workload, then go step by step towards more complex workload. Large file sequence write with large block size would be a nice start point, as it could take all advantage of multithread compression. Thanks, Qu I did also test the folder tree where I realised it is intense / far from a regular use-case. It gives far slower results with zlib being the slowest. The source's average file size is near 13KiB. However, in this test where I gave some results below, the .tar is a large (4.7GB) singular file - I'm not unpacking it at all. Average results from source tree: compression type / write speed / read speed no / 0.29 GBps / 0.20 GBps lzo / 0.21 GBps / 0.17 GBps zstd / 0.13 GBps / 0.14 GBps zlib / 0.06 GBps / 0.10 GBps Average results from .tar: compression type / write speed / read speed no / 1.42 GBps / 2.79 GBps lzo / 1.17 GBps / 2.04 GBps zstd / 0.75 GBps / 1.97 GBps zlib / 1.24 GBps / 2.07 GBps Another advice here is, if you really want a super fast storage, and there is plenty memory, brd module will be your best friend. And for modern mainstream hardware, brd could provide performance over 1GiB/s: $ sudo modprobe brd rd_nr=1 rd_size=2097152 $ LANG=C dd if=/dev/zero bs=1M of=/dev/ram0 count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.45593 s, 1.5 GB/s My real worry is that I'm currently reading at 2.79GB/s (see result above and below) without compression when my hardware *should* limit it to 2.0GB/s. This tells me either `sync` is not working or my benchmark method is flawed. Thanks, Qu compression type / write speed / read speed (in GBps) zlib / 1.24 / 2.07 lzo / 1.17 / 2.04 zstd / 0.75 / 1.97 no / 1.42 / 2.79 The SSD is PCI-E 3.0 4-lane capable and is connected to a PCI-E 2.0 16-lane slot. lspci -vv confirms it is using 4 lanes. This means it's peak throughput *should* be 2.0 GBps - but above you can see the average read benchmark is 2.79GBps. :-/ The crude timing script I've put together does the following: - Format the SSD anew with btrfs and no custom settings - wait 180 seconds for possible hardware TRIM to settle (possibly overkill since the SSD is new) - Mount the fs using all defaults except for compression, which could be of zlib, lzo, zstd, or no - sync - Drop all caches - Time the following - Copy the file to the test fs (source is a ramdisk) - sync - Drop all caches - Time the following - Copy back from the test fs to ramdisk - sync - unmount I can see how, with compression, it *can* be faster than 2 GBps (though it isn't). But I cannot see how having no compression could possibly be faster than 2 GBps. :-/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
NVMe SSD + compression - benchmarking
Hey, all I'm following up on the queries I had last week since I have installed the NVMe SSD into the PCI-e adapter. I'm having difficulty knowing whether or not I'm doing these benchmarks correctly. As a first test, I put together a 4.7GB .tar containing mostly duplicated copies of the kernel source code (rather compressible). Writing this to the SSD I was seeing repeatable numbers - but noted that the new (supposedly faster) zstd compression is noticeably slower than all other methods. Perhaps this is partly due to lack of multi-threading? No matter, I did also notice a supposedly impossible stat when there is no compression, in that it seems to be faster than the PCI-E 2.0 bus theoretically can deliver: compression type / write speed / read speed (in GBps) zlib / 1.24 / 2.07 lzo / 1.17 / 2.04 zstd / 0.75 / 1.97 no / 1.42 / 2.79 The SSD is PCI-E 3.0 4-lane capable and is connected to a PCI-E 2.0 16-lane slot. lspci -vv confirms it is using 4 lanes. This means it's peak throughput *should* be 2.0 GBps - but above you can see the average read benchmark is 2.79GBps. :-/ The crude timing script I've put together does the following: - Format the SSD anew with btrfs and no custom settings - wait 180 seconds for possible hardware TRIM to settle (possibly overkill since the SSD is new) - Mount the fs using all defaults except for compression, which could be of zlib, lzo, zstd, or no - sync - Drop all caches - Time the following - Copy the file to the test fs (source is a ramdisk) - sync - Drop all caches - Time the following - Copy back from the test fs to ramdisk - sync - unmount I can see how, with compression, it *can* be faster than 2 GBps (though it isn't). But I cannot see how having no compression could possibly be faster than 2 GBps. :-/ I can of course get more info if it'd help figure out this puzzle: Kernel info: Linux localhost.localdomain 4.16.3-1-vfio #1 SMP PREEMPT Sun Apr 22 12:35:45 SAST 2018 x86_64 GNU/Linux ^ Close to the regular ArchLinux kernel - but with vfio, and compiled with -arch=native. See https://aur.archlinux.org/pkgbase/linux-vfio/ CPU model: model name: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz Motherboard model: Product Name: Z68MA-G45 (MS-7676) lspci output for the slot: 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 ^ The disk id sans serial is Samsung_SSD_960_EVO_1TB dmidecode output for the slot: Handle 0x001E, DMI type 9, 17 bytes System Slot Information Designation: J8B4 Type: x16 PCI Express Current Usage: In Use Length: Long ID: 4 Characteristics: 3.3 V is provided Opening is shared PME signal is supported Bus Address: :02:01.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nvme+btrfs+compression sensibility and benchmark
Thank you, all Though the info is useful, there's not a clear consensus on what I should expect. For interest's sake, I'll post benchmarks from the device itself when it arrives. I'm expecting at least that I'll be blown away :) On 04/18/2018 09:23 PM, Chris Murphy wrote: On Wed, Apr 18, 2018 at 10:38 AM, Austin S. Hemmelgarn mailto:ahferro...@gmail.com>> wrote: For reference, the zstd compression in BTRFS uses level 3 by default (as does zlib compression IIRC), though I'm not sure about lzop (I think it uses the lowest compression setting). The user space tool, zstd, does default to 3, according to its man page. -# # compression level [1-19] (default: 3) However, the kernel is claiming it's level 0, which doesn't exist in the man page. So I have no idea what we're using. This is what I get with mount option compress=zstd [ 4.097858] BTRFS info (device nvme0n1p9): use zstd compression, level 0 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
nvme+btrfs+compression sensibility and benchmark
Hi, all I'm looking for some advice re compression with NVME. Compression helps performance with a minor CPU hit - but is it still worth it with the far higher throughputs offered by newer PCI and NVME-type SSDs? I've ordered a PCIe-to-M.2 adapter along with a 1TB 960 Evo drive for my home desktop. I previously used compression on an older SATA-based Intel 520 SSD, where compression made sense. However, the wisdom isn't so clear-cut if the SSD is potentially faster than the compression algorithm with my CPU (aging i7 3770). Testing using a copy of the kernel source tarball in tmpfs it seems my system can compress/decompress at about 670MB/s using zstd with 8 threads. lzop isn't that far behind. But I'm not sure if the benchmark I'm running is the same as how btrfs would be using it internally. Given these numbers I'm inclined to believe compression will make things slower - but can't be sure without knowing if I'm testing correctly. What is the best practice with benchmarking and with NVME/PCI storage? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote: On 2017-08-03 14:29, Christoph Anton Mitterer wrote: On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: There are no higher-level management tools (e.g. RAID management/monitoring, etc.)... [snip] As far as 'higher-level' management tools, you're using your system wrong if you _need_ them. There is no need for there to be a GUI, or a web interface, or a DBus interface, or any other such bloat in the main management tools, they work just fine as is and are mostly on par with the interfaces provided by LVM, MD, and ZFS (other than the lack of machine parseable output). I'd also argue that if you can't reassemble your storage stack by hand without using 'higher-level' tools, you should not be using that storage stack as you don't properly understand it. On the subject of monitoring specifically, part of the issue there is kernel side, any monitoring system currently needs to be polling-based, not event-based, and as a result monitoring tends to be a very system specific affair based on how much overhead you're willing to tolerate. The limited stuff that does exist is also trivial to integrate with many pieces of existing monitoring infrastructure (like Nagios or monit), and therefore the people who care about it a lot (like me) are either monitoring by hand, or are just using the tools with their existing infrastructure (for example, I use monit already on all my systems, so I just make sure to have entries in the config for that to check error counters and scrub results), so there's not much in the way of incentive for the concerned parties to reinvent the wheel. To counter, I think this is a big problem with btrfs, especially in terms of user attrition. We don't need "GUI" tools. At all. But we do need that btrfs is self-sufficient enough that regular users don't get burnt by what they would view as unexpected behaviour. We have currently a situation where btrfs is too demanding on inexperienced users. I feel we need better worst-case behaviours. For example, if *I* have a btrfs on its second-to-last-available chunk, it means I'm not micro-managing properly. But users shouldn't have to micro-manage in the first place. Btrfs (or a management tool) should just know to balance the least-used chunk and/or delete the lowest-priority snapshot, etc. It shouldn't cause my services/apps to give diskspace errors when, clearly, there is free space available. The other "high-level" aspect would be along the lines of better guidance and standardisation for distros on how best to configure btrfs. This would include guidance/best practices for things like appropriate subvolume mountpoints and snapshot paths, sensible schedules or logic (or perhaps even example tools/scripts) for balancing and scrubbing the filesystem. I don't have all the answers. But I also don't want to have to tell people they can't adopt it because a) they don't (or never will) understand it; and b) they're going to resent me for their irresponsibly losing their own data. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ From the release notes at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html "Btrfs has been deprecated The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux. The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature. Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID56 status?
Hey, all Long-time lurker/commenter here. Production-ready RAID5/6 and N-way mirroring are the two features I've been anticipating most, so I've commented regularly when this sort of thing pops up. :) I'm only addressing some of the RAID-types queries as Qu already has a handle on the rest. Small-yet-important hint: If you don't have a backup of it, it isn't important. On 01/23/2017 02:25 AM, Jan Vales wrote: [ snip ] Correct me, if im wrong... * It seems, raid1(btrfs) is actually raid10, as there are no more than 2 copies of data, regardless of the count of devices. The original "definition" of raid1 is two mirrored devices. The *nix industry standard implementation (mdadm) extends this to any number of mirrored devices. Thus confusion here is understandable. ** Is there a way to duplicate data n-times? This is a planned feature, especially in lieu of feature-parity with mdadm, though the priority isn't particularly high right now. This has been referred to as "N-way mirroring". The last time I recall discussion over this, it was hoped to get work started on it after raid5/6 was stable. ** If there are only 3 devices and the wrong device dies... is it dead? Qu has the right answers. Generally if you're using anything other than dup, raid0, or single, one disk failure is "okay". More than one failure is closer to "undefined". Except with RAID6, where you need to have more than two disk failures before you have lost data. * Whats the diffrence of raid1(btrfs) and raid10(btrfs)? Some nice illustrations from Qu there. :) ** After reading like 5 diffrent wiki pages, I understood, that there are diffrences ... but not what they are and how they affect me :/ * Whats the diffrence of raid0(btrfs) and "normal" multi-device operation which seems like a traditional raid0 to me? raid0 stripes data in 64k chunks (I think this size is tunable) across all devices, which is generally far faster in terms of throughput in both writing and reading data. By '"normal" multi-device' I will assume this means "single" with multiple devices. New writes with "single" will use a 1GB chunk on one device until the chunk is full, at which point it allocates a new chunk, which will usually be put on the disk with the most available free space. There is no particular optimisation in place comparable to raid0 here. Maybe rename/alias raid-levels that do not match traditional raid-levels, so one cannot expect some behavior that is not there. The extreme example is imho raid1(btrfs) vs raid1. I would expect that if i have 5 btrfs-raid1-devices, 4 may die and btrfs should be able to fully recover, which, if i understand correctly, by far does not hold. If you named that raid-level say "george" ... I would need to consult the docs and I obviously would not expect any behavior. :) We've discussed this a couple of times. Hugo came up with a notation since dubbed "csp" notation: c->Copies, s->Stripes, and p->Parities. Examples of this would be: raid1: 2c 3-way mirroring across 3 (or more*) devices: 3c raid0 (2-or-more-devices): 2s raid0 (3-or-more): 3s raid5 (5-or-more): 4s1p raid16 (12-or-more): 2c4s2p * note the "or more": Mdadm *cannot* mirror less mirrors or stripes than devices, whereas there is no particular reason why btrfs won't be able to do this. A minor problem with csp notation is that it implies a complete implementation of *any* combination of these, whereas the idea was simply to create a way to refer to the "raid" levels in a consistent way. I hope this brings some clarity. :) regards, Jan Vales -- I only read plaintext emails. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Preliminary BTRFS Encryption
For the most part, I agree with you, especially about the strategy being backward - and file encryption being a viable more-easily-implementable direction. However, you are doing yourself a disservice to compare btrfs' features as a "re-implementation" of existing tools. The existing tools cannot do what btrfs' devs want to implement. See below inline. On 09/16/2016 03:12 AM, Dave Chinner wrote: On Tue, Sep 13, 2016 at 09:39:46PM +0800, Anand Jain wrote: This patchset adds btrfs encryption support. The main objective of this series is to have bugs fixed and stability. I have verified with fstests to confirm that there is no regression. A design write-up is coming next, however here below is the quick example on the cli usage. Please try out, let me know if I have missed something. Yup, that best practices say "do not roll your own encryption infrastructure". 100% agreed This is just my 2c worth - take it or leave it, don't other flaming. Keep in mind that I'm not picking on btrfs here - I asked similar hard questions about the proposed f2fs encryption implementation. That was a "copy and snowflake" version of the ext4 encryption code - they made changes and now we have generic code and common functionality between ext4 and f2fs. Also would like to mention that a review from the security experts is due, which is important and I believe those review comments can be accommodated without major changes from here. That's a fairly significant red flag to me - security reviews need to be done at the design phase against specific threat models - security review is not a code/implementation review... Also agreed. This is a bit backward. The ext4 developers got this right by publishing threat models and design docs, which got quite a lot of review and feedback before code was published for review. https://docs.google.com/document/d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/edit#heading=h.qmnirp22ipew [small reorder of comments] As of now these patch set supports encryption on per subvolume, as managing properties on per subvolume is a kind of core to btrfs, which is easier for data center solution-ing, seamlessly persistent and easy to manage. We've got dmcrypt for this sort of transparent "device level" encryption. Do we really need another btrfs layer that re-implements ... [snip] Woah, woah. This is partly addressed by Roman's reply - but ... Subvolumes: Subvolumes are not comparable to block devices. This thinking is flawed at best; cancerous at worst. As a user I tend to think of subvolumes simply as directly-mountable folders. As a sysadmin I also think of them as snapshottable/send-receiveable folders. And as a dev I know they're actually not that different from regular folders. They have some extra metadata so aren't as lightweight - but of course they expose very useful flexibility not available in a regular folder. MD/raid comparison: In much the same way, comparing btrfs' raid features to md directly is also flawed. Btrfs even re-uses code in md to implement raid-type features in ways that md cannot. I can't answer for the current raid5/6 stability issues - but I am confident that the overall design is good, and that it will be fixed. The generic file encryption code is solid, reviewed, tested and already widely deployed via two separate filesystems. There is a much wider pool of developers who will maintain it, reveiw changes and know all the traps that a new implementation might fall into. There's a much bigger safety net here, which significantly lowers the risk of zero-day fatal flaws in a new implementation and of flaws in future modifications and enhancements. Hence, IMO, the first thing to do is implement and make the generic file encryption support solid and robust, not tack it on as an afterthought for the magic btrfs encryption pixies to take care of. Indeed, with the generic file encryption, btrfs may not even need the special subvolume encryption pixies. i.e. you can effectively implement subvolume encryption via configuration of a multi-user encryption key for each subvolume and apply it to the subvolume tree root at creation time. Then only users with permission to unlock the subvolume key can access it. Once the generic file encryption is solid and fulfils the needs of most users, then you can look to solving the less common threat models that neither dmcrypt or per-file encryption address. Only if the generic code cannot be expanded to address specific threat models should you then implement something that is unique to btrfs Agreed, this sounds like a far safer and achievable implementation process. Cheers, Dave. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Allocator behaviour during device delete
On 06/09/2016 03:07 PM, Austin S. Hemmelgarn wrote: On 2016-06-09 08:34, Brendan Hide wrote: Hey, all I noticed this odd behaviour while migrating from a 1TB spindle to SSD (in this case on a LUKS-encrypted 200GB partition) - and am curious if this behaviour I've noted below is expected or known. I figure it is a bug. Depending on the situation, it *could* be severe. In my case it was simply annoying. --- Steps After having added the new device (btrfs dev add), I deleted the old device (btrfs dev del) Then, whilst waiting for that to complete, I started a watch of "btrfs fi show /". Note that the below is very close to the output at the time - but is not actually copy/pasted from the output. Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 Total devices 2 FS bytes used 115.03GiB devid1 size 0.00GiB used 298.06GiB path /dev/sda2 devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot devid1 is the old disk while devid2 is the new SSD After a few minutes, I saw that the numbers have changed - but that the SSD still had no data: Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 Total devices 2 FS bytes used 115.03GiB devid1 size 0.00GiB used 284.06GiB path /dev/sda2 devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot The "FS bytes used" amount was changing a lot - but mostly stayed near the original total, which is expected since there was very little happening other than the "migration". I'm not certain of the exact point where it started using the new disk's space. I figure that may have been helpful to pinpoint. :-/ OK, I'm pretty sure I know what was going on in this case. Your assumption that device delete uses the balance code is correct, and that is why you see what's happening happening. There are two key bits that are missing though: 1. Balance will never allocate chunks when it doesn't need to. 2. The space usage listed in fi show is how much space is allocated to chunks, not how much is used in those chunks. In this case, based on what you've said, you had a lot of empty or mostly empty chunks. As a result of this, the device delete was both copying data, and consolidating free space. If you have a lot of empty or mostly empty chunks, it's not unusual for a device delete to look like this until you start hitting chunks that have actual data in them. The pri8mary point of this behavior is that it makes it possible to directly switch to a smaller device without having to run a balance and then a resize before replacing the device, and then resize again afterwards. Thanks, Austin. Your explanation is along the lines of my thinking though. The new disk should have had *some* data written to it at that point, as it started out at over 600GiB in allocation (should have probably mentioned that already). Consolidating or not, I would consider data being written to the old disk to be a bug, even if it is considered minor. I'll set up a reproducible test later today to prove/disprove the theory. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Allocator behaviour during device delete
Hey, all I noticed this odd behaviour while migrating from a 1TB spindle to SSD (in this case on a LUKS-encrypted 200GB partition) - and am curious if this behaviour I've noted below is expected or known. I figure it is a bug. Depending on the situation, it *could* be severe. In my case it was simply annoying. --- Steps After having added the new device (btrfs dev add), I deleted the old device (btrfs dev del) Then, whilst waiting for that to complete, I started a watch of "btrfs fi show /". Note that the below is very close to the output at the time - but is not actually copy/pasted from the output. > Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 > Total devices 2 FS bytes used 115.03GiB > devid1 size 0.00GiB used 298.06GiB path /dev/sda2 > devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot devid1 is the old disk while devid2 is the new SSD After a few minutes, I saw that the numbers have changed - but that the SSD still had no data: > Label: 'tricky-root' uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42 > Total devices 2 FS bytes used 115.03GiB > devid1 size 0.00GiB used 284.06GiB path /dev/sda2 > devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot The "FS bytes used" amount was changing a lot - but mostly stayed near the original total, which is expected since there was very little happening other than the "migration". I'm not certain of the exact point where it started using the new disk's space. I figure that may have been helpful to pinpoint. :-/ --- Educated guess as to what was happening: Key: Though the available space on devid1 is displayed as 0 GiB, internally the allocator still sees most of the device's space as available. The allocator will continue writing to the old disk even though the intention is to remove it. The dev delete operation goes through the chunks in sequence and does a "normal" balance operation on each, which the kernel simply sends to the "normal" single allocator. At the start of the operation, the allocator will see that the device of 1TB has more space available than the 200GB device, thus it writes the data to a new chunk on the 1TB spindle. Only after the chunk is balanced away, does the operation mark *only* that "source" chunk as being unavailable. As each chunk is subsequently balanced away, eventually the allocator will see that there is more space available on the new device than on the old device (1:199/2:200), thus the next chunk gets allocated to the new device. The same occurs for the next chunk (1:198/2:199) and so on, until the device finally has zero usage and is removed completely. --- Naive approach for a fix (assuming my assessment above is correct) At the start: 1. "Balance away"/Mark-as-Unavailable empty space 2. Balance away the *current* chunks (data+metadata) that would otherwise be written to if the device was still available 3. As before, balance in whatever order is applicable. --- Severity I figure that, for my use-case, this isn't a severe issue. However, in the case where you want quickly to remove a potentially failing disk (common use case for dev delete), I'd much rather that btrfs does *not* write data to the disk I'm trying to remove, making this a potentially severe bug. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 vs RAID10 and best way to set up 6 disks
On 06/03/16 20:59, Christoph Anton Mitterer wrote: On Fri, 2016-06-03 at 13:42 -0500, Mitchell Fossen wrote: Thanks for pointing that out, so if I'm thinking correctly, with RAID1 it's just that there is a copy of the data somewhere on some other drive. With RAID10, there's still only 1 other copy, but the entire "original" disk is mirrored to another one, right? As Justin mentioned, btrfs doesn't raid whole disks/devices. Instead, it works with chunks. To be honest, I couldn't tell you for sure :-/ ... IMHO the btrfs documentation has some "issues". mkfs.btrfs(8) says: 2 copies for RAID10, so I'd assume it's just the striped version of what btrfs - for whichever questionable reason - calls "RAID1". The "questionable reason" is simply the fact that it is, now as well as at the time the features were added, the closest existing terminology that best describes what it does. Even now, it would be difficult on the spot adequately to explain what it means for redundancy without also mentioning "RAID". Btrfs does not raid disks/devices. It works with chunks that are allocated to devices when the previous chunk/chunk-set is full. We're all very aware of the inherent problem of language - and have discussed various ways to address it. You will find that some on the list (but not everyone) are very careful to never call it "RAID" - but instead raid (very small difference, I know). Hugo Mills previously made headway in getting discussion and consensus of proper nomenclature. * Especially, when you have an odd number devices (or devices with different sizes), its not clear to me, personally, at all how far that redundancy actually goes respectively what btrfs actually does... could be that you have your 2 copies, but maybe on the same device then? No, btrfs' raid1 naively guarantees that the two copies will *never* be on the same device. raid10 does the same thing - but in stripes on as many devices as possible. The reason I say "naively" is that there is little to stop you from creating a 2-device "raid1" using two partitions on the same physical device. This is especially difficult to detect if you add abstraction layers (lvm, dm-crypt, etc). This same problem does apply to mdadm however. Though it won't necessarily answer all questions about allocation, I strongly suggest checking out Hugo's btrfs calculator ** I hope this is helpful. * http://comments.gmane.org/gmane.comp.file-systems.btrfs/34717 / https://www.spinics.net/lists/linux-btrfs/msg33742.html * http://comments.gmane.org/gmane.comp.file-systems.btrfs/34792 ** http://carfax.org.uk/btrfs-usage/ Cheers, Chris. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB memory sticks wear & speed: btrfs vs f2fs?
On 2/9/2016 1:13 PM, Martin wrote: How does btrfs compare to f2fs for use on (128GByte) USB memory sticks? Particularly for wearing out certain storage blocks? Does btrfs heavily use particular storage blocks that will prematurely "wear out"? (That is, could the whole 128GBytes be lost due to one 4kByte block having been re-written excessively too many times due to a fixed repeatedly used filesystem block?) Any other comparisons/thoughts for btrfs vs f2fs? Copy-on-write (CoW) designs tend naturally to work well with flash media. F2fs is *specifically* designed to work well with flash, whereas for btrfs it is a natural consequence of the copy-on-write design. With both filesystems, if you randomly generate a 1GB file and delete it 1000 times, onto a 1TB flash, you are *very* likely to get exactly one write to *every* block on the flash (possibly two writes to <1% of the blocks) rather than, as would be the case with non-CoW filesystems, 1000 writes to a small chunk of blocks. I haven't found much reference or comparison information online wrt wear leveling - mostly performance benchmarks that don't really address your request. Personally I will likely never bother with f2fs unless I somehow end up working on a project requiring relatively small storage in Flash (as that is what f2fs was designed for). If someone can provide or link to some proper comparison data, that would be nice. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Btrfs device and pool management (wip)
On 12/1/2015 12:05 PM, Brendan Hide wrote: On 11/30/2015 11:09 PM, Chris Murphy wrote: On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn wrote: I've had multiple cases of disks that got one write error then were fine for more than a year before any further issues. My thought is add an option to retry that single write after some short delay (1-2s maybe), and if it still fails, then mark the disk as failed. Seems reasonable. I think I added this to the Project Ideas page on the wiki a *very* long time ago https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation "After a device is marked as unreliable, maintain the device within the FS in order to confirm the issue persists. The device will still contribute toward fs performance but will not be treated as if contributing towards replication/reliability. If the device shows that the given errors were a once-off issue then the device can be marked as reliable once again. This will mitigate further unnecessary rebalance. See http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - "[Drive Resurrection]" as an example of where this is a significant feature for storage vendors." Related, a separate section on that same page mentions a Jeff Mahoney. Perhaps he should be consulted or his work should be looked into: Take device with heavy IO errors offline or mark as "unreliable" "Devices should be taken offline after they reach a given threshold of IO errors. Jeff Mahoney works on handling EIO errors (among others), this project can build on top of it." Agreed. Maybe it would be an error rate (set by ratio)? I was thinking of either: a. A running count, using the current error counting mechanisms, with some max number allowed before the device gets kicked. b. A count that decays over time, this would need two tunables (how long an error is considered, and how many are allowed). OK. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Btrfs device and pool management (wip)
On 11/30/2015 11:09 PM, Chris Murphy wrote: On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn wrote: I've had multiple cases of disks that got one write error then were fine for more than a year before any further issues. My thought is add an option to retry that single write after some short delay (1-2s maybe), and if it still fails, then mark the disk as failed. Seems reasonable. I think I added this to the Project Ideas page on the wiki a *very* long time ago https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation "After a device is marked as unreliable, maintain the device within the FS in order to confirm the issue persists. The device will still contribute toward fs performance but will not be treated as if contributing towards replication/reliability. If the device shows that the given errors were a once-off issue then the device can be marked as reliable once again. This will mitigate further unnecessary rebalance. See http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - "[Drive Resurrection]" as an example of where this is a significant feature for storage vendors." Agreed. Maybe it would be an error rate (set by ratio)? I was thinking of either: a. A running count, using the current error counting mechanisms, with some max number allowed before the device gets kicked. b. A count that decays over time, this would need two tunables (how long an error is considered, and how many are allowed). OK. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 + hot spare question
Things can be a little more nuanced. First off, I'm not even sure btrfs supports a hot spare currently. I haven't seen anything along those lines recently in the list - and don't recall anything along those lines before either. The current mention of it in the Project Ideas page on the wiki implies it hasn't been looked at yet. Also, depending on your experience with btrfs, some of the tasks involved in fixing up a missing/dead disk might be daunting. See further (queries for btrfs-devs too) inline below: On 2015-09-08 14:12, Hugo Mills wrote: On Tue, Sep 08, 2015 at 01:59:19PM +0200, Peter Keše wrote: However I'd like to be prepared for a disk failure. Because my server is not easily accessible and disk replacement times can be long, I'm considering the idea of making a 5-drive raid6, thus getting 12TB useable space + parity. In this case, the extra 4TB drive would serve as some sort of a hot spare. From the above I'm reading one of two situations: a) 6 drives, raid6 across 5 drives and 1 unused/hot spare b) 5 drives, raid6 across 5 drives and zero unused/hot spare My assumption is that if one hard drive fails before the volume is more than 8TB full, I can just rebalance and resize the volume from 12 TB back to 8 TB essentially going from 5-drive raid6 to 4-drive raid6). Can anyone confirm my assumption? Can I indeed rebalance from 5-drive raid6 to 4-drive raid6 if the volume is not too big? Yes, you can, provided, as you say, the data is small enough to fit into the reduced filesystem. Hugo. This is true - however, I'd be hesitant to build this up due to the current process not being very "smooth" depending on how unlucky you are. If you have scenario b above, will the filesystem still be read/write or read-only post-reboot? Will it "just work" with the only requirement being free space on the four working disks? RAID6 is intended to be tolerant of two disk failures. In the case of there being a double failure and only 5 disks, the ease with which the user can balance/convert to a 3-disk raid5 is also important. Please shoot down my concerns. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: add replace missing and replace RAID 5/6 to profile configs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2015/07/24 07:50 PM, Omar Sandoval wrote: > On Fri, Jul 24, 2015 at 02:09:46PM +0200, David Sterba wrote: >> On Thu, Jul 23, 2015 at 01:51:50PM -0700, Omar Sandoval wrote: >>> + # We can't do replace with these profiles because they + >>> # imply only one device ($SCRATCH_DEV), and we need to + >>> # >>> keep $SCRATCH_DEV around for _scratch_mount + # and >>> _check_scratch_fs. +local unsupported=( + >>> "single" + >>> "dup" >> >> DUP does imply single device, but why does 'single' ? > > It does not, I apparently forgot that you could use single to > concatenate multiple devices. I'll fix that in v2. > > Thanks for reviewing! > Late to the party. DUP *implies* single device but there are cases where dup is used on a multi-device fs. Even if the use-cases aren't good or intended to be long-term, they are still valid, right? - -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (MingW32) iQEcBAEBAgAGBQJVtq9AAAoJEE+uni74c4qN5eYIAJAGznsi3RD1tchbSLwhMXJk bJJ4ORB9taLXHykSfYTsHIaUoVpcVR6tT/I1jz5070DY3mKkQ16a8nwtSxPba4Lv QiS8YRegFiHMYzZbH1T7Tnm6R9g/RZsaU7GS3JhP9HUYG7hIWGRRuoiOjYn/hoLw uMXuIFOkPKGYDgyAhDIp3KDYlBjMHT6Oun7CcpvTjXiOnzJFFp3MSt3b6mmmdMVV YKWpWyKVh7qlENEoqKb4exqr1WGYKU+kBLXRs4wdm3xb66EcWYs0Er1u6v+K1trx nryFrfUxYtMJsSuR9ZJm88DOsXKAuX1LEdRKVOlq7krsIK8HlTizccMXAl10gKk= =ndkL -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OT raid-1 from raid-10?
On 2015/02/22 03:02, Dave Stevens wrote: If there's a better list please say so. Either way, there is definitely not enough information here for us to give any practical advice. This is a btrfs-related mailing list and there's no indication even that you are using btrfs as your filesystem. Typically, "linux raid autodetect" refers to mdraid, which is not btrfs - and changing the partition's "type" won't change the underlying issue of the data being unavailable. I have a raid-10 array with two dirty drives and (according to the kernel) not enough mirrors to repair the raid-10. But I think drives sda and sdb are mirrored and maybe I could read the data off them if I changed the fs type from linux raid autodetect to ext3. is that reasonable? D If at least 3 disks are still in good condition, there's a chance that you can recover all your data. If not, my advice is to restore from backup. If you don't have backups ... well ... we have a couple of sayings about that, mostly along the lines of: If you don't have backups of it, it wasn't that important to begin with. Run the following commands (as root) and send us the output - then maybe someone will be kind enough to point you in the right direction: uname -a cat /etc/*release btrfs --version btrfs fi show cat /proc/mdstat fdisk -l gdisk -l -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2015/02/09 10:30 PM, Kai Krakow wrote: > Brendan Hide schrieb: > >> I have the following two lines in >> /etc/udev/rules.d/61-persistent-storage.rules for two old 250GB [snip] > Wouldn't it be easier and more efficient to use this: > > ACTION=="add|change", KERNEL=="sd[a-z]", ENV{ID_SERIAL}=="...", > ATTR{device/timeout}="120" > > Otherwise you always spawn a shell and additional file descriptors, > and you could spare a variable interpolation. Tho it probably > depends on your udev version... > > I'm using this and it works setting the attributes (set deadline on > SSD): > > ACTION=="add|change", KERNEL=="sd[a-z]", > ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline" > > And, I think you missed the double-equal "==" behind ENV{}... > Right? Otherwise you just assign a value. Tho, you could probably > match on ATTR{devices/model} instead to be more generic (the serial > is probably too specific). You can get those from the > /sys/block/sd* subtree. > It is certainly possible that it isn't 100% the right way - but it has been working. Your suggestions certainly sound more efficient/canonical. I was following what I found online until it "worked". :) I'll make the appropriate adjustments and test. Thanks! - -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (MingW32) iQEcBAEBAgAGBQJU2YknAAoJEE+uni74c4qNopMH/34nj5wEi3m25jk/vEUud3hh bbK4/mh564VnMc1NnpYXe++gUUTf0+203JDERgCQ1k3XjFMUe3VDPQBSdCIxcuOV H7BtFWcuUYvaTd/3kHTcB2mp097RUQs25Jhcmf8y/+YZdnglnpSrRYtIIMM8osil Y70IzoSRLuVHYlZT5VPmH7r7P9CeW5VnEG0jb3DkDe+tLH2Ed1Wy/Ti5myX0BF2l 7vJ1gTnPMmIUu/MKmNka6/hSWKGV7G2MeFoOy9UB2HhWsdGCjpJ1z8ToRQLcZbWX yCpSjw2GDCtdG91iKiWK+kAJOreKqWGA3GSdgKqZhAQVg6LFeml1qLrBZ7H9H1o= =TtpU -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replacing a (or two?) failed drive(s) in RAID-1 btrfs filesystem
On 2015/02/09 01:58, constantine wrote: Second, SMART is only saying its internal test is good. The errors are related to data transfer, so that implicates the enclosure (bridge chipset or electronics), the cable, or the controller interface. Actually it could also be a flaky controller or RAM on the drive itself too which I don't think get checked with SMART tests. Which test should I do from now on (on a weekly basis?) so as to prevent similar things from happening? Chris has given some very good info so far. I've also had to learn some of this stuff the hard way (failed/unreliable drives, data unavailable/lost, etc). The info below will be of help to you in following some of the advice already given. Unfortunately, the best course of action I see so far is to follow Chris' advice and to purchase more disks so you can make a backup ASAP. I have the following two lines in /etc/udev/rules.d/61-persistent-storage.rules for two old 250GB spindles. It sets the timeout to 120 seconds because these two disks don't support SCT ERC. This may very well apply without modification to other distros - but this is only tested in Arch: ACTION=="add", KERNEL=="sd*", SUBSYSTEM=="block", ENV{ID_SERIAL}="ST3250410AS_6RYF5NP7" RUN+="/bin/sh -c 'echo 120 > /sys$devpath/device/timeout'" ACTION=="add", KERNEL=="sd*", SUBSYSTEM=="block", ENV{ID_SERIAL}="ST3250820AS_9QE2CQWC" RUN+="/bin/sh -c 'echo 120 > /sys$devpath/device/timeout'" I have a "smart_scan" script* that does a check of all disks using smartctl. The meat of the script is in main(). The rest of the script is from a template of mine. The script, with no parameters, will do a short and then a long test on all drives. It does not give any output - however if you have smartd running and configured appropriately, smartd will pick up on any issues found and send appropriate alerts (email/system log/etc). It is configured in /etc/cron.d/smart. It runs a short test every morning and a long test every Saturday evening: 255***root/usr/local/sbin/smart_scan short 2518**6root/usr/local/sbin/smart_scan long Then, scrubbing**: This relatively simple script runs a scrub on all disks and prints the results *only* if there were errors. I've scheduled this in a cron as well to execute *every* morning shortly after 2am. Cron is configured to send me an email if there is any output - so I only get an email if there's something to look into. And finally, I have btsync configured to synchronise my Arch desktop's system journal to a couple of local and remote servers of mine. A much cleaner way to do this would be to use an external syslog server - I haven't yet looked into doing that properly, however. http://swiftspirit.co.za/down/smart_scan http://swiftspirit.co.za/down/btrfs-scrub-all -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "btrfs send" and interrupts
On 2015/01/23 11:58, Matthias Urlichs wrote: Hi, root@data:/daten/backup/email/xmit# btrfs send foo | ssh wherever btrfs receive /mnt/mail/xmit/ At subvol foo At subvol foo [ some time passes and I need to do something else on that volume ] ^Z [1]+ Stopped btrfs send foo | ssh -p50022 surf btrfs receive /mnt/mail/xmit/ root@data:/daten/backup/email/xmit# bg [1]+ btrfs send foo | ssh -p50022 surf btrfs receive /mnt/mail/xmit/ & root@data:/daten/backup/email/xmit# [ Immediately afterwards, this happens: ] ERROR: crc32 mismatch in command. At subvol foo At subvol foo ERROR: creating subvolume foo failed. File exists [1]+ Exit 1 btrfs send foo | ssh -p50022 surf btrfs receive /mnt/mail/xmit/ root@data:/daten/backup/email/xmit# Yowch. Please make sure that the simple act of backgrounding a data transfer doesn't abort it. That was ten hours in, now I have to repeat the whole thing. :-/ Thank you. Interesting case. I'm not sure of the merits/workaround needed to do this. It appears even using cat into netcat (nc) causes netcat to quit if you background the operation. A workaround for future: I *strongly* recommend using screen for long-lived operations. This would have avoided the problem. Perhaps you were sitting in front of the server and it wasn't much of a concern at the time - but most admins work remotely. Never mind ^z, what about other occurrences such as if the power/internet goes out at your office/home and the server is on another continent? Your session dies and you lose 10 hours of work/waiting. With a screen session, that is no longer true. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovery Operation With Multiple Devices
On 2015/01/23 09:53, Brett King wrote: Hi All, Just wondering how 'btrfs recovery' operates I'm assuming you're referring to a different set of commands or general scrub/recovery processes. AFAIK there is no "btrfs recovery" command. , when the source device given is one of many in an MD array - I can't find anything documentation beyond a single device use case. btrfs doesn't know what an md array or member is, therefore your results aren't going to be well-defined. Depending on the type of md array the member was in, your data may be mostly readable (RAID1) or completely/mostly non-interpretable (RAID5/6/10/0) until md fixes the array. Does it automatically include all devices in the relevant MD array as occurs when mounting, or does it only restore the data which happened to be written to the specific, single device given ? As above, btrfs is not md-aware. It will attempt to work with what it is given. It might not understand anything it sees as it will not have a good description of what it is looking at. Imagine being given instructions on how to get somewhere only to find that the first 20 instructions and every second instruction thereafter was skipped and there's a 50% chance the destination doesn't exist. From an inverse perspective, how can I restore all data including snapshots, which are spread across a damaged MD FS to a new (MD) FS ? Your best bet is to restore the md array. More details are needed for anyone to assist - for example what RAID-type was the array set up with, how many disks were in the array, and how it failed. Also, technically this is the wrong place to ask for advice about restoring md arrays. ;) Can send / receive do this perhaps ? Send/receive is for sending good data to a destination that can accept it. This, as above, depends on the data being readable/available. Very likely the data will be unreadable from a single disk unless the md array was RAID1. Thanks in advance ! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I need to P. are we almost there yet?
On 2015/01/02 15:42, Austin S Hemmelgarn wrote: On 2014-12-31 12:27, ashf...@whisperpc.com wrote: I see this as a CRITICAL design flaw. The reason for calling it CRITICAL is that System Administrators have been trained for >20 years that RAID-10 can usually handle a dual-disk failure, but the BTRFS implementation has effectively ZERO chance of doing so. No, some rather simple math That's the problem. The math isn't as simple as you'd expect: The example below is probably a pathological case - but here goes. Let's say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where d1 is the first bit of data and d2 is the second: Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2 Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4 Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6 Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8 Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10 Lose any two disks and you have a 50% chance on *each* chunk to have lost that chunk. With traditional RAID10 you have a 50% chance of losing the array entirely. With btrfs, the more data you have stored, the chances get closer to 100% of losing *some* data in a 2-disk failure. In the above example, losing A and B means you lose d3, d6, and d7 (which ends up being 60% of all chunks). Losing A and C means you lose d1 (20% of all chunks). Losing A and D means you lose d9 (20% of all chunks). Losing B and C means you lose d10 (20% of all chunks). Losing B and D means you lose d2 (20% of all chunks). Losing C and D means you lose d4,d5, AND d8 (60% of all chunks) The above skewed example has an average of 40% of all chunks failed. As you add more data and randomise the allocation, this will approach 50% - BUT, the chances of losing *some* data is already clearly shown to be very close to 100%. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Another "No space left" balance failure with plenty of space available
Hey, guys This is on my ArchLinux desktop. Current values as follows and the exact error is currently reproducible. Let me know if you want me to run any tests/etc. I've made an image (76MB) and can send the link to interested parties. I have come across this once before in the last few weeks. The workaround at that time was to run multiple balances with incrementing -musage and -dusage values. Whether or not that was a real, imaginary, or temporary fix is another story. I have backups but the issue doesn't yet appear to cause any symptoms other than these errors. The drive is a second-hand 60GB Intel 330 recycled from a decommissioned server. The mkfs was run on Nov 4th before I started a migration from spinning rust. According to my pacman logs btrfs-progs was on 3.17-1 and kernel was 3.17.1-1 root ~ $ uname -a Linux watricky.invalid.co.za 3.17.4-1-ARCH #1 SMP PREEMPT Fri Nov 21 21:14:42 CET 2014 x86_64 GNU/Linux root ~ $ btrfs fi show / Label: 'arch-btrfs-root' uuid: 782a0edc-1848-42ea-91cb-de8334f0c248 Total devices 1 FS bytes used 17.44GiB devid1 size 40.00GiB used 20.31GiB path /dev/sdc1 Btrfs v3.17.2 root ~ $ btrfs fi df / Data, single: total=18.00GiB, used=16.72GiB System, DUP: total=32.00MiB, used=16.00KiB Metadata, DUP: total=1.12GiB, used=738.67MiB GlobalReserve, single: total=256.00MiB, used=0.00B Relevant kernel lines: root ~ $ journalctl -k | grep ^Dec\ 01\ 21\: Dec 01 21:10:01 watricky.invalid.co.za kernel: BTRFS info (device sdc1): relocating block group 46166704128 flags 36 Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): found 2194 extents Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): relocating block group 45059407872 flags 34 Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): found 1 extents Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 1 enospc errors during balance Original Message Subject: Cron /usr/bin/btrfs balance start -musage=90 / 2>&1 > /dev/null Date: Mon, 01 Dec 2014 21:10:03 +0200 From: (Cron Daemon) To: bren...@swiftspirit.co.za ERROR: error during balancing '/' - No space left on device There may be more info in syslog - try dmesg | tail -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS equivalent for tune2fs?
On 2014/12/02 09:31, Brendan Hide wrote: On 2014/12/02 07:54, MegaBrutal wrote: Hi all, I know there is a btrfstune, but it doesn't provide all the functionality I'm thinking of. For ext2/3/4 file systems I can get a bunch of useful data with "tune2fs -l". How can I retrieve the same type of information about a BTRFS file system? (E.g., last mount time, last checked time, blocks reserved for superuser*, etc.) * Anyway, does BTRFS even have an option to reserve X% for the superuser? Btrfs does not yet have this option. I'm certain that specific feature is in mind for the future however. As regards other equivalents, the same/similar answer applies. There simply aren't a lot of tuneables available "right now". Almost forgot about this: btrfs property (get|set) Again, there are a lot of features still to be added. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS equivalent for tune2fs?
On 2014/12/02 07:54, MegaBrutal wrote: Hi all, I know there is a btrfstune, but it doesn't provide all the functionality I'm thinking of. For ext2/3/4 file systems I can get a bunch of useful data with "tune2fs -l". How can I retrieve the same type of information about a BTRFS file system? (E.g., last mount time, last checked time, blocks reserved for superuser*, etc.) * Anyway, does BTRFS even have an option to reserve X% for the superuser? Btrfs does not yet have this option. I'm certain that specific feature is in mind for the future however. As regards other equivalents, the same/similar answer applies. There simply aren't a lot of tuneables available "right now". -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On 2014/11/25 18:47, David Sterba wrote: We could provide an interface for external applications that would make use of the strong checksums. Eg. external dedup, integrity db. The benefit here is that the checksum is always up to date, so there's no need to compute the checksums again. At the obvious cost. I can imagine some use-cases where you might even want more than one algorithm to be used and stored. Not sure if that makes me a madman, though. ;) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: add sha256 checksum option
On 2014/11/25 13:30, Liu Bo wrote: This is actually inspired by ZFS, who offers checksum functions ranging from the simple-and-fast fletcher2 to the slower-but-secure sha256. Back to btrfs, crc32c is the only choice. And also for the slowness of sha256, Intel has a set of instructions for it, "Intel SHA Extensions", that may help a lot. I think the advantage will be in giving a choice with some strong suggestions: An example of suggestions - if using sha256 on an old or "low-power" CPU, detect that the CPU doesn't support the appropriate acceleration functions and print a warning at mount or a warning-and-prompt at mkfs-time. The default could even be changed based on the architecture - though I suspect crc32c is already a good default on most architectures. The choice allowance gives flexibility where admins know it optimally could be used - and David's suggestion (separate thread) would be able to take advantage of that. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] btrfs: enable swap file support
On 2014/11/25 00:03, Omar Sandoval wrote: [snip] The snapshot issue is a little tricker to resolve. I see a few options: 1. Just do the COW and hope for the best 2. As part of btrfs_swap_activate, COW any shared extents. If a snapshot happens while a swap file is active, we'll fall back to 1. 3. Clobber any swap file extents which are in a snapshot, i.e., always use the existing extent. I'm partial to 3, as it's the simplest approach, and I don't think it makes much sense for a swap file to be in a snapshot anyways. I'd appreciate any comments that anyone might have. Personally, 3 seems pragmatic - but not necessarily "correct". :-/ -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On 2014/11/23 03:07, Marc MERLIN wrote: On Sun, Nov 23, 2014 at 12:05:04AM +, Hugo Mills wrote: Which is correct? Less than or equal to 55% full. This confuses me. Does that mean that the fullest blocks do not get rebalanced? "Balance has three primary benefits: - free up some space for new allocations - change storage profile - balance/migrate data to or away from new or failing disks (the original purpose of balance) and one fringe benefit: - force a data re-write (good if you think your spinning-rust needs to re-allocate sectors) In the regular case where you're not changing the storage profile or migrating data between disks, there isn't much to gain from balancing full chunks - and it involves a lot of work. For SSDs, it is particularly bad for wear. For spinning rust it is merely a lot of unnecessary work. I guess I was under the mistaken impression that the more data you had the more you could be out of balance. A chunk is the part of a block group that lives on one device, so in RAID-1, every block group is precisely two chunks; in RAID-0, every block group is 2 or more chunks, up to the number of devices in the FS. A chunk is usually 1 GiB in size for data and 250 MiB for metadata, but can be smaller under some circumstances. Right. So, why would you rebalance empty chunks or near empty chunks? Don't you want to rebalance almost full chunks first, and work you way to less and less full as needed? Balancing empty chunks makes them available for re-allocation - so that is directly useful and light on workload. -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014/11/21 06:58, Zygo Blaxell wrote: You have one reallocated sector, so the drive has lost some data at some time in the last 49000(!) hours. Normally reallocations happen during writes so the data that was "lost" was data you were in the process of overwriting anyway; however, the reallocated sector count could also be a sign of deteriorating drive integrity. In /var/lib/smartmontools there might be a csv file with logged error attribute data that you could use to figure out whether that reallocation was recent. I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. All interesting points. Do you schedule SMART self-tests on your own systems? I have smartd running. In theory it tracks changes and sends alerts if it figures a drive is going to fail. But, based on what you've indicated, that isn't good enough. WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. The difference here is that the issue only affects the one drive. This leaves the probable cause at: - the drive itself - the cable/ports with a negligibly-possible cause at the motherboard chipset. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: systemd.setenv and a mount.unit
On 2014/11/20 00:48, Jakob Schürz wrote: Hi there! Another challenge... I'm using btrfs. So i make snapshots from my system. And in a script, I make a symlink (for example: @system.CURRENT and @system.LAST) for the current and the last snapshot. So i want to add 2 entries in grub2 from which i can boot into the current and the last snapshot. I tried to pass an environmental variable with systemd.setenv=BOOTSNAP=@system.CURRENT, and i have a mount-unit containing the option Options=defaults,nofail,subvol=archive-local/@system.$BOOTSNAP Perhaps I'm reading it incorrectly. I interpret this as: Options=defaults,nofail,subvol=archive-local/@system.@system.CURRENT -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014/11/18 14:08, Austin S Hemmelgarn wrote: [snip] there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching. Its an old and replaceable disk - but if the cable replacement doesn't work I'll try this for kicks. :) The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use. Thanks. I'll try this first. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014/11/18 09:36, Roman Mamedov wrote: On Tue, 18 Nov 2014 09:29:54 +0200 Brendan Hide wrote: Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Not necessarily the disk's fault, could be a SATA controller issue. How are your disks connected, which controller brand and chip? Add lspci output, at least if something other than the ordinary "to the motherboard chipset's built-in ports". In this case, yup, its directly to the motherboard chipset's built-in ports. This is a very old desktop, and the other 3 disks don't have any issues. I'm checking out the alternative pointed out by Austin. SATA-relevant lspci output: 00:1f.2 SATA controller: Intel Corporation 82801JD/DO (ICH10 Family) SATA AHCI Controller (rev 02) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
scrub implies failing drive - smartctl blissfully unaware
Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Here are (I think) the important bits of the smartctl output for $(smartctl -a /dev/sdb) (the full results are attached): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 086 060 030Pre-fail Always - 440801014 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000Old_age Always - 0 Original Message Subject:Cron /usr/local/sbin/btrfs-scrub-all Date: Tue, 18 Nov 2014 04:19:12 +0200 From: (Cron Daemon) To: brendan@watricky WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 [snip] smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.2-1-ARCH] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 Device Model: ST3250410AS Serial Number:6RYF5NP7 Firmware Version: 4.AAA User Capacity:250,059,350,016 bytes [250 GB] Sector Size: 512 bytes logical/physical Device is:In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7 (minor revision not indicated) Local Time is:Tue Nov 18 09:16:03 2014 SAST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection:( 430) seconds. Offline data collection capabilities:(0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 1) minutes. Extended self-test routine recommended polling time:( 64) minutes. SCT capabilities: (0x0001) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006Pre-fail Always - 0 3 Spin_Up_Time0x0003 099 097 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 100 100 020Old_age Always - 68 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 086 060 030Pre-fail Always - 440801057 9 Power_On_Hours
Re: BTRFS messes up snapshot LV with origin
On 2014/11/17 09:35, Daniel Dressler top-posted: If a UUID is not unique enough how will adding a second UUID or "unique drive identifier" help? A UUID is *supposed* to be unique by design. Isolated, the design is adequate. But the bigger picture clearly shows the design is naive. And broken. A second per-disk id (note I said "unique" - but I never said universal as in "UUID") would allow for better-defined behaviour where, presently, we're simply saying "current behaviour is undefined and you're likely to get corruption". On the other hand, I asked already if we have IDs of some sort (how else do we know which disk a chunk is stored on?), thus I don't think we need to add anything to the format. A simple scenario similar to the one the OP introduced: Disk sda -> says it is UUID Z with diskid 0 Disk sdb -> says it is UUID Z with diskid 0 If we're ignoring the fact that there are two disks with the same UUID and diskid and it causes corruption, then the kernel is doing something "stupid but fixable". We have some choices: - give a clear warning and ignore one of the disks (could just pick the first one - or be a little smarter and pick one based on some heuristic - for example extent generation number) - give a clear error and panic Normal multi-disk scenario: Disk sda -> UUID Z with diskid 1 Disk sdb -> UUID Z with diskid 2 These two disks are in the same filesystem and are supposed to work together - no issues. My second suggestion covers another scenario as well: Disk sda -> UUID Z with diskid 1; root block indicates that only diskid 1 is recorded as being part of the filesystem Disk sdb -> UUID Z with diskid 3; root block indicates that only diskid 3 is recorded as being part of the filesystem Again, based on the existing featureset, it seems reasonable that this information should already be recorded in the fs metadata. If the behaviour is "undefined" and causing corruption, again the kernel is currently doing something "stupid but fixable". Again, we have similar choices: - give a clear warning and ignore bad disk(s) - give a clear error and panic 2014-11-17 15:59 GMT+09:00 Brendan Hide : cc'd bug-g...@gnu.org for FYI On 2014/11/17 03:42, Duncan wrote: MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for "Universally Unique IDentifier".[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Perhaps the right question to ask is "Where should this bug be fixed?". TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being "out of scope". The "correct" fix probably lies in the ecosystem design, which requires co-operation from btrfs. Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job "exactly as expected". Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old "missing" disk re-introduced, etc. That leaves two places where this can be fixed: grub and btrfs Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying "LVM snapshots are incompatible with btrfs" is the right way to go either. That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about "ecosystem design" than grub being silly. I imagine a couple of aspects that could help fix a): - Utilise a "unique drive identifier" in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for "defined behaviour", preventing corruption - even if that "defined behaviour" is to simply give out lots of "PEBKAC" errors and panic. - Utilise a "drive list" to ensure tha
Re: BTRFS messes up snapshot LV with origin
cc'd bug-g...@gnu.org for FYI On 2014/11/17 03:42, Duncan wrote: MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for "Universally Unique IDentifier".[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Perhaps the right question to ask is "Where should this bug be fixed?". TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being "out of scope". The "correct" fix probably lies in the ecosystem design, which requires co-operation from btrfs. Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job "exactly as expected". Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old "missing" disk re-introduced, etc. That leaves two places where this can be fixed: grub and btrfs Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying "LVM snapshots are incompatible with btrfs" is the right way to go either. That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about "ecosystem design" than grub being silly. I imagine a couple of aspects that could help fix a): - Utilise a "unique drive identifier" in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for "defined behaviour", preventing corruption - even if that "defined behaviour" is to simply give out lots of "PEBKAC" errors and panic. - Utilise a "drive list" to ensure that two unrelated filesystems with the same UUID cannot get "mixed up". Yes, the user/admin would likely be the culprit here (perhaps a VM rollout process that always gives out the same UUID in all its filesystems). Again, does btrfs not already have something like this built-in that we're simply not utilising fully? I'm not exactly sure of the "correct" way to fix b) except that I imagine it would be trivial to fix once a) is fixed. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check segfaults after flipping 2 Bytes
On 2014/10/02 07:51, Brendan Hide wrote: On 2014/10/02 01:31, Duncan wrote: [snip] I'm not sure if there is a mount option for this use case however. The option descriptions for "nodatasum" and "nodatacow" imply that *new* checksums are not generated. In this case the checksums already exist. Looks like btrfsck has a relevant option, albeit likely more destructive than absolutely necessary: --init-csum-tree create a new CRC tree. ^ Also, mail was sent as HTML 12 hours ago thus was never delivered. Thunderbird has been disciplined. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check segfaults after flipping 2 Bytes
On 2014/10/02 01:31, Duncan wrote: Niklas Fischer posted on Wed, 01 Oct 2014 22:29:55 +0200 as excerpted: I was trying to determine how btrfs reacts to disk errors, when I discovered, that flipping two Bytes, supposedly inside of a file can render the filesystem unusable. Here is what I did: 1. dd if=/dev/zero of=/dev/sdg2 bs=1M 2. mkfs.btrfs /dev/sdg2 3. mount /dev/sdg2 /tmp/btrfs 4. echo "hello world this is some text" > /tmp/btrfs/hello 5. umount /dev/sdg2 Keep in mind that on btrfs, small enough files will not be written to file extents but instead will be written directly into the metadata. That's a small enough file I guess that's what you were seeing, which would explain the two instances of the string, since on a single device btrfs, metadata is dup mode by default. That metadata block would then fail checksum, and an attempt would be made to use the second copy, which of course would fail it the same way. At least a very unlikely scenario in production. And that being the only file in the filesystem, I'd /guess/ (not being a developer myself, just a btrfs testing admin and list regular) that metadata block is still the original one, which very likely contains critical filesystem information as well, thus explaining the mount failure when the block failed checksum verify. This is a possible use-case for an equivalent to ZFS's ditto blocks. An alternative strategy would be to purposefully "sparsify" early metadata blocks (this is thinking out loud - whether or not that is a viable or easy strategy is debatable). In theory at least, with a less synthetic test case there'd be enough more metadata on the filesystem that the affected metadata block would be further down the chain, and corrupting it wouldn't corrupt critical filesystem information as it wouldn't be in the same block. That might explain the problem, but I don't know enough about btrfs to know how reasonable a solution would be. [snip] A reasonable workaround to get the filesystem back into a usable or recoverable state might be to mount read-only and ignore checksums. That would keep the filesystem intact, though the system has no way to know whether or not the folder structures are also corrupt. I'm not sure if there is a mount option for this use case however. The option descriptions for "nodatasum" and "nodatacow" imply that *new* checksums are not generated. In this case the checksums already exist. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas for a feature implementation
On 2014/08/12 17:52, David Pottage wrote: [snip] ... if it does not then the file-system has broken the contract to secure delete a file when you asked it to. This is a technicality - and it has not necessarily "broken the contract". I think the correct thing to do would be to securely delete the metadata referring to that file. That would satisfy the concept that there is no evidence that the file ever existed in that location. The fact that it actually does still legitimately exist elsewhere is not a caveat - it is simply acting within standard behaviour. -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC errors during balance
On 20/07/14 14:59, Duncan wrote: Marc Joliet posted on Sun, 20 Jul 2014 12:22:33 +0200 as excerpted: On the other hand, the wiki [0] says that defragmentation (and balancing) is optional, and the only reason stated for doing either is because they "will have impact on performance". Yes. That's what threw off the other guy as well. He decided to skip it for the same reason. If I had a wiki account I'd change it, but for whatever reason I tend to be far more comfortable writing list replies, sometimes repeatedly, than writing anything on the web, which I tend to treat as read-only. So I've never gotten a wiki account and thus haven't changed it, and apparently the other guy with the problem and anyone else that knows hasn't changed it either, so the conversion page still continues to underemphasize the importance of completing the conversion steps, including the defrag, in proper order. I've inserted information specific to this in the wiki. Others with wiki accounts, feel free to review: https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3#Before_first_use -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/017: skip invalid block sizes for btrfs
Not subscribed to fstests so not sure if this will reach that mailing list... I feel Takeuchi's instincts are right, even if the analysis *may* be wrong. As is it looks like there should be a btrfs) selector inside the case. On 23/06/14 12:48, Satoru Takeuchi wrote: Hi Filipe, (2014/06/23 19:28), Filipe David Borba Manana wrote: In btrfs the block size (called sector size in btrfs) can not be smaller then the page size. Therefore skip block sizes smaller then page size if the fs is btrfs, so that the test can succeed on btrfs (testing only with block sizes of 4kb on systems with a page size of 4Kb). Signed-off-by: Filipe David Borba Manana I consider it doesn't work since this test is not for Btrfs. Please see the following code. tests/generic/017: === for (( BSIZE = 1024; BSIZE <= 4096; BSIZE *= 2 )); do length=$(($BLOCKS * $BSIZE)) case $FSTYP in xfs) _scratch_mkfs -b size=$BSIZE >> $seqres.full 2>&1 ;; ext4) _scratch_mkfs -b $BSIZE >> $seqres.full 2>&1 ;; esac _scratch_mount >> $seqres.full 2>&1 === There is no btrfs here. This test was moved to shared/005 to generic/017 at 21723cdbf303e031d6429f67fec9768750a5db7d. Original supported fs is here. === supported_fs xfs ext4 === I suspect that Lukas moved this test to generic/ by mistake or forgot to add "$FSTYP == btrfs" case. Thanks, Satoru --- tests/generic/017 | 8 1 file changed, 8 insertions(+) diff --git a/tests/generic/017 b/tests/generic/017 index 13b7254..6495be5 100755 --- a/tests/generic/017 +++ b/tests/generic/017 @@ -51,6 +51,14 @@ BLOCKS=10240 for (( BSIZE = 1024; BSIZE <= 4096; BSIZE *= 2 )); do + # btrfs doesn't support block size smaller then page size + if [ "$FSTYP" == "btrfs" ]; then + if (( $BSIZE < `getconf PAGE_SIZE` )); then + echo "80" + continue + fi + fi + length=$(($BLOCKS * $BSIZE)) case $FSTYP in xfs) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] lib: add size unit t/p/e to memparse
On 13/06/14 03:42, Gui Hecheng wrote: For modern filesystems such as btrfs, t/p/e size level operations are common. add size unit t/p/e parsing to memparse Signed-off-by: Gui Hecheng --- changelog v1->v2: replace kilobyte with kibibyte, and others v2->v3: add missing unit "bytes" in comment v3->v4: remove idiotic name for K,M,G,P,T,E --- lib/cmdline.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index d4932f7..76a712e 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -121,11 +121,7 @@ EXPORT_SYMBOL(get_options); *@retptr: (output) Optional pointer to next char after parse completes * *Parses a string into a number. The number stored at @ptr is - * potentially suffixed with %K (for kilobytes, or 1024 bytes), - * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or - * 1073741824). If the number is suffixed with K, M, or G, then - * the return value is the number multiplied by one kilobyte, one - * megabyte, or one gigabyte, respectively. + * potentially suffixed with K, M, G, T, P, E. */ unsigned long long memparse(const char *ptr, char **retptr) @@ -135,6 +131,15 @@ unsigned long long memparse(const char *ptr, char **retptr) unsigned long long ret = simple_strtoull(ptr, &endptr, 0); switch (*endptr) { + case 'E': + case 'e': + ret <<= 10; + case 'P': + case 'p': + ret <<= 10; + case 'T': + case 't': + ret <<= 10; case 'G': case 'g': ret <<= 10; Ah, I see - you've removed all reference to their names. That's good too. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] lib: add size unit t/p/e to memparse
On 12/06/14 23:15, Andrew Morton wrote: On Wed, 2 Apr 2014 16:54:37 +0800 Gui Hecheng wrote: For modern filesystems such as btrfs, t/p/e size level operations are common. add size unit t/p/e parsing to memparse Signed-off-by: Gui Hecheng --- changelog v1->v2: replace kilobyte with kibibyte, and others v2->v3: add missing unit "bytes" in comment --- lib/cmdline.c | 25 - 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/lib/cmdline.c b/lib/cmdline.c index eb67911..511b9be 100644 --- a/lib/cmdline.c +++ b/lib/cmdline.c @@ -119,11 +119,17 @@ char *get_options(const char *str, int nints, int *ints) *@retptr: (output) Optional pointer to next char after parse completes * *Parses a string into a number. The number stored at @ptr is - * potentially suffixed with %K (for kilobytes, or 1024 bytes), - * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or - * 1073741824). If the number is suffixed with K, M, or G, then - * the return value is the number multiplied by one kilobyte, one - * megabyte, or one gigabyte, respectively. + * potentially suffixed with + * %K (for kibibytes, or 1024 bytes), + * %M (for mebibytes, or 1048576 bytes), + * %G (for gibibytes, or 1073741824 bytes), + * %T (for tebibytes, or 1099511627776 bytes), + * %P (for pebibytes, or 1125899906842624 bytes), + * %E (for exbibytes, or 1152921504606846976 bytes). I'm afraid I find these names quite idiotic - we all know what the traditional terms mean so why go and muck with it. Also, kibibytes sounds like cat food. Hi, Andrew While I agree it sounds like cat food, it seemed like a good opportunity to fix a minor issue that is otherwise unlikely to be fixed for a very long time. Should we feel uncomfortable with the patch, as is, because of language/correctness friction? Pedantry included, the patch is correct. ;) Thanks -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What to do about snapshot-aware defrag
On 2014/05/31 12:00 AM, Martin wrote: OK... I'll jump in... On 30/05/14 21:43, Josef Bacik wrote: [snip] Option 1: Only relink inodes that haven't changed since the snapshot was taken. Pros: -Faster -Simpler -Less duplicated code, uses existing functions for tricky operations so less likely to introduce weird bugs. Cons: -Could possibly lost some of the snapshot-awareness of the defrag. If you just touch a file we would not do the relinking and you'd end up with twice the space usage. [...] Obvious way to go for fast KISS. I second this - KISS is better. Would in-band dedupe resolve the issue with losing the "snapshot-awareness of the defrag"? I figure that if someone absolutely wants everything deduped efficiently they'd put in the necessary resources (memory/dedicated SSD/etc) to have in-band dedupe work well. One question: Will option one mean that we always need to mount with noatime or read-only to allow snapshot defragging to do anything? That is a very good question. I very rarely have mounts without noatime - and usually only because I hadn't thought of it. Regards, Martin -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl
On 2014/05/21 06:15 AM, Qu Wenruo wrote: [snip] Further on top of your check_missing patch I am writing code to to handle disk reappear. I should be sending them all soon. Disk reappear problem is also reproduce here. I am intersting about how will your patch to deal with. Is your patch going to check super genertion to determing previously missing device and wipe reappeared superblock?(Wang mentioned it in the mail in Jan.) With md we have the bitmap feature that helps prevent resynchronising the entire disk when doing a "re-add". Wiping the superblock is *better* than what we currently have (corruption) - but hopefully the end goal is to be able to have it re-add *without* introducing corruption. IMO the reappear disk problem can also be resolved by not swap tgtdev->uuid and srcdev->uuid, which means tgtdev will not use the same uuid of srcdev. Thanks, Qu Thanks, Anand -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 2014/05/20 04:07 PM, Austin S Hemmelgarn wrote: On 2014-05-19 22:07, Russell Coker wrote: [snip] As an aside, I'd really like to be able to set RAID levels by subtree. I'd like to use RAID-1 with ditto blocks for my important data and RAID-0 for unimportant data. But the proposed changes for n-way replication would already handle this. [snip] Russell's specific request above is probably best handled by being able to change replication levels per subvolume - this won't be handled by N-way replication. Extra replication on leaf nodes will make relatively little difference in the scenarios laid out in this thread - but on "trunk" nodes (folders or subvolumes closer to the filesystem root) it makes a significant difference. "Plain" N-way replication doesn't flexibly treat these two nodes differently. As an example, Russell might have a server with two disks - yet he wants 6 copies of all metadata for subvolumes and their immediate subfolders. At three folders deep he "only" wants to have 4 copies. At six folders deep, only 2. Ditto blocks add an attractive safety net without unnecessarily doubling or tripling the size of *all* metadata. It is a good idea. The next question to me is whether or not it is something that can be implemented elegantly and whether or not a talented *dev* thinks it is a good idea. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 2014/05/19 10:36 PM, Martin wrote: On 18/05/14 17:09, Russell Coker wrote: On Sat, 17 May 2014 13:50:52 Martin wrote: [...] Do you see or measure any real advantage? [snip] This is extremely difficult to measure objectively. Subjectively ... see below. [snip] *What other failure modes* should we guard against? I know I'd sleep a /little/ better at night knowing that a double disk failure on a "raid5/1/10" configuration might ruin a ton of data along with an obscure set of metadata in some "long" tree paths - but not the entire filesystem. The other use-case/failure mode - where you are somehow unlucky enough to have sets of bad sectors/bitrot on multiple disks that simultaneously affect the only copies of the tree roots - is an extremely unlikely scenario. As unlikely as it may be, the scenario is a very painful consequence in spite of VERY little corruption. That is where the peace-of-mind/bragging rights come in. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: send/receive and bedup
On 19/05/14 15:00, Scott Middleton wrote: On 19 May 2014 09:07, Marc MERLIN wrote: On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote: I read so much about BtrFS that I mistaked Bedup with Duperemove. Duperemove is actually what I am testing. I'm currently using programs that find files that are the same, and hardlink them together: http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html hardlink.py actually seems to be the faster (memory and CPU) one event though it's in python. I can get others to run out of RAM on my 8GB server easily :( Interesting app. An issue with hardlinking (with the backups use-case, this problem isn't likely to happen), is that if you modify a file, all the hardlinks get changed along with it - including the ones that you don't want changed. @Marc: Since you've been using btrfs for a while now I'm sure you've already considered whether or not a reflink copy is the better/worse option. Bedup should be better, but last I tried I couldn't get it to work. It's been updated since then, I just haven't had the chance to try it again since then. Please post what you find out, or if you have a hardlink maker that's better than the ones I found :) Thanks for that. I may be completely wrong in my approach. I am not looking for a file level comparison. Bedup worked fine for that. I have a lot of virtual images and shadow protect images where only a few megabytes may be the difference. So a file level hash and comparison doesn't really achieve my goals. I thought duperemove may be on a lower level. https://github.com/markfasheh/duperemove "Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing extents that match each other. When given the -d option, duperemove will submit those extents for deduplication using the btrfs-extent-same ioctl." It defaults to 128k but you can make it smaller. I hit a hurdle though. The 3TB HDD I used seemed OK when I did a long SMART test but seems to die every few hours. Admittedly it was part of a failed mdadm RAID array that I pulled out of a clients machine. The only other copy I have of the data is the original mdadm array that was recently replaced with a new server, so I am loathe to use that HDD yet. At least for another couple of weeks! I am still hopeful duperemove will work. Duperemove does look exactly like what you are looking for. The last traffic on the mailing list regarding that was in August last year. It looks like it was pulled into the main kernel repository on September 1st. The last commit to the duperemove application was on April 20th this year. Maybe Mark (cc'd) can provide further insight on its current status. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID-1 - suboptimal write performance?
On 2014/05/16 11:36 PM, Austin S Hemmelgarn wrote: On 05/16/2014 04:41 PM, Tomasz Chmielewski wrote: On Fri, 16 May 2014 14:06:24 -0400 Calvin Walton wrote: No comment on the performance issue, other than to say that I've seen similar on RAID-10 before, I think. Also, what happens when the system crashes, and one drive has several hundred megabytes data more than the other one? This shouldn't be an issue as long as you occasionally run a scrub or balance. The scrub should find it and fix the missing data, and a balance would just rewrite it as proper RAID-1 as a matter of course. It's similar (writes to just one drive, while the other is idle) when removing (many) snapshots. Not sure if that's optimal behaviour. [snip] Ideally, BTRFS should dispatch the first write for a block in a round-robin fashion among available devices. This won't fix the underlying issue, but it will make it less of an issue for BTRFS. More ideally, btrfs should dispatch them in parallel. This will likely be looked into for N-way mirroring. Having 3 or more copies and working in the current way would be far from optimal. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand.
On 2014/05/18 02:05 PM, Hugo Mills wrote: On Sun, May 18, 2014 at 03:04:33PM +0800, Qu Wenruo wrote: I don't have any real suggestions for alternatives coming from my experience, other than "not this". I've used docbook for man pages briefly, many years ago. Looking around on the web, reStructuredText might be a good option. Personally, I'd like to write docs in LaTeX, but I'm not sure how easy it is to convert that to man pages. Hugo. What I have read so far indicates that LaTeX is the simplest most beautiful way to create portable documentation - and that exporting to a man page is simple. I can't vouch for it except to say that it is worth investigating. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: staggered stripes
On 2014/05/15 04:38 PM, Russell Coker wrote: On Thu, 15 May 2014 09:31:42 Duncan wrote: Does the BTRFS RAID functionality do such staggered stripes? If not could it be added? AFAIK nothing like that yet, but it's reasonably likely to be implemented later. N-way-mirroring is roadmapped for next up after raid56 completion, however. It's RAID-5/6 when we really need such staggering. It's a reasonably common configuration choice to use two different brands of disk for a RAID-1 array. As the correlation between parts of the disks with errors only applied to disks of the same make and model (and this is expected due to firmware/manufacturing issues) the people who care about such things on RAID-1 have probably already dealt with the issue. You do mention the partition alternative, but not as I'd do it for such a case. Instead of doing a different sized buffer partition (or using the mkfs.btrfs option to start at some offset into the device) on each device, I'd simply do multiple partitions and reorder them on each device. If there are multiple partitions on a device then that will probably make performance suck. Also does BTRFS even allow special treatment of them or will it put two copies from a RAID-10 on the same disk? I suspect the approach is similar to the following: sd[abcd][1234] each configured as LVM PVs sda[1234] as an LVM VG sdb[2345] as an LVM VG sdc[3456] as an LVM VG sdd[4567] as an LVM VG btrfs across all four VGs ^ Um - the above is ignoring "DOS"-style partition limitations Tho N-way-mirroring would sure help here too, since if a given area around the same address is assumed to be weak on each device, I'd sure like greater than the current 2-way-mirroring, even if if I had a different filesystem/partition at that spot on each one, since with only two-way-mirroring if one copy is assumed to be weak, guess what, you're down to only one reasonably reliable copy now, and that's not a good spot to be in if that one copy happens to be hit by a cosmic ray or otherwise fail checksum, without another reliable copy to fix it since that other copy is in the weak area already. Another alternative would be using something like mdraid's raid10 "far" layout, with btrfs on top of that... In the "copies= option" thread Brendan Hide stated that this sort of thing is planned. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mkfs.btrfs: allow UUID specification at mkfs time
On 14/05/14 09:31, Wang Shilong wrote: On 05/14/2014 09:18 AM, Eric Sandeen wrote: Allow the specification of the filesystem UUID at mkfs time. (Implemented only for mkfs.btrfs, not btrfs-convert). Just out of curiosity, this option is used for what kind of use case? I notice Ext4 also has this option.:-) Personally I can't think of any "average" or "normal" use case. The simplest case however is in using predictable/predetermined UUIDs. Certain things, such as testing or perhaps even large-scale automation, are likely simpler to implement with a predictable UUID. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -musage=>0 means always reporting relocation
On 2014/05/11 11:52 AM, Russell Coker wrote: On Sun, 11 May 2014, Russell Coker wrote: Below is the output of running a balance a few times on a 120G SSD. Sorry forgot to mention that's kernel 3.14.1 Debian package. Please send the output of the two following command: btrfs fi df / This will give more information on your current chunk situation. I suspect this is a case where a system chunk (which is included when specifying metadata) that is not actually being relocated. This is a bug that I believe was already fixed, though I'm not sure in which version. The pathological case is where you have a chunk that is 1% full and *every* other in-use chunk on the device is 100% full. In that situation, a balance will simply move that data into a new chunk (which will only ever reach 1% full). Thus, all subsequent balances will relocate that same data again to another new chunk. -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does "btrfs fi show" show full?
On 2014/05/07 09:59 AM, Marc MERLIN wrote: [snip] Did I get this right? I'm not sure I did, since it seems the bigger the -dusage number, the more work balance has to do. If I asked -dsuage=85, it would do all chunks that are more than 15% full? -dusage=85 balances all chunks that up to 85% full. The higher the number, the more work that needs to be done. So, do I need to change the text above to say "more than 45% full" ? More generally, does it not make sense to just use the same percentage in -dusage than the percentage of total filesytem full? Thanks, Marc Separately, Duncan has made me realise my "halfway up" algorithm is not very good - it was probably just "good enough" at the time and worked "well enough" that I wasn't prompted to analyse it further. Doing a simulation with randomly-semi-filled chunks, "df" at 55%, and chunk utilisation at 86%, -dusage=55 balances 30% of the chunks, almost perfectly bringing chunk utilisation down to 56%. In my algorithm I would have used -dusage=70 which in my simulation would have balanced 34% of the chunks - but bringing chunk utilisation down to 55% - a bit of wasted effort and unnecessary SSD wear. I think now that I need to experiment with a much lower -dusage value and perhaps to repeat the balance with the df value (55 in the example) if the chunk usage is still too high. Getting an optimal first value algorithmically might prove a challenge - I might just end up picking some arbitrary percentage point below the df value. Pathological use-cases still apply however (for example if all chunks except one are exactly 54% full). The up-side is that if the algorithm is applied regularly (as in scripted and scheduled) then the situation will always be that the majority of chunks are going to be relatively full, avoiding the pathological use-case. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
Hi, Marc. Inline below. :) On 2014/05/06 02:19 PM, Marc MERLIN wrote: On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote: "In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full: legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1" "-dusage=50" will balance all chunks that are 50% *or less* used, Sorry, I actually meant to write 55 there. not more. The idea is that full chunks are better left alone while emptyish chunks are bundled together to make new full chunks, leaving big open areas for new chunks. Your process is good however - just the explanation that needs the tweak. :) Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55? As usual, it depends on what end-result you want. Paranoid rebalancing - always ensuring there are as many free chunks as possible - is totally unnecessary. There may be more good reasons to rebalance - but I'm only aware of two: a) to avoid ENOSPC due to running out of free chunks; and b) to change allocation type. If you want all chunks either full or empty (except for that last chunk which will be somewhere inbetween), -dusage=55 will get you 99% there. In your last example, a full rebalance is not necessary. If you want to clear all unnecessary chunks you can run the balance with -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of the data chunks that are 80% and less used, which would by necessity get about ~160GB worth chunks back out of data and available for re-use. So in my case when I hit that case, I had to use dusage=0 to recover. Anything above that just didn't work. I suspect when using more than zero the first chunk it wanted to balance wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it didn't need a destination for the data. That is actually an interesting workaround for that case. On Mon, May 05, 2014 at 07:09:22PM +0200, Brendan Hide wrote: Forgot this part: Also in your last example, you used "-dusage=0" and it balanced 91 chunks. That means you had 91 empty or very-close-to-empty chunks. ;) Correct. That FS was very mis-balanced. On Mon, May 05, 2014 at 02:36:09PM -0400, Calvin Walton wrote: The standard response on the mailing list for this issue is to temporarily add an additional device to the filesystem (even e.g. a 4GB USB flash drive is often enough) - this will add space to allocate a few new chunks, allowing the balance to proceed. You can remove the extra device after the balance completes. I just added that tip, thank you. On Tue, May 06, 2014 at 02:41:16PM +1000, Russell Coker wrote: Recently kernel 3.14 allowed fixing a metadata space error that seemed to be impossible to solve with 3.13. So it's possible that some of my other problems with a lack of metadata space could have been solved with kernel 3.14 too. Good point. I added that tip too. Thanks, Marc -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Thoughts on RAID nomenclature
On 2014/05/05 11:17 PM, Hugo Mills wrote: A passing remark I made on this list a day or two ago set me to thinking. You may all want to hide behind your desks or in a similar safe place away from the danger zone (say, Vladivostok) at this point... I feel like I can brave some "mild horrors". Of course, my C skills aren't up to scratch so its all just bravado. ;) If we switch to the NcMsPp notation for replication, that comfortably describes most of the plausible replication methods, and I'm happy with that. But, there's a wart in the previous proposition, which is putting "d" for 2cd to indicate that there's a DUP where replicated chunks can go on the same device. This was the jumping-off point to consider chunk allocation strategies in general. At the moment, we have two chunk allocation strategies: "dup" and "spread" (for want of a better word; not to be confused with the ssd_spread mount option, which is a whole different kettle of borscht). The dup allocation strategy is currently only available for 2c replication, and only on single-device filesystems. When a filesystem with dup allocation has a second device added to it, it's automatically upgraded to spread. I thought this step was manual - but okay! :) The general operation of the chunk allocator is that it's asked for locations for n chunks for a block group, and makes a decision about where those chunks go. In the case of spread, it sorts the devices in decreasing order of unchunked space, and allocates the n chunks in that order. For dup, it allocates both chunks on the same device (or, generalising, may allocate the chunks on the same device if it has to). Now, there are other variations we could consider. For example: - linear, which allocates on the n smallest-numbered devices with free space. This goes halfway towards some people's goal of minimising the file fragments damaged in a device failure on a 1c FS (again, see (*)). [There's an open question on this one about what happens when holes open up through, say, a balance.] - grouped, which allows the administrator to assign groups to the devices, and allocates each chunk from a different group. [There's a variation here -- we could look instead at ensuring that different _copies_ go in different groups.] Given these four (spread, dup, linear, grouped), I think it's fairly obvious that spread is a special case of grouped, where each device is its own group. Then dup is the opposite of grouped (i.e. you must have one or the other but not both). Finally, linear is a modifier that changes the sort order. All of these options run completely independently of the actual replication level selected, so we could have 3c:spread,linear (allocates on the first three devices only, until one fills up and then it moves to the fourth device), or 2c2s:grouped, with a device mapping {sda:1, sdb:1, sdc:1, sdd:2, sde:2, sdf:2} which puts different copies on different device controllers. Does this all make sense? Are there any other options or features that we might consider for chunk allocation at this point? Having had a look at the chunk allocator, I think most if not all of this is fairly easily implementable, given a sufficiently good method of describing it all, which is what I'm trying to get to the bottom of in this discussion. I think I get most of what you're saying. If its not too difficult, perhaps you could update (or duplicate to another URL) your /btrfs-usage/ calculator to reflect the idea. It'd definitely make it easier for everyone (including myself) to know we're on the same page. I like the idea that the administrator would have more granular control over where data gets allocated first or where copies "belong". "Splicing" data to different controllers as you mentioned can help with both redundancy and performance. Note: I've always thought of dup as a special form of "spread" where we just write things out twice - but yes, there's no need for it to be compatible with any other allocation type. Hugo. (*) The missing piece here is to deal with extent allocation in a similar way, which would offer better odds again on the number of files damaged in a device-loss situation on a 1c FS. This is in general a much harder problem, though. The only change we have in this area at the moment is ssd_spread, which doesn't do very much. It also has the potential for really killing performance and/or file fragmentation. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
On 05/05/14 19:07, Brendan Hide wrote: On 05/05/14 14:16, Marc MERLIN wrote: I've just written this new page: http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html First, are there problems in it? Second, are there other FS full issues I should mention in it? Thanks, Marc "In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full: legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1" "-dusage=50" will balance all chunks that are 50% *or less* used, not more. The idea is that full chunks are better left alone while emptyish chunks are bundled together to make new full chunks, leaving big open areas for new chunks. Your process is good however - just the explanation that needs the tweak. :) In your last example, a full rebalance is not necessary. If you want to clear all unnecessary chunks you can run the balance with -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of the data chunks that are 80% and less used, which would by necessity get about ~160GB worth chunks back out of data and available for re-use. The issue I'm not sure of how to get through is if you can't balance *because* of ENOSPC errors. I'd probably start scouring the mailing list archives if I ever come across that. Forgot this part: Also in your last example, you used "-dusage=0" and it balanced 91 chunks. That means you had 91 empty or very-close-to-empty chunks. ;) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
On 05/05/14 14:16, Marc MERLIN wrote: I've just written this new page: http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html First, are there problems in it? Second, are there other FS full issues I should mention in it? Thanks, Marc "In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full: legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1" "-dusage=50" will balance all chunks that are 50% *or less* used, not more. The idea is that full chunks are better left alone while emptyish chunks are bundled together to make new full chunks, leaving big open areas for new chunks. Your process is good however - just the explanation that needs the tweak. :) In your last example, a full rebalance is not necessary. If you want to clear all unnecessary chunks you can run the balance with -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of the data chunks that are 80% and less used, which would by necessity get about ~160GB worth chunks back out of data and available for re-use. The issue I'm not sure of how to get through is if you can't balance *because* of ENOSPC errors. I'd probably start scouring the mailing list archives if I ever come across that. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does "btrfs fi show" show full?
On 05/05/14 07:50, Marc MERLIN wrote: On Mon, May 05, 2014 at 06:11:28AM +0200, Brendan Hide wrote: The "per-device" used amount refers to the amount of space that has been allocated to chunks. That first one probably needs a balance. Btrfs doesn't behave very well when available diskspace is so low due to the fact that it cannot allocate any new chunks. An attempt to allocate a new chunk will result in ENOSPC errors. The "Total" bytes used refers to the total actual data that is stored. Right. So 'Total used' is what I'm really using, whereas 'devid used' is actually what is being used due to the way btrfs doesn't seem to reclaim chunks after they're not used anymore, or some similar problem. In the second FS: Label: btrfs_pool1 uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6 Total devices 1 FS bytes used 442.17GiB devid1 size 865.01GiB used 751.04GiB path /dev/mapper/cryptroot The difference is huge between 'Total used' and 'devid used'. Is btrfs going to fix this on its own, or likely not and I'm stuck doing a full balance (without filters since I'm balancing data and not metadata)? If that helps. legolas:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=734.01GiB, used=435.29GiB System, DUP: total=8.00MiB, used=96.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=8.50GiB, used=6.74GiB Metadata, single: total=8.00MiB, used=0.00 Thanks, Marc What I typically do in snapshot cleanup scripts is to use the usage= filter with the percentage being a simple algorithm of "halfway between actual data and maximum chunk allocation I'm comfortable with". As an example, one of my servers' diskspace was at 65% and last night the chunk allocation reached the 90% mark. So it automatically ran the balance with "dusage=77". This would have cleared out about half of the "unnecessary" chunks. Balance causes a lot of load with spinning rust - of course, after-hours, nobody really cares. With SSDs it causes wear. This is just one method I felt was a sensible way for me to avoid ENOSPC issues while also ensuring I'm not rebalancing the entire system unnecessarily. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
On 05/05/14 06:36, Roman Mamedov wrote: On Mon, 05 May 2014 06:13:30 +0200 Brendan Hide wrote: 1) There will be a *very* small performance penalty (negligible, really) Oh, really, it's slower to mount the device directly? Not that I really care, but that's unexpected. Um ... the penalty is if you're mounting indirectly. ;) I feel that's on about the same scale as giving your files shorter filenames, "so that they open faster". Or have you looked at the actual kernel code with regard to how it's handled, or maybe even have any benchmarks, other than a general thought of "it's indirect, so it probably must be slower"? My apologies - not everyone here is a native English-speaker. You are 100% right, though. The scale is very small. By negligible, the "penalty" is at most a few CPU cycles. When compared to the wait time on a spindle, it really doesn't matter much. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
On 2014/05/05 02:56 AM, Marc MERLIN wrote: On Sun, May 04, 2014 at 09:07:55AM +0200, Brendan Hide wrote: On 2014/05/04 02:47 AM, Marc MERLIN wrote: Is there any functional difference between mount -o subvol=usr /dev/sda1 /usr and mount /dev/sda1 /mnt/btrfs_pool mount -o bind /mnt/btrfs_pool/usr /usr ? Thanks, Marc There are two "issues" with this. 1) There will be a *very* small performance penalty (negligible, really) Oh, really, it's slower to mount the device directly? Not that I really care, but that's unexpected. Um ... the penalty is if you're mounting indirectly. ;) 2) Old snapshots and other supposedly-hidden subvolumes will be accessible under /mnt/btrfs_pool. This is a minor security concern (which of course may not concern you, depending on your use-case). There are a few similar minor security concerns - the recently-highlighted issue with old snapshots is the potential that old vulnerable binaries within a snapshot are still accessible and/or executable. That's a fair point. I can of course make that mountpoint 0700, but it's a valid concern in some cases (not for me though). So thanks for confirming my understanding, it sounds like both are valid and if you're already mounting the main pool like I am, that's the easiest way. Thanks, Marc All good. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does "btrfs fi show" show full?
On 2014/05/05 02:54 AM, Marc MERLIN wrote: More slides, more questions, sorry :) (thanks for the other answers, I'm still going through them) If I have: gandalfthegreat:~# btrfs fi show Label: 'btrfs_pool1' uuid: 873d526c-e911-4234-af1b-239889cd143d Total devices 1 FS bytes used 214.44GB devid1 size 231.02GB used 231.02GB path /dev/dm-0 I'm a bit confused. It tells me 1) FS uses 214GB out of 231GB 2) Device uses 231GB out of 231GB I understand how the device can use less than the FS if you have multiple devices that share a filesystem. But I'm not sure how a filesystem can use less than what's being used on a single device. Similarly, my current laptop shows: legolas:~# btrfs fi show Label: btrfs_pool1 uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6 Total devices 1 FS bytes used 442.17GiB devid1 size 865.01GiB used 751.04GiB path /dev/mapper/cryptroot So, am I 100GB from being full, or am I really only using 442GB out of 865GB? If so, what does the device used value really mean if it can be that much higher than the filesystem used value? Thanks, Marc The "per-device" used amount refers to the amount of space that has been allocated to chunks. That first one probably needs a balance. Btrfs doesn't behave very well when available diskspace is so low due to the fact that it cannot allocate any new chunks. An attempt to allocate a new chunk will result in ENOSPC errors. The "Total" bytes used refers to the total actual data that is stored. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying related snapshots to another server with btrfs send/receive?
On 2014/05/04 09:28 AM, Marc MERLIN wrote: On Sun, May 04, 2014 at 09:16:02AM +0200, Brendan Hide wrote: Sending one-at-a-time, the shared-data relationship will be kept by using the -p (parent) parameter. Send will only send the differences and receive will create a new snapshot, adjusting for those differences, even when the receive is run on a remote server. $ btrfs send backup | btrfs receive $path/ $ btrfs send -p backup backup.sav1 | btrfs receive $path/ $ btrfs send -p backup.sav1 backup.sav2 | btrfs receive $path/ $ btrfs send -p backup.sav2 backup.sav3 | btrfs receive $path/ $ btrfs send -p backup.sav3 backup.sav4 | btrfs receive $path/ So this is exactly the same than what I do incremental backups with brrfs send, but -p only works if the snapshot is read only, does it not? I do use that for my incremental syncs and don't mind read only snapshots there, but if I have read/write snapshots that are there for other reasons than btrfs send incrementals, can I still send them that way with -p? (I thought that wouldn't work) Thanks, Marc Yes, -p (parent) and -c (clone source) are the only ways I'm aware of to push subvolumes across while ensuring data-sharing relationship remains intact. This will end up being much the same as doing incremental backups: From the man page section on -c: "You must not specify clone sources unless you guarantee that these snapshots are exactly in the same state on both sides, the sender and the receiver. It is allowed to omit the '-p ' option when '-c ' options are given, in which case 'btrfs send' will determine a suitable parent among the clone sources itself." -p does require that the sources be read-only. I suspect -c does as well. This means that it won't be so simple as you want your sources to be read-write. Probably the only way then would be to make read-only snapshots whenever you want to sync these over while also ensuring that you keep at least one read-only snapshot intact - again, much like incremental backups. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
On 2014/05/04 09:24 AM, Marc MERLIN wrote: On Sun, May 04, 2014 at 08:57:19AM +0200, Brendan Hide wrote: Hi, Marc Raid0 is not redundant in any way. See inline below. Thanks for clearing things up. But now I have 2 questions 1) btrfs has two copies of all metadata on even a single drive, correct? Only when *specifically* using -m dup (which is the default on a single non-SSD device), will there be two copies of the metadata stored on a single device. This is not recommended when using Ah, so -m dup is default like I thought, but not on SSD? Ooops, that means that my laptop does not have redundant metadata on its SSD like I thought. Thanks for the heads up. Ah, I see the man page now "This is because SSDs can remap blocks internally so duplicate blocks could end up in the same erase block which negates the benefits of doing metadata duplication." You can force dup but, per the man page, whether or not that is beneficial is questionable. multiple devices as it means one device failure will likely cause critical loss of metadata. That's the part where I'm not clear: What's the difference between -m dup and -m raid1 Don't they both say 2 copies of the metadata? Is -m dup only valid for a single drive, while -m raid1 for 2+ drives? The issue is that -m dup will always put both copies on a single device. If you lose that device, you've lost both (all) copies of that metadata. With -m raid1 the second copy is on a *different* device. I believe dup *can* be used with multiple devices but mkfs.btrfs might not let you do it from the get-go. The way most have gotten there is by having dup on a single device and then, after adding another device, they didn't convert the metadata to raid1. If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the metadata on the same drive or is btrfs smart enough to spread out metadata copies so that they're not on the same drive? This will mean there is only a single copy, albeit striped across the drives. Ok, so -m raid0 only means a single copy of metadata, thanks for explaining. good for redundancy. A total failure of a single device will mean any large files will be lost and only files smaller than the default per-disk stripe width (I believe this used to be 4K and is now 16K - I could be wrong) stored only on the remaining disk will be available. Gotcha, thanks for confirming, so -m raid1 -d raid0 really only protects against metadata corruption or a single block loss, but otherwise if you lost a drive in a 2 drive raid0, you'll have lost more than just half your files. The scenario you mentioned at the beginning, "if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files" is more applicable to using "-m raid1 -d single". Single is not geared towards performance and, though it doesn't guarantee a file is only on a single disk, the allocation does mean that the majority of all files smaller than a chunk will be stored on only one disk or the other - not both. Ok, so in other words: -d raid0: if you one 1 drive out of 2, you may end up with small files and the rest will be lost -d single: you're more likely to have files be on one drive or the other, although there is no guarantee there either. Correct? Correct Thanks, Marc -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copies= option
On 2014/05/04 05:27 AM, Duncan wrote: Russell Coker posted on Sun, 04 May 2014 12:16:54 +1000 as excerpted: Are there any plans for a feature like the ZFS copies= option? I'd like to be able to set copies= separately for data and metadata. In most cases RAID-1 provides adequate data protection but I'd like to have RAID-1 and copies=2 for metadata so that if one disk dies and another has some bad sectors during recovery I'm unlikely to lose metadata. Hugo's the guy with the better info on this one, but until he answers... The zfs license issues mean it's not an option for me and I'm thus not familiar with its options in any detail, but if I understand the question correctly, yes. And of course since btrfs treats data and metadata separately, it's extremely unlikely that any sort of copies= option wouldn't be separately configurable for each. There was a discussion of a very nice multi-way-configuration schema that I deliberately stayed out of as both a bit above my head and far enough in the future that I didn't want to get my hopes up too high about it yet. I already want N-way-mirroring so bad I can taste it, and this was that and way more... if/when it ever actually gets coded and committed to the mainline kernel btrfs. As I said, Hugo should have more on it, as he was active in that discussion as it seemed to line up perfectly with his area of interest. The simple answer is yes, this is planned. As Duncan implied, however, it is not on the immediate roadmap. Internally we appear to be referring to this feature as "N-way redundancy" or "N-way mirroring". My understanding is that the biggest hurdle before the primary devs will look into N-way redundancy is to finish the Raid5/6 implementation to include self-healing/scrubbing support - a critical issue before it can be adopted further. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying related snapshots to another server with btrfs send/receive?
On 2014/05/04 05:12 AM, Marc MERLIN wrote: Another question I just came up with. If I have historical snapshots like so: backup backup.sav1 backup.sav2 backup.sav3 If I want to copy them up to another server, can btrfs send/receive let me copy all of the to another btrfs pool while keeping the duplicated block relationship between all of them? Note that the backup.sav dirs will never change, so I won't need incremental backups on those, just a one time send. I believe this is supposed to work, correct? The only part I'm not clear about is am I supposed to copy them all at once in the same send command, or one by one? If they had to be copied together and if I create a new snapshot of backup: backup.sav4 If I use btrfs send to that same destination, is btrfs send/receive indeed able to keep the shared block relationship? Thanks, Marc I'm not sure if they can be sent in one go. :-/ Sending one-at-a-time, the shared-data relationship will be kept by using the -p (parent) parameter. Send will only send the differences and receive will create a new snapshot, adjusting for those differences, even when the receive is run on a remote server. $ btrfs send backup | btrfs receive $path/ $ btrfs send -p backup backup.sav1 | btrfs receive $path/ $ btrfs send -p backup.sav1 backup.sav2 | btrfs receive $path/ $ btrfs send -p backup.sav2 backup.sav3 | btrfs receive $path/ $ btrfs send -p backup.sav3 backup.sav4 | btrfs receive $path/ -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
On 2014/05/04 02:47 AM, Marc MERLIN wrote: Is there any functional difference between mount -o subvol=usr /dev/sda1 /usr and mount /dev/sda1 /mnt/btrfs_pool mount -o bind /mnt/btrfs_pool/usr /usr ? Thanks, Marc There are two "issues" with this. 1) There will be a *very* small performance penalty (negligible, really) 2) Old snapshots and other supposedly-hidden subvolumes will be accessible under /mnt/btrfs_pool. This is a minor security concern (which of course may not concern you, depending on your use-case). There are a few similar minor security concerns - the recently-highlighted issue with old snapshots is the potential that old vulnerable binaries within a snapshot are still accessible and/or executable. -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
Hi, Marc Raid0 is not redundant in any way. See inline below. On 2014/05/04 01:27 AM, Marc MERLIN wrote: So, I was thinking. In the past, I've done this: mkfs.btrfs -d raid0 -m raid1 -L btrfs_raid0 /dev/mapper/raid0d* My rationale at the time was that if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files. If I have raid1 with 2 drives, I should end up with 4 copies of each file's metadata, right? But now I have 2 questions 1) btrfs has two copies of all metadata on even a single drive, correct? Only when *specifically* using -m dup (which is the default on a single non-SSD device), will there be two copies of the metadata stored on a single device. This is not recommended when using multiple devices as it means one device failure will likely cause critical loss of metadata. When using -m raid1 (as is the case in your first example above and as is the default with multiple devices), two copies of the metadata are distributed across two devices (each of those devices with a copy has only a single copy). If so, and I have a -d raid0 -m raid0 filesystem, are both copies of the metadata on the same drive or is btrfs smart enough to spread out metadata copies so that they're not on the same drive? This will mean there is only a single copy, albeit striped across the drives. 2) does btrfs lay out files on raid0 so that files aren't striped across more than one drive, so that if I lose a drive, I only lose whole files, but not little chunks of all my files, making my entire FS toast? "raid0" currently allocates a single chunk on each device and then makes use of "RAID0-like" stripes across these chunks until a new chunk needs to be allocated. This is good for performance but not good for redundancy. A total failure of a single device will mean any large files will be lost and only files smaller than the default per-disk stripe width (I believe this used to be 4K and is now 16K - I could be wrong) stored only on the remaining disk will be available. The scenario you mentioned at the beginning, "if I lose a drive, I'll still have full metadata for the entire filesystem and only missing files" is more applicable to using "-m raid1 -d single". Single is not geared towards performance and, though it doesn't guarantee a file is only on a single disk, the allocation does mean that the majority of all files smaller than a chunk will be stored on only one disk or the other - not both. Thanks, Marc I hope the above is helpful. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with space
On 02/05/14 10:23, Duncan wrote: Russell Coker posted on Fri, 02 May 2014 11:48:07 +1000 as excerpted: On Thu, 1 May 2014, Duncan <1i5t5.dun...@cox.net> wrote: [snip] http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf Whether a true RAID-1 means just 2 copies or N copies is a matter of opinion. Papers such as the above seem to clearly imply that RAID-1 is strictly 2 copies of data. Thanks for that link. =:^) My position would be that reflects the original, but not the modern, definition. The paper seems to describe as raid1 what would later come to be called raid1+0, which quickly morphed into raid10, leaving the raid1 description only covering pure mirror-raid. Personally I'm flexible on using the terminology in day-to-day operations and discussion due to the fact that the end-result is "close enough". But ... The definition of "RAID 1" is still only a mirror of two devices. As far as I'm aware, Linux's mdraid is the only raid system in the world that allows N-way mirroring while still referring to it as "RAID1". Due to the way it handles data in chunks, and also due to its "rampant layering violations", *technically* btrfs's "RAID-like" features are not "RAID". To differentiate from "RAID", we're already using lowercase "raid" and, in the long term, some of us are also looking to do away with "raid{x}" terms altogether with what Hugo and I last termed as "csp notation". Changing the terminology is important - but it is particularly non-urgent. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: protect snapshots from deleting during send
On 2014/04/16 05:22 PM, David Sterba wrote: On Wed, Apr 16, 2014 at 04:59:09PM +0200, Brendan Hide wrote: On 2014/04/16 03:40 PM, Chris Mason wrote: So in my example with the automated tool, the tool really shouldn't be deleting a snapshot where send is in progress. The tool should be told that snapshot is busy and try to delete it again later. It makes more sense now, 'll queue this up for 3.16 and we can try it out in -next. -chris So ... does this mean the plan is to a) have userland tool give an error; or b) a deletion would be "scheduled" in the background for as soon as the send has completed? b) is current state, a) is the plan with the patch, 'btrfs subvol delete' would return EPERM/EBUSY My apologies, I should have followed up on this a while ago already. :-/ Would having something closer to b) be more desirable if the resource simply disappears but continues in the background? This would be as in a lazy umount, where presently-open files are left open and writable but the directory tree has "disappeared". I submit that, with a), the actual status is more obvious/concrete whereas with b+lazy), current issues would flow smoothly with no errors and no foreseeable future issues. I reserve the right to be wrong, of course. ;) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cycle of send/receive for backup/restore is incomplete...
Replied inline: On 2014/04/24 12:30 AM, Robert White wrote: So the backup/restore system described using snapshots is incomplete because the final restore is a copy operation. As such, the act of restoring from the backup will require restarting the entire backup cycle because the copy operation will scramble the metadata consanguinity. The real choice is to restore by sending the snapshot back via send and receive so that all the UIDs and metadata continue to match up. But there's no way to "promote" the final snapshot to a non-snapshot subvolume identical to the one made by the original btrfs subvolume create operation. btrfs doesn't differentiate snapshots and subvolumes. They're the same first-class citizen. A snapshot is a subvolume that just happens to have some data (automagically/naturally) deduplicated with another subvolume. Consider a file system with __System as the default mount (e.g. btrfs subvolume create /__System). You make a snapshot (btrfs sub snap -r /__System /__System_BACKUP). Then you send the backup to another file system with send receive. Nothing new here. The thing is, if you want to restore from that backup, you'd send/receive /__System_BACKUP to the new/restore drive. But that snapshot is _forced_ to be read only. So then your only choice is to make a writable snapshot called /__System. At this point you have a tiny problem, the three drives aren't really the same. The __System and __System_BACKUP on the final drive are subvolumes of /, while on the original system / and /__System were full subvolumes. There's no such thing as a "full" subvolume. Again, they're all first-class citizens. The "real" root of a btrfs is always treated as a subvolume, as are the subvolumes inside it too. Just because other subvolumes are contained therein it doesn't mean they're diminished somehow. You cannot have multiple subvolumes *without* having them be a "sub" volume of the real "root" subvolume. It's dumb, it's a tiny difference, but it's annoying. There needs to be a way to promote /__System to a non-snapshot status. If you look at the output of "btrfs subvolume list -s /" on the various drives it's not possible to end up with the exact same system as the original. From a user application perspective, the system *is* identical to the original. That's the important part. If you want the disk to be identical bit for bit then you want a different backup system entirely, one that backs up the hard disk, not the files/content. On the other hand if you just want to have all your snapshots restored as well, that's not too difficult. Its pointless from most perspectives - but not difficult. There needs to be either an option to btrfs subvolume create that takes a snapshot as an argument to base the new device on, or an option to receive that will make a read-write non-snapshot subvolume. This feature already exists. This is a very important aspect of how snapshots work with send / receive and why it makes things very efficient. They work just as well for a restore as they do for a backup. The flag you are looking for is "-p" for "parent", which you should already be using for the backups in the first place: From backup host: $ btrfs send -p /backup/path/yesterday /backup/path/last_backup | From restored host: $ | btrfs receive /tmp/btrfs_root/ Then you make the non-read-only snapshot of the restored subvolume. [snip] -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: protect snapshots from deleting during send
On 2014/04/16 03:40 PM, Chris Mason wrote: So in my example with the automated tool, the tool really shouldn't be deleting a snapshot where send is in progress. The tool should be told that snapshot is busy and try to delete it again later. It makes more sense now, 'll queue this up for 3.16 and we can try it out in -next. -chris So ... does this mean the plan is to a) have userland tool give an error; or b) a deletion would be "scheduled" in the background for as soon as the send has completed? -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying a disk containing a btrfs filesystem
Hi, Michael Btrfs send/receive can transfer incremental snapshots as well - you're looking for the "-p" or "parent" parameter. On the other hand, it might not be the right tool for the job. If you're 100% happy with your old disk's *content*/layout/etc (just not happy with the disk's reliability), try an overnight/over-weekend ddrescue instead: http://www.forensicswiki.org/wiki/Ddrescue What I've done in the past is scripted ddrescue to recover as much data as possible. Its like using dd between two disks except that it can keep a log of bad sectors that can be retried later. The log also helps by ensuring that if you cancel the operation, you can start it again and it will continue where it left off. Additionally, you can have it skip big sections of the disk when it comes across bad sectors - and it can "trim" these sections on subsequent runs. Btrfs send/receive has the advantage that you can run it while your system is still active. DDrescue has the advantage that it is very good at recovering 99% of your data where a disk has lots of bad sectors. For btrfs, send/receive the main subvolumes then, afterward, send/receive the snapshots using the "parent" parameter, "-p". There *is* the possibility that this needs to be reversed - as in, the backup should be treated as the parent instead of the other way around: btrfs send /home | btrfs receive /mnt/new-disk/home btrfs send -p /home /backups/home-2014-04-08 | btrfs receive /mnt/new-disk/backups/. Below is from the last scriptlet I made when I last did the ddrescue route (in that case I was recovering a failing NTFS drive). It was particularly bad and took a whole weekend to recover. The new disk worked flawlessly however. :) How I've used ddrescue in the past is to connect the failing and new disk to a server. Alternatively, using USB, you could boot from a rescue CD/flash drive and do the rescue there. a) Identify the disks in /dev/disk/by- and put those values into the bash script below. ensure that it refers to the disk as a whole (not a partition for example). This ensures that re-ordering of the drives after a reboot won't affect the process. b) Set up a log file location on a separate filesystem - a flash drive is ideal unless you've gone the "server" route, where I normally just put the log into a path on the the server as so: /root/brendan-laptop.recovery.log #!/bin/bash src_disk=/dev/disk/by-id/ata-ST3250410AS_6RYF5NP7 dst_disk=/dev/disk/by-id/ata-ST3500418AS_9VM2ZZQS #log=/path/to/log log=~/brendan-laptop.recovery.log #Sector size (default is 512 - newer disks should probably be 4096) sector_size=4096 #Force writing to a block device - disable if you're backing up to an image file #force="" force="-f" # We want to skip bigger chunks to get as much data as possible before the source disk really dies # For the same reason, we also want to start with (attempting) to maintain a high read rate #Minimum read rate in MB/s before skipping min_readrate=10485760 #Default skip size is 64k. for skip_size in 65536 16384; do #Going forward ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size $src_disk $dst_disk $log #Going in reverse ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size -R $src_disk $dst_disk $log done #Default skip size is 64k. #This time re-trying all failed/skipped sections for skip_size in 4096; do #Going forward ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size -A $src_disk $dst_disk $log #Going in reverse ddrescue -r 1 -a $min_readrate -b $sector_size -d $force -K $skip_size -R -A $src_disk $dst_disk $log done #Default skip size is 64k. for skip_size in 1024 256 64 ; do #Going forward ddrescue -r 1 -b $sector_size -d $force -K $skip_size $src_disk $dst_disk $log #Going in reverse ddrescue -r 1 -b $sector_size -d $force -K $skip_size -R $src_disk $dst_disk $log done echo "Done. Run an chkdsk/fsck/whatever-might-be appropriate for the new disk's filesystem(s)" On 10/04/14 15:21, Michael Schuerig wrote: SMART indicates that my notebook disk may soon be failing (an unreadable/uncorrectable sector), therefore I intend to exchange it. The disk contains a single btrfs filesystem with several nested(!) subvolumes, each with several read-only snapshots in a .snapshots subdirectory. As far as I can tell, btrfs currently does not offer a sensible way to duplicate the entire contents of the old disk onto a new one. I can use cp, rsync, or send/receive to copy the "main" subvolumes. But unless I'm missing something obvious, the snapshots are effectively lost. btrfs send optionally takes multiple clone sources, but I've never seen an example of its usage. If that's what "experimental" means, I'm willing
Re: [PATCH] lib: add size unit t/p/e to memparse
On 31/03/14 12:03, Gui Hecheng wrote: - * potentially suffixed with %K (for kilobytes, or 1024 bytes), - * %M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or - * 1073741824). If the number is suffixed with K, M, or G, then + * potentially suffixed with + * %K (for kilobytes, or 1024 bytes), + * %M (for megabytes, or 1048576 bytes), + * %G (for gigabytes, or 1073741824), + * %T (for terabytes, or 1099511627776), + * %P (for petabytes, or 1125899906842624 bytes), + * %E (for exabytes, or 1152921504606846976 bytes). My apologies, I should have noticed this in your earlier mail. This could be updated to specifically refer to the "bi"nary prefixes rather than the old SI-conflicting names: kibibyte, mebibyte, gibibyte, tebibyte, pebibyte, and exbibyte -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: add btrfs resize unit t/p/e support
On 2014/03/27 04:51 AM, Gui Hecheng wrote: [snip] We add t/p/e support by replacing lib/cmdline.c:memparse with btrfs_memparse. The btrfs_memparse copies memparse's code and add unit t/p/e parsing. Is there a conflict preventing adding this to memparse directly? -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs and raid5 status with kernel 3.14, documentation, and howto
On 25/03/14 03:29, Marc MERLIN wrote: On Tue, Mar 25, 2014 at 01:11:43AM +, Martin wrote: There's a big thread a short while ago about using parity across n-devices where the parity is spread such that you can have 1, 2, and up to 6 redundant devices. Well beyond just raid5 and raid6: http://lwn.net/Articles/579034/ Aah, ok. I didn't understand you meant that. I know nothing about that, but to be honest, raid6 feels like it's enough for me :) There are a few of us who are very much looking forward to these special/flexible RAID types - for example RAID15 (very good performance, very high redundancy, less than 50% diskspace efficiency). The csp notation will probably make it easier to develop the flexible raid types and is very much required in order to better manage these more flexible raid types. A typical RAID15 with 12 disks would in csp notation is written as: 2c5s1p And some would like to be able to use the exact same redundancy scheme even with extra disks: 2c5s1p on 16 disks (note, the example is not 2c7s1p, though that would also be a valid scheme with 16 disks being the minimum number of disks required) The last thread on this (I think) can be viewed here, http://www.spinics.net/lists/linux-btrfs/msg23137.html where Hugo also explains and lists the notation for the existing schemes. -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "Asymmetric" RAID0
On 25/03/14 07:15, Slava Barinov wrote: Hello, I've been using a single drive btrfs for some time and when free space became too low I've added an additional drive and rebalanced FS with RAID0 data and RAID1 System and Metadata storage. Now I have the following configuration: # btrfs fi show /btr Label: none uuid: f9d78880-10a7-439b-8ebd-14d815edbc19 Total devices 2 FS bytes used 415.45GiB devid1 size 931.51GiB used 222.03GiB path /dev/sdc devid2 size 431.51GiB used 222.03GiB path /dev/sdb # btrfs fi df /btr Data, RAID0: total=424.00GiB, used=406.81GiB System, RAID1: total=32.00MiB, used=40.00KiB Metadata, RAID1: total=10.00GiB, used=8.64GiB # df -h Filesystem Size Used Avail Use% Mounted on /dev/sdb 1.4T 424G 437G 50% /btr I suppose I should trust to btrfs fi df, not df utility. So the main question is if such "asymmetric" RAID0 configuration possible at all and why does btrfs ignore ~500 GB of free space on /dev/sdc drive? Also it's interesting what will happen when I add 20 GB more data to my FS. Should I be prepared to usual btrfs low-space problems? Best regards, Slava Barinov. The "raid0" will always distribute data to each disk relatively equally. There are exceptions of course. The way to have it better utilise the diskspace is to use either "single" (which won't get the same performance as raid0) or to add a third disk. In any raided configuration, the largest disk won't be fully utilised unless the other disks add up to be equal to or more than that largest disk. Play around with Hugo's disk usage calculator to get a better idea of what the different configurations will do: http://carfax.org.uk/btrfs-usage/ -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)
On 2014/03/22 11:11 PM, Marc MERLIN wrote: Please consider adding a blank line between quotes, it makes them just a bit more readable :) Np. On Sat, Mar 22, 2014 at 11:02:24PM +0200, Brendan Hide wrote: - it doesn't create writeable snapshots on the destination in case you want to use the copy as a live filesystem One of the issues with doing writeable snapshots by default is that the backup is (ever so slightly) less safe from "fat-finger syndrome". If I want a writeable snapshot, I'll make it from the read-only snapshot, thereby reducing the chances of accidentally "tainting" or deleting data in the backup. I actually *did* accidentally delete my entire filesystem (hence the paranoid umounts). But, of course, my script *first* created read-only snapshots from which recovery took only a few minutes. ;) The writeable snapshot I create is on top of the read only one used by btrfs receive. So, I can play with it, but it won't upset/break anything for the backup. The historical snapshots I keep give me cheap backups to go back to do get a file I may have deleted 3 days ago and want back now even though my btrfs send/receive runs hourly. Ah. In that case my comment is moot. I could add support for something like this but I'm unlikely to use it. [snip] - Your comments say shlock isn't safe and that's documented. I don't see that in the man page http://manpages.ubuntu.com/manpages/trusty/man1/shlock.1.html That man page looks newer than the one I last looked at - specifically the part saying "improved by Berend Reitsma to solve a race condition." The previous documentation on shlock indicated it was safe for hourly crons - but not in the case where a cron might be executed twice simultaneously. Shlock was recommended by a colleague until I realised this potential issue, thus my template doesn't use it. I should update the comment with some updated information. It's not super important, it was more my curiosity. If a simple lock program in C isn't atomic, what's the point of it? I never looked at the source code, but maybe I should... Likely the INN devs needed something outside of a shell environment. Based on the man page, shlock should be atomic now. I'd love to have details on this if I shouldn't be using it - Is set -o noclobber; echo $$ > $lockfile really atomic and safer than shlock? If so, great, although I would then wonder why shlock even exists :) The part that brings about an atomic lock is "noclobber", which sets it so that we are not allowed to "clobber"/"overwrite" an existing file. Thus, if the file exists, the command fails. If it successfully creates the new file, the command returns true. I understand how it's supposed to work, I just wondered if it was really atomic as it should be since there would be no reason for shlock to even exist with that line of code you wrote. When I originally came across the feature I wasn't sure it would work and did extensive testing: For example, spawn 30 000 processes, each of which tried to take the lock. After the machine became responsive again ;) only 1 lock ever turned out to have succeeded. Since then its been in production use across various scripts on hundreds of servers. My guess (see above) is that the INN devs couldn't or didn't want to use it. The original page where I learned about noclobber: http://www.davidpashley.com/articles/writing-robust-shell-scripts/ Thanks for the info Marc No problem - and thanks. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)
On 2014/03/22 10:00 PM, Marc MERLIN wrote: On Sat, Mar 22, 2014 at 09:44:05PM +0200, Brendan Hide wrote: Hi, Marc Feel free to use ideas from my own script. Some aspects in my script are more mature and others are frankly pathetic. ;) There are also quite a lot of TODOs throughout my script that aren't likely to get the urgent attention they deserve. It has been slowly evolving over the last two weeks. http://swiftspirit.co.za/scripts/btrfs-snd-rcv-backup I figured I likely wasn't the only one working on a script like this :) From a quick read, it looks even more complex than mine :) but Well ... I did say some things are pathetic my side. ;) I also use a template (its about 3 years old now) when I make a new script, hence the options such as being able to ignore the mutex checks and also having a random delay at start. These obviously add some unnecessary complexity. - it doesn't do ssh to a destination for a remote backup There should be a TODO for this on my side. Presently in testing I'm only using it for device-local backup to a separate disk and not to a "proper" remote backup. - it doesn't seem to keep a list of configurable snapshots not necessary for send/restore but useful for getting historical data I'm not sure what this is useful for. :-/ If related, I plan on creating a separate script to move snapshots around into _var_daily_$date, _var_weekly_$date, etc. - it doesn't seem to use a symlink to keep track of the last complete snapshot on the source and destination, and does more work to compensate when recovering from an incomplete backup/restore. Yes, a symlink would make this smoother. - it doesn't create writeable snapshots on the destination in case you want to use the copy as a live filesystem One of the issues with doing writeable snapshots by default is that the backup is (ever so slightly) less safe from "fat-finger syndrome". If I want a writeable snapshot, I'll make it from the read-only snapshot, thereby reducing the chances of accidentally "tainting" or deleting data in the backup. I actually *did* accidentally delete my entire filesystem (hence the paranoid umounts). But, of course, my script *first* created read-only snapshots from which recovery took only a few minutes. ;) Things I noticed: - I don't use ionice, maybe I should. Did you find that it actually made a difference with send/receive? This is just a habit I've developed over time in all my scripts. I figure that if I'm using the machine at the time and the snapshot has a large churn, I'd prefer the ionice. That said, the main test system is a desktop which is likely to have much less churn than a server. In the last two weeks the longest daily incremental backup took about 5 minutes to complete, while it typically takes about 30 seconds only. - Your comments say shlock isn't safe and that's documented. I don't see that in the man page http://manpages.ubuntu.com/manpages/trusty/man1/shlock.1.html That man page looks newer than the one I last looked at - specifically the part saying "improved by Berend Reitsma to solve a race condition." The previous documentation on shlock indicated it was safe for hourly crons - but not in the case where a cron might be executed twice simultaneously. Shlock was recommended by a colleague until I realised this potential issue, thus my template doesn't use it. I should update the comment with some updated information. My only two worries then would be a) if it is outdated on other distros and b) that it appears that it is not installed by default. On my Arch desktop it seems to be available with "inn"[1] (Usenet server and related software) and nowhere else. It seems the same on Ubuntu (Google pointed me to inn2-dev). Do you have INN installed? If not, where did you get shlock from? I'd love to have details on this if I shouldn't be using it - Is set -o noclobber; echo $$ > $lockfile really atomic and safer than shlock? If so, great, although I would then wonder why shlock even exists :) The part that brings about an atomic lock is "noclobber", which sets it so that we are not allowed to "clobber"/"overwrite" an existing file. Thus, if the file exists, the command fails. If it successfully creates the new file, the command returns true. I'd consider changing this mostly for the fact that depending on INN is a very big dependency. There are other options as well, though I don't think they're as portable as noclobber. Thanks, Marc Thanks for your input. It has already given me some direction. :) [1] https://www.archlinux.org/packages/community/x86_64/inn/files/ -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)
On 2014/03/22 09:44 PM, Brendan Hide wrote: On 2014/03/21 07:29 PM, Marc MERLIN wrote: Hi, Marc Feel free to use ideas from my own script. Some aspects in my script are more mature and others are frankly pathetic. ;) There are also quite a lot of TODOs throughout my script that aren't likely to get the urgent attention they deserve. It has been slowly evolving over the last two weeks. http://swiftspirit.co.za/scripts/btrfs-snd-rcv-backup I forgot to include some notes: The script depends on a "config" file at /etc/btrfs-backup/paths.conf which is supposed to contain the paths as well as some parameters. At present the file consists solely of the following paths as these are all separate subvolumes in my test system: "/ /home /usr /var " The snapshot names on source and backup are formatted as below. This way, daylight savings doesn't need any special treatment: __2014-03-17-23h00m01s+0200 __2014-03-18-23h00m01s+0200 __2014-03-19-23h00m01s+0200 __2014-03-20-23h00m02s+0200 __2014-03-21-23h00m01s+0200 _home_2014-03-17-23h00m01s+0200 _home_2014-03-18-23h00m01s+0200 _home_2014-03-19-23h00m01s+0200 _home_2014-03-20-23h00m02s+0200 _home_2014-03-21-23h00m01s+0200 _usr_2014-03-17-23h00m01s+0200 _usr_2014-03-18-23h00m01s+0200 _usr_2014-03-19-23h00m01s+0200 _usr_2014-03-20-23h00m02s+0200 _usr_2014-03-21-23h00m01s+0200 _var_2014-03-17-23h00m01s+0200 _var_2014-03-18-23h00m01s+0200 _var_2014-03-19-23h00m01s+0200 _var_2014-03-20-23h00m02s+0200 _var_2014-03-21-23h00m01s+0200 -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send/Receive howto and script for others to use (was Re: Is anyone using btrfs send/receive)
prw" # Keep track of the last snapshot to send a diff against. ln -snf $src_newsnap ${vol}_last # The rw version can be used for mounting with subvol=vol_last_rw ln -snf $src_newsnaprw ${vol}_last_rw $ssh ln -snf $src_newsnaprw $dest_pool/${vol}_last_rw # How many snapshots to keep on the source btrfs pool (both read # only and read-write). ls -rd ${vol}_ro* | tail -n +$(( $keep + 1 ))| while read snap do btrfs subvolume delete "$snap" done ls -rd ${vol}_rw* | tail -n +$(( $keep + 1 ))| while read snap do btrfs subvolume delete "$snap" done # Same thing for destination (assume the same number of snapshots to keep, # you can change this if you really want). $ssh ls -rd $dest_pool/${vol}_ro* | tail -n +$(( $keep + 1 ))| while read snap do $ssh btrfs subvolume delete "$snap" done $ssh ls -rd $dest_pool/${vol}_rw* | tail -n +$(( $keep + 1 ))| while read snap do $ssh btrfs subvolume delete "$snap" done rm $lock Hi, Marc Feel free to use ideas from my own script. Some aspects in my script are more mature and others are frankly pathetic. ;) There are also quite a lot of TODOs throughout my script that aren't likely to get the urgent attention they deserve. It has been slowly evolving over the last two weeks. http://swiftspirit.co.za/scripts/btrfs-snd-rcv-backup -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental backup for a raid1
On 2014/03/13 09:48 PM, Andrew Skretvedt wrote: On 2014-Mar-13 14:28, Hugo Mills wrote: On Thu, Mar 13, 2014 at 08:12:44PM +0100, Michael Schuerig wrote: I have a btrfs raid1 filesystem spread over two disks. I want to backup this filesystem regularly and efficiently to an external disk (same model as the ones in the raid) in such a way that * when one disk in the raid fails, I can substitute the backup and rebalancing from the surviving disk to the substitute only applies the missing changes. For point 1, not really. It's a different filesystem [snip] Hugo. I'm new We all start somewhere. ;) Could you, at the time you wanted to backup the filesystem: 1) in the filesystem, break RAID1: /dev/A /dev/B <-- remove /dev/B 2) reestablish RAID1 to the backup device: /dev/A /dev/C <-- added Its this step that won't work "as is" and, from an outsider's perspective, it is not obvious why: As Hugo mentioned, "It's a different filesystem". The two disks don't have any "co-ordinating" record of data and don't have any record indicating that the other disk even exists. The files they store might even be identical - but there's a lot of missing information that would be necessary to tell them they can work together. All this will do is reformat /dev/C and then it will be rewritten again by the balance operation in step 3) below. 3) balance to effect the backup (i.e. rebuilding the RAID1 onto /dev/C) 4) break/reconnect the original devices: remove /dev/C; re-add /dev/B to the fs Again, as with 2), /dev/A is now synchronised with (for all intents and purposes) a new disk. If you want to re-add /dev/B, you're going to lose any data on /dev/B (view this in the sense that, if you wiped the disk, the end-result would be the same) and then you would be re-balancing new data onto it from scratch. Before removing /dev/B: Disk A: abdeg__cf__ Disk B: abc_df_ge__ <- note that data is *not* necessarily stored in the exact same position on both disks Disk C: gbfc_d__a_e All data is available on all disks. Disk C has no record indicating that disks A and B exist. Disk A and B have a record indicating that the other disk is part of the same FS. These two disks have no record indicating disk C exists. 1. Remove /dev/B: Disk A: abdeg__cf__ Disk C: gbfc_d__a_e 2. Add /dev/C to /dev/A as RAID1: Disk A: abdeg__cf__ Disk C: _## <- system reformats /dev/C and treats the old data as garbage 3. Balance /dev/{A,C}: Disk A: abdeg__cf__ Disk C: abcdefg Both disks now have a full record of where the data is supposed to be and have a record indicating that the other disk is part of the FS. Notice that, though Disk C has the exact same files as it did before step 1, the on-disk filesystem looks very different. 4. Follow steps 1, 2, and 3 above - but with different disks - similar end-result. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [systemd-devel] [HEADS-UP] Discoverable Partitions Spec
On 2014/03/12 09:31 PM, Chris Murphy wrote: On Mar 12, 2014, at 1:12 PM, Goffredo Baroncelli wrote: On 03/12/2014 06:24 PM, Chris Mason wrote: Your suggestion also sounds like it places snapshots outside of their parent subvolume? If so it mitigates a possible security concern if the snapshot contains (old) binaries with vulnerabilities. I asked about how to go about assessing this on the Fedora security list: https://lists.fedoraproject.org/pipermail/security/2014-February/001748.html There aren't many replies but the consensus is that it's a legitimate concern, so either the snapshots shouldn't be persistently available (which is typical with e.g. snapper, and also yum-plugin-fs-snapshot), and/or when the subvolume containing snapshots is mounted, it's done with either mount option noexec or nosuid (no consensus on which one, although Gnome Shell uses nosuid by default when automounting removable media). This is exactly the same result if following the previously-recommended subvolume layout given on the Arch wiki. It seems this wiki advice has "disappeared" so I can't give a link for it ... My apologies if the rest of my mail is off-topic. Though not specifically for rollback, my snapshots prior to btrfs {send | , receive} backup is done via temporary mountpoint. Until two days ago I was still using rsync to a secondary btrfs volume and the __snapshots folder had been sitting empty for about a year. The performance difference with send|receive is magnitudes apart: A daily backup to the secondary disk now takes between 30 and 40 seconds whereas it took 20 to 30 minutes with rsync. Here are my current subvolumes: __active __active/home __active/usr __active/var __snapshots/__2014-03-12-23h00m01s+0200 __snapshots/_home_2014-03-12-23h00m01s+0200 __snapshots/_usr_2014-03-12-23h00m01s+0200 __snapshots/_var_2014-03-12-23h00m01s+0200 I hadn't thought of noexec or nosuid. On a single-user system you don't really expect that type of incursion. I will put up my work after I've properly automated cleanup. The only minor gripe I have with the temporary mount is that I feel it should be possible to perform snapshots and use send|receive without the requirement of having the subvolumes be "visible" in userspace. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding btrfs and backups
On 2014/03/06 09:27 PM, Eric Mesa wrote: Brian Wong wrote: a snapshot is different than a backup [snip] ... Three hard drives: A, B, and C. Hard drives A and B - btrfs RAID-1 so that if one drive dies I can keep using my system until the replacement for the raid arrives. Hard drive C - gets (hourly/daily/weekly/or some combination of the above) snapshots from the RAID. (Starting with the initial state snapshot) Each timepoint another snapshot is copied to hard drive C. [snip]... So if that's what I'm doing, do snapshots become a way to do backups? An important distinction for anyone joining the conversation is that snapshots are *not* backups, in a similar way that you mentioned that RAID is not a backup. If a hard drive implodes, its snapshots go with it. Snapshots can (and should) be used as part of a backup methodology - and your example is almost exactly the same as previous good backup examples. I think most of the time there's mention of an external "backup server" keeping the backups, which is the only major difference compared to the process you're looking at. Btrfs send/receive with snapshots can make the process far more efficient compared to rsync. Rsync doesn't have any record as to what information has changed so it has to compare all the data (causing heavy I/O). Btrfs keeps a record and can skip to the part of sending the data. I do something similar to what you have described on my Archlinux desktop - however I haven't updated my (very old) backup script to take advantage of btrfs' send/receive functionality. I'm still using rsync. :-/ / and /home are on btrfs-raid1 on two smallish disks /mnt/btrfs-backup is on btrfs single/dup on a single larger disk See https://btrfs.wiki.kernel.org/index.php/Incremental_Backup for a basic incremental methodology using btrfs send/receive -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with btrfs balance
On 14/02/14 05:42, Austin S. Hemmelgarn wrote: On 2014/02/10 04:33 AM, Austin S Hemmelgarn wrote: Do you happen to know which git repository and branch is preferred to base patches on? I'm getting ready to write one to fix this, and would like to make it as easy as possible for the developers to merge. A list of the "main" repositories is maintained at https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories I'd suggest David Sterba's branch as he maintains it for userspace-tools integration. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible to wait for snapshot deletion?
On 2014/02/13 09:02 PM, Kai Krakow wrote: Hi! Is it technically possible to wait for a snapshot completely purged from disk? I imagine an option like "--wait" for btrfs delete subvolume. This would fit some purposes I'm planning to implement: * In a backup scenario I have a similar use-case for this also involving backups. In my case I have a script that uses a btrfs filesystem for the backup store using snapshots. At the end of each run, if diskspace usage is below a predefined threshold, it will delete old snapshots until the diskspace usage is below that threshold again. Of course, the first time I added the automatic deletion, it deleted far more than was necessary due to the fact that the actual freeing of diskspace is asynchronous from the command completion. I ended up setting a small delay (of about 60 seconds) between each iteration and also set it to monitor system load. If load is not low enough after the delay then it waits another 60 seconds. This complicated (frankly broken) workaround would be completely unnecessary with a --wait switch. Alternatively, perhaps a knob where we can see if a subvolume deletion is in progress could help. -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with btrfs balance
On 2014/02/10 04:33 AM, Austin S Hemmelgarn wrote: Apparently, trying to use -mconvert=dup or -sconvert=dup on a multi-device filesystem using one of the RAID profiles for metadata fails with a statement to look at the kernel log, which doesn't show anything at all about the failure. ^ If this is the case then it is definitely a bug. Can you provide some version info? Specifically kernel, btrfs-tools, and Distro. it appears that the kernel stops you from converting to a dup profile for metadata in this case because it thinks that such a profile doesn't work on multiple devices, despite the fact that you can take a single device filesystem, and a device, and it will still work fine even without converting the metadata/system profiles. I believe dup used to work on multiple devices but the facility was removed. In the standard case it doesn't make sense to use dup with multiple devices: It uses the same amount of diskspace but is more vulnerable than the RAID1 alternative. Ideally, this should be changed to allow converting to dup so that when converting a multi-device filesystem to single-device, you never have to have metadata or system chunks use a single profile. This is a good use-case for having the facility. I'm thinking that, if it is brought back in, the only caveat is that appropriate warnings should be put in place to indicate that it is inappropriate. My guess on how you'd like to migrate from raid1/raid1 to single/dup, assuming sda and sdb: btrfs balance start -dconvert=single -mconvert=dup / btrfs device delete /dev/sdb / -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Provide a better free space estimate on RAID1
On 2014/02/05 10:15 PM, Roman Mamedov wrote: Hello, On a freshly-created RAID1 filesystem of two 1TB disks: # df -h /mnt/p2/ Filesystem Size Used Avail Use% Mounted on /dev/sda2 1.8T 1.1M 1.8T 1% /mnt/p2 I cannot write 2TB of user data to that RAID1, so this estimate is clearly misleading. I got tired of looking at the bogus disk free space on all my RAID1 btrfs systems, so today I decided to do something about this: ... After: # df -h /mnt/p2/ Filesystem Size Used Avail Use% Mounted on /dev/sda2 1.8T 1.1M 912G 1% /mnt/p2 Until per-subvolume RAID profiles are implemented, this estimate will be correct, and even after, it should be closer to the truth than assuming the user will fill their RAID1 FS only with subvolumes of single or raid0 profiles. This is a known issue: https://btrfs.wiki.kernel.org/index.php/FAQ#Why_does_df_show_incorrect_free_space_for_my_RAID_volume.3F Btrfs is still considered experimental - this is just one of those caveats we've learned to adjust to. The change could work well for now and I'm sure it has been considered. I guess the biggest end-user issue is that you can, at a whim, change the model for new blocks - raid0/5/6,single etc and your value from 5 minutes ago is far out from your new value without having written anything or taken up any space. Not a show-stopper problem, really. The biggest dev issue is that future features will break this behaviour, such as the "per-subvolume RAID profiles" you mentioned. It is difficult to motivate including code (for which there's a known workaround) where we know it will be obsoleted. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-transaction blocked for more than 120 seconds
On 2014/01/06 12:57 AM, Roman Mamedov wrote: Did you align your partitions to accommodate for the 4K sector of the EARS? I had, yes. I had to do a lot of research to get the array working "optimally". I didn't need to repartition the spare so this carried over to its being used as an OS disk. I actually lost the "Green" array twice - and learned some valuable lessons: 1. I had an 8-port SCSI card which was dropping the disks due to the timeout issue mentioned by Chris. That caused the first array failure. Technically all the data was on the disks - but temporarily irrecoverable as disks were constantly being dropped. I made a mistake during ddrescue which simultaneously destroyed two disks' data, meaning that the recovery operation was finally for nought. The only consolation was that I had very little data at the time and none of it was irreplaceable. 2. After replacing the SCSI card with two 4-port SATA cards, a few months later I still had a double-failure (the second failure being during the RAID5 rebuild). This time it was only due to bad disks and a lack of scrubbing/early warning - clearly my own fault. Having learnt these lessons, I'm now a big fan of scrubbing and backups. ;) I'm also pushing for RAID15 wherever data is mission-critical. I simply don't "trust" the reliability of disks any more and I also better understand how, by having more and/or larger disks in a RAID5/6 array, the overall reliability of that array array plummets. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-transaction blocked for more than 120 seconds
On 2014/01/05 11:17 PM, Sulla wrote: Certainly: I have 3 HDDs, all of which WD20EARS. Maybe/maybe-not off-topic: Poor hardware performance, though not necessarily the root cause, can be a major factor with these errors. WD Greens (Reds too, for that matter) have poor non-sequential performance. An educated guess I'd say there's a 15% chance this is a major factor to the problem and, perhaps, a 60% chance it is merely a "small contributor" to the problem. Greens are aimed at consumers wanting high capacity and a low pricepoint. The result is poor performance. See footnote * re my experience. My general recommendation (use cases vary of course) is to install a tiny SSD (60GB, for example) just for the OS. It is typically cheaper than the larger drives and will be *much* faster. WD Greens and Reds have good *sequential* throughput but comparatively abysmal random throughput even in comparison to regular non-SSD consumer drives. * I had 8x 1.5TB WD1500EARS drives in an mdRAID5 array. With it I had a single 250GB IDE disk for the OS. When the very old IDE disk inevitably died, I decided to use a spare 1.5TB drive for the OS. Performance was bad enough that I simply bought my first SSD the same week. -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Total fs size does not match with the actual size of the setup
On 2013/10/27 10:27 PM, Lester B wrote: 2013/10/28 Hugo Mills : On Mon, Oct 28, 2013 at 04:09:18AM +0800, Lester B wrote: The btrfs setup only have one device of size 7 GiB but when I run df, the total size shown is 15 GiB. Running btrfs --repair I'd recommend not running btrfs check --repair unless you really know what you're doing, or you've checked with someone knowledgable and they say you should try it. On a non-broken filesystem (as here), it's probably OK, though. displays an error "cache and super generation don't match, space cache will be invalidated." This is harmless. How can I correct the total fs size as shown in df? You can't. It's an artefact of the fact that you've got a RAID-1 (or RAID-10, or --mixed and DUP) filesystem, and that the standard kernel interface for df doesn't allow us to report the correct figures -- see [1] (and the subsequent entry as well) for a more detailed description. Hugo. [1] https://btrfs.wiki.kernel.org/index.php/FAQ#Why_does_df_show_incorrect_free_space_for_my_RAID_volume.3F -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Nothing right in my left brain. Nothing left in --- my right brain. But my setup is a simple one without any RAID levels or other things so at least df size column should show the actual size of my setup. Could you send us the output of the following?: btrfs fi df (where is the path where the btrfs is mounted.) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Non-intelligent behaviour on device delete with multiple devices?
On 2013/10/27 07:33 PM, Hans-Kristian Bakke wrote: Hi Today I tried removing two devices from a multidevice btrfs RAID10 volume using the following command: --- btrfs device delete /dev/sdl /dev/sdk /btrfs --- It first removed device sdl and then sdk. What I did not expect however was that btrfs didn't remove sdk from the available drives when removing and rebalancing data from the first device. This resultated in that over 300GB of data was actually added to sdk during removal of sdl, only to make the removal process of sdk longer. This seems to me as a rather non-intelligent way to do this. I would expect that all drives given as input to the btrfs device delete command was removed from the list of drives available for rebalancing of the data during removal of the drives. This is a known issue I'm sure will be addressed. It has annoyed me in the past as well. Perhaps add it to the wiki: https://btrfs.wiki.kernel.org/index.php/Project_ideas -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No space left on device, problem
On 2013/10/27 10:50 AM, Igor M wrote: On Sun, Oct 27, 2013 at 2:00 AM, Tomasz Chmielewski wrote: Still no messages. Parameter seems to be active as /sys/module/printk/parameters/ignore_loglevel is Y, but there are no messages in log files or dmesg. Maybe I need to turn on some kernel debugging option and recompile kernel ? Also I should mention that cca 230G+ data was copied before this error started to occur. I think I saw a similar issue before. Can you try using rsync with "--bwlimit XY" option to copy the files? The option will limit the speed, in kB, at which the file is being copied; it will work even when source and destination files are on a local machine. Also I run strace cp -a .. ... read(3, "350348f07$0$24520$c3e8da3$fb4835"..., 65536) = 65536 write(4, "350348f07$0$24520$c3e8da3$fb4835"..., 65536) = 65536 read(3, "62.76C52BF412E849CB86D4FF3898B94"..., 65536) = 65536 write(4, "62.76C52BF412E849CB86D4FF3898B94"..., 65536) = -1 ENOSPC (No space left on device) Last two write calls take a lot more time, and then last one returns ENOSPC. But if this write is retryed, then it succeeds. I tried with midnight commander and when error occurs, if I Retry operation then it finishes copying this file until error occurs again at next file. With --bwlimit it seems to be better, lower the speed later the error occurs, and if it's slow enough copy is successfull. But now I'm not sure anymore. I copied a few files with bwlimit, and now sudenly error doesn't occur anymore, even with no bwlimit. I'll do some more tests. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html This sounds to me like the problem is related to read performance causing a bork. This would explain why bwlimit helps, as well as why cp works the second time around (since it is cached). -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid0 unable to mount
On 2013/10/26 06:19 AM, lilofile wrote: when I use two disk to create raid0 in btrfs, after rebooting system,one disk unable to mount , error is as follows: mount: wrong fs type, bad option, bad superblock on /dev/md0, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so in /var/log/kern.log kernel: [ 480.962487] btrfs: failed to read the system array on md0 kernel: [ 480.988400] btrfs: open_ctree failed -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html At first glance, this looks confusing since it is referring to md0, which would be md's raid. Is it a btrfs-raid0 using an md-raid-something-else -> Or is it simply btrfs on top of an md-raid0? cat /proc/mdstat and btrfs fi show will answer those queries. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid5
On 2013/10/22 07:18 PM, Alexandre Oliva wrote: ... and it is surely an improvement over the current state of raid56 in btrfs, so it might be a good idea to put it in. I suspect the issue is that, while it sortof works, we don't really want to push people to use it half-baked. This is reassuring work, however. Maybe it would be nice to have some half-baked code *anyway*, even if Chris doesn't put it in his pull requests juuust yet. ;) So far, I've put more than 1TB of data on that failing disk with 16 partitions on raid6, and somehow I got all the data back successfully: every file passed an md5sum check, in spite of tons of I/O errors in the process. Is this all on a single disk? If so it must be seeking like mad! haha -- ______ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] [RFC] RAID-level terminology change
Bit late - but that could be explored in future. The main downside I see with "automatic" redundancy/optimisation is the complexity it introduces. Likely this would be best served with user-space tools. On 11/03/13 02:21, Roger Binns wrote: On 10/03/13 15:04, Hugo Mills wrote: Given that this is going to end up rewriting *all* of the data on the FS, Why does all data have to be rewritten? Why does every piece of data have to have exactly the same storage parameters in terms of non-redundancy/performance/striping options? This is a good point. You don't necessarily have to rewrite everything all at once so the performance penalty is not necessarily that bad. More importantly, some "restripe" operations actually don't need much change on-disk (in theory). Let's say we have disks with 3c chunk allocation and this needs to be reallocated into 2c chunks. In practice at present what would actually happen is that it would first create a *new* 2c chunk and migrate the data over from the 3c chunk. Once the data is moved across we finally mark the space taken up by the original 3c chunk as available for use. Rinse; Repeat. In theory we can skip this rebalance/migration step by "thinning" out the chunks in-place: relabel the chunk as 2c and mark the unneeded copies as available diskspace. A similar situation applies to other types of "conversions" in that they could be converted in-place with much less I/O or that the I/O could be optimised (for example sequential I/O between disks with minimal buffering needed vs moving data between two locations on the same disk). I'm sure there are other possibilities for "in-place" conversions too, such as moving from 4c to 2c2s or 2c to 2s. xC -> (x-1)C xCmS -> (x/2)C(m*2)S The complexity of the different types of conversions hasn't escaped me, and I do see another downside as well. With the 3C->2C conversion there is the inevitability of "macro" fragmentation. Again, there could be long-term performance implication or it might even be negligible. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send & receive produces "Too many open files in system"
On 2013/02/18 12:37 PM, Adam Ryczkowski wrote: ... to migrate btrfs from one partition layout to another. ... sits on top of lvm2 logical volume, which sits on top of cryptsetup Luks device which subsequentely sits on top of mdadm RAID-6 spanning a partition on each of 4 hard drives ... is a read-only snaphot which I estimate contain ca. 100GB data. ... is btrfs multidevice raid10 filesystem, which is based on 4 cryptsetup Luks devices, each live as a separate partition on the same 4 physical hard drives ... ... about 8MB/sek read (and the same speed of write) from each of all 4 hard drives). I hope you've solved this already - but if not: The unnecessarily complex setup aside, a 4-disk RAID6 is going to be slow - most would have gone for a RAID10 configuration, albeit that it has less redundancy. Another real problem here is that you are copying data from these disks to themselves. This means that for every read and write that all four of the disks have to do two seeks. This is time-consuming of the order of 7ms per seek depending on the disks you have. The way to avoid these unnecessary seeks is to first copy the data to a separate unrelated device and then to copy from that device to your final destination device. To increase RAID6 write performance (Perhaps irrelevant here) you can try optimising the stripe_cache_size value. It can use a ton of memory depending on how large a stripe cache setting you end up with. Search online for "mdraid stripe_cache_size". To increase the read performance you can try optimising the md arrays' readahead. As above, search online for "blockdev setra". This should hopefully make a noticeable difference. Good luck. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Btrfs-progs: check out if the swap device
On 2013/02/14 09:53 AM, Tsutomu Itoh wrote: + if (ret < 0) { + fprintf(stderr, "error checking %s status: %s\n", file, + strerror(-ret)); + exit(1); + } ... + /* check if the device is busy */ + fd = open(file, O_RDWR|O_EXCL); + if (fd < 0) { + fprintf(stderr, "unable to open %s: %s\n", file, + strerror(errno)); + exit(1); + } This is fine and works (as tested by David) - but I'm not sure if the below suggestions from Zach were taken into account. 1. If the check with "open(file, O_RDWR|O_EXCL)" shows that the device is available, there's no point in checking if it is mounted as a swap device. A preliminary check using this could precede all other checks which should be skipped if it shows success. 2. If there's an error checking the status (for example lets say /proc/swaps is deprecated), we should print the informational message but not error out. On 2013/02/13 11:58 AM, Zach Brown wrote: - First always open with O_EXCL. If it succeeds then there's no reason to check /proc/swaps at all. (Maybe it doesn't need to try check_mounted() there either? Not sure if it's protecting against accidentally mounting mounted shared storage or not.) ... - At no point is failure of any of the /proc/swaps parsing fatal. It'd carry on ignoring errors until it doesnt have work to do. It'd only ever print the nice message when it finds a match. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [btrfs] Periodic write spikes while idling, on btrfs root
On 2013/02/14 12:15 PM, Vedant Kumar wrote: Hello, I'm experiencing periodic write spikes while my system is idle. ... turned out to be some systemd log in /var/log/journal. I turned off journald and rebooted, but the write spike behavior remained. ... best, -vk I believe btrfs syncs every 30 seconds (if anything's changed). This sounds like systemd's journal is not actually disabled and that it is simply logging new information every few seconds and forcing it to be synced to disk. Have you tried following the journal as root to see what is being logged? journalctl -f Alternatively, as another measure to troubleshoot, in /etc/systemd/journald.conf, change the Storage= option either to "none" (which disables logging completely) or to a path inside a tmpfs, thereby eliminating btrfs' involvement. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html