Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
On Sat, May 13, 2017 at 07:41:24PM -0600, Andreas Dilger wrote: > On May 10, 2017, at 11:10 PM, Eric Biggerswrote: > > > > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: > >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang > >> out] > >> > >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: > >>> Theodore Ts'o writes: > >>> > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > > 1.) Privacy implications. Say the filesystem is being shared between > > multiple > >users, and one user unpacks foo.tar.gz into their home directory, > > which > >they've set to mode 700 to hide from other users. Because of this > > new > >ioctl, all users will be able to see every (inode number, size in > > blocks) > >pair that was added to the filesystem, as well as the exact layout > > of the > >physical block allocations which might hint at how the files were > > created. > >If there is a known "fingerprint" for the unpacked foo.tar.gz in this > >regard, its presence on the filesystem will be revealed to all > > users. And > >if any filesystems happen to prefer allocating blocks near the > > containing > >directory, the directory the files are in would likely be revealed > > too. > >> > >> Frankly, why are container users even allowed to make unrestricted ioctl > >> calls? I thought we had a bunch of security infrastructure to constrain > >> what userspace can do to a system, so why don't ioctls fall under these > >> same protections? If your containers are really that adversarial, you > >> ought to be blacklisting as much as you can. > >> > > > > Personally I don't find the presence of sandboxing features to be a very > > good > > excuse for introducing random insecure ioctls. Not everyone has everything > > perfectly "sandboxed" all the time, for obvious reasons. It's easy to > > forget > > about the filesystem ioctls, too, since they can be executed on any regular > > file, without having to open some device node in /dev. > > > > (And this actually does happen; the SELinux policy in Android, for example, > > still allows apps to call any ioctl on their data files, despite all the > > effort > > that has gone into whitelisting other types of ioctls. Which should be > > fixed, > > of course, but it shows that this kind of mistake is very easy to make.) > > > Unix/Linux has historically not been terribly concerned about trying > to protect this kind of privacy between users. So for example, in > order to do this, you would have to call GETFSMAP continously to track > this sort of thing. Someone who wanted to do this could probably get > this information (and much, much more) by continuously running "ps" to > see what processes are running. > > (I will note. wryly, that in the bad old days, when dozens of users > were sharing a one MIPS Vax/780, it was considered a *good* thing > that social pressure could be applied when it was found that someone > was running a CPU or memory hogger on a time sharing system. The > privacy right of someone running "xtrek" to be able to hide this from > other users on the system was never considered important at all. :-) > >> > >> Not to mention someone running GETFSMAP in a loop will be pretty obvious > >> both from the high kernel cpu usage and the huge number of metadata > >> operations. > > > > Well, only if that someone running GETFSMAP actually wants to watch things > > in > > real-time (it's not necessary for all scenarios that have been mentioned), > > *and* > > there is monitoring in place which actually detects it and can do something > > about it. > > > > Yes, PIDs have traditionally been global, but today we have PID namespaces, > > and > > many other isolation features such as mount namespaces. Nothing is > > perfect, of > > course, and containers are a lot worse than VMs, but it seems weird to use > > that > > as an excuse to knowingly make things worse... > > > >> > Fortunately, the days of timesharing seem to well behind us. For > those people who think that containers are as secure as VM's (hah, > hah, hah), it might be that best way to handle this is to have a mount > option that requires root access to this functionality. For those > people who really care about this, they can disable access. > >> > >> Or use separate filesystems for each container so that exploitable bugs > >> that shut down the filesystem can't be used to kill the other > >> containers. You could use a torrent of metadata-heavy operations > >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS > >> the other containers. > >> > >>> What would be the reason for not putting this behind > >>> capable(CAP_SYS_ADMIN)? > >>> >
Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
On May 10, 2017, at 11:10 PM, Eric Biggerswrote: > > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang >> out] >> >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: >>> Theodore Ts'o writes: >>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: > 1.) Privacy implications. Say the filesystem is being shared between > multiple >users, and one user unpacks foo.tar.gz into their home directory, which >they've set to mode 700 to hide from other users. Because of this new >ioctl, all users will be able to see every (inode number, size in > blocks) >pair that was added to the filesystem, as well as the exact layout of > the >physical block allocations which might hint at how the files were > created. >If there is a known "fingerprint" for the unpacked foo.tar.gz in this >regard, its presence on the filesystem will be revealed to all users. > And >if any filesystems happen to prefer allocating blocks near the > containing >directory, the directory the files are in would likely be revealed too. >> >> Frankly, why are container users even allowed to make unrestricted ioctl >> calls? I thought we had a bunch of security infrastructure to constrain >> what userspace can do to a system, so why don't ioctls fall under these >> same protections? If your containers are really that adversarial, you >> ought to be blacklisting as much as you can. >> > > Personally I don't find the presence of sandboxing features to be a very good > excuse for introducing random insecure ioctls. Not everyone has everything > perfectly "sandboxed" all the time, for obvious reasons. It's easy to forget > about the filesystem ioctls, too, since they can be executed on any regular > file, without having to open some device node in /dev. > > (And this actually does happen; the SELinux policy in Android, for example, > still allows apps to call any ioctl on their data files, despite all the > effort > that has gone into whitelisting other types of ioctls. Which should be fixed, > of course, but it shows that this kind of mistake is very easy to make.) > Unix/Linux has historically not been terribly concerned about trying to protect this kind of privacy between users. So for example, in order to do this, you would have to call GETFSMAP continously to track this sort of thing. Someone who wanted to do this could probably get this information (and much, much more) by continuously running "ps" to see what processes are running. (I will note. wryly, that in the bad old days, when dozens of users were sharing a one MIPS Vax/780, it was considered a *good* thing that social pressure could be applied when it was found that someone was running a CPU or memory hogger on a time sharing system. The privacy right of someone running "xtrek" to be able to hide this from other users on the system was never considered important at all. :-) >> >> Not to mention someone running GETFSMAP in a loop will be pretty obvious >> both from the high kernel cpu usage and the huge number of metadata >> operations. > > Well, only if that someone running GETFSMAP actually wants to watch things in > real-time (it's not necessary for all scenarios that have been mentioned), > *and* > there is monitoring in place which actually detects it and can do something > about it. > > Yes, PIDs have traditionally been global, but today we have PID namespaces, > and > many other isolation features such as mount namespaces. Nothing is perfect, > of > course, and containers are a lot worse than VMs, but it seems weird to use > that > as an excuse to knowingly make things worse... > >> Fortunately, the days of timesharing seem to well behind us. For those people who think that containers are as secure as VM's (hah, hah, hah), it might be that best way to handle this is to have a mount option that requires root access to this functionality. For those people who really care about this, they can disable access. >> >> Or use separate filesystems for each container so that exploitable bugs >> that shut down the filesystem can't be used to kill the other >> containers. You could use a torrent of metadata-heavy operations >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS >> the other containers. >> >>> What would be the reason for not putting this behind >>> capable(CAP_SYS_ADMIN)? >>> >>> What possible legitimate function could this functionality serve to >>> users who don't own your filesystem? >> >> As I've said before, it's to enable dedupe tools to decide, given a set >> of files with shareable blocks, roughly how many other times each of >> those shareable blocks are shared
balancing every night broke balancing so now I can't balance anymore?
Kernel 4.11, btrfs-progs v4.7.3 I run scrub and balance every night, been doing this for 1.5 years on this filesystem. But it has just started failing: saruman:~# btrfs balance start -musage=0 /mnt/btrfs_pool1 Done, had to relocate 0 out of 235 chunks saruman:~# btrfs balance start -dusage=0 /mnt/btrfs_pool1 Done, had to relocate 0 out of 235 chunks saruman:~# btrfs balance start -musage=1 /mnt/btrfs_pool1 ERROR: error during balancing '/mnt/btrfs_pool1': No space left on device aruman:~# btrfs balance start -dusage=10 /mnt/btrfs_pool1 Done, had to relocate 0 out of 235 chunks saruman:~# btrfs balance start -dusage=20 /mnt/btrfs_pool1 ERROR: error during balancing '/mnt/btrfs_pool1': No space left on device There may be more info in syslog - try dmesg | tail BTRFS info (device dm-2): 1 enospc errors during balance BTRFS info (device dm-2): relocating block group 598566305792 flags data BTRFS info (device dm-2): 1 enospc errors during balance BTRFS info (device dm-2): 1 enospc errors during balance BTRFS info (device dm-2): relocating block group 598566305792 flags data BTRFS info (device dm-2): 1 enospc errors during balance saruman:~# btrfs fi show /mnt/btrfs_pool1/ Label: 'btrfs_pool1' uuid: bc115001-a8d1-445c-9ec9-6050620efd0a Total devices 1 FS bytes used 169.73GiB devid1 size 228.67GiB used 228.67GiB path /dev/mapper/pool1 saruman:~# btrfs fi usage /mnt/btrfs_pool1/ Overall: Device size: 228.67GiB Device allocated:228.67GiB Device unallocated:1.00MiB Device missing: 0.00B Used:171.25GiB Free (estimated): 55.32GiB (min: 55.32GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:221.60GiB, Used:166.28GiB /dev/mapper/pool1 221.60GiB Metadata,single: Size:7.03GiB, Used:4.96GiB /dev/mapper/pool1 7.03GiB System,single: Size:32.00MiB, Used:48.00KiB /dev/mapper/pool1 32.00MiB Unallocated: /dev/mapper/pool1 1.00MiB How did I get into such a misbalanced state when I balance every night? My filesystem is not full, I can write just fine, but I sure cannot rebalance now. Besides adding another device to add space, is there a way around this and more generally not getting into that state anymore considering that I already rebalance every night? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating btrfs RAID on LUKS devs makes devices disappear
13.05.2017 18:28, Ochi пишет: > Hello, > > okay, I think I now have a repro that is stupidly simple, I'm not even > sure if I overlook something here. No multi-device btrfs involved, but > notably it does happen with btrfs, but not with e.g. ext4. > I could not reproduce it with single device but I finally was able to reliably reproduce your problem with multiple devices. It looks a bit different, I think difference is due to encrypted root in your case. Anyway https://github.com/systemd/systemd/issues/5955 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
Hi Liu, On Wed, Mar 22, 2017 at 1:40 AM, Liu Bowrote: > On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote: >> We have a commit_root_sem, which is a read-write semaphore that protects the >> commit roots. >> But it is also used to protect the list of caching block groups. >> >> As a result, while doing "slow" caching, the following issue is seen: >> >> Some of the caching threads are scanning the extent tree with >> commit_root_sem >> acquired in shared mode, with stack like: >> [] read_extent_buffer_pages+0x2d2/0x300 [btrfs] >> [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 >> [btrfs] >> [] read_tree_block+0x40/0x70 [btrfs] >> [] read_block_for_search.isra.33+0x12c/0x370 [btrfs] >> [] btrfs_search_slot+0x3c6/0xb10 [btrfs] >> [] caching_thread+0x1b9/0x820 [btrfs] >> [] normal_work_helper+0xc6/0x340 [btrfs] >> [] btrfs_cache_helper+0x12/0x20 [btrfs] >> >> IO requests that want to allocate space are waiting in cache_block_group() >> to acquire the commit_root_sem in exclusive mode. But they only want to add >> the caching control structure to the list of caching block-groups: >> [] schedule+0x29/0x70 >> [] rwsem_down_write_failed+0x145/0x320 >> [] call_rwsem_down_write_failed+0x13/0x20 >> [] cache_block_group+0x25b/0x450 [btrfs] >> [] find_free_extent+0xd16/0xdb0 [btrfs] >> [] btrfs_reserve_extent+0xaf/0x160 [btrfs] >> >> Other caching threads want to continue their scanning, and for that they >> are waiting to acquire commit_root_sem in shared mode. But since there are >> IO threads that want the exclusive lock, the caching threads are unable >> to continue the scanning, because (I presume) rw_semaphore guarantees some >> fairness: >> [] schedule+0x29/0x70 >> [] rwsem_down_read_failed+0xc5/0x120 >> [] call_rwsem_down_read_failed+0x14/0x30 >> [] caching_thread+0x1a1/0x820 [btrfs] >> [] normal_work_helper+0xc6/0x340 [btrfs] >> [] btrfs_cache_helper+0x12/0x20 [btrfs] >> [] process_one_work+0x146/0x410 >> >> This causes slowness of the IO, especially when there are many block groups >> that need to be scanned for free space. In some cases it takes minutes >> until a single IO thread is able to allocate free space. >> >> I don't see a deadlock here, because the caching threads that were able to >> acquire >> the commit_root_sem will call rwsem_is_contended() and should give up the >> semaphore, >> so that IO threads are able to acquire it in exclusive mode. >> >> However, introducing a separate mutex that protects only the list of caching >> block groups makes things move forward much faster. >> > > The problem did exist and the patch looks good to me. > >> This patch is based on kernel 3.18. >> Unfortunately, I am not able to submit a patch based on one of the latest >> kernels, because >> here btrfs is part of the larger system, and upgrading the kernel is a >> significant effort. >> Hence marking the patch as RFC. >> Hopefully, this patch still has some value to the community. >> >> Signed-off-by: Alex Lyakas >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 42d11e7..74feacb 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info { >> struct list_head trans_list; >> struct list_head dead_roots; >> struct list_head caching_block_groups; >> +/* protects the above list */ >> +struct mutex caching_block_groups_mutex; >> >> spinlock_t delayed_iput_lock; >> struct list_head delayed_iputs; >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 5177954..130ec58 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb, >> INIT_LIST_HEAD(_info->delayed_iputs); >> INIT_LIST_HEAD(_info->delalloc_roots); >> INIT_LIST_HEAD(_info->caching_block_groups); >> +mutex_init(_info->caching_block_groups_mutex); >> spin_lock_init(_info->delalloc_root_lock); >> spin_lock_init(_info->trans_lock); >> spin_lock_init(_info->fs_roots_radix_lock); >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index a067065..906fb08 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -637,10 +637,10 @@ static int cache_block_group(struct >> btrfs_block_group_cache *cache, >> return 0; >> } >> >> -down_write(_info->commit_root_sem); >> +mutex_lock(_info->caching_block_groups_mutex); >> atomic_inc(_ctl->count); >> list_add_tail(_ctl->list, _info->caching_block_groups); >> -up_write(_info->commit_root_sem); >> +mutex_unlock(_info->caching_block_groups_mutex); >> >> btrfs_get_block_group(cache); >> >> @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct >> btrfs_trans_handle *trans, >> >> down_write(_info->commit_root_sem); >> > > Witht the new mutex, it's not necessary to take commit_root_sem here > because a) pinned_extents could not be modified outside of a > transaction, and b) while at
Re: Creating btrfs RAID on LUKS devs makes devices disappear
Hello, okay, I think I now have a repro that is stupidly simple, I'm not even sure if I overlook something here. No multi-device btrfs involved, but notably it does happen with btrfs, but not with e.g. ext4. [Sidenote: At first I thought it had to do with systemd-cryptsetup opening multiple devices with the same key that makes a difference. Rationale: I think the whole systemd machinery for opening crypt devices is able to try the same password on multiple devices when manual keyphrase input is used, and I thought maybe the same is true for keyfiles which may cause race conditions, but after all it doesn't seem to matter much. Also it seemed to relate to multi-device btrfs volumes, but now it appears to be simpler than that. That said, I can't be sure whether there are more problems hidden when actually using RAID.] I have tried to repro the issue on a completely fresh Arch Linux in a VirtualBox VM. No custom systemd magic involved whatsoever, all stock services, generators, etc. In addition to the root volume (no crypto), there is another virtual HDD with one partition. This is a LUKS partition with a keyfile added to open it automatically on boot. I added a corresponding /etc/crypttab line as follows: storage0/dev/sdb1/etc/crypto/keyfile Let's suppose we open the crypt device manually the first time and perform mkfs.btrfs on the /dev/mapper/storage0 device. Reboot the system such that systemd-cryptsetup can do its magic to open the dm device. After reboot, log in. /dev/mapper/storage0 should be there, and of course the corresponding /dev/dm-*. Perform another mkfs.btrfs on /dev/mapper/storage0. What I observe is (possibly try multiple times, but it has been pretty reliable in my testing): - /dev/mapper/storage0 and the /dev/dm-* device are gone. - A process systemd-cryptsetup is using 100% CPU (haven't noticed before, but now on my laptop I can actually hear it) - The dm-device was eliminated by systemd, see the logs below. - Logging out and in again (as root in my case) solves the issue, the device is back. I have prepared outputs of journalctl and udevadm info --export-db produced after the last step (logging out and back in). Since the logs are quite large, I link them here, I hope that is okay: https://pastebin.com/1r6j1Par https://pastebin.com/vXLGFQ0Z In the journal, the interesting spots are after the two "ROOT LOGIN ON tty1". A few seconds after the first one, I performed the mkfs. Notably, it doesn't seem to happen when using e.g. ext4 instead of btrfs. Also, it doesn't happen when opening the crypt device manually, without crypttab and thus without systemd-cryptsetup, systemd-cryptsetup-generator, etc. which parses crypttab. So after all, I suspect the systemd-cryptsetup to be the culprit in combination with btrfs volumes. Maybe someone can repro that. Versions used in the VM: - Current Arch Linux - Kernel 4.10.13 - btrfs-progs 4.10.2 - systemd v232 (also tested v233 from testing repo with same results) Hope this helps Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/19] Btrfs-progs offline scrub
> > Ping? > > Any comments? > > Thanks, > Qu Can I inject corruption with existing script [1] and expect offline scrub to fix it? If so, I'll give it try and let you know the results. [1] https://patchwork.kernel.org/patch/9583455/ Cheers, Lakshmipathi.G -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs-progs: btrfs-convert: Add larger device support
With larger file system (in this case its 22TB), ext2fs_open() returns EXT2_ET_CANT_USE_LEGACY_BITMAPS error message with ext2fs_read_block_bitmap(). To overcome this issue, we need pass EXT2_FLAG_64BITS flag with ext2fs_open and also use 64-bit functions like ext2fs_get_block_bitmap_range2, ext2fs_inode_data_blocks2,ext2fs_read_ext_attr2 bug: https://bugzilla.kernel.org/show_bug.cgi?id=194795 Signed-off-by: Lakshmipathi.G--- convert/source-ext2.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/convert/source-ext2.c b/convert/source-ext2.c index 1b0576b..275cb89 100644 --- a/convert/source-ext2.c +++ b/convert/source-ext2.c @@ -34,8 +34,9 @@ static int ext2_open_fs(struct btrfs_convert_context *cctx, const char *name) ext2_filsys ext2_fs; ext2_ino_t ino; u32 ro_feature; + int open_flag = EXT2_FLAG_SOFTSUPP_FEATURES | EXT2_FLAG_64BITS; - ret = ext2fs_open(name, 0, 0, 0, unix_io_manager, _fs); + ret = ext2fs_open(name, open_flag, 0, 0, unix_io_manager, _fs); if (ret) { fprintf(stderr, "ext2fs_open: %s\n", error_message(ret)); return -1; @@ -148,7 +149,7 @@ static int ext2_read_used_space(struct btrfs_convert_context *cctx) return -ENOMEM; for (i = 0; i < fs->group_desc_count; i++) { - ret = ext2fs_get_block_bitmap_range(fs->block_map, blk_itr, + ret = ext2fs_get_block_bitmap_range2(fs->block_map, blk_itr, block_nbytes * 8, block_bitmap); if (ret) { error("fail to get bitmap from ext2, %s", @@ -353,7 +354,7 @@ static int ext2_create_symlink(struct btrfs_trans_handle *trans, int ret; char *pathname; u64 inode_size = btrfs_stack_inode_size(btrfs_inode); - if (ext2fs_inode_data_blocks(ext2_fs, ext2_inode)) { + if (ext2fs_inode_data_blocks2(ext2_fs, ext2_inode)) { btrfs_set_stack_inode_size(btrfs_inode, inode_size + 1); ret = ext2_create_file_extents(trans, root, objectid, btrfs_inode, ext2_fs, ext2_ino, @@ -627,9 +628,9 @@ static int ext2_copy_extended_attrs(struct btrfs_trans_handle *trans, ret = -ENOMEM; goto out; } - err = ext2fs_read_ext_attr(ext2_fs, ext2_inode->i_file_acl, buffer); + err = ext2fs_read_ext_attr2(ext2_fs, ext2_inode->i_file_acl, buffer); if (err) { - fprintf(stderr, "ext2fs_read_ext_attr: %s\n", + fprintf(stderr, "ext2fs_read_ext_attr2: %s\n", error_message(err)); ret = -1; goto out; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[OT] SSD performance patterns (was: Btrfs/SSD)
Am Sat, 13 May 2017 09:39:39 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: > Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted: > > > In the end, the more continuous blocks of free space there are, the > > better the chance for proper wear leveling. > > Talking about which... > > When I was doing my ssd research the first time around, the going > recommendation was to keep 20-33% of the total space on the ssd > entirely unallocated, allowing it to use that space as an FTL > erase-block management pool. > > At the time, I added up all my "performance matters" data dirs and > allowing for reasonable in-filesystem free-space, decided I could fit > it in 64 GB if I had to, tho 80 GB would be a more comfortable fit, > so allowing for the above entirely unpartitioned/unused slackspace > recommendations, had a target of 120-128 GB, with a reasonable range > depending on actual availability of 100-160 GB. > > It turned out, due to pricing and availability, I ended up spending > somewhat more and getting 256 GB (238.5 GiB). Of course that allowed > me much more flexibility than I had expected and I ended up with > basically everything but the media partition on the ssds, PLUS I > still left them at only just over 50% partitioned, (using the gdisk > figures, 51%- partitioned, 49%+ free). I put by ESP (for UEFI) onto the SSD and also played with putting swap onto it dedicated to hibernation. But I discarded the hibernation idea and removed the swap because it didn't work well: It wasn't much faster then waking from HDD, and hibernation is not that reliable anyways. Also, hybrid hibernation is not yet integrated into KDE so I stick to sleep mode currently. The rest of my SSD (also 500GB) is dedicated to bcache. This fits my complete work set of daily work with hit ratios going up to 90% and beyond. My filesystem boots and feels like SSD, the HDDs are almost silent and still my file system is 3TB on 3x 1TB HDD. > Given that, I've not enabled btrfs trim/discard (which saved me from > the bugs with it a few kernel cycles ago), and while I do have a > weekly fstrim systemd timer setup, I've not had to be too concerned > about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was > known not to be trimming everything it really should have been. This is a good recommendation as TRIM is still a slow operation because Queued TRIM is not used for most drives due to buggy firmware. So you not only circumvent kernel and firmware bugs, but also get better performance that way. > Anyway, that 20-33% left entirely unallocated/unpartitioned > recommendation still holds, right? Am I correct in asserting that if > one is following that, the FTL already has plenty of erase-blocks > available for management and the discussion about filesystem level > trim and free space management becomes much less urgent, tho of > course it's still worth considering if it's convenient to do so? > > And am I also correct in believing that while it's not really worth > spending more to over-provision to the near 50% as I ended up doing, > if things work out that way as they did with me because the > difference in price between 30% overprovisioning and 50% > overprovisioning ends up being trivial, there's really not much need > to worry about active filesystem trim at all, because the FTL has > effectively half the device left to play erase-block musical chairs > with as it decides it needs to? I think, things may have changed since long ago. See below. But it certainly depends on which drive manufacturer you chose, I guess. I can at least confirm that bigger drives wear their write cycles much slower, even when filled up. My old 128MB Crucial drive was worn after only 1 year (I swapped it early, I kept an eye on SMART numbers). My 500GB Samsung drive is around 1 year old now, I do write a lot more data to it, but according to SMART it should work for at least 5 to 7 more years. By that time, I probably already swapped it for a bigger drive. So I guess you should maybe look at your SMART numbers and calculate the expected life time: Power_on_Hours(RAW) * WLC(VALUE) / (100-WLC(VALUE)) with WLC = Wear_Leveling_Count should get you the expected remaining power on hours. My drive is powered on 24/7 most of the time but if you power your drive only 8 hours per day, you can easily ramp up the life time by three times of days vs. me. ;-) There is also Total_LBAs_Written but that, at least for me, usually gives much higher lifetime values so I'd stick with the pessimistic ones. Even when WLC goes to zero, the drive should still have reserved blocks available. My drive sets the threshold to 0 for WLC which makes me think that it is not fatal when it hits 0 because the drive still has reserved blocks. And for reserved blocks, the threshold is 10%. Now combine that with your planning of getting a new drive, and you can optimize space efficiency vs. lifetime better. > Of course the higher per-GiB cost
Re: Btrfs/SSD
> Anyway, that 20-33% left entirely unallocated/unpartitioned > recommendation still holds, right? I never liked that idea. And I really disliked how people considered it to be (and even passed it down as) some magical, absolute stupid-proof fail-safe thing (because it's not). 1: Unless you reliably trim the whole LBA space (and/or run ata_secure_erase on the whole drive) before you (re-)partition the LBA space, you have zero guarantee that the drive's controller/firmware will treat the unallocated space as empty or will keep it's content around as useful data (even if it's full of zeros because zero could be very useful data unless it's specifically marked as "throwaway" by trim/erase). On the other hand, a trim-compatible filesystem should properly mark (trim) all (or at least most of) the free space as free (= free to erase internally by the controller's discretion). And even if trim isn't fail-proof either, those bugs should be temporary (and it's not like a sane SSD will die in a few weeks due to these kind of issues during sane usage and crazy drivers will often fail under crazy usage regardless of trim and spare space). 2: It's not some daemon-summoning, world-ending catastrophe if you occasionally happen to fill your SSD to ~100%. It probably won't like it (it will probably get slow by the end of the writes and the internal write amplification might skyrocket at it's peak) but nothing extraordinary will happen and normal operation (high write speed, normal internal write amplification, etc) should resume soon after you make some room (for example, you delete your temporary files or move some old content to an archive storage and you properly trim that space). That space is there to be used, just don't leave it close to 100% all the time and try never leaving it close to 100% when you plan to keep it busy with many small random writes. 3: Some drives have plenty of hidden internal spare space (especially the expensive kinds offered for datacenters or "enthusiast" consumers by big companies like Intel and such). Even some cheap drivers might have plenty of erased space at 100% LBA allocation if they use compression internally (and you don't fill it up to 100% with in-compressible content). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Am Sat, 13 May 2017 14:52:47 +0500 schrieb Roman Mamedov: > On Fri, 12 May 2017 20:36:44 +0200 > Kai Krakow wrote: > > > My concern is with fail scenarios of some SSDs which die unexpected > > and horribly. I found some reports of older Samsung SSDs which > > failed suddenly and unexpected, and in a way that the drive > > completely died: No more data access, everything gone. HDDs start > > with bad sectors and there's a good chance I can recover most of > > the data except a few sectors. > > Just have your backups up-to-date, doesn't matter if it's SSD, HDD or > any sort of RAID. > > In a way it's even better, that SSDs [are said to] fail abruptly and > entirely. You can then just restore from backups and go on. Whereas a > failing HDD can leave you puzzled on e.g. whether it's a cable or > controller problem instead, and possibly can even cause some data > corruption which you won't notice until too late. My current backup strategy can handle this. I never backup files from the source again if it didn't change by timestamp. That way, silent data corruption won't creep into the backup. Additionally, I keep a backlog of 5 years of file history. Even if a corrupted file creeps into the backup, there is enough time to get a good copy back. If it's older, it probably doesn't hurt so much anyway. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
On Fri, 12 May 2017 20:36:44 +0200 Kai Krakowwrote: > My concern is with fail scenarios of some SSDs which die unexpected and > horribly. I found some reports of older Samsung SSDs which failed > suddenly and unexpected, and in a way that the drive completely died: > No more data access, everything gone. HDDs start with bad sectors and > there's a good chance I can recover most of the data except a few > sectors. Just have your backups up-to-date, doesn't matter if it's SSD, HDD or any sort of RAID. In a way it's even better, that SSDs [are said to] fail abruptly and entirely. You can then just restore from backups and go on. Whereas a failing HDD can leave you puzzled on e.g. whether it's a cable or controller problem instead, and possibly can even cause some data corruption which you won't notice until too late. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/SSD
Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted: > In the end, the more continuous blocks of free space there are, the > better the chance for proper wear leveling. Talking about which... When I was doing my ssd research the first time around, the going recommendation was to keep 20-33% of the total space on the ssd entirely unallocated, allowing it to use that space as an FTL erase-block management pool. At the time, I added up all my "performance matters" data dirs and allowing for reasonable in-filesystem free-space, decided I could fit it in 64 GB if I had to, tho 80 GB would be a more comfortable fit, so allowing for the above entirely unpartitioned/unused slackspace recommendations, had a target of 120-128 GB, with a reasonable range depending on actual availability of 100-160 GB. It turned out, due to pricing and availability, I ended up spending somewhat more and getting 256 GB (238.5 GiB). Of course that allowed me much more flexibility than I had expected and I ended up with basically everything but the media partition on the ssds, PLUS I still left them at only just over 50% partitioned, (using the gdisk figures, 51%- partitioned, 49%+ free). Given that, I've not enabled btrfs trim/discard (which saved me from the bugs with it a few kernel cycles ago), and while I do have a weekly fstrim systemd timer setup, I've not had to be too concerned about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was known not to be trimming everything it really should have been. Anyway, that 20-33% left entirely unallocated/unpartitioned recommendation still holds, right? Am I correct in asserting that if one is following that, the FTL already has plenty of erase-blocks available for management and the discussion about filesystem level trim and free space management becomes much less urgent, tho of course it's still worth considering if it's convenient to do so? And am I also correct in believing that while it's not really worth spending more to over-provision to the near 50% as I ended up doing, if things work out that way as they did with me because the difference in price between 30% overprovisioning and 50% overprovisioning ends up being trivial, there's really not much need to worry about active filesystem trim at all, because the FTL has effectively half the device left to play erase-block musical chairs with as it decides it needs to? Of course the higher per-GiB cost of ssd as compared to spinning rust does mean that the above overprovisioning recommendation really does hurt, most of the time, driving per-usable-GB costs even higher, and as I recall that was definitely the case back then between 80 GiB and 160 GiB, and it was basically an accident of timing, that I was buying just as the manufactures flooded the market with newly cost-effective 256 GB devices, that meant they were only trivially more expensive than the 128 or 160 GB, AND unlike the smaller devices, actually /available/ in the 500-ish MB/sec performance range that (for SATA-based SSDs) is actually capped by SATA-600 bus speeds more than the chips themselves. (There were lower cost 128 GB devices, but they were lower speed than I wanted, too.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating btrfs RAID on LUKS devs makes devices disappear
12.05.2017 20:07, Chris Murphy пишет: > On Thu, May 11, 2017 at 5:24 PM, Ochiwrote: >> Hello, >> >> here is the journal.log (I hope). It's quite interesting. I rebooted the >> machine, performed a mkfs.btrfs on dm-{2,3,4} and dm-3 was missing >> afterwards (around timestamp 66.*). However, I then logged into the machine >> from another terminal (around timestamp 118.*) which triggered something to >> make the device appear again :O Indeed, dm-3 was once again there after >> logging in. Does systemd mix something up? > Yes :) Did you doubt? Please, try to reproduce it and provide both journalctl and "udevadm info --export-db" output. I have my theory what happens here. > I don't see any Btrfs complaints. If dm-3 is the device Btrfs is > expecting and it vanishes, then Btrfs would mention it on a read or > write. So either nothing is happening and Btrfs isn't yet aware that > dm-3 is gone, or it's looking at some other instance of that encrypted > volume, maybe it's using /dev/mapper/storage1. You can find out with > btrfs fi show. > /dev/mapper/xxx should be link to /dev/dm-NN (although you are never sure with Linux). dm-NN is *the* device. /dev/mapper/storage1 cannot exist without /dev/dm-3, irrespectively of what btrfs shows. > Whenever I use Btrfs on LUKS I invariably see fi show, show me one > device using /dev/dm-0 notation and the other device is > /dev/mapper/blah notation. I think this is just an artifact of all the > weird behind the scenes shit with symlinks and such, and that is a > systemd thing as far as I know. > Not quite. As implemented today, when device appears (i.e. udev gets ADD uevent) and it is detected as part of btrfs, udev rule scans device (with equivalent of "btrfs device ready"). At the time event is being processed the only name that is available is canonical kernel /dev/dm-0; all convenience syminks are created later, after all rules have been processed. btrfs remembers device name that was passed to it. What makes it even more confusing - some btrfs utilities seem to resolve /dev/dm-0 to /dev/mapper/blah by itself, and some not. I am not sure which ones. Recently btrfsprogs got extra rule that repeats "btrfs device ready" but now *after* symlinks have been created (using RUN directive with /dev/mapper/blah). It updates kernel with new name. So my guess is that for this device this rule is missing (probably it gets created in initrd and rule is not added to it). > Anyway, more information is needed. First see if the device is really > missing per Btrfs (read or write something and also check with 'btrfs > fi show') You can add systemd.log_level=debug as a boot parameter to > get more information in the journal, although it's rather verbose. You > could combine it with rd.udev.debug but that gets really crazy verbose > so I tend not to use them at the same time. > @ochi: Please, before running with debug repeat your test as you did and provide udevadm info --export-db. This may be enough. In my experience debug output while being extremely verbose contains very little additional useful information, but has tendency of skewing relative timing thus changing behavior. > The other possibility is there's a conflict with dracut which may be > doing some things, and the debug switch for that is rd.debug and is > likewise extremely verbose, producing huge logs, so I would start out > with just the systemd debug and see if that reveals anything > *assuming* Btrfs is complaining. If Btrfs doesn't complain about a > device going away then I wouldn't worry about it. > > This is not btrfs issue in any case (except btrfs folks should really work together with systemd folks and finally come to common implementation of multi-device filesystems). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html