[PATCH RFC] btrfs: Slightly speedup btrfs_read_block_groups
Btrfs_read_block_groups() function is the most time consuming function if the whole fs is filled with small extents. For a btrfs filled with all 16K sized files, and when 2T space is used, mount the fs needs 10 to 12 seconds. While ftrace shows that, btrfs_read_block_groups() takes about 9 seconds, while btrfs_read_chunk_tree() only takes 14ms. In theory, btrfs_read_chunk_tree() and btrfs_read_block_groups() should take the same time, as chunk and block groups are 1:1 mapped. However, considering block group items are spread across the large extent tree, it takes a lot of time to search btree. And furthermore, find_first_block_group() function used by btrfs_read_block_groups() is using a very bad method to locate block group item, by searching and then checking slot by slot. In kernel space, checking slot by slot is a little time consuming, as for next_leaf() case, kernel need to do extra locking. This patch will fix the slot by slot checking, as when we call btrfs_read_block_groups(), we have already read out all chunks and save them into map_tree. So we use map_tree to get exact block group start and length, then do exact btrfs_search_slot(), without slot by slot check, to speedup the mount. With this patch, time spent on btrfs_read_block_groups() is reduced to 7.56s, compared to old 8.94s. Reported-by: Tsutomu Itoh Signed-off-by: Qu Wenruo --- The further fix would change the mount process from reading out all block groups to reading out block group on demand. But according to the btrfs_read_chunk_tree() calling time, the real problem is the on-disk format and btree locking. If block group items are arranged like chunks, in a dedicated tree, btrfs_read_block_groups() should take the same time as btrfs_read_chunk_tree(). And further more, if we can split current huge extent tree into something like per-chunk extent tree, a lot of current code like delayed_refs can be removed, as extent tree operation will be much faster. --- fs/btrfs/extent-tree.c | 61 -- fs/btrfs/extent_map.c | 1 + fs/btrfs/extent_map.h | 22 ++ 3 files changed, 47 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8507484..9fa7728 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9520,39 +9520,20 @@ out: return ret; } -static int find_first_block_group(struct btrfs_root *root, - struct btrfs_path *path, struct btrfs_key *key) +int find_block_group(struct btrfs_root *root, + struct btrfs_path *path, + struct extent_map *chunk_em) { int ret = 0; - struct btrfs_key found_key; - struct extent_buffer *leaf; - int slot; - - ret = btrfs_search_slot(NULL, root, key, path, 0, 0); - if (ret < 0) - goto out; + struct btrfs_key key; - while (1) { - slot = path->slots[0]; - leaf = path->nodes[0]; - if (slot >= btrfs_header_nritems(leaf)) { - ret = btrfs_next_leaf(root, path); - if (ret == 0) - continue; - if (ret < 0) - goto out; - break; - } - btrfs_item_key_to_cpu(leaf, &found_key, slot); + key.objectid = chunk_em->start; + key.offset = chunk_em->len; + key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; - if (found_key.objectid >= key->objectid && - found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { - ret = 0; - goto out; - } - path->slots[0]++; - } -out: + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret > 0) + ret = -ENOENT; return ret; } @@ -9771,16 +9752,14 @@ int btrfs_read_block_groups(struct btrfs_root *root) struct btrfs_block_group_cache *cache; struct btrfs_fs_info *info = root->fs_info; struct btrfs_space_info *space_info; - struct btrfs_key key; + struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree; + struct extent_map *chunk_em; struct btrfs_key found_key; struct extent_buffer *leaf; int need_clear = 0; u64 cache_gen; root = info->extent_root; - key.objectid = 0; - key.offset = 0; - key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -9793,10 +9772,16 @@ int btrfs_read_block_groups(struct btrfs_root *root) if (btrfs_test_opt(root, CLEAR_CACHE)) need_clear = 1; + /* Here we don't lock the map tree, as we are the only reader */ + chunk_em = first_extent_mapping(&map_tree->map_tree); + /* Not really possible */ + if
Re: [PATCH] btrfs-progs: Adjust timing of safety delay countdown
On Wed, May 04, 2016 at 03:43:26PM -0400, Noah Massey wrote: > When printing the countdown in the safety delay, the number should > correspond to the number of seconds remaining to wait at the time the > delay is printed. > > In other words, there should be a one second sleep after printing '1'. > > Signed-off-by: Noah Massey Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes
On Thu, May 5, 2016 at 5:23 AM, Zygo Blaxell wrote: > During a mount, we start the cleaner kthread first because the transaction > kthread wants to wake up the cleaner kthread. We start the transaction > kthread next because everything in btrfs wants transactions. We do reloc > recovery in the thread that was doing the original mount call once the > transaction kthread is running. This means that the cleaner kthread > could already be running when reloc recovery happens (e.g. if a snapshot > delete was started before a crash). > > Relocation does not play well with the cleaner kthread, so a mutex was > added in commit 5f3164813b90f7dbcb5c3ab9006906222ce471b7 "Btrfs: fix > race between balance recovery and root deletion" to prevent both from > being active at the same time. > > If the cleaner kthread is already holding the mutex by the time we get > to btrfs_recover_relocation, the mount will be blocked until at least > one deleted subvolume is cleaned (possibly more if the mount process > doesn't get the lock right away). During this time (which could be an > arbitrarily long time on a large/slow filesystem), the mount process is > stuck and the filesystem is unnecessarily inaccessible. > > Fix this by locking cleaner_mutex before we start cleaner_kthread, and > unlocking the mutex after mount no longer requires it. This ensures > that the mounting process will not be blocked by the cleaner kthread. > The cleaner kthread is already prepared for mutex contention and will > just go to sleep until the mutex is available. You miss your Signed-off-by: tag (git format-patch or git commit with -s add it automatically). Once you get that, you can add my Reviewed-by: Filipe Manana > --- > fs/btrfs/disk-io.c | 18 +++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index d8d68af..7c8f435 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2509,6 +2509,7 @@ int open_ctree(struct super_block *sb, > int num_backups_tried = 0; > int backup_index = 0; > int max_active; > + bool cleaner_mutex_locked = false; > > tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info); > chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info); > @@ -2988,6 +2989,13 @@ retry_root_backup: > goto fail_sysfs; > } > > + /* > +* Hold the cleaner_mutex thread here so that we don't block > +* for a long time on btrfs_recover_relocation. cleaner_kthread > +* will wait for us to finish mounting the filesystem. > +*/ > + mutex_lock(&fs_info->cleaner_mutex); > + cleaner_mutex_locked = true; > fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root, >"btrfs-cleaner"); > if (IS_ERR(fs_info->cleaner_kthread)) > @@ -3046,10 +3054,8 @@ retry_root_backup: > ret = btrfs_cleanup_fs_roots(fs_info); > if (ret) > goto fail_qgroup; > - > - mutex_lock(&fs_info->cleaner_mutex); > + /* We locked cleaner_mutex before creating cleaner_kthread. */ > ret = btrfs_recover_relocation(tree_root); > - mutex_unlock(&fs_info->cleaner_mutex); > if (ret < 0) { > printk(KERN_WARNING >"BTRFS: failed to recover relocation\n"); > @@ -3057,6 +3063,8 @@ retry_root_backup: > goto fail_qgroup; > } > } > + mutex_unlock(&fs_info->cleaner_mutex); > + cleaner_mutex_locked = false; > > location.objectid = BTRFS_FS_TREE_OBJECTID; > location.type = BTRFS_ROOT_ITEM_KEY; > @@ -3164,6 +3172,10 @@ fail_cleaner: > filemap_write_and_wait(fs_info->btree_inode->i_mapping); > > fail_sysfs: > + if (cleaner_mutex_locked) { > + mutex_unlock(&fs_info->cleaner_mutex); > + cleaner_mutex_locked = false; > + } > btrfs_sysfs_remove_mounted(fs_info); > > fail_fsdev_sysfs: > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote: I suggest using defaults for starters. The only thing in that list that needs be there is either subvolid or subvold, not both. Add in the non-default options once you've proven the defaults are working, and add them one at a time. Yes I read your previous suggestion and I already dropped subvolid, but since the problem already happened I left it in the mail for completeness. Anyway the culprit here is genfstab and that's probably what a beginner is going to use when installing a distro: https://wiki.archlinux.org/index.php/beginners'_guide#fstab Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q). The firmware is old if I understand the naming scheme used by Dell. It says EXT49D0Q is current. http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH According to this (http://forum.notebookreview.com/threads/2015-xps-13-ssd-fw-problem-with-m-2-samsung-pm851.770501/) the firmware you linked is for the mSATA version of the drive, not the M.2 one. EXT25D0Q seems to be the very latest one for my drive. I advice using all defaults for everything for now, otherwise it's anyone's guess what you're running into. On giovedì 5 maggio 2016 06:12:28 CEST, Qu Wenruo wrote: Would it be OK for you to test your btrfs on a plain ssd, without encryption? And just as Chris Murphy said, reducing mount option is also a pretty good debugging start point. Ok, I will remove dmcrypt, discard, compress=lzo, nodefrag and see what happens. I made a copy of /dev/mapper/cryptroot with dd on an external drive and I run btrfs check on it (btrfs-progs 4.5.2): https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB) Checked, but seems the output is truncated? No, I didn't truncate the btrfs check output because it wasn't endless. I just truncated the repair output. I also have something new to report. Do you remember when I said that my screen was black and so I had to forcedly power off the system? Something similar happened today and since in the meantime I enabled magic sysrq keys I have been able to recover this from the logs: mag 05 11:55:51 arch-laptop kdeinit5[960]: Registering "org.kde.StatusNotifierItem-1060-1/StatusNotifierItem" to system tray mag 05 11:55:51 arch-laptop obexd[1098]: OBEX daemon 5.39 mag 05 11:55:51 arch-laptop dbus-daemon[920]: Successfully activated service 'org.bluez.obex' mag 05 11:55:51 arch-laptop systemd[898]: Started Bluetooth OBEX service. mag 05 11:55:51 arch-laptop korgac[1044]: log_kidentitymanagement: IdentityManager: There was no default identity. Marking first one as default. mag 05 11:55:51 arch-laptop kernel: BUG: unable to handle kernel paging request at 00017d11 mag 05 11:55:51 arch-laptop kernel: IP: [] anon_vma_interval_tree_insert+0x3f/0x90 mag 05 11:55:51 arch-laptop kernel: PGD 0 mag 05 11:55:51 arch-laptop kernel: Oops: [#1] PREEMPT SMP mag 05 11:55:51 arch-laptop kernel: Modules linked in: rfcomm(+) visor bnep uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152 crc16 mii joydev mousedev nvr mag 05 11:55:51 arch-laptop kernel: mei_me syscopyarea sysfillrect snd sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan intel_hid sparse_keymap int3403_thermal video processor_thermal_device dw_dmac snd_soc_sst_acpi snd_soc_sst_m mag 05 11:55:51 arch-laptop kernel: lrw gf128mul glue_helper ablk_helper cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM TTY layer initialized mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM socket layer initialized mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM ver 1.11 mag 05 11:55:51 arch-laptop kernel: xhci_hcd mag 05 11:55:51 arch-laptop kernel: i8042 serio sdhci_acpi sdhci led_class mmc_core pl2303 mos7720 usbserial parport hid_generic usbhid hid usbcore usb_common mag 05 11:55:51 arch-laptop kernel: CPU: 0 PID: 351 Comm: systemd-udevd Not tainted 4.5.1-1-ARCH #1 mag 05 11:55:51 arch-laptop kernel: Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A07 11/11/2015 mag 05 11:55:51 arch-laptop kernel: task: 88021347d580 ti: 880211f8c000 task.ti: 880211f8c000 mag 05 11:55:51 arch-laptop kernel: RIP: 0010:[] [] anon_vma_interval_tree_insert+0x3f/0x90 mag 05 11:55:51 arch-laptop kernel: RSP: 0018:880211f8fd68 EFLAGS: 00010206 mag 05 11:55:51 arch-laptop kernel: RAX: 8800da2f4820 RBX: 8800bb59ce40 RCX: 8800da2f4830 mag 05 11:55:51 arch-laptop kernel: RDX: 8800da2f4828 RSI: 8800374404a0 RDI: 8800c58dfa40 mag 05 11:55:51 arch-laptop kernel: RBP: 880211f8fdb8 R08: 00017c79 R09: 0007f55e2059 mag 05 11:55:51 arch-laptop kernel: R10: 0007f55e2053 R11: 8800c58dfa40 R12: 880037440460 mag 05 11:55:51 arch-laptop kernel: R13:
Re: Spare volumes and hot auto-replacement feature
On 2016-05-04 19:18, Dmitry Katsubo wrote: Dear btrfs community, I am interested in spare volumes and hot auto-replacement feature [1]. I have a couple of questions: * Which kernel version this feature will be included? Probably 4.7. I would not suggest using it in production for at least a few cycles though (probably 4.9). * The description says that replacement happens automatically when there is any write failed or flush failed. Is it possible to control the ratio / number of such failures? (e.g. in case it was one-time accidental failure) As far as I know, no, it just happens. * What happens if spare device is smaller then the (failing) device to be replaced? I'm pretty sure that it doesn't get replaced. * What happens if during the replacement the spare device fails (write error)? I'm not certain about this one. * Is it possible for root to be notified in case if drive replacement (successful or unsuccessful) took place? Actually this question is actual for me for overall write/flush failures on btrfs volume (btrfs monitor). There isn't any built-in monitoring in BTRFS that I know of, there are a couple of options though for monitoring. The simplest and probably most reliable is to write a script to poll for changes in the error counts. You can also check the filesystem mount options (without the hot-spare functionality, if there's an error, the filesystem will (usually) get remounted read-only, and this also works for most other filesystems too). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Spare volumes and hot auto-replacement feature
Most of it (like policy tuning/configuring/notification) is through sysfs interface, However to implement this, we need the existing sysfs volume patches to be integrated. We need to think about the implementation of per-FSID spare which I hope will solve the problem incompatible spare disk. As of now if auto replace fails, spare device is out of the kernel device list. If user wants to give a 2nd try then, they should run btrfs dev scan again. And the degraded vol will continue to look for the spare device. Thanks for the feedback. Anand On 05/05/2016 07:18 AM, Dmitry Katsubo wrote: Dear btrfs community, I am interested in spare volumes and hot auto-replacement feature [1]. I have a couple of questions: * Which kernel version this feature will be included? * The description says that replacement happens automatically when there is any write failed or flush failed. Is it possible to control the ratio / number of such failures? (e.g. in case it was one-time accidental failure) * What happens if spare device is smaller then the (failing) device to be replaced? * What happens if during the replacement the spare device fails (write error)? * Is it possible for root to be notified in case if drive replacement (successful or unsuccessful) took place? Actually this question is actual for me for overall write/flush failures on btrfs volume (btrfs monitor). Many thanks! [1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg48209.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Improve compression workspaces memory management
Hi, the compression workspaces are allocated as needed an this could fail if there's no free memory. Moreover, as we might be flushing data from the restricted contexts we should try our best not to fail. This patchset preallocates one workspace for each compression type at module load time (and tries to get one if that fails later). If any further request for new workspace fails, there's still that one to make progress. IOW workspace allocation will not fail at writeback time. I have tested this by instrumenting the code to limit the number of workspaces to one and did some stress tests. David Sterba (4): btrfs: rename and document compression workspace members btrfs: preallocate compression workspaces btrfs: make find_workspace always succeed btrfs: make find_workspace warn if there are no workspaces fs/btrfs/compression.c | 85 -- 1 file changed, 61 insertions(+), 24 deletions(-) -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: rename and document compression workspace members
The names are confusing, pick more fitting names and add comments. Signed-off-by: David Sterba --- fs/btrfs/compression.c | 35 +++ 1 file changed, 19 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ff61a41ac90b..4d5cd9624bb3 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -743,8 +743,11 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, static struct { struct list_head idle_ws; spinlock_t ws_lock; - int num_ws; - atomic_t alloc_ws; + /* Number of free workspaces */ + int free_ws; + /* Total number of allocated workspaces */ + atomic_t total_ws; + /* Waiters for a free workspace */ wait_queue_head_t ws_wait; } btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; @@ -760,7 +763,7 @@ void __init btrfs_init_compress(void) for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws); spin_lock_init(&btrfs_comp_ws[i].ws_lock); - atomic_set(&btrfs_comp_ws[i].alloc_ws, 0); + atomic_set(&btrfs_comp_ws[i].total_ws, 0); init_waitqueue_head(&btrfs_comp_ws[i].ws_wait); } } @@ -777,35 +780,35 @@ static struct list_head *find_workspace(int type) struct list_head *idle_ws = &btrfs_comp_ws[idx].idle_ws; spinlock_t *ws_lock = &btrfs_comp_ws[idx].ws_lock; - atomic_t *alloc_ws = &btrfs_comp_ws[idx].alloc_ws; + atomic_t *total_ws = &btrfs_comp_ws[idx].total_ws; wait_queue_head_t *ws_wait = &btrfs_comp_ws[idx].ws_wait; - int *num_ws = &btrfs_comp_ws[idx].num_ws; + int *free_ws= &btrfs_comp_ws[idx].free_ws; again: spin_lock(ws_lock); if (!list_empty(idle_ws)) { workspace = idle_ws->next; list_del(workspace); - (*num_ws)--; + (*free_ws)--; spin_unlock(ws_lock); return workspace; } - if (atomic_read(alloc_ws) > cpus) { + if (atomic_read(total_ws) > cpus) { DEFINE_WAIT(wait); spin_unlock(ws_lock); prepare_to_wait(ws_wait, &wait, TASK_UNINTERRUPTIBLE); - if (atomic_read(alloc_ws) > cpus && !*num_ws) + if (atomic_read(total_ws) > cpus && !*free_ws) schedule(); finish_wait(ws_wait, &wait); goto again; } - atomic_inc(alloc_ws); + atomic_inc(total_ws); spin_unlock(ws_lock); workspace = btrfs_compress_op[idx]->alloc_workspace(); if (IS_ERR(workspace)) { - atomic_dec(alloc_ws); + atomic_dec(total_ws); wake_up(ws_wait); } return workspace; @@ -820,21 +823,21 @@ static void free_workspace(int type, struct list_head *workspace) int idx = type - 1; struct list_head *idle_ws = &btrfs_comp_ws[idx].idle_ws; spinlock_t *ws_lock = &btrfs_comp_ws[idx].ws_lock; - atomic_t *alloc_ws = &btrfs_comp_ws[idx].alloc_ws; + atomic_t *total_ws = &btrfs_comp_ws[idx].total_ws; wait_queue_head_t *ws_wait = &btrfs_comp_ws[idx].ws_wait; - int *num_ws = &btrfs_comp_ws[idx].num_ws; + int *free_ws= &btrfs_comp_ws[idx].free_ws; spin_lock(ws_lock); - if (*num_ws < num_online_cpus()) { + if (*free_ws < num_online_cpus()) { list_add(workspace, idle_ws); - (*num_ws)++; + (*free_ws)++; spin_unlock(ws_lock); goto wake; } spin_unlock(ws_lock); btrfs_compress_op[idx]->free_workspace(workspace); - atomic_dec(alloc_ws); + atomic_dec(total_ws); wake: /* * Make sure counter is updated before we wake up waiters. @@ -857,7 +860,7 @@ static void free_workspaces(void) workspace = btrfs_comp_ws[i].idle_ws.next; list_del(workspace); btrfs_compress_op[i]->free_workspace(workspace); - atomic_dec(&btrfs_comp_ws[i].alloc_ws); + atomic_dec(&btrfs_comp_ws[i].total_ws); } } } -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs: make find_workspace warn if there are no workspaces
Be verbose if there are no workspaces at all, ie. the module init time preallocation failed. Signed-off-by: David Sterba --- fs/btrfs/compression.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index c70625560265..658c39b70fba 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -834,7 +834,21 @@ static struct list_head *find_workspace(int type) * workspace preallocated for each type and the compression * time is bounded so we get to a workspace eventually. This * makes our caller's life easier. +* +* To prevent silent and low-probability deadlocks (when the +* initial preallocation fails), check if there are any +* workspaces at all. */ + if (atomic_read(total_ws) == 0) { + static DEFINE_RATELIMIT_STATE(_rs, + /* once per minute */ 60 * HZ, + /* no burst */ 1); + + if (__ratelimit(&_rs)) { + printk(KERN_WARNING + "no compression workspaces, low memory, retrying"); + } + } goto again; } return workspace; -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] btrfs: make find_workspace always succeed
With just one preallocated workspace we can guarantee forward progress even if there's no memory available for new workspaces. The cost is more waiting but we also get rid of several error paths. On average, there will be several idle workspaces, so the waiting penalty won't be so bad. In the worst case, all cpus will compete for one workspace until there's some memory. Attempts to allocate a new one are done each time the waiters are woken up. Signed-off-by: David Sterba --- fs/btrfs/compression.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 38c058bcf359..c70625560265 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -785,8 +785,10 @@ void __init btrfs_init_compress(void) } /* - * this finds an available workspace or allocates a new one - * ERR_PTR is returned if things go bad. + * This finds an available workspace or allocates a new one. + * If it's not possible to allocate a new one, waits until there's one. + * Preallocation makes a forward progress guarantees and we do not return + * errors. */ static struct list_head *find_workspace(int type) { @@ -826,6 +828,14 @@ static struct list_head *find_workspace(int type) if (IS_ERR(workspace)) { atomic_dec(total_ws); wake_up(ws_wait); + + /* +* Do not return the error but go back to waiting. There's a +* workspace preallocated for each type and the compression +* time is bounded so we get to a workspace eventually. This +* makes our caller's life easier. +*/ + goto again; } return workspace; } @@ -913,8 +923,6 @@ int btrfs_compress_pages(int type, struct address_space *mapping, int ret; workspace = find_workspace(type); - if (IS_ERR(workspace)) - return PTR_ERR(workspace); ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping, start, len, pages, @@ -949,8 +957,6 @@ static int btrfs_decompress_biovec(int type, struct page **pages_in, int ret; workspace = find_workspace(type); - if (IS_ERR(workspace)) - return PTR_ERR(workspace); ret = btrfs_compress_op[type-1]->decompress_biovec(workspace, pages_in, disk_start, @@ -971,8 +977,6 @@ int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page, int ret; workspace = find_workspace(type); - if (IS_ERR(workspace)) - return PTR_ERR(workspace); ret = btrfs_compress_op[type-1]->decompress(workspace, data_in, dest_page, start_byte, -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: preallocate compression workspaces
Preallocate one workspace for each compression type so we can guarantee forward progress in the worst case. A failure cannot be a hard error as we might not use compression at all on the filesystem. If we can't allocate the workspaces later when need them, it might actually deadlock, but in such situation the system has effectively not enough memory to operate properly. Signed-off-by: David Sterba --- fs/btrfs/compression.c | 16 1 file changed, 16 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 4d5cd9624bb3..38c058bcf359 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -761,10 +761,26 @@ void __init btrfs_init_compress(void) int i; for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { + struct list_head *workspace; + INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws); spin_lock_init(&btrfs_comp_ws[i].ws_lock); atomic_set(&btrfs_comp_ws[i].total_ws, 0); init_waitqueue_head(&btrfs_comp_ws[i].ws_wait); + + /* +* Preallocate one workspace for each compression type so +* we can guarantee forward progress in the worst case +*/ + workspace = btrfs_compress_op[i]->alloc_workspace(); + if (IS_ERR(workspace)) { + printk(KERN_WARNING + "BTRFS: cannot preallocate compression workspace, will try later"); + } else { + atomic_set(&btrfs_comp_ws[i].total_ws, 1); + btrfs_comp_ws[i].free_ws = 1; + list_add(workspace, &btrfs_comp_ws[i].idle_ws); + } } } -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Btrfs: pin logs earlier when doing a rename exchange operation
From: Filipe Manana The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(), which had a race fixed by my previous patch titled "Btrfs: pin log earlier when renaming", and so it suffers from the same problem. We pin the logs of the affected roots after we insert the new inode references, leaving a time window where concurrent tasks logging the inodes can end up logging both the new and old references, resulting in log trees that when replayed can turn the metadata into inconsistent states. This behaviour was added to btrfs_rename() in 2009 without any explanation about why not pinning the logs earlier, just leaving a comment about the posibility for the race. As of today it's perfectly safe and sane to pin the logs before we start doing any of the steps involved in the rename operation. Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 503d749..dab6c08f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9458,6 +9458,8 @@ static int btrfs_rename_exchange(struct inode *old_dir, /* force full log commit if subvolume involved. */ btrfs_set_log_full_commit(root->fs_info, trans); } else { + btrfs_pin_log_trans(root); + root_log_pinned = true; ret = btrfs_insert_inode_ref(trans, dest, new_dentry->d_name.name, new_dentry->d_name.len, @@ -9465,8 +9467,6 @@ static int btrfs_rename_exchange(struct inode *old_dir, btrfs_ino(new_dir), old_idx); if (ret) goto out_fail; - btrfs_pin_log_trans(root); - root_log_pinned = true; } /* And now for the dest. */ @@ -9474,6 +9474,8 @@ static int btrfs_rename_exchange(struct inode *old_dir, /* force full log commit if subvolume involved. */ btrfs_set_log_full_commit(dest->fs_info, trans); } else { + btrfs_pin_log_trans(dest); + dest_log_pinned = true; ret = btrfs_insert_inode_ref(trans, root, old_dentry->d_name.name, old_dentry->d_name.len, @@ -9481,8 +9483,6 @@ static int btrfs_rename_exchange(struct inode *old_dir, btrfs_ino(old_dir), new_idx); if (ret) goto out_fail; - btrfs_pin_log_trans(dest); - dest_log_pinned = true; } /* Update inode version and ctime/mtime. */ -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Btrfs: unpin logs if rename exchange operation fails
From: Filipe Manana If rename exchange operations fail at some point after we pinned any of the logs, we end up aborting the current transaction but never unpin the logs, which leaves concurrent tasks that are trying to sync the logs (as part of an fsync request from user space) blocked forever and preventing the filesystem from being unmountable. Fix this by safely unpinning the log. Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 38 -- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ab64721..503d749 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9412,6 +9412,8 @@ static int btrfs_rename_exchange(struct inode *old_dir, u64 new_idx = 0; u64 root_objectid; int ret; + bool root_log_pinned = false; + bool dest_log_pinned = false; /* we only allow rename subvolume link between subvolumes */ if (old_ino != BTRFS_FIRST_FREE_OBJECTID && root != dest) @@ -9464,6 +9466,7 @@ static int btrfs_rename_exchange(struct inode *old_dir, if (ret) goto out_fail; btrfs_pin_log_trans(root); + root_log_pinned = true; } /* And now for the dest. */ @@ -9479,6 +9482,7 @@ static int btrfs_rename_exchange(struct inode *old_dir, if (ret) goto out_fail; btrfs_pin_log_trans(dest); + dest_log_pinned = true; } /* Update inode version and ctime/mtime. */ @@ -9557,17 +9561,47 @@ static int btrfs_rename_exchange(struct inode *old_dir, if (new_inode->i_nlink == 1) BTRFS_I(new_inode)->dir_index = new_idx; - if (old_ino != BTRFS_FIRST_FREE_OBJECTID) { + if (root_log_pinned) { parent = new_dentry->d_parent; btrfs_log_new_name(trans, old_inode, old_dir, parent); btrfs_end_log_trans(root); + root_log_pinned = false; } - if (new_ino != BTRFS_FIRST_FREE_OBJECTID) { + if (dest_log_pinned) { parent = old_dentry->d_parent; btrfs_log_new_name(trans, new_inode, new_dir, parent); btrfs_end_log_trans(dest); + dest_log_pinned = false; } out_fail: + /* +* If we have pinned a log and an error happened, we unpin tasks +* trying to sync the log and force them to fallback to a transaction +* commit if the log currently contains any of the inodes involved in +* this rename operation (to ensure we do not persist a log with an +* inconsistent state for any of these inodes or leading to any +* inconsistencies when replayed). If the transaction was aborted, the +* abortion reason is propagated to userspace when attempting to commit +* the transaction. If the log does not contain any of these inodes, we +* allow the tasks to sync it. +*/ + if (ret && (root_log_pinned || dest_log_pinned)) { + if (btrfs_inode_in_log(old_dir, root->fs_info->generation) || + btrfs_inode_in_log(new_dir, root->fs_info->generation) || + btrfs_inode_in_log(old_inode, root->fs_info->generation) || + (new_inode && +btrfs_inode_in_log(new_inode, root->fs_info->generation))) + btrfs_set_log_full_commit(root->fs_info, trans); + + if (root_log_pinned) { + btrfs_end_log_trans(root); + root_log_pinned = false; + } + if (dest_log_pinned) { + btrfs_end_log_trans(dest); + dest_log_pinned = false; + } + } ret = btrfs_end_transaction(trans, root); out_notrans: if (new_ino == BTRFS_FIRST_FREE_OBJECTID) -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Btrfs: fix inode leak on failure to setup whiteout inode in rename
From: Filipe Manana If we failed to fully setup the whiteout inode during a rename operation with the whiteout flag, we ended up leaking the inode, not decrementing its link count nor removing all its items from the fs/subvol tree. Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 09947cb..ab64721 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9612,21 +9612,21 @@ static int btrfs_whiteout_for_rename(struct btrfs_trans_handle *trans, ret = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name); if (ret) - return ret; + goto out; ret = btrfs_add_nondir(trans, dir, dentry, inode, 0, index); if (ret) - return ret; + goto out; ret = btrfs_update_inode(trans, root, inode); - if (ret) - return ret; - +out: unlock_new_inode(inode); + if (ret) + inode_dec_link_count(inode); iput(inode); - return 0; + return ret; } static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry, -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair
On Thu, May 05, 2016 at 12:36:52PM +0200, Niccolò Belli wrote: > On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote: > > I suggest using defaults for starters. The only thing in that list > > that needs be there is either subvolid or subvold, not both. Add in > > the non-default options once you've proven the defaults are working, > > and add them one at a time. > > Yes I read your previous suggestion and I already dropped subvolid, but > since the problem already happened I left it in the mail for completeness. > Anyway the culprit here is genfstab and that's probably what a beginner is > going to use when installing a distro: > https://wiki.archlinux.org/index.php/beginners'_guide#fstab > The redundant subvolid doesn't hurt, the kernel will just check that it matches the passed subvol (see [1]). genfstab probably just pulls the options out of /proc/mounts or /proc/self/mountinfo, and since we show both, that's how it gets in fstab. If it was actually a problem, there would be a clear message in dmesg. 1: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb289b7be62db84b9630ce00367444c810cada2c -- Omar -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 0/3] getfsmapx ioctl
Hi, Building on the discussion "Exposing Extent Information to Userspace" at LSF, this patchset offers the userspace definition, implementation, and manpages for a new FS_IOC_GETFSMAPX ioctl that enables userspace to query the filesystem for a map of every extent in a given range of physical block keyspace. Note that prior to the existence of block sharing, I'd have said "given range of physical blocks", but now that we can return multiple owner:offset pairs for a given block, the block keyspace now has to include extra fields to uniquely identify a reverse mapping record. This ioctl behaves in a similar manner to XFS_IOC_GETBMAPX -- pass in an array of struct getfsmapx with key and other control values in the first two array elements, and the kernel passes back extent information in the other array elements. The particulars of how to do this are documented in the manpage that goes along with this set (it applies against man-pages.git) and example code in the other patches is against xfsprogs.git#for-next. Basically, set the lowest key for which you want records in the first array element; the highest key in the second; and the kernel spits out records in the rest of the elements. That's similar to how GETBMAPX does it, but different from FIEMAP. I added a dummy 64-bit "device id" per Josef's request, though I'm thinking that could be cut down to a simple dev_t. I also wonder if the kernel should rewrite the low key with the last element returned so as to seed the next call, but userspace can do that too. The kernel-space implementation (for XFS) is buried inside the xfs reverse mapping patchset which is treading water at github[1]. I prefer not to patchbomb the whole kernel series until I've put the mess through better testing, but this should be enough to get the mailing list discussion started. Questions? Comments? Bike sheds? --D [1] https://github.com/djwong/linux/tree/djwong-experimental -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] document the XFS_IOC_GETFSMAPX ioctl
Document the new XFS_IOC_GETFSMAPX that returns the physical layout of a (disk-based) filesystem. (Yes, the leading 'X' needs to fall off...) Signed-off-by: Darrick J. Wong --- man2/ioctl_getfsmapx.2 | 253 1 file changed, 253 insertions(+) create mode 100644 man2/ioctl_getfsmapx.2 diff --git a/man2/ioctl_getfsmapx.2 b/man2/ioctl_getfsmapx.2 new file mode 100644 index 000..b79a8e5 --- /dev/null +++ b/man2/ioctl_getfsmapx.2 @@ -0,0 +1,253 @@ +.\" Copyright (C) 2016 Oracle. All rights reserved. +.\" +.\" %%%LICENSE_START(VERBATIM) +.\" This program is free software; you can redistribute it and/or +.\" modify it under the terms of the GNU General Public License as +.\" published by the Free Software Foundation. +.\" +.\" This program is distributed in the hope that it would be useful, +.\" but WITHOUT ANY WARRANTY; without even the implied warranty of +.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.\" GNU General Public License for more details. +.\" +.\" You should have received a copy of the GNU General Public License +.\" along with this program; if not, write the Free Software Foundation, +.\" Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +.\" %%%LICENSE_END +.TH IOCTL-XFS_IOC_GETFSMAPX 2 2016-05-05 "Linux" "Linux Programmer's Manual" +.SH NAME +ioctl_getfsmapx \- retrieve the physical layout of the filesystem +.SH SYNOPSIS +.br +.B #include +.br +.B #include +.sp +.BI "int ioctl(int " fd ", XFS_IOC_GETFSMAPX, struct getfsmapx * " arg ); +.SH DESCRIPTION +This +.BR ioctl (2) +retrieves physical extent mappings for a filesystem. This information can +be used to discover which files are mapped to a physical block, examine +free space, or find known bad blocks, among other things. + +The sole argument to this ioctl should be an array of the following +structure: +.in +4n +.nf + +struct getfsmapx { + __s64 fmv_device; /* device id */ + __s64 fmv_block; /* starting block */ + __s64 fmv_owner; /* owner id */ + __s64 fmv_offset; /* file offset of segment */ + __s64 fmv_length; /* length of segment, blocks */ + __s32 fmv_oflags; /* mapping flags */ + __s32 fmv_iflags; /* control flags (1st structure) */ + __s32 fmv_count; /* # of entries in array incl. input */ + __s32 fmv_entries;/* # of entries filled in (output). */ + __s64 fmv_unused1;/* future use, must be zero */ +}; + +.fi +.in +The array must contain at least two elements. The first two array +elements specify the lowest and highest reverse-mapping keys, respectively, +for which userspace would like physical mapping information. A reverse +mapping key consists of the tuple (device, block, owner, offset). The +owner and offset fields are part of the key because some filesystems +support sharing physical blocks between multiple files and therefore may +return multiple mappings for a given physical block. + +.SS Fields of struct getfsmapx +.PP +The +.I fmv_device +field contains a 64-bit cookie to uniquely identify the underlying storage +device if the filesystem supports multiple devices. If not, the field +should be +.BR FMV_DEV_DEFAULT "." + +.PP +The +.I fmv_block +field contains the 512-byte sector address of the extent. + +.PP +The +.I fmv_owner +field contains the owner of the extent. This is generally an inode +number, though if +.B FMV_OF_SPECIAL_OWNER +is set in the +.I fmv_oflags +field, then the owner value is one of the following special values: +.TP +.B FMV_OWN_FREE +Free space. +.TP +.B FMV_OWN_UNKNOWN +This extent has an unknown owner. +.TP +.B FMV_OWN_FS +Static filesystem metadata. +.TP +.B FMV_OWN_LOG +The filesystem journal. +.TP +.B FMV_OWN_AG +Allocation group metadata. +.TP +.B FMV_OWN_INOBT +The inode index, if one is provided. +.TP +.B FMV_OWN_INODES +Inodes. +.TP +.B FMV_OWN_REFC +Reference counting indexes. +.TP +.B FMV_OWN_COW +This extent is being used to stage a copy-on-write. +.TP +.B FMV_OWN_DEFECTIVE: +This extent has been marked defective either by the filesystem or the +underlying device. + +.PP +The +.I fmv_offset +field contains the logical address of the reverse mapping record, in units +of 512-byte blocks. This field has no meaning if the +.BR FMV_OF_SPECIAL_OWNER " or " FMV_OF_EXTENT_MAP +flags are set in +.IR fmv_oflags "." + +.PP +The +.I fmv_length +field contains the length of the extent, in units of 512-byte blocks. +This field must be zero in the second array element. + +.PP +The +.I fmv_oflags +field is a bitmask of extent state flags. The bits are: +.TP +.B FMV_OF_PREALLOC +The extent is allocated but not yet written. +.TP +.B FMV_OF_ATTR_FORK +This extent contains extended attribute data. +.TP +.B FMV_OF_EXTENT_MAP +This extent contains extent map information for the owner. +.TP +.B FMV_OF_SHARED +Parts
[PATCH 2/3] xfs: introduce the XFS_IOC_GETFSMAPX ioctl
Introduce a new ioctl that uses the reverse mapping btree to return information about the physical layout of the filesystem. This is the xfsprogs side of things for userspace support. Signed-off-by: Darrick J. Wong --- libxfs/xfs_fs.h | 65 + 3 files changed, 106 insertions(+), 14 deletions(-) diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h index d5ed090..6573fcc 100644 --- a/libxfs/xfs_fs.h +++ b/libxfs/xfs_fs.h @@ -117,6 +117,70 @@ struct getbmapx { #define BMV_OF_SHARED 0x8 /* segment shared with another file */ /* + * Structure for XFS_IOC_GETFSMAPX. + * + * Similar to XFS_IOC_GETBMAPX, the first two elements in the array are + * used to constrain the output. The first element in the array should + * represent the lowest disk address that the user wants to learn about. + * The second element in the array should represent the highest disk + * address to query. Subsequent array elements will be filled out by the + * command. + * + * The fmv_iflags field is only used in the first structure. The + * fmv_oflags field is filled in for each returned structure after the + * second structure. The fmv_unused1 fields in the first two array + * elements must be zero. + * + * The fmv_count, fmv_entries, and fmv_iflags fields in the second array + * element must be zero. + * + * fmv_block, fmv_offset, and fmv_length are expressed in units of 512 + * byte sectors. + */ +#ifndef HAVE_GETFSMAPX +struct getfsmapx { + __s64 fmv_device; /* device id */ + __s64 fmv_block; /* starting block */ + __s64 fmv_owner; /* owner id */ + __s64 fmv_offset; /* file offset of segment */ + __s64 fmv_length; /* length of segment, blocks */ + __s32 fmv_oflags; /* mapping flags */ + __s32 fmv_iflags; /* control flags (1st structure) */ + __s32 fmv_count; /* # of entries in array incl. input */ + __s32 fmv_entries;/* # of entries filled in (output). */ + __s64 fmv_unused1;/* future use, must be zero */ +}; +#endif + +/* fmv_device values - set by XFS_IOC_GETFSMAPX caller.*/ +/* use this value if the filesystem doesn't support multiple devices. */ +#define FMV_DEV_DEFAULT0 + +/* fmv_flags values - set by XFS_IOC_GETFSMAPX caller. */ +/* no flags defined yet */ +#define FMV_IF_VALID 0 + +/* fmv_flags values - returned for each non-header segment */ +#define FMV_OF_PREALLOC0x1 /* segment = unwritten pre-allocation */ +#define FMV_OF_ATTR_FORK 0x2 /* segment = attribute fork */ +#define FMV_OF_EXTENT_MAP 0x4 /* segment = extent map */ +#define FMV_OF_SHARED 0x8 /* segment = shared with another file */ +#define FMV_OF_SPECIAL_OWNER 0x10/* owner is a special value */ +#define FMV_OF_LAST0x20/* segment is the last in the FS */ + +/* fmv_owner special values */ +#defineFMV_OWN_FREE(-1ULL) /* free space */ +#define FMV_OWN_UNKNOWN(-2ULL) /* unknown owner */ +#define FMV_OWN_FS (-3ULL) /* static fs metadata */ +#define FMV_OWN_LOG(-4ULL) /* journalling log */ +#define FMV_OWN_AG (-5ULL) /* per-AG metadata */ +#define FMV_OWN_INOBT (-6ULL) /* inode btree blocks */ +#define FMV_OWN_INODES (-7ULL) /* inodes */ +#define FMV_OWN_REFC (-8ULL) /* refcount tree */ +#define FMV_OWN_COW(-9ULL) /* cow allocations */ +#define FMV_OWN_DEFECTIVE (-10ULL) /* bad blocks */ + +/* * Structure for XFS_IOC_FSSETDM. * For use by backup and restore programs to set the XFS on-disk inode * fields di_dmevmask and di_dmstate. These must be set to exactly and @@ -523,6 +587,7 @@ typedef struct xfs_swapext #define XFS_IOC_GETBMAPX _IOWR('X', 56, struct getbmap) #define XFS_IOC_ZERO_RANGE _IOW ('X', 57, struct xfs_flock64) #define XFS_IOC_FREE_EOFBLOCKS _IOR ('X', 58, struct xfs_fs_eofblocks) +#define XFS_IOC_GETFSMAPX _IOWR('X', 59, struct getfsmapx) /* * ioctl commands that replace IRIX syssgi()'s -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] xfs_io: support the new getfsmap ioctl
Add a new command, 'fsmap', to xfs_io so that we can query the filesystem extent map on a live filesystem. Signed-off-by: Darrick J. Wong --- io/Makefile |2 io/fsmap.c| 485 + io/init.c |1 io/io.h |1 man/man8/xfs_io.8 | 47 + 5 files changed, 535 insertions(+), 1 deletion(-) create mode 100644 io/fsmap.c diff --git a/io/Makefile b/io/Makefile index 0b53f41..6439e1d 100644 --- a/io/Makefile +++ b/io/Makefile @@ -11,7 +11,7 @@ HFILES = init.h io.h CFILES = init.c \ attr.c bmap.c file.c freeze.c fsync.c getrusage.c imap.c link.c \ mmap.c open.c parent.c pread.c prealloc.c pwrite.c seek.c shutdown.c \ - sync.c truncate.c reflink.c + sync.c truncate.c reflink.c fsmap.c LLDLIBS = $(LIBXCMD) $(LIBHANDLE) LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE) diff --git a/io/fsmap.c b/io/fsmap.c new file mode 100644 index 000..bf72555 --- /dev/null +++ b/io/fsmap.c @@ -0,0 +1,485 @@ +/* + * Copyright (c) 2016 Oracle. + * All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "platform_defs.h" +#include "command.h" +#include "init.h" +#include "io.h" +#include "input.h" + +static cmdinfo_t fsmap_cmd; + +static void +fsmap_help(void) +{ + printf(_( +"\n" +" prints the block mapping for an XFS filesystem" +"\n" +" Example:\n" +" 'fsmap -vp' - tabular format verbose map, including unwritten extents\n" +"\n" +" fsmap prints the map of disk blocks used by the whole filesystem.\n" +" The map lists each extent used by the file, as well as regions in the\n" +" filesystem that do not have any corresponding blocks (free space).\n" +" By default, each line of the listing takes the following form:\n" +" extent: [startoffset..endoffset] owner startblock..endblock\n" +" All the file offsets and disk blocks are in units of 512-byte blocks.\n" +" -n -- query n extents.\n" +" -v -- Verbose information, specify ag info. Show flags legend on 2nd -v\n" +"\n")); +} + +static int +numlen( + off64_t val) +{ + off64_t tmp; + int len; + + for (len = 0, tmp = val; tmp > 0; tmp = tmp/10) + len++; + return (len == 0 ? 1 : len); +} + +static const char * +special_owner( + __int64_t owner) +{ + switch (owner) { + case FMV_OWN_FREE: + return _("free space"); + case FMV_OWN_UNKNOWN: + return _("unknown"); + case FMV_OWN_FS: + return _("static fs metadata"); + case FMV_OWN_LOG: + return _("journalling log"); + case FMV_OWN_AG: + return _("per-AG metadata"); + case FMV_OWN_INOBT: + return _("inode btree"); + case FMV_OWN_INODES: + return _("inodes"); + case FMV_OWN_REFC: + return _("refcount btree"); + case FMV_OWN_COW: + return _("cow reservation"); + case FMV_OWN_DEFECTIVE: + return _("defective"); + default: + return _("unknown"); + } +} + +static void +dump_map( + unsigned long long nr, + struct getfsmapx*map) +{ + unsigned long long i; + struct getfsmapx*p; + + for (i = 0, p = map + 2; i < map->fmv_entries; i++, p++) { + printf("\t%llu: [%lld..%lld]: ", i + nr, + (long long) p->fmv_block, + (long long)(p->fmv_block + p->fmv_length - 1)); + if (p->fmv_oflags & FMV_OF_SPECIAL_OWNER) + printf("%s", special_owner(p->fmv_owner)); + else if (p->fmv_oflags & FMV_OF_EXTENT_MAP) + printf(_("inode %lld extent map"), + (long long) p->fmv_owner); + else + printf(_("inode %lld %lld..%lld"), + (long long) p->fmv_owner, + (long long) p->fmv_offset, + (long long)(p->fmv_offset + p->fmv_length - 1)); + printf(_(" %lld blocks\n"), + (long long)p->fmv_length); + } +} + +/* + * Verbose mode displays: + * extent: [startblock..endblock]: startoffset..endoffset \ + * ag# (agoffset..agendoffset) totalbbs flags + */ +#define MINR
Re: [PATCH 0/2] scop GFP_NOFS api
On Wed, May 04 2016, Dave Chinner wrote: > FWIW, I don't think making evict() non-blocking is going to be worth > the effort here. Making memory reclaim wait on a priority ordered > queue while asynchronous reclaim threads run reclaim as efficiently > as possible and wakes waiters as it frees the memory the waiters > require is a model that has been proven to work in the past, and > this appears to me to be the model you are advocating for. I agree > that direct reclaim needs to die and be replaced with something > entirely more predictable, controllable and less prone to deadlock > contexts - you just need to convince the mm developers that it will > perform and scale better than what we have now. > > In the mean time, having a slightly more fine grained GFP_NOFS > equivalent context will allow us to avoid the worst of the current > GFP_NOFS problems with very little extra code. You have painted two pictures here. The first is an ideal which does look a lot like the sort of outcome I was aiming for, but is more than a small step away. The second is a band-aid which would take us in exactly the wrong direction. It makes an interface which people apparently find hard to use (or easy to misused) - the setting of __GFP_FS - and makes it more complex. Certainly it would be more powerful, but I think it would also be more misused. So I ask myself: can we take some small steps towards 'A' and thereby enable at least the functionality enabled by 'B'? A core design principle for me is to enable filesystems to take control of their own destiny. They should have the information available to make the decisions they need to make, and the opportunity to carry them out. All the places where direct reclaim currently calls into filesystems carry the 'gfp' flags so the file system can decide what to do, with one exception: evict_inode. So my first proposal would be to rectify that. - redefine .nr_cached_objects and .free_cached_objects so that, if they are defined, they are responsible for s_dentry_lru and s_inode_lru. e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes two calls to list_lru_shrink_count. This would require exporting prune_dcache_sb and prune_icache_sb but otherwise should be a fairly straight forward change. If nr_cached_objects were defined, super_cache_scan would no longer abort without __GFP_FS - that test would be left to the filesystem. - Now any filesystem that wants to can stash it's super_block pointer in current->journal_info while doing memory allocations, and abort any reclaim attempts (release_page, shrinker, nr_cached_objects) if and only if current->journal_info == "my superblock". This can be done without the core mm code knowing any more than it already does. - A more sophisticated filesystem might import much of the code for prune_icache_sb() - either by copy/paste or by exporting some vfs internals - and then store an inode pointer in current->journal_info and only abort reclaim which touches that inode. - if a filesystem happens to know that it will never block in any of these reclaim calls, it can always allow prune_dcache_sb to run, and never needs to use GFP_NOFS. I think NFS might be close to being able to do this as it flushes everything on last-close. But that is something that NFS developers can care about (or not) quite independently from mm people. - Maybe some fs developer will try to enable free_cached_objects to do as much work as possible for every inode, but never deadlock. It could do its own fs-specfic deadlock detection, or could queue work to a work queue and wait a limited time for it. Or something. If some filesystem developer comes up with something that works really well, developers of other filesystems might copy it - or not as they choose. Maybe ->journal_info isn't perfect for this. It is currently only safe for reclaim code to compare it against a known value. It is not safe to dereference it to see if it points to a known value. That could possibly be cleaned up, or another task_struct field could be provided for filesystems to track their state. Or do you find a task_struct field unacceptable and there is some reason and that an explicitly passed cookie is superior? My key point is that we shouldn't try to plumb some new abstraction through the MM code so there is a new pattern for all filesystems to follow. Rather the mm/vfs should get out of the filesystems' way as much as possible and let them innovate independently. Thanks for your time, NeilBrown signature.asc Description: PGP signature