date:20160505

[PATCH RFC] btrfs: Slightly speedup btrfs_read_block_groups

2016-05-05 Thread Qu Wenruo

Btrfs_read_block_groups() function is the most time consuming function
if the whole fs is filled with small extents.

For a btrfs filled with all 16K sized files, and when 2T space is used,
mount the fs needs 10 to 12 seconds.

While ftrace shows that, btrfs_read_block_groups() takes about 9
seconds, while btrfs_read_chunk_tree() only takes 14ms.
In theory, btrfs_read_chunk_tree() and btrfs_read_block_groups() should
take the same time, as chunk and block groups are 1:1 mapped.

However, considering block group items are spread across the large
extent tree, it takes a lot of time to search btree.

And furthermore, find_first_block_group() function used by
btrfs_read_block_groups() is using a very bad method to locate block
group item, by searching and then checking slot by slot.

In kernel space, checking slot by slot is a little time consuming, as
for next_leaf() case, kernel need to do extra locking.

This patch will fix the slot by slot checking, as when we call
btrfs_read_block_groups(), we have already read out all chunks and save
them into map_tree.

So we use map_tree to get exact block group start and length, then do
exact btrfs_search_slot(), without slot by slot check, to speedup the
mount.

With this patch, time spent on btrfs_read_block_groups() is reduced to
7.56s, compared to old 8.94s.

Reported-by: Tsutomu Itoh 
Signed-off-by: Qu Wenruo 

---
The further fix would change the mount process from reading out all
block groups to reading out block group on demand.

But according to the btrfs_read_chunk_tree() calling time, the real
problem is the on-disk format and btree locking.

If block group items are arranged like chunks, in a dedicated tree,
btrfs_read_block_groups() should take the same time as
btrfs_read_chunk_tree().

And further more, if we can split current huge extent tree into
something like per-chunk extent tree, a lot of current code like
delayed_refs can be removed, as extent tree operation will be much
faster.
---
 fs/btrfs/extent-tree.c | 61 --
 fs/btrfs/extent_map.c  |  1 +
 fs/btrfs/extent_map.h  | 22 ++
 3 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8507484..9fa7728 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9520,39 +9520,20 @@ out:
return ret;
 }
 
-static int find_first_block_group(struct btrfs_root *root,
-   struct btrfs_path *path, struct btrfs_key *key)
+int find_block_group(struct btrfs_root *root,
+  struct btrfs_path *path,
+  struct extent_map *chunk_em)
 {
int ret = 0;
-   struct btrfs_key found_key;
-   struct extent_buffer *leaf;
-   int slot;
-
-   ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
-   if (ret < 0)
-   goto out;
+   struct btrfs_key key;
 
-   while (1) {
-   slot = path->slots[0];
-   leaf = path->nodes[0];
-   if (slot >= btrfs_header_nritems(leaf)) {
-   ret = btrfs_next_leaf(root, path);
-   if (ret == 0)
-   continue;
-   if (ret < 0)
-   goto out;
-   break;
-   }
-   btrfs_item_key_to_cpu(leaf, &found_key, slot);
+   key.objectid = chunk_em->start;
+   key.offset = chunk_em->len;
+   key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 
-   if (found_key.objectid >= key->objectid &&
-   found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
-   ret = 0;
-   goto out;
-   }
-   path->slots[0]++;
-   }
-out:
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret > 0)
+   ret = -ENOENT;
return ret;
 }
 
@@ -9771,16 +9752,14 @@ int btrfs_read_block_groups(struct btrfs_root *root)
struct btrfs_block_group_cache *cache;
struct btrfs_fs_info *info = root->fs_info;
struct btrfs_space_info *space_info;
-   struct btrfs_key key;
+   struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree;
+   struct extent_map *chunk_em;
struct btrfs_key found_key;
struct extent_buffer *leaf;
int need_clear = 0;
u64 cache_gen;
 
root = info->extent_root;
-   key.objectid = 0;
-   key.offset = 0;
-   key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -9793,10 +9772,16 @@ int btrfs_read_block_groups(struct btrfs_root *root)
if (btrfs_test_opt(root, CLEAR_CACHE))
need_clear = 1;
 
+   /* Here we don't lock the map tree, as we are the only reader */
+   chunk_em = first_extent_mapping(&map_tree->map_tree);
+   /* Not really possible */
+   if

Re: [PATCH] btrfs-progs: Adjust timing of safety delay countdown

2016-05-05 Thread David Sterba

On Wed, May 04, 2016 at 03:43:26PM -0400, Noah Massey wrote:
> When printing the countdown in the safety delay, the number should
> correspond to the number of seconds remaining to wait at the time the
> delay is printed.
> 
> In other words, there should be a one second sleep after printing '1'.
> 
> Signed-off-by: Noah Massey 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes

2016-05-05 Thread Filipe Manana

On Thu, May 5, 2016 at 5:23 AM, Zygo Blaxell
 wrote:
> During a mount, we start the cleaner kthread first because the transaction
> kthread wants to wake up the cleaner kthread.  We start the transaction
> kthread next because everything in btrfs wants transactions.  We do reloc
> recovery in the thread that was doing the original mount call once the
> transaction kthread is running.  This means that the cleaner kthread
> could already be running when reloc recovery happens (e.g. if a snapshot
> delete was started before a crash).
>
> Relocation does not play well with the cleaner kthread, so a mutex was
> added in commit 5f3164813b90f7dbcb5c3ab9006906222ce471b7 "Btrfs: fix
> race between balance recovery and root deletion" to prevent both from
> being active at the same time.
>
> If the cleaner kthread is already holding the mutex by the time we get
> to btrfs_recover_relocation, the mount will be blocked until at least
> one deleted subvolume is cleaned (possibly more if the mount process
> doesn't get the lock right away).  During this time (which could be an
> arbitrarily long time on a large/slow filesystem), the mount process is
> stuck and the filesystem is unnecessarily inaccessible.
>
> Fix this by locking cleaner_mutex before we start cleaner_kthread, and
> unlocking the mutex after mount no longer requires it.  This ensures
> that the mounting process will not be blocked by the cleaner kthread.
> The cleaner kthread is already prepared for mutex contention and will
> just go to sleep until the mutex is available.

You miss your Signed-off-by:  tag (git format-patch or git commit
with -s add it automatically).
Once you get that, you can add my Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/disk-io.c | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index d8d68af..7c8f435 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2509,6 +2509,7 @@ int open_ctree(struct super_block *sb,
> int num_backups_tried = 0;
> int backup_index = 0;
> int max_active;
> +   bool cleaner_mutex_locked = false;
>
> tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info);
> chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info);
> @@ -2988,6 +2989,13 @@ retry_root_backup:
> goto fail_sysfs;
> }
>
> +   /*
> +* Hold the cleaner_mutex thread here so that we don't block
> +* for a long time on btrfs_recover_relocation.  cleaner_kthread
> +* will wait for us to finish mounting the filesystem.
> +*/
> +   mutex_lock(&fs_info->cleaner_mutex);
> +   cleaner_mutex_locked = true;
> fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
>"btrfs-cleaner");
> if (IS_ERR(fs_info->cleaner_kthread))
> @@ -3046,10 +3054,8 @@ retry_root_backup:
> ret = btrfs_cleanup_fs_roots(fs_info);
> if (ret)
> goto fail_qgroup;
> -
> -   mutex_lock(&fs_info->cleaner_mutex);
> +   /* We locked cleaner_mutex before creating cleaner_kthread. */
> ret = btrfs_recover_relocation(tree_root);
> -   mutex_unlock(&fs_info->cleaner_mutex);
> if (ret < 0) {
> printk(KERN_WARNING
>"BTRFS: failed to recover relocation\n");
> @@ -3057,6 +3063,8 @@ retry_root_backup:
> goto fail_qgroup;
> }
> }
> +   mutex_unlock(&fs_info->cleaner_mutex);
> +   cleaner_mutex_locked = false;
>
> location.objectid = BTRFS_FS_TREE_OBJECTID;
> location.type = BTRFS_ROOT_ITEM_KEY;
> @@ -3164,6 +3172,10 @@ fail_cleaner:
> filemap_write_and_wait(fs_info->btree_inode->i_mapping);
>
>  fail_sysfs:
> +   if (cleaner_mutex_locked) {
> +   mutex_unlock(&fs_info->cleaner_mutex);
> +   cleaner_mutex_locked = false;
> +   }
> btrfs_sysfs_remove_mounted(fs_info);
>
>  fail_fsdev_sysfs:
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-05 Thread Niccolò Belli


On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:

I suggest using defaults for starters. The only thing in that list
that needs be there is either subvolid or subvold, not both. Add in
the non-default options once you've proven the defaults are working,
and add them one at a time.


Yes I read your previous suggestion and I already dropped subvolid, but 
since the problem already happened I left it in the mail for completeness.
Anyway the culprit here is genfstab and that's probably what a beginner is 
going to use when installing a distro: 
https://wiki.archlinux.org/index.php/beginners'_guide#fstab



Disk is a SAMSUNG SSD PM851 M.2 2280 256GB (Firmware Version: EXT25D0Q).


The firmware is old if I understand the naming scheme used by Dell. It
says EXT49D0Q is current.

http://www.dell.com/support/home/al/en/aldhs1/Drivers/DriversDetails?driverId=0NXHH


According to this 
(http://forum.notebookreview.com/threads/2015-xps-13-ssd-fw-problem-with-m-2-samsung-pm851.770501/) 
the firmware you linked is for the mSATA version of the drive, not the M.2 
one. EXT25D0Q seems to be the very latest one for my drive.



I advice using all defaults for everything for
now, otherwise it's anyone's guess what you're running into.


On giovedì 5 maggio 2016 06:12:28 CEST, Qu Wenruo wrote:
Would it be OK for you to test your btrfs on a plain ssd, 
without encryption?
And just as Chris Murphy said, reducing mount option is also a 
pretty good debugging start point.


Ok, I will remove dmcrypt, discard, compress=lzo, nodefrag and see what 
happens.



I made a copy of /dev/mapper/cryptroot with dd on an external drive and
I run btrfs check on it (btrfs-progs 4.5.2):
https://drive.google.com/open?id=0Bwe9Wtc-5xF1SjJacXpMMU5mems (37MB)


Checked, but seems the output is truncated?


No, I didn't truncate the btrfs check output because it wasn't endless. I 
just truncated the repair output.


I also have something new to report. Do you remember when I said that my 
screen was black and so I had to forcedly power off the system? Something 
similar happened today and since in the meantime I enabled magic sysrq keys 
I have been able to recover this from the logs:


mag 05 11:55:51 arch-laptop kdeinit5[960]: Registering 
"org.kde.StatusNotifierItem-1060-1/StatusNotifierItem" to system tray

mag 05 11:55:51 arch-laptop obexd[1098]: OBEX daemon 5.39
mag 05 11:55:51 arch-laptop dbus-daemon[920]: Successfully activated 
service 'org.bluez.obex'

mag 05 11:55:51 arch-laptop systemd[898]: Started Bluetooth OBEX service.
mag 05 11:55:51 arch-laptop korgac[1044]: log_kidentitymanagement: 
IdentityManager: There was no default identity. Marking first one as 
default.
mag 05 11:55:51 arch-laptop kernel: BUG: unable to handle kernel paging 
request at 00017d11
mag 05 11:55:51 arch-laptop kernel: IP: [] 
anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: PGD 0 
mag 05 11:55:51 arch-laptop kernel: Oops:  [#1] PREEMPT SMP 
mag 05 11:55:51 arch-laptop kernel: Modules linked in: rfcomm(+) visor bnep 
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core 
videodev media btusb btrtl btbcm btintel cdc_ether bluetooth usbnet r8152 
crc16 mii joydev mousedev nvr
mag 05 11:55:51 arch-laptop kernel:  mei_me syscopyarea sysfillrect snd 
sysimgblt fb_sys_fops i2c_algo_bit shpchp soundcore mei wmi thermal fan 
intel_hid sparse_keymap int3403_thermal video processor_thermal_device 
dw_dmac snd_soc_sst_acpi snd_soc_sst_m
mag 05 11:55:51 arch-laptop kernel:  lrw gf128mul glue_helper ablk_helper 
cryptd ahci libahci libata scsi_mod xhci_pci rtsx_pci

mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM TTY layer initialized
mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM socket layer 
initialized

mag 05 11:55:51 arch-laptop kernel: Bluetooth: RFCOMM ver 1.11
mag 05 11:55:51 arch-laptop kernel:  xhci_hcd
mag 05 11:55:51 arch-laptop kernel:  i8042 serio sdhci_acpi sdhci led_class 
mmc_core pl2303 mos7720 usbserial parport hid_generic usbhid hid usbcore 
usb_common
mag 05 11:55:51 arch-laptop kernel: CPU: 0 PID: 351 Comm: systemd-udevd Not 
tainted 4.5.1-1-ARCH #1
mag 05 11:55:51 arch-laptop kernel: Hardware name: Dell Inc. XPS 13 
9343/0F5KF3, BIOS A07 11/11/2015
mag 05 11:55:51 arch-laptop kernel: task: 88021347d580 ti: 
880211f8c000 task.ti: 880211f8c000
mag 05 11:55:51 arch-laptop kernel: RIP: 0010:[]  
[] anon_vma_interval_tree_insert+0x3f/0x90
mag 05 11:55:51 arch-laptop kernel: RSP: 0018:880211f8fd68  EFLAGS: 
00010206
mag 05 11:55:51 arch-laptop kernel: RAX: 8800da2f4820 RBX: 
8800bb59ce40 RCX: 8800da2f4830
mag 05 11:55:51 arch-laptop kernel: RDX: 8800da2f4828 RSI: 
8800374404a0 RDI: 8800c58dfa40
mag 05 11:55:51 arch-laptop kernel: RBP: 880211f8fdb8 R08: 
00017c79 R09: 0007f55e2059
mag 05 11:55:51 arch-laptop kernel: R10: 0007f55e2053 R11: 
8800c58dfa40 R12: 880037440460
mag 05 11:55:51 arch-laptop kernel: R13:

Re: Spare volumes and hot auto-replacement feature

2016-05-05 Thread Austin S. Hemmelgarn


On 2016-05-04 19:18, Dmitry Katsubo wrote:

Dear btrfs community,

I am interested in spare volumes and hot auto-replacement feature [1]. I have a 
couple of questions:

* Which kernel version this feature will be included?
Probably 4.7.  I would not suggest using it in production for at least a 
few cycles though (probably 4.9).

* The description says that replacement happens automatically when there is any 
write failed or flush failed. Is it possible to control the ratio / number of 
such failures? (e.g. in case it was one-time accidental failure)

As far as I know, no, it just happens.

* What happens if spare device is smaller then the (failing) device to be 
replaced?

I'm pretty sure that it doesn't get replaced.

* What happens if during the replacement the spare device fails (write error)?

I'm not certain about this one.

* Is it possible for root to be notified in case if drive replacement 
(successful or unsuccessful) took place? Actually this question is actual for 
me for overall write/flush failures on btrfs volume (btrfs monitor).
There isn't any built-in monitoring in BTRFS that I know of, there are a 
couple of options though for monitoring.  The simplest and probably most 
reliable is to write a script to poll for changes in the error counts. 
You can also check the filesystem mount options (without the hot-spare 
functionality, if there's an error, the filesystem will (usually) get 
remounted read-only, and this also works for most other filesystems too).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spare volumes and hot auto-replacement feature

2016-05-05 Thread Anand Jain




Most of it (like policy tuning/configuring/notification) is through
sysfs interface, However to implement this, we need the existing sysfs
volume patches to be integrated.
We need to think about the implementation of per-FSID spare which I
hope will solve the problem incompatible spare disk.
As of now if auto replace fails, spare device is out of the kernel
device list. If user wants to give a 2nd try then, they should run
btrfs dev scan again. And the degraded vol will continue to look
for the spare device.

Thanks for the feedback.

Anand


On 05/05/2016 07:18 AM, Dmitry Katsubo wrote:

Dear btrfs community,

I am interested in spare volumes and hot auto-replacement feature [1]. I have a 
couple of questions:

* Which kernel version this feature will be included?
* The description says that replacement happens automatically when there is any 
write failed or flush failed. Is it possible to control the ratio / number of 
such failures? (e.g. in case it was one-time accidental failure)
* What happens if spare device is smaller then the (failing) device to be 
replaced?
* What happens if during the replacement the spare device fails (write error)?
* Is it possible for root to be notified in case if drive replacement 
(successful or unsuccessful) took place? Actually this question is actual for 
me for overall write/flush failures on btrfs volume (btrfs monitor).

Many thanks!

[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg48209.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/4] Improve compression workspaces memory management

2016-05-05 Thread David Sterba

Hi,

the compression workspaces are allocated as needed an this could fail if
there's no free memory. Moreover, as we might be flushing data from the
restricted contexts we should try our best not to fail.

This patchset preallocates one workspace for each compression type at module
load time (and tries to get one if that fails later). If any further request
for new workspace fails, there's still that one to make progress. IOW workspace
allocation will not fail at writeback time.

I have tested this by instrumenting the code to limit the number of workspaces
to one and did some stress tests.

David Sterba (4):
  btrfs: rename and document compression workspace members
  btrfs: preallocate compression workspaces
  btrfs: make find_workspace always succeed
  btrfs: make find_workspace warn if there are no workspaces

 fs/btrfs/compression.c | 85 --
 1 file changed, 61 insertions(+), 24 deletions(-)

-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] btrfs: rename and document compression workspace members

2016-05-05 Thread David Sterba

The names are confusing, pick more fitting names and add comments.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 35 +++
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ff61a41ac90b..4d5cd9624bb3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -743,8 +743,11 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
 static struct {
struct list_head idle_ws;
spinlock_t ws_lock;
-   int num_ws;
-   atomic_t alloc_ws;
+   /* Number of free workspaces */
+   int free_ws;
+   /* Total number of allocated workspaces */
+   atomic_t total_ws;
+   /* Waiters for a free workspace */
wait_queue_head_t ws_wait;
 } btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
 
@@ -760,7 +763,7 @@ void __init btrfs_init_compress(void)
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws);
spin_lock_init(&btrfs_comp_ws[i].ws_lock);
-   atomic_set(&btrfs_comp_ws[i].alloc_ws, 0);
+   atomic_set(&btrfs_comp_ws[i].total_ws, 0);
init_waitqueue_head(&btrfs_comp_ws[i].ws_wait);
}
 }
@@ -777,35 +780,35 @@ static struct list_head *find_workspace(int type)
 
struct list_head *idle_ws   = &btrfs_comp_ws[idx].idle_ws;
spinlock_t *ws_lock = &btrfs_comp_ws[idx].ws_lock;
-   atomic_t *alloc_ws  = &btrfs_comp_ws[idx].alloc_ws;
+   atomic_t *total_ws  = &btrfs_comp_ws[idx].total_ws;
wait_queue_head_t *ws_wait  = &btrfs_comp_ws[idx].ws_wait;
-   int *num_ws = &btrfs_comp_ws[idx].num_ws;
+   int *free_ws= &btrfs_comp_ws[idx].free_ws;
 again:
spin_lock(ws_lock);
if (!list_empty(idle_ws)) {
workspace = idle_ws->next;
list_del(workspace);
-   (*num_ws)--;
+   (*free_ws)--;
spin_unlock(ws_lock);
return workspace;
 
}
-   if (atomic_read(alloc_ws) > cpus) {
+   if (atomic_read(total_ws) > cpus) {
DEFINE_WAIT(wait);
 
spin_unlock(ws_lock);
prepare_to_wait(ws_wait, &wait, TASK_UNINTERRUPTIBLE);
-   if (atomic_read(alloc_ws) > cpus && !*num_ws)
+   if (atomic_read(total_ws) > cpus && !*free_ws)
schedule();
finish_wait(ws_wait, &wait);
goto again;
}
-   atomic_inc(alloc_ws);
+   atomic_inc(total_ws);
spin_unlock(ws_lock);
 
workspace = btrfs_compress_op[idx]->alloc_workspace();
if (IS_ERR(workspace)) {
-   atomic_dec(alloc_ws);
+   atomic_dec(total_ws);
wake_up(ws_wait);
}
return workspace;
@@ -820,21 +823,21 @@ static void free_workspace(int type, struct list_head 
*workspace)
int idx = type - 1;
struct list_head *idle_ws   = &btrfs_comp_ws[idx].idle_ws;
spinlock_t *ws_lock = &btrfs_comp_ws[idx].ws_lock;
-   atomic_t *alloc_ws  = &btrfs_comp_ws[idx].alloc_ws;
+   atomic_t *total_ws  = &btrfs_comp_ws[idx].total_ws;
wait_queue_head_t *ws_wait  = &btrfs_comp_ws[idx].ws_wait;
-   int *num_ws = &btrfs_comp_ws[idx].num_ws;
+   int *free_ws= &btrfs_comp_ws[idx].free_ws;
 
spin_lock(ws_lock);
-   if (*num_ws < num_online_cpus()) {
+   if (*free_ws < num_online_cpus()) {
list_add(workspace, idle_ws);
-   (*num_ws)++;
+   (*free_ws)++;
spin_unlock(ws_lock);
goto wake;
}
spin_unlock(ws_lock);
 
btrfs_compress_op[idx]->free_workspace(workspace);
-   atomic_dec(alloc_ws);
+   atomic_dec(total_ws);
 wake:
/*
 * Make sure counter is updated before we wake up waiters.
@@ -857,7 +860,7 @@ static void free_workspaces(void)
workspace = btrfs_comp_ws[i].idle_ws.next;
list_del(workspace);
btrfs_compress_op[i]->free_workspace(workspace);
-   atomic_dec(&btrfs_comp_ws[i].alloc_ws);
+   atomic_dec(&btrfs_comp_ws[i].total_ws);
}
}
 }
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] btrfs: make find_workspace warn if there are no workspaces

2016-05-05 Thread David Sterba

Be verbose if there are no workspaces at all, ie. the module init time
preallocation failed.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index c70625560265..658c39b70fba 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -834,7 +834,21 @@ static struct list_head *find_workspace(int type)
 * workspace preallocated for each type and the compression
 * time is bounded so we get to a workspace eventually. This
 * makes our caller's life easier.
+*
+* To prevent silent and low-probability deadlocks (when the
+* initial preallocation fails), check if there are any
+* workspaces at all.
 */
+   if (atomic_read(total_ws) == 0) {
+   static DEFINE_RATELIMIT_STATE(_rs,
+   /* once per minute */ 60 * HZ,
+   /* no burst */ 1);
+
+   if (__ratelimit(&_rs)) {
+   printk(KERN_WARNING
+   "no compression workspaces, low memory, retrying");
+   }
+   }
goto again;
}
return workspace;
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] btrfs: make find_workspace always succeed

2016-05-05 Thread David Sterba

With just one preallocated workspace we can guarantee forward progress
even if there's no memory available for new workspaces. The cost is more
waiting but we also get rid of several error paths.

On average, there will be several idle workspaces, so the waiting
penalty won't be so bad.

In the worst case, all cpus will compete for one workspace until there's
some memory. Attempts to allocate a new one are done each time the
waiters are woken up.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 38c058bcf359..c70625560265 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -785,8 +785,10 @@ void __init btrfs_init_compress(void)
 }
 
 /*
- * this finds an available workspace or allocates a new one
- * ERR_PTR is returned if things go bad.
+ * This finds an available workspace or allocates a new one.
+ * If it's not possible to allocate a new one, waits until there's one.
+ * Preallocation makes a forward progress guarantees and we do not return
+ * errors.
  */
 static struct list_head *find_workspace(int type)
 {
@@ -826,6 +828,14 @@ static struct list_head *find_workspace(int type)
if (IS_ERR(workspace)) {
atomic_dec(total_ws);
wake_up(ws_wait);
+
+   /*
+* Do not return the error but go back to waiting. There's a
+* workspace preallocated for each type and the compression
+* time is bounded so we get to a workspace eventually. This
+* makes our caller's life easier.
+*/
+   goto again;
}
return workspace;
 }
@@ -913,8 +923,6 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
int ret;
 
workspace = find_workspace(type);
-   if (IS_ERR(workspace))
-   return PTR_ERR(workspace);
 
ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping,
  start, len, pages,
@@ -949,8 +957,6 @@ static int btrfs_decompress_biovec(int type, struct page 
**pages_in,
int ret;
 
workspace = find_workspace(type);
-   if (IS_ERR(workspace))
-   return PTR_ERR(workspace);
 
ret = btrfs_compress_op[type-1]->decompress_biovec(workspace, pages_in,
 disk_start,
@@ -971,8 +977,6 @@ int btrfs_decompress(int type, unsigned char *data_in, 
struct page *dest_page,
int ret;
 
workspace = find_workspace(type);
-   if (IS_ERR(workspace))
-   return PTR_ERR(workspace);
 
ret = btrfs_compress_op[type-1]->decompress(workspace, data_in,
  dest_page, start_byte,
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] btrfs: preallocate compression workspaces

2016-05-05 Thread David Sterba

Preallocate one workspace for each compression type so we can guarantee
forward progress in the worst case. A failure cannot be a hard error as
we might not use compression at all on the filesystem. If we can't
allocate the workspaces later when need them, it might actually
deadlock, but in such situation the system has effectively not enough
memory to operate properly.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4d5cd9624bb3..38c058bcf359 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -761,10 +761,26 @@ void __init btrfs_init_compress(void)
int i;
 
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
+   struct list_head *workspace;
+
INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws);
spin_lock_init(&btrfs_comp_ws[i].ws_lock);
atomic_set(&btrfs_comp_ws[i].total_ws, 0);
init_waitqueue_head(&btrfs_comp_ws[i].ws_wait);
+
+   /*
+* Preallocate one workspace for each compression type so
+* we can guarantee forward progress in the worst case
+*/
+   workspace = btrfs_compress_op[i]->alloc_workspace();
+   if (IS_ERR(workspace)) {
+   printk(KERN_WARNING
+   "BTRFS: cannot preallocate compression workspace, will try later");
+   } else {
+   atomic_set(&btrfs_comp_ws[i].total_ws, 1);
+   btrfs_comp_ws[i].free_ws = 1;
+   list_add(workspace, &btrfs_comp_ws[i].idle_ws);
+   }
}
 }
 
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] Btrfs: pin logs earlier when doing a rename exchange operation

2016-05-05 Thread fdmanana

From: Filipe Manana 

The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
which had a race fixed by my previous patch titled "Btrfs: pin log earlier
when renaming", and so it suffers from the same problem.

We pin the logs of the affected roots after we insert the new inode
references, leaving a time window where concurrent tasks logging the
inodes can end up logging both the new and old references, resulting
in log trees that when replayed can turn the metadata into inconsistent
states. This behaviour was added to btrfs_rename() in 2009 without any
explanation about why not pinning the logs earlier, just leaving a
comment about the posibility for the race. As of today it's perfectly
safe and sane to pin the logs before we start doing any of the steps
involved in the rename operation.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 503d749..dab6c08f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9458,6 +9458,8 @@ static int btrfs_rename_exchange(struct inode *old_dir,
/* force full log commit if subvolume involved. */
btrfs_set_log_full_commit(root->fs_info, trans);
} else {
+   btrfs_pin_log_trans(root);
+   root_log_pinned = true;
ret = btrfs_insert_inode_ref(trans, dest,
 new_dentry->d_name.name,
 new_dentry->d_name.len,
@@ -9465,8 +9467,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 btrfs_ino(new_dir), old_idx);
if (ret)
goto out_fail;
-   btrfs_pin_log_trans(root);
-   root_log_pinned = true;
}
 
/* And now for the dest. */
@@ -9474,6 +9474,8 @@ static int btrfs_rename_exchange(struct inode *old_dir,
/* force full log commit if subvolume involved. */
btrfs_set_log_full_commit(dest->fs_info, trans);
} else {
+   btrfs_pin_log_trans(dest);
+   dest_log_pinned = true;
ret = btrfs_insert_inode_ref(trans, root,
 old_dentry->d_name.name,
 old_dentry->d_name.len,
@@ -9481,8 +9483,6 @@ static int btrfs_rename_exchange(struct inode *old_dir,
 btrfs_ino(old_dir), new_idx);
if (ret)
goto out_fail;
-   btrfs_pin_log_trans(dest);
-   dest_log_pinned = true;
}
 
/* Update inode version and ctime/mtime. */
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] Btrfs: unpin logs if rename exchange operation fails

2016-05-05 Thread fdmanana

From: Filipe Manana 

If rename exchange operations fail at some point after we pinned any of
the logs, we end up aborting the current transaction but never unpin the
logs, which leaves concurrent tasks that are trying to sync the logs (as
part of an fsync request from user space) blocked forever and preventing
the filesystem from being unmountable.

Fix this by safely unpinning the log.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ab64721..503d749 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9412,6 +9412,8 @@ static int btrfs_rename_exchange(struct inode *old_dir,
u64 new_idx = 0;
u64 root_objectid;
int ret;
+   bool root_log_pinned = false;
+   bool dest_log_pinned = false;
 
/* we only allow rename subvolume link between subvolumes */
if (old_ino != BTRFS_FIRST_FREE_OBJECTID && root != dest)
@@ -9464,6 +9466,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
if (ret)
goto out_fail;
btrfs_pin_log_trans(root);
+   root_log_pinned = true;
}
 
/* And now for the dest. */
@@ -9479,6 +9482,7 @@ static int btrfs_rename_exchange(struct inode *old_dir,
if (ret)
goto out_fail;
btrfs_pin_log_trans(dest);
+   dest_log_pinned = true;
}
 
/* Update inode version and ctime/mtime. */
@@ -9557,17 +9561,47 @@ static int btrfs_rename_exchange(struct inode *old_dir,
if (new_inode->i_nlink == 1)
BTRFS_I(new_inode)->dir_index = new_idx;
 
-   if (old_ino != BTRFS_FIRST_FREE_OBJECTID) {
+   if (root_log_pinned) {
parent = new_dentry->d_parent;
btrfs_log_new_name(trans, old_inode, old_dir, parent);
btrfs_end_log_trans(root);
+   root_log_pinned = false;
}
-   if (new_ino != BTRFS_FIRST_FREE_OBJECTID) {
+   if (dest_log_pinned) {
parent = old_dentry->d_parent;
btrfs_log_new_name(trans, new_inode, new_dir, parent);
btrfs_end_log_trans(dest);
+   dest_log_pinned = false;
}
 out_fail:
+   /*
+* If we have pinned a log and an error happened, we unpin tasks
+* trying to sync the log and force them to fallback to a transaction
+* commit if the log currently contains any of the inodes involved in
+* this rename operation (to ensure we do not persist a log with an
+* inconsistent state for any of these inodes or leading to any
+* inconsistencies when replayed). If the transaction was aborted, the
+* abortion reason is propagated to userspace when attempting to commit
+* the transaction. If the log does not contain any of these inodes, we
+* allow the tasks to sync it.
+*/
+   if (ret && (root_log_pinned || dest_log_pinned)) {
+   if (btrfs_inode_in_log(old_dir, root->fs_info->generation) ||
+   btrfs_inode_in_log(new_dir, root->fs_info->generation) ||
+   btrfs_inode_in_log(old_inode, root->fs_info->generation) ||
+   (new_inode &&
+btrfs_inode_in_log(new_inode, root->fs_info->generation)))
+   btrfs_set_log_full_commit(root->fs_info, trans);
+
+   if (root_log_pinned) {
+   btrfs_end_log_trans(root);
+   root_log_pinned = false;
+   }
+   if (dest_log_pinned) {
+   btrfs_end_log_trans(dest);
+   dest_log_pinned = false;
+   }
+   }
ret = btrfs_end_transaction(trans, root);
 out_notrans:
if (new_ino == BTRFS_FIRST_FREE_OBJECTID)
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] Btrfs: fix inode leak on failure to setup whiteout inode in rename

2016-05-05 Thread fdmanana

From: Filipe Manana 

If we failed to fully setup the whiteout inode during a rename operation
with the whiteout flag, we ended up leaking the inode, not decrementing
its link count nor removing all its items from the fs/subvol tree.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 09947cb..ab64721 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9612,21 +9612,21 @@ static int btrfs_whiteout_for_rename(struct 
btrfs_trans_handle *trans,
ret = btrfs_init_inode_security(trans, inode, dir,
&dentry->d_name);
if (ret)
-   return ret;
+   goto out;
 
ret = btrfs_add_nondir(trans, dir, dentry,
inode, 0, index);
if (ret)
-   return ret;
+   goto out;
 
ret = btrfs_update_inode(trans, root, inode);
-   if (ret)
-   return ret;
-
+out:
unlock_new_inode(inode);
+   if (ret)
+   inode_dec_link_count(inode);
iput(inode);
 
-   return 0;
+   return ret;
 }
 
 static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

2016-05-05 Thread Omar Sandoval

On Thu, May 05, 2016 at 12:36:52PM +0200, Niccolò Belli wrote:
> On giovedì 5 maggio 2016 03:07:37 CEST, Chris Murphy wrote:
> > I suggest using defaults for starters. The only thing in that list
> > that needs be there is either subvolid or subvold, not both. Add in
> > the non-default options once you've proven the defaults are working,
> > and add them one at a time.
> 
> Yes I read your previous suggestion and I already dropped subvolid, but
> since the problem already happened I left it in the mail for completeness.
> Anyway the culprit here is genfstab and that's probably what a beginner is
> going to use when installing a distro:
> https://wiki.archlinux.org/index.php/beginners'_guide#fstab
> 

The redundant subvolid doesn't hurt, the kernel will just check that it
matches the passed subvol (see [1]). genfstab probably just pulls the
options out of /proc/mounts or /proc/self/mountinfo, and since we show
both, that's how it gets in fstab. If it was actually a problem, there
would be a clear message in dmesg.

1: 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb289b7be62db84b9630ce00367444c810cada2c

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 0/3] getfsmapx ioctl

2016-05-05 Thread Darrick J. Wong

Hi,

Building on the discussion "Exposing Extent Information to Userspace"
at LSF, this patchset offers the userspace definition, implementation,
and manpages for a new FS_IOC_GETFSMAPX ioctl that enables userspace
to query the filesystem for a map of every extent in a given range of
physical block keyspace.

Note that prior to the existence of block sharing, I'd have said
"given range of physical blocks", but now that we can return multiple
owner:offset pairs for a given block, the block keyspace now has to
include extra fields to uniquely identify a reverse mapping record.

This ioctl behaves in a similar manner to XFS_IOC_GETBMAPX -- pass in
an array of struct getfsmapx with key and other control values in the
first two array elements, and the kernel passes back extent
information in the other array elements.  The particulars of how to do
this are documented in the manpage that goes along with this set (it
applies against man-pages.git) and example code in the other patches
is against xfsprogs.git#for-next.

Basically, set the lowest key for which you want records in the first
array element; the highest key in the second; and the kernel spits out
records in the rest of the elements.  That's similar to how GETBMAPX
does it, but different from FIEMAP.  I added a dummy 64-bit "device
id" per Josef's request, though I'm thinking that could be cut down to
a simple dev_t.  I also wonder if the kernel should rewrite the low
key with the last element returned so as to seed the next call, but
userspace can do that too.

The kernel-space implementation (for XFS) is buried inside the xfs
reverse mapping patchset which is treading water at github[1].  I
prefer not to patchbomb the whole kernel series until I've put the
mess through better testing, but this should be enough to get the
mailing list discussion started.

Questions?  Comments?  Bike sheds?

--D

[1] https://github.com/djwong/linux/tree/djwong-experimental
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] document the XFS_IOC_GETFSMAPX ioctl

2016-05-05 Thread Darrick J. Wong

Document the new XFS_IOC_GETFSMAPX that returns the physical layout
of a (disk-based) filesystem.

(Yes, the leading 'X' needs to fall off...)

Signed-off-by: Darrick J. Wong 
---
 man2/ioctl_getfsmapx.2 |  253 
 1 file changed, 253 insertions(+)
 create mode 100644 man2/ioctl_getfsmapx.2

diff --git a/man2/ioctl_getfsmapx.2 b/man2/ioctl_getfsmapx.2
new file mode 100644
index 000..b79a8e5
--- /dev/null
+++ b/man2/ioctl_getfsmapx.2
@@ -0,0 +1,253 @@
+.\" Copyright (C) 2016 Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" This program is free software; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public License as
+.\" published by the Free Software Foundation.
+.\"
+.\" This program is distributed in the hope that it would be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public License
+.\" along with this program; if not, write the Free Software Foundation,
+.\" Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+.\" %%%LICENSE_END
+.TH IOCTL-XFS_IOC_GETFSMAPX 2 2016-05-05 "Linux" "Linux Programmer's Manual"
+.SH NAME
+ioctl_getfsmapx \- retrieve the physical layout of the filesystem
+.SH SYNOPSIS
+.br
+.B #include 
+.br
+.B #include 
+.sp
+.BI "int ioctl(int " fd ", XFS_IOC_GETFSMAPX, struct getfsmapx * " arg );
+.SH DESCRIPTION
+This
+.BR ioctl (2)
+retrieves physical extent mappings for a filesystem.  This information can
+be used to discover which files are mapped to a physical block, examine
+free space, or find known bad blocks, among other things.
+
+The sole argument to this ioctl should be an array of the following
+structure:
+.in +4n
+.nf
+
+struct getfsmapx {
+   __s64   fmv_device; /* device id */
+   __s64   fmv_block;  /* starting block */
+   __s64   fmv_owner;  /* owner id */
+   __s64   fmv_offset; /* file offset of segment */
+   __s64   fmv_length; /* length of segment, blocks */
+   __s32   fmv_oflags; /* mapping flags */
+   __s32   fmv_iflags; /* control flags (1st structure) */
+   __s32   fmv_count;  /* # of entries in array incl. input */
+   __s32   fmv_entries;/* # of entries filled in (output). */
+   __s64   fmv_unused1;/* future use, must be zero */
+};
+
+.fi
+.in
+The array must contain at least two elements.  The first two array
+elements specify the lowest and highest reverse-mapping keys, respectively,
+for which userspace would like physical mapping information.  A reverse
+mapping key consists of the tuple (device, block, owner, offset).  The
+owner and offset fields are part of the key because some filesystems
+support sharing physical blocks between multiple files and therefore may
+return multiple mappings for a given physical block.
+
+.SS Fields of struct getfsmapx
+.PP
+The
+.I fmv_device
+field contains a 64-bit cookie to uniquely identify the underlying storage
+device if the filesystem supports multiple devices.  If not, the field
+should be
+.BR FMV_DEV_DEFAULT "."
+
+.PP
+The
+.I fmv_block
+field contains the 512-byte sector address of the extent.
+
+.PP
+The
+.I fmv_owner
+field contains the owner of the extent.  This is generally an inode
+number, though if
+.B FMV_OF_SPECIAL_OWNER
+is set in the
+.I fmv_oflags
+field, then the owner value is one of the following special values:
+.TP
+.B FMV_OWN_FREE
+Free space.
+.TP
+.B FMV_OWN_UNKNOWN
+This extent has an unknown owner.
+.TP
+.B FMV_OWN_FS
+Static filesystem metadata.
+.TP
+.B FMV_OWN_LOG
+The filesystem journal.
+.TP
+.B FMV_OWN_AG
+Allocation group metadata.
+.TP
+.B FMV_OWN_INOBT
+The inode index, if one is provided.
+.TP
+.B FMV_OWN_INODES
+Inodes.
+.TP
+.B FMV_OWN_REFC
+Reference counting indexes.
+.TP
+.B FMV_OWN_COW
+This extent is being used to stage a copy-on-write.
+.TP
+.B FMV_OWN_DEFECTIVE:
+This extent has been marked defective either by the filesystem or the
+underlying device.
+
+.PP
+The
+.I fmv_offset
+field contains the logical address of the reverse mapping record, in units
+of 512-byte blocks.  This field has no meaning if the
+.BR FMV_OF_SPECIAL_OWNER " or " FMV_OF_EXTENT_MAP
+flags are set in
+.IR fmv_oflags "."
+
+.PP
+The
+.I fmv_length
+field contains the length of the extent, in units of 512-byte blocks.
+This field must be zero in the second array element.
+
+.PP
+The
+.I fmv_oflags
+field is a bitmask of extent state flags.  The bits are:
+.TP
+.B FMV_OF_PREALLOC
+The extent is allocated but not yet written.
+.TP
+.B FMV_OF_ATTR_FORK
+This extent contains extended attribute data.
+.TP
+.B FMV_OF_EXTENT_MAP
+This extent contains extent map information for the owner.
+.TP
+.B FMV_OF_SHARED
+Parts

[PATCH 2/3] xfs: introduce the XFS_IOC_GETFSMAPX ioctl

2016-05-05 Thread Darrick J. Wong

Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.  This is
the xfsprogs side of things for userspace support.

Signed-off-by: Darrick J. Wong 
---
 libxfs/xfs_fs.h   |   65 +
 3 files changed, 106 insertions(+), 14 deletions(-)

diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index d5ed090..6573fcc 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -117,6 +117,70 @@ struct getbmapx {
 #define BMV_OF_SHARED  0x8 /* segment shared with another file */
 
 /*
+ * Structure for XFS_IOC_GETFSMAPX.
+ *
+ * Similar to XFS_IOC_GETBMAPX, the first two elements in the array are
+ * used to constrain the output.  The first element in the array should
+ * represent the lowest disk address that the user wants to learn about.
+ * The second element in the array should represent the highest disk
+ * address to query.  Subsequent array elements will be filled out by the
+ * command.
+ *
+ * The fmv_iflags field is only used in the first structure.  The
+ * fmv_oflags field is filled in for each returned structure after the
+ * second structure.  The fmv_unused1 fields in the first two array
+ * elements must be zero.
+ *
+ * The fmv_count, fmv_entries, and fmv_iflags fields in the second array
+ * element must be zero.
+ *
+ * fmv_block, fmv_offset, and fmv_length are expressed in units of 512
+ * byte sectors.
+ */
+#ifndef HAVE_GETFSMAPX
+struct getfsmapx {
+   __s64   fmv_device; /* device id */
+   __s64   fmv_block;  /* starting block */
+   __s64   fmv_owner;  /* owner id */
+   __s64   fmv_offset; /* file offset of segment */
+   __s64   fmv_length; /* length of segment, blocks */
+   __s32   fmv_oflags; /* mapping flags */
+   __s32   fmv_iflags; /* control flags (1st structure) */
+   __s32   fmv_count;  /* # of entries in array incl. input */
+   __s32   fmv_entries;/* # of entries filled in (output). */
+   __s64   fmv_unused1;/* future use, must be zero */
+};
+#endif
+
+/* fmv_device values - set by XFS_IOC_GETFSMAPX caller.*/
+/* use this value if the filesystem doesn't support multiple devices. */
+#define FMV_DEV_DEFAULT0
+
+/* fmv_flags values - set by XFS_IOC_GETFSMAPX caller. */
+/* no flags defined yet */
+#define FMV_IF_VALID   0
+
+/* fmv_flags values - returned for each non-header segment */
+#define FMV_OF_PREALLOC0x1 /* segment = unwritten 
pre-allocation */
+#define FMV_OF_ATTR_FORK   0x2 /* segment = attribute fork */
+#define FMV_OF_EXTENT_MAP  0x4 /* segment = extent map */
+#define FMV_OF_SHARED  0x8 /* segment = shared with another file */
+#define FMV_OF_SPECIAL_OWNER   0x10/* owner is a special value */
+#define FMV_OF_LAST0x20/* segment is the last in the FS */
+
+/* fmv_owner special values */
+#defineFMV_OWN_FREE(-1ULL) /* free space */
+#define FMV_OWN_UNKNOWN(-2ULL) /* unknown owner */
+#define FMV_OWN_FS (-3ULL) /* static fs metadata */
+#define FMV_OWN_LOG(-4ULL) /* journalling log */
+#define FMV_OWN_AG (-5ULL) /* per-AG metadata */
+#define FMV_OWN_INOBT  (-6ULL) /* inode btree blocks */
+#define FMV_OWN_INODES (-7ULL) /* inodes */
+#define FMV_OWN_REFC   (-8ULL) /* refcount tree */
+#define FMV_OWN_COW(-9ULL) /* cow allocations */
+#define FMV_OWN_DEFECTIVE  (-10ULL) /* bad blocks */
+
+/*
  * Structure for XFS_IOC_FSSETDM.
  * For use by backup and restore programs to set the XFS on-disk inode
  * fields di_dmevmask and di_dmstate.  These must be set to exactly and
@@ -523,6 +587,7 @@ typedef struct xfs_swapext
 #define XFS_IOC_GETBMAPX   _IOWR('X', 56, struct getbmap)
 #define XFS_IOC_ZERO_RANGE _IOW ('X', 57, struct xfs_flock64)
 #define XFS_IOC_FREE_EOFBLOCKS _IOR ('X', 58, struct xfs_fs_eofblocks)
+#define XFS_IOC_GETFSMAPX  _IOWR('X', 59, struct getfsmapx)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] xfs_io: support the new getfsmap ioctl

2016-05-05 Thread Darrick J. Wong

Add a new command, 'fsmap', to xfs_io so that we can query the filesystem
extent map on a live filesystem.

Signed-off-by: Darrick J. Wong 
---
 io/Makefile   |2 
 io/fsmap.c|  485 +
 io/init.c |1 
 io/io.h   |1 
 man/man8/xfs_io.8 |   47 +
 5 files changed, 535 insertions(+), 1 deletion(-)
 create mode 100644 io/fsmap.c

diff --git a/io/Makefile b/io/Makefile
index 0b53f41..6439e1d 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -11,7 +11,7 @@ HFILES = init.h io.h
 CFILES = init.c \
attr.c bmap.c file.c freeze.c fsync.c getrusage.c imap.c link.c \
mmap.c open.c parent.c pread.c prealloc.c pwrite.c seek.c shutdown.c \
-   sync.c truncate.c reflink.c
+   sync.c truncate.c reflink.c fsmap.c
 
 LLDLIBS = $(LIBXCMD) $(LIBHANDLE)
 LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE)
diff --git a/io/fsmap.c b/io/fsmap.c
new file mode 100644
index 000..bf72555
--- /dev/null
+++ b/io/fsmap.c
@@ -0,0 +1,485 @@
+/*
+ * Copyright (c) 2016 Oracle.
+ * All Rights Reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "platform_defs.h"
+#include "command.h"
+#include "init.h"
+#include "io.h"
+#include "input.h"
+
+static cmdinfo_t fsmap_cmd;
+
+static void
+fsmap_help(void)
+{
+   printf(_(
+"\n"
+" prints the block mapping for an XFS filesystem"
+"\n"
+" Example:\n"
+" 'fsmap -vp' - tabular format verbose map, including unwritten extents\n"
+"\n"
+" fsmap prints the map of disk blocks used by the whole filesystem.\n"
+" The map lists each extent used by the file, as well as regions in the\n"
+" filesystem that do not have any corresponding blocks (free space).\n"
+" By default, each line of the listing takes the following form:\n"
+" extent: [startoffset..endoffset] owner startblock..endblock\n"
+" All the file offsets and disk blocks are in units of 512-byte blocks.\n"
+" -n -- query n extents.\n"
+" -v -- Verbose information, specify ag info.  Show flags legend on 2nd -v\n"
+"\n"));
+}
+
+static int
+numlen(
+   off64_t val)
+{
+   off64_t tmp;
+   int len;
+
+   for (len = 0, tmp = val; tmp > 0; tmp = tmp/10)
+   len++;
+   return (len == 0 ? 1 : len);
+}
+
+static const char *
+special_owner(
+   __int64_t   owner)
+{
+   switch (owner) {
+   case FMV_OWN_FREE:
+   return _("free space");
+   case FMV_OWN_UNKNOWN:
+   return _("unknown");
+   case FMV_OWN_FS:
+   return _("static fs metadata");
+   case FMV_OWN_LOG:
+   return _("journalling log");
+   case FMV_OWN_AG:
+   return _("per-AG metadata");
+   case FMV_OWN_INOBT:
+   return _("inode btree");
+   case FMV_OWN_INODES:
+   return _("inodes");
+   case FMV_OWN_REFC:
+   return _("refcount btree");
+   case FMV_OWN_COW:
+   return _("cow reservation");
+   case FMV_OWN_DEFECTIVE:
+   return _("defective");
+   default:
+   return _("unknown");
+   }
+}
+
+static void
+dump_map(
+   unsigned long long  nr,
+   struct getfsmapx*map)
+{
+   unsigned long long  i;
+   struct getfsmapx*p;
+
+   for (i = 0, p = map + 2; i < map->fmv_entries; i++, p++) {
+   printf("\t%llu: [%lld..%lld]: ", i + nr,
+   (long long) p->fmv_block,
+   (long long)(p->fmv_block + p->fmv_length - 1));
+   if (p->fmv_oflags & FMV_OF_SPECIAL_OWNER)
+   printf("%s", special_owner(p->fmv_owner));
+   else if (p->fmv_oflags & FMV_OF_EXTENT_MAP)
+   printf(_("inode %lld extent map"),
+   (long long) p->fmv_owner);
+   else
+   printf(_("inode %lld %lld..%lld"),
+   (long long) p->fmv_owner,
+   (long long) p->fmv_offset,
+   (long long)(p->fmv_offset + p->fmv_length - 1));
+   printf(_(" %lld blocks\n"),
+   (long long)p->fmv_length);
+   }
+}
+
+/*
+ * Verbose mode displays:
+ *   extent: [startblock..endblock]: startoffset..endoffset \
+ * ag# (agoffset..agendoffset) totalbbs flags
+ */
+#define MINR

Re: [PATCH 0/2] scop GFP_NOFS api

2016-05-05 Thread NeilBrown

On Wed, May 04 2016, Dave Chinner wrote:

> FWIW, I don't think making evict() non-blocking is going to be worth
> the effort here. Making memory reclaim wait on a priority ordered
> queue while asynchronous reclaim threads run reclaim as efficiently
> as possible and wakes waiters as it frees the memory the waiters
> require is a model that has been proven to work in the past, and
> this appears to me to be the model you are advocating for. I agree
> that direct reclaim needs to die and be replaced with something
> entirely more predictable, controllable and less prone to deadlock
> contexts - you just need to convince the mm developers that it will
> perform and scale better than what we have now.
>
> In the mean time, having a slightly more fine grained GFP_NOFS
> equivalent context will allow us to avoid the worst of the current
> GFP_NOFS problems with very little extra code.

You have painted two pictures here.  The first is an ideal which does
look a lot like the sort of outcome I was aiming for, but is more than a
small step away.
The second is a band-aid which would take us in exactly the wrong
direction.  It makes an interface which people apparently find hard to
use (or easy to misused) - the setting of __GFP_FS - and makes it more
complex.  Certainly it would be more powerful, but I think it would also
be more misused.

So I ask myself:  can we take some small steps towards 'A' and thereby
enable at least the functionality enabled by 'B'?

A core design principle for me is to enable filesystems to take control
of their own destiny.   They should have the information available to
make the decisions they need to make, and the opportunity to carry them
out.

All the places where direct reclaim currently calls into filesystems
carry the 'gfp' flags so the file system can decide what to do, with one
exception: evict_inode.  So my first proposal would be to rectify that.

 - redefine .nr_cached_objects and .free_cached_objects so that, if they
   are defined, they are responsible for s_dentry_lru and s_inode_lru.
   e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes
   two calls to list_lru_shrink_count.  This would require exporting
   prune_dcache_sb and prune_icache_sb but otherwise should be a fairly
   straight forward change.
   If nr_cached_objects were defined, super_cache_scan would no longer
   abort without __GFP_FS - that test would be left to the filesystem.

 - Now any filesystem that wants to can stash it's super_block pointer
   in current->journal_info while doing memory allocations, and abort
   any reclaim attempts (release_page, shrinker, nr_cached_objects) if
   and only if current->journal_info == "my superblock".  This can be
   done without the core mm code knowing any more than it already does.

 - A more sophisticated filesystem might import much of the code for
   prune_icache_sb() - either by copy/paste or by exporting some vfs
   internals - and then store an inode pointer in current->journal_info
   and only abort reclaim which touches that inode.

 - if a filesystem happens to know that it will never block in any of
   these reclaim calls, it can always allow prune_dcache_sb to run, and
   never needs to use GFP_NOFS.  I think NFS might be close to being
   able to do this as it flushes everything on last-close.  But that is
   something that NFS developers can care about (or not) quite
   independently from mm people.

 - Maybe some fs developer will try to enable free_cached_objects to do
   as much work as possible for every inode, but never deadlock.  It
   could do its own fs-specfic deadlock detection, or could queue work
   to a work queue and wait a limited time for it.  Or something.
   If some filesystem developer comes up with something that works
   really well, developers of other filesystems might copy it - or not
   as they choose.

Maybe ->journal_info isn't perfect for this.  It is currently only safe
for reclaim code to compare it against a known value.  It is not safe to
dereference it to see if it points to a known value.  That could possibly be
cleaned up, or another task_struct field could be provided for
filesystems to track their state.  Or do you find a task_struct field
unacceptable and there is some reason and that an explicitly passed cookie
is superior?

My key point is that we shouldn't try to plumb some new abstraction
through the MM code so there is a new pattern for all filesystems to
follow.  Rather the mm/vfs should get out of the filesystems' way as much
as possible and let them innovate independently.

Thanks for your time,
NeilBrown

signature.asc
Description: PGP signature

[PATCH RFC] btrfs: Slightly speedup btrfs_read_block_groups

Re: [PATCH] btrfs-progs: Adjust timing of safety delay countdown

Re: [PATCH] btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

Re: Spare volumes and hot auto-replacement feature

Re: Spare volumes and hot auto-replacement feature

[PATCH 0/4] Improve compression workspaces memory management

[PATCH 1/4] btrfs: rename and document compression workspace members

[PATCH 4/4] btrfs: make find_workspace warn if there are no workspaces

[PATCH 3/4] btrfs: make find_workspace always succeed

[PATCH 2/4] btrfs: preallocate compression workspaces

[PATCH 3/3] Btrfs: pin logs earlier when doing a rename exchange operation

[PATCH 2/3] Btrfs: unpin logs if rename exchange operation fails

[PATCH 1/3] Btrfs: fix inode leak on failure to setup whiteout inode in rename

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

[RFC 0/3] getfsmapx ioctl

[PATCH 1/3] document the XFS_IOC_GETFSMAPX ioctl

[PATCH 2/3] xfs: introduce the XFS_IOC_GETFSMAPX ioctl

[PATCH 3/3] xfs_io: support the new getfsmap ioctl

Re: [PATCH 0/2] scop GFP_NOFS api

20 matches

Site Navigation

Mail list logo

Footer information