[PATCH v2] btrfs: Improve btrfs_search_slot description

2017-12-12 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 880f4f693263..7e6511dfe73a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2654,17 +2654,29 @@ int btrfs_find_item(struct btrfs_root *fs_root, struct 
btrfs_path *path,
 }
 
 /*
- * look for key in the tree.  path is filled in with nodes along the way
- * if key is found, we return zero and you can find the item in the leaf
- * level of the path (level 0)
+ * btrfs_search_slot - look for a key in a tree and perform necessary
+ * modifications to preserve tree invariants.
  *
- * If the key isn't found, the path points to the slot where it should
- * be inserted, and 1 is returned.  If there are other errors during the
- * search a negative error number is returned.
+ * @trans: Handle of transaction, used when modifying the tree
+ * @p: Holds all btree nodes along the search path
+ * @root:  The root node of the tree
+ * @key:   The key we are looking for
+ * @ins_len:   Indicates purpose of search, for inserts it is 1, for
+ * deletions it's -1. 0 for plain searches
+ * @cow:   boolean should CoW operations be performed. Must always be 1
+ * when modifying the tree.
  *
- * if ins_len > 0, nodes and leaves will be split as we walk down the
- * tree.  if ins_len < 0, nodes will be merged as we walk down the tree (if
- * possible)
+ * If @ins_len > 0, nodes and leaves will be split as we walk down the tree.
+ * If @ins_len < 0, nodes will be merged as we walk down the tree (if possible)
+ *
+ * If @key is found, 0 is returned and you can find the item in the leaf level
+ * of the path (level 0)
+ *
+ * If @key isn't found, 1 is returned and the leaf level of the path (level 0)
+ * points to the slot where it should be inserted
+ *
+ * If an error is encountered while searching the tree a negative error number
+ * is returned
  */
 int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
  const struct btrfs_key *key, struct btrfs_path *p,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Remove pair of bio_get/put in btrfs_schedule_bio

2017-12-12 Thread Nikolay Borisov


On 13.12.2017 00:09, Liu Bo wrote:
> On Tue, Dec 12, 2017 at 07:52:38PM +0100, David Sterba wrote:
>> On Mon, Dec 11, 2017 at 02:57:56PM -0800, Liu Bo wrote:
>>> On Mon, Dec 11, 2017 at 04:38:48PM +0200, Nikolay Borisov wrote:
 This code was added in 492bb6deee34 ("Btrfs: Hold a reference on bios
 during submit_bio, add some extra bio checks"). However, holding a
 reference on a bio is necessary only if it's going to be referenced
 after the submit_bio returns and the bio is completed. In this
 particular instance this is not the case so there is no need to hold
 an extra reference since we directly return.

>>>
>>> Looks good.  If possible, please also drop other unnecessary bio_get()
>>> in btrfs.
>>
>> You mean everywhere with similar context like in this patch?
>>
> 
> Right, there're several places where bio_get() is put before running
> functions but bio_put() is also put immediately without referencing
> bio.

I've already created patches for those and put them through fstests.
Will send them later

> 
> thanks,
> -liubo
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at

2017-12-12 Thread Darrick J. Wong
On Wed, Dec 13, 2017 at 06:38:25AM +0100, Adam Borowski wrote:
> This link is replicated in most filesystems' config stanzas.  Referring
> to an archived version of that site is pointless as it mostly deals with
> patches; user documentation is available elsewhere.
> 
> Signed-off-by: Adam Borowski 
> ---
> Sending this as one piece; if you guys would instead prefer this chopped
> into tiny per-filesystem bits, please say so.
> 
> 
>  Documentation/filesystems/ext2.txt |  2 --
>  Documentation/filesystems/ext4.txt |  7 +++
>  fs/9p/Kconfig  |  3 ---
>  fs/Kconfig |  6 +-
>  fs/btrfs/Kconfig   |  3 ---
>  fs/ceph/Kconfig|  3 ---
>  fs/cifs/Kconfig| 15 +++
>  fs/ext2/Kconfig|  6 +-
>  fs/ext4/Kconfig|  3 ---
>  fs/f2fs/Kconfig|  6 +-
>  fs/hfsplus/Kconfig |  3 ---
>  fs/jffs2/Kconfig   |  6 +-
>  fs/jfs/Kconfig |  3 ---
>  fs/reiserfs/Kconfig|  6 +-
>  fs/xfs/Kconfig |  3 ---
>  15 files changed, 15 insertions(+), 60 deletions(-)
> 
> diff --git a/Documentation/filesystems/ext2.txt 
> b/Documentation/filesystems/ext2.txt
> index 55755395d3dc..81c0becab225 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -49,12 +49,10 @@ sb=n  Use alternate 
> superblock at this location.
>  
>  user_xattr   Enable "user." POSIX Extended Attributes
>   (requires CONFIG_EXT2_FS_XATTR).
> - See also http://acl.bestbits.at
>  nouser_xattr Don't support "user." extended attributes.
>  
>  acl  Enable POSIX Access Control Lists support
>   (requires CONFIG_EXT2_FS_POSIX_ACL).
> - See also http://acl.bestbits.at
>  noaclDon't support POSIX ACLs.
>  
>  nobh Do not attach buffer_heads to file pagecache.
> diff --git a/Documentation/filesystems/ext4.txt 
> b/Documentation/filesystems/ext4.txt
> index 75236c0c2ac2..8cd63e16f171 100644
> --- a/Documentation/filesystems/ext4.txt
> +++ b/Documentation/filesystems/ext4.txt
> @@ -202,15 +202,14 @@ inode_readahead_blks=n  This tuning parameter controls 
> the maximum
>   the buffer cache.  The default value is 32 blocks.
>  
>  nouser_xattr Disables Extended User Attributes.  See the
> - attr(5) manual page and http://acl.bestbits.at/
> - for more information about extended attributes.
> + attr(5) manual page for more information about
> + extended attributes.
>  
>  noaclThis option disables POSIX Access Control List
>   support. If ACL support is enabled in the kernel
>   configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
>   enabled by default on mount. See the acl(5) manual
> - page and http://acl.bestbits.at/ for more information
> - about acl.
> + page for more information about acl.
>  
>  bsddf(*) Make 'df' act like BSD.
>  minixdf  Make 'df' act like Minix.
> diff --git a/fs/9p/Kconfig b/fs/9p/Kconfig
> index 6489e1fc1afd..11045d8e356a 100644
> --- a/fs/9p/Kconfig
> +++ b/fs/9p/Kconfig
> @@ -25,9 +25,6 @@ config 9P_FS_POSIX_ACL
> POSIX Access Control Lists (ACLs) support permissions for users and
> groups beyond the owner/group/world scheme.
>  
> -   To learn more about Access Control Lists, visit the POSIX ACLs for
> -   Linux website .
> -
> If you don't know what Access Control Lists are, say N
>  
>  endif
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 7aee6d699fd6..0ed56752f208 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -167,17 +167,13 @@ config TMPFS_POSIX_ACL
> files for sound to work properly.  In short, if you're not sure,
> say Y.
>  
> -   To learn more about Access Control Lists, visit the POSIX ACLs for
> -   Linux website .
> -
>  config TMPFS_XATTR
>   bool "Tmpfs extended attributes"
>   depends on TMPFS
>   default n
>   help
> Extended attributes are name:value pairs associated with inodes by
> -   the kernel or by users (see the attr(5) manual page, or visit
> -    for details).
> +   the kernel or by users (see the attr(5) manual page for details).
>  
> Currently this enables support for the trusted.* and
> security.* namespaces.
> diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
> index 

[PATCH] fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at

2017-12-12 Thread Adam Borowski
This link is replicated in most filesystems' config stanzas.  Referring
to an archived version of that site is pointless as it mostly deals with
patches; user documentation is available elsewhere.

Signed-off-by: Adam Borowski 
---
Sending this as one piece; if you guys would instead prefer this chopped
into tiny per-filesystem bits, please say so.


 Documentation/filesystems/ext2.txt |  2 --
 Documentation/filesystems/ext4.txt |  7 +++
 fs/9p/Kconfig  |  3 ---
 fs/Kconfig |  6 +-
 fs/btrfs/Kconfig   |  3 ---
 fs/ceph/Kconfig|  3 ---
 fs/cifs/Kconfig| 15 +++
 fs/ext2/Kconfig|  6 +-
 fs/ext4/Kconfig|  3 ---
 fs/f2fs/Kconfig|  6 +-
 fs/hfsplus/Kconfig |  3 ---
 fs/jffs2/Kconfig   |  6 +-
 fs/jfs/Kconfig |  3 ---
 fs/reiserfs/Kconfig|  6 +-
 fs/xfs/Kconfig |  3 ---
 15 files changed, 15 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/ext2.txt 
b/Documentation/filesystems/ext2.txt
index 55755395d3dc..81c0becab225 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -49,12 +49,10 @@ sb=nUse alternate 
superblock at this location.
 
 user_xattr Enable "user." POSIX Extended Attributes
(requires CONFIG_EXT2_FS_XATTR).
-   See also http://acl.bestbits.at
 nouser_xattr   Don't support "user." extended attributes.
 
 aclEnable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
-   See also http://acl.bestbits.at
 noacl  Don't support POSIX ACLs.
 
 nobh   Do not attach buffer_heads to file pagecache.
diff --git a/Documentation/filesystems/ext4.txt 
b/Documentation/filesystems/ext4.txt
index 75236c0c2ac2..8cd63e16f171 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -202,15 +202,14 @@ inode_readahead_blks=nThis tuning parameter controls 
the maximum
the buffer cache.  The default value is 32 blocks.
 
 nouser_xattr   Disables Extended User Attributes.  See the
-   attr(5) manual page and http://acl.bestbits.at/
-   for more information about extended attributes.
+   attr(5) manual page for more information about
+   extended attributes.
 
 noacl  This option disables POSIX Access Control List
support. If ACL support is enabled in the kernel
configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
enabled by default on mount. See the acl(5) manual
-   page and http://acl.bestbits.at/ for more information
-   about acl.
+   page for more information about acl.
 
 bsddf  (*) Make 'df' act like BSD.
 minixdfMake 'df' act like Minix.
diff --git a/fs/9p/Kconfig b/fs/9p/Kconfig
index 6489e1fc1afd..11045d8e356a 100644
--- a/fs/9p/Kconfig
+++ b/fs/9p/Kconfig
@@ -25,9 +25,6 @@ config 9P_FS_POSIX_ACL
  POSIX Access Control Lists (ACLs) support permissions for users and
  groups beyond the owner/group/world scheme.
 
- To learn more about Access Control Lists, visit the POSIX ACLs for
- Linux website .
-
  If you don't know what Access Control Lists are, say N
 
 endif
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..0ed56752f208 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -167,17 +167,13 @@ config TMPFS_POSIX_ACL
  files for sound to work properly.  In short, if you're not sure,
  say Y.
 
- To learn more about Access Control Lists, visit the POSIX ACLs for
- Linux website .
-
 config TMPFS_XATTR
bool "Tmpfs extended attributes"
depends on TMPFS
default n
help
  Extended attributes are name:value pairs associated with inodes by
- the kernel or by users (see the attr(5) manual page, or visit
-  for details).
+ the kernel or by users (see the attr(5) manual page for details).
 
  Currently this enables support for the trusted.* and
  security.* namespaces.
diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 2e558227931a..273351ee4c46 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -38,9 +38,6 @@ config BTRFS_FS_POSIX_ACL
  POSIX Access Control Lists (ACLs) support permissions for users and
  groups 

Re: [PATCH v3 0/5] define BTRFS_DEV_STATE

2017-12-12 Thread David Sterba
On Wed, Dec 13, 2017 at 06:38:12AM +0800, Anand Jain wrote:
> 
> 
> On 12/13/2017 01:42 AM, David Sterba wrote:
> > On Sun, Dec 10, 2017 at 05:15:17PM +0800, Anand Jain wrote:
> >> As of now device properties and states are being represented as int
> >> variable, patches here makes them bit flags instead. Further, wip
> >> patches such as device failed state needs this cleanup.
> >>
> >> v2:
> >>   Adds BTRFS_DEV_STATE_REPLACE_TGT
> >>   Adds BTRFS_DEV_STATE_FLUSH_SENT
> >>   Drops BTRFS_DEV_STATE_CAN_DISCARD
> >>   Starts bit flag from the bit 0
> >>   Drops unrelated change - declare btrfs_device
> >>
> >> v3:
> >>   Fix static checker warning, define respective dev state as bit number
> > 
> > The define numbers are fixed but the whitespace changes that I made in
> > misc-next
> 
>   Will do next time. Thanks. I don't see misc-next. Is it for-next ?

The kernel.org repository only gets the latest for-next, that is
assembled from the pending branches, and also after some testing. You
could still find 'misc-next' inside the for-next branch, but it's not
obvious.

All the development branches are pushed to

https://github.com/kdave/btrfs-devel or
http://repo.or.cz/linux-2.6/btrfs-unstable.git

more frequently than the k.org/for-next is updated. I thought this has
become a common knowledge, but yet it's not documented on the wiki so.
Let's fix that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs-progs: doc: remove explanation of 'X' attribute

2017-12-12 Thread Misono, Tomohiro
1st patch requires btrfs/048's output to be modified.
If the patch is Ok, I will update the xfstest.

Thanks,
Tomohiro

On 2017/12/13 3:37, David Sterba wrote:
> On Tue, Dec 12, 2017 at 04:08:17PM +0900, Misono, Tomohiro wrote:
>> e2fsprogs has removed compression support since v1.43 and there is no field
>> 'X' (no compress) for lxattr now. So, just remove the explanation.
> 
> Hm right, I hoped that we could export the NOCOMP flag through
> lsattr/chattr, but this relied on a compile-time compression support of
> e2fsprogs.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] Qgroup metadata reservation rework

2017-12-12 Thread Qu Wenruo


On 2017年12月13日 02:01, David Sterba wrote:
> On Tue, Dec 12, 2017 at 04:16:08PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 12.12.2017 09:34, Qu Wenruo wrote:
>>> [Overall]
>>> The previous rework on qgroup reservation system put a lot of effort on
>>> data, which works quite fine.
>>>
>>> But it takes less focus on metadata reservation, causing some problem
>>> like metadata reservation underflow and noisy kernel warning.
>>>
>>> This patchset will try to address the remaining problem of metadata
>>> reservation.
>>>
>>> The idea of new qgroup metadata reservation is to use 2 types of
>>> metadata reservation:
>>> 1) Per-transaction reservation
>>>Life span will be inside a transaction. Will be freed at transaction
>>>commit time.
>>>
>>> 2) Preallocated reservation
>>>For case where we reserve space before starting a transaction.
>>>Operation like dealloc and delayed-inode/item belongs to this type.
>>>
>>>This works similar to block_rsv, its reservation can be
>>>reserved/released at any timing caller like.
>>>
>>>The only point to notice is, if preallocated reservation is used and
>>>finished without problem, it should be converted to per-transaction
>>>type instead of just freeing.
>>>This is to co-operate with qgroup update at commit time.
>>>
>>> For preallocated type, this patch will integrate them into inode_rsv
>>> mechanism reworked by Josef, and delayed-inode/item reservation.
>>>
>>>
>>> [Problem: Over-reserve]
>>> Currently the patchset addresses metadata underflow quite well, but
>>> due to the over-reserve nature of btrfs and highly bounded to inode_rsv,
>>> qgroup metadata reservation also tends to be over-reserved.
>>>
>>> This is especially obvious for small limit.
>>> For 128M limit, it's will only be able to write about 70M before hitting
>>> quota limit.
>>> Although for larger limit, like 5G limit, it can reach 4.5G or more
>>> before hitting limit.
>>>
>>> Such over-reserved behavior can lead to some problem with existing test
>>> cases (where limit is normally less than 20M).
>>>
>>> While it's also possible to be addressed by use more accurate space other
>>> than max estimations.
>>>
>>> For example, to calculate metadata needed for delalloc, we use
>>> btrfs_calc_trans_metadata_size(), which always try to reserve space for
>>> CoW a full-height tree, and will also include csum size.
>>> Both calculate is way over-killed for qgroup metadata reservation.
>> In private chat with Chris couple of months ago we discussed making the
>> reservation a lot less pessimistic. One assumption which we could
>> exploit is the fact that upon a tree split it's unlikely we will create
>> more than 1 additional level in the tree. So we could potentially modify
>> btrfs_calc_trans_metadata_size to take a root parameter and instead of
>>
>> BTRFS_MAX_LEVEL * 2 we could change this to root_level * 2. How does
>> that sound?

My plan is more aggressive.
Since qgroup doesn't really need to care about low level space
reservation for extents, here we only needs to care about "net"
modification.

For item insert, it will takes one new tree block for split at most (if
I don't miss something).
So for qgroup, one ordered extent should only reserve one nodesize.

In fact, for per-trans reservation, we have been using this for a long time.

> 
> That sounds unsafe for low levels, because the reservations can be made
> with level N and the tree height can increase at commit time to > N,
> then we'd go ENOSPC. But BTRFS_MAX_LEVEL can be reduced to
> min(root_level + 1, 4) and even the +1 can be dropped for higher levels
> if the root node is filled up to some percent (say < 50%). The 4 is a
> guess and could be possibly 3, with some more precise calculations.

So the idea only applies to qgroup, nothing is changed for low level
space reservation.
Although this needs extra work in qgroup side.

Current patchset completely rely on current outstanding extent mechanism
to ensure its reservation is reserved and released correctly.

If qgroup is going to do its own reservation calculation, it needs extra
work to handle.

But compared to too early EDQUOT, it's still worthy anyway.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 00/14] Qgroup metadata reservation rework

2017-12-12 Thread Qu Wenruo


On 2017年12月13日 05:12, David Sterba wrote:
> On Tue, Dec 12, 2017 at 03:34:22PM +0800, Qu Wenruo wrote:
>> The patch is consist of 2 main parts:
>> 1) Type based qgroup reservation
>>The original patchset is sent several months ago.
>>Nothing is modified at all, just rebased. And not conflict at all.
>>
>>It's from patch 1 to patch 6.
>>
>> 2) Split meta qgroup reservation into per-trans and prealloc sub types
>>The real work to address metadata underflow.
>>Due to the over-reserve problem, this part is still in RFC state.
>>But the framework should mostly be fine, only needs extra fine-tuning
>>to get more accurate qgroup rsv to avoid too early limit.
>>
>>It's from patch 7 to 14.
> 
> I'm going to add the whole patchset to next, the first part has been
> there for some time and no test failures were reported. I optimistically
> expect that the second part will also be fine.

The type based reservation is completely fine, since it doesn't
introduce anything new, just a preparation for the incoming meta rework.

However I prefer not to push the whole patchset to upstream until
over-reserve behavior is solved.
Since it breaks quite some test cases with small limit.

> However I'm not sure how
> good is the test coverage of qgroups in fstests, so lack of reports may
> not mean anything.

That's why I submitted mkfs.btrfs quota support some times ago. :^)

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


[PATCH v2] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2017-12-12 Thread Timofey Titovets
At now btrfs_dedupe_file_range() restricted to 16MiB range for
limit locking time and memory requirement for dedup ioctl()

For too big input range code silently set range to 16MiB

Let's remove that restriction by do iterating over dedup range.
That's backward compatible and will not change anything for request
less then 16MiB.

Changes:
  v1 -> v2:
- Refactor btrfs_cmp_data_prepare and btrfs_extent_same
- Store memory of pages array between iterations
- Lock inodes once, not on each iteration
- Small inplace cleanups

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/ioctl.c | 160 ---
 1 file changed, 94 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d136ff0522e6..b17dcab1bb0c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2985,8 +2985,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp)
put_page(pg);
}
}
-   kfree(cmp->src_pages);
-   kfree(cmp->dst_pages);
+
+   cmp->num_pages = 0;
 }
 
 static int btrfs_cmp_data_prepare(struct inode *src, u64 loff,
@@ -2994,41 +2994,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, 
u64 loff,
  u64 len, struct cmp_pages *cmp)
 {
int ret;
-   int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
-   struct page **src_pgarr, **dst_pgarr;
-
-   /*
-* We must gather up all the pages before we initiate our
-* extent locking. We use an array for the page pointers. Size
-* of the array is bounded by len, which is in turn bounded by
-* BTRFS_MAX_DEDUPE_LEN.
-*/
-   src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   if (!src_pgarr || !dst_pgarr) {
-   kfree(src_pgarr);
-   kfree(dst_pgarr);
-   return -ENOMEM;
-   }
-   cmp->num_pages = num_pages;
-   cmp->src_pages = src_pgarr;
-   cmp->dst_pages = dst_pgarr;
 
/*
 * If deduping ranges in the same inode, locking rules make it mandatory
 * to always lock pages in ascending order to avoid deadlocks with
 * concurrent tasks (such as starting writeback/delalloc).
 */
-   if (src == dst && dst_loff < loff) {
-   swap(src_pgarr, dst_pgarr);
+   if (src == dst && dst_loff < loff)
swap(loff, dst_loff);
-   }
 
-   ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff);
+   cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT;
+
+   ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff);
if (ret)
goto out;
 
-   ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff);
+   ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, 
dst_loff);
 
 out:
if (ret)
@@ -3098,31 +3079,23 @@ static int extent_same_check_offsets(struct inode 
*inode, u64 off, u64 *plen,
return 0;
 }
 
-static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
-struct inode *dst, u64 dst_loff)
+static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
+  struct inode *dst, u64 dst_loff,
+  struct cmp_pages *cmp)
 {
int ret;
u64 len = olen;
-   struct cmp_pages cmp;
bool same_inode = (src == dst);
u64 same_lock_start = 0;
u64 same_lock_len = 0;
 
-   if (len == 0)
-   return 0;
-
-   if (same_inode)
-   inode_lock(src);
-   else
-   btrfs_double_inode_lock(src, dst);
-
ret = extent_same_check_offsets(src, loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
ret = extent_same_check_offsets(dst, dst_loff, , olen);
if (ret)
-   goto out_unlock;
+   return ret;
 
if (same_inode) {
/*
@@ -3139,32 +3112,21 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 * allow an unaligned length so long as it ends at
 * i_size.
 */
-   if (len != olen) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (len != olen)
+   return -EINVAL;
 
/* Check for overlapping ranges */
-   if (dst_loff + len > loff && dst_loff < loff + len) {
-   ret = -EINVAL;
-   goto out_unlock;
-   }
+   if (dst_loff + len > loff && dst_loff < loff + len)
+   return -EINVAL;
 
same_lock_start = min_t(u64, loff, dst_loff);
same_lock_len = max_t(u64, loff, dst_loff) + len - 

Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-12 Thread Josef Bacik
On Wed, Dec 13, 2017 at 09:20:04AM +1100, Dave Chinner wrote:
> On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> > On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > > From: Josef Bacik 
> > > > 
> > > > Now that we have metadata counters in the VM, we need to provide a way 
> > > > to kick
> > > > writeback on dirty metadata.  Introduce 
> > > > super_operations->write_metadata.  This
> > > > allows file systems to deal with writing back any dirty metadata we 
> > > > need based
> > > > on the writeback needs of the system.  Since there is no inode to key 
> > > > off of we
> > > > need a list in the bdi for dirty super blocks to be added.  From there 
> > > > we can
> > > > find any dirty sb's on the bdi we are currently doing writeback on and 
> > > > call into
> > > > their ->write_metadata callback.
> > > > 
> > > > Signed-off-by: Josef Bacik 
> > > > Reviewed-by: Jan Kara 
> > > > Reviewed-by: Tejun Heo 
> > > > ---
> > > >  fs/fs-writeback.c| 72 
> > > > 
> > > >  fs/super.c   |  6 
> > > >  include/linux/backing-dev-defs.h |  2 ++
> > > >  include/linux/fs.h   |  4 +++
> > > >  mm/backing-dev.c |  2 ++
> > > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 987448ed7698..fba703dff678 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct 
> > > > bdi_writeback *wb,
> > > > return pages;
> > > >  }
> > > >  
> > > > +static long writeback_sb_metadata(struct super_block *sb,
> > > > + struct bdi_writeback *wb,
> > > > + struct wb_writeback_work *work)
> > > > +{
> > > > +   struct writeback_control wbc = {
> > > > +   .sync_mode  = work->sync_mode,
> > > > +   .tagged_writepages  = work->tagged_writepages,
> > > > +   .for_kupdate= work->for_kupdate,
> > > > +   .for_background = work->for_background,
> > > > +   .for_sync   = work->for_sync,
> > > > +   .range_cyclic   = work->range_cyclic,
> > > > +   .range_start= 0,
> > > > +   .range_end  = LLONG_MAX,
> > > > +   };
> > > > +   long write_chunk;
> > > > +
> > > > +   write_chunk = writeback_chunk_size(wb, work);
> > > > +   wbc.nr_to_write = write_chunk;
> > > > +   sb->s_op->write_metadata(sb, );
> > > > +   work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > > +
> > > > +   return write_chunk - wbc.nr_to_write;
> > > 
> > > Ok, writeback_chunk_size() returns a page count. We've already gone
> > > through the "metadata is not page sized" dance on the dirty
> > > accounting side, so how are we supposed to use pages to account for
> > > metadata writeback?
> > > 
> > 
> > This is just one of those things that's going to be slightly shitty.  It's 
> > the
> > same for memory reclaim, all of those places use pages so we just take
> > METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.
> 
> Ok, so that isn't exactly easy to deal with, because all our
> metadata writeback is based on log sequence number targets (i.e. how
> far to push the tail of the log towards the current head). We've
> actually got no idea how pages/bytes actually map to a LSN target
> because while we might account a full buffer as dirty for memory
> reclaim purposes (up to 64k in size), we might have only logged 128
> bytes of it.
> 
> i.e. if we are asked to push 2MB of metadata and we treat that as
> 2MB of log space (i.e. push target of tail LSN + 2MB) we could have
> logged several tens of megabytes of dirty metadata in that LSN
> range and have to flush it all. OTOH, if the buffers are fully
> logged, then that same target might only flush 1.5MB of metadata
> once all the log overhead is taken into account.
> 
> So there's a fairly large disconnect between the "flush N bytes of
> metadata" API and the "push to a target LSN" that XFS uses for
> flushing metadata in aged order. I'm betting that extN and otehr
> filesystems might have similar mismatches with their journal
> flushing...
> 

If there's not a correlation then there's no sense in xfs using this.  If btrfs
has 16gib of dirty metadata then that's exactly how much we have to write out,
which is what this is designed for.

> > > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> > > work->tagged_writepages is set, this will basically tell us to flush
> > > the entire dirty metadata cache because write_chunk will get set to
> > > LONG_MAX.
> > > 
> > > IOWs, this would 

Re: [PATCH v9 0/5] Add the ability to do BPF directed error injection

2017-12-12 Thread Darrick J. Wong
On Mon, Dec 11, 2017 at 11:36:45AM -0500, Josef Bacik wrote:
> This is the same as v8, just rebased onto the bpf tree.
> 
> v8->v9:
> - rebased onto the bpf tree.
> 
> v7->v8:
> - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
> 
> v6->v7:
> - moved the opt-in macro to bpf.h out of kprobes.h.
> 
> v5->v6:
> - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
>   feature.  This way only functions that opt-in will be allowed to be
>   overridden.
> - added a btrfs patch to allow error injection for open_ctree() so that the 
> bpf
>   sample actually works.
> 
> v4->v5:
> - disallow kprobe_override programs from being put in the prog map array so we
>   don't tail call into something we didn't check.  This allows us to make the
>   normal path still fast without a bunch of percpu operations.
> 
> v3->v4:
> - fix a build error found by kbuild test bot (I didn't wait long enough
>   apparently.)
> - Added a warning message as per Daniels suggestion.
> 
> v2->v3:
> - added a ->kprobe_override flag to bpf_prog.
> - added some sanity checks to disallow attaching bpf progs that have
>   ->kprobe_override set that aren't for ftrace kprobes.
> - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
>   ftrace kprobe.
> - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read 
> this
>   value in the kprobe path, and thus only write to it if we're overriding or
>   clearing the override.
> 
> v1->v2:
> - moved things around to make sure that bpf_override_return could really only 
> be
>   used for an ftrace kprobe.
> - killed the special return values from trace_call_bpf.
> - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
>   it was being called from an ftrace kprobe context.
> - reworked the logic in kprobe_perf_func to take advantage of 
> bpf_kprobe_state.
> - updated the test as per Alexei's review.
> 
> - Original message -
> 
> A lot of our error paths are not well tested because we have no good way of
> injecting errors generically.  Some subystems (block, memory) have ways to
> inject errors, but they are random so it's hard to get reproduceable results.
> 
> With BPF we can add determinism to our error injection.  We can use kprobes 
> and
> other things to verify we are injecting errors at the exact case we are trying
> to test.  This patch gives us the tool to actual do the error injection part.
> It is very simple, we just set the return value of the pt_regs we're given to
> whatever we provide, and then override the PC with a dummy function that 
> simply
> returns.

Heh, this looks cool.  I decided to try it to see what happens, and saw
a bunch of dmesg pasted in below.  Is that supposed to happen?  Or am I
the only fs developer still running with lockdep enabled? :)

It looks like bpf_override_return has some sort of side effect such that
we get the splat, since commenting it out makes the symptom go away.



--D

[ 1847.769183] BTRFS error (device (null)): open_ctree failed
[ 1847.770130] BUG: sleeping function called from invalid context at 
/storage/home/djwong/cdev/work/linux-xfs/kernel/locking/rwsem.c:69
[ 1847.771976] in_atomic(): 1, irqs_disabled(): 0, pid: 1524, name: mount
[ 1847.773016] 1 lock held by mount/1524:
[ 1847.773530]  #0:  (>s_umount_key#34/1){+.+.}, at: [<653a9bb4>] 
sget_userns+0x302/0x4f0
[ 1847.774731] Preemption disabled at:
[ 1847.774735] [<  (null)>]   (null)
[ 1847.777009] CPU: 2 PID: 1524 Comm: mount Tainted: GW
4.15.0-rc3-xfsx #3
[ 1847.778800] Call Trace:
[ 1847.779047]  dump_stack+0x7c/0xbe
[ 1847.779361]  ___might_sleep+0x1f7/0x260
[ 1847.779720]  down_write+0x29/0xb0
[ 1847.780046]  unregister_shrinker+0x15/0x70
[ 1847.780427]  deactivate_locked_super+0x2e/0x60
[ 1847.780935]  btrfs_mount+0xbb6/0x1000 [btrfs]
[ 1847.781353]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.781750]  ? mount_fs+0xf/0x80
[ 1847.782065]  ? alloc_vfsmnt+0x1a1/0x230
[ 1847.782429]  mount_fs+0xf/0x80
[ 1847.782733]  vfs_kern_mount+0x62/0x160
[ 1847.783128]  btrfs_mount+0x3d3/0x1000 [btrfs]
[ 1847.783493]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.783849]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.784207]  ? mount_fs+0xf/0x80
[ 1847.784502]  mount_fs+0xf/0x80
[ 1847.784835]  vfs_kern_mount+0x62/0x160
[ 1847.785235]  do_mount+0x1b1/0xd50
[ 1847.785594]  ? _copy_from_user+0x5b/0x90
[ 1847.786028]  ? memdup_user+0x4b/0x70
[ 1847.786501]  SyS_mount+0x85/0xd0
[ 1847.786835]  entry_SYSCALL_64_fastpath+0x1f/0x96
[ 1847.787311] RIP: 0033:0x7f6ebecc1b5a
[ 1847.787691] RSP: 002b:7ffc7bd1c958 EFLAGS: 0202 ORIG_RAX: 
00a5
[ 1847.788383] RAX: ffda RBX: 7f6ebefba63a RCX: 7f6ebecc1b5a
[ 1847.789106] RDX: 00bfd010 RSI: 00bfa230 RDI: 00bfa210
[ 1847.789807] RBP: 00bfa0f0 R08:  R09: 0014
[ 1847.790511] R10: c0ed R11: 0202 R12: 7f6ebf1ca83c
[ 

Re: [PATCH v3 0/5] define BTRFS_DEV_STATE

2017-12-12 Thread Anand Jain



On 12/13/2017 01:42 AM, David Sterba wrote:

On Sun, Dec 10, 2017 at 05:15:17PM +0800, Anand Jain wrote:

As of now device properties and states are being represented as int
variable, patches here makes them bit flags instead. Further, wip
patches such as device failed state needs this cleanup.

v2:
  Adds BTRFS_DEV_STATE_REPLACE_TGT
  Adds BTRFS_DEV_STATE_FLUSH_SENT
  Drops BTRFS_DEV_STATE_CAN_DISCARD
  Starts bit flag from the bit 0
  Drops unrelated change - declare btrfs_device

v3:
  Fix static checker warning, define respective dev state as bit number


The define numbers are fixed but the whitespace changes that I made in
misc-next


 Will do next time. Thanks. I don't see misc-next. Is it for-next ?

 git branch -a
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/for-4.13-part1
  remotes/origin/for-4.14
  remotes/origin/for-4.15-rc2
  remotes/origin/for-4.15-rc3
  remotes/origin/for-chris-4.12
  remotes/origin/for-next
  remotes/origin/for-next-test
  remotes/origin/master
  remotes/origin/nowait-aio-btrfs-fixup
git remote -v
origin  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git (fetch)
origin  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git (push)

Thanks, Anand



while merging the patches are not in v3. I'd expect that you
update your patches if there's another iteration and there have been
changes to the merged ones, or send me only incrementals to the merged
code or instructions how to fix it up if it's a trivial change like
that.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-12 Thread Dave Chinner
On Tue, Dec 12, 2017 at 01:05:35PM -0500, Josef Bacik wrote:
> On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> > On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > > From: Josef Bacik 
> > > 
> > > Now that we have metadata counters in the VM, we need to provide a way to 
> > > kick
> > > writeback on dirty metadata.  Introduce super_operations->write_metadata. 
> > >  This
> > > allows file systems to deal with writing back any dirty metadata we need 
> > > based
> > > on the writeback needs of the system.  Since there is no inode to key off 
> > > of we
> > > need a list in the bdi for dirty super blocks to be added.  From there we 
> > > can
> > > find any dirty sb's on the bdi we are currently doing writeback on and 
> > > call into
> > > their ->write_metadata callback.
> > > 
> > > Signed-off-by: Josef Bacik 
> > > Reviewed-by: Jan Kara 
> > > Reviewed-by: Tejun Heo 
> > > ---
> > >  fs/fs-writeback.c| 72 
> > > 
> > >  fs/super.c   |  6 
> > >  include/linux/backing-dev-defs.h |  2 ++
> > >  include/linux/fs.h   |  4 +++
> > >  mm/backing-dev.c |  2 ++
> > >  5 files changed, 80 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 987448ed7698..fba703dff678 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct 
> > > bdi_writeback *wb,
> > >   return pages;
> > >  }
> > >  
> > > +static long writeback_sb_metadata(struct super_block *sb,
> > > +   struct bdi_writeback *wb,
> > > +   struct wb_writeback_work *work)
> > > +{
> > > + struct writeback_control wbc = {
> > > + .sync_mode  = work->sync_mode,
> > > + .tagged_writepages  = work->tagged_writepages,
> > > + .for_kupdate= work->for_kupdate,
> > > + .for_background = work->for_background,
> > > + .for_sync   = work->for_sync,
> > > + .range_cyclic   = work->range_cyclic,
> > > + .range_start= 0,
> > > + .range_end  = LLONG_MAX,
> > > + };
> > > + long write_chunk;
> > > +
> > > + write_chunk = writeback_chunk_size(wb, work);
> > > + wbc.nr_to_write = write_chunk;
> > > + sb->s_op->write_metadata(sb, );
> > > + work->nr_pages -= write_chunk - wbc.nr_to_write;
> > > +
> > > + return write_chunk - wbc.nr_to_write;
> > 
> > Ok, writeback_chunk_size() returns a page count. We've already gone
> > through the "metadata is not page sized" dance on the dirty
> > accounting side, so how are we supposed to use pages to account for
> > metadata writeback?
> > 
> 
> This is just one of those things that's going to be slightly shitty.  It's the
> same for memory reclaim, all of those places use pages so we just take
> METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.

Ok, so that isn't exactly easy to deal with, because all our
metadata writeback is based on log sequence number targets (i.e. how
far to push the tail of the log towards the current head). We've
actually got no idea how pages/bytes actually map to a LSN target
because while we might account a full buffer as dirty for memory
reclaim purposes (up to 64k in size), we might have only logged 128
bytes of it.

i.e. if we are asked to push 2MB of metadata and we treat that as
2MB of log space (i.e. push target of tail LSN + 2MB) we could have
logged several tens of megabytes of dirty metadata in that LSN
range and have to flush it all. OTOH, if the buffers are fully
logged, then that same target might only flush 1.5MB of metadata
once all the log overhead is taken into account.

So there's a fairly large disconnect between the "flush N bytes of
metadata" API and the "push to a target LSN" that XFS uses for
flushing metadata in aged order. I'm betting that extN and otehr
filesystems might have similar mismatches with their journal
flushing...

> > And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> > work->tagged_writepages is set, this will basically tell us to flush
> > the entire dirty metadata cache because write_chunk will get set to
> > LONG_MAX.
> > 
> > IOWs, this would appear to me to change sync() behaviour quite
> > dramatically on filesystems where ->write_metadata is implemented.
> > That is, instead of leaving all the metadata dirty in memory and
> > just forcing the journal to stable storage, filesystems will be told
> > to also write back all their dirty metadata before sync() returns,
> > even though it is not necessary to provide correct sync()
> > semantics
> 
> Well for btrfs that's exactly what we have currently since it's just backed by
> an inode.

H. That explains a lot.

Seems to me that btrfs is the odd one out here, 

Re: [PATCH] btrfs: Remove pair of bio_get/put in btrfs_schedule_bio

2017-12-12 Thread Liu Bo
On Tue, Dec 12, 2017 at 07:52:38PM +0100, David Sterba wrote:
> On Mon, Dec 11, 2017 at 02:57:56PM -0800, Liu Bo wrote:
> > On Mon, Dec 11, 2017 at 04:38:48PM +0200, Nikolay Borisov wrote:
> > > This code was added in 492bb6deee34 ("Btrfs: Hold a reference on bios
> > > during submit_bio, add some extra bio checks"). However, holding a
> > > reference on a bio is necessary only if it's going to be referenced
> > > after the submit_bio returns and the bio is completed. In this
> > > particular instance this is not the case so there is no need to hold
> > > an extra reference since we directly return.
> > >
> > 
> > Looks good.  If possible, please also drop other unnecessary bio_get()
> > in btrfs.
> 
> You mean everywhere with similar context like in this patch?
>

Right, there're several places where bio_get() is put before running
functions but bio_put() is also put immediately without referencing
bio.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Minor compression heuristic cleanups

2017-12-12 Thread Timofey Titovets
2017-12-12 23:55 GMT+03:00 David Sterba :
> The callback pointers for radix_sort are not needed, we don't plan to
> export the function now. The compiler is smart enough to replace the
> indirect calls with direct ones, so there's no change in the resulting
> asm code.
>
> David Sterba (3):
>   btrfs: heuristic: open code get_num callback of radix sort
>   btrfs: heuristic: open code copy_call callback of radix sort
>   btrfs: heuristic: call get4bits directly
>
>  fs/btrfs/compression.c | 42 +++---
>  1 file changed, 11 insertions(+), 31 deletions(-)
>
> --
> 2.15.1
>

Thanks!
On whole series:
Reviewed-by: Timofey Titovets

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Question] Two variants of SB reads can we move into bio read instead ?

2017-12-12 Thread David Sterba
On Fri, Dec 08, 2017 at 06:27:10PM +0800, Anand Jain wrote:
> 
> David,
> 
>   There are two variants of SB read, one using the device cache [1]
>   and the other buffer head [2].
> 
>   [1] btrfs_read_disk_super()
>   [2] btrfs_read_dev_super()
> 
>   Patch, 6f60cbd3ae442cb35861bb522f388db123d42ec1
>(btrfs: access superblock via pagecache in scan_one_device)
>   Embraced device caches to avoid blocksize set and thus avoid device
>   drop cache. But however its in the context of starting up with a new
>   mount and in practice, would it really matter if the device cache is
>   dropped in the context of mount, we any way drop it when the device
>   is successfully mounted though. I don't understand the actual problem
>   here.

The fix was like 4 years ago, I don't remember all the details but the
problem was reproducible and it's necessary to be fixed as it is by
6f60cbd3ae442cb358.

There is a potential race with udev that calls scan, mkfs that updates
the blocks and repeatedly opens the block device with rw flags (that
triggers udev). And if mount starts in between, funny things can happen
because of the unexpected set_blocksize calls.

>   Further [2] is still using buffer head, which works very well for us
>   in this context, any idea if there is any suggestion to move it to
>   newer bio read instead ?

No special reason to use buffer_head, just that it's been there and is
straightforward to use. Conversion to the bio interface is desired,
though this might not be that easy as it looks as the bh interface hides
some corner cases of the interaction with page cache.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] Qgroup metadata reservation rework

2017-12-12 Thread David Sterba
On Tue, Dec 12, 2017 at 03:34:22PM +0800, Qu Wenruo wrote:
> The patch is consist of 2 main parts:
> 1) Type based qgroup reservation
>The original patchset is sent several months ago.
>Nothing is modified at all, just rebased. And not conflict at all.
> 
>It's from patch 1 to patch 6.
> 
> 2) Split meta qgroup reservation into per-trans and prealloc sub types
>The real work to address metadata underflow.
>Due to the over-reserve problem, this part is still in RFC state.
>But the framework should mostly be fine, only needs extra fine-tuning
>to get more accurate qgroup rsv to avoid too early limit.
> 
>It's from patch 7 to 14.

I'm going to add the whole patchset to next, the first part has been
there for some time and no test failures were reported. I optimistically
expect that the second part will also be fine. However I'm not sure how
good is the test coverage of qgroups in fstests, so lack of reports may
not mean anything.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] More unnecessary argument cleanups

2017-12-12 Thread David Sterba
Another step peeling off the unnecessary arguments, this time the
unlock_extent*.

David Sterba (2):
  btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC
  btrfs: sink unlock_extent parameter gfp_flags

 fs/btrfs/disk-io.c  |  2 +-
 fs/btrfs/extent_io.c| 12 +---
 fs/btrfs/extent_io.h| 11 +--
 fs/btrfs/file.c | 15 +++
 fs/btrfs/free-space-cache.c |  5 ++---
 fs/btrfs/inode.c| 26 --
 fs/btrfs/ioctl.c|  7 +++
 fs/btrfs/scrub.c|  3 +--
 8 files changed, 40 insertions(+), 41 deletions(-)

-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: sink unlock_extent parameter gfp_flags

2017-12-12 Thread David Sterba
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c  |  2 +-
 fs/btrfs/extent_io.c| 10 --
 fs/btrfs/extent_io.h|  4 ++--
 fs/btrfs/file.c | 15 +++
 fs/btrfs/free-space-cache.c |  5 ++---
 fs/btrfs/inode.c| 26 --
 fs/btrfs/ioctl.c|  7 +++
 fs/btrfs/scrub.c|  3 +--
 8 files changed, 32 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a3e9b74f6006..cb14c2188e6f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -381,7 +381,7 @@ static int verify_parent_transid(struct extent_io_tree 
*io_tree,
clear_extent_buffer_uptodate(eb);
 out:
unlock_extent_cached(io_tree, eb->start, eb->start + eb->len - 1,
-_state, GFP_NOFS);
+_state);
if (need_lock)
btrfs_tree_read_unlock_blocking(eb);
return ret;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 90ecccbf0211..7980f19dbe49 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1647,7 +1647,7 @@ STATIC u64 find_lock_delalloc_range(struct inode *inode,
 EXTENT_DELALLOC, 1, cached_state);
if (!ret) {
unlock_extent_cached(tree, delalloc_start, delalloc_end,
-_state, GFP_NOFS);
+_state);
__unlock_for_delalloc(inode, locked_page,
  delalloc_start, delalloc_end);
cond_resched();
@@ -2942,8 +2942,7 @@ static int __do_readpage(struct extent_io_tree *tree,
set_extent_uptodate(tree, cur, cur + iosize - 1,
, GFP_NOFS);
unlock_extent_cached(tree, cur,
-cur + iosize - 1,
-, GFP_NOFS);
+cur + iosize - 1, );
break;
}
em = __get_extent_map(inode, page, pg_offset, cur,
@@ -3036,8 +3035,7 @@ static int __do_readpage(struct extent_io_tree *tree,
set_extent_uptodate(tree, cur, cur + iosize - 1,
, GFP_NOFS);
unlock_extent_cached(tree, cur,
-cur + iosize - 1,
-, GFP_NOFS);
+cur + iosize - 1, );
cur = cur + iosize;
pg_offset += iosize;
continue;
@@ -4632,7 +4630,7 @@ int extent_fiemap(struct inode *inode, struct 
fiemap_extent_info *fieinfo,
 out:
btrfs_free_path(path);
unlock_extent_cached(_I(inode)->io_tree, start, start + len - 1,
-_state, GFP_NOFS);
+_state);
return ret;
 }
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 978351e8e8dc..72e5af2965a8 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -312,10 +312,10 @@ static inline int unlock_extent(struct extent_io_tree 
*tree, u64 start, u64 end)
 }
 
 static inline int unlock_extent_cached(struct extent_io_tree *tree, u64 start,
-   u64 end, struct extent_state **cached, gfp_t mask)
+   u64 end, struct extent_state **cached)
 {
return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached,
-   mask, NULL);
+   GFP_NOFS, NULL);
 }
 
 static inline int unlock_extent_cached_atomic(struct extent_io_tree *tree,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1096398e1351..1ed2e6e9e204 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1504,7 +1504,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode 
*inode, struct page **pages,
ordered->file_offset + ordered->len > start_pos &&
ordered->file_offset <= last_pos) {
unlock_extent_cached(>io_tree, start_pos,
-   last_pos, cached_state, GFP_NOFS);
+   last_pos, cached_state);
for (i = 0; i < num_pages; i++) {
unlock_page(pages[i]);
put_page(pages[i]);
@@ -1758,8 +1758,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
pos, copied, NULL);
if (extents_locked)

[PATCH 1/2] btrfs: add separate helper for unlock_extent_cached with GFP_ATOMIC

2017-12-12 Thread David Sterba
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 2 +-
 fs/btrfs/extent_io.h | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6cd3da16f114..90ecccbf0211 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2492,7 +2492,7 @@ endio_readpage_release_extent(struct extent_io_tree 
*tree, u64 start, u64 len,
 
if (uptodate && tree->track_uptodate)
set_extent_uptodate(tree, start, end, , GFP_ATOMIC);
-   unlock_extent_cached(tree, start, end, , GFP_ATOMIC);
+   unlock_extent_cached_atomic(tree, start, end, );
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index db2558b0cad4..978351e8e8dc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -318,6 +318,13 @@ static inline int unlock_extent_cached(struct 
extent_io_tree *tree, u64 start,
mask, NULL);
 }
 
+static inline int unlock_extent_cached_atomic(struct extent_io_tree *tree,
+   u64 start, u64 end, struct extent_state **cached)
+{
+   return __clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, cached,
+   GFP_ATOMIC, NULL);
+}
+
 static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
u64 end, unsigned bits)
 {
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: heuristic: open code copy_call callback of radix sort

2017-12-12 Thread David Sterba
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 37a69d4b04ce..935acabc0ea7 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1311,13 +1311,6 @@ static u8 get4bits(u64 num, int shift) {
return low4bits;
 }
 
-static void copy_cell(void *dst, int dest_i, void *src, int src_i)
-{
-   struct bucket_item *dstv = (struct bucket_item *)dst;
-   struct bucket_item *srcv = (struct bucket_item *)src;
-   dstv[dest_i] = srcv[src_i];
-}
-
 /*
  * Use 4 bits as radix base
  * Use 16 u32 counters for calculating new possition in buf array
@@ -1326,13 +1319,10 @@ static void copy_cell(void *dst, int dest_i, void *src, 
int src_i)
  * @array_buf - buffer array to store sorting results
  *  must be equal in size to @array
  * @num   - array size
- * @copy_cell - function to copy data from array to array_buf and vice versa
  * @get4bits  - function to get 4 bits from number at specified offset
  */
 static void radix_sort(struct bucket_item *array, struct bucket_item 
*array_buf,
   int num,
-  void (*copy_cell)(void *dest, int dest_i,
-void* src, int src_i),
   u8 (*get4bits)(u64 num, int shift))
 {
u64 max_num;
@@ -1376,7 +1366,7 @@ static void radix_sort(struct bucket_item *array, struct 
bucket_item *array_buf,
addr = get4bits(buf_num, shift);
counters[addr]--;
new_addr = counters[addr];
-   copy_cell(array_buf, new_addr, array, i);
+   array_buf[new_addr] = array[i];
}
 
shift += RADIX_BASE;
@@ -1403,7 +1393,7 @@ static void radix_sort(struct bucket_item *array, struct 
bucket_item *array_buf,
addr = get4bits(buf_num, shift);
counters[addr]--;
new_addr = counters[addr];
-   copy_cell(array, new_addr, array_buf, i);
+   array[new_addr] = array_buf[i];
}
 
shift += RADIX_BASE;
@@ -1437,8 +1427,7 @@ static int byte_core_set_size(struct heuristic_ws *ws)
struct bucket_item *bucket = ws->bucket;
 
/* Sort in reverse order */
-   radix_sort(ws->bucket, ws->bucket_b, BUCKET_SIZE, copy_cell,
-   get4bits);
+   radix_sort(ws->bucket, ws->bucket_b, BUCKET_SIZE, get4bits);
 
for (i = 0; i < BYTE_CORE_SET_LOW; i++)
coreset_sum += bucket[i].count;
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: heuristic: call get4bits directly

2017-12-12 Thread David Sterba
As it's a single instance and local to the file, we don't need to pass
it as an argument.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 935acabc0ea7..208334aa6c6e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1319,11 +1319,9 @@ static u8 get4bits(u64 num, int shift) {
  * @array_buf - buffer array to store sorting results
  *  must be equal in size to @array
  * @num   - array size
- * @get4bits  - function to get 4 bits from number at specified offset
  */
 static void radix_sort(struct bucket_item *array, struct bucket_item 
*array_buf,
-  int num,
-  u8 (*get4bits)(u64 num, int shift))
+  int num)
 {
u64 max_num;
u64 buf_num;
@@ -1427,7 +1425,7 @@ static int byte_core_set_size(struct heuristic_ws *ws)
struct bucket_item *bucket = ws->bucket;
 
/* Sort in reverse order */
-   radix_sort(ws->bucket, ws->bucket_b, BUCKET_SIZE, get4bits);
+   radix_sort(ws->bucket, ws->bucket_b, BUCKET_SIZE);
 
for (i = 0; i < BYTE_CORE_SET_LOW; i++)
coreset_sum += bucket[i].count;
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs: heuristic: open code get_num callback of radix sort

2017-12-12 Thread David Sterba
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it and also make the array types explicit.

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 25 +
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 28c3940062b7..37a69d4b04ce 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1318,12 +1318,6 @@ static void copy_cell(void *dst, int dest_i, void *src, 
int src_i)
dstv[dest_i] = srcv[src_i];
 }
 
-static u64 get_num(const void *a, int i)
-{
-   struct bucket_item *av = (struct bucket_item *)a;
-   return av[i].count;
-}
-
 /*
  * Use 4 bits as radix base
  * Use 16 u32 counters for calculating new possition in buf array
@@ -1332,12 +1326,11 @@ static u64 get_num(const void *a, int i)
  * @array_buf - buffer array to store sorting results
  *  must be equal in size to @array
  * @num   - array size
- * @get_num   - function to extract number from array
  * @copy_cell - function to copy data from array to array_buf and vice versa
  * @get4bits  - function to get 4 bits from number at specified offset
  */
-static void radix_sort(void *array, void *array_buf, int num,
-  u64 (*get_num)(const void *, int i),
+static void radix_sort(struct bucket_item *array, struct bucket_item 
*array_buf,
+  int num,
   void (*copy_cell)(void *dest, int dest_i,
 void* src, int src_i),
   u8 (*get4bits)(u64 num, int shift))
@@ -1355,9 +1348,9 @@ static void radix_sort(void *array, void *array_buf, int 
num,
 * Try avoid useless loop iterations for small numbers stored in big
 * counters.  Example: 48 33 4 ... in 64bit array
 */
-   max_num = get_num(array, 0);
+   max_num = array[0].count;
for (i = 1; i < num; i++) {
-   buf_num = get_num(array, i);
+   buf_num = array[i].count;
if (buf_num > max_num)
max_num = buf_num;
}
@@ -1370,7 +1363,7 @@ static void radix_sort(void *array, void *array_buf, int 
num,
memset(counters, 0, sizeof(counters));
 
for (i = 0; i < num; i++) {
-   buf_num = get_num(array, i);
+   buf_num = array[i].count;
addr = get4bits(buf_num, shift);
counters[addr]++;
}
@@ -1379,7 +1372,7 @@ static void radix_sort(void *array, void *array_buf, int 
num,
counters[i] += counters[i - 1];
 
for (i = num - 1; i >= 0; i--) {
-   buf_num = get_num(array, i);
+   buf_num = array[i].count;
addr = get4bits(buf_num, shift);
counters[addr]--;
new_addr = counters[addr];
@@ -1397,7 +1390,7 @@ static void radix_sort(void *array, void *array_buf, int 
num,
memset(counters, 0, sizeof(counters));
 
for (i = 0; i < num; i ++) {
-   buf_num = get_num(array_buf, i);
+   buf_num = array_buf[i].count;
addr = get4bits(buf_num, shift);
counters[addr]++;
}
@@ -1406,7 +1399,7 @@ static void radix_sort(void *array, void *array_buf, int 
num,
counters[i] += counters[i - 1];
 
for (i = num - 1; i >= 0; i--) {
-   buf_num = get_num(array_buf, i);
+   buf_num = array_buf[i].count;
addr = get4bits(buf_num, shift);
counters[addr]--;
new_addr = counters[addr];
@@ -1444,7 +1437,7 @@ static int byte_core_set_size(struct heuristic_ws *ws)
struct bucket_item *bucket = ws->bucket;
 
/* Sort in reverse order */
-   radix_sort(ws->bucket, ws->bucket_b, BUCKET_SIZE, get_num, copy_cell,
+   radix_sort(ws->bucket, ws->bucket_b, BUCKET_SIZE, copy_cell,
get4bits);
 
for (i = 0; i < BYTE_CORE_SET_LOW; i++)
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Minor compression heuristic cleanups

2017-12-12 Thread David Sterba
The callback pointers for radix_sort are not needed, we don't plan to
export the function now. The compiler is smart enough to replace the
indirect calls with direct ones, so there's no change in the resulting
asm code.

David Sterba (3):
  btrfs: heuristic: open code get_num callback of radix sort
  btrfs: heuristic: open code copy_call callback of radix sort
  btrfs: heuristic: call get4bits directly

 fs/btrfs/compression.c | 42 +++---
 1 file changed, 11 insertions(+), 31 deletions(-)

-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] btrfs: cleanup mount path

2017-12-12 Thread David Sterba
On Mon, Oct 16, 2017 at 05:15:38PM +0200, David Sterba wrote:
> On Mon, Sep 25, 2017 at 04:26:30PM +0900, Misono, Tomohiro wrote:
> > Summary:
> > Cleanup mount path by avoiding calling btrfs_mount() twice.
> > No functional change. See below for longer explanation.
> > 
> > Changelog:
> > v3:
> >   Reorganized patches again into four and added comments to the source.
> >   Each patch can be applied and compiled while maintaining functionality.
> >   The first one is the preparation and the second one is the main part.
> >   The last two are small cleanups.
> 
> I'm conditionally adding this series to for-next for testing.

No problems while this patch has been in for-next, so I'm going to add
that to 4.16 queue. Please update the patchset according to my comments.
Also please review if there's anything relevant for th MS_* and SB_*
rename that happened in 4.15 (1751e8a6cb935). Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: split parse_early_options() in two

2017-12-12 Thread David Sterba
On Mon, Sep 25, 2017 at 04:28:36PM +0900, Misono, Tomohiro wrote:
> Now parse_early_options() is used by both btrfs_mount() and mount_root().
> However, the former only needs subvol related part and the latter needs
> the others.
> 
> Therefore extract the subvol related parts from parse_early_options() and
> move it to new parse function (parse_subvol_options()).
> 
> Signed-off-by: Tomohiro Misono 
> ---
>  fs/btrfs/super.c | 85 
> +++-
>  1 file changed, 60 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 1c34ca6..7edd74d 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -448,7 +448,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   case Opt_subvolrootid:
>   case Opt_device:
>   /*
> -  * These are parsed by btrfs_parse_early_options
> +  * These are parsed by btrfs_parse_subvol_options
> +  * and btrfs_parse_early_options
>* and can be happily ignored here.
>*/
>   break;
> @@ -855,11 +856,63 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> char *options,
>   * only when we need to allocate a new super block.
>   */
>  static int btrfs_parse_early_options(const char *options, fmode_t flags,
> - void *holder, char **subvol_name, u64 *subvol_objectid,
> - struct btrfs_fs_devices **fs_devices)
> + void *holder, struct btrfs_fs_devices **fs_devices)
>  {
>   substring_t args[MAX_OPT_ARGS];
>   char *device_name, *opts, *orig, *p;
> + int error = 0;
> +
> + if (!options)
> + return 0;
> +
> + /*
> +  * strsep changes the string, duplicate it because btrfs_parse_options
> +  * gets called later
> +  */
> + opts = kstrdup(options, GFP_KERNEL);
> + if (!opts)
> + return -ENOMEM;
> + orig = opts;
> +
> + while ((p = strsep(, ",")) != NULL) {
> + int token;
> + if (!*p)
> + continue;
> +
> + token = match_token(p, tokens, args);
> + switch (token) {
> + case Opt_device:

Please rewrite this as a simple 'if'.

> + device_name = match_strdup([0]);
> + if (!device_name) {
> + error = -ENOMEM;
> + goto out;
> + }
> + error = btrfs_scan_one_device(device_name,
> + flags, holder, fs_devices);
> + kfree(device_name);
> + if (error)
> + goto out;
> + break;
> + default:
> + break;
> + }
> + }
> +
> +out:
> + kfree(orig);
> + return error;
> +}
> +
> +/*
> + * Parse mount options that are related to subvolume id
> + *
> + * The value is later passed to mount_subvol()
> + */
> +static int btrfs_parse_subvol_options(const char *options, fmode_t flags,
> + void *holder, char **subvol_name, u64 *subvol_objectid)
> +{
> + substring_t args[MAX_OPT_ARGS];
> + char *opts, *orig, *p;
>   char *num = NULL;
>   int error = 0;
>  
> @@ -867,8 +920,8 @@ static int btrfs_parse_early_options(const char *options, 
> fmode_t flags,
>   return 0;
>  
>   /*
> -  * strsep changes the string, duplicate it because parse_options
> -  * gets called twice
> +  * strsep changes the string, duplicate it because
> +  * btrfs_parse_early_options gets called later
>*/
>   opts = kstrdup(options, GFP_KERNEL);
>   if (!opts)
> @@ -907,18 +960,6 @@ static int btrfs_parse_early_options(const char 
> *options, fmode_t flags,
>   case Opt_subvolrootid:
>   pr_warn("BTRFS: 'subvolrootid' mount option is 
> deprecated and has no effect\n");
>   break;
> - case Opt_device:
> - device_name = match_strdup([0]);
> - if (!device_name) {
> - error = -ENOMEM;
> - goto out;
> - }
> - error = btrfs_scan_one_device(device_name,
> - flags, holder, fs_devices);
> - kfree(device_name);
> - if (error)
> - goto out;
> - break;
>   default:
>   break;
>   }
> @@ -1492,18 +1533,14 @@ static struct dentry *mount_root(struct 
> file_system_type *fs_type, int flags,
>   struct btrfs_fs_info *fs_info = NULL;
>   struct security_mnt_opts new_sec_opts;
>   fmode_t mode = FMODE_READ;
> - 

Re: [PATCH 1/4] btrfs: add mount_root() and new file_system_type

2017-12-12 Thread David Sterba
On Mon, Sep 25, 2017 at 04:27:07PM +0900, Misono, Tomohiro wrote:
> Add mount_root() and new file_system_type for preparation of cleanup of
> btrfs_mount(). Code path is not changed yet.
> 
> mount_root() is almost the same as current btrfs_mount(), but doesn't
> have subvolume related part.
> 
> Signed-off-by: Tomohiro Misono 
> ---
>  fs/btrfs/super.c | 116 
> +++
>  1 file changed, 116 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 12540b6..fe43606 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -66,6 +66,7 @@
>  
>  static const struct super_operations btrfs_super_ops;
>  static struct file_system_type btrfs_fs_type;
> +static struct file_system_type btrfs_root_fs_type;

A comment why we need this would be useful.

>  
>  static int btrfs_remount(struct super_block *sb, int *flags, char *data);
>  
> @@ -1514,6 +1515,112 @@ static int setup_security_options(struct 
> btrfs_fs_info *fs_info,
>   return ret;
>  }
>  
> +static struct dentry *mount_root(struct file_system_type *fs_type, int flags,

'mount_root' is too generic and could be confused with other mount_*
functions from VFS, btrfs_mount_root would be better.

> + const char *device_name, void *data)
> +{
> + struct block_device *bdev = NULL;
> + struct super_block *s;
> + struct btrfs_fs_devices *fs_devices = NULL;
> + struct btrfs_fs_info *fs_info = NULL;
> + struct security_mnt_opts new_sec_opts;
> + fmode_t mode = FMODE_READ;
> + char *subvol_name = NULL;
> + u64 subvol_objectid = 0;
> + int error = 0;
> +
> + if (!(flags & MS_RDONLY))
> + mode |= FMODE_WRITE;
> +
> + error = btrfs_parse_early_options(data, mode, fs_type,
> +   _name, _objectid,
> +   _devices);
> + if (error) {
> + kfree(subvol_name);
> + return ERR_PTR(error);
> + }
> +
> + security_init_mnt_opts(_sec_opts);
> + if (data) {
> + error = parse_security_options(data, _sec_opts);
> + if (error)
> + return ERR_PTR(error);
> + }
> +
> + error = btrfs_scan_one_device(device_name, mode, fs_type, _devices);
> + if (error)
> + goto error_sec_opts;
> +
> + /*
> +  * Setup a dummy root and fs_info for test/set super.  This is because
> +  * we don't actually fill this stuff out until open_ctree, but we need
> +  * it for searching for existing supers, so this lets us do that and
> +  * then open_ctree will properly initialize everything later.
> +  */
> + fs_info = kzalloc(sizeof(struct btrfs_fs_info), GFP_NOFS);

Why GFP_NOFS?

> + if (!fs_info) {
> + error = -ENOMEM;
> + goto error_sec_opts;
> + }
> +
> + fs_info->fs_devices = fs_devices;
> +
> + fs_info->super_copy = kzalloc(BTRFS_SUPER_INFO_SIZE, GFP_NOFS);
> + fs_info->super_for_commit = kzalloc(BTRFS_SUPER_INFO_SIZE, GFP_NOFS);

Also here. Current code uses GFP_KERNEL so this might got switched in
the meantim.

> + security_init_mnt_opts(_info->security_opts);
> + if (!fs_info->super_copy || !fs_info->super_for_commit) {
> + error = -ENOMEM;
> + goto error_fs_info;
> + }
> +
> + error = btrfs_open_devices(fs_devices, mode, fs_type);
> + if (error)
> + goto error_fs_info;
> +
> + if (!(flags & MS_RDONLY) && fs_devices->rw_devices == 0) {
> + error = -EACCES;
> + goto error_close_devices;
> + }
> +
> + bdev = fs_devices->latest_bdev;
> + s = sget(fs_type, btrfs_test_super, btrfs_set_super, flags | MS_NOSEC,
> +  fs_info);
> + if (IS_ERR(s)) {
> + error = PTR_ERR(s);
> + goto error_close_devices;
> + }
> +
> + if (s->s_root) {
> + btrfs_close_devices(fs_devices);
> + free_fs_info(fs_info);
> + if ((flags ^ s->s_flags) & MS_RDONLY)
> + error = -EBUSY;
> + } else {
> + snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
> + btrfs_sb(s)->bdev_holder = fs_type;
> + error = btrfs_fill_super(s, fs_devices, data);
> + }
> + if (error) {
> + deactivate_locked_super(s);
> + goto error_sec_opts;
> + }
> +
> + fs_info = btrfs_sb(s);
> + error = setup_security_options(fs_info, s, _sec_opts);
> + if (error) {
> + deactivate_locked_super(s);
> + goto error_sec_opts;
> + }
> +
> + return dget(s->s_root);
> +
> +error_close_devices:
> + btrfs_close_devices(fs_devices);
> +error_fs_info:
> + free_fs_info(fs_info);
> +error_sec_opts:
> + security_free_mnt_opts(_sec_opts);
> + return ERR_PTR(error);
> +}
>  /*
>   * Find a superblock for the given device / mount point.
>   *
> @@ 

Re: [PATCH v3] Btrfs: heuristic replace heap sort with radix sort

2017-12-12 Thread David Sterba
On Tue, Dec 05, 2017 at 11:02:08AM +0300, Timofey Titovets wrote:
> Slowest part of heuristic for now is kernel heap sort()
> It's can take up to 55% of runtime on sorting bucket items.
> 
> As sorting will always call on most data sets to get correctly
> byte_core_set_size, the only way to speed up heuristic, is to
> speed up sort on bucket.
> 
> Add a general radix_sort function.
> Radix sort require 2 buffers, one full size of input array
> and one for store counters (jump addresses).
> 
> That increase usage per heuristic workspace +1KiB
> 8KiB + 1KiB -> 8KiB + 2KiB
> 
> That is LSD Radix, i use 4 bit as a base for calculating,
> to make counters array acceptable small (16 elements * 8 byte).
> 
> That Radix sort implementation have several points to adjust,
> I added him to make radix sort general usable in kernel,
> like heap sort, if needed.
> 
> Performance tested in userspace copy of heuristic code,
> throughput:
> - average <-> random data: ~3500 MiB/s - heap  sort
> - average <-> random data: ~6000 MiB/s - radix sort
> 
> Changes:
>   v1 -> v2:
> - Tested on Big Endian
> - Drop most of multiply operations
> - Separately allocate sort buffer
> 
> Changes:
>   v2 -> v3:
> - Fix uint -> u conversion
> - Reduce stack size, by reduce vars sizes to u32,
>   restrict input array size to u32
>   Assume that kernel will never try sorting arrays > 2^32
> - Drop max_cell arg (precheck - correctly find max value by it self)

I had some a few more style fixes merging v2 so I updated the patch with
v2/v3 manually. The arrays to sort will be typically small compared to
2^32, so using int is safe.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: raid56: fix race between merge_bio and rbio_orig_end_io

2017-12-12 Thread David Sterba
On Fri, Dec 08, 2017 at 04:02:35PM -0700, Liu Bo wrote:
> We're not allowed to take any new bios to rbio->bio_list in
> rbio_orig_end_io(), otherwise we may get merged with more bios and
> rbio->bio_list is not empty.
> 
> This should only happens in error-out cases, the normal path of
> recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to
> disable merge before doing IO.
> 
> Reported-by: Jérôme Carretero 
> Signed-off-by: Liu Bo 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Improve btrfs_search_slot description

2017-12-12 Thread David Sterba
On Fri, Dec 08, 2017 at 05:06:25PM +0200, Nikolay Borisov wrote:
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/ctree.c | 32 ++--
>  1 file changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 880f4f693263..1f001d31bda8 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2653,18 +2653,30 @@ int btrfs_find_item(struct btrfs_root *fs_root, 
> struct btrfs_path *path,
>   return 0;
>  }
>  
> -/*
> - * look for key in the tree.  path is filled in with nodes along the way
> - * if key is found, we return zero and you can find the item in the leaf
> - * level of the path (level 0)
> +/* btrfs_search_slot - look for a key in a tree and perform necessary

 /*
  * ...

> + * modifications to preserve tree invariants.
> + *
> + * @trans:   Handle of transaction, used when modifying the tree
> + * @p:   Holds all btree nodes along the search path
> + * @root:The root node of the tree
> + * @key: The key we are looking for
> + * @ins_len: Indicates purpose of search, for inserts it is 1, for
> + *   deletions it's -1. 0 for plain searches
> + * @cow: boolean should CoW operations be performed. Must always be 1
> + *   when modifying the tree.
> + *
> + * If @ins_len > 0, nodes and leaves will be split as we walk down the tree.
> + * If @ins_len < 0, nodes will be merged as we walk down the tree (if 
> possible)
> + *
> + * If @key is found, 0 is returned and you can find the item in the leaf 
> level
> + * of the path (level 0)
>   *
> - * If the key isn't found, the path points to the slot where it should
> - * be inserted, and 1 is returned.  If there are other errors during the
> - * search a negative error number is returned.
> + * If @key isn't found, 1 is returned and the leaf level of the path (level 
> 0)
> + * points to the slot where it should be inserted
> + * be inserted, and 1 is returned.

looks like some editing artifact

>   *
> - * if ins_len > 0, nodes and leaves will be split as we walk down the
> - * tree.  if ins_len < 0, nodes will be merged as we walk down the tree (if
> - * possible)
> + * If an error is encountered while searching the tree a negative error 
> number
> + * is returned
>   */
>  int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root 
> *root,
> const struct btrfs_key *key, struct btrfs_path *p,
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Rename bin_search -> btrfs_bin_search

2017-12-12 Thread David Sterba
On Fri, Dec 08, 2017 at 04:27:43PM +0200, Nikolay Borisov wrote:
> Currently there are 2 function doing binary search on btrfs nodes:
> bin_search and btrfs_bin_search. The latter being a simple wrapper for
> the former. So eliminate the wrapper and just rename bin_search to
> btrfs_bin_search. No functional changes
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: sink extent_write_full_page tree argument

2017-12-12 Thread David Sterba
On Fri, Dec 08, 2017 at 03:55:59PM +0200, Nikolay Borisov wrote:
> The tree argument passed to extent_write_full_page is referenced from
> the page being passed to the same function. Since we already have
> enough information to get the reference, remove the function parameter.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: sink extent_write_locked_range tree parameter

2017-12-12 Thread David Sterba
On Fri, Dec 08, 2017 at 03:55:58PM +0200, Nikolay Borisov wrote:
> This function is called only from submit_compressed_extents and the
> io tree being passed is always that of the inode. But we are also
> passing the inode, so just move getting the io tree pointer in
> extent_write_locked_range to simplify the signature.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Remove pair of bio_get/put in btrfs_schedule_bio

2017-12-12 Thread David Sterba
On Mon, Dec 11, 2017 at 02:57:56PM -0800, Liu Bo wrote:
> On Mon, Dec 11, 2017 at 04:38:48PM +0200, Nikolay Borisov wrote:
> > This code was added in 492bb6deee34 ("Btrfs: Hold a reference on bios
> > during submit_bio, add some extra bio checks"). However, holding a
> > reference on a bio is necessary only if it's going to be referenced
> > after the submit_bio returns and the bio is completed. In this
> > particular instance this is not the case so there is no need to hold
> > an extra reference since we directly return.
> >
> 
> Looks good.  If possible, please also drop other unnecessary bio_get()
> in btrfs.

You mean everywhere with similar context like in this patch?

> Reviewed-by: Liu Bo 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs-progs: doc: remove explanation of 'X' attribute

2017-12-12 Thread David Sterba
On Tue, Dec 12, 2017 at 04:08:17PM +0900, Misono, Tomohiro wrote:
> e2fsprogs has removed compression support since v1.43 and there is no field
> 'X' (no compress) for lxattr now. So, just remove the explanation.

Hm right, I hoped that we could export the NOCOMP flag through
lsattr/chattr, but this relied on a compile-time compression support of
e2fsprogs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix out of bounds access in btrfs_search_slot

2017-12-12 Thread David Sterba
On Tue, Dec 12, 2017 at 11:14:49AM +0200, Nikolay Borisov wrote:
> When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
> the level variable is going to be 7 (this is the max height of the
> tree). On the other hand btrfs_cow_block is always called with
> "level + 1" as an index into the nodes and slots arrays. This leads to
> an out of bounds access. Admittdely this will be benign since an OOB
> access of the nodes array will likely read the 0th element from the
> slots array, which in this case is going to be 0 (since we start CoW at
> the top of the tree). The OOB access into the slots array in turn will
> read the 0th and 1st values of the locks array, which would both be 0
> at the time. However, this benign behavior relies on the fact that the 
> path being passed hasn't been initialised, if it has already been used to 
> query a btree then it could potentially have populated the nodes/slots arrays.
> 
> Fix it by explicitly checking if we are at level 7 (the maximum allowed
> index in nodes/slots arrays) and explicitly call the CoW routine with
> NULL for parent's node/slot.
> 
> Signed-off-by: Nikolay Borisov 
> Fixes-coverity-id: 711515

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata

2017-12-12 Thread Josef Bacik
On Tue, Dec 12, 2017 at 10:36:19AM +1100, Dave Chinner wrote:
> On Mon, Dec 11, 2017 at 04:55:31PM -0500, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > Now that we have metadata counters in the VM, we need to provide a way to 
> > kick
> > writeback on dirty metadata.  Introduce super_operations->write_metadata.  
> > This
> > allows file systems to deal with writing back any dirty metadata we need 
> > based
> > on the writeback needs of the system.  Since there is no inode to key off 
> > of we
> > need a list in the bdi for dirty super blocks to be added.  From there we 
> > can
> > find any dirty sb's on the bdi we are currently doing writeback on and call 
> > into
> > their ->write_metadata callback.
> > 
> > Signed-off-by: Josef Bacik 
> > Reviewed-by: Jan Kara 
> > Reviewed-by: Tejun Heo 
> > ---
> >  fs/fs-writeback.c| 72 
> > 
> >  fs/super.c   |  6 
> >  include/linux/backing-dev-defs.h |  2 ++
> >  include/linux/fs.h   |  4 +++
> >  mm/backing-dev.c |  2 ++
> >  5 files changed, 80 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 987448ed7698..fba703dff678 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1479,6 +1479,31 @@ static long writeback_chunk_size(struct 
> > bdi_writeback *wb,
> > return pages;
> >  }
> >  
> > +static long writeback_sb_metadata(struct super_block *sb,
> > + struct bdi_writeback *wb,
> > + struct wb_writeback_work *work)
> > +{
> > +   struct writeback_control wbc = {
> > +   .sync_mode  = work->sync_mode,
> > +   .tagged_writepages  = work->tagged_writepages,
> > +   .for_kupdate= work->for_kupdate,
> > +   .for_background = work->for_background,
> > +   .for_sync   = work->for_sync,
> > +   .range_cyclic   = work->range_cyclic,
> > +   .range_start= 0,
> > +   .range_end  = LLONG_MAX,
> > +   };
> > +   long write_chunk;
> > +
> > +   write_chunk = writeback_chunk_size(wb, work);
> > +   wbc.nr_to_write = write_chunk;
> > +   sb->s_op->write_metadata(sb, );
> > +   work->nr_pages -= write_chunk - wbc.nr_to_write;
> > +
> > +   return write_chunk - wbc.nr_to_write;
> 
> Ok, writeback_chunk_size() returns a page count. We've already gone
> through the "metadata is not page sized" dance on the dirty
> accounting side, so how are we supposed to use pages to account for
> metadata writeback?
> 

This is just one of those things that's going to be slightly shitty.  It's the
same for memory reclaim, all of those places use pages so we just take
METADATA_*_BYTES >> PAGE_SHIFT to get pages and figure it's close enough.

> And, from what I can tell, if work->sync_mode = WB_SYNC_ALL or
> work->tagged_writepages is set, this will basically tell us to flush
> the entire dirty metadata cache because write_chunk will get set to
> LONG_MAX.
> 
> IOWs, this would appear to me to change sync() behaviour quite
> dramatically on filesystems where ->write_metadata is implemented.
> That is, instead of leaving all the metadata dirty in memory and
> just forcing the journal to stable storage, filesystems will be told
> to also write back all their dirty metadata before sync() returns,
> even though it is not necessary to provide correct sync()
> semantics

Well for btrfs that's exactly what we have currently since it's just backed by
an inode.  Obviously this is different for journaled fs'es, but I assumed that
in your case you would either not use this part of the infrastructure or simply
ignore WB_SYNC_ALL and use WB_SYNC_NONE as a way to be nice under memory
pressure or whatever.

> 
> Mind you, writeback invocation is so convoluted now I could easily
> be mis-interpretting this code, but it does seem to me like this
> code is going to have some unintended behaviours
> 

I don't think so, because right now this behavior is exactly what btrfs has
currently with it's inode setup.  I didn't really think the journaled use case
out since you guys are already rate limited by the journal.  If you would want
to start using this stuff what would you like to see done instead?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] Qgroup metadata reservation rework

2017-12-12 Thread David Sterba
On Tue, Dec 12, 2017 at 04:16:08PM +0200, Nikolay Borisov wrote:
> 
> 
> On 12.12.2017 09:34, Qu Wenruo wrote:
> > [Overall]
> > The previous rework on qgroup reservation system put a lot of effort on
> > data, which works quite fine.
> > 
> > But it takes less focus on metadata reservation, causing some problem
> > like metadata reservation underflow and noisy kernel warning.
> > 
> > This patchset will try to address the remaining problem of metadata
> > reservation.
> > 
> > The idea of new qgroup metadata reservation is to use 2 types of
> > metadata reservation:
> > 1) Per-transaction reservation
> >Life span will be inside a transaction. Will be freed at transaction
> >commit time.
> > 
> > 2) Preallocated reservation
> >For case where we reserve space before starting a transaction.
> >Operation like dealloc and delayed-inode/item belongs to this type.
> > 
> >This works similar to block_rsv, its reservation can be
> >reserved/released at any timing caller like.
> > 
> >The only point to notice is, if preallocated reservation is used and
> >finished without problem, it should be converted to per-transaction
> >type instead of just freeing.
> >This is to co-operate with qgroup update at commit time.
> > 
> > For preallocated type, this patch will integrate them into inode_rsv
> > mechanism reworked by Josef, and delayed-inode/item reservation.
> > 
> > 
> > [Problem: Over-reserve]
> > Currently the patchset addresses metadata underflow quite well, but
> > due to the over-reserve nature of btrfs and highly bounded to inode_rsv,
> > qgroup metadata reservation also tends to be over-reserved.
> > 
> > This is especially obvious for small limit.
> > For 128M limit, it's will only be able to write about 70M before hitting
> > quota limit.
> > Although for larger limit, like 5G limit, it can reach 4.5G or more
> > before hitting limit.
> > 
> > Such over-reserved behavior can lead to some problem with existing test
> > cases (where limit is normally less than 20M).
> > 
> > While it's also possible to be addressed by use more accurate space other
> > than max estimations.
> > 
> > For example, to calculate metadata needed for delalloc, we use
> > btrfs_calc_trans_metadata_size(), which always try to reserve space for
> > CoW a full-height tree, and will also include csum size.
> > Both calculate is way over-killed for qgroup metadata reservation.
> In private chat with Chris couple of months ago we discussed making the
> reservation a lot less pessimistic. One assumption which we could
> exploit is the fact that upon a tree split it's unlikely we will create
> more than 1 additional level in the tree. So we could potentially modify
> btrfs_calc_trans_metadata_size to take a root parameter and instead of
> 
> BTRFS_MAX_LEVEL * 2 we could change this to root_level * 2. How does
> that sound?

That sounds unsafe for low levels, because the reservations can be made
with level N and the tree height can increase at commit time to > N,
then we'd go ENOSPC. But BTRFS_MAX_LEVEL can be reduced to
min(root_level + 1, 4) and even the +1 can be dropped for higher levels
if the root node is filled up to some percent (say < 50%). The 4 is a
guess and could be possibly 3, with some more precise calculations.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/5] define BTRFS_DEV_STATE

2017-12-12 Thread David Sterba
On Sun, Dec 10, 2017 at 05:15:17PM +0800, Anand Jain wrote:
> As of now device properties and states are being represented as int
> variable, patches here makes them bit flags instead. Further, wip
> patches such as device failed state needs this cleanup.
> 
> v2:
>  Adds BTRFS_DEV_STATE_REPLACE_TGT
>  Adds BTRFS_DEV_STATE_FLUSH_SENT
>  Drops BTRFS_DEV_STATE_CAN_DISCARD
>  Starts bit flag from the bit 0
>  Drops unrelated change - declare btrfs_device
> 
> v3:
>  Fix static checker warning, define respective dev state as bit number

The define numbers are fixed but the whitespace changes that I made in
misc-next while merging the patches are not in v3. I'd expect that you
update your patches if there's another iteration and there have been
changes to the merged ones, or send me only incrementals to the merged
code or instructions how to fix it up if it's a trivial change like
that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 0/5] Add the ability to do BPF directed error injection

2017-12-12 Thread Alexei Starovoitov

On 12/11/17 8:36 AM, Josef Bacik wrote:

This is the same as v8, just rebased onto the bpf tree.

v8->v9:
- rebased onto the bpf tree.

v7->v8:
- removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.

v6->v7:
- moved the opt-in macro to bpf.h out of kprobes.h.

v5->v6:
- add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
  feature.  This way only functions that opt-in will be allowed to be
  overridden.
- added a btrfs patch to allow error injection for open_ctree() so that the bpf
  sample actually works.

v4->v5:
- disallow kprobe_override programs from being put in the prog map array so we
  don't tail call into something we didn't check.  This allows us to make the
  normal path still fast without a bunch of percpu operations.

v3->v4:
- fix a build error found by kbuild test bot (I didn't wait long enough
  apparently.)
- Added a warning message as per Daniels suggestion.

v2->v3:
- added a ->kprobe_override flag to bpf_prog.
- added some sanity checks to disallow attaching bpf progs that have
  ->kprobe_override set that aren't for ftrace kprobes.
- added the trace_kprobe_ftrace helper to check if the trace_event_call is a
  ftrace kprobe.
- renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
  value in the kprobe path, and thus only write to it if we're overriding or
  clearing the override.

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

- Original message -

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,


Applied, thanks Josef!

While applying in the patch "bpf: add a bpf_override_function helper"
I moved ifdef CONFIG_BPF_KPROBE_OVERRIDE few lines,
so when it's not set the program will fail at load time with error
"unknown func bpf_override_return#58"
instead of returning EINVAL at run-time.
That's more standard way of adding new helpers.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: do not merge rbios if their fail stripe index are not identical

2017-12-12 Thread David Sterba
On Mon, Dec 11, 2017 at 02:56:31PM -0700, Liu Bo wrote:
> Since fail stripe index in rbio would be used to decide which
> algorithm reconstruction would be run, we cannot merge rbios if
> their's fail striped index are different, otherwise, one of the two
> reconstructions would fail.
> 
> Signed-off-by: Liu Bo 

Patch replaced in the branch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 23/45] fs: btrfs: remove duplicate includes

2017-12-12 Thread David Sterba
On Wed, Dec 06, 2017 at 10:14:31PM +0530, Pravin Shedge wrote:
> These duplicate includes have been found with scripts/checkincludes.pl but
> they have been removed manually to avoid removing false positives.
> 
> Signed-off-by: Pravin Shedge 

Added to queue, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xfstests/btrfs/100 and 151 failure

2017-12-12 Thread Lakshmipathi.G
> Actually 151 has been failing for me as well but not 100
>
Okay, can you share the kernel .config file? I'll give it a try
with those config and check 100, sometime tomorrow.


Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: broken btrfs filesystem

2017-12-12 Thread Austin S. Hemmelgarn

On 2017-12-12 11:24, Hugo Mills wrote:

On Tue, Dec 12, 2017 at 04:18:09PM +, Neal Becker wrote:

Is it possible to check while it is mounted?


Certainly not while mounted read-write. While mounted read-only --
I'm not certain. Possibly.
In theory, it is possible, but I think that the safety measures in 
`btrfs check` don't distinguish between the two cases, so in practice it 
may not be possible.


Hugo.


On Tue, Dec 12, 2017 at 9:52 AM Hugo Mills  wrote:


On Tue, Dec 12, 2017 at 09:02:56AM -0500, Neal Becker wrote:

sudo ls -la ~/
[sudo] password for nbecker:
ls: cannot access '/home/nbecker/.bash_history': No such file or

directory

ls: cannot access '/home/nbecker/.bash_history': No such file or

directory

ls: cannot access '/home/nbecker/.bash_history': No such file or

directory

ls: cannot access '/home/nbecker/.bash_history': No such file or

directory

ls: cannot access '/home/nbecker/.bash_history': No such file or

directory

ls: cannot access '/home/nbecker/.bash_history': No such file or

directory

total 11652
drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
[...]
-rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
[...]


Could you show the result of btrfs check --readonly on this FS? The
rest, below, doesn't show up anything unusual to me.

Hugo.


uname -a
Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

  btrfs --version
btrfs-progs v4.11.1

sudo btrfs fi show
Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
Total devices 1 FS bytes used 80.96GiB
devid1 size 230.00GiB used 230.00GiB path /dev/sda3

sudo btrfs fi df /home
Data, single: total=226.99GiB, used=78.89GiB
System, single: total=4.00MiB, used=48.00KiB
Metadata, single: total=3.01GiB, used=2.07GiB
GlobalReserve, single: total=222.36MiB, used=0.00B

dmesg.log is here:
https://nbecker.fedorapeople.org/dmesg.txt

mount | grep btrfs
/dev/sda3 on / type btrfs
(rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
/dev/sda3 on /home type btrfs
(rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: broken btrfs filesystem

2017-12-12 Thread Hugo Mills
On Tue, Dec 12, 2017 at 04:18:09PM +, Neal Becker wrote:
> Is it possible to check while it is mounted?

   Certainly not while mounted read-write. While mounted read-only --
I'm not certain. Possibly.

   Hugo.

> On Tue, Dec 12, 2017 at 9:52 AM Hugo Mills  wrote:
> 
> > On Tue, Dec 12, 2017 at 09:02:56AM -0500, Neal Becker wrote:
> > > sudo ls -la ~/
> > > [sudo] password for nbecker:
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > total 11652
> > > drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
> > > drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
> > > [...]
> > > -rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
> > > [...]
> >
> >Could you show the result of btrfs check --readonly on this FS? The
> > rest, below, doesn't show up anything unusual to me.
> >
> >Hugo.
> >
> > > uname -a
> > > Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
> > > 2017 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > >  btrfs --version
> > > btrfs-progs v4.11.1
> > >
> > > sudo btrfs fi show
> > > Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
> > > Total devices 1 FS bytes used 80.96GiB
> > > devid1 size 230.00GiB used 230.00GiB path /dev/sda3
> > >
> > > sudo btrfs fi df /home
> > > Data, single: total=226.99GiB, used=78.89GiB
> > > System, single: total=4.00MiB, used=48.00KiB
> > > Metadata, single: total=3.01GiB, used=2.07GiB
> > > GlobalReserve, single: total=222.36MiB, used=0.00B
> > >
> > > dmesg.log is here:
> > > https://nbecker.fedorapeople.org/dmesg.txt
> > >
> > > mount | grep btrfs
> > > /dev/sda3 on / type btrfs
> > > (rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
> > > /dev/sda3 on /home type btrfs
> > > (rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)
> > >
> >

-- 
Hugo Mills | Let me past! There's been a major scientific
hugo@... carfax.org.uk | break-in!
http://carfax.org.uk/  | Through! Break-through!
PGP: E2AB1DE4  |  Ford Prefect


signature.asc
Description: Digital signature


Re: xfstests/btrfs/100 and 151 failure

2017-12-12 Thread Nikolay Borisov


On 12.12.2017 18:06, Nikolay Borisov wrote:
> 
> 
> On 12.12.2017 17:55, Lakshmipathi.G wrote:
>> Hi.
>>
>> While checking with latest misc-next branch from btrfs-devel.
>> btrfs/100 and 151 failed.
>> Details available on below bugzilla links. thanks
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=198085
>> https://bugzilla.kernel.org/show_bug.cgi?id=198087
> 
> I've been running tests on that branch + some custom patches and haven't
> seen failure on those 2 tests.

Actually 151 has been failing for me as well but not 100

> 
>>
>> 
>> Cheers,
>> Lakshmipathi.G
>> http://www.giis.co.in http://www.webminal.org
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Chunk-Recovery fails with alignment error

2017-12-12 Thread Benjamin Beichler
Hi,

the patch unfortunately did not work, because I didn't know for which
version/tree was made, since it does not apply for 4.14. or 4.15. But
since I got your hint with the possibility of the old age of my btrfs
volume, I simply tried an old ubuntu live disk, and it mounted the
volume :-) Then I started a balance of the metadata, and hope it
reloacted the unfavorable placed block and voila - it is also
mountable in newer kernels. Then I updated my arch kernel in chroot
and now my system is running again.

I know changed to DUP metadata profile and currently I run a scrub,
but I hope there are no more serious errors.

I don't know whether the patch for checking bytenr < sectorsize should
be seen as regression bug, or the block was simply placed wrong by the
bcache bug.

All in all: Many thanks for the help !

kind regards

Benjamin

2017-12-11 0:50 GMT+01:00 Qu Wenruo :
>
>
> On 2017年12月11日 05:16, Benjamin Beichler wrote:
>> The patch let the chunk-recover be successful. But I'm no lucky man,
>> the recovered chunk tree does not work or other metadata is also
>> broken.
>>
>> Mounting the system is not successful (dmesg):
>> BTRFS critical (device dm-0): corrupt node, invalid item slot:
>> block=16384, root=1, slot=0
>> BTRFS error (device dm-0): failed to read chunk root
>> BTRFS error (device dm-0): open_ctree failed
>
> For this error, you could apply this diff to by-pass it:
>
> --
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index ce4ed6ec8f39..355220e21c2e 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -413,12 +413,6 @@ int btrfs_check_node(struct btrfs_root *root,
> struct extent_buffer *node)
> btrfs_node_key_to_cpu(node, , slot);
> btrfs_node_key_to_cpu(node, _key, slot + 1);
>
> -   if (!bytenr) {
> -   generic_err(root, node, slot,
> -   "invalid NULL node pointer");
> -   ret = -EUCLEAN;
> -   goto out;
> -   }
> if (!IS_ALIGNED(bytenr, root->fs_info->sectorsize)) {
> generic_err(root, node, slot,
> "unaligned pointer, have %llu should be aligned
> to %u",
> --
>
> Thanks,
> Qu
>
>>
>> Therefore I tried a btrfs check --repair, this time without error:
>> https://gist.github.com/anonymous/5cf7ad9e187032d2c94db4f91bb62c24
>>
>> Then I tried btrfs check --init-extent-tree and this produces much
>> output. I put the beginning into here:
>> https://gist.github.com/anonymous/70e2482646a8235ee2327105d920dadd
>> From a fast view, the messages keep to be similar to the last of the
>> gist, but the messages in the beginning are not repeating. If it helps
>> I have complete compressed log.
>>
>> Then I tried a btrfs recover to get files, but for many files (also
>> improtant data, but I filtered the output) I get outputs like in:
>> https://gist.github.com/anonymous/1cc7f7ab5af33e76d0bf80960bb300eb
>>
>> Any new suggestions?
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xfstests/btrfs/100 and 151 failure

2017-12-12 Thread Nikolay Borisov


On 12.12.2017 17:55, Lakshmipathi.G wrote:
> Hi.
> 
> While checking with latest misc-next branch from btrfs-devel.
> btrfs/100 and 151 failed.
> Details available on below bugzilla links. thanks
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=198085
> https://bugzilla.kernel.org/show_bug.cgi?id=198087

I've been running tests on that branch + some custom patches and haven't
seen failure on those 2 tests.

> 
> 
> Cheers,
> Lakshmipathi.G
> http://www.giis.co.in http://www.webminal.org
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


xfstests/btrfs/100 and 151 failure

2017-12-12 Thread Lakshmipathi.G
Hi.

While checking with latest misc-next branch from btrfs-devel.
btrfs/100 and 151 failed.
Details available on below bugzilla links. thanks

https://bugzilla.kernel.org/show_bug.cgi?id=198085
https://bugzilla.kernel.org/show_bug.cgi?id=198087


Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-12 Thread Alan Stern
On Mon, 11 Dec 2017, Joe Perches wrote:

> >  - I don't understand the error for xa_head here:
> > 
> > struct xarray {
> > spinlock_t  xa_lock;
> > gfp_t   xa_flags;
> > void __rcu *xa_head;
> > };
> > 
> >Do people really think that:
> > 
> > struct xarray {
> > spinlock_t  xa_lock;
> > gfp_t   xa_flags;
> > void __rcu  *xa_head;
> > };
> > 
> >is more aesthetically pleasing?  And not just that, but it's an *error*
> >so the former is *RIGHT* and this is *WRONG*.  And not just a matter

Not sure what was meant here.  Neither one is *WRONG* in the sense of 
being invalid C code.  In the sense of what checkpatch will accept, the 
former is *WRONG* and the latter is *RIGHT* -- the opposite of what 
was written.

> >of taste?
> 
> No opinion really.
> That's from Andy Whitcroft's original implementation.

This one, at least, is easy to explain.  The original version tends to
lead to bugs, or easily misunderstood code.  Consider if another
variable was added to the declaration; it could easily turn into:

void __rcu *xa_head, xa_head2;

(The compiler will reject this, but it wouldn't if the underlying type
had been int instead of void.)

Doing it the other way makes the meaning a lot more clear:

void __rcu  *xa_head, *xa_head2;

This is an idiom specifically intended to reduce the likelihood of 
errors.  Rather like avoiding assignments inside conditionals.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: broken btrfs filesystem

2017-12-12 Thread Hugo Mills
On Tue, Dec 12, 2017 at 09:02:56AM -0500, Neal Becker wrote:
> sudo ls -la ~/
> [sudo] password for nbecker:
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> total 11652
> drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
> drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
> [...]
> -rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
> [...]

   Could you show the result of btrfs check --readonly on this FS? The
rest, below, doesn't show up anything unusual to me.

   Hugo.

> uname -a
> Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
>  btrfs --version
> btrfs-progs v4.11.1
> 
> sudo btrfs fi show
> Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
> Total devices 1 FS bytes used 80.96GiB
> devid1 size 230.00GiB used 230.00GiB path /dev/sda3
> 
> sudo btrfs fi df /home
> Data, single: total=226.99GiB, used=78.89GiB
> System, single: total=4.00MiB, used=48.00KiB
> Metadata, single: total=3.01GiB, used=2.07GiB
> GlobalReserve, single: total=222.36MiB, used=0.00B
> 
> dmesg.log is here:
> https://nbecker.fedorapeople.org/dmesg.txt
> 
> mount | grep btrfs
> /dev/sda3 on / type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
> /dev/sda3 on /home type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)
> 

-- 
Hugo Mills | Hey, Virtual Memory! Now I can have a *really big*
hugo@... carfax.org.uk | ramdisk!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 00/14] Qgroup metadata reservation rework

2017-12-12 Thread Nikolay Borisov


On 12.12.2017 09:34, Qu Wenruo wrote:
> [Overall]
> The previous rework on qgroup reservation system put a lot of effort on
> data, which works quite fine.
> 
> But it takes less focus on metadata reservation, causing some problem
> like metadata reservation underflow and noisy kernel warning.
> 
> This patchset will try to address the remaining problem of metadata
> reservation.
> 
> The idea of new qgroup metadata reservation is to use 2 types of
> metadata reservation:
> 1) Per-transaction reservation
>Life span will be inside a transaction. Will be freed at transaction
>commit time.
> 
> 2) Preallocated reservation
>For case where we reserve space before starting a transaction.
>Operation like dealloc and delayed-inode/item belongs to this type.
> 
>This works similar to block_rsv, its reservation can be
>reserved/released at any timing caller like.
> 
>The only point to notice is, if preallocated reservation is used and
>finished without problem, it should be converted to per-transaction
>type instead of just freeing.
>This is to co-operate with qgroup update at commit time.
> 
> For preallocated type, this patch will integrate them into inode_rsv
> mechanism reworked by Josef, and delayed-inode/item reservation.
> 
> 
> [Problem: Over-reserve]
> Currently the patchset addresses metadata underflow quite well, but
> due to the over-reserve nature of btrfs and highly bounded to inode_rsv,
> qgroup metadata reservation also tends to be over-reserved.
> 
> This is especially obvious for small limit.
> For 128M limit, it's will only be able to write about 70M before hitting
> quota limit.
> Although for larger limit, like 5G limit, it can reach 4.5G or more
> before hitting limit.
> 
> Such over-reserved behavior can lead to some problem with existing test
> cases (where limit is normally less than 20M).
> 
> While it's also possible to be addressed by use more accurate space other
> than max estimations.
> 
> For example, to calculate metadata needed for delalloc, we use
> btrfs_calc_trans_metadata_size(), which always try to reserve space for
> CoW a full-height tree, and will also include csum size.
> Both calculate is way over-killed for qgroup metadata reservation.
In private chat with Chris couple of months ago we discussed making the
reservation a lot less pessimistic. One assumption which we could
exploit is the fact that upon a tree split it's unlikely we will create
more than 1 additional level in the tree. So we could potentially modify
btrfs_calc_trans_metadata_size to take a root parameter and instead of

BTRFS_MAX_LEVEL * 2 we could change this to root_level * 2. How does
that sound?


> 
> [Patch structure]
> The patch is consist of 2 main parts:
> 1) Type based qgroup reservation
>The original patchset is sent several months ago.
>Nothing is modified at all, just rebased. And not conflict at all.
> 
>It's from patch 1 to patch 6.
> 
> 2) Split meta qgroup reservation into per-trans and prealloc sub types
>The real work to address metadata underflow.
>Due to the over-reserve problem, this part is still in RFC state.
>But the framework should mostly be fine, only needs extra fine-tuning
>to get more accurate qgroup rsv to avoid too early limit.
> 
>It's from patch 7 to 14.
> 
> Qu Wenruo (14):
>   btrfs: qgroup: Skeleton to support separate qgroup reservation type
>   btrfs: qgroup: Introduce helpers to update and access new qgroup rsv
>   btrfs: qgroup: Make qgroup_reserve and its callers to use separate
> reservation type
>   btrfs: qgroup: Fix wrong qgroup reservation update for relationship
> modification
>   btrfs: qgroup: Update trace events to use new separate rsv types
>   btrfs: qgroup: Cleanup the remaining old reservation counters
>   btrfs: qgroup: Split meta rsv type into meta_prealloc and
> meta_pertrans
>   btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup
>   btrfs: qgroup: Introduce function to convert META_PREALLOC into
> META_PERTRANS
>   btrfs: qgroup: Use separate meta reservation type for delalloc
>   btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and
> item
>   btrfs: qgroup: Use root->qgroup_meta_rsv_* to record qgroup meta
> reserved space
>   btrfs: qgroup: Update trace events for metadata reservation
>   Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"
> 
>  fs/btrfs/ctree.h |  15 +-
>  fs/btrfs/delayed-inode.c |  50 +--
>  fs/btrfs/disk-io.c   |   2 +-
>  fs/btrfs/extent-tree.c   |  49 +++---
>  fs/btrfs/file.c  |  15 +-
>  fs/btrfs/free-space-cache.c  |   2 +-
>  fs/btrfs/inode-map.c |   4 +-
>  fs/btrfs/inode.c |  27 ++--
>  fs/btrfs/ioctl.c |  10 +-
>  fs/btrfs/ordered-data.c  |   2 +-
>  fs/btrfs/qgroup.c| 350 
> ---
>  fs/btrfs/qgroup.h| 102 -
>  fs/btrfs/relocation.c  

broken btrfs filesystem

2017-12-12 Thread Neal Becker
sudo ls -la ~/
[sudo] password for nbecker:
ls: cannot access '/home/nbecker/.bash_history': No such file or directory
ls: cannot access '/home/nbecker/.bash_history': No such file or directory
ls: cannot access '/home/nbecker/.bash_history': No such file or directory
ls: cannot access '/home/nbecker/.bash_history': No such file or directory
ls: cannot access '/home/nbecker/.bash_history': No such file or directory
ls: cannot access '/home/nbecker/.bash_history': No such file or directory
total 11652
drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
[...]
-rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-?? ? ?   ?  ??  .bash_history
-rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
[...]

uname -a
Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

 btrfs --version
btrfs-progs v4.11.1

sudo btrfs fi show
Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
Total devices 1 FS bytes used 80.96GiB
devid1 size 230.00GiB used 230.00GiB path /dev/sda3

sudo btrfs fi df /home
Data, single: total=226.99GiB, used=78.89GiB
System, single: total=4.00MiB, used=48.00KiB
Metadata, single: total=3.01GiB, used=2.07GiB
GlobalReserve, single: total=222.36MiB, used=0.00B

dmesg.log is here:
https://nbecker.fedorapeople.org/dmesg.txt

mount | grep btrfs
/dev/sda3 on / type btrfs
(rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
/dev/sda3 on /home type btrfs
(rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)

-- 
Those who don't understand recursion are doomed to repeat it
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OFF-By-One Trouble? [BTRFS in Debian Stretch Kernel 4.9.0-4-amd64]

2017-12-12 Thread Juergen Sauer
Hi collegues,
this morning we've got a problm with btrfs in Debian Stable.

Yesterday we had our fs filled below 30% space usage.
No further usage came in the usage is nearly the same, but ...

# df -h .
DateisystemGröße Benutzt Verf. Verw% Eingehängt auf
/dev/vdb1  483G483G  148K  100% /mnt

# btrfs fi df .
Data, single: total=480.97GiB, used=480.97GiB
System, single: total=32.00MiB, used=96.00KiB
Metadata, single: total=2.00GiB, used=1.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# btrfs fi show .
Label: 'AXSDaten'  uuid: fde7e31f-4127-4a9d-b7c6-774ceff8f7c1
Total devices 1 FS bytes used 482.26GiB
devid1 size 483.00GiB used 483.00GiB path /dev/vdb1

# btrfs --version
btrfs-progs v4.7.3

we found exerpt of kernel log:
[  199.391577] attempt to access beyond end of device
[  199.391598] vdb1: rw=536870984, want=1012922336, limit=1012922335
[  199.391615] BTRFS error (device vdb1): bdev /dev/vdb1 errs: wr 301,
rd 0, flush 0, corrupt 0, gen 0
r
If I interpret the log right, on the FS the kernel wants to write on
block  1012922336, but the fs was created with max 1012922335 blocks.

Looks to us here like a typical off-by-one error on accessing the last
block of device.

This error was repeating under ff. conditions:
- shutdown the virtual server (kvm),
- extend the physical file of the according virtual hdd by 20% (from 483
Gib to 520 GiB)
- restart virtual server
- fixed partion size (parted), mounted
- called btrfs fi resize max on the corresponding mount point

Error came back, nearly at once.

Any other Idea to fix this error?

mit freundlichen Grüßen
Jürgen Sauer

-- 
Jürgen Sauer - automatiX GmbH,
+49-4209-4699, juergen.sa...@automatix.de
Geschäftsführer: Jürgen Sauer,
Gerichtstand: Amtsgericht Walsrode • HRB 120986
Ust-Id: DE191468481 • St.Nr.: 36/211/08000
GPG Public Key zur Signaturprüfung:
http://www.automatix.de/juergen_sauer_publickey.gpg



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: Fix out of bounds access in btrfs_search_slot

2017-12-12 Thread Nikolay Borisov
When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
the level variable is going to be 7 (this is the max height of the
tree). On the other hand btrfs_cow_block is always called with
"level + 1" as an index into the nodes and slots arrays. This leads to
an out of bounds access. Admittdely this will be benign since an OOB
access of the nodes array will likely read the 0th element from the
slots array, which in this case is going to be 0 (since we start CoW at
the top of the tree). The OOB access into the slots array in turn will
read the 0th and 1st values of the locks array, which would both be 0
at the time. However, this benign behavior relies on the fact that the 
path being passed hasn't been initialised, if it has already been used to 
query a btree then it could potentially have populated the nodes/slots arrays.

Fix it by explicitly checking if we are at level 7 (the maximum allowed
index in nodes/slots arrays) and explicitly call the CoW routine with
NULL for parent's node/slot.

Signed-off-by: Nikolay Borisov 
Fixes-coverity-id: 711515
---
 fs/btrfs/ctree.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 1f001d31bda8..076b704798c9 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2780,6 +2780,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 * contention with the cow code
 */
if (cow) {
+   bool last_level = (level == (BTRFS_MAX_LEVEL - 1));
+
/*
 * if we don't really need to cow this block
 * then we don't want to set the path blocking,
@@ -2804,9 +2806,13 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
}
 
btrfs_set_path_blocking(p);
-   err = btrfs_cow_block(trans, root, b,
- p->nodes[level + 1],
- p->slots[level + 1], );
+   if (last_level)
+   err = btrfs_cow_block(trans, root, b, NULL, 0,
+ );
+   else
+   err = btrfs_cow_block(trans, root, b,
+ p->nodes[level + 1],
+ p->slots[level + 1], );
if (err) {
ret = err;
goto done;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html