Re: Symlink not persisted even after fsync

2018-04-16 Thread Vijay Chidambaram
On Mon, Apr 16, 2018 at 7:07 PM, Dave Chinner  wrote:
> On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
>> Thanks! As I mentioned before, this is useful. I have a follow-up
>> question. Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1?
>
> So, strictly ordered behaviour:
>
> create foo:
> - creates dirent in inode B and new inode A in an atomic
>   transaction sequence #1
>
> link foo -> A/bar
> - creates dirent in inode C and bumps inode A link count in
>   an atomic transaction seqeunce #2.
>
> fsync foo
> - looks at inode A, sees it's "last modification" sequence
>   counter as #2
> - flushes all transactions up to and including #2 to the
>   journal.
>
> See the dependency chain? Both the inodes and dirents in the create
> operation and the link operation are chained to the inode foo via
> the atomic transactions. Hence when we flush foo, we also flush the
> dependent changes because of the change atomicity requirements
>
>> I would say 2,
>
> Correct, for strict ordering. But
>
>> but POSIX is silent on this,
>
> Well, it's not silent, POSIX explicitly allows for fsync() to do
> nothing and report success. Hence we can't really look to POSIX to
> define how fsync() should behave.
>
>> so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>
> Right. But directory A has a dependent change linked to foo. If we
> fsync() foo, we are persisting the link count change in that file,
> and hence all the other changes related to that link count change
> must also be flushed. Similarly, all the cahnges related to the
> creation on foo must be flushed, too.
>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2.
>
> Right - that's the "reference counted object dependency" I refered
> to. i.e. it's a bi-direction atomic dependency - either we show both
> the new dirent and the link count change, or we show neither of
> them.  Hence fsync on one object implies that we are also persisting
> the related changes in the other object, too.
>
>> But btrfs and F2FS show link count of
>> 1 after a crash.
>
> That may be valid if the dirent A/bar does not exist after recovery,
> but it also means fsync() hasn't actually guaranteed inode changes
> made prior to the fsync to be persistent on disk. i.e. that's a
> violation of ordered metadata semantics and probably a bug.

Great, this matches our understanding perfectly. We have separately
posted to the btrfs mailing list to confirm it is a bug. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Do super block verification before writing it to disk

2018-04-16 Thread Qu Wenruo
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.

The corruption would looks like:
--
superblock: bytenr=65536, device=/dev/sdc1
-
csum_type   41700 (INVALID)
csum0x3b252d3a [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
...
incompat_flags  0x5b224169
( MIXED_BACKREF |
  COMPRESS_LZO |
  BIG_METADATA |
  EXTENDED_IREF |
  SKINNY_METADATA |
  unknown flag: 0x5b224000 )
...
--
Or
--
superblock: bytenr=65536, device=/dev/mapper/x
-
csum_type  35355 (INVALID)
csum_size  32
csum   0xf0dbeddd [match]
bytenr 65536
flags  0x1
   ( WRITTEN )
magic  _BHRfS_M [match]
...
incompat_flags 0x176d2169
   ( MIXED_BACKREF |
 COMPRESS_LZO |
 BIG_METADATA |
 EXTENDED_IREF |
 SKINNY_METADATA |
 unknown flag: 0x176d2000 )
--

Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.

Although the cause is still unknown, at least detect it and prevent further
corruption.

Reported-by: Ken Swenson 
Reported-by: Ben Parsons <9parso...@gmail.com>
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Fix false alerts by moving the check to write_dev_supers() as
  btrfs_check_super_valid() only handles the primary superblock.
v3:
  Update commit message to show the corruption in details.
  Modify the kernel error message to show corruption is detected before
  transaction commitment.
---
 fs/btrfs/disk-io.c | 30 ++
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23803102aa0d..2d543ba2b7af 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -68,7 +68,8 @@
 static const struct extent_io_ops btree_extent_io_ops;
 static void end_workqueue_fn(struct btrfs_work *work);
 static void free_fs_root(struct btrfs_root *root);
-static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info);
+static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+  struct btrfs_super_block *sb);
 static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
  struct btrfs_fs_info *fs_info);
@@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb,
 
memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
 
-   ret = btrfs_check_super_valid(fs_info);
+   ret = btrfs_check_super_valid(fs_info, fs_info->super_copy);
if (ret) {
btrfs_err(fs_info, "superblock contains fatal errors");
err = -EINVAL;
@@ -3310,6 +3311,27 @@ static int write_dev_supers(struct btrfs_device *device,
 
btrfs_set_super_bytenr(sb, bytenr);
 
+   /* check the validation of the primary sb before writing */
+   if (i == 0) {
+   ret = btrfs_check_super_valid(device->fs_info, sb);
+   if (ret) {
+   btrfs_err(device->fs_info,
+"superblock corruption detected before transaction commitment for device %llu",
+ device->devid);
+   return -EUCLEAN;
+   }
+   /*
+* Unknown incompat flags can't be mounted, so newly
+* developed flags means corruption
+*/
+   if (btrfs_super_incompat_flags(sb) &
+   ~BTRFS_FEATURE_INCOMPAT_SUPP) {
+   btrfs_err(device->fs_info,
+"superblock corruption detected before transaction commitment for device %llu",
+ device->devid);
+   return -EUCLEAN;
+   }
+   }
crc = ~(u32)0;
crc = btrfs_csum_data((const char *)sb + BTRFS_CSUM_SIZE, crc,
  BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
@@ -3985,9 +4007,9 @@ int 

Re: btrfs fails to mount after power outage

2018-04-16 Thread Qu Wenruo


On 2018年04月17日 00:07, Tom Vincent wrote:
> On 12 April 2018 at 00:25, Qu Wenruo  wrote:
>> I'm curious about what's the underlying disk?
> 
> It's an Samsung PM951 NVMe SSD.
> 
>> Is it plain physical device? Or have other layers like bcache/lvm?
> 
> btrfs on LUKS
> 
>>> btrfs check
>> Full output please.
> 
> https://gist.githubusercontent.com/tlvince/acf51b37622c216e1c33cdc3dfbd321f/raw/d0237948bbffacd4bb8d53fdfa5f23391416c1e2/btrfs-check.txt

Unfortunately, not only extent tree, but also fs trees got corrupted:
root 259 inode 19916 errors 2000, link count wrong
unresolved ref dir 16196710 index 2 namelen 12 name foo.gpg filetype 0
errors 3, no dir item, no dir index

Such output along other error messages means at least one tree block of
your fs trees get corrupted.

And it seems that all corrupted tree blocks belongs to subvolume 259.

>> For transid error, btrfs check --repair can fix it, but only do it when
>> that's the only problem.
> 
> I ran this (for ~12+ hours) to no avail; it appears to have been
> looping around "Btree for root 259 is fixed". I grew impatient and
> SIGINT-ed, which unsurprisingly toasted the file system once and for
> all (I rebuilt from backups at that point).

check --repair won't help much in this case.
So btrfs-restore would be your last chance to salvage data.

> 
> Full output:
> 
> https://gist.githubusercontent.com/tlvince/8060c19526aa011b0baff2b12e3873fd/raw/ecc43bd9dc7b352e490aa0bf0deac368af04e117/btrfs-check-repair.txt
> 
> Note, the system was fine for a few days after zero-log (before check
> --repair), but then hit the same transid error at boot.

Your filesystem is already *CORRUPTED*, so whatever happens is not a
surprise.

Only a filesystem which passes "btrfs check" without any problems could
be ensured to run for a long time.

Thanks,
Qu

> 
> On 13 April 2018 at 06:46, Duncan <1i5t5.dun...@cox.net> wrote:
>> What mount options?  In particular, is the discard option used (and of
>> course I'm assuming nothing as insane as nobarrier)?
> 
> noatime,compress=lzo
> 
> ... as well as some defaults: rw,noatime,compress=lzo,ssd,space_cache
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 3/4] btrfs: Refactor btrfs_ioctl_snap_destroy() by using btrfs_delete_subvolume()

2018-04-16 Thread Misono Tomohiro
On 2018/04/17 2:53, David Sterba wrote:
> On Wed, Apr 11, 2018 at 02:20:49PM +0900, Misono Tomohiro wrote:
>> Use btrfs_delete_subvolume() to refactor btrfs_ioctl_snap_destroy().
>> The permission check is still done in btrfs_ioctl_snap_destroy(). Also,
>> call of d_delete() is still required since btrfs_delete_subvolume()
>> does not call it.
>>
>> As a result, btrfs_unlink_subvol() and may_destroy_subvol()
>> become static functions. No functional change happens.
>>
>> Signed-off-by: Tomohiro Misono 
> 
> Why is this still split into two patches? Factoring out a function
> should happen in one patch, ie 2 and 3 in one go. Do you have a reason
> not to do it like that?

I thought this reduces code change in one patch and might be good, but
I'm completely fine with merging 2nd and 3rd patches. Should I send v5?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-16 Thread Dave Chinner
On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
> Thanks! As I mentioned before, this is useful. I have a follow-up
> question. Consider the following workload:
> 
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
> 
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? 

So, strictly ordered behaviour:

create foo:
- creates dirent in inode B and new inode A in an atomic
  transaction sequence #1

link foo -> A/bar
- creates dirent in inode C and bumps inode A link count in
  an atomic transaction seqeunce #2.

fsync foo
- looks at inode A, sees it's "last modification" sequence
  counter as #2
- flushes all transactions up to and including #2 to the
  journal.

See the dependency chain? Both the inodes and dirents in the create
operation and the link operation are chained to the inode foo via
the atomic transactions. Hence when we flush foo, we also flush the
dependent changes because of the change atomicity requirements

> I would say 2,

Correct, for strict ordering. But

> but POSIX is silent on this,

Well, it's not silent, POSIX explicitly allows for fsync() to do
nothing and report success. Hence we can't really look to POSIX to
define how fsync() should behave.

> so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.

Right. But directory A has a dependent change linked to foo. If we
fsync() foo, we are persisting the link count change in that file,
and hence all the other changes related to that link count change
must also be flushed. Similarly, all the cahnges related to the
creation on foo must be flushed, too.

> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2.

Right - that's the "reference counted object dependency" I refered
to. i.e. it's a bi-direction atomic dependency - either we show both
the new dirent and the link count change, or we show neither of
them.  Hence fsync on one object implies that we are also persisting
the related changes in the other object, too.

> But btrfs and F2FS show link count of
> 1 after a crash.

That may be valid if the dirent A/bar does not exist after recovery,
but it also means fsync() hasn't actually guaranteed inode changes
made prior to the fsync to be persistent on disk. i.e. that's a
violation of ordered metadata semantics and probably a bug.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-04-16 Thread Qu Wenruo


On 2018年04月17日 01:27, David Sterba wrote:
> On Mon, Apr 16, 2018 at 04:53:10PM +0900, Misono Tomohiro wrote:
>> On 2018/04/12 22:13, David Sterba wrote:
>>> On Tue, Apr 03, 2018 at 03:30:34PM +0800, Qu Wenruo wrote:
 I didn't see this patch merged in your misc-next branch but only the
 remaining patches.

 However without this patch, btrfs qgroup reserved space will get
 obviously increased as prealloc metadata reserved space is never freed
 until inode reserved space is freed.

 This would cause a lot of qgroup related test cases to fail.

 Would you please merge this patch with all qgroup patchset?
>>>
>>> For the record: patch 9 and 10 are now in misc next and will go to 4.17.
>>> I need to let them through the for-next and other testing, so it will be
>>> some of the post rc1 pull requests.
>>
>> I still saw qgroup related xfstests (btrfs/022 etc.) fails in current 
>> misc-next branch
>> which includes 9th and 10th patches.
>>
>> I noticed 5th patch in the tree (already in v4.17-rc1) is older version and 
>> this
>> seems the reason that the tests still fails.
> 
> Qu, can you please have a look a send an incremental fixup? Handling of
> this patchset was not very good from my side, I should have asked for a
> fresh resend as I applied the changes from mailinglist and not from git.

No problem, and for next time, I'll double check misc-next branch for
such difference.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Proposal

2018-04-16 Thread MS Zeliha Omer Faruk



Hello

Greeetings to you please did you get my previous email regarding my
investment proposal last week friday ?

MS.Zeliha ömer faruk
zeliha.omer.fa...@gmail.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Fix race condition between delayed refs and blockgroup removal

2018-04-16 Thread David Sterba
On Wed, Apr 11, 2018 at 10:59:09AM +0300, Nikolay Borisov wrote:
> When the delayed refs for a head are all run, eventually
> cleanup_ref_head is called which (in case of deletion) obtains a
> reference for the relevant btrfs_space_info struct by querying the bg
> for the range. This is problematic because when the last extent of a
> bg is deleted a race window emerges between removal of that bg and the
> subsequent invocation of cleanup_ref_head. This can result in cache being null
> and either a null pointer dereference or assertion failure.
> 
>   task: 8d04d31ed080 task.stack: 9e5dc10cc000
>   RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
>   RSP: 0018:9e5dc10cfbe8 EFLAGS: 00010292
>   RAX: 0044 RBX:  RCX: 
>   RDX: 8d04ffc1f868 RSI: 8d04ffc178c8 RDI: 8d04ffc178c8
>   RBP: 8d04d29e5ea0 R08: 01f0 R09: 0001
>   R10: 9e5dc0507d58 R11: 0001 R12: 8d04d29e5ea0
>   R13: 8d04d29e5f08 R14: 8d04efe29b40 R15: 8d04efe203e0
>   FS:  7fbf58ead500() GS:8d04ffc0() 
> knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: 7fe6c6975648 CR3: 13b2a000 CR4: 06f0
>   DR0:  DR1:  DR2: 
>   DR3:  DR6: fffe0ff0 DR7: 0400
>   Call Trace:
>__btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
>btrfs_run_delayed_refs+0x68/0x250 [btrfs]
>btrfs_should_end_transaction+0x42/0x60 [btrfs]
>btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
>btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
>evict+0xc6/0x190
>do_unlinkat+0x19c/0x300
>do_syscall_64+0x74/0x140
>entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>   RIP: 0033:0x7fbf589c57a7
> 
> To fix this, introduce a new flag "is_system" to head_ref structs,
> which is populated at insertion time. This allows to decouple the
> querying for the spaceinfo from querying the possibly deleted bg.
> 
> Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned 
> accounting")
> Suggested-by: Omar Sandoval 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> v2: 
>  * Collapsed the if/else logic in cleanup_ref_head
> 
>  fs/btrfs/delayed-ref.c | 19 ++-
>  fs/btrfs/delayed-ref.h |  1 +
>  fs/btrfs/extent-tree.c | 18 +-
>  3 files changed, 28 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 5ac2d7909782..20fa7b4c132d 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -550,8 +550,10 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>struct btrfs_delayed_ref_head *head_ref,
>struct btrfs_qgroup_extent_record *qrecord,
>u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
> -  int action, int is_data, int *qrecord_inserted_ret,
> +  int action, int is_data, int is_system,
> +  int *qrecord_inserted_ret,
>int *old_ref_mod, int *new_ref_mod)
> +
>  {
>   struct btrfs_delayed_ref_head *existing;
>   struct btrfs_delayed_ref_root *delayed_refs;
> @@ -595,6 +597,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>   head_ref->ref_mod = count_mod;
>   head_ref->must_insert_reserved = must_insert_reserved;
>   head_ref->is_data = is_data;
> + head_ref->is_system = is_system;
>   head_ref->ref_tree = RB_ROOT;
>   INIT_LIST_HEAD(_ref->ref_add_list);
>   RB_CLEAR_NODE(_ref->href_node);
> @@ -782,6 +785,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
> *fs_info,
>   struct btrfs_delayed_ref_root *delayed_refs;
>   struct btrfs_qgroup_extent_record *record = NULL;
>   int qrecord_inserted;
> + int is_system = ref_root == BTRFS_CHUNK_TREE_OBJECTID;
>  
>   BUG_ON(extent_op && extent_op->is_data);
>   ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
> @@ -810,8 +814,8 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
> *fs_info,
>*/
>   head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
>   bytenr, num_bytes, 0, 0, action, 0,
> - _inserted, old_ref_mod,
> - new_ref_mod);
> + is_system, _inserted,
> + old_ref_mod, new_ref_mod);
>  
>   add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr,
>num_bytes, parent, ref_root, level, action);
> @@ -878,7 +882,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
> *fs_info,
>*/
>   head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
>   

Re: [PATCH 4/4] btrfs: Rewrite retry logic in do_chunk_alloc

2018-04-16 Thread Nikolay Borisov


On 16.04.2018 21:53, David Sterba wrote:
> On Wed, Apr 11, 2018 at 11:21:20AM +0300, Nikolay Borisov wrote:
>> do_chunk_alloc implements logic to detect whether there is currently
>> pending chunk allocation  (by means of space_info->chunk_alloc being
>> set) and if so it loops around to the 'again' label. Additionally,
>> based on the state of the space_info (e.g. whether it's full or not)
>> and the return value of should_alloc_chunk() it decides whether this
>> is a "hard" error (ENOSPC) or we can just return 0.
>>
>> This patch refactors all of this:
>>
>> 1. Put order to the scattered ifs handling the various cases in an
>> easy-to-read if {} else if{} branches. This makes clear the various
>> cases we are interested in handling.
>>
>> 2. Call should_alloc_chunk only once and use the result in the
>> if/else if constructs. All of this is done under space_info->lock, so
>> even before multiple calls of should_alloc_chunk were unnecessary.
>>
>> 3. Rewrite the "do {} while()" loop currently implemented via label
>> into an explicit loop construct.
>>
>> 4. Move the mutex locking outside of the while loop and remove the
>> comment. The comment in fact was misleading since in the old code if the
>> mutex is acquired and we don't need to loop again (wait_for_alloc) is
>> 0 then we don't recheck if we need to allocate (the comment was stating
>> the opposite).
> 
> The comment seems correct to me. It is referring to the actual
> allocation that happens in paralell (a few lines below calling
> btrfs_alloc_chunk). So the first part attempts to determine if we should
> wait, and if yes, then take the mutex and block until the other
> allocation finishes.
> 
> This is completely mising in the new code so the loop is only blocking
> on the space info spinlock, but not going to sleep. Busy looping.

With your explanation the code now makes a bit more sense and indeed
I've missed that. Will send an updated patch.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/15] Review uuid_mutex usage

2018-04-16 Thread David Sterba
On Thu, Apr 12, 2018 at 10:29:23AM +0800, Anand Jain wrote:
> uuid_mutex lock is not a per-fs lock but a global lock. The main aim of
> this patch-set is to critically review the usage of this lock, and delete
> the unnecessary once. By doing this we improve the concurrency of
> device operations across multiple btrfs filesystems is in the system.
> 
> patch 1: Was sent before, I am including it here, as its about uuid_mutex.
> 
> patch 2-9: Are cleanup and or preparatory patches.
> 
> patch 10-14: Drops the uuid_mutex and makes sure there is enough lock,
> as discussed in the patch change log.
> 
> patch 15: A generic cleanup patch around functions in the same context.
> 
> These patches are on top of
>   https://github.com/kdave/btrfs-devel.git remove-volume-mutex
> And it will be a good idea to go along with the kill-volume-mutex patches.
> 
> This is tested with xfstests and there are no _new_ regression. And I am
> trying to understand the old regressions, and notice that they are
> inconsistent.

The series looks good to me, I don't see any obvious breakage when the
uuid mutex is being removed, most of the device list consistency is
guarded by the device_list_mutex.

It indeed looks like uuid_mutex is overused and I've noticed that the
lock description in volumes.c is not entirely correct as it says that
eg. the device count and count of rw devices is proteced by that lock.

This would be good to update too, so it matches the code.

I'll add the branch to for-next to expose it to more testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel unaligned access at ... btrfs_real_readdir+0x51c/0x718 [btrfs]

2018-04-16 Thread David Sterba
On Mon, Apr 16, 2018 at 09:55:45PM +0200, René Rebe wrote:
> Hi,
> 
> On 04/16/2018 06:48 PM, David Sterba wrote:
> > The warnings are valid, there's unaligned access introduced by patch
> > 
> > 23b5ec74943f44378b68c0edd8e210a86318ea5e
> > btrfs: fix readdir deadlock with pagefault
> > 
> > The directory entries (struct dir_entry) are copied to a temporary
> > buffer as they fit, ie. no alignment, and the members accessed in
> > several places.
> > 
> > The following patch adds the proper unaligned access, only compile-tested.
> > Please test and let me know, thanks!
> Would have loved to immediately give it a try, however, sorry,
> I forgot to mention I'm on the latest stable release -4.16.2-
> on a first glance this does not look like it does just apply.
> 
> I would re-base myself if I would not also have a glibc initialization 
> bug to hunt and debug, too :-/
> 
> If you happen to also rebase it for current -stable, ... ;-)

Sure, attached a 4.16.2 version.
>From 4df58593a5a42c632f1c18ced3d6fae2196e29a9 Mon Sep 17 00:00:00 2001
From: David Sterba 
Date: Mon, 16 Apr 2018 21:10:14 +0200
Subject: [PATCH] test readdir unaligned access

---
 fs/btrfs/inode.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c7b75dd58fad..c2df7b158820 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -5951,11 +5952,13 @@ static int btrfs_filldir(void *addr, int entries, 
struct dir_context *ctx)
struct dir_entry *entry = addr;
char *name = (char *)(entry + 1);
 
-   ctx->pos = entry->offset;
-   if (!dir_emit(ctx, name, entry->name_len, entry->ino,
- entry->type))
+   ctx->pos = get_unaligned(>offset);
+   if (!dir_emit(ctx, name, get_unaligned(>name_len),
+   get_unaligned(>ino),
+   get_unaligned(>type)))
return 1;
-   addr += sizeof(struct dir_entry) + entry->name_len;
+   addr += sizeof(struct dir_entry) +
+   get_unaligned(>name_len);
ctx->pos++;
}
return 0;
@@ -6045,14 +6048,15 @@ static int btrfs_real_readdir(struct file *file, struct 
dir_context *ctx)
}
 
entry = addr;
-   entry->name_len = name_len;
+   put_unaligned(name_len, >name_len);
name_ptr = (char *)(entry + 1);
read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1),
   name_len);
-   entry->type = btrfs_filetype_table[btrfs_dir_type(leaf, di)];
+   put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)],
+   >type);
btrfs_dir_item_key_to_cpu(leaf, di, );
-   entry->ino = location.objectid;
-   entry->offset = found_key.offset;
+   put_unaligned(location.objectid, >ino);
+   put_unaligned(found_key.offset,>offset);
entries++;
addr += sizeof(struct dir_entry) + name_len;
total_len += sizeof(struct dir_entry) + name_len;
-- 
2.16.2



Re: [PATCH] btrfs: Do super block verification before writing it to disk

2018-04-16 Thread David Sterba
On Mon, Apr 16, 2018 at 09:00:38PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年04月16日 20:55, Anand Jain wrote:
> > 
> > 
> > On 04/16/2018 10:02 AM, Qu Wenruo wrote:
> >> There are already 2 reports about strangely corrupted super blocks,
> >> where csum type and incompat flags get some obvious garbage, but csum
> >> still matches and all other vitals are correct.
> >>
> >> This normally means some kernel memory corruption happens, although the
> >> cause is unknown, at least detect it and prevent further corruption.
> >>
> >> Signed-off-by: Qu Wenruo 
> >> ---
> > 
> >  Can you help point those 2 cases here?
> 
> Did you mean the reported-by tags?
> If so, no problem.

Yes please, also describe what got corrupted in the changelog, besides
the links to the bugreports.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Do super block verification before writing it to disk

2018-04-16 Thread David Sterba
On Mon, Apr 16, 2018 at 10:02:27AM +0800, Qu Wenruo wrote:
> There are already 2 reports about strangely corrupted super blocks,
> where csum type and incompat flags get some obvious garbage, but csum
> still matches and all other vitals are correct.
> 
> This normally means some kernel memory corruption happens, although the
> cause is unknown, at least detect it and prevent further corruption.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/disk-io.c | 24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 23803102aa0d..10d814f03f13 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -68,7 +68,8 @@
>  static const struct extent_io_ops btree_extent_io_ops;
>  static void end_workqueue_fn(struct btrfs_work *work);
>  static void free_fs_root(struct btrfs_root *root);
> -static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info);
> +static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
> +struct btrfs_super_block *sb);
>  static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
>  static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
> struct btrfs_fs_info *fs_info);
> @@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb,
>  
>   memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
>  
> - ret = btrfs_check_super_valid(fs_info);
> + ret = btrfs_check_super_valid(fs_info, fs_info->super_copy);
>   if (ret) {
>   btrfs_err(fs_info, "superblock contains fatal errors");
>   err = -EINVAL;
> @@ -3575,6 +3576,21 @@ int write_all_supers(struct btrfs_fs_info *fs_info, 
> int max_mirrors)
>   sb = fs_info->super_for_commit;
>   dev_item = >dev_item;
>  
> + /* Do extra check on the sb to be written */
> + ret = btrfs_check_super_valid(fs_info, sb);
> + if (ret) {
> + btrfs_err(fs_info, "fatal superblock corrupted detected");
> + return -EUCLEAN;
> + }
> + /*
> +  * Unknown incompat flags can't be mounted, so newly developed flags
> +  * means corruption
> +  */
> + if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
> + btrfs_err(fs_info, "fatal superblock corrupted detected");

The error messages could state that the corruption is detected at the
pre-commit time. Otherwise it's a good idea to do the checks, they're
lighweight.

> + return -EUCLEAN;
> + }
> +
>   mutex_lock(_info->fs_devices->device_list_mutex);
>   head = _info->fs_devices->devices;
>   max_errors = btrfs_super_num_devices(fs_info->super_copy) - 1;
> @@ -3985,9 +4001,9 @@ int btrfs_read_buffer(struct extent_buffer *buf, u64 
> parent_transid, int level,
> level, first_key);
>  }
>  
> -static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info)
> +static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
> +struct btrfs_super_block *sb)
>  {
> - struct btrfs_super_block *sb = fs_info->super_copy;
>   u64 nodesize = btrfs_super_nodesize(sb);
>   u64 sectorsize = btrfs_super_sectorsize(sb);
>   int ret = 0;
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] btrfs: Rewrite retry logic in do_chunk_alloc

2018-04-16 Thread David Sterba
On Wed, Apr 11, 2018 at 11:21:20AM +0300, Nikolay Borisov wrote:
> do_chunk_alloc implements logic to detect whether there is currently
> pending chunk allocation  (by means of space_info->chunk_alloc being
> set) and if so it loops around to the 'again' label. Additionally,
> based on the state of the space_info (e.g. whether it's full or not)
> and the return value of should_alloc_chunk() it decides whether this
> is a "hard" error (ENOSPC) or we can just return 0.
> 
> This patch refactors all of this:
> 
> 1. Put order to the scattered ifs handling the various cases in an
> easy-to-read if {} else if{} branches. This makes clear the various
> cases we are interested in handling.
> 
> 2. Call should_alloc_chunk only once and use the result in the
> if/else if constructs. All of this is done under space_info->lock, so
> even before multiple calls of should_alloc_chunk were unnecessary.
> 
> 3. Rewrite the "do {} while()" loop currently implemented via label
> into an explicit loop construct.
> 
> 4. Move the mutex locking outside of the while loop and remove the
> comment. The comment in fact was misleading since in the old code if the
> mutex is acquired and we don't need to loop again (wait_for_alloc) is
> 0 then we don't recheck if we need to allocate (the comment was stating
> the opposite).

The comment seems correct to me. It is referring to the actual
allocation that happens in paralell (a few lines below calling
btrfs_alloc_chunk). So the first part attempts to determine if we should
wait, and if yes, then take the mutex and block until the other
allocation finishes.

This is completely mising in the new code so the loop is only blocking
on the space info spinlock, but not going to sleep. Busy looping.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] btrfs: Consolidate error checking for btrfs_alloc_chunk

2018-04-16 Thread David Sterba
On Wed, Apr 11, 2018 at 11:21:19AM +0300, Nikolay Borisov wrote:
> The second if is really a subcase of ret being less than 0. So
> introduce a generic if (ret < 0) check, and inside have another if
> which explicitly handles the -ENOSPC and any other errors. No
> functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs: Fix lock release order

2018-04-16 Thread David Sterba
On Wed, Apr 11, 2018 at 11:21:18AM +0300, Nikolay Borisov wrote:
> Locks should generally be released in the oppposite order they are
> acquired. Generally lock acquisiton ordering is used to ensure
> deadlocks don't happen. However, as becomes more complicated it's
> best to also maintain proper unlock order so as to avoid possible dead
> locks. This was found by code inspection and doesn't necessarily lead
> to a deadlock scenario.
> 
> Signed-off-by: Nikolay Borisov 
Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] btrfs: Use while loop instead of labels in __endio_write_update_ordered

2018-04-16 Thread David Sterba
On Wed, Apr 11, 2018 at 11:21:17AM +0300, Nikolay Borisov wrote:
> Currently __endio_write_update_ordered uses labels to implement
> what is essentially a simple while loop. This makes the code more
> cumbersome to follow than it actually has to be. No functional
> changes. No xfstest regressions were found during testing.
> 
> Signed-off-by: Nikolay Borisov 
Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/4] btrfs: Refactor btrfs_ioctl_snap_destroy() by using btrfs_delete_subvolume()

2018-04-16 Thread David Sterba
On Wed, Apr 11, 2018 at 02:20:49PM +0900, Misono Tomohiro wrote:
> Use btrfs_delete_subvolume() to refactor btrfs_ioctl_snap_destroy().
> The permission check is still done in btrfs_ioctl_snap_destroy(). Also,
> call of d_delete() is still required since btrfs_delete_subvolume()
> does not call it.
> 
> As a result, btrfs_unlink_subvol() and may_destroy_subvol()
> become static functions. No functional change happens.
> 
> Signed-off-by: Tomohiro Misono 

Why is this still split into two patches? Factoring out a function
should happen in one patch, ie 2 and 3 in one go. Do you have a reason
not to do it like that?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel unaligned access at ... btrfs_real_readdir+0x51c/0x718 [btrfs]

2018-04-16 Thread René Rebe

Hi,

On 04/16/2018 06:48 PM, David Sterba wrote:

The warnings are valid, there's unaligned access introduced by patch

23b5ec74943f44378b68c0edd8e210a86318ea5e
btrfs: fix readdir deadlock with pagefault

The directory entries (struct dir_entry) are copied to a temporary
buffer as they fit, ie. no alignment, and the members accessed in
several places.

The following patch adds the proper unaligned access, only compile-tested.
Please test and let me know, thanks!

Would have loved to immediately give it a try, however, sorry,
I forgot to mention I'm on the latest stable release -4.16.2-
on a first glance this does not look like it does just apply.

I would re-base myself if I would not also have a glibc initialization 
bug to hunt and debug, too :-/


If you happen to also rebase it for current -stable, ... ;-)

--
  René Rebe, ExactCODE GmbH, Lietzenburger Str. 42, DE-10789 Berlin
  https://exactcode.com | https://t2sde.org | https://rene.rebe.de
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add udev-md-raid-safe-timeouts.rules

2018-04-16 Thread Alan Stern
On Mon, 16 Apr 2018, Chris Murphy wrote:

> Adding linux-usb@ and linux-scsi@
> (This email does contain the thread initiating email, but some replies
> are on the other lists.)
> 
> On Mon, Apr 16, 2018 at 5:43 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2018-04-15 21:04, Chris Murphy wrote:
> >>
> >> I just ran into this:
> >>
> >> https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
> >>
> >> This solution is inadequate, can it be made more generic? This isn't
> >> an md specific problem, it affects Btrfs and LVM as well. And in fact
> >> raid0, and even none raid setups.
> >>
> >> There is no good reason to prevent deep recovery, which is what
> >> happens with the default command timer of 30 seconds, with this class
> >> of drive. Basically that value is going to cause data loss for the
> >> single device and also raid0 case, where the reset happens before deep
> >> recovery has a chance. And even if deep recovery fails to return user
> >> data, what we need to see is the proper error message: read error UNC,
> >> rather than a link reset message which just obfuscates the problem.
> >
> >
> > This has been discussed at least once here before (probably more times, hard
> > to be sure since it usually comes up as a side discussion in an only
> > marginally related thread).  Last I knew, the consensus here was that it
> > needs to be changed upstream in the kernel, not by adding a udev rule
> > because while the value is technically system policy, the default policy is
> > brain-dead for anything but the original disks it was i9ntended for (30
> > seconds works perfectly fine for actual SCSI devices because they behave
> > sanely in the face of media errors, but it's horribly inadequate for ATA
> > devices).
> >
> > To re-iterate what I've said before on the subject:
> >
> > For ATA drives it should probably be 150 seconds.  That's 30 seconds beyond
> > the typical amount of time most consumer drives will keep retrying a sector,
> > so even if it goes the full time to try and recover a sector this shouldn't
> > trigger.  The only people this change should negatively impact are those who
> > have failing drives which support SCT ERC and have it enabled, but aren't
> > already adjusting this timeout.
> >
> > For physical SCSI devices, it should continue to be 30 seconds.  SCSI disks
> > are sensible here and don't waste your time trying to recover a sector.  For
> > PV-SCSI devices, it should probably be adjusted too, but I don't know what a
> > reasonable value is.
> >
> > For USB devices it should probably be higher than 30 seconds, but again I
> > have no idea what a reasonable value is.
> 
> I don't know how all of this is designed but it seems like there's
> only one location for the command timer, and the SCSI driver owns it,
> and then everyone else (ATA and USB and for all I know SAN) are on top
> of that and lack any ability to have separate timeouts.

As far as mass-storage is concerned, USB is merely a transport.  It 
doesn't impose any timeout rules; the appropriate timeout value is 
whatever the device at the end of the USB link needs.  Thus, a SCSI 
drive connected over USB could use a 30-second timeout, an ATA drive 
could use 150 seconds, and so on.

Unfortunately, the only way to tell what sort of drive you've got is by
looking at the Vendor/Product IDs or other information provided by the
drive itself.  You can't tell anything just from knowing what sort of
bus it's on.

Alan Stern

> The nice thing about the udev rule is that it tests for SCT ERC before
> making a change. There certainly are enterprise and almost enterprise
> "NAS" SATA drives that have short SCT ERC times enabled out of the box
> - and the udev method makes them immune to the change.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-04-16 Thread David Sterba
On Mon, Apr 16, 2018 at 04:53:10PM +0900, Misono Tomohiro wrote:
> On 2018/04/12 22:13, David Sterba wrote:
> > On Tue, Apr 03, 2018 at 03:30:34PM +0800, Qu Wenruo wrote:
> >> I didn't see this patch merged in your misc-next branch but only the
> >> remaining patches.
> >>
> >> However without this patch, btrfs qgroup reserved space will get
> >> obviously increased as prealloc metadata reserved space is never freed
> >> until inode reserved space is freed.
> >>
> >> This would cause a lot of qgroup related test cases to fail.
> >>
> >> Would you please merge this patch with all qgroup patchset?
> > 
> > For the record: patch 9 and 10 are now in misc next and will go to 4.17.
> > I need to let them through the for-next and other testing, so it will be
> > some of the post rc1 pull requests.
> 
> I still saw qgroup related xfstests (btrfs/022 etc.) fails in current 
> misc-next branch
> which includes 9th and 10th patches.
> 
> I noticed 5th patch in the tree (already in v4.17-rc1) is older version and 
> this
> seems the reason that the tests still fails.

Qu, can you please have a look a send an incremental fixup? Handling of
this patchset was not very good from my side, I should have asked for a
fresh resend as I applied the changes from mailinglist and not from git.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add udev-md-raid-safe-timeouts.rules

2018-04-16 Thread Chris Murphy
Adding linux-usb@ and linux-scsi@
(This email does contain the thread initiating email, but some replies
are on the other lists.)

On Mon, Apr 16, 2018 at 5:43 AM, Austin S. Hemmelgarn
 wrote:
> On 2018-04-15 21:04, Chris Murphy wrote:
>>
>> I just ran into this:
>>
>> https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
>>
>> This solution is inadequate, can it be made more generic? This isn't
>> an md specific problem, it affects Btrfs and LVM as well. And in fact
>> raid0, and even none raid setups.
>>
>> There is no good reason to prevent deep recovery, which is what
>> happens with the default command timer of 30 seconds, with this class
>> of drive. Basically that value is going to cause data loss for the
>> single device and also raid0 case, where the reset happens before deep
>> recovery has a chance. And even if deep recovery fails to return user
>> data, what we need to see is the proper error message: read error UNC,
>> rather than a link reset message which just obfuscates the problem.
>
>
> This has been discussed at least once here before (probably more times, hard
> to be sure since it usually comes up as a side discussion in an only
> marginally related thread).  Last I knew, the consensus here was that it
> needs to be changed upstream in the kernel, not by adding a udev rule
> because while the value is technically system policy, the default policy is
> brain-dead for anything but the original disks it was i9ntended for (30
> seconds works perfectly fine for actual SCSI devices because they behave
> sanely in the face of media errors, but it's horribly inadequate for ATA
> devices).
>
> To re-iterate what I've said before on the subject:
>
> For ATA drives it should probably be 150 seconds.  That's 30 seconds beyond
> the typical amount of time most consumer drives will keep retrying a sector,
> so even if it goes the full time to try and recover a sector this shouldn't
> trigger.  The only people this change should negatively impact are those who
> have failing drives which support SCT ERC and have it enabled, but aren't
> already adjusting this timeout.
>
> For physical SCSI devices, it should continue to be 30 seconds.  SCSI disks
> are sensible here and don't waste your time trying to recover a sector.  For
> PV-SCSI devices, it should probably be adjusted too, but I don't know what a
> reasonable value is.
>
> For USB devices it should probably be higher than 30 seconds, but again I
> have no idea what a reasonable value is.

I don't know how all of this is designed but it seems like there's
only one location for the command timer, and the SCSI driver owns it,
and then everyone else (ATA and USB and for all I know SAN) are on top
of that and lack any ability to have separate timeouts.

The nice thing about the udev rule is that it tests for SCT ERC before
making a change. There certainly are enterprise and almost enterprise
"NAS" SATA drives that have short SCT ERC times enabled out of the box
- and the udev method makes them immune to the change.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel unaligned access at ... btrfs_real_readdir+0x51c/0x718 [btrfs]

2018-04-16 Thread David Sterba
On Mon, Apr 16, 2018 at 04:52:16PM +0200, René Rebe wrote:
> I just installed the latest #t2sde test build on a sparc64 system with
> btrfs rootfs - you know, just for the fun of testing, and in contrast
> to my x86 and ppc systems I get tons of unaligned access warnings, in
> the form of:
> 
> [0.00] Btrfs loaded, crc32c=crc32c-generic
> [0.00] BTRFS: device fsid c84acf49-452a-4ed3-ae86-576975967db5 devid 
> 1 transid 42 /dev/sda3
> [0.00] BTRFS info (device sda3): disk space caching is enabled
> [0.00] BTRFS info (device sda3): has skinny extents
> [0.00] Kernel unaligned access at TPC[102f3080] 
> btrfs_real_readdir+0x51c/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f30dc] 
> btrfs_real_readdir+0x578/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f313c] 
> btrfs_real_readdir+0x5d8/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f31a0] 
> btrfs_real_readdir+0x63c/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f3080] 
> btrfs_real_readdir+0x51c/0x718 [btrfs]
> [0.00] log_unaligned: 38289 callbacks suppressed
> [0.00] Kernel unaligned access at TPC[102f3080] 
> btrfs_real_readdir+0x51c/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f30dc] 
> btrfs_real_readdir+0x578/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f313c] 
> btrfs_real_readdir+0x5d8/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f31a0] 
> btrfs_real_readdir+0x63c/0x718 [btrfs]
> [0.00] Kernel unaligned access at TPC[102f3080] 
> btrfs_real_readdir+0x51c/0x718 [btrfs]
> [0.00] log_unaligned: 112774 callbacks suppressed
> 
> in case some someone likes to address those, or has a quick patch to
> test, … just let me know ;-)

The warnings are valid, there's unaligned access introduced by patch

23b5ec74943f44378b68c0edd8e210a86318ea5e
btrfs: fix readdir deadlock with pagefault

The directory entries (struct dir_entry) are copied to a temporary
buffer as they fit, ie. no alignment, and the members accessed in
several places.

The following patch adds the proper unaligned access, only compile-tested.
Please test and let me know, thanks!

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e064c49c9a9a..f958eb686462 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -5905,11 +5906,13 @@ static int btrfs_filldir(void *addr, int entries, 
struct dir_context *ctx)
struct dir_entry *entry = addr;
char *name = (char *)(entry + 1);
 
-   ctx->pos = entry->offset;
-   if (!dir_emit(ctx, name, entry->name_len, entry->ino,
- entry->type))
+   ctx->pos = get_unaligned(>offset);
+   if (!dir_emit(ctx, name, get_unaligned(>name_len),
+   get_unaligned(>ino),
+   get_unaligned(>type)))
return 1;
-   addr += sizeof(struct dir_entry) + entry->name_len;
+   addr += sizeof(struct dir_entry) +
+   get_unaligned(>name_len);
ctx->pos++;
}
return 0;
@@ -5999,14 +6002,15 @@ static int btrfs_real_readdir(struct file *file, struct 
dir_context *ctx)
}
 
entry = addr;
-   entry->name_len = name_len;
+   put_unaligned(name_len, >name_len);
name_ptr = (char *)(entry + 1);
read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1),
   name_len);
-   entry->type = btrfs_filetype_table[btrfs_dir_type(leaf, di)];
+   put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)],
+   >type);
btrfs_dir_item_key_to_cpu(leaf, di, );
-   entry->ino = location.objectid;
-   entry->offset = found_key.offset;
+   put_unaligned(location.objectid, >ino);
+   put_unaligned(found_key.offset, >offset);
entries++;
addr += sizeof(struct dir_entry) + name_len;
total_len += sizeof(struct dir_entry) + name_len;
---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fails to mount after power outage

2018-04-16 Thread Tom Vincent
On 12 April 2018 at 00:25, Qu Wenruo  wrote:
> I'm curious about what's the underlying disk?

It's an Samsung PM951 NVMe SSD.

> Is it plain physical device? Or have other layers like bcache/lvm?

btrfs on LUKS

>> btrfs check
> Full output please.

https://gist.githubusercontent.com/tlvince/acf51b37622c216e1c33cdc3dfbd321f/raw/d0237948bbffacd4bb8d53fdfa5f23391416c1e2/btrfs-check.txt

> For transid error, btrfs check --repair can fix it, but only do it when
> that's the only problem.

I ran this (for ~12+ hours) to no avail; it appears to have been
looping around "Btree for root 259 is fixed". I grew impatient and
SIGINT-ed, which unsurprisingly toasted the file system once and for
all (I rebuilt from backups at that point).

Full output:

https://gist.githubusercontent.com/tlvince/8060c19526aa011b0baff2b12e3873fd/raw/ecc43bd9dc7b352e490aa0bf0deac368af04e117/btrfs-check-repair.txt

Note, the system was fine for a few days after zero-log (before check
--repair), but then hit the same transid error at boot.

On 13 April 2018 at 06:46, Duncan <1i5t5.dun...@cox.net> wrote:
> What mount options?  In particular, is the discard option used (and of
> course I'm assuming nothing as insane as nobarrier)?

noatime,compress=lzo

... as well as some defaults: rw,noatime,compress=lzo,ssd,space_cache
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add udev-md-raid-safe-timeouts.rules

2018-04-16 Thread Wol's lists

On 16/04/18 12:43, Austin S. Hemmelgarn wrote:

On 2018-04-15 21:04, Chris Murphy wrote:

I just ran into this:
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec 



This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.

There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.


This has been discussed at least once here before (probably more times, 
hard to be sure since it usually comes up as a side discussion in an 
only marginally related thread). 


Sorry, but where is "here"? This message is cross-posted to about three 
lists at least ...


 Last I knew, the consensus here was
that it needs to be changed upstream in the kernel, not by adding a udev 
rule because while the value is technically system policy, the default 
policy is brain-dead for anything but the original disks it was 
i9ntended for (30 seconds works perfectly fine for actual SCSI devices 
because they behave sanely in the face of media errors, but it's 
horribly inadequate for ATA devices).


To re-iterate what I've said before on the subject:

imho (and it's probably going to be a pain to implement :-) there should 
be a soft time-out and a hard time-out. The soft time-out should trigger 
"drive is taking too long to respond" messages that end up in a log - so 
that people who actually care can keep a track of this sort of thing. 
The hard timeout should be the current set-up, where the kernel just 
gives up.


Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add udev-md-raid-safe-timeouts.rules

2018-04-16 Thread Roger Heflin
And then there are SAN devices managed by multipath, were the timeouts
should maybe even lower.  I know in the scsi layer there are some
extra retries going on and that that actual timeout hits at 5x the
base timeout.  So there kind of is a soft timeout on SAN devices at
the base timeout.Though there are no messages before the 5x
timeout message indicating that the system is having issues at least
in the SAN case.

For mdraid it should almost be a parameter defined on the md-device to
override the timeout since one could have some disks with ERC and some
without.  Multipath.conf has a setting fast_io_fail_tmo that is
supposed to set the scsi timeout to that value if set.

On Mon, Apr 16, 2018 at 10:02 AM, Wol's lists  wrote:
> On 16/04/18 12:43, Austin S. Hemmelgarn wrote:
>>
>> On 2018-04-15 21:04, Chris Murphy wrote:
>>>
>>> I just ran into this:
>>>
>>> https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec
>>>
>>> This solution is inadequate, can it be made more generic? This isn't
>>> an md specific problem, it affects Btrfs and LVM as well. And in fact
>>> raid0, and even none raid setups.
>>>
>>> There is no good reason to prevent deep recovery, which is what
>>> happens with the default command timer of 30 seconds, with this class
>>> of drive. Basically that value is going to cause data loss for the
>>> single device and also raid0 case, where the reset happens before deep
>>> recovery has a chance. And even if deep recovery fails to return user
>>> data, what we need to see is the proper error message: read error UNC,
>>> rather than a link reset message which just obfuscates the problem.
>>
>>
>> This has been discussed at least once here before (probably more times,
>> hard to be sure since it usually comes up as a side discussion in an only
>> marginally related thread).
>
>
> Sorry, but where is "here"? This message is cross-posted to about three
> lists at least ...
>
>  Last I knew, the consensus here was
>>
>> that it needs to be changed upstream in the kernel, not by adding a udev
>> rule because while the value is technically system policy, the default
>> policy is brain-dead for anything but the original disks it was i9ntended
>> for (30 seconds works perfectly fine for actual SCSI devices because they
>> behave sanely in the face of media errors, but it's horribly inadequate for
>> ATA devices).
>>
>> To re-iterate what I've said before on the subject:
>>
> imho (and it's probably going to be a pain to implement :-) there should be
> a soft time-out and a hard time-out. The soft time-out should trigger "drive
> is taking too long to respond" messages that end up in a log - so that
> people who actually care can keep a track of this sort of thing. The hard
> timeout should be the current set-up, where the kernel just gives up.
>
> Cheers,
> Wol
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-16 Thread Vijay Chidambaram
On Mon, Apr 16, 2018 at 12:39 AM, Amir Goldstein  wrote:
> On Mon, Apr 16, 2018 at 3:10 AM, Vijay Chidambaram  
> wrote:
> [...]
>> Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2. But btrfs and F2FS show link count of
>> 1 after a crash.
>>
>
> That sounds like a clear bug - nlink is metadata of inode foo, so
> should be made persistent by fsync(foo).

This is what we think as well. We have posted this as a separate
thread to confirm this with other btrfs developers.

> For non-journaled fs you would need to fsync(A) to guarantee
> seeing A/bar after crash, but for a journaled fs, if you didn't see
> A/bar after crash and did see nlink 2 on foo then you would get
> a filesystem inconsistency, so practically, fsync(foo) takes care
> of persisting A/bar entry as well. But as you already understand,
> these rules have not been formalized by a standard, instead, they
> have been "formalized" by various fsck.* tools.

I don't think fsck tools are very useful here: fsck could return the
file system to an empty state, and that would still be consistent.
fsck makes no guarantees about data loss. I think fsck is allow to
truncate files, remove directory entries etc. which could lead to data
loss.

But I agree the guarantees haven't been formalized.

> Allow me to suggest a different framing for CrashMonkey.
> You seem to be engaging in discussions with the community
> about whether X behavior is a bug or not and as you can see
> the answer depends on the filesystem (and sometimes on the
> developer). Instead, you could declare that CrashMonkey
> is a "Certification tool" to certify filesystems to a certain
> crash consistency behavior. Then you can discuss with the
> community about specific models that CrashMonkey should
> be testing. The model describes the implicit dependencies
> and ordering guaranties between operations.
> Dave has mentioned the "strictly ordered metadata" model.
> I do not know of any formal definition of this model for filesystems,
> but you can take a shot at starting one and encoding it into
> CrashMonkey. This sounds like a great paper to me.

This is a great idea! We will be submitting the basic CrashMonkey
paper soon, so I don't know if we have enough time to do this.
Currently, we just explicitly say this behavior is supported by ext4,
but not btrfs, etc. So the bugs are file-system specific. But we would
definitely consider doing this in the future.

Btw, such models are what we introduced in the ALICE paper that Ted
had mentioned before. We called them "Abstract Persistence Models",
but it was essentially the same idea.

> I don't know if Btrfs and f2fs will qualify as "strictly ordered
> metadata" and I don't know if they would want to qualify.
> Mind you a filesystem can be crash consistent without
> following "strictly ordered metadata". In fact, in many cases
> "strictly ordered metadata" imposes performance penalty by
> coupling together unrelated metadata updates (e.g. create
> A/a and create B/b), but it is also quite hard to decouple them
> because future operation can create a dependency (e.g.
> mv A/a B/b).

I agree that total ordering might lead to performance loss. I'm not
advocating for btrfs/F2FS to be totally ordered; I merely want them to
be clear about what guarantees they do provide.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: don't bug_on with enomem in __clear_state_bit

2018-04-16 Thread Josef Bacik
On Sat, Apr 14, 2018 at 02:49:52AM +0200, David Sterba wrote:
> On Fri, Apr 13, 2018 at 04:28:55PM -0400, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > Since we're allocating under atomic we could every easily enomem, so if
> > that's the case and we can block then loop around and try to allocate
> > the prealloc not under a lock.
> > 
> > We also saw this happen during try_to_release_page in production, in
> > which case it's completely valid to return ENOMEM so we can tell
> > try_to_release_page that we can't release this page.
> > 
> > Signed-off-by: Josef Bacik 
> 
> Exactly same patch as
> 
> https://patchwork.kernel.org/patch/10053319/
> 
> so the same comment applies.

Moving the bugon just makes the same problem happen again.
try_to_release_page() will call with whatever arbitrary gfp mask it has, so if
it's GFP_ATOMIC we _want_ it to return -ENOMEM.  Everybody else that calls that
doesn't check the return value calls with a gfp mask that's will allow retrying
the allocation.  If they fail it's because the box is super out of memory and
we'll have larger problems than us not handling the case properly.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symlink not persisted even after fsync

2018-04-16 Thread Vijay Chidambaram
On Mon, Apr 16, 2018 at 12:52 AM, Theodore Y. Ts'o  wrote:
> On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
>>
>> I don't think this is what the paper's ext3-fast does. All the paper
>> says is if you have a file system where the fsync of a file persisted
>> only data related to that file, it would increase performance.
>> ext3-fast is the name given to such a file system. Note that we do not
>> present a design of ext3-fast or analyze it in any detail. In fact, we
>> explicitly say "The ext3-fast file system (derived from inferences
>> provided by ALICE) seems interesting for application safety, though
>> further investigation is required into the validity of its design."
>
> Well, says that it's based on ext3's data=journal "Abstract Persistent
> Model".  It's true that a design was not proposed --- but if you
> don't propose a design, how do you know what the performance is or
> whether it's even practical?  That's one of those things I find
> extremely distasteful in the paper.  Sure, I can model a faster than
> light interstellar engine ala Star Trek's Warp Drive --- and I can
> talk about it having, say, better performance than a reaction drive.
> But it doesn't tell us anything useful about whether it can be built,
> or whether it's even useful to dream about it.
>
> To me, that part of the paper, really read as, "watch as I wave my
> hands around widely, that they never leave the ends of my arms!"

I partially understand where you are coming from, but your argument
seems to boil down to "don't say anything until you have worked out
every detail". I don't agree with this. Yes, it was speculative, but
we did have a fairly clear disclaimer.

To the point about it being obvious: you might be surprised at how
many people outside this community take it for granted that if you
fsync a file, only that file's contents and metadata will be persisted
:) So it was obvious to you, but truly shocking for many.

Btw, ext3-fast is what led to our CCFS work in FAST 17:
http://www.cs.utexas.edu/~vijay/papers/fast17-c2fs.pdf. In this paper,
we do show that if you divide your application writes into streams, it
is possible to persist only the data/metadata of one stream,
independent of the IO being done in other streams. So as it turned
out, it wasn't an impossible file-system design.

But we digress. I think we both agree that researchers should engage
more with the file-system community.

>
>> Thanks! As I mentioned before, this is useful. I have a follow-up
>> question. Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2. But btrfs and F2FS show link count of
>> 1 after a crash.
>
> Well, is the link count accurate?  That is to say, does A/bar exist?
> I would think that the requirement that the file system be self
> consistent is the most important consideration.

There are two ways to look at this.

1. A/bar does not exist, link count is 1, and so it is not a bug.

2. We are calling fsync on the inode when the inode's link count is 2.
So it should persist the inode plus the dependency that is A/bar. The
file system after a crash should show both A/bar and the file with
link count 2. This is what ext4, xfs, and F2FS do.

We've posted separately to figure out what semantics btrfs supports.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel unaligned access at ... btrfs_real_readdir+0x51c/0x718 [btrfs]

2018-04-16 Thread René Rebe
Hi,

I just installed the latest #t2sde test build on a sparc64 system with btrfs 
rootfs - you know, just for the fun of testing, and in contrast to my x86 and 
ppc systems I get tons of unaligned access warnings, in the form of:

[0.00] Btrfs loaded, crc32c=crc32c-generic
[0.00] BTRFS: device fsid c84acf49-452a-4ed3-ae86-576975967db5 devid 1 
transid 42 /dev/sda3
[0.00] BTRFS info (device sda3): disk space caching is enabled
[0.00] BTRFS info (device sda3): has skinny extents
[0.00] Kernel unaligned access at TPC[102f3080] 
btrfs_real_readdir+0x51c/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f30dc] 
btrfs_real_readdir+0x578/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f313c] 
btrfs_real_readdir+0x5d8/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f31a0] 
btrfs_real_readdir+0x63c/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f3080] 
btrfs_real_readdir+0x51c/0x718 [btrfs]
[0.00] log_unaligned: 38289 callbacks suppressed
[0.00] Kernel unaligned access at TPC[102f3080] 
btrfs_real_readdir+0x51c/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f30dc] 
btrfs_real_readdir+0x578/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f313c] 
btrfs_real_readdir+0x5d8/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f31a0] 
btrfs_real_readdir+0x63c/0x718 [btrfs]
[0.00] Kernel unaligned access at TPC[102f3080] 
btrfs_real_readdir+0x51c/0x718 [btrfs]
[0.00] log_unaligned: 112774 callbacks suppressed

in case some someone likes to address those, or has a quick patch to test, … 
just let me know ;-)

Greetings,
René Rebe

-- 
 ExactCODE GmbH, Lietzenburger Str. 42, DE-10789 Berlin
 http://exactcode.com | http://exactscan.com | http://ocrkit.com | 
http://t2-project.org | http://rene.rebe.de
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hard link not persisted on fsync

2018-04-16 Thread Jayashree Mohan
Hi,

The following seems to be a crash consistency bug on btrfs, where in
the link count is not persisted even after a fsync on the original
file.

Consider the following workload :
creat foo
link (foo, A/bar)
fsync(foo)
---Crash---

Now, on recovery we expect the metadata of foo to be persisted i.e
have a link count of 2. However in btrfs, the link count is 1 and file
A/bar is not persisted. The expected behaviour would be to persist the
dependencies of inode foo. That is to say, shouldn't fsync of foo
persist A/bar and correctly update the link count?
Note that ext4, xfs and f2fs recover to the correct link count of 2
for the above workload.

Let us know what you think about this behavior.

Thanks,
Jayashree Mohan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Unclear `btrfs subvolume snapshot` command

2018-04-16 Thread Stefan Klinger
Hi,

I'm using

# btrfs version
btrfs-progs v4.16


In btrfs-subvolume(8), I find the description of the `snapshot`
command unclear:

snapshot [-r]  |[/]

I find the preference of `|` is unclear.  From the text passage

If only  is given, the subvolume will be named the basename
of .

I infer, that `` must be there even “if only  is given”.
Also, it's unclear how the cases

snapshot [-r]  
snapshot [-r]  [/]

are told apart.  I assume it's based on whether `` is an
existing directory.  I came to this from the rather unfortunate error
message

ERROR: invalid snapshot name '/'

when trying

# btrfs subvolume snapshot / /foo

where `/foo` already existed.  It took me some time to arrive at the
conclusion that `btrfs` took this as the ` ` case, and
tried to make a subdir `/` under `/foo`, which is, of course, illegal.

May I suggest the following changes:

  * in the manual, change to:
  
snapshot [-r]  
Create a snapshot of the subvolume .

If  is an existing directory, the snapshot will be
named /.  Note that this is
illegal if  is `/`.

If  does not exist, it will be created, and this
will be the snapshot.

  * in `btrfs subvolume`:  Change the error message from

ERROR: invalid snapshot name '/'

to

ERROR: '' exists, but cannot create snapshot named '/'
under it.  Specify explicit snapshot name.

IMHO it would be better to replace an existing  with the
snapshot, instead of creating one under it.  It would solve above
problem, and also be more similar to how `mount` handles a mountpoint.

Thank you,
Stefan


-- 
http://stefan-klinger.de  o/X
Send plain text messages only, not exceeding 32kB./\/
\
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Do super block verification before writing it to disk

2018-04-16 Thread Qu Wenruo


On 2018年04月16日 20:55, Anand Jain wrote:
> 
> 
> On 04/16/2018 10:02 AM, Qu Wenruo wrote:
>> There are already 2 reports about strangely corrupted super blocks,
>> where csum type and incompat flags get some obvious garbage, but csum
>> still matches and all other vitals are correct.
>>
>> This normally means some kernel memory corruption happens, although the
>> cause is unknown, at least detect it and prevent further corruption.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
> 
>  Can you help point those 2 cases here?

Did you mean the reported-by tags?
If so, no problem.

Thanks,
Qu
> 
> Thanks, Anand
> 
> 
>>   fs/btrfs/disk-io.c | 24 
>>   1 file changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 23803102aa0d..10d814f03f13 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -68,7 +68,8 @@
>>   static const struct extent_io_ops btree_extent_io_ops;
>>   static void end_workqueue_fn(struct btrfs_work *work);
>>   static void free_fs_root(struct btrfs_root *root);
>> -static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info);
>> +static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
>> +   struct btrfs_super_block *sb);
>>   static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
>>   static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
>>     struct btrfs_fs_info *fs_info);
>> @@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb,
>>     memcpy(fs_info->fsid, fs_info->super_copy->fsid,
>> BTRFS_FSID_SIZE);
>>   -    ret = btrfs_check_super_valid(fs_info);
>> +    ret = btrfs_check_super_valid(fs_info, fs_info->super_copy);
>>   if (ret) {
>>   btrfs_err(fs_info, "superblock contains fatal errors");
>>   err = -EINVAL;
>> @@ -3575,6 +3576,21 @@ int write_all_supers(struct btrfs_fs_info
>> *fs_info, int max_mirrors)
>>   sb = fs_info->super_for_commit;
>>   dev_item = >dev_item;
>>   +    /* Do extra check on the sb to be written */
>> +    ret = btrfs_check_super_valid(fs_info, sb);
>> +    if (ret) {
>> +    btrfs_err(fs_info, "fatal superblock corrupted detected");
>> +    return -EUCLEAN;
>> +    }
>> +    /*
>> + * Unknown incompat flags can't be mounted, so newly developed flags
>> + * means corruption
>> + */
>> +    if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
>> +    btrfs_err(fs_info, "fatal superblock corrupted detected");
>> +    return -EUCLEAN;
>> +    }
>> +
>>   mutex_lock(_info->fs_devices->device_list_mutex);
>>   head = _info->fs_devices->devices;
>>   max_errors = btrfs_super_num_devices(fs_info->super_copy) - 1;
>> @@ -3985,9 +4001,9 @@ int btrfs_read_buffer(struct extent_buffer *buf,
>> u64 parent_transid, int level,
>>     level, first_key);
>>   }
>>   -static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info)
>> +static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
>> +   struct btrfs_super_block *sb)
>>   {
>> -    struct btrfs_super_block *sb = fs_info->super_copy;
>>   u64 nodesize = btrfs_super_nodesize(sb);
>>   u64 sectorsize = btrfs_super_sectorsize(sb);
>>   int ret = 0;
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: Do super block verification before writing it to disk

2018-04-16 Thread Anand Jain



On 04/16/2018 10:02 AM, Qu Wenruo wrote:

There are already 2 reports about strangely corrupted super blocks,
where csum type and incompat flags get some obvious garbage, but csum
still matches and all other vitals are correct.

This normally means some kernel memory corruption happens, although the
cause is unknown, at least detect it and prevent further corruption.

Signed-off-by: Qu Wenruo 
---


 Can you help point those 2 cases here?

Thanks, Anand



  fs/btrfs/disk-io.c | 24 
  1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 23803102aa0d..10d814f03f13 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -68,7 +68,8 @@
  static const struct extent_io_ops btree_extent_io_ops;
  static void end_workqueue_fn(struct btrfs_work *work);
  static void free_fs_root(struct btrfs_root *root);
-static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info);
+static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+  struct btrfs_super_block *sb);
  static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
  static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
  struct btrfs_fs_info *fs_info);
@@ -2680,7 +2681,7 @@ int open_ctree(struct super_block *sb,
  
  	memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
  
-	ret = btrfs_check_super_valid(fs_info);

+   ret = btrfs_check_super_valid(fs_info, fs_info->super_copy);
if (ret) {
btrfs_err(fs_info, "superblock contains fatal errors");
err = -EINVAL;
@@ -3575,6 +3576,21 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
sb = fs_info->super_for_commit;
dev_item = >dev_item;
  
+	/* Do extra check on the sb to be written */

+   ret = btrfs_check_super_valid(fs_info, sb);
+   if (ret) {
+   btrfs_err(fs_info, "fatal superblock corrupted detected");
+   return -EUCLEAN;
+   }
+   /*
+* Unknown incompat flags can't be mounted, so newly developed flags
+* means corruption
+*/
+   if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
+   btrfs_err(fs_info, "fatal superblock corrupted detected");
+   return -EUCLEAN;
+   }
+
mutex_lock(_info->fs_devices->device_list_mutex);
head = _info->fs_devices->devices;
max_errors = btrfs_super_num_devices(fs_info->super_copy) - 1;
@@ -3985,9 +4001,9 @@ int btrfs_read_buffer(struct extent_buffer *buf, u64 
parent_transid, int level,
  level, first_key);
  }
  
-static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info)

+static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+  struct btrfs_super_block *sb)
  {
-   struct btrfs_super_block *sb = fs_info->super_copy;
u64 nodesize = btrfs_super_nodesize(sb);
u64 sectorsize = btrfs_super_sectorsize(sb);
int ret = 0;


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: test btrfs fsync after hole punching with no-holes mode

2018-04-16 Thread Eryu Guan
On Mon, Apr 16, 2018 at 12:28:59PM +0100, Filipe Manana wrote:
> On Mon, Apr 9, 2018 at 2:05 PM, Eryu Guan  wrote:
> > On Sun, Apr 08, 2018 at 09:46:24AM +0100, Filipe Manana wrote:
> >> On Sun, Apr 8, 2018 at 8:46 AM, Eryu Guan  wrote:
> >> > On Wed, Mar 28, 2018 at 12:55:30PM +0100, fdman...@kernel.org wrote:
> >> >> From: Filipe Manana 
> >> >>
> >> >> Test that when we have the no-holes mode enabled and a specific metadata
> >> >> layout, if we punch a hole and fsync the file, at replay time the whole
> >> >> hole was preserved.
> >> >>
> >> >> This issue is fixed by the following btrfs patch for the linux kernel:
> >> >>
> >> >>   "Btrfs: fix fsync after hole punching when using no-holes feature"
> >> >>
> >> >> Signed-off-by: Filipe Manana 
> >> >> ---
> >> >>
> >> >> V2: Made the test work when selinux is enabled, and made it use direct 
> >> >> IO
> >> >> writes to ensure 256K extents.
> >> >
> >> > Test fails with selinux enabled now on unpatched kernel. But I found
> >> > that, in my release testing, test still fails when testing with current
> >> > Linus tree (HEAD is 642e7fd23353, without selinux this time), which
> >> > should contain the mentioned fix. Does that mean the bug is not fully
> >> > fixed?
> >>
> >> The bug is fully fixed. But HEAD 642e7fd23353 does not contain the
> >> fix. The current  linus' master has it.
> >
> > I'll double check.. Thanks for the heads-up!
> 
> Hi Eryu, any reason this didn't go in the last update?
> Thanks.

Sorry, I thought I queued it for last update, but actually I queued
"generic: test for fsync after fallocate" not this one (and I double
checked that test not this one..). I'll give it another try and queue it
for next update if all look well.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Add udev-md-raid-safe-timeouts.rules

2018-04-16 Thread Austin S. Hemmelgarn

On 2018-04-15 21:04, Chris Murphy wrote:

I just ran into this:
https://github.com/neilbrown/mdadm/pull/32/commits/af1ddca7d5311dfc9ed60a5eb6497db1296f1bec

This solution is inadequate, can it be made more generic? This isn't
an md specific problem, it affects Btrfs and LVM as well. And in fact
raid0, and even none raid setups.

There is no good reason to prevent deep recovery, which is what
happens with the default command timer of 30 seconds, with this class
of drive. Basically that value is going to cause data loss for the
single device and also raid0 case, where the reset happens before deep
recovery has a chance. And even if deep recovery fails to return user
data, what we need to see is the proper error message: read error UNC,
rather than a link reset message which just obfuscates the problem.


This has been discussed at least once here before (probably more times, 
hard to be sure since it usually comes up as a side discussion in an 
only marginally related thread).  Last I knew, the consensus here was 
that it needs to be changed upstream in the kernel, not by adding a udev 
rule because while the value is technically system policy, the default 
policy is brain-dead for anything but the original disks it was 
i9ntended for (30 seconds works perfectly fine for actual SCSI devices 
because they behave sanely in the face of media errors, but it's 
horribly inadequate for ATA devices).


To re-iterate what I've said before on the subject:

For ATA drives it should probably be 150 seconds.  That's 30 seconds 
beyond the typical amount of time most consumer drives will keep 
retrying a sector, so even if it goes the full time to try and recover a 
sector this shouldn't trigger.  The only people this change should 
negatively impact are those who have failing drives which support SCT 
ERC and have it enabled, but aren't already adjusting this timeout.


For physical SCSI devices, it should continue to be 30 seconds.  SCSI 
disks are sensible here and don't waste your time trying to recover a 
sector.  For PV-SCSI devices, it should probably be adjusted too, but I 
don't know what a reasonable value is.


For USB devices it should probably be higher than 30 seconds, but again 
I have no idea what a reasonable value is.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: test btrfs fsync after hole punching with no-holes mode

2018-04-16 Thread Filipe Manana
On Mon, Apr 9, 2018 at 2:05 PM, Eryu Guan  wrote:
> On Sun, Apr 08, 2018 at 09:46:24AM +0100, Filipe Manana wrote:
>> On Sun, Apr 8, 2018 at 8:46 AM, Eryu Guan  wrote:
>> > On Wed, Mar 28, 2018 at 12:55:30PM +0100, fdman...@kernel.org wrote:
>> >> From: Filipe Manana 
>> >>
>> >> Test that when we have the no-holes mode enabled and a specific metadata
>> >> layout, if we punch a hole and fsync the file, at replay time the whole
>> >> hole was preserved.
>> >>
>> >> This issue is fixed by the following btrfs patch for the linux kernel:
>> >>
>> >>   "Btrfs: fix fsync after hole punching when using no-holes feature"
>> >>
>> >> Signed-off-by: Filipe Manana 
>> >> ---
>> >>
>> >> V2: Made the test work when selinux is enabled, and made it use direct IO
>> >> writes to ensure 256K extents.
>> >
>> > Test fails with selinux enabled now on unpatched kernel. But I found
>> > that, in my release testing, test still fails when testing with current
>> > Linus tree (HEAD is 642e7fd23353, without selinux this time), which
>> > should contain the mentioned fix. Does that mean the bug is not fully
>> > fixed?
>>
>> The bug is fully fixed. But HEAD 642e7fd23353 does not contain the
>> fix. The current  linus' master has it.
>
> I'll double check.. Thanks for the heads-up!

Hi Eryu, any reason this didn't go in the last update?
Thanks.

>
> Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: add comment about BTRFS_FS_EXCL_OP

2018-04-16 Thread Anand Jain
Adds comments about BTRFS_FS_EXCL_OP to existing comments
about the device locks.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3a3912a9dec9..18306165ffaa 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -202,6 +202,22 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
  *   device_list_mutex
  * chunk_mutex
  *   balance_mutex
+ *
+ * BTRFS_FS_EXCL_OP
+ * 
+ * Maintains the exclusivity of the following device operations.
+ *
+ * - Balance (*)
+ * - Add
+ * - Remove
+ * - Replace (*)
+ * - Resize
+ *
+ * The device operation with (*) indicate it can continue after a CLI pause
+ * (only balance can be paused using CLI) or remount or unmount-mount sequence
+ * or a power-recycle, as long as the FS is still writeable. And during the
+ * course of the pause, the BTRFS_FS_EXCL_OP remains set. BTRFS_FS_EXCL_OP flag
+ * is set and cleared using atomic functions.
  */
 
 DEFINE_MUTEX(uuid_mutex);
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/16] btrfs: remove redundant read-only check from btrfs_cancel_balance

2018-04-16 Thread Anand Jain



On 04/04/2018 02:34 AM, David Sterba wrote:

Balance cannot be started on a read-only filesystem and will have to
finish/exit before eg. going to read-only via remount. Cancelling does
not need to check for that.

In case the filesystem is forcibly set to read-only after an error,
balance will finish anyway and if the cancel call is too fast it will
just wait for that to happen. Again does not have to check.


 What if there is a power recycle and mounted as readonly
 after the reboot?

Thanks, Anand



Signed-off-by: David Sterba 
---
  fs/btrfs/volumes.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a0420af1fad9..2956e7b4cb9f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4053,9 +4053,6 @@ int btrfs_pause_balance(struct btrfs_fs_info *fs_info)
  
  int btrfs_cancel_balance(struct btrfs_fs_info *fs_info)

  {
-   if (sb_rdonly(fs_info->sb))
-   return -EROFS;
-
mutex_lock(_info->balance_mutex);
if (!fs_info->balance_ctl) {
mutex_unlock(_info->balance_mutex);


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: error: redefinition of 'struct btrfs_ioctl_defrag_range_args

2018-04-16 Thread Ilan Schwarts
Thanks.
If you see no problems, then its probably locally on my machine.

On Thu, Apr 12, 2018 at 5:10 PM, David Sterba  wrote:
> On Wed, Apr 11, 2018 at 01:22:23PM +0300, Ilan Schwarts wrote:
>> Hi
>> While trying to compile my kernel module on suse 12.2 kernel
>> 4.4.103-92.53-default
>> I recieve the following warning: error: redefinition of 'struct
>> btrfs_ioctl_defrag_range_args
>> I see that struct is defined in 2 places:
>> /lib/modules/4.4.103-92.53-default/source/fs/btrfs/ctree.h:1985:8
>> /usr/src/linux-4.4.103-92.53/include/uapi/linux/btrfs.h:497:8
>
> Some of the defintions were moved from fs/btrfs/ to uapi, so the sources
> seem to be in a half-state where it exists in both places.
>
>> Why is it defined in 2 places ?
>> No one else is getting that warning (which treated as error for me) ?
>
> I've never seen the warning, but I don't compile the 4.4 distro kernel.



-- 


-
Ilan Schwarts
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why btrfsInode -> root is null

2018-04-16 Thread Ilan Schwarts
Hi all,

I maintain kernel module on top of VFS.
When action executes, e.g vfs_rename, i get dentry object.
I need to get the fsid from the dentry object.
Since I maintain the same code for alot of suse distro and kernels
(11.4, 12.0, 12.1, 12.2), i have many #if macros,

My question is in SLES 12.2 kernel 4.4.103*, How do i get BTRFS dentry FSID ?
In previous kernels I did:
1. dentry->d_inode->i_op->getattr(NULL, dentry, )
2. btrfsInode->root->anon_super.s_dev
3. btrfsInode->root->anon_dev

I cast Inode to btrfs inode:
struct btrfs_inode *btrfsInode;
btrfsInode = BTRFS_I(dentry->d_inode);


On this version 4.4.103*,, there is the operation:
dentry->d_inode->i_sb->s_op->get_inode_dev(dentry->d_inode);

I have 2 questions,
1. Is this the proper way to obtain it ?
2. In some dentries, after casting inode to btrfsInode, the
btrfsInode->root is null, how could that be ? shouldn't the cast fail
(return null) ?

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 10/10] btrfs: qgroup: Use independent and accurate per inode qgroup rsv

2018-04-16 Thread Misono Tomohiro


On 2018/04/12 22:13, David Sterba wrote:
> On Tue, Apr 03, 2018 at 03:30:34PM +0800, Qu Wenruo wrote:
>> I didn't see this patch merged in your misc-next branch but only the
>> remaining patches.
>>
>> However without this patch, btrfs qgroup reserved space will get
>> obviously increased as prealloc metadata reserved space is never freed
>> until inode reserved space is freed.
>>
>> This would cause a lot of qgroup related test cases to fail.
>>
>> Would you please merge this patch with all qgroup patchset?
> 
> For the record: patch 9 and 10 are now in misc next and will go to 4.17.
> I need to let them through the for-next and other testing, so it will be
> some of the post rc1 pull requests.

I still saw qgroup related xfstests (btrfs/022 etc.) fails in current misc-next 
branch
which includes 9th and 10th patches.

I noticed 5th patch in the tree (already in v4.17-rc1) is older version and this
seems the reason that the tests still fails.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/16] btrfs: add sanity check when resuming balance after mount

2018-04-16 Thread Anand Jain



On 04/04/2018 02:34 AM, David Sterba wrote:

Replace a WARN_ON with a proper check and message in case something goes
really wrong and resumed balance cannot set up its exclusive status.



The check is a user friendly assertion, I don't expect to ever happen
under normal circumstances.

Also document that the paused balance starts here and owns the exclusive
op status.




Signed-off-by: David Sterba 
---
  fs/btrfs/volumes.c | 16 ++--
  1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eb78c1d0ce2b..843982a2cbdb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3982,6 +3982,20 @@ int btrfs_recover_balance(struct btrfs_fs_info *fs_info)
struct btrfs_key key;
int ret;
  
+	/*

+* This should never happen, as the paused balance state is recovered
+* during mount without any chance of other exclusive ops to collide.
+* Let it fail early and do not continue mount.
+*
+* Otherwise, this gives the exclusive op status to balance and keeps
+* in paused state until user intervention (cancel or umount).
+*/
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
+   btrfs_err(fs_info,
+   "cannot set exclusive op status to resume balance");
+   return -EINVAL;
+   }



 We need the test_and_set_bit(BTRFS_FS_EXCL_OP) only if we confirm that
 there is a pending balance. Its better to test and set at the same
 place as WARN_ON before.

Thanks, Anand



path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -4018,8 +4032,6 @@ int btrfs_recover_balance(struct btrfs_fs_info *fs_info)
btrfs_balance_sys(leaf, item, _bargs);
btrfs_disk_balance_args_to_cpu(>sys, _bargs);
  
-	WARN_ON(test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags));

-
mutex_lock(_info->volume_mutex);
mutex_lock(_info->balance_mutex);
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html