Re: [PATCH v13 00/15] Btrfs In-band De-duplication

2016-09-09 Thread Mark Fasheh
On Thu, Sep 08, 2016 at 03:12:49PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux.git wang_dedupe_20160907
> 
> This version is just another small update, rebased to David's
> for-next-20160906 branch.
> 
> This updates only includes one small fix, which is exposed by recent
> commits which checks space_info->bytes_may_use at umount time.
> Caused by that we only free quota reserved space at hash hit, but doesn't
> free space_info->bytes_may_use.
> 
> Other rebase changes are all related to recent infrastructure change,
> like io_tree and quota flags change.
> 
> We ran xfstests with dedupe enabled.

Is there an xfstests patch for this I can look at? We want to be able to run
and reproduce the same tests as you.

Also where are the disk portion patches or did I miss them somehow?
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ioctl_xfs_ioc_getfsmap.2: document XFS_IOC_GETFSMAP ioctl

2016-09-09 Thread Dave Chinner
On Thu, Sep 08, 2016 at 11:07:16PM -0700, Darrick J. Wong wrote:
> On Fri, Sep 09, 2016 at 09:38:06AM +1000, Dave Chinner wrote:
> > On Tue, Aug 30, 2016 at 12:09:49PM -0700, Darrick J. Wong wrote:
> > > > I recall for FIEMAP that some filesystems may not have files aligned
> > > > to sector offsets, and we just used byte offsets.  Storage like
> > > > NVDIMMs are cacheline granular, so I don't think it makes sense to
> > > > tie this to old disk sector sizes.  Alternately, the units could be
> > > > in terms of fs blocks as returned by statvfs.st_bsize, but mixing
> > > > units for fmv_block, fmv_offset, fmv_length is uneeded complexity.
> > > 
> > > Ugh.  I'd rather just change the units to bytes rather than force all
> > > the users to multiply things. :)
> > 
> > Yup, units need to be either in disk addresses (i.e. 512 byte units)
> > or bytes. If people can't handle disk addresses (seems to be the
> > case), the bytes it should be.
> 
> 
> 
> > > I'd much rather just add more special owner codes for any other
> > > filesystem that has distinguishable metadata types that are not
> > > covered by the existing OWN_ codes.  We /do/ have 2^64 possible
> > > values, so it's not like we're going to run out.
> > 
> > This is diagnositc information as much as anything, just like
> > fiemap is diagnostic information. So if we have specific type
> > information, it needs to be reported accurately to be useful.
> > 
> > Hence I really don't care if the users and developers of other fs
> > types don't understand what the special owner codes that a specific
> > filesystem returns mean. i.e. it's not useful user information -
> > only a tool that groks the specific filesystem is going to be able
> > to anything useful with special owner codes. So, IMO, there's little
> > point trying to make them generic or to even trying to define and
> > explain them in the man page
> 
>  I'm ok with describing generally what each special owner code
> means.  Maybe the manpage could be more explicit about "None of these
> codes are useful unless you're a low level filesystem tool"?

You can add that, but it doesn't address the underlying problem.
i.e.  that we can add/change the codes, their name, meaning, etc,
and now there's a third party man page that is incorrect and out of
date. It's the same problem with documenting filesystem specific
mount options in mount(8). Better, IMO, is to simple say "refer to
filesystem specific documentation for a description of these special
values". e.g. refer them to the XFS Filesystem Structure
document where this is all spelled out in enough detail to be useful
for someone thinking that they might want to use them

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix a possible umount deadlock

2016-09-09 Thread Anand Jain



On 09/09/2016 08:53 PM, David Sterba wrote:

On Fri, Sep 09, 2016 at 04:31:04PM +0800, Anand Jain wrote:

 static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_device *device, *tmp;
+   static LIST_HEAD(pending_put);


Why is it static?


 sorry my mistake its typo. v2 is sent out.

Thanks, Anand



+   INIT_LIST_HEAD(_put);

if (--fs_devices->opened > 0)
return 0;
@@ -906,9 +904,24 @@ static int __btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
mutex_lock(_devices->device_list_mutex);
list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) {
btrfs_close_one_device(device);
+   list_add(>dev_list, _put);
}
mutex_unlock(_devices->device_list_mutex);

+   /*
+* btrfs_show_devname() is using the device_list_mutex,
+* sometimes a call to blkdev_put() leads vfs calling
+* into this func. So do put outside of device_list_mutex,
+* as of now.
+*/
+   while (!list_empty(_put)) {
+   device = list_entry(pending_put.next,
+   struct btrfs_device, dev_list);
+   list_del(>dev_list);
+   btrfs_close_bdev(device);
+   call_rcu(>rcu, free_device);
+   }
+
WARN_ON(fs_devices->open_devices);
WARN_ON(fs_devices->rw_devices);
fs_devices->opened = 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: fix a possible umount deadlock

2016-09-09 Thread Anand Jain
btrfs_show_devname() is using the device_list_mutex, sometimes
a call to blkdev_put() leads vfs calling into this func. So
call blkdev_put() outside of device_list_mutex, as of now.

[  983.284212] ==
[  983.290401] [ INFO: possible circular locking dependency detected ]
[  983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted
[  983.302081] ---
[  983.308357] umount/21720 is trying to acquire lock:
[  983.313243]  (>bd_mutex){+.+.+.}, at: [] 
blkdev_put+0x31/0x150
[  983.321264]
[  983.321264] but task is already holding lock:
[  983.327101]  (_devs->device_list_mutex){+.+...}, at: [] 
__btrfs_close_devices+0x46/0x200 [btrfs]
[  983.337839]
[  983.337839] which lock already depends on the new lock.
[  983.337839]
[  983.346024]
[  983.346024] the existing dependency chain (in reverse order) is:
[  983.353512]
-> #4 (_devs->device_list_mutex){+.+...}:
[  983.359096][] lock_acquire+0x1bc/0x1f0
[  983.365143][] mutex_lock_nested+0x65/0x350
[  983.371521][] btrfs_show_devname+0x36/0x1f0 [btrfs]
[  983.378710][] show_vfsmnt+0x4e/0x150
[  983.384593][] m_show+0x17/0x20
[  983.389957][] seq_read+0x2b5/0x3b0
[  983.395669][] __vfs_read+0x28/0x100
[  983.401464][] vfs_read+0xab/0x150
[  983.407080][] SyS_read+0x52/0xb0
[  983.412609][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.419617]
-> #3 (namespace_sem){++}:
[  983.424024][] lock_acquire+0x1bc/0x1f0
[  983.430074][] down_write+0x49/0x80
[  983.435785][] lock_mount+0x67/0x1c0
[  983.441582][] do_add_mount+0x32/0xf0
[  983.447458][] finish_automount+0x5a/0xc0
[  983.453682][] follow_managed+0x1b3/0x2a0
[  983.459912][] lookup_fast+0x300/0x350
[  983.465875][] path_openat+0x3a7/0xaa0
[  983.471846][] do_filp_open+0x85/0xe0
[  983.477731][] do_sys_open+0x14c/0x1f0
[  983.483702][] SyS_open+0x1e/0x20
[  983.489240][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.496254]
-> #2 (>s_type->i_mutex_key#3){+.+.+.}:
[  983.501798][] lock_acquire+0x1bc/0x1f0
[  983.507855][] down_write+0x49/0x80
[  983.513558][] start_creating+0x87/0x100
[  983.519703][] debugfs_create_dir+0x17/0x100
[  983.526195][] bdi_register+0x93/0x210
[  983.532165][] bdi_register_owner+0x43/0x70
[  983.538570][] device_add_disk+0x1fb/0x450
[  983.544888][] loop_add+0x1e6/0x290
[  983.550596][] loop_init+0x10b/0x14f
[  983.556394][] do_one_initcall+0xa7/0x180
[  983.562618][] kernel_init_freeable+0x1cc/0x266
[  983.569370][] kernel_init+0xe/0x100
[  983.575166][] ret_from_fork+0x1f/0x40
[  983.581131]
-> #1 (loop_index_mutex){+.+.+.}:
[  983.585801][] lock_acquire+0x1bc/0x1f0
[  983.591858][] mutex_lock_nested+0x65/0x350
[  983.598256][] lo_open+0x1f/0x60
[  983.603704][] __blkdev_get+0x123/0x400
[  983.609757][] blkdev_get+0x34a/0x350
[  983.615639][] blkdev_open+0x64/0x80
[  983.621428][] do_dentry_open+0x1c6/0x2d0
[  983.627651][] vfs_open+0x69/0x80
[  983.633181][] path_openat+0x834/0xaa0
[  983.639152][] do_filp_open+0x85/0xe0
[  983.645035][] do_sys_open+0x14c/0x1f0
[  983.650999][] SyS_open+0x1e/0x20
[  983.656535][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.663541]
-> #0 (>bd_mutex){+.+.+.}:
[  983.668107][] __lock_acquire+0x1003/0x17b0
[  983.674510][] lock_acquire+0x1bc/0x1f0
[  983.680561][] mutex_lock_nested+0x65/0x350
[  983.686967][] blkdev_put+0x31/0x150
[  983.692761][] btrfs_close_bdev+0x4f/0x60 [btrfs]
[  983.699699][] __btrfs_close_devices+0xcb/0x200 
[btrfs]
[  983.707178][] btrfs_close_devices+0x2b/0xa0 [btrfs]
[  983.714380][] close_ctree+0x265/0x340 [btrfs]
[  983.721061][] btrfs_put_super+0x19/0x20 [btrfs]
[  983.727908][] generic_shutdown_super+0x6f/0x100
[  983.734744][] kill_anon_super+0x16/0x30
[  983.740888][] btrfs_kill_super+0x1e/0x130 [btrfs]
[  983.747909][] deactivate_locked_super+0x49/0x80
[  983.754745][] deactivate_super+0x5d/0x70
[  983.760977][] cleanup_mnt+0x5c/0x80
[  983.766773][] __cleanup_mnt+0x12/0x20
[  983.772738][] task_work_run+0x7e/0xc0
[  983.778708][] exit_to_usermode_loop+0x7e/0xb4
[  983.785373][] syscall_return_slowpath+0xbb/0xd0
[  983.792212][] entry_SYSCALL_64_fastpath+0xbf/0xc1
[  983.799225]
[  983.799225] other info that might help us debug this:
[  983.799225]
[  983.807291] Chain exists of:
  >bd_mutex --> namespace_sem --> _devs->device_list_mutex

[  983.816521]  Possible unsafe locking scenario:
[  983.816521]
[  983.822489]CPU0CPU1
[  983.827043]

Re: btrfs kernel oops on mount

2016-09-09 Thread Duncan
moparisthebest posted on Fri, 09 Sep 2016 15:23:13 -0400 as excerpted:

> On 09/09/2016 02:47 PM, Austin S. Hemmelgarn wrote:
>> On 2016-09-09 12:12, moparisthebest wrote:
>>> Hi,
>>>
>>> I'm hoping to get some help with mounting my btrfs array which quit
>>> working yesterday.  My array was in the middle of a balance, about 50%
>>> remaining, when it hit an error and remounted itself read-only [1].
>>> btrfs fi show output [2], btrfs df output [3].
>>>
>>> I unmounted the array, and when I tried to mount it again, it locked
>>> up the whole system so even alt+sysrq would not work.  I rebooted,
>>> tried to mount again, same lockup.  This was all kernel 4.5.7.
>>>
>>> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
>>> message appeared on the screen and I took a picture [4].
>>>
>>> I rebooted into an arch live system with kernel 4.7.2, tried to mount
>>> again, got some dmesg output before it crashed [5] and took a picture
>>> when it crashed [6], says in part 'BUG: unable to handle kernel NULL
>>> pointer dereference at 01f0'.
>>>
>>> Is there anything I can do to get this in a working state again or
>>> perhaps even recover some data?
>>>
>>> Thanks much for any help
>>>
>>> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt [2]:
>>> https://www.moparisthebest.com/btrfs/btrfsfishow.txt [3]:
>>> https://www.moparisthebest.com/btrfs/btrfsdf.txt [4]:
>>> https://www.moparisthebest.com/btrfsoops.jpg [5]:
>>> https://www.moparisthebest.com/btrfs/dmsgprecrash.txt [6]:
>>> https://www.moparisthebest.com/btrfsnulldereference.jpg
>> 
>> The output from btrfs fi show and fi df both indicate that the
>> filesystem is essentially completely full.  You've gotten to the point
>> where your using the global metadata reserve, and I think things are
>> getting stuck trying (and failing) to reclaim the space that's used
>> there.

>> Given that the FS is pretty much wedged, I think your best bet for
>> fixing this is probably going to be to use btrfs restore to get the
>> data onto a new (larger) set of disks.  If you do take this approach, a
>> metadata dump might be useful, if somebody could find enough room to
>> extract it.

> If I read btrfs fi show right, it's got minimum ~600gb free on each one
> of the 8 drives, shouldn't that be more than enough for most things?  (I
> guess unless I have single files over 600gb that need COW'd, I don't
> though)

Austin did pick up on something I (and apparently Chris) missed, the non-
zero used global reserve, but as best I can tell he's wrongly attributing 
it to fully used devices, when as you (and Chris) point out that's not 
the case.

What he picked up on is this.  Under normal conditions, global reserve 
"used" should always be zero, as sans bugs, btrfs has to be in pretty 
dire lack of space condition before it'll start using the reserve.  Under 
most conditions, btrfs will simply ENOSPC an operation before it starts 
using reserve, so the fact that it's used indicates that btrfs *BELIEVES* 
that it is in dire straits, space-wise, and has no place to go *but* 
reserves.

But as you point out, all eight devices seem to have a half-TiB plus 
available, unallocated and free to allocate as necessary.  Given that 
btrfs raid1 only does pair-mirroring, and that chunks should be at 
absolute largest, 10 GiB, there's *plenty* of space to allocate as needed.


Which can only mean that you've hit one of those elusive ENOSPC bugs 
where there's plenty of space left to allocate, but btrfs simply refuses 
to allocate it, instead triggering ENOSPC errors left and right, and of 
particular interest here, btrfs believes the ENOSPC problems to be severe 
enough that it has even run substantially into global reserves, *DESPITE* 
there *actually* being *plenty* of space!

Now I'm not a dev (just a btrfs user and list regular) and the traces, 
etc, don't tend to add much usable information for me, so I can't judge 
whether your particular case is affected by the following or not, but as 
it so happens, there's active patches going into 4.8 dealing with some of 
these previously unsolved ENOSPC when there's *plenty* of space bugs.

So there's a fair chance the patches in either current 4.8-git or still 
in-process at this very moment will fix at least the evident false ENOSPC 
despite loads of space actually being available, which based on the fact 
that used reserve is /not/ zero was very likely the original trigger for 
the auto-remount-ro.  However, it's also possible that there are other 
issues now as well, that the current patches may /not/ fix, even if they 
fix all the ENOSPC issues, which itself I can't guarantee.  But it's 
worth a shot.

The other known problem with a known (mount-option) fix that you're 
almost certainly running into ATM is the unfinished balance, since the 
balance will try to resume once you mount the btrfs writable, and at 
least without the ENSPC patchs mentioned above, that balance is 
immediately running into the 

btrfstune -x -> extent-tree.c:2688: btrfs_reserve_extent: Assertion `ret` failed.

2016-09-09 Thread Hans van Kranenburg
Hi,

While trying to enable skinny metadata on a filesystem, I got this error
(after minutes of reading from disk by the program):

-# btrfstune -x /dev/xvdb
extent-tree.c:2688: btrfs_reserve_extent: Assertion `ret` failed.
btrfstune[0x410ef6]
btrfstune[0x410f1d]
btrfstune(btrfs_reserve_extent+0x781)[0x41522e]
btrfstune(btrfs_alloc_free_block+0x63)[0x415413]
btrfstune(__btrfs_cow_block+0xfc)[0x409176]
btrfstune(btrfs_cow_block+0x8b)[0x409718]
btrfstune[0x40d8ad]
btrfstune(btrfs_commit_transaction+0xb8)[0x40f10d]
btrfstune(main+0x3b3)[0x407e31]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f945fa06700]
btrfstune(_start+0x29)[0x408509]

This is a ~40TiB filesystem that was created about one and a half year
ago, has grown from 1TiB to this size now and has always been running
with the Debian 3.16-ckt kernel.

# uname -a
Linux backups-dolly 4.7.0-1-amd64 #1 SMP Debian 4.7.2-1 (2016-08-28)
x86_64 GNU/Linux

# btrfs version
btrfs-progs v4.7.1

One of the things I already did earlier today was switching to
space_cache=v2

Does the shown error ring a bell? What's the next step to debug this?

The filesystem is a clone of the production filesystem (not btrfs clone,
but lower level, on iSCSI storage) meant to be used for upgrade-testing
and performance testing, so if anything goes wrong in whatever way,
there will be no panicing involved.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: segfault btrfs scrub

2016-09-09 Thread Liu Bo
On Fri, Sep 09, 2016 at 02:41:45PM +0200, Jan Koester wrote:
> 
>  
>  
> Hi,
> 
> i got from btrfs scrub command segfault. I use btrfs tools 4.7.2.
>  
> root@dibsi:/home/jan# btrfs scrub status /local
> Speicherzugriffsfehler
> root@dibsi:/home/jan# dmesg
> [78294.556713] BTRFS error (device sda): bad tree block start 
> 18427384836265136347 2304683610112
> [78294.556956] BTRFS error (device sda): bad tree block start 
> 17385487456874290426 2304683610112
> [78294.558323] BTRFS error (device sda): bad tree block start 
> 17385487456874290426 2304683610112
> [78294.558397] [ cut here ]
> [78294.569900] kernel BUG at fs/btrfs/ctree.c:5202!
> [78294.581634] invalid opcode:  [#15] SMP
> [78294.593089] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs 
> xfs libcrc32c binfmt_misc btrfs xor raid6_pq kvm_amd kvm irqbypass serio_raw 
> snd_usb_audio input_leds joydev snd_usbmidi_lib snd_hda_codec_hdmi 
> edac_mce_amd snd_hda_intel edac_core snd_hda_codec k10temp snd_ctxfi 
> snd_hda_core snd_hwdep snd_pcm i2c_piix4 snd_seq_midi snd_seq_midi_event 
> snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore tpm_infineon 
> mac_hid 8250_fintek shpchp sunrpc parport_pc ppdev lp parport autofs4 
> hid_generic usbhid hid amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm 
> drm_kms_helper e1000e syscopyarea sysfillrect sysimgblt ptp fb_sys_fops r8169 
> drm mii ahci pps_core libahci wmi fjes
> [78294.629504] CPU: 3 PID: 16486 Comm: btrfs Tainted: G  D W   
> 4.6.0-rc4 #1
> [78294.629506] Hardware name: Gigabyte Technology Co., Ltd. 
> GA-970A-D3/GA-970A-D3, BIOS F12 09/03/2013
> [78294.629510] task: 880070766800 ti: 8801c2d3 task.ti: 
> 8801c2d3
> [78294.629568] RIP: 0010:[]  [] 
> btrfs_search_forward+0x24d/0x330 [btrfs]
> [78294.629572] RSP: 0018:8801c2d33c10  EFLAGS: 00010246
> [78294.629581] RAX:  RBX:  RCX: 
> 0001
> [78294.629583] RDX: 0001 RSI:  RDI: 
> 880080638d40
> [78294.629585] RBP: 8801c2d33c70 R08: 021899d9 R09: 
> 02189fd9
> [78294.629587] R10:  R11: 0003 R12: 
> 88008826e8c0
> [78294.629589] R13: 0001 R14: 0001 R15: 
> 
> [78294.629593] FS:  7ff69486f8c0() GS:88022fcc() 
> knlGS:e71e3b40
> [78294.629595] CS:  0010 DS:  ES:  CR0: 80050033
> [78294.629598] CR2: 01a94088 CR3: 000221fe6000 CR4: 
> 06e0
> [78294.629599] Stack:
> [78294.629605]  024280ca 8801c2d33cbf 880223bfa800 
> 01ff
> [78294.629609]  d800 0001 db9fb905 
> 88008826e8c0
> [78294.629613]  8801c2d33d18 8802008ee000 8801c2d33cbf 
> 8801f91e6800
> [78294.629614] Call Trace:
> [78294.629669]  [] search_ioctl+0xf2/0x1a0 [btrfs]
> [78294.629720]  [] btrfs_ioctl_tree_search+0x72/0xc0 [btrfs]
> [78294.629769]  [] btrfs_ioctl+0x3e4/0x21a0 [btrfs]
> [78294.629777]  [] ? handle_mm_fault+0x14cf/0x1e60
> [78294.629782]  [] ? cp_new_stat+0x153/0x180
> [78294.629789]  [] do_vfs_ioctl+0xa1/0x5b0
> [78294.629794]  [] ? __do_page_fault+0x205/0x4d0
> [78294.629800]  [] SyS_ioctl+0x79/0x90
> [78294.629806]  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
> [78294.629847] Code: 8b 4d a0 48 8b 55 a8 4d 89 f8 48 8b 7d b0 4c 89 e6 e8 68 
> fb ff ff 85 c0 0f 85 bf 00 00 00 4c 89 e7 e8 88 7f ff ff e9 fa fd ff ff <0f> 
> 0b 48 8d 04 92 43 89 54 ac 40 48 8d 75 bf b9 11 00 00 00 48
> [78294.629885] RIP  [] btrfs_search_forward+0x24d/0x330 
> [btrfs]
> [78294.629887]  RSP 
> [78294.629969] ---[ end trace fa1ffcf4f496deaf ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

We have a commit[1] in 4.8 which has cleaned up this BUG_ON().

But it'll only help us to return gracefully, for the invalid metadata,
try btrfsck instead.

[1]:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=fb770ae414d018255afa7a70b14ba1f8620762dd


Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread Chris Murphy
On Fri, Sep 9, 2016 at 12:47 PM, Austin S. Hemmelgarn
 wrote:
>
> The output from btrfs fi show and fi df both indicate that the filesystem is
> essentially completely full.

?What am I missing?

https://www.moparisthebest.com/btrfs/btrfsfishow.txt

There's thousands of GiB's totally unallocated. Just taking the last
two devices:

devid   13 size 3.64TiB used 3.04TiB path /dev/mapper/fourtb5
devid   14 size 7.28TiB used 6.21TiB path /dev/mapper/eighttb

There's plenty of room for it to allocate some 600GiB of new metadata
or data chunks, mirrored on just these two devices. None of the others
is totally full either.

Sounds like with enospc devs want to see a couple things beyond what I
asked for:

enospc_debug
grep -IR . /sys/fs/btrfs/UUID/allocation/

That's kinda hard to do right now if it's not mounting though...



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread moparisthebest
On 09/09/2016 02:47 PM, Austin S. Hemmelgarn wrote:
> On 2016-09-09 12:12, moparisthebest wrote:
>> Hi,
>>
>> I'm hoping to get some help with mounting my btrfs array which quit
>> working yesterday.  My array was in the middle of a balance, about 50%
>> remaining, when it hit an error and remounted itself read-only [1].
>> btrfs fi show output [2], btrfs df output [3].
>>
>> I unmounted the array, and when I tried to mount it again, it locked up
>> the whole system so even alt+sysrq would not work.  I rebooted, tried to
>> mount again, same lockup.  This was all kernel 4.5.7.
>>
>> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
>> message appeared on the screen and I took a picture [4].
>>
>> I rebooted into an arch live system with kernel 4.7.2, tried to mount
>> again, got some dmesg output before it crashed [5] and took a picture
>> when it crashed [6], says in part 'BUG: unable to handle kernel NULL
>> pointer dereference at 01f0'.
>>
>> Is there anything I can do to get this in a working state again or
>> perhaps even recover some data?
>>
>> Thanks much for any help
>>
>> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
>> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
>> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
>> [4]: https://www.moparisthebest.com/btrfsoops.jpg
>> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
>> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg
> 
> The output from btrfs fi show and fi df both indicate that the
> filesystem is essentially completely full.  You've gotten to the point
> where your using the global metadata reserve, and I think things are
> getting stuck trying (and failing) to reclaim the space that's used
> there.  The fact that the kernel is crashing in response to this is
> concerning, but it isn't surprising as this is not something that's
> really all that tested, and is very much not a normal operational
> scenario.  I'm guessing that the error you hit that forced the
> filesystem read-only is something that requires recovery, which in turn
> requires copy-on-write updates of some of the metadata, which you have
> essentially zero room for, and that's what's causing the kernel to choke
> when trying to mount the filesystem.
> 
> Given that the FS is pretty much wedged, I think your best bet for
> fixing this is probably going to be to use btrfs restore to get the data
> onto a new (larger) set of disks.  If you do take this approach, a
> metadata dump might be useful, if somebody could find enough room to
> extract it.
> 
> Alternatively, because of the small amount of free space on the largest
> device in the array, you _might_ be able to fix things if you can get it
> mounted read-write by running a balance converting both data and
> metadata to single profiles, adding a few more disks (or replacing some
> with bigger ones), and then converting back to raid1 profiles.  This is
> exponentially more risky than just restoring to a new filesystem, and
> will almost certainly take longer.
> 
> A couple of other things to comment about on this:
> 1. 'can_overcommit' (the function that the Arch kernel choked on) is
> from the memory management subsystem.  The fact that that's throwing a
> null pointer says to me either your hardware has issues, or the Arch
> kernel itself has problems (which would probably mean the kernel image
> is corrupted).
> 2. You may want to look for more symmetrically sized disks if you're
> going to be using raid1 mode.  The space that's free on the last listed
> disk in the filesystem is unusable in raid1 mode because there are no
> other disks with usable space.
> 3. In general, it's a good idea to keep an eye on space usage on your
> filesystems.  If it's getting to be more than about 95% full, you should
> be looking at getting some more storage space.  This is especially true
> for BTRFS, as a 100% full BTRFS filesystem functionally becomes
> permanently read-only because there's nowhere for the copy-on-write
> updates to write to.

If I read btrfs fi show right, it's got minimum ~600gb free on each one
of the 8 drives, shouldn't that be more than enough for most things?  (I
guess unless I have single files over 600gb that need COW'd, I don't though)

Didn't ubuntu on kernel 4.4 die in the same can_overcommit function?
(https://www.moparisthebest.com/btrfsoops.jpg) what kind of hardware
issues would cause a repeatable kernel crash like that?  Like am I
looking at memory issues or the SAS controller or what?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread Chris Murphy
On Fri, Sep 9, 2016 at 12:32 PM, moparisthebest
 wrote:

> This is indeed an lzo compressed system, it's always been mounted with
> that option anyhow.
>
> btrfs check has been running for ~6 hours so far, I'll follow up with
> output on that when it finishes.
>
> Hmm, the problem with the 4.7.2/systemd system is it's a live usb system
> so the log/journal wouldn't be saved anywhere except tmpfs, I'll see
> what I can rig up unless someone has any amazing ideas?  I'm still brand
> new to systemd...

Pick the easier of:
1.
ssh with a remote computer; the blocked tasks will slow down sshd and
the responsiveness of everything; but it shouldn't totally inhibit it
and may be more reliable than a local VT if the command is pretyped
and ready to go before you initiate the mount. Use journalctl -fk to
follow, and save out the output as  text file from that remote
computer.
2.
netconsole might be more reliable than sshd in this case, again just
connect with a remote computer, and in its Terminal you can do:
journalctl -fk
3.
Create a file system on a USB stick partition, copy live's /var to the
stick, then mount the stick over the live's /var, and now it's read
writeable. And then:
   mkdir -p /var/log/journal
   systemd-tmpfiles --create --prefix /var/log/journal

I think that will cause systemd-journald to flush to /var now, you can do:
journalctl -b | grep journald

And see if you have lines like this:
Sep 09 09:11:05 f24m systemd-journald[238]: Journal stopped
Sep 09 09:11:06 f24m systemd-journald[549]: Runtime journal
(/run/log/journal/) is 8.0M, max 393.2M, 385.2M free.
Sep 09 09:11:06 f24m systemd-journald[549]: System journal
(/var/log/journal/) is 999.7M, max 1.0G, 24.2M free.
Sep 09 09:11:07 f24m systemd-journald[549]: Time spent on flushing to
/var is 1.040757s for 1490 entries.
Sep 09 09:11:07 f24m systemd-journald[238]: Received SIGTERM from PID
1 (systemd).

So what happens when you force reboot? Mount this stick, and use
'journalctl -D /mnt/log/journal/machineid/ > outputfile.txt' which
will point to the journal binary file and write it out to a text file.
You could try -k to filter out just kernel messages but since that
implies -b and you have a different boot than what's in this journal I
have no idea off hand if that will work;  you could also filter by |
grep kernel > outputfile.txt but maybe not every line will have kernel
in it? I just tried it  with sysrq t and everything relevant seems to
have "kernel" in each line.


They're probably in order of ease; but not sure which is more reliable
when things are being blocked. Network may be more or less blocked
*shrug* I'd use XFS for the stick file system for /var.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Security implications of btrfs receive?

2016-09-09 Thread Chris Murphy
On Thu, Sep 8, 2016 at 5:48 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-09-07 15:34, Chris Murphy wrote:

> I like the idea of matching WWN as part of the check, with a couple of
> caveats:
> 1. We need to keep in mind that in some environments, this can be spoofed
> (Virtualization for example, although doing so would require source level
> modifications to most hypervisors).
> 2. There needs to be a way to forcibly mount in the case of a mismatch, as
> well as a way to update the filesystem to match the current WWN's of all of
> it's disks.  I also specifically think that these should be separate
> options, the first is useful for debugging a filesystem using image files,
> while the second is useful for external clones of disks.
> 3. single device filesystems should store the WWN, and ideally keep it
> up-to-date, but not check it.  They have no need to check it, and single
> device is the primary use case for a traditional user, so it should be as
> simple as possible.
> 4. We should be matching on more than just fsuuid, devuuid, and WWN, because
> just matching those would allow a second partition on the same device to
> cause issues.

Probably a different abstraction is necessary: WWN is appropriate
where member devices are drives; but maybe it's an LVM UUID in other
cases, e.g. where there's LVM snapshots. I'm not sure how drdb devices
are uniquely identified, but that'd also be in the "one of these"
list.




>> It is also kinda important to see things like udisks and storaged as
>> user agents, ensuring they have a way to communicate with the helper
>> so things are mounted and umounted correctly as most DE's now expect
>> to just automount everything. I still get weird behaviors on GNOME
>> with udisks2 and multiple device Btrfs volumes with current upstream
>> GNOME stuff.
>
> DE's expect the ability to automount things as a regular user, not
> necessarily that it has to happen.  I'm not all that worried personally
> about automounting of multi-device filesystems, largely because the type of
> person who automounting in the desktop primarily caters to is not likely to
> have a multi-device filesystem to begin with.

It should work better than it does because it works well for LVM and
mdadm arrays.

I think what's going on is the DE's mounter (udisksd) tries to mount
each Btrfs device node, even though those nodes make up a single fs
volume. It issues multiple mount and umount commands for that one
array. This doesn't happen with LVM and mdadm because an array has one
node. That's not true with Btrfs, it has one or many, depending on
your point of view. There's no way to mount just an fs volume UUID as
far as I know.


>For that matter, the primary
> (only realistic?) use for multi-device filesystems on removable media is
> backups, and the few people who are going to set things up to automatically
> run backups when the disks get plugged in will be smart enough to get things
> working correctly themselves, while anyone else is going to be running the
> backup manually and can mount the FS by hand if they aren't using something
> like autofs.

Yeah I  am that person but it's the DE that's getting confused, and
then confusing me with its confusion, so it's bad Ux. GNOME automounts
a Btrfs raid1 by showing two disk icons with the exact same name, and
gets confused upon ejecting either with the GUI eject button or via
the CLI. So we can say udisks is doing something wrong, but what, and
is there anything we can do to make it easier for it to do the right
thing seeing as Btrfs is so different?


Here's some 2 to 6 year old bugs related to this:
https://bugs.freedesktop.org/show_bug.cgi?id=87277
https://bugzilla.gnome.org/show_bug.cgi?id=746769
https://bugzilla.gnome.org/show_bug.cgi?id=608204


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread Austin S. Hemmelgarn

On 2016-09-09 14:32, moparisthebest wrote:

On 09/09/2016 01:51 PM, Chris Murphy wrote:

On Fri, Sep 9, 2016 at 10:12 AM, moparisthebest
 wrote:

Hi,

I'm hoping to get some help with mounting my btrfs array which quit
working yesterday.  My array was in the middle of a balance, about 50%
remaining, when it hit an error and remounted itself read-only [1].
btrfs fi show output [2], btrfs df output [3].

I unmounted the array, and when I tried to mount it again, it locked up
the whole system so even alt+sysrq would not work.  I rebooted, tried to
mount again, same lockup.  This was all kernel 4.5.7.

I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
message appeared on the screen and I took a picture [4].

I rebooted into an arch live system with kernel 4.7.2, tried to mount
again, got some dmesg output before it crashed [5] and took a picture
when it crashed [6], says in part 'BUG: unable to handle kernel NULL
pointer dereference at 01f0'.

Is there anything I can do to get this in a working state again or
perhaps even recover some data?

Thanks much for any help

[1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
[2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
[3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
[4]: https://www.moparisthebest.com/btrfsoops.jpg
[5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
[6]: https://www.moparisthebest.com/btrfsnulldereference.jpg


Good report. Try on the 4.7.2 kernel system, two consoles, have one
ready with 'echo w > /proc/sysrq-trigger' as root (sudo doesn't work)
but don't issue it, mount in the other console and then switch back
and issue the sysrq. It'll take a while, minutes maybe even to switch
consoles, and then also for the command itself to issue, and then
minutes before the result actually gets committed to systemd journal
or var/log/messages. If it's a systemd system, and if you have to
force reboot to regain control, you can get the sysrq with 'journalctl
-b-1 -k > outputfile.txt'

Also btrfs check output is useful to include also (without --repair
for starters).

The thing that concerns me is this occasional problem that comes up
sometimes with lzo compressed volumes. Duncan knows more about that
one so he may chime in. I would definitely only do default mounts for
the above, don't include the compression option. You could also try -o
ro,recovery and see where that gets you.




This is indeed an lzo compressed system, it's always been mounted with
that option anyhow.

btrfs check has been running for ~6 hours so far, I'll follow up with
output on that when it finishes.

Hmm, the problem with the 4.7.2/systemd system is it's a live usb system
so the log/journal wouldn't be saved anywhere except tmpfs, I'll see
what I can rig up unless someone has any amazing ideas?  I'm still brand
new to systemd...
I don't know much about systemd myself, but I do know it's possible to 
set up a remote journal (essentially a remote logging server like people 
have been doing for decades with syslogd).  I don't know if this would 
catch the error or not though.  Alternatively, if you could set up a 
serial console, you could capture all the output there instead without 
even having to touch the journal.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread Austin S. Hemmelgarn

On 2016-09-09 12:12, moparisthebest wrote:

Hi,

I'm hoping to get some help with mounting my btrfs array which quit
working yesterday.  My array was in the middle of a balance, about 50%
remaining, when it hit an error and remounted itself read-only [1].
btrfs fi show output [2], btrfs df output [3].

I unmounted the array, and when I tried to mount it again, it locked up
the whole system so even alt+sysrq would not work.  I rebooted, tried to
mount again, same lockup.  This was all kernel 4.5.7.

I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
message appeared on the screen and I took a picture [4].

I rebooted into an arch live system with kernel 4.7.2, tried to mount
again, got some dmesg output before it crashed [5] and took a picture
when it crashed [6], says in part 'BUG: unable to handle kernel NULL
pointer dereference at 01f0'.

Is there anything I can do to get this in a working state again or
perhaps even recover some data?

Thanks much for any help

[1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
[2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
[3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
[4]: https://www.moparisthebest.com/btrfsoops.jpg
[5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
[6]: https://www.moparisthebest.com/btrfsnulldereference.jpg


The output from btrfs fi show and fi df both indicate that the 
filesystem is essentially completely full.  You've gotten to the point 
where your using the global metadata reserve, and I think things are 
getting stuck trying (and failing) to reclaim the space that's used 
there.  The fact that the kernel is crashing in response to this is 
concerning, but it isn't surprising as this is not something that's 
really all that tested, and is very much not a normal operational 
scenario.  I'm guessing that the error you hit that forced the 
filesystem read-only is something that requires recovery, which in turn 
requires copy-on-write updates of some of the metadata, which you have 
essentially zero room for, and that's what's causing the kernel to choke 
when trying to mount the filesystem.


Given that the FS is pretty much wedged, I think your best bet for 
fixing this is probably going to be to use btrfs restore to get the data 
onto a new (larger) set of disks.  If you do take this approach, a 
metadata dump might be useful, if somebody could find enough room to 
extract it.


Alternatively, because of the small amount of free space on the largest 
device in the array, you _might_ be able to fix things if you can get it 
mounted read-write by running a balance converting both data and 
metadata to single profiles, adding a few more disks (or replacing some 
with bigger ones), and then converting back to raid1 profiles.  This is 
exponentially more risky than just restoring to a new filesystem, and 
will almost certainly take longer.


A couple of other things to comment about on this:
1. 'can_overcommit' (the function that the Arch kernel choked on) is 
from the memory management subsystem.  The fact that that's throwing a 
null pointer says to me either your hardware has issues, or the Arch 
kernel itself has problems (which would probably mean the kernel image 
is corrupted).
2. You may want to look for more symmetrically sized disks if you're 
going to be using raid1 mode.  The space that's free on the last listed 
disk in the filesystem is unusable in raid1 mode because there are no 
other disks with usable space.
3. In general, it's a good idea to keep an eye on space usage on your 
filesystems.  If it's getting to be more than about 95% full, you should 
be looking at getting some more storage space.  This is especially true 
for BTRFS, as a 100% full BTRFS filesystem functionally becomes 
permanently read-only because there's nowhere for the copy-on-write 
updates to write to.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread moparisthebest
On 09/09/2016 01:51 PM, Chris Murphy wrote:
> On Fri, Sep 9, 2016 at 10:12 AM, moparisthebest
>  wrote:
>> Hi,
>>
>> I'm hoping to get some help with mounting my btrfs array which quit
>> working yesterday.  My array was in the middle of a balance, about 50%
>> remaining, when it hit an error and remounted itself read-only [1].
>> btrfs fi show output [2], btrfs df output [3].
>>
>> I unmounted the array, and when I tried to mount it again, it locked up
>> the whole system so even alt+sysrq would not work.  I rebooted, tried to
>> mount again, same lockup.  This was all kernel 4.5.7.
>>
>> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
>> message appeared on the screen and I took a picture [4].
>>
>> I rebooted into an arch live system with kernel 4.7.2, tried to mount
>> again, got some dmesg output before it crashed [5] and took a picture
>> when it crashed [6], says in part 'BUG: unable to handle kernel NULL
>> pointer dereference at 01f0'.
>>
>> Is there anything I can do to get this in a working state again or
>> perhaps even recover some data?
>>
>> Thanks much for any help
>>
>> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
>> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
>> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
>> [4]: https://www.moparisthebest.com/btrfsoops.jpg
>> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
>> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg
> 
> Good report. Try on the 4.7.2 kernel system, two consoles, have one
> ready with 'echo w > /proc/sysrq-trigger' as root (sudo doesn't work)
> but don't issue it, mount in the other console and then switch back
> and issue the sysrq. It'll take a while, minutes maybe even to switch
> consoles, and then also for the command itself to issue, and then
> minutes before the result actually gets committed to systemd journal
> or var/log/messages. If it's a systemd system, and if you have to
> force reboot to regain control, you can get the sysrq with 'journalctl
> -b-1 -k > outputfile.txt'
> 
> Also btrfs check output is useful to include also (without --repair
> for starters).
> 
> The thing that concerns me is this occasional problem that comes up
> sometimes with lzo compressed volumes. Duncan knows more about that
> one so he may chime in. I would definitely only do default mounts for
> the above, don't include the compression option. You could also try -o
> ro,recovery and see where that gets you.
> 
> 

This is indeed an lzo compressed system, it's always been mounted with
that option anyhow.

btrfs check has been running for ~6 hours so far, I'll follow up with
output on that when it finishes.

Hmm, the problem with the 4.7.2/systemd system is it's a live usb system
so the log/journal wouldn't be saved anywhere except tmpfs, I'll see
what I can rig up unless someone has any amazing ideas?  I'm still brand
new to systemd...

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs kernel oops on mount

2016-09-09 Thread Chris Murphy
On Fri, Sep 9, 2016 at 10:12 AM, moparisthebest
 wrote:
> Hi,
>
> I'm hoping to get some help with mounting my btrfs array which quit
> working yesterday.  My array was in the middle of a balance, about 50%
> remaining, when it hit an error and remounted itself read-only [1].
> btrfs fi show output [2], btrfs df output [3].
>
> I unmounted the array, and when I tried to mount it again, it locked up
> the whole system so even alt+sysrq would not work.  I rebooted, tried to
> mount again, same lockup.  This was all kernel 4.5.7.
>
> I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
> message appeared on the screen and I took a picture [4].
>
> I rebooted into an arch live system with kernel 4.7.2, tried to mount
> again, got some dmesg output before it crashed [5] and took a picture
> when it crashed [6], says in part 'BUG: unable to handle kernel NULL
> pointer dereference at 01f0'.
>
> Is there anything I can do to get this in a working state again or
> perhaps even recover some data?
>
> Thanks much for any help
>
> [1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
> [2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
> [3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
> [4]: https://www.moparisthebest.com/btrfsoops.jpg
> [5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
> [6]: https://www.moparisthebest.com/btrfsnulldereference.jpg

Good report. Try on the 4.7.2 kernel system, two consoles, have one
ready with 'echo w > /proc/sysrq-trigger' as root (sudo doesn't work)
but don't issue it, mount in the other console and then switch back
and issue the sysrq. It'll take a while, minutes maybe even to switch
consoles, and then also for the command itself to issue, and then
minutes before the result actually gets committed to systemd journal
or var/log/messages. If it's a systemd system, and if you have to
force reboot to regain control, you can get the sysrq with 'journalctl
-b-1 -k > outputfile.txt'

Also btrfs check output is useful to include also (without --repair
for starters).

The thing that concerns me is this occasional problem that comes up
sometimes with lzo compressed volumes. Duncan knows more about that
one so he may chime in. I would definitely only do default mounts for
the above, don't include the compression option. You could also try -o
ro,recovery and see where that gets you.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2016-09-09 Thread Chris Mason
Hi Linus,

We have three fixes in my for-linus-4.8 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.8

I'm not proud of how long it took me to track down that one liner in
btrfs_sync_log(), but the good news is the patches I was trying to blame
for these problems were actually fine (sorry Filipe).

Wang Xiaoguang (2) commits (+16/-8):
btrfs: introduce tickets_id to determine whether asynchronous metadata 
reclaim work makes progress (+7/-5)
btrfs: do not decrease bytes_may_use when replaying extents (+9/-3)

Chris Mason (1) commits (+1/-0):
Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns

Total: (3) commits (+17/-8)

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 23 +++
 fs/btrfs/tree-log.c|  1 +
 3 files changed, 17 insertions(+), 8 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7][V2] Btrfs: kill the btree_inode

2016-09-09 Thread Josef Bacik
In order to more efficiently support sub-page blocksizes we need to stop
allocating pages from pagecache for our metadata.  Instead switch to using the
account_metadata* counters for making sure we are keeping the system aware of
how much dirty metadata we have, and use the ->free_cached_objects super
operation in order to handle freeing up extent buffers.  This greatly simplifies
how we deal with extent buffers as now we no longer have to tie the page cache
reclaimation stuff to the extent buffer stuff.  This will also allow us to
simply kmalloc() our data for sub-page blocksizes.

Signed-off-by: Josef Bacik 
---
V1->V2
-fixed the unlock_start as pointed out by Chandan.
-fixed a panic when fs_info->eb_info is null.

 fs/btrfs/btrfs_inode.h |   3 +-
 fs/btrfs/ctree.c   |  10 +-
 fs/btrfs/ctree.h   |  14 +-
 fs/btrfs/disk-io.c | 389 --
 fs/btrfs/extent_io.c   | 913 ++---
 fs/btrfs/extent_io.h   |  49 +-
 fs/btrfs/inode.c   |   6 +-
 fs/btrfs/root-tree.c   |   2 +-
 fs/btrfs/super.c   |  29 +-
 fs/btrfs/tests/btrfs-tests.c   |  37 +-
 fs/btrfs/tests/extent-io-tests.c   |   4 +-
 fs/btrfs/tests/free-space-tree-tests.c |   4 +-
 fs/btrfs/tests/qgroup-tests.c  |   4 +-
 fs/btrfs/transaction.c |  11 +-
 14 files changed, 727 insertions(+), 748 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1a8fa46..ad7b185 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -229,10 +229,9 @@ static inline u64 btrfs_ino(struct inode *inode)
u64 ino = BTRFS_I(inode)->location.objectid;
 
/*
-* !ino: btree_inode
 * type == BTRFS_ROOT_ITEM_KEY: subvol dir
 */
-   if (!ino || BTRFS_I(inode)->location.type == BTRFS_ROOT_ITEM_KEY)
+   if (BTRFS_I(inode)->location.type == BTRFS_ROOT_ITEM_KEY)
ino = inode->i_ino;
return ino;
 }
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d1c56c9..b267053 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1373,8 +1373,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
 
if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
BUG_ON(tm->slot != 0);
-   eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start,
-   eb->len);
+   eb_rewin = alloc_dummy_extent_buffer(fs_info->eb_info,
+eb->start, eb->len);
if (!eb_rewin) {
btrfs_tree_read_unlock_blocking(eb);
free_extent_buffer(eb);
@@ -1455,8 +1455,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
} else if (old_root) {
btrfs_tree_read_unlock(eb_root);
free_extent_buffer(eb_root);
-   eb = alloc_dummy_extent_buffer(root->fs_info, logical,
-   root->nodesize);
+   eb = alloc_dummy_extent_buffer(root->fs_info->eb_info, logical,
+  root->nodesize);
} else {
btrfs_set_lock_blocking_rw(eb_root, BTRFS_READ_LOCK);
eb = btrfs_clone_extent_buffer(eb_root);
@@ -1772,7 +1772,7 @@ static noinline int generic_bin_search(struct 
extent_buffer *eb,
int err;
 
if (low > high) {
-   btrfs_err(eb->fs_info,
+   btrfs_err(eb->eb_info->fs_info,
 "%s: low (%d) > high (%d) eb %llu owner %llu level %d",
  __func__, low, high, eb->start,
  btrfs_header_owner(eb), btrfs_header_level(eb));
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 282a031..b9ee7cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -675,6 +676,7 @@ struct btrfs_device;
 struct btrfs_fs_devices;
 struct btrfs_balance_control;
 struct btrfs_delayed_root;
+struct btrfs_eb_info;
 
 #define BTRFS_FS_BARRIER   1
 #define BTRFS_FS_CLOSING_START 2
@@ -797,7 +799,7 @@ struct btrfs_fs_info {
struct btrfs_super_block *super_for_commit;
struct block_device *__bdev;
struct super_block *sb;
-   struct inode *btree_inode;
+   struct btrfs_eb_info *eb_info;
struct backing_dev_info bdi;
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
@@ -1042,10 +1044,6 @@ struct btrfs_fs_info {
/* readahead works cnt */
atomic_t reada_works_cnt;
 
-   /* Extent buffer radix tree */
-   spinlock_t buffer_lock;
-   struct radix_tree_root buffer_radix;
-
  

Re: Security implications of btrfs receive?

2016-09-09 Thread Austin S. Hemmelgarn

On 2016-09-09 12:33, David Sterba wrote:

On Wed, Sep 07, 2016 at 03:08:18PM -0400, Austin S. Hemmelgarn wrote:

On 2016-09-07 14:07, Christoph Anton Mitterer wrote:

On Wed, 2016-09-07 at 11:06 -0400, Austin S. Hemmelgarn wrote:

This is an issue with any filesystem,

Not really... any other filesystem I'd know (not sure about ZFS) keeps
working when there are UUID collisions... or at least it won't cause
arbitrary corruptions, which then in the end may even be used for such
attacks as described in that thread.

Even other multi-device containers (LVM, MD) don't at least corrupt
your data like it allegedly can happen with btrfs.




 it is just a bigger issue with
BTRFS.

No corruption vs. possible arbitrary data corruption and leakage seems
to be more than "just bigger".
I'd call it unacceptable for a production system.

So is refusing to boot.  In most cases, downtime is just as bad as data
corruption.




  Take a system using ext4, or XFS, or almost any other Linux
filesystem, running almost any major distro, create a minimum sized
partition on the disk for that filesystem type, and create a
filesystem
there with the same UUID as the root filesystem.  Next time that
system
reboots, things will usually blow up (XFS will refuse to mount, ext4
and
most other filesystems will work sometimes and not others).

Well but that's something completely different.
It would be perfectly fine if, in case of an UUID collision, the system
simply denies mounting/assembly (actually that's one of the solutions
others and I've proposed in the aforementioned thread).

But it's not acceptable if the system does *something* in such
situation,... or if such fs/container is already mounted/active and
another device with colliding UUID appears *then*, it's neither
acceptable that the already active fs/container wouldn't continue to
work properly.

And that seems to my experience just how e.g. LVM handles this.

"Not booting" is not really an issue in terms of data corruption.


At least I'm pretty sure to remember that one of the main developers
(was it Qu?) acknowledged these issues (both in terms of accidental
corruption and security wise) and that he was glad that these issues
were brought up and that they should be solved.



It hasn't, because there's not any way it can be completely
fixed.

Why not? As it was laid out by myself and others, the basic solution
would be:
- Refuse any mounting in case UUID collisions are detected.
- Generally don't do any auto-rebuilds or e.g. RAID assemblies unless
  specifically allowed/configured by the user (as this might also be
  used to extract data from a system).
- If there are any collisions (either by mounting or by processes like
  rebuilds/added devices) require the user to specify uniquely which
  device he actually wants (e.g. by path).
- And in case a filesystem is already mounted and UUID collisions
  happens then (e.g. a dd clone get's plugged in), continue to use the
  already active device... just as e.g. LVM does.


  This
particular case is an excellent example of why it's so hard to
fix.  To
close this particular hole, BTRFS itself would have to become aware
of
whether whoever is running an ioctl is running in a chroot or not,
which
is non-trivial to determine to begin with, and even harder when you
factor in the fact that chroot() is a VFS level thing, not a
underlying
filesystem thing, while ioctls are much lower level.

Isn't it simply enough to:
- know which blockdevices with a btrfs and with which UUIDs there are
- let userland tools deny any mount/assembly/etc. actions in case of
  collisions
- do the actual addressing of devices via the device path (so that
  proper devices will continued to be if the fs was already mounted
  when a collision occurs)
?

That's not the issue being discussed in this case.  The ultimate issue
is of course the same (the flawed assumption that some arbitrary bytes
will be globally unique), but the particular resultant issues are
different.  The problem being discussed is that receive doesn't verify
that subvolume UUID's it has been told to clone from are within the are
it's been told to work.  This can cause an information leak, but not
data corruption, and is actually an issue with the clone ioctl in
general.  Graham actually proposed a good solution to this particular
problem (require an open fd to a source file containing the blocks to be
passed into the ioctl in addition to everything else), but it's still
orthogonal to the symptoms you're talking about.


And further, as I've said, security wise auto-assembly of multi-device
seems always prone to attacks at least in certain use cases, so for the
security conscious people:
- Don't do auto-assembly/rebuild/etc. based on scanning for UUID
- Let the user choose to do this manually via specifying the devices
  (via e.g. path).
  So a user could say something like
  mount -t btrfs -o 
device=/dev/disk/by-path/pci-\:00\:1f.2-ata-1,device=/dev/disk/by-path/pci-\:00\:2f.2-ata-2
 

Re: Security implications of btrfs receive?

2016-09-09 Thread Austin S. Hemmelgarn

On 2016-09-09 12:18, David Sterba wrote:

On Wed, Sep 07, 2016 at 07:58:30AM -0400, Austin S. Hemmelgarn wrote:

On 2016-09-06 13:20, Graham Cobb wrote:

Thanks to Austin and Duncan for their replies.

On 06/09/16 13:15, Austin S. Hemmelgarn wrote:

On 2016-09-05 05:59, Graham Cobb wrote:

Does the "path" argument of btrfs-receive mean that *all* operations are
confined to that path?  For example, if a UUID or transid is sent which
refers to an entity outside the path will that other entity be affected
or used?

As far as I know, no, it won't be affected.

Is it possible for a file to be created containing shared
extents from outside the path?

As far as I know, the only way for this to happen is if you're
referencing a parent subvolume for a relative send that is itself
sharing extents outside of the path.  From a practical perspective,
unless you're doing deduplication on the receiving end, the this
shouldn't be possible.


Unfortunately that is not the case.  I decided to do some tests to see
what happens.  It is possible for a receive into one path to reference
and access a subvolume from a different path on the same btrfs disk.  I
have created a bash script to demonstrate this at:

https://gist.github.com/GrahamCobb/c7964138057e4e092a75319c9fb240a3

This does require the attacker to know the (source) subvolume UUID they
want to copy.  I am not sure how hard UUIDs are to guess.

Oh, I forgot about the fact that it checks the whole filesystem for
referenced source subvolumes.


What if the stream is verified first? Ie. look if there are the
operations using subolumes not owned by the user.

I think that extending the ioctl to require proof of access to the 
source being cloned from would be a better approach to this, as this is 
an issue with the ioctl in general, it's just discussion of send/receive 
that brought this up.  I'm actually kind of surprised that this didn't 
get noticed before, seeing as it's a pretty significant and not all that 
difficult to use information leak.  Ideally, this needs to be decided 
before the VFS layer clone ioctl gets finalized.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Security implications of btrfs receive?

2016-09-09 Thread David Sterba
On Wed, Sep 07, 2016 at 03:08:18PM -0400, Austin S. Hemmelgarn wrote:
> On 2016-09-07 14:07, Christoph Anton Mitterer wrote:
> > On Wed, 2016-09-07 at 11:06 -0400, Austin S. Hemmelgarn wrote:
> >> This is an issue with any filesystem,
> > Not really... any other filesystem I'd know (not sure about ZFS) keeps
> > working when there are UUID collisions... or at least it won't cause
> > arbitrary corruptions, which then in the end may even be used for such
> > attacks as described in that thread.
> >
> > Even other multi-device containers (LVM, MD) don't at least corrupt
> > your data like it allegedly can happen with btrfs.
> >
> >
> >
> >>  it is just a bigger issue with
> >> BTRFS.
> > No corruption vs. possible arbitrary data corruption and leakage seems
> > to be more than "just bigger".
> > I'd call it unacceptable for a production system.
> So is refusing to boot.  In most cases, downtime is just as bad as data 
> corruption.
> >
> >
> >>   Take a system using ext4, or XFS, or almost any other Linux
> >> filesystem, running almost any major distro, create a minimum sized
> >> partition on the disk for that filesystem type, and create a
> >> filesystem
> >> there with the same UUID as the root filesystem.  Next time that
> >> system
> >> reboots, things will usually blow up (XFS will refuse to mount, ext4
> >> and
> >> most other filesystems will work sometimes and not others).
> > Well but that's something completely different.
> > It would be perfectly fine if, in case of an UUID collision, the system
> > simply denies mounting/assembly (actually that's one of the solutions
> > others and I've proposed in the aforementioned thread).
> >
> > But it's not acceptable if the system does *something* in such
> > situation,... or if such fs/container is already mounted/active and
> > another device with colliding UUID appears *then*, it's neither
> > acceptable that the already active fs/container wouldn't continue to
> > work properly.
> >
> > And that seems to my experience just how e.g. LVM handles this.
> >
> > "Not booting" is not really an issue in terms of data corruption.
> >
> >
> > At least I'm pretty sure to remember that one of the main developers
> > (was it Qu?) acknowledged these issues (both in terms of accidental
> > corruption and security wise) and that he was glad that these issues
> > were brought up and that they should be solved.
> >
> >
> >> It hasn't, because there's not any way it can be completely
> >> fixed.
> > Why not? As it was laid out by myself and others, the basic solution
> > would be:
> > - Refuse any mounting in case UUID collisions are detected.
> > - Generally don't do any auto-rebuilds or e.g. RAID assemblies unless
> >   specifically allowed/configured by the user (as this might also be
> >   used to extract data from a system).
> > - If there are any collisions (either by mounting or by processes like
> >   rebuilds/added devices) require the user to specify uniquely which
> >   device he actually wants (e.g. by path).
> > - And in case a filesystem is already mounted and UUID collisions
> >   happens then (e.g. a dd clone get's plugged in), continue to use the
> >   already active device... just as e.g. LVM does.
> >
> >>   This
> >> particular case is an excellent example of why it's so hard to
> >> fix.  To
> >> close this particular hole, BTRFS itself would have to become aware
> >> of
> >> whether whoever is running an ioctl is running in a chroot or not,
> >> which
> >> is non-trivial to determine to begin with, and even harder when you
> >> factor in the fact that chroot() is a VFS level thing, not a
> >> underlying
> >> filesystem thing, while ioctls are much lower level.
> > Isn't it simply enough to:
> > - know which blockdevices with a btrfs and with which UUIDs there are
> > - let userland tools deny any mount/assembly/etc. actions in case of
> >   collisions
> > - do the actual addressing of devices via the device path (so that
> >   proper devices will continued to be if the fs was already mounted
> >   when a collision occurs)
> > ?
> That's not the issue being discussed in this case.  The ultimate issue 
> is of course the same (the flawed assumption that some arbitrary bytes 
> will be globally unique), but the particular resultant issues are 
> different.  The problem being discussed is that receive doesn't verify 
> that subvolume UUID's it has been told to clone from are within the are 
> it's been told to work.  This can cause an information leak, but not 
> data corruption, and is actually an issue with the clone ioctl in 
> general.  Graham actually proposed a good solution to this particular 
> problem (require an open fd to a source file containing the blocks to be 
> passed into the ioctl in addition to everything else), but it's still 
> orthogonal to the symptoms you're talking about.
> >
> > And further, as I've said, security wise auto-assembly of multi-device
> > seems always prone to attacks at least in certain use cases, so for the

btrfs kernel oops on mount

2016-09-09 Thread moparisthebest
Hi,

I'm hoping to get some help with mounting my btrfs array which quit
working yesterday.  My array was in the middle of a balance, about 50%
remaining, when it hit an error and remounted itself read-only [1].
btrfs fi show output [2], btrfs df output [3].

I unmounted the array, and when I tried to mount it again, it locked up
the whole system so even alt+sysrq would not work.  I rebooted, tried to
mount again, same lockup.  This was all kernel 4.5.7.

I rebooted to kernel 4.4.0, tried to mount, crashed again, this time a
message appeared on the screen and I took a picture [4].

I rebooted into an arch live system with kernel 4.7.2, tried to mount
again, got some dmesg output before it crashed [5] and took a picture
when it crashed [6], says in part 'BUG: unable to handle kernel NULL
pointer dereference at 01f0'.

Is there anything I can do to get this in a working state again or
perhaps even recover some data?

Thanks much for any help

[1]: https://www.moparisthebest.com/btrfs/initial_crash.txt
[2]: https://www.moparisthebest.com/btrfs/btrfsfishow.txt
[3]: https://www.moparisthebest.com/btrfs/btrfsdf.txt
[4]: https://www.moparisthebest.com/btrfsoops.jpg
[5]: https://www.moparisthebest.com/btrfs/dmsgprecrash.txt
[6]: https://www.moparisthebest.com/btrfsnulldereference.jpg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Security implications of btrfs receive?

2016-09-09 Thread David Sterba
On Wed, Sep 07, 2016 at 07:58:30AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-09-06 13:20, Graham Cobb wrote:
> > Thanks to Austin and Duncan for their replies.
> >
> > On 06/09/16 13:15, Austin S. Hemmelgarn wrote:
> >> On 2016-09-05 05:59, Graham Cobb wrote:
> >>> Does the "path" argument of btrfs-receive mean that *all* operations are
> >>> confined to that path?  For example, if a UUID or transid is sent which
> >>> refers to an entity outside the path will that other entity be affected
> >>> or used?
> >> As far as I know, no, it won't be affected.
> >>> Is it possible for a file to be created containing shared
> >>> extents from outside the path?
> >> As far as I know, the only way for this to happen is if you're
> >> referencing a parent subvolume for a relative send that is itself
> >> sharing extents outside of the path.  From a practical perspective,
> >> unless you're doing deduplication on the receiving end, the this
> >> shouldn't be possible.
> >
> > Unfortunately that is not the case.  I decided to do some tests to see
> > what happens.  It is possible for a receive into one path to reference
> > and access a subvolume from a different path on the same btrfs disk.  I
> > have created a bash script to demonstrate this at:
> >
> > https://gist.github.com/GrahamCobb/c7964138057e4e092a75319c9fb240a3
> >
> > This does require the attacker to know the (source) subvolume UUID they
> > want to copy.  I am not sure how hard UUIDs are to guess.
> Oh, I forgot about the fact that it checks the whole filesystem for 
> referenced source subvolumes.

What if the stream is verified first? Ie. look if there are the
operations using subolumes not owned by the user.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs-progs: check: remove unused found_key variable in walk_down_tree()

2016-09-09 Thread David Sterba
On Mon, Aug 29, 2016 at 06:22:17PM +0200, David Sterba wrote:
> On Thu, Aug 25, 2016 at 01:20:59PM +0800, Wang Xiaoguang wrote:
> > Signed-off-by: Wang Xiaoguang 
> > ---
> >  cmds-check.c | 5 -
> >  1 file changed, 5 deletions(-)
> > 
> > diff --git a/cmds-check.c b/cmds-check.c
> > index 0ddfd24..1cd0421 100644
> > --- a/cmds-check.c
> > +++ b/cmds-check.c
> > @@ -3737,7 +3737,6 @@ static int check_fs_root(struct btrfs_root *root,
> > path.slots[level] = 0;
> > } else {
> > struct btrfs_key key;
> > -   struct btrfs_disk_key found_key;
> >  
> > btrfs_disk_key_to_cpu(, _item->drop_progress);
> > level = root_item->drop_level;
> > @@ -3745,10 +3744,6 @@ static int check_fs_root(struct btrfs_root *root,
> > wret = btrfs_search_slot(NULL, root, , , 0, 0);
> > if (wret < 0)
> > goto skip_walking;
> > -   btrfs_node_key(path.nodes[level], _key,
> > -   path.slots[level]);
> > -   WARN_ON(memcmp(_key, _item->drop_progress,
> > -   sizeof(found_key)));
> 
> It's not unused, the WARN_ON is an if in disguise, ane memcmp does the
> check, am I missing something here?

So, the warning should stay, please replace it with an if and a message,
unless there are other reasons to drop the check completely.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] btrfs-progs: check: make low memory mode support partially dropped snapshots

2016-09-09 Thread David Sterba
On Thu, Aug 25, 2016 at 01:21:00PM +0800, Wang Xiaoguang wrote:
> Signed-off-by: Wang Xiaoguang 

This + test image applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 00/15] Btrfs In-band De-duplication

2016-09-09 Thread David Sterba
On Thu, Sep 08, 2016 at 03:12:49PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux.git wang_dedupe_20160907
> 
> This version is just another small update, rebased to David's
> for-next-20160906 branch.

I've rebased it locally to the 4.9 patch queue and Josef's btree-inode
branch, now pushed to for-next-test. It's really for testing only
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lockdep warning in btrfs in 4.8-rc3

2016-09-09 Thread Chris Mason

On 09/08/2016 08:50 PM, Dave Jones wrote:

On Thu, Sep 08, 2016 at 08:58:48AM -0400, Chris Mason wrote:
 > On 09/08/2016 07:50 AM, Christian Borntraeger wrote:
 > > On 09/08/2016 01:48 PM, Christian Borntraeger wrote:
 > >> Chris,
 > >>
 > >> with 4.8-rc3 I get the following on an s390 box:
 > >
 > > Sorry for the noise, just saw the fix in your pull request.
 > >
 >
 > The lockdep splat is still there, we'll need to annotate this one a little.

Here's another one (unrelated?) that I've not seen before today:

WARNING: CPU: 1 PID: 10664 at kernel/locking/lockdep.c:704 
register_lock_class+0x33f/0x510
CPU: 1 PID: 10664 Comm: kworker/u8:5 Not tainted 4.8.0-rc5-think+ #2
Workqueue: writeback wb_workfn (flush-btrfs-1)
 0097 b97fbad3 88013b8c3770 a63d3ab1
   a6bf1792 a60df22f
 88013b8c37b0 a60897a0 02c0b97fbad3 a6bf1792
Call Trace:
 [] dump_stack+0x6c/0x9b
 [] ? register_lock_class+0x33f/0x510
 [] __warn+0x110/0x130
 [] warn_slowpath_null+0x2c/0x40
 [] register_lock_class+0x33f/0x510
 [] ? bio_add_page+0x7e/0x120
 [] __lock_acquire.isra.32+0x5b/0x8c0
 [] lock_acquire+0x58/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] _raw_write_lock+0x38/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] lock_extent_buffer_for_io+0x28/0x2e0 [btrfs]
 [] btree_write_cache_pages+0x231/0x550 [btrfs]
 [] ? btree_set_page_dirty+0x20/0x20 [btrfs]
 [] btree_writepages+0x74/0x90 [btrfs]
 [] do_writepages+0x3e/0x80
 [] __writeback_single_inode+0x42/0x220
 [] writeback_sb_inodes+0x351/0x730
 [] ? __wb_update_bandwidth+0x1c1/0x2b0
 [] wb_writeback+0x138/0x2a0
 [] wb_workfn+0x10e/0x340
 [] ? __lock_acquire.isra.32+0x1cf/0x8c0
 [] process_one_work+0x24f/0x5d0
 [] ? process_one_work+0x1e0/0x5d0
 [] worker_thread+0x53/0x5b0
 [] ? process_one_work+0x5d0/0x5d0
 [] kthread+0x120/0x140
 [] ? finish_task_switch+0x6a/0x200
 [] ret_from_fork+0x1f/0x40
 [] ? kthread_create_on_node+0x270/0x270
---[ end trace 7b39395c07435bf1 ]---


 700 /*
 701  * Huh! same key, different name? Did someone 
trample
 702  * on some memory? We're most confused.
 703  */
 704 WARN_ON_ONCE(class->name != lock->name);

That seems kinda scary. There was a trinity run going on at the same time,
so this _might_ be a random scribble from something unrelated to btrfs,
but just in case..

IWBNI that code printed out both cases so I could see if this was
corruption or two unrelated keys. I'll make it do that in case it
happens again.



I haven't seen this one before, if you could make it happen again, that 
would be great ;)


-chris



Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug report about patch "Btrfs: kill the btree_inode"

2016-09-09 Thread Josef Bacik

On 09/09/2016 04:28 AM, Wang Xiaoguang wrote:

hello,

When we rebase dedupe patches to David's for-next-20160906 branch,
we found below panic. By bisect, it seems that "Btrfs: kill the btree_inode"
causing this bug, please check.
Fstests case btrfs/060 can easily reproduce this bug.


Oops forgot to run with SCRATCH_DEV_POOL.  Thanks I'll fix this up and send out 
the corrected patch,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some help with the code.

2016-09-09 Thread David Sterba
On Tue, Sep 06, 2016 at 04:22:25PM +0100, Tomasz Kusmierz wrote:
> This is predominantly for maintainers:
> 
> I've noticed that there is a lot of code for btrfs ... and after few
> glimpses I've noticed that there are occurrences which beg for some
> refactoring to make it less of a pain to maintain.
> 
> I'm speaking of occurrences where:
> - within a function there are multiple checks for null pointer and
> then whenever there is anything hanging on the end of that pointer to
> finally call the function, pass the pointer to it and watch it perform
> same checks to finally deallocate stuff on the end of a pointer.

Can you please point me to an example? If it's a bad pattern it would be
worth cleaning up.

> - single line functions ... called only in two places

That might not be always useless, as the function name tells us what it
does, not how, so it's a form of selfdocumenting code. If the function
body is some common code construct, it would be harder to grep for it.

But I understand what you mean. This could be also a leftover from some
broader changes that removed calls, reduced function size to the
one line.

> and so on.
> 
> I know that you guys are busy, but maintaining code that is only
> growing must be a pain.

Depends. Standalone features bring a lot of new code, but it's
separated. Random sample of patches from recent releases tells me that
net line growth is spread accross many patches that add just a few lines
(eg. enhanced tests, more helpers).

https://btrfs.wiki.kernel.org/index.php/Contributors#Statistics

Doing broader cleanups is good when done from time to time, as it tends
to interfere with other patches, so it's more a matter of scheduling
when to do it. The beginning or end of the particular development cycle
are good candidates.

Reducing size should be done in the way that does not make the code less
readable, which is kind of subjective metric but should be sorted when
patches (or samples) are posted. That said, cleanups and refactoring
patches are welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: State of the fuzzer

2016-09-09 Thread David Sterba
On Tue, Sep 06, 2016 at 10:32:28PM +0200, Lukas Lueg wrote:
> I'm currently fuzzing rev 2076992 and things start to slowly, slowly
> quiet down. We will probably run out of steam at the end of the week
> when a total of (roughly) half a billion BTRFS-images have passed by.
> I will switch revisions to current HEAD and restart the whole process
> then. A few things:
> 
> * There are a couple of crashes (mostly segfaults) I have not reported
> yet. I'll report them if they show up again with the latest revision.

Ok.

> * The coverage-analysis shows assertion failures which are currently
> silenced. An assertion failure is technically a worse disaster
> successfully prevented, it still constitutes unexpected/unusable
> behaviour, though. Do you want assertions to be enabled and images
> triggering those assertions reported? This is basically the same
> conundrum as with BUG_ON and abort().

Yes please. I'd like to turn most bugons/assertions into a normal
failure report if it would make sense.

> * A few endless loops entered into by btrfsck are currently
> unmitigated (see bugs 155621, 155571, 11 and 155151). It would be
> nice if those had been taken care of by next week if possible.

Two of them are fixed, the other two need more work, updating all
callers of read_node_slot and the callchain. So you may still see that
kind of looping in more images. I don't have an ETA for the fix, I won't
be available during the next week.

At the moment, the initial sanity checks should catch most of the
corrupted values, so I'm expecting that you'll see different classes of
problems in the next rounds.

The testsuite now contains all images that you reported and we have a
fix in git. There are more utilities run on the images, there may be
more problems for us to fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix a possible umount deadlock

2016-09-09 Thread David Sterba
On Fri, Sep 09, 2016 at 04:31:04PM +0800, Anand Jain wrote:
>  static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
>  {
>   struct btrfs_device *device, *tmp;
> + static LIST_HEAD(pending_put);

Why is it static?

> + INIT_LIST_HEAD(_put);
>  
>   if (--fs_devices->opened > 0)
>   return 0;
> @@ -906,9 +904,24 @@ static int __btrfs_close_devices(struct btrfs_fs_devices 
> *fs_devices)
>   mutex_lock(_devices->device_list_mutex);
>   list_for_each_entry_safe(device, tmp, _devices->devices, dev_list) {
>   btrfs_close_one_device(device);
> + list_add(>dev_list, _put);
>   }
>   mutex_unlock(_devices->device_list_mutex);
>  
> + /*
> +  * btrfs_show_devname() is using the device_list_mutex,
> +  * sometimes a call to blkdev_put() leads vfs calling
> +  * into this func. So do put outside of device_list_mutex,
> +  * as of now.
> +  */
> + while (!list_empty(_put)) {
> + device = list_entry(pending_put.next,
> + struct btrfs_device, dev_list);
> + list_del(>dev_list);
> + btrfs_close_bdev(device);
> + call_rcu(>rcu, free_device);
> + }
> +
>   WARN_ON(fs_devices->open_devices);
>   WARN_ON(fs_devices->rw_devices);
>   fs_devices->opened = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


segfault btrfs scrub

2016-09-09 Thread Jan Koester

 
 
Hi,

i got from btrfs scrub command segfault. I use btrfs tools 4.7.2.
 
root@dibsi:/home/jan# btrfs scrub status /local
Speicherzugriffsfehler
root@dibsi:/home/jan# dmesg
[78294.556713] BTRFS error (device sda): bad tree block start 
18427384836265136347 2304683610112
[78294.556956] BTRFS error (device sda): bad tree block start 
17385487456874290426 2304683610112
[78294.558323] BTRFS error (device sda): bad tree block start 
17385487456874290426 2304683610112
[78294.558397] [ cut here ]
[78294.569900] kernel BUG at fs/btrfs/ctree.c:5202!
[78294.581634] invalid opcode:  [#15] SMP
[78294.593089] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs 
libcrc32c binfmt_misc btrfs xor raid6_pq kvm_amd kvm irqbypass serio_raw 
snd_usb_audio input_leds joydev snd_usbmidi_lib snd_hda_codec_hdmi edac_mce_amd 
snd_hda_intel edac_core snd_hda_codec k10temp snd_ctxfi snd_hda_core snd_hwdep 
snd_pcm i2c_piix4 snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
snd_seq_device snd_timer snd soundcore tpm_infineon mac_hid 8250_fintek shpchp 
sunrpc parport_pc ppdev lp parport autofs4 hid_generic usbhid hid amdkfd 
amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper e1000e syscopyarea 
sysfillrect sysimgblt ptp fb_sys_fops r8169 drm mii ahci pps_core libahci wmi 
fjes
[78294.629504] CPU: 3 PID: 16486 Comm: btrfs Tainted: G  D W   
4.6.0-rc4 #1
[78294.629506] Hardware name: Gigabyte Technology Co., Ltd. 
GA-970A-D3/GA-970A-D3, BIOS F12 09/03/2013
[78294.629510] task: 880070766800 ti: 8801c2d3 task.ti: 
8801c2d3
[78294.629568] RIP: 0010:[]  [] 
btrfs_search_forward+0x24d/0x330 [btrfs]
[78294.629572] RSP: 0018:8801c2d33c10  EFLAGS: 00010246
[78294.629581] RAX:  RBX:  RCX: 0001
[78294.629583] RDX: 0001 RSI:  RDI: 880080638d40
[78294.629585] RBP: 8801c2d33c70 R08: 021899d9 R09: 02189fd9
[78294.629587] R10:  R11: 0003 R12: 88008826e8c0
[78294.629589] R13: 0001 R14: 0001 R15: 
[78294.629593] FS:  7ff69486f8c0() GS:88022fcc() 
knlGS:e71e3b40
[78294.629595] CS:  0010 DS:  ES:  CR0: 80050033
[78294.629598] CR2: 01a94088 CR3: 000221fe6000 CR4: 06e0
[78294.629599] Stack:
[78294.629605]  024280ca 8801c2d33cbf 880223bfa800 
01ff
[78294.629609]  d800 0001 db9fb905 
88008826e8c0
[78294.629613]  8801c2d33d18 8802008ee000 8801c2d33cbf 
8801f91e6800
[78294.629614] Call Trace:
[78294.629669]  [] search_ioctl+0xf2/0x1a0 [btrfs]
[78294.629720]  [] btrfs_ioctl_tree_search+0x72/0xc0 [btrfs]
[78294.629769]  [] btrfs_ioctl+0x3e4/0x21a0 [btrfs]
[78294.629777]  [] ? handle_mm_fault+0x14cf/0x1e60
[78294.629782]  [] ? cp_new_stat+0x153/0x180
[78294.629789]  [] do_vfs_ioctl+0xa1/0x5b0
[78294.629794]  [] ? __do_page_fault+0x205/0x4d0
[78294.629800]  [] SyS_ioctl+0x79/0x90
[78294.629806]  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
[78294.629847] Code: 8b 4d a0 48 8b 55 a8 4d 89 f8 48 8b 7d b0 4c 89 e6 e8 68 
fb ff ff 85 c0 0f 85 bf 00 00 00 4c 89 e7 e8 88 7f ff ff e9 fa fd ff ff <0f> 0b 
48 8d 04 92 43 89 54 ac 40 48 8d 75 bf b9 11 00 00 00 48
[78294.629885] RIP  [] btrfs_search_forward+0x24d/0x330 
[btrfs]
[78294.629887]  RSP 
[78294.629969] ---[ end trace fa1ffcf4f496deaf ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recent complete stalls of btrfs (4.7.0-rc2+) -- any advice?

2016-09-09 Thread Yaroslav Halchenko

On Tue, 09 Aug 2016, Yaroslav Halchenko wrote:

> The beast has died on me today's morning :-/  Last kern.log msg was

> (Fixing recursive fault but reboot is needed!)

locked down again but this time seems to be different stack (and no above
msg) from before:

(full list of oopses since boot at
http://www.onerussian.com/tmp/journal-20160909-oopses.log
)

Sep 09 02:18:33 smaug kernel: [ cut here ]
Sep 09 02:18:33 smaug kernel: WARNING: CPU: 4 PID: 2189174 at 
lib/list_debug.c:33 __list_add+0x86/0xb0
Sep 09 02:18:33 smaug kernel: list_add corruption. prev->next should be next 
(8820079d6308), but was 88181e7e0d28. (prev=8810b209fe10).
Sep 09 02:18:33 smaug kernel: Modules linked in: veth xt_addrtype 
ipt_MASQUERADE nf_nat_masquerade_ipv4 bridge stp llc pci_stub cpufreq_stats 
cpufreq_userspace cpufreq_conservative cpufreq_powersave xt_pkttype nf_log_ipv4 
nf_log_common xt_tcpudp ip6table_mangle nfsd auth_rpcgss oid_registry nfs_acl 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_TCPMSS 
xt_LOG ipt_REJECT nf_reject_ipv4 iptable_mangle xt_multiport xt_state xt_limit 
xt_conntrack nf_conntrack_ftp nfs lockd grace nf_conntrack ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables fscache sunrpc binfmt_misc 
ipmi_watchdog intel_rapl sb_edac edac_core x86_pkg_temp_thermal 
intel_powerclamp coretemp ipmi_poweroff ipmi_devintf kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support 
drbg
Sep 09 02:18:33 smaug kernel:  ansi_cprng snd_pcm snd_timer aesni_intel snd 
aes_x86_64 soundcore lrw fuse gf128mul glue_helper ablk_helper cryptd pcspkr 
ast ttm drm_kms_helper joydev drm mei_me evdev i2c_algo_bit i2c_i801 mei shpchp 
lpc_ich ioatdma mfd_core ipmi_si wmi ipmi_msghandler tpm_tis tpm acpi_pad 
acpi_power_meter button ecryptfs cbc sha256_ssse3 sha256_generic hmac 
encrypted_keys autofs4 ext4 crc16 jbd2 mbcache btrfs dm_mod raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c crc32c_generic raid1 md_mod sg ses enclosure sd_mod hid_generic 
usbhid hid crc32c_intel ahci libahci mpt3sas raid_class scsi_transport_sas 
ehci_pci xhci_pci xhci_hcd ehci_hcd libata ixgbe dca usbcore usb_common 
scsi_mod ptp pps_core mdio fjes [last unloaded: vboxdrv]
Sep 09 02:18:33 smaug kernel: CPU: 4 PID: 2189174 Comm: git-annex Tainted: G
W IO4.7.0-rc2+ #1
Sep 09 02:18:33 smaug kernel: Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 
1.0b 09/17/2014
Sep 09 02:18:33 smaug kernel:  0286 0ab947c2 
8130c605 881292cfbd28
Sep 09 02:18:33 smaug kernel:   8107a314 
881292cfbe10 881292cfbd80
Sep 09 02:18:33 smaug kernel:  8810b209fe10 881037a07a98 
881f24b1a800 881037a07800
Sep 09 02:18:33 smaug kernel: Call Trace:
Sep 09 02:18:33 smaug kernel:  [] ? dump_stack+0x5c/0x77
Sep 09 02:18:33 smaug kernel:  [] ? __warn+0xc4/0xe0
Sep 09 02:18:33 smaug kernel:  [] ? 
warn_slowpath_fmt+0x5f/0x80
Sep 09 02:18:33 smaug kernel:  [] ? 
btrfs_write_marked_extents+0x95/0x130 [btrfs]
Sep 09 02:18:33 smaug kernel:  [] ? __list_add+0x86/0xb0
Sep 09 02:18:33 smaug kernel:  [] ? 
btrfs_sync_log+0x249/0xa80 [btrfs]
Sep 09 02:18:33 smaug kernel:  [] ? 
btrfs_sync_file+0x39a/0x3e0 [btrfs]
Sep 09 02:18:33 smaug kernel:  [] ? do_fsync+0x38/0x60
Sep 09 02:18:33 smaug kernel:  [] ? SyS_fdatasync+0xf/0x20
Sep 09 02:18:33 smaug kernel:  [] ? 
entry_SYSCALL_64_fastpath+0x1e/0xa8
Sep 09 02:18:33 smaug kernel: ---[ end trace 125800d45db3ce41 ]---
Sep 09 02:18:34 smaug kernel: general protection fault:  [#1] SMP
Sep 09 02:18:34 smaug kernel: Modules linked in: veth xt_addrtype 
ipt_MASQUERADE nf_nat_masquerade_ipv4 bridge stp llc pci_stub cpufreq_stats 
cpufreq_userspace cpufreq_conservative cpufreq_powersave xt_pkttype nf_log_ipv4 
nf_log_common xt_tcpudp ip6table_mangle nfsd auth_rpcgss oid_registry nfs_acl 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_TCPMSS 
xt_LOG ipt_REJECT nf_reject_ipv4 iptable_mangle xt_multiport xt_state xt_limit 
xt_conntrack nf_conntrack_ftp nfs lockd grace nf_conntrack ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables fscache sunrpc binfmt_misc 
ipmi_watchdog intel_rapl sb_edac edac_core x86_pkg_temp_thermal 
intel_powerclamp coretemp ipmi_poweroff ipmi_devintf kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support 
drbg
Sep 09 02:18:34 smaug kernel:  ansi_cprng snd_pcm snd_timer aesni_intel snd 
aes_x86_64 soundcore lrw fuse gf128mul glue_helper ablk_helper cryptd pcspkr 
ast ttm drm_kms_helper joydev drm mei_me evdev i2c_algo_bit i2c_i801 mei shpchp 
lpc_ich ioatdma mfd_core ipmi_si wmi ipmi_msghandler tpm_tis tpm acpi_pad 
acpi_power_meter button ecryptfs cbc sha256_ssse3 sha256_generic hmac 
encrypted_keys autofs4 ext4 crc16 jbd2 mbcache btrfs dm_mod raid456 
async_raid6_recov async_memcpy async_pq async_xor

Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-09 Thread Holger Hoffstätte
On 09/09/16 12:18, Holger Hoffstätte wrote:
> On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:
> 
>> cleaner_kthread() may run at any time, in which it'll call 
>> btrfs_delete_unused_bgs()
>> to delete unused block groups. Because this work is asynchronous, it may 
>> also result
>> in false ENOSPC error. 
> 
> 
> With this v3 I can now no longer balance (tested only with metadata).
> New chunks are allocated (as balance does) but nothing ever shrinks, until
> after unmount/remount, when the cleaner eventually kicks in.
> 
> This might be related to the recent patch by Naohiro Aota:
> "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
> 
> which by itself doesn't seem to do any harm (i.e. everything still seems
> to work as expected).

Actually even that is not true; both patches seem to be wrong in subtle
ways. Naohiro's patch seems to prevent the deletion during balance, whereas
yours prevents the cleaner from kicking in.

As a simple reproducer you can convert from -mdup to -msingle (to create
bloat) and then balance with -musage=10. Depending on which of the two
patches are applied, you end with bloat that only grows and never shrinks,
or bloat that ends up in mixed state (dup and single).

Undoing both makes both balancing and cleaning work again.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-09 Thread Holger Hoffstätte
On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote:

> cleaner_kthread() may run at any time, in which it'll call 
> btrfs_delete_unused_bgs()
> to delete unused block groups. Because this work is asynchronous, it may also 
> result
> in false ENOSPC error. 


With this v3 I can now no longer balance (tested only with metadata).
New chunks are allocated (as balance does) but nothing ever shrinks, until
after unmount/remount, when the cleaner eventually kicks in.

This might be related to the recent patch by Naohiro Aota:
"btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"

which by itself doesn't seem to do any harm (i.e. everything still seems
to work as expected).

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-09 Thread David Sterba
On Fri, Sep 09, 2016 at 04:25:15PM +0800, Wang Xiaoguang wrote:
> hello David,
> 
> This patch's v2 version in your for-next-20160906 branch is still wrong, 
> really sorry,
> please revert it.

Patch replaced with V3 in the upcoming for-next.

> Stefan Priebe has reported another similar issue, thought I didn't see 
> it in my
> test environment. Now I choose to not call down_read(bg_delete_sem) for free
> space inode, which I think can resolve these issues, please check, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bug report about patch "Btrfs: kill the btree_inode"

2016-09-09 Thread Wang Xiaoguang

hello,

When we rebase dedupe patches to David's for-next-20160906 branch,
we found below panic. By bisect, it seems that "Btrfs: kill the btree_inode"
causing this bug, please check.
Fstests case btrfs/060 can easily reproduce this bug.

localhost login: [   43.694734] BUG: unable to handle kernel NULL 
pointer dereference at 0070

[   43.695812] IP: [] list_lru_destroy+0x11/0xe0
[   43.696526] PGD 0
[   43.696765] Oops:  [#1] SMP
[   43.697105] Modules linked in: uinput fuse ip6t_rpfilter ip6t_REJECT 
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat 
ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_mangle iptable_security iptable_raw iptable_filter dm_mirror 
dm_region_hash dm_log dm_mod snd_hda_codec_generic crct10dif_pclmul 
crc32_pclmul ext4 snd_hda_intel ppdev snd_hda_codec jbd2 btrfs 
ghash_clmulni_intel mbcache snd_hwdep snd_hda_core snd_seq xor 
snd_seq_device aesni_intel glue_helper lrw raid6_pq snd_pcm gf128mul 
ablk_helper cryptd parport_pc snd_timer pcspkr virtio_balloon snd 
parport soundcore sg i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace 
sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi 
qxl virtio_console drm_kms_helper 8139too syscopyarea sysfillrect ahci 
sysimgblt fb_sys_fops ttm libahci ata_piix drm libata crc32c_intel 
serio_raw virtio_pci i2c_core virtio_ring virtio 8139cp mii floppy

[   43.709009] CPU: 0 PID: 8267 Comm: mount Not tainted 4.8.0-rc5+ #50
[   43.709680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.8.2-20150714_191134- 04/01/2014

[   43.710691] task: 880074e3ab80 task.stack: 88006c1e8000
[   43.711322] RIP: 0010:[] [] 
list_lru_destroy+0x11/0xe0

[   43.712227] RSP: 0018:88006c1ebb88  EFLAGS: 00010246
[   43.712796] RAX:  RBX:  RCX: 
dead0200
[   43.713552] RDX: 81c78d78 RSI: 880074e3ab80 RDI: 
0070
[   43.714314] RBP: 88006c1ebba0 R08: 88006c1ebb00 R09: 
88003337e000
[   43.715074] R10:  R11: 000a282f3176 R12: 
0070
[   43.715948] R13: 8800738b6000 R14: 88007b028680 R15: 
8800769b0a80
[   43.716709] FS:  7fd734456880() GS:88007de0() 
knlGS:

[   43.717570] CS:  0010 DS:  ES:  CR0: 80050033
[   43.718187] CR2: 0070 CR3: 77b8d000 CR4: 
000406f0

[   43.718954] Stack:
[   43.719177]   a05f31a0 8800738b6000 
88006c1ebc80
[   43.720018]  a052a1fe c9abea50 00080294 
00017b403b01
[   43.720862]  88006c1ebbd8 813640ae 88006c1ebc08 
811b8b4e

[   43.721675] Call Trace:
[   43.721968]  [] btrfs_mount+0xb6e/0xfc0 [btrfs]
[   43.722676]  [] ? find_next_zero_bit+0x1e/0x20
[   43.723321]  [] ? pcpu_next_unpop+0x3e/0x50
[   43.723938]  [] ? find_next_bit+0x19/0x20
[   43.724537]  [] mount_fs+0x39/0x160
[   43.725085]  [] ? __alloc_percpu+0x15/0x20
[   43.725696]  [] vfs_kern_mount+0x67/0x100
[   43.726332]  [] btrfs_mount+0x19d/0xfc0 [btrfs]
[   43.726992]  [] ? find_next_zero_bit+0x1e/0x20
[   43.727646]  [] mount_fs+0x39/0x160
[   43.728192]  [] ? __alloc_percpu+0x15/0x20
[   43.728881]  [] vfs_kern_mount+0x67/0x100
[   43.729480]  [] do_mount+0x1e2/0xca0
[   43.730036]  [] ? kmem_cache_alloc_trace+0x14b/0x1b0
[   43.730742]  [] SyS_mount+0x83/0xd0
[   43.731290]  [] do_syscall_64+0x67/0x160
[   43.731888]  [] entry_SYSCALL64_slow_path+0x25/0x25
[   43.732575] Code: 4d 8b 26 4c 89 e7 e8 9f 64 03 00 5b 41 5c 41 5d 41 
5e 5d c3 66 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 49 89 
fc 53 <48> 83 3f 00 0f 84 b2 00 00 00 e8 50 9a 04 00 48 c7 c7 20 34 c9

[   43.735379] RIP  [] list_lru_destroy+0x11/0xe0
[   43.736043]  RSP 
[   43.736421] CR2: 0070
[   43.737102] ---[ end trace 7f226c7f270332f0 ]---
[   43.737837] Kernel panic - not syncing: Fatal exception
[   43.738430] Kernel Offset: disabled
[   43.738735] ---[ end Kernel panic - not syncing: Fatal exception

Regards,
Xiaoguang Wang


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another 4.8-rc locked splat: btrfs_close_devices()

2016-09-09 Thread Anand Jain




 Looks like we need to take time to clean up device_list_mutex,
 chunk_mutex, volume_mutex and rcu. As of now I have sent out,
 [PATCH] btrfs: fix a possible umount deadlock

 This has passed xfstests/btrfs.

Thanks, Anand

On 09/09/2016 08:38 AM, Anand Jain wrote:


Thanks for the report Ilya.

Yep. Have seen similar issues during hotspare fixes as well.
Where the vfs call to btrfs_show_devname() and its
device_list_mutex lock is conflicting. One of that is fixed
here.

--
779bf3fefa835cb52a07457c8acac6f2f66f2493
btrfs: fix lock dep warning, move scratch dev out of
device_list_mutex and uuid_mutex
--

I was kind of expecting this here as well when wrote 142388194191.
However couldn't reproduce.

To fix this permanently, I see the following choices,

Chris/David,

 1. Do you think device_list_mutex is needed at btrfs_show_devname()
 or rcu should suffice. ?

 2. To me the roles of fs_info->volume_mutex can be replaced with
 device_list_mutex. Any idea, if I am missing something ?

Thanks, Anand


On 09/08/2016 10:34 PM, Ilya Dryomov wrote:

Hello,

This one seems to have appeared after Anand's commit
142388194191 ("btrfs: do not background blkdev_put()") got merged into
4.8-rc4.

Thanks,

Ilya


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] writeback: introduce super_operations->write_metadata

2016-09-09 Thread Jan Kara
On Mon 22-08-16 13:35:02, Josef Bacik wrote:
> Now that we have metadata counters in the VM, we need to provide a way to kick
> writeback on dirty metadata.  Introduce super_operations->write_metadata.  
> This
> allows file systems to deal with writing back any dirty metadata we need based
> on the writeback needs of the system.  Since there is no inode to key off of 
> we
> need a list in the bdi for dirty super blocks to be added.  From there we can
> find any dirty sb's on the bdi we are currently doing writeback on and call 
> into
> their ->write_metadata callback.
> 
> Signed-off-by: Josef Bacik 
...
> @@ -1639,11 +1664,38 @@ static long __writeback_inodes_wb(struct 
> bdi_writeback *wb,
>  
>   /* refer to the same tests at the end of writeback_sb_inodes */
>   if (wrote) {
> - if (time_is_before_jiffies(start_time + HZ / 10UL))
> - break;
> - if (work->nr_pages <= 0)
> + if (time_is_before_jiffies(start_time + HZ / 10UL) ||
> + work->nr_pages <= 0) {
> + done = true;
>   break;
> + }
> + }
> + }
> +
> + if (!done && wb_stat(wb, WB_METADATA_DIRTY)) {
> + LIST_HEAD(list);
> +
> + spin_unlock(>list_lock);
> + spin_lock(>bdi->sb_list_lock);
> + list_splice_init(>bdi->dirty_sb_list, );
> + while (!list_empty()) {
> + struct super_block *sb;
> +
> + sb = list_first_entry(, struct super_block,
> +   s_bdi_list);
> + list_move_tail(>s_bdi_list,
> +>bdi->dirty_sb_list);
> + if (!sb->s_op->write_metadata)
> + continue;
> + if (!trylock_super(sb))
> + continue;
> + spin_unlock(>bdi->sb_list_lock);
> + wrote += writeback_sb_metadata(sb, wb, work);
> + spin_lock(>bdi->sb_list_lock);
> + up_read(>s_umount);
>   }
> + spin_unlock(>bdi->sb_list_lock);
> + spin_lock(>list_lock);
>   }
>   /* Leave any unwritten inodes on b_io */
>   return wrote;

So this will hook metadata writeback into the periodic writeback but when
work->sb is set, metadata won't be written because in that case we call
writeback_sb_inodes() directly. So you need to call writeback_sb_metadata()
from wb_writeback() in that case as well. 

...

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f3f0b4c8..c063ac6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1430,6 +1430,8 @@ struct super_block {
>  
>   spinlock_t  s_inode_wblist_lock;
>   struct list_heads_inodes_wb;/* writeback inodes */
> +
> + struct list_heads_bdi_list;

Maybe call this s_bdi_dirty_list?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-09 Thread Wang Xiaoguang

hello David,

This patch's v2 version in your for-next-20160906 branch is still wrong, 
really sorry,

please revert it.
Stefan Priebe has reported another similar issue, thought I didn't see 
it in my

test environment. Now I choose to not call down_read(bg_delete_sem) for free
space inode, which I think can resolve these issues, please check, thanks.

Regards,
Xiaoguang Wang

On 09/09/2016 04:17 PM, Wang Xiaoguang wrote:

cleaner_kthread() may run at any time, in which it'll call 
btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also 
result
in false ENOSPC error. Please see below race window:

CPU1   | CPU2
   |
|-> btrfs_alloc_data_chunk_ondemand() |-> cleaner_kthread()
 |-> do_chunk_alloc()  |   |
 |   assume it returns ENOSPC, which means |   |
 |   btrfs_space_info is full and have free|   |
 |   space to satisfy data request.|   |
 | |   |- > 
btrfs_delete_unused_bgs()
 | |   |it will decrease 
btrfs_space_info
 | |   |total_bytes and make
 | |   |btrfs_space_info is not 
full.
 | |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block 
btrfs_delete_unused_bgs().
Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's 
mutex,
we can not use it for this purpose. Of course, we can re-define it to be struct
rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method 
will
work.

But given that delete_unused_bgs_mutex's name length is longer than 
bg_delete_sem,
I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
delete delete_unused_bgs_mutex :)

Reported-by: Stefan Priebe 
Signed-off-by: Wang Xiaoguang 
---
V2: fix a deadlock revealed by fstests case btrfs/071, we call
 start_transaction() before in down_write(bg_delete_sem) in
 btrfs_delete_unused_bgs().

v3: Stefan Priebe reported another similar deadlock, so here we choose
 to not call down_read(bg_delete_sem) for free space inode in
 btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
 data space reservation for free space cache in the transaction context,
 btrfs_delete_unused_bgs() will either have finished its job, or start
 a new transaction waiting current transaction to complete, there will
 be no unused block groups to be deleted, so it's safe to not call
 down_read(bg_delete_sem)
---
---
  fs/btrfs/ctree.h   |  2 +-
  fs/btrfs/disk-io.c | 13 +--
  fs/btrfs/extent-tree.c | 59 --
  fs/btrfs/volumes.c | 42 +--
  4 files changed, 76 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eff3993..fa78ef9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -788,6 +788,7 @@ struct btrfs_fs_info {
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
+   struct rw_semaphore bg_delete_sem;
  
  	/*

 * this is taken to make sure we don't set block groups ro after
@@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
spinlock_t unused_bgs_lock;
struct list_head unused_bgs;
struct mutex unused_bg_unpin_mutex;
-   struct mutex delete_unused_bgs_mutex;
  
  	/* For btrfs to record security options */

struct security_mnt_opts security_opts;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54bc8c7..3cdbd05 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
btrfs_run_defrag_inodes(root->fs_info);
  
  		/*

-* Acquires fs_info->delete_unused_bgs_mutex to avoid racing
-* with relocation (btrfs_relocate_chunk) and relocation
-* acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
-* after acquiring fs_info->delete_unused_bgs_mutex. So we
-* can't hold, nor need to, fs_info->cleaner_mutex when deleting
-* unused block groups.
+* Acquires fs_info->bg_delete_sem to avoid racing with
+* relocation (btrfs_relocate_chunk) and relocation acquires
+* fs_info->cleaner_mutex (btrfs_relocate_block_group) after
+* acquiring fs_info->bg_delete_sem. So we can't hold, nor 

[PATCH] btrfs: fix a possible umount deadlock

2016-09-09 Thread Anand Jain
btrfs_show_devname() is using the device_list_mutex, sometimes
a call to blkdev_put() leads vfs calling into this func. So
call blkdev_put() outside of device_list_mutex, as of now.

[  983.284212] ==
[  983.290401] [ INFO: possible circular locking dependency detected ]
[  983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted
[  983.302081] ---
[  983.308357] umount/21720 is trying to acquire lock:
[  983.313243]  (>bd_mutex){+.+.+.}, at: [] 
blkdev_put+0x31/0x150
[  983.321264]
[  983.321264] but task is already holding lock:
[  983.327101]  (_devs->device_list_mutex){+.+...}, at: [] 
__btrfs_close_devices+0x46/0x200 [btrfs]
[  983.337839]
[  983.337839] which lock already depends on the new lock.
[  983.337839]
[  983.346024]
[  983.346024] the existing dependency chain (in reverse order) is:
[  983.353512]
-> #4 (_devs->device_list_mutex){+.+...}:
[  983.359096][] lock_acquire+0x1bc/0x1f0
[  983.365143][] mutex_lock_nested+0x65/0x350
[  983.371521][] btrfs_show_devname+0x36/0x1f0 [btrfs]
[  983.378710][] show_vfsmnt+0x4e/0x150
[  983.384593][] m_show+0x17/0x20
[  983.389957][] seq_read+0x2b5/0x3b0
[  983.395669][] __vfs_read+0x28/0x100
[  983.401464][] vfs_read+0xab/0x150
[  983.407080][] SyS_read+0x52/0xb0
[  983.412609][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.419617]
-> #3 (namespace_sem){++}:
[  983.424024][] lock_acquire+0x1bc/0x1f0
[  983.430074][] down_write+0x49/0x80
[  983.435785][] lock_mount+0x67/0x1c0
[  983.441582][] do_add_mount+0x32/0xf0
[  983.447458][] finish_automount+0x5a/0xc0
[  983.453682][] follow_managed+0x1b3/0x2a0
[  983.459912][] lookup_fast+0x300/0x350
[  983.465875][] path_openat+0x3a7/0xaa0
[  983.471846][] do_filp_open+0x85/0xe0
[  983.477731][] do_sys_open+0x14c/0x1f0
[  983.483702][] SyS_open+0x1e/0x20
[  983.489240][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.496254]
-> #2 (>s_type->i_mutex_key#3){+.+.+.}:
[  983.501798][] lock_acquire+0x1bc/0x1f0
[  983.507855][] down_write+0x49/0x80
[  983.513558][] start_creating+0x87/0x100
[  983.519703][] debugfs_create_dir+0x17/0x100
[  983.526195][] bdi_register+0x93/0x210
[  983.532165][] bdi_register_owner+0x43/0x70
[  983.538570][] device_add_disk+0x1fb/0x450
[  983.544888][] loop_add+0x1e6/0x290
[  983.550596][] loop_init+0x10b/0x14f
[  983.556394][] do_one_initcall+0xa7/0x180
[  983.562618][] kernel_init_freeable+0x1cc/0x266
[  983.569370][] kernel_init+0xe/0x100
[  983.575166][] ret_from_fork+0x1f/0x40
[  983.581131]
-> #1 (loop_index_mutex){+.+.+.}:
[  983.585801][] lock_acquire+0x1bc/0x1f0
[  983.591858][] mutex_lock_nested+0x65/0x350
[  983.598256][] lo_open+0x1f/0x60
[  983.603704][] __blkdev_get+0x123/0x400
[  983.609757][] blkdev_get+0x34a/0x350
[  983.615639][] blkdev_open+0x64/0x80
[  983.621428][] do_dentry_open+0x1c6/0x2d0
[  983.627651][] vfs_open+0x69/0x80
[  983.633181][] path_openat+0x834/0xaa0
[  983.639152][] do_filp_open+0x85/0xe0
[  983.645035][] do_sys_open+0x14c/0x1f0
[  983.650999][] SyS_open+0x1e/0x20
[  983.656535][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.663541]
-> #0 (>bd_mutex){+.+.+.}:
[  983.668107][] __lock_acquire+0x1003/0x17b0
[  983.674510][] lock_acquire+0x1bc/0x1f0
[  983.680561][] mutex_lock_nested+0x65/0x350
[  983.686967][] blkdev_put+0x31/0x150
[  983.692761][] btrfs_close_bdev+0x4f/0x60 [btrfs]
[  983.699699][] __btrfs_close_devices+0xcb/0x200 
[btrfs]
[  983.707178][] btrfs_close_devices+0x2b/0xa0 [btrfs]
[  983.714380][] close_ctree+0x265/0x340 [btrfs]
[  983.721061][] btrfs_put_super+0x19/0x20 [btrfs]
[  983.727908][] generic_shutdown_super+0x6f/0x100
[  983.734744][] kill_anon_super+0x16/0x30
[  983.740888][] btrfs_kill_super+0x1e/0x130 [btrfs]
[  983.747909][] deactivate_locked_super+0x49/0x80
[  983.754745][] deactivate_super+0x5d/0x70
[  983.760977][] cleanup_mnt+0x5c/0x80
[  983.766773][] __cleanup_mnt+0x12/0x20
[  983.772738][] task_work_run+0x7e/0xc0
[  983.778708][] exit_to_usermode_loop+0x7e/0xb4
[  983.785373][] syscall_return_slowpath+0xbb/0xd0
[  983.792212][] entry_SYSCALL_64_fastpath+0xbf/0xc1
[  983.799225]
[  983.799225] other info that might help us debug this:
[  983.799225]
[  983.807291] Chain exists of:
  >bd_mutex --> namespace_sem --> _devs->device_list_mutex

[  983.816521]  Possible unsafe locking scenario:
[  983.816521]
[  983.822489]CPU0CPU1
[  983.827043]

[PATCH v3] btrfs: should block unused block groups deletion work when allocating data space

2016-09-09 Thread Wang Xiaoguang
cleaner_kthread() may run at any time, in which it'll call 
btrfs_delete_unused_bgs()
to delete unused block groups. Because this work is asynchronous, it may also 
result
in false ENOSPC error. Please see below race window:

   CPU1   | CPU2
  |
|-> btrfs_alloc_data_chunk_ondemand() |-> cleaner_kthread()
|-> do_chunk_alloc()  |   |
|   assume it returns ENOSPC, which means |   |
|   btrfs_space_info is full and have free|   |
|   space to satisfy data request.|   |
| |   |- > btrfs_delete_unused_bgs()
| |   |it will decrease 
btrfs_space_info
| |   |total_bytes and make
| |   |btrfs_space_info is not 
full.
| |   |
In this case, we may get ENOSPC error, but btrfs_space_info is not full.

To fix this issue, in btrfs_alloc_data_chunk_ondemand(), if we need to call
do_chunk_alloc() to allocating new chunk, we should block 
btrfs_delete_unused_bgs().
Here we introduce a new struct rw_semaphore bg_delete_sem to do this job.

Indeed there is already a "struct mutex delete_unused_bgs_mutex", but it's 
mutex,
we can not use it for this purpose. Of course, we can re-define it to be struct
rw_semaphore, then use it in btrfs_alloc_data_chunk_ondemand(). Either method 
will
work.

But given that delete_unused_bgs_mutex's name length is longer than 
bg_delete_sem,
I choose the first method, to create a new struct rw_semaphore bg_delete_sem and
delete delete_unused_bgs_mutex :)

Reported-by: Stefan Priebe 
Signed-off-by: Wang Xiaoguang 
---
V2: fix a deadlock revealed by fstests case btrfs/071, we call
start_transaction() before in down_write(bg_delete_sem) in
btrfs_delete_unused_bgs().

v3: Stefan Priebe reported another similar deadlock, so here we choose
to not call down_read(bg_delete_sem) for free space inode in
btrfs_alloc_data_chunk_ondemand(). Meanwhile because we only do the
data space reservation for free space cache in the transaction context,
btrfs_delete_unused_bgs() will either have finished its job, or start
a new transaction waiting current transaction to complete, there will
be no unused block groups to be deleted, so it's safe to not call
down_read(bg_delete_sem)
---
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/disk-io.c | 13 +--
 fs/btrfs/extent-tree.c | 59 --
 fs/btrfs/volumes.c | 42 +--
 4 files changed, 76 insertions(+), 40 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eff3993..fa78ef9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -788,6 +788,7 @@ struct btrfs_fs_info {
struct mutex cleaner_mutex;
struct mutex chunk_mutex;
struct mutex volume_mutex;
+   struct rw_semaphore bg_delete_sem;
 
/*
 * this is taken to make sure we don't set block groups ro after
@@ -1068,7 +1069,6 @@ struct btrfs_fs_info {
spinlock_t unused_bgs_lock;
struct list_head unused_bgs;
struct mutex unused_bg_unpin_mutex;
-   struct mutex delete_unused_bgs_mutex;
 
/* For btrfs to record security options */
struct security_mnt_opts security_opts;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54bc8c7..3cdbd05 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1868,12 +1868,11 @@ static int cleaner_kthread(void *arg)
btrfs_run_defrag_inodes(root->fs_info);
 
/*
-* Acquires fs_info->delete_unused_bgs_mutex to avoid racing
-* with relocation (btrfs_relocate_chunk) and relocation
-* acquires fs_info->cleaner_mutex (btrfs_relocate_block_group)
-* after acquiring fs_info->delete_unused_bgs_mutex. So we
-* can't hold, nor need to, fs_info->cleaner_mutex when deleting
-* unused block groups.
+* Acquires fs_info->bg_delete_sem to avoid racing with
+* relocation (btrfs_relocate_chunk) and relocation acquires
+* fs_info->cleaner_mutex (btrfs_relocate_block_group) after
+* acquiring fs_info->bg_delete_sem. So we can't hold, nor need
+* to, fs_info->cleaner_mutex when deleting unused block groups.
 */
btrfs_delete_unused_bgs(root->fs_info);
 sleep:
@@ -2634,7 +2633,6 @@ int open_ctree(struct super_block *sb,
spin_lock_init(_info->unused_bgs_lock);
rwlock_init(_info->tree_mod_log_lock);
mutex_init(_info->unused_bg_unpin_mutex);
-   mutex_init(_info->delete_unused_bgs_mutex);

Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

2016-09-09 Thread Jan Kara
On Mon 22-08-16 13:35:01, Josef Bacik wrote:
> Provide a mechanism for file systems to indicate how much dirty metadata they
> are holding.  This introduces a few things
> 
> 1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
> 2) WB stat for dirty metadata.  This way we know if we need to try and call 
> into
> the file system to write out metadata.  This could potentially be used in the
> future to make balancing of dirty pages smarter.

So I'm curious about one thing: In the previous posting you have mentioned
that the main motivation for this work is to have a simple support for
sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
do the dirty accounting at page granularity. What are your plans to handle
this mismatch?

The thing is you actually shouldn't miscount by too much as that could
upset some checks in mm checking how much dirty pages a node has directing
how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
should be actually used in the checks in node_limits_ok() or in
node_pagecache_reclaimable() at all because once you start accounting dirty
slab objects, you are really on a thin ice...

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 56c8fda..d329f89 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1809,6 +1809,7 @@ static unsigned long get_nr_dirty_pages(void)
>  {
>   return global_node_page_state(NR_FILE_DIRTY) +
>   global_node_page_state(NR_UNSTABLE_NFS) +
> + global_node_page_state(NR_METADATA_DIRTY) +
>   get_nr_dirty_inodes();

With my question is also connected this - when we have NR_METADATA_DIRTY,
we could just account dirty inodes there and get rid of this
get_nr_dirty_inodes() hack...

But actually getting this to work right to be able to track dirty inodes would
be useful on its own - some throlling of creation of dirty inodes would be
useful for several filesystems (ext4, xfs, ...).

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 121a6e3..6a52723 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -506,6 +506,7 @@ bool node_dirty_ok(struct pglist_data *pgdat)
>   nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
>   nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
>   nr_pages += node_page_state(pgdat, NR_WRITEBACK);
> + nr_pages += node_page_state(pgdat, NR_METADATA_DIRTY);
>  
>   return nr_pages <= limit;
>  }
> @@ -1595,7 +1596,8 @@ static void balance_dirty_pages(struct bdi_writeback 
> *wb,
>* been flushed to permanent storage.
>*/
>   nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
> - global_node_page_state(NR_UNSTABLE_NFS);
> + global_node_page_state(NR_UNSTABLE_NFS) +
> + global_node_page_state(NR_METADATA_DIRTY);
>   gdtc->avail = global_dirtyable_memory();
>   gdtc->dirty = nr_reclaimable + 
> global_node_page_state(NR_WRITEBACK);
>  
> @@ -1935,7 +1937,8 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>*/
>   gdtc->avail = global_dirtyable_memory();
>   gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
> -   global_node_page_state(NR_UNSTABLE_NFS);
> +   global_node_page_state(NR_UNSTABLE_NFS) +
> +   global_node_page_state(NR_METADATA_DIRTY);
>   domain_dirty_limits(gdtc);
>  
>   if (gdtc->dirty > gdtc->bg_thresh)
> @@ -2009,7 +2012,8 @@ void laptop_mode_timer_fn(unsigned long data)
>  {
>   struct request_queue *q = (struct request_queue *)data;
>   int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
> - global_node_page_state(NR_UNSTABLE_NFS);
> + global_node_page_state(NR_UNSTABLE_NFS) +
> + global_node_page_state(NR_METADATA_DIRTY);
>   struct bdi_writeback *wb;
>  
>   /*
> @@ -2473,6 +2477,96 @@ void account_page_dirtied(struct page *page, struct 
> address_space *mapping)
>  EXPORT_SYMBOL(account_page_dirtied);
>  
>  /*
> + * account_metadata_dirtied
> + * @page - the page being dirited
> + * @bdi - the bdi that owns this page
> + *
> + * Do the dirty page accounting for metadata pages that aren't backed by an
> + * address_space.
> + */
> +void account_metadata_dirtied(struct page *page, struct backing_dev_info 
> *bdi)
> +{
> + unsigned long flags;
> +

A bdi_cap_account_dirty() check here and in following functions?

> + local_irq_save(flags);
> + __inc_node_page_state(page, NR_METADATA_DIRTY);
> + __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> + __inc_node_page_state(page, NR_DIRTIED);
> + __inc_wb_stat(>wb, WB_RECLAIMABLE);
> + __inc_wb_stat(>wb, WB_DIRTIED);
> + __inc_wb_stat(>wb, WB_METADATA_DIRTY);
> + current->nr_dirtied++;
> + task_io_account_write(PAGE_SIZE);
> + this_cpu_inc(bdp_ratelimits);
> + 

[PATCH v3 3/3] ioctl_getfsmap.2: document the GETFSMAP ioctl

2016-09-09 Thread Darrick J. Wong
Document the new GETFSMAP ioctl that returns the physical layout of a
(disk-based) filesystem.

Signed-off-by: Darrick J. Wong 
---
 man2/ioctl_getfsmap.2 |  313 +
 1 file changed, 313 insertions(+)
 create mode 100644 man2/ioctl_getfsmap.2

diff --git a/man2/ioctl_getfsmap.2 b/man2/ioctl_getfsmap.2
new file mode 100644
index 000..fac3ff4
--- /dev/null
+++ b/man2/ioctl_getfsmap.2
@@ -0,0 +1,313 @@
+.\" Copyright (c) 2016, Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" This is free documentation; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public License as
+.\" published by the Free Software Foundation; either version 2 of
+.\" the License, or (at your option) any later version.
+.\"
+.\" The GNU General Public License's references to "object code"
+.\" and "executables" are to be interpreted as the output of any
+.\" document formatting or typesetting system, including
+.\" intermediate and printed output.
+.\"
+.\" This manual is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" .
+.\" %%%LICENSE_END
+.TH IOCTL-GETFSMAP 2 2016-09-08 "Linux" "Linux Programmer's Manual"
+.SH NAME
+ioctl_getfsmap \- retrieve the physical layout of the filesystem
+.SH SYNOPSIS
+.br
+.B #include 
+.br
+.B #include 
+.sp
+.BI "int ioctl(int " fd ", GETFSMAP, struct fsmap_head * " arg );
+.SH DESCRIPTION
+This
+.BR ioctl (2)
+retrieves physical extent mappings for a filesystem.
+This information can be used to discover which files are mapped to a physical
+block, examine free space, or find known bad blocks, among other things.
+
+The sole argument to this ioctl should be a pointer to a single
+.BR "struct fsmap_head" ":"
+.in +4n
+.nf
+
+struct fsmap {
+   __u32   fmr_device; /* device id */
+   __u32   fmr_flags;  /* mapping flags */
+   __u64   fmr_physical;   /* device offset of segment */
+   __u64   fmr_owner;  /* owner id */
+   __u64   fmr_offset; /* file offset of segment */
+   __u64   fmr_length; /* length of segment */
+   __u64   fmr_reserved[3];/* must be zero */
+};
+
+struct fsmap_head {
+   __u32   fmh_iflags; /* control flags */
+   __u32   fmh_oflags; /* output flags */
+   __u32   fmh_count;  /* # of entries in array incl. input */
+   __u32   fmh_entries;/* # of entries filled in (output). */
+   __u64   fmh_reserved[6];/* must be zero */
+
+   struct fsmapfmh_keys[2];/* low and high keys for the mapping 
search */
+   struct fsmapfmh_recs[]; /* returned records */
+};
+
+.fi
+.in
+The two
+.I fmh_keys
+array elements specify the lowest and highest reverse-mapping
+keys, respectively, for which userspace would like physical mapping
+information.
+A reverse mapping key consists of the tuple (device, block, owner, offset).
+The owner and offset fields are part of the key because some filesystems
+support sharing physical blocks between multiple files and
+therefore may return multiple mappings for a given physical block.
+.PP
+Filesystem mappings are copied into the
+.I fmh_recs
+array, which immediately follows the header data.
+.SS Fields of struct fsmap_head
+.PP
+The
+.I fmh_iflags
+field is a bitmask passed to the kernel to alter the output.
+There are no flags defined, so this value must be zero.
+
+.PP
+The
+.I fmh_oflags
+field is a bitmask of flags that concern all output mappings.
+If
+.B FMH_OF_DEV_T
+is set, then the
+.I fmr_device
+field represents a
+.B dev_t
+structure containing the major and minor numbers of the block device.
+
+.PP
+The
+.I fmh_count
+field contains the number of elements in the array being passed to the
+kernel.
+If this value is 0,
+.I fmh_entries
+will be set to the number of records that would have been returned had
+the array been large enough;
+no mapping information will be returned.
+
+.PP
+The
+.I fmh_entries
+field contains the number of elements in the
+.I fmh_recs
+array that contain useful information.
+
+.PP
+The
+.I fmh_reserved
+fields must be set to zero.
+
+.SS Keys
+.PP
+The two key records in
+.B fsmap_head.fmh_keys
+specify the lowest and highest extent records in the keyspace that the caller
+wants returned.
+A filesystem that can share blocks between files likely requires the tuple
+.RI "(" "device" ", " "physical" ", " "owner" ", " "offset" ", " "flags" ")"
+to uniquely index any filesystem mapping record.
+Classic non-sharing filesystems might be able to 

Re: [PATCH 3/3] ioctl_xfs_ioc_getfsmap.2: document XFS_IOC_GETFSMAP ioctl

2016-09-09 Thread Darrick J. Wong
On Fri, Sep 09, 2016 at 09:38:06AM +1000, Dave Chinner wrote:
> On Tue, Aug 30, 2016 at 12:09:49PM -0700, Darrick J. Wong wrote:
> > > I recall for FIEMAP that some filesystems may not have files aligned
> > > to sector offsets, and we just used byte offsets.  Storage like
> > > NVDIMMs are cacheline granular, so I don't think it makes sense to
> > > tie this to old disk sector sizes.  Alternately, the units could be
> > > in terms of fs blocks as returned by statvfs.st_bsize, but mixing
> > > units for fmv_block, fmv_offset, fmv_length is uneeded complexity.
> > 
> > Ugh.  I'd rather just change the units to bytes rather than force all
> > the users to multiply things. :)
> 
> Yup, units need to be either in disk addresses (i.e. 512 byte units)
> or bytes. If people can't handle disk addresses (seems to be the
> case), the bytes it should be.



> > I'd much rather just add more special owner codes for any other
> > filesystem that has distinguishable metadata types that are not
> > covered by the existing OWN_ codes.  We /do/ have 2^64 possible
> > values, so it's not like we're going to run out.
> 
> This is diagnositc information as much as anything, just like
> fiemap is diagnostic information. So if we have specific type
> information, it needs to be reported accurately to be useful.
> 
> Hence I really don't care if the users and developers of other fs
> types don't understand what the special owner codes that a specific
> filesystem returns mean. i.e. it's not useful user information -
> only a tool that groks the specific filesystem is going to be able
> to anything useful with special owner codes. So, IMO, there's little
> point trying to make them generic or to even trying to define and
> explain them in the man page

 I'm ok with describing generally what each special owner code
means.  Maybe the manpage could be more explicit about "None of these
codes are useful unless you're a low level filesystem tool"?

> > > It seems like there are several fields in the structure that are used for
> > > only input or only output?  Does it make more sense to have one structure
> > > used only for the input request, and then the array of values returned be
> > > in a different structure?  I'm not necessarily requesting that it be 
> > > changed,
> > > but it definitely is something I noticed a few times while reading this 
> > > doc.
> > 
> > I've been thinking about rearranging this a bit, since the flags
> > handling is very awkward with the current array structure.  Each
> > rmap has its own flags; we may someday want to pass operation flags
> > into the ioctl; and we currently have one operation flag to pass back
> > to userspace.  Each of those flags can be a separate field.  I think
> > people will get confused about FMV_OF_* and FMV_HOF_* being referenced
> > in oflags, and iflags has no meaning for returned records.
> 
> Yup, that's what I initially noticed when I glanced at this. The XFS
> getbmap interface is just plain nasty, and we shouldn't be copying
> that API pattern if we can help it.

Lol ok. :)

> > So, this instead?
> > 
> > struct getfsmap_rec {
> > u32 device; /* device id */
> > u32 flags;  /* mapping flags */
> > u64 block;  /* physical addr, bytes */
> > u64 owner;  /* inode or special owner code */
> > u64 offset; /* file offset of mapping, bytes */
> > u64 length; /* length of segment, bytes */
> > u64 reserved;   /* will be set to zero */
> > }; /* 48 bytes */
> > 
> > struct getfsmap_head {
> > u32 iflags; /* none defined yet */
> > u32 oflags; /* FMV_HOF_DEV_T */
> > u32 count;  /* # entries in recs array */
> > u32 entries;/* # entries filled in (output) */
> > u64 reserved[2];/* must be zero */
> > 
> > struct getfsmap_rec keys[2]; /* low and high keys for the mapping 
> > search */
> > struct getfsmap_rec recs[0];
> > }; /* 32 bytes + 2*48 = 128 bytes */
> > 
> > #define XFS_IOC_GETFSMAP_IOWR('X', 59, struct getfsmap_head)
> > 
> > This also means that userspace can set up for the next ioctl
> > invocation with memcpy(>keys[0], >recs[head->entries - 1]).
> > 
> > Yes, I think I like this better.  Everyone else, please chime in. :)
> 
> That's pretty much the structure I was going to suggest - it matches
> the fiemap pattern. i.e control parameters are separated from record
> data. I'd dump a bit more reserved space in the structure, though;
> we've got heaps of flag space for future expansion, but if we need
> to pass new parameters into/out of the kernel we'll quickly use the
> reserved space.

I padded struct fsmap with enough reserved space to make it an even 64 bytes,
and padded struct fsmap_head so that the space before keys is 64 bytes in
length.  See v3 patch of the ioctl manpage.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
--
To unsubscribe