[PATCH] btrfs-progs: allow device deletion using 'missing' keyword again

2015-11-06 Thread Alexander Fougner
Device deletion procedures ensures the device is a block device.
This patch introduces 'missing' as keyword again, correctly
passing it on to the kernel instead of complaining about
'missing' not being a block device.

Signed-off-by: Alexander Fougner 
---
 cmds-device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-device.c b/cmds-device.c
index c2f3a40..dca30d7 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -161,7 +161,7 @@ static int _cmd_device_remove(int argc, char **argv,
struct  btrfs_ioctl_vol_args arg;
int res;
 
-   if (is_block_device(argv[i]) != 1) {
+   if (is_block_device(argv[i]) != 1 && strcmp(argv[i], 
"missing")) {
fprintf(stderr,
"ERROR: %s is not a block device\n", argv[i]);
ret++;
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-06 Thread Patrik Lundquist
On 6 November 2015 at 10:03, Janos Toth F.  wrote:
>
> Although I updated the firmware of the drives. (I found an IMPORTANT
> update when I went there to download SeaTools, although there was no
> change log to tell me why this was important). This might changed the
> error handling behavior of the drive...?

I've had Seagate drives not reporting errors until I updated the
firmware. They tended to timeout instead. Got a shitload of SMART
errors after I updated, but they still didn't handle errors very well
(became unresponsive).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-06 Thread Janos Toth F.
I created a fresh RAID-5 mode Btrfs on the same 3 disks (including the
faulty one which is still producing numerous random read errors) and
Btrfs now seems to work exactly as I would anticipate.

I copied some data and verified the checksum. The data is readable and
correct regardless of the constant warning messages in the kernel log
about the read errors on the single faulty HDD (the bad behavior is
confirmed by the SMART logs and I tested it in a different PC as
well...).

I also ran several scrubs and now it always finishes with X corrected
and 0 uncorrected errors. (The errors are supposedly corrected but the
faulty HDD keeps randomly corrupting the data...)
The last time I saw uncorrected errors during the scrub and not every
data was readable. Rather strange...

I ran 24 hours of Gimps/Prime95 Blend stresstest without errors on the
problematic machine.
Although I updated the firmware of the drives. (I found an IMPORTANT
update when I went there to download SeaTools, although there was no
change log to tell me why this was important). This might changed the
error handling behavior of the drive...?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs progs pre-release 4.3-rc1

2015-11-06 Thread David Sterba
On Tue, Nov 03, 2015 at 12:10:14AM +, Duncan wrote:
> David Sterba posted on Mon, 02 Nov 2015 16:14:53 +0100 as excerpted:
> 
> > the kernel 4.3 was released yesterday, the btrfs-progs will follow at
> > the end of this week. I've tagged an rc1 from current devel branch.
> > There are a lots of small invisible changes and one change in the
> > defaults:
> > 
> > * mkfs: mixed mode is not forced anymore for devices smaller than 1 GiB
> 
> It says one change in the /defaults/, but then it says mixed mode isn't 
> /forced/ anymore under a GiB.

Well, it may be a loose definition of 'default'. I meant a change in the
current behaviour without further tuning.

> Which is it, a change in the /defaults/, under a gig now defaults to 
> separate data/metadata, or same /defaults/, but now there's a way to 
> overrule them and do separate data/metadata under a gig, so while mixed 
> remains the default, it's no longer /forced/?
> 
> If the /defaults/ changed, is mixed mode still /recommended/ for small 
> filesystems?

Yes it is, where small remains < 1 GiB.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs locking in linux 4.2.5

2015-11-06 Thread Sree Harsha Totakura
Hi,

I believe I ran into some btrfs related locking issues with linux-4.2.5.
 It happened when I updated my Arch Linux OS, restarted the system, and
tried to copy (not move) 20GB+ of data with in the btrfs /root partition.

The system's load shot up with load averages peaking up to 20.00 at some
point (the system has 8 cores).  Programs were stuck while loading.

After about 20 minutes, the system returned to normal operation and sane
load averages.

The system log had the following warnings hinting that this could be due
to btrfs locking:

> Nov 05 14:26:42 fulcrum kernel: INFO: task systemd-journal:330 blocked for 
> more than 120 seconds.
> Nov 05 14:26:42 fulcrum kernel: Tainted: P IO 4.2.5-1-ARCH #1
> Nov 05 14:26:42 fulcrum kernel: "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Nov 05 14:26:42 fulcrum kernel: systemd-journal D 0005 0 330 1 
> 0x0004
> Nov 05 14:26:42 fulcrum kernel: 8806027a7d28 0082 
> 88060125b200 880602a2b200
> Nov 05 14:26:42 fulcrum kernel:  8806027a8000 
> 8806017689f0 8806017689f0
> Nov 05 14:26:42 fulcrum kernel: 880487f1b170 0001 
> 8806027a7d48 8157283e
> Nov 05 14:26:42 fulcrum kernel: Call Trace:
> Nov 05 14:26:42 fulcrum kernel: [] schedule+0x3e/0x90
> Nov 05 14:26:42 fulcrum kernel: [] 
> wait_current_trans.isra.9+0xca/0x110 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] ? 
> wake_atomic_t_function+0x60/0x60
> Nov 05 14:26:42 fulcrum kernel: [] 
> start_transaction+0x420/0x580 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] 
> btrfs_start_transaction+0x1b/0x20 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] 
> btrfs_sync_file+0x204/0x380 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] vfs_fsync_range+0x4b/0xb0
> Nov 05 14:26:42 fulcrum kernel: [] ? 
> SyS_timerfd_settime+0x53/0xa0
> Nov 05 14:26:42 fulcrum kernel: [] do_fsync+0x3d/0x70
> Nov 05 14:26:42 fulcrum kernel: [] SyS_fsync+0x10/0x20
> Nov 05 14:26:42 fulcrum kernel: [] 
> entry_SYSCALL_64_fastpath+0x12/0x71
> Nov 05 14:26:42 fulcrum kernel: INFO: task postgres:894 blocked for more than 
> 120 seconds.
> Nov 05 14:26:42 fulcrum kernel: Tainted: P IO 4.2.5-1-ARCH #1
> Nov 05 14:26:42 fulcrum kernel: "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Nov 05 14:26:42 fulcrum kernel: postgres D 88061fc95200 0 894 865 
> 0x
> Nov 05 14:26:42 fulcrum kernel: 8805f435fb38 0086 
> 880603ed1900 880603513200
> Nov 05 14:26:42 fulcrum kernel: 8805f435fbb8 8805f436 
> 8806017689f0 8806017689f0
> Nov 05 14:26:42 fulcrum kernel: 88059e5acc38 0001 
> 8805f435fb58 8157283e
> Nov 05 14:26:42 fulcrum kernel: Call Trace:
> Nov 05 14:26:42 fulcrum kernel: [] schedule+0x3e/0x90
> Nov 05 14:26:42 fulcrum kernel: [] 
> wait_current_trans.isra.9+0xca/0x110 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] ? 
> wake_atomic_t_function+0x60/0x60
> Nov 05 14:26:42 fulcrum kernel: [] 
> start_transaction+0x420/0x580 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] 
> btrfs_start_transaction+0x1b/0x20 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] btrfs_create+0x4a/0x210 
> [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] vfs_create+0x9c/0xd0
> Nov 05 14:26:42 fulcrum kernel: [] path_openat+0xfb7/0x1110
> Nov 05 14:26:42 fulcrum kernel: [] do_filp_open+0x8a/0x100
> Nov 05 14:26:42 fulcrum kernel: [] ? __alloc_fd+0x88/0x110
> Nov 05 14:26:42 fulcrum kernel: [] do_sys_open+0x146/0x230
> Nov 05 14:26:42 fulcrum kernel: [] SyS_open+0x1e/0x20
> Nov 05 14:26:42 fulcrum kernel: [] 
> entry_SYSCALL_64_fastpath+0x12/0x71
> Nov 05 14:26:42 fulcrum kernel: INFO: task zsh:1959 blocked for more than 120 
> seconds.
> Nov 05 14:26:42 fulcrum kernel: Tainted: P IO 4.2.5-1-ARCH #1
> Nov 05 14:26:42 fulcrum kernel: "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Nov 05 14:26:42 fulcrum kernel: zsh D 0001 0 1959 1323 0x
> Nov 05 14:26:42 fulcrum kernel: 880556813cd8 0082 
> 8805ec457080 8805eac5b200
> Nov 05 14:26:42 fulcrum kernel: 880556813ce8 880556814000 
> 8805fcde09f0 8805fcde09f0
> Nov 05 14:26:42 fulcrum kernel: 8800d4f05ac8 0001 
> 880556813cf8 8157283e
> Nov 05 14:26:42 fulcrum kernel: Call Trace:
> Nov 05 14:26:42 fulcrum kernel: [] schedule+0x3e/0x90
> Nov 05 14:26:42 fulcrum kernel: [] 
> wait_current_trans.isra.9+0xca/0x110 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] ? 
> wake_atomic_t_function+0x60/0x60
> Nov 05 14:26:42 fulcrum kernel: [] 
> start_transaction+0x420/0x580 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] ? __d_lookup+0xa1/0x160
> Nov 05 14:26:42 fulcrum kernel: [] 
> btrfs_start_transaction+0x1b/0x20 [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] btrfs_symlink+0x80/0x3e0 
> [btrfs]
> Nov 05 14:26:42 fulcrum kernel: [] ? lookup_dcache+0x30/0xb0
> Nov 05 14:26:42 fulcrum kernel: [] vfs_symlink+0x8b/0xc0
> 

Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs

2015-11-06 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/5/15 11:03 PM, Jeff Mahoney wrote:
> On 11/5/15 10:18 PM, Al Viro wrote:
>> On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote:
>> 
>>> So now file_operations callbacks can't assume that
>>> file->f_path.dentry belongs to the same file system that
>>> implements the callback.  More than that, any code that could
>>> ultimately get a dentry that comes from an open file can't
>>> trust that it's from the same file system.
>> 
>> Use file_inode() for inode.
>> 
>>> This crash is due to this issue.  Unlike xfs and ext2/3/4, we
>>> use file->f_path.dentry->d_inode to resolve the inode.  Using
>>> file_inode() is an easy enough fix here, but we run into
>>> trouble later.  We have logic in the btrfs fsync() call path
>>> (check_parent_dirs_for_sync) that walks back up the dentry
>>> chain examining the inode's last transaction and last unlink
>>> transaction to determine whether a full transaction commit is
>>> required.  This obviously doesn't work if we're walking the 
>>> overlayfs path instead.  Regardless of any argument over
>>> whether that's doing the right thing, it's a pretty common
>>> pattern to assume that file->f_path.dentry comes from the same
>>> file system when using a file_operation.  Is it intended that
>>> that assumption is no longer valid?
>> 
>> It's actually rare, and your example is a perfect demonstration
>> of the reasons why it is so rare.  What's to protect
>> btrfs_log_dentry_safe() from racing with rename(2)?  Sure, you do
>> dget_parent().  Which protects you from having one-time parent
>> dentry freed under you.  What it doesn't do is making any
>> promises about its relationship with your file.
> 
> I suppose the irony here is that, AFAIK, that code is to ensure a
> file doesn't get lost between transactions due to rename.
> 
> Isn't the file->f_path.dentry relationship stable otherwise,
> though? The name might change and the parent might change but the
> dentry that the file points to won't.

And, taking it a bit further, it's impossible for a rename to end up
with a file pointing into a different file system.  So this btrfs case
might misbehave, but it would never crash like we're seeing here.

- -Jeff


> I did find a few other places where that assumption happens without
> any questionable traversals.  Sure, all three are in file systems
> unlikely to be used with overlayfs.
> 
> ocfs2_prepare_inode_for_write uses file->f_path.dentry for 
> should_remove_suid (due to needing to do it early since cluster
> locking is unknown in setattr, according to the commit).  Having 
> should_remove_suid operate on an inode would solve that easily.
> 
> fat_ioctl_set_attributes uses it to call fat_setattr, but that only
> uses the inode and could have the inode_operation use a wrapper.
> 
> cifs_new_fileinfo keeps a reference to the dentry but it seems to
> be used mostly to access the inode except for the nasty-looking
> call to build_path_from_dentry in cifs_reopen_file, which I won't
> be touching. That does look like a questionable traversal,
> especially with the "we can't take the rename lock here" comment.
> 
> -Jeff
> 


- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQIcBAEBAgAGBQJWPL01AAoJEB57S2MheeWy1+IP/RfWvnpaXOCA2HJhzyR0attX
D+SYah7Dc5OBicN0lghIg5ka0U2J1+l051yOOkT2sDRE23Lyu9/wmxhQVerx7hN4
js/ZGwbmGfO9I3kXbAKzGdsAscVAgvTcEp8gYXWFCzYIRYyDKEJM8xrQMM+Z2mIy
AMu6lzMRFGD7q2KIITZzML0cozgT0TREE9D9+IrT3ywxAegIPATxwFp3pDRDwl4F
zb2QjJjJvw/z0LEAlatwV1H7AAIZxAVrMWVywlsrdvg+pwA508JvkN7Wk06dAcJ2
YB+ddVIQsYyJuBYMA+IQsCM9q7LjIVPskoqi8BMxS2MvYObu6Z0zU+Iwcp0RnVa+
FiKt3gfRR0yOAuulzg9wKylYasIC8kfKD1POaAmOBgLErhDFtXIsJSXuw5HgY/VR
LsSAbyOMfWg+YvreswQ7d7VMnK0wIJuRnludWVbQIn8y+4RKbqj2jiYIlZ7FMeUu
rSSPlNt0GKISaSM3iSBrR2qN8PLvVyxdXpZSCl5itfqNea6KAwL+Kj61x0rNZhhF
GkQlwsxJxYEue1eqqZU8iEkd0y93yPo3puhH7yHtT+dJW0NahjKiJF6TAGHF3C4a
dEatwl6FSvDJA1aXvHG2dMfbtIiywKM1LJ4VAP1TOsbL3sqG3i4Orh7cN4bl2tYv
/D9wgUU17XXdK76ysaxM
=iP2W
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs release 4.3

2015-11-06 Thread David Sterba
Hi,

the btrfs-progs 4.3 have been released.

There's a notable change in the behaviour of mkfs on devices smaller than 1
GiB. The forced --mixed mode is no more. The resulting filesytem will have a
split data and metadata groups. This may lead to earlier 'no-space' because the
space is reserved for metadata use.

* mkfs
  * mixed mode is not forced for filesystems smaller than 1GiB
  * mixed mode broken with mismatching sectorsize and nodesize, fixed
  * print version info earlier
  * print devices sorted by id
  * do not truncate target image with --rootsize

* fi usage:
  * don't print global block reserve
  * print device id
  * minor output tuning
  * other cleanups

* calc-size:
  * div-by-zero fix on an empty filesystem
  * fix crash

* bugfixes:
  * more superblock sanity checks
  * consistently round size of all devices down to sectorsize
  * misc leak fixes
  * convert: don't try to rollback with a half-deleted ext2_saved subvolume

* other:
  * check: add progress indicator
  * scrub: enahced error message
  * show-super: read superblock from a given offset
  * add README
  * docs: update manual page for mkfs.btrfs, btrfstune, balance, convert and
inspect-internal
  * build: optional build with more warnings (W=...)
  * build: better support for static checkers
  * build: html output of documentation
  * pretty-print: last_snapshot for root_item
  * pretty-print: stripe dev uuid
  * error reporting wrappers, introduced and example use
  * refactor open_file_or_dir
  * other docs and help updates

* testing:
  * test for nodes crossing stripes
  * test for broken 'subvolume sync'
  * basic tests for mkfs, raid option combinations
  * basic tests for fuzzed images (check)
  * command intrumentation (eg valgrind)
  * print commands if requested
  * add README for tests


Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git


Shortlog:

Anand Jain (4):
  btrfs-progs: move is_numerical() helper to utils and rename
  btrfs-progs: device add: cleanup argument handling
  btrfs-progs: fix uninitialized copy of btrfs_fs_devices list
  btrfs-progs: fix missing initialization of list head for dev_list

Chandan Rajendra (2):
  Btrfs-progs: Do not force mixed block group creation unless '-M' option 
is specified
  Btrfs-progs: Prevent creation of filesystem with 'mixed bgs' and having 
differing sectorsize and nodesize.

David Sterba (46):
  btrfs-progs: misc tests: add 009-subvolume-sync-must-wait
  btrfs-progs: tests: print commands on terminal if requested
  btrfs-progs: build: allow to build with various compiler warnings
  btrfs-progs: a bit of makefile documentation
  btrfs-progs: check: update help text
  btrfs-progs: build: make support for static checkers more generic
  btrfs-progs: docs: add html build target
  btrfs-progs: cleanup and comment parse_range
  btrfs-progs: do not modify the string in parse_range
  btrfs-progs: extend parse_range API to accept a relaxed range
  btrfs-progs: add helpers for parsing 32bit ranges
  btrfs-progs: add helpers to print ranges
  btrfs-progs: tests: add mkfs tests
  btrfs-progs: tests: add 001-basic-profiles mkfs tests
  btrfs-progs: tests: add 002-no-force-mixed-on-small-volume
  btrfs-progs: tests: add 010-convert-delete-ext2-subvol
  btrfs-progs: tests: set default test image size to 2G
  btrfs-progs: tests: do not run sudo helper tests if not necessary
  btrfs-progs: tests: add test driver for fuzzed images
  btrfs-progs: tests: 001-simple-unmounted: iterate over fuzzed images and 
run check
  btrfs-progs: tests: add support for command instrumentation
  btrfs-progs: tests: do not log output of run_mayfail to terminal
  btrfs-progs: tests: add 003-mixed-with-wrong-nodesize
  btrfs-progs: mkfs: remove stray message about forced mixed-bg
  btrfs-progs: add an initial README
  btrfs-progs: add initial tests/README
  btrfs-progs: image: fix bogus check after cpu on-line detection
  btrfs-progs: mkfs: print version info first
  btrfs-progs: docs: enhance manual page for mkfs
  btrfs-progs: docs: enhance manual page for btrfstune
  btrfs-progs: docs: enhance manual page for balance
  btrfs-progs: docs: enhance the manual page for convert
  btrfs-progs: docs: enhance manual page for inspect-internal
  Btrfs progs v4.3-rc1
  btrfs-progs: fi usage: do not print global block reserve
  btrfs-progs: fi usage: cleanup, print header in one go
  btrfs-progs: fi usage: print path header in the tabular mode
  btrfs-progs: fi usage: properly count real space infos
  btrfs-progs: fi usage: cleanup, replace header constant
  btrfs-progs: fi usage: cleanup, replace space info starting column 
constant
  btrfs-progs: fi usage: print device id column in the tabular output
  btrfs-progs: string 

Re: Bad fs performance, IO freezes

2015-11-06 Thread cheater00 .
I am getting a sata dock for my laptop next week. Until then, is it
possible to perform an action in btrfs (like rm which seems to trigger
the issue) and make it log what exactly it's doing?


On Thu, Oct 29, 2015 at 9:01 PM, Austin S Hemmelgarn
 wrote:
> On 2015-10-29 11:49, cheater00 . wrote:
>>
>> Hi Austin,
>> seek times are fine, but this literally freezes my computer for a
>> split second. I've had to re-type this email twice because the freezes
>> meant letters I typed would not arrive on the screen.
>> USB disks are so common they should not be having issues.
>
> That's debatable.  USB is commonly used because it's almost impossible to
> find a system that doesn't have it, not because it's reliable.  The original
> intent was for it to be used for stuff like mice and keyboards, so it was
> designed with low-latency and fair scheduling in mind, both of which really
> hurt performance of bulk data storage devices.
>>
>> I have 4.3.0-040300rc7-generic #201510260712 which is just three days old.
>
> That should be perfectly recent enough, although FWIW, the official version
> of 4.3 should be out this Sunday.
>>
>>
>> Please advise. Isn't it better to *not* use a vm to debug this?
>
> That depends.  For something like this, it could go either way.  I just use
> a VM because that's what I always use, because it's nice not crashing your
> system when trying to debug a kernel panic.
>>
>> BTW, if we are talking about slow speed making things worse, I could
>> try downgrading the cable to usb2.
>> Is there a standard virtualbox VM that I could use?
>
> In general, it's pretty easy to set something like Ubuntu up in VirtualBox,
> the install is essentially identical to regular hardware aside from the
> initial setup of the VM itself.  The documentation for VirtualBox is really
> good, if you've never used virtualization before, it's definitely worth
> reading.
>>
>> I'll download Gentoo in the meantime. I have never used it. I'm
>> getting the "minimal installation cd" from 29th september.
>>
>> http://distfiles.gentoo.org/releases/x86/autobuilds/20150929/install-x86-minimal-20150929.iso
>
> I meant by no means that you needed to use Gentoo, I only mentioned it
> because it's what I use (which in turn is because that's what I use on just
> about everything except stuff like the Raspberry Pi or the BeagleBoard).  If
> you just want to debug this and then be done with it, I would actually
> advise against using Gentoo, it takes a lot of effort to get a system up and
> running with it, and it's very involved to maintain compared to Ubuntu.  On
> the other hand though, if you are willing to learn to use it, it's one of
> the most highly customizable Linux distros out there, and can have
> noticeably better performance than more generic distros (FWIW, it's also one
> of the last big distros that doesn't force systemd on it's users by
> default).
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: test quota disable during quota rescan

2015-11-06 Thread Justin Maggard
This test case tests if we are able to disable quotas on a filesystem
while a quota rescan is running.  Up to now (4.3) this would result
in a kernel NULL pointer dereference.

Fixed by patch (btrfs: qgroup: fix quota disable during rescan).

Signed-off-by: Justin Maggard 
---
 tests/btrfs/115 | 62 +
 tests/btrfs/115.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 65 insertions(+)
 create mode 100755 tests/btrfs/115
 create mode 100644 tests/btrfs/115.out

diff --git a/tests/btrfs/115 b/tests/btrfs/115
new file mode 100755
index 000..0d1cb3a
--- /dev/null
+++ b/tests/btrfs/115
@@ -0,0 +1,62 @@
+#! /bin/bash
+# FS QA Test No. btrfs/115
+#
+# btrfs quota scan/disable sanity test
+# Make sure that disabling quotas during a quota rescan doesn't crash
+#
+#---
+# Copyright (c) 2015 NETGEAR, Inc.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount
+
+for i in `seq 0 1 45`; do
+   echo -n > $SCRATCH_MNT/file.$i
+done
+echo 3 > /proc/sys/vm/drop_caches
+$BTRFS_UTIL_PROG quota enable $SCRATCH_MNT
+$BTRFS_UTIL_PROG quota disable $SCRATCH_MNT
+
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/btrfs/115.out b/tests/btrfs/115.out
new file mode 100644
index 000..d9dd136
--- /dev/null
+++ b/tests/btrfs/115.out
@@ -0,0 +1,2 @@
+QA output created by 115
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 10ab26b..39b9aff 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -117,3 +117,4 @@
 112 auto quick clone
 113 auto quick compress clone
 114 auto qgroup
+115 auto qgroup
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: qgroup: fix quota disable during rescan

2015-11-06 Thread Justin Maggard
There's a race condition that leads to a NULL pointer dereference if you
disable quotas while a quota rescan is running.  To fix this, we just need
to wait for the quota rescan worker to actually exit before tearing down
the quota structures.

Signed-off-by: Justin Maggard 
---
 fs/btrfs/qgroup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 75c0249..a7cf504 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -993,9 +993,10 @@ int btrfs_quota_disable(struct btrfs_trans_handle *trans,
mutex_lock(_info->qgroup_ioctl_lock);
if (!fs_info->quota_root)
goto out;
-   spin_lock(_info->qgroup_lock);
fs_info->quota_enabled = 0;
fs_info->pending_quota_state = 0;
+   btrfs_qgroup_wait_for_completion(fs_info);
+   spin_lock(_info->qgroup_lock);
quota_root = fs_info->quota_root;
fs_info->quota_root = NULL;
fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_ON;
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2015-11-06 Thread Chris Mason
Hi Linus,

Please pull my for-linus-4.4 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.4

My branch was based on 4.3-rc5, and it ended up with a minor conflict
against the btrfs changes sent in for a later rc.  I put a sample merge
resolution up here:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.4-merged

The only non-obvious part of all this is in fs/btrfs/volumes.h.  Both
sides of the merge ended up creating the BTRFS_BALANCE_ARGS_MASK
definition, which was my fault (sorry) for letting a duplicate patch in
the mix.  When you merge, please keep the longer of the two definitions,
since we added to it later in the for-linus-4.4 branch.

I'm happy to redo things merged with 4.3-final if you don't want to
bother with this, sorry for the hassle.

On to the fun part, we have a lot of subvolume quota improvements in here,
along with big piles of cleanups from Dave Sterba and Anand Jain and
others.

Josef pitched in a batch of allocator fixes based on production use here
at FB.  We found that mount -o ssd_spread greatly improved our
performance on hardware raid5/6, but it exposed some CPU bottlenecks in
the allocator.  These patches make a huge difference.

Qu Wenruo (24) commits (+1031/-375):
btrfs: extent-tree: Add new version of btrfs_check_data_free_space and 
btrfs_free_reserved_data_space. (+79/-9)
btrfs: extent-tree: Switch to new check_data_free_space and 
free_reserved_data_space (+27/-19)
btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in 
clear_bit_hook (+22/-12)
btrfs: delayed_ref: Add new function to record reserved space into delayed 
ref (+43/-0)
btrfs: extent-tree: Add new version of btrfs_delalloc_reserve/release_space 
(+61/-0)
btrfs: extent_io: Introduce needed structure for recoding set/clear bits 
(+12/-0)
btrfs: qgroup: Introduce functions to release/free qgroup reserve data 
(+62/-0)
btrfs: extent-tree: Switch to new delalloc space reserve and release 
(+38/-25)
btrfs: delayed_ref: release and free qgroup reserved at proper timing 
(+34/-4)
btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans 
(+32/-21)
btrfs: extent_io: Introduce new function clear_record_extent_bits() 
(+42/-11)
btrfs: qgroup: Fix a rebase bug which will cause qgroup double free (+0/-4)
btrfs: extent_io: Introduce new function set_record_extent_bits (+56/-18)
btrfs: qgroup: Introduce new functions to reserve/free metadata (+48/-0)
btrfs: qgroup: Don't copy extent buffer to do qgroup rescan (+16/-10)
btrfs: qgroup: Add new trace point for qgroup data reserve (+130/-2)
btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function (+52/-0)
btrfs: fallocate: Add support to accurate qgroup reserve (+117/-44)
btrfs: qgroup: Check if qgroup reserved space leaked (+34/-0)
btrfs: qgroup: Cleanup old inaccurate facilities (+60/-156)
btrfs: qgroup: Add handler for NOCOW and inline (+21/-1)
btrfs: qgroup: Use new metadata reservation. (+13/-36)
btrfs: Fix a data space underflow warning (+8/-3)
btrfs: Add handler for invalidate page (+24/-0)

David Sterba (17) commits (+393/-112):
btrfs: introduce ratelimited _in_rcu variants of message printing functions 
(+38/-0)
btrfs: remove waitqueue_active check from btrfs_rm_dev_replace_unblocked 
(+1/-2)
btrfs: add balance filters limits, stripes and usage to supported mask 
(+4/-1)
btrfs: comment the rest of implicit barriers before waitqueue_active 
(+22/-0)
btrfs: introduce ratelimited variants of message printing functions (+21/-0)
btrfs: switch message printers to ratelimited _in_rcu variants (+16/-16)
btrfs: introduce _in_rcu variants of message printing functions (+29/-0)
btrfs: extend balance filter usage to take minimum and maximum (+60/-4)
btrfs: extend balance filter limit to take minimum and maximum (+67/-3)
btrfs: add barrier for waitqueue_active in clear_btree_io_tree (+6/-0)
btrfs: add comments to barriers before waitqueue_active (+16/-2)
btrfs: switch message printers to ratelimited variants (+31/-33)
btrfs: check unsupported filters in balance arguments (+13/-0)
btrfs: switch message printers to _in_rcu variants (+27/-27)
btrfs: remove extra barrier before waitqueue_active (+6/-2)
btrfs: comment waitqueue_active implied by locks (+11/-1)
btrfs: switch more printks to our helpers (+25/-21)

Anand Jain (14) commits (+205/-184):
Btrfs: __btrfs_std_error() logic should be consistent w/out CONFIG_PRINTK 
defined (+5/-22)
Btrfs: use BTRFS_ERROR_DEV_MISSING_NOT_FOUND when missing device is not 
found (+2/-4)
Btrfs: kernel operation should come after user input has been verified 
(+13/-13)
Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks (+27/-13)
Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link (+5/-5)
Btrfs: rename btrfs_sysfs_remove_one to 

Re: Unable to allocate for space usage in particular btrfs volume

2015-11-06 Thread Calvin Walton
On Thu, 2015-11-05 at 10:44 +, OmegaPhil wrote:
> On 05/11/15 04:18, Duncan wrote:
> > OmegaPhil posted on Wed, 04 Nov 2015 21:53:09 + as excerpted:
> > VM image files (and large database files, for the same reason) are
> > a bit 
> > of a problem on btrfs, and indeed, any COW-based filesystem, since
> > the 
> > random rewrite pattern matching that use-case is pretty much the
> > absolute 
> > worst-case match for a COW-based filesystem there is.
> > Since you're not doing snapshotting (which conflicts with this
> > option, 
> > with an imperfect workaround), setting nocow on those files may
> > well 
> > eliminate the problem, but be aware if you aren't already that (1)
> > nocow 
> > does turn off checksumming as well, in ordered to avoid a race that
> > could 
> > easily lead to data corruption, and (2) you can't just activate

> So a couple of gig still unaccountable but irrelevant. Thanks,
> problem
> solved! Although hopefully checksumming will be allowed on nocow
> files
> in the future as thats currently 17% of all data unprotected and will
> get worse...

There's actually an interesting workaround to this: Although the VM
disk images aren't checksummed on the host filesystem, you can use
btrfs *inside* the VMs and enable checksumming there. The downside is
that you can only verify the VM data by booting the VM and running a
scrub from inside.

This of course doesn't help if your VMs are Windows or legacy versions
of Linux without btrfs support. On BSD you could try ZFS.

-- 
Calvin Walton 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to allocate for space usage in particular btrfs volume

2015-11-06 Thread Austin S Hemmelgarn

On 2015-11-06 15:15, Calvin Walton wrote:

On Thu, 2015-11-05 at 10:44 +, OmegaPhil wrote:

On 05/11/15 04:18, Duncan wrote:

OmegaPhil posted on Wed, 04 Nov 2015 21:53:09 + as excerpted:
VM image files (and large database files, for the same reason) are
a bit
of a problem on btrfs, and indeed, any COW-based filesystem, since
the
random rewrite pattern matching that use-case is pretty much the
absolute
worst-case match for a COW-based filesystem there is.
Since you're not doing snapshotting (which conflicts with this
option,
with an imperfect workaround), setting nocow on those files may
well
eliminate the problem, but be aware if you aren't already that (1)
nocow
does turn off checksumming as well, in ordered to avoid a race that
could
easily lead to data corruption, and (2) you can't just activate



So a couple of gig still unaccountable but irrelevant. Thanks,
problem
solved! Although hopefully checksumming will be allowed on nocow
files
in the future as thats currently 17% of all data unprotected and will
get worse...


There's actually an interesting workaround to this: Although the VM
disk images aren't checksummed on the host filesystem, you can use
btrfs *inside* the VMs and enable checksumming there. The downside is
that you can only verify the VM data by booting the VM and running a
scrub from inside.
Actually, by using a combination of loop devices and kpartx, it's fully 
possible to mount the FS and verify it without booting the VM.  Of 
course, doing this usually requires root access on the host system, but 
for most people I know, that's usually not an issue.  I do this on 
occasion when I need to pull a file off of one of my VM disks on my 
laptop and don't have the time to spin up the VM itself.


Another option if you're doing a direct boot of the kernel (for example, 
when using a fully paravirtualized domain on Xen, or using some of the 
QEMU ARM systems) is to just do the volume management (partitioning and 
such) on the host, and expose each filesystem to the guest as a separate 
disk.  I do this with most of my Linux VM's on my Xen system where I use 
LVM as the back-end storage for the virtual disk images, as it allows me 
to easily directly mount the VM's filesystems on the host if need be 
(and let's you do all kinds of cool things like using a cluster-aware 
filesystem for the VM's root so that you can mount it from the host 
safely while the VM is still online).





smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH v8 4/4] vfs: Add vfs_copy_file_range() support for pagecache copies

2015-11-06 Thread Anna Schumaker
This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space.  This is especially useful so
NFSD can support server-side copies.

The default (flags=0) means to first attempt copy acceleration, but use
the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Padraig Brady 
Reviewed-by: Christoph Hellwig 
---
 fs/read_write.c | 37 ++---
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 97c15ca..a093830 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1329,6 +1329,24 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
 }
 #endif
 
+static ssize_t vfs_copy_fr_copy(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len)
+{
+   ssize_t ret = rw_verify_area(READ, file_in, _in, len);
+
+   if (ret >= 0) {
+   len = ret;
+   ret = rw_verify_area(WRITE, file_out, _out, len);
+   if (ret >= 0)
+   len = ret;
+   }
+   if (ret < 0)
+   return ret;
+
+   return do_splice_direct(file_in, _in, file_out, _out, len, 0);
+}
+
 /*
  * copy_file_range() differs from regular file read and write in that it
  * specifically allows return partial success.  When it does so is up to
@@ -1345,17 +1363,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (flags != 0)
return -EINVAL;
 
-   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
-   ret = rw_verify_area(READ, file_in, _in, len);
-   if (ret >= 0)
-   ret = rw_verify_area(WRITE, file_out, _out, len);
-   if (ret < 0)
-   return ret;
-
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
-   (file_out->f_flags & O_APPEND) ||
-   !file_out->f_op->copy_file_range)
+   (file_out->f_flags & O_APPEND))
return -EBADF;
 
/* this could be relaxed once a method supports cross-fs copies */
@@ -1369,8 +1379,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t 
pos_in,
if (ret)
return ret;
 
-   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, 
pos_out,
- len, flags);
+   ret = -EOPNOTSUPP;
+   if (file_out->f_op->copy_file_range)
+   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+ pos_out, len, flags);
+   if (ret == -EOPNOTSUPP)
+   ret = vfs_copy_fr_copy(file_in, pos_in, file_out, pos_out, len);
+
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 3/4] btrfs: add .copy_file_range file operation

2015-11-06 Thread Anna Schumaker
From: Zach Brown 

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown 
[Anna Schumaker: Make flags an unsigned int,
 Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker 
Reviewed-by: Josef Bacik 
Reviewed-by: David Sterba 
Reviewed-by: Christoph Hellwig 
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..0046567 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
  loff_t pos, size_t write_bytes,
  struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = btrfs_ioctl,
 #endif
+   .copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3e3e613..ad75e48 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-  u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+   u64 off, u64 olen, u64 destoff)
 {
struct inode *inode = file_inode(file);
+   struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct fd src_file;
-   struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
-   int same_inode = 0;
+   int same_inode = src == inode;
 
/*
 * TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
 *   be either compressed or non-compressed.
 */
 
-   /* the destination must be opened for writing */
-   if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-   return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   return ret;
-
-   src_file = fdget(srcfd);
-   if (!src_file.file) {
-   ret = -EBADF;
-   goto out_drop_write;
-   }
-
-   ret = -EXDEV;
-   if (src_file.file->f_path.mnt != file->f_path.mnt)
-   goto out_fput;
-
-   src = file_inode(src_file.file);
-
-   ret = -EINVAL;
-   if (src == inode)
-   same_inode = 1;
-
-   /* the src must be open for reading */
-   if (!(src_file.file->f_mode & FMODE_READ))
-   goto out_fput;
+   if (file_src->f_path.mnt != file->f_path.mnt ||
+   src->i_sb != inode->i_sb)
+   return -EXDEV;
 
/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-   goto out_fput;
+   return -EINVAL;
 
-   ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-   goto out_fput;
-
-   ret = -EXDEV;
-   if (src->i_sb != inode->i_sb)
-   goto out_fput;
+   return -EISDIR;
 
if (!same_inode) {
btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,49 @@ out_unlock:
btrfs_double_inode_unlock(src, inode);
else
mutex_unlock(>i_mutex);
+   return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+   ssize_t ret;
+
+   ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+   if (ret == 0)
+   ret = len;
+   return ret;
+}
+

[PATCH v8 1/4] vfs: add copy_file_range syscall and vfs helper

2015-11-06 Thread Anna Schumaker
From: Zach Brown 

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown 
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
 Change flags parameter from int to unsigned int,
 Add function to include/linux/syscalls.h,
 Check copy len after file open mode,
 Don't forbid ranges inside the same file,
 Use rw_verify_area() to veriy ranges,
 Use file_out rather than file_in,
 Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker 
Reviewed-by: Christoph Hellwig 
---
-v8:
- Remove redundant checks
- Clear up fdget() / fdput() confusion
---
 fs/read_write.c   | 120 ++
 include/linux/fs.h|   3 +
 include/linux/syscalls.h  |   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c   |   1 +
 5 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..97c15ca 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -1327,3 +1328,122 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, unsigned int flags)
+{
+   struct inode *inode_in = file_inode(file_in);
+   struct inode *inode_out = file_inode(file_out);
+   ssize_t ret;
+
+   if (flags != 0)
+   return -EINVAL;
+
+   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+   ret = rw_verify_area(READ, file_in, _in, len);
+   if (ret >= 0)
+   ret = rw_verify_area(WRITE, file_out, _out, len);
+   if (ret < 0)
+   return ret;
+
+   if (!(file_in->f_mode & FMODE_READ) ||
+   !(file_out->f_mode & FMODE_WRITE) ||
+   (file_out->f_flags & O_APPEND) ||
+   !file_out->f_op->copy_file_range)
+   return -EBADF;
+
+   /* this could be relaxed once a method supports cross-fs copies */
+   if (inode_in->i_sb != inode_out->i_sb)
+   return -EXDEV;
+
+   if (len == 0)
+   return 0;
+
+   ret = mnt_want_write_file(file_out);
+   if (ret)
+   return ret;
+
+   ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, 
pos_out,
+ len, flags);
+   if (ret > 0) {
+   fsnotify_access(file_in);
+   add_rchar(current, ret);
+   fsnotify_modify(file_out);
+   add_wchar(current, ret);
+   }
+   inc_syscr(current);
+   inc_syscw(current);
+
+   mnt_drop_write_file(file_out);
+
+   return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+   int, fd_out, loff_t __user *, off_out,
+   size_t, len, unsigned int, flags)
+{
+   loff_t pos_in;
+   loff_t pos_out;
+   struct fd f_in;
+   struct fd f_out;
+   ssize_t ret = -EBADF;
+
+   f_in = fdget(fd_in);
+   if (!f_in.file)
+   goto out2;
+
+   f_out = fdget(fd_out);
+   if (!f_out.file)
+   goto out1;
+
+   ret = -EFAULT;
+   if (off_in) {
+   if (copy_from_user(_in, off_in, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_in = f_in.file->f_pos;
+   }
+
+   if (off_out) {
+   if (copy_from_user(_out, off_out, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_out = 

[PATCH v8 0/4] VFS: In-kernel copy system call

2015-11-06 Thread Anna Schumaker
Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations.  We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

This posting  fixes up a few minor issues that came up on the mailing list.  I
looked into the O_APPEND question, and do_splice_direct() specificially
disallows files that are open for appending.  I've decided to keep the
no-O_APPEND requirement for now since I use this function for pagecache copies.

Changes in v8:
- Remove redundant checks.
- Make the fdget() / fdput() calls more obvious.
- Document disallowing files open with O_APPEND.

Thanks,
Anna


Anna Schumaker (1):
  vfs: Add vfs_copy_file_range() support for pagecache copies

Zach Brown (3):
  vfs: add copy_file_range syscall and vfs helper
  x86: add sys_copy_file_range to syscall tables
  btrfs: add .copy_file_range file operation

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/btrfs/ctree.h   |   3 +
 fs/btrfs/file.c|   1 +
 fs/btrfs/ioctl.c   |  91 --
 fs/read_write.c| 135 +
 include/linux/fs.h |   3 +
 include/linux/syscalls.h   |   3 +
 include/uapi/asm-generic/unistd.h  |   4 +-
 kernel/sys_ni.c|   1 +
 10 files changed, 203 insertions(+), 40 deletions(-)

-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 2/4] x86: add sys_copy_file_range to syscall tables

2015-11-06 Thread Anna Schumaker
From: Zach Brown 

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown 
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker 
Reviewed-by: Christoph Hellwig 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 7663c45..0531270 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -382,3 +382,4 @@
 373i386shutdownsys_shutdown
 374i386userfaultfd sys_userfaultfd
 375i386membarrier  sys_membarrier
+376i386copy_file_range sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..03a9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 32264  execveatstub_execveat
 323common  userfaultfd sys_userfaultfd
 324common  membarrier  sys_membarrier
+325common  copy_file_range sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 5/4] copy_file_range.2: New page documenting copy_file_range()

2015-11-06 Thread Anna Schumaker
copy_file_range() is a new system call for copying ranges of data
completely in the kernel.  This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Christoph Hellwig 
---
v8:
- Document that files can not be open with O_APPEND.
---
 man2/copy_file_range.2 | 201 +
 man2/splice.2  |   1 +
 2 files changed, 202 insertions(+)
 create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 000..d9f76d1
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,201 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker 

+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume
+.\" no responsibility for errors or omissions, or for damages resulting
+.\" from the use of the information contained herein.  The author(s) may
+.\" not have taken the same level of care in the production of this
+.\" manual, which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-11-06 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include 
+.B #include 
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI "loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read.  The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument must be set to 0.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing; or
+.I fd_out
+is open for appending.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is not 0.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EXDEV
+.IR file_in " and " file_out
+are not on the same mounted filesystem.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH NOTES
+If
+.I file_in
+is a sparse file, then
+.BR copy_file_range ()
+may expand any holes existing in the requested range.
+Users may benefit from calling
+.BR copy_file_range ()
+in a loop, and using
+.BR lseek (2)
+to find the locations of data segments.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
+   loff_t *off_out, size_t len, unsigned int flags)
+{
+return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
+   

Re: btrfs_sync_file alignment trap on arm (kernel 4.2.5)

2015-11-06 Thread Cody P Schafer
On Wed, Nov 4, 2015 at 5:55 PM, Cody P Schafer  wrote:
> Ideas as to what could cause this would be appreciated.
>
> This consistently is triggered shortly after boot (I presume due to
> conmand calling fsync on a file).
>
> Note that I'm not quite running 4.2.5, but none of the changes I have
> additionally applied are to btrfs or atomics.
>
> Let me know if there is a way for me to get you more info.
>
> It looks like the line is:
>
> mutex_lock(>i_mutex);
> atomic_inc(>log_batch);  
> full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> _I(inode)->runtime_flags);
>
>
> addr2line of the trapping instruction:
> addr2line -e
> ./work/beaglebone-poky-linux-gnueabi/linux-yocto-ikabit/4.2.5+gitAUTOINC+c29ac1-r1/linux-beaglebone-standard-build/arch/arm/boot/vmlinux
> c0217a34 -i
> 
> /home/cody/obj/y/tmp/work-shared/beaglebone/kernel-source/arch/arm/include/asm/atomic.h:194
> 
> /home/cody/obj/y/tmp/work-shared/beaglebone/kernel-source/fs/btrfs/file.c:1886
>
> fault log:
>
> [   11.488382] Alignment trap: not handling instruction e1993f9f at 
> []
> [   11.560558] Unhandled fault: alignment exception (0x001) at 0x026d
> [   11.607301] pgd = dc09
> [   11.610166] [026d] *pgd=9c063831, *pte=, *ppte=
> [   11.665548] Internal error: : 1 [#1] PREEMPT ARM
> [   11.670388] Modules linked in: omaplfb(O) bufferclass_ti(O) pvrsrvkm(O)
> [   11.677341] CPU: 0 PID: 248 Comm: connmand Tainted: G   O
>  4.2.5-yocto-standard #1
> [   11.686172] Hardware name: Generic AM33XX (Flattened Device Tree)
> [   11.692551] task: dc00d100 ti: dc068000 task.ti: dc068000
> [   11.698219] PC is at btrfs_sync_file+0x104/0x3f4
> [   11.703051] LR is at btrfs_sync_file+0x100/0x3f4
> [   11.707883] pc : []lr : []psr: 60060013
> [   11.707883] sp : dc069e98  ip : dc069e98  fp : dc069f1c
> [   11.719897] r10:   r9 : 026d  r8 : dd746b40
> [   11.725363] r7 : dcef8dcc  r6 :   r5 : dcef8d68  r4 : 0001
> [   11.732192] r3 : 7fff  r2 :   r1 :   r0 : dcef8dcc
> [   11.739024] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
> user
> [   11.746491] Control: 10c5387d  Table: 9c090019  DAC: 0015
> [   11.752502] Process connmand (pid: 248, stack limit = 0xdc068210)
> [   11.758878] Stack: (0xdc069e98 to 0xdc06a000)
> [   11.763436] 9e80:
>  7fff
> [   11.771998] 9ea0: dc069f38 c014582c b30b  
> 37f6 81a4 d5012f68
> [   11.780559] 9ec0:  8000   
>  008d 
> [   11.789121] 9ee0: 0400  0001 20b04e34 563924e6
> dd746b40 dce2beb8 
> [   11.797683] 9f00:   dc068000  dc069f54
> dc069f20 c0171d00 c0217940
> [   11.806245] 9f20:  7fff  dc069f38 c0145b6c
> dd746b40  dd746b40
> [   11.814807] 9f40: 0076 c000f7a4 dc069f74 dc069f58 c0171d3c
> c0171c40  7fff
> [   11.823369] 9f60:  c015cc9c dc069f94 dc069f78 c0171d7c
> c0171d14  bebe2bf8
> [   11.831931] 9f80: 00e64c4d b6a104d0 dc069fa4 dc069f98 c017204c
> c0171d50  dc069fa8
> [   11.840492] 9fa0: c000f600 c017203c bebe2bf8 00e64c4d 0009
> bebe2bf8 008d 
> [   11.849054] 9fc0: bebe2bf8 00e64c4d b6a104d0 0076 00e64810
> 00e60cd0 0009 
> [   11.857616] 9fe0:  bebe2bd4 b6e63384 b6dcef00 60060010
> 0009 044a1103 099b0303
> [   11.866199] [] (btrfs_sync_file) from []
> (vfs_fsync_range+0xcc/0xd4)
> [   11.874674] [] (vfs_fsync_range) from []
> (vfs_fsync+0x34/0x3c)
> [   11.882600] [] (vfs_fsync) from [] (do_fsync+0x38/0x54)
> [   11.889890] [] (do_fsync) from [] (SyS_fsync+0x1c/0x20)
> [   11.897188] [] (SyS_fsync) from []
> (ret_fast_syscall+0x0/0x3c)
> [   11.905116] Code: e14b05fc e1a7 eb1251f7 e1993f9f (e2833001)
> [   13.418969] ---[ end trace d7bcd93aea7d243c ]---
> [   13.442420] Kernel panic - not syncing: Fatal exception

I looked into this a bit further, and it looks like my issue is that
the btrfs_root* from BTRFS_I(inode)->root is somehow `1` instead of an
actual pointer value, so it appears this isn't quite an alignment
issue.

Kernel starts without issue if I just stop doing the overlay (and as a
result stop using btrfs at all).

I've added some debugging, and the actual file it's trying to fsync
appears to be: `/var/lib/connman/settings.4CAC7X`

That directory is a overlayfs dir with a btrfs upper and a squashfs lower.

Ideas on why the btrfs_root pointer is 0x1?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_sync_file alignment trap on arm (kernel 4.2.5)

2015-11-06 Thread Filipe Manana
On Fri, Nov 6, 2015 at 10:16 PM, Cody P Schafer  wrote:
> On Wed, Nov 4, 2015 at 5:55 PM, Cody P Schafer  wrote:
>> Ideas as to what could cause this would be appreciated.
>>
>> This consistently is triggered shortly after boot (I presume due to
>> conmand calling fsync on a file).
>>
>> Note that I'm not quite running 4.2.5, but none of the changes I have
>> additionally applied are to btrfs or atomics.
>>
>> Let me know if there is a way for me to get you more info.
>>
>> It looks like the line is:
>>
>> mutex_lock(>i_mutex);
>> atomic_inc(>log_batch);  
>> full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>> _I(inode)->runtime_flags);
>>
>>
>> addr2line of the trapping instruction:
>> addr2line -e
>> ./work/beaglebone-poky-linux-gnueabi/linux-yocto-ikabit/4.2.5+gitAUTOINC+c29ac1-r1/linux-beaglebone-standard-build/arch/arm/boot/vmlinux
>> c0217a34 -i
>> 
>> /home/cody/obj/y/tmp/work-shared/beaglebone/kernel-source/arch/arm/include/asm/atomic.h:194
>> 
>> /home/cody/obj/y/tmp/work-shared/beaglebone/kernel-source/fs/btrfs/file.c:1886
>>
>> fault log:
>>
>> [   11.488382] Alignment trap: not handling instruction e1993f9f at 
>> []
>> [   11.560558] Unhandled fault: alignment exception (0x001) at 0x026d
>> [   11.607301] pgd = dc09
>> [   11.610166] [026d] *pgd=9c063831, *pte=, *ppte=
>> [   11.665548] Internal error: : 1 [#1] PREEMPT ARM
>> [   11.670388] Modules linked in: omaplfb(O) bufferclass_ti(O) pvrsrvkm(O)
>> [   11.677341] CPU: 0 PID: 248 Comm: connmand Tainted: G   O
>>  4.2.5-yocto-standard #1
>> [   11.686172] Hardware name: Generic AM33XX (Flattened Device Tree)
>> [   11.692551] task: dc00d100 ti: dc068000 task.ti: dc068000
>> [   11.698219] PC is at btrfs_sync_file+0x104/0x3f4
>> [   11.703051] LR is at btrfs_sync_file+0x100/0x3f4
>> [   11.707883] pc : []lr : []psr: 60060013
>> [   11.707883] sp : dc069e98  ip : dc069e98  fp : dc069f1c
>> [   11.719897] r10:   r9 : 026d  r8 : dd746b40
>> [   11.725363] r7 : dcef8dcc  r6 :   r5 : dcef8d68  r4 : 0001
>> [   11.732192] r3 : 7fff  r2 :   r1 :   r0 : dcef8dcc
>> [   11.739024] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
>> user
>> [   11.746491] Control: 10c5387d  Table: 9c090019  DAC: 0015
>> [   11.752502] Process connmand (pid: 248, stack limit = 0xdc068210)
>> [   11.758878] Stack: (0xdc069e98 to 0xdc06a000)
>> [   11.763436] 9e80:
>>  7fff
>> [   11.771998] 9ea0: dc069f38 c014582c b30b  
>> 37f6 81a4 d5012f68
>> [   11.780559] 9ec0:  8000   
>>  008d 
>> [   11.789121] 9ee0: 0400  0001 20b04e34 563924e6
>> dd746b40 dce2beb8 
>> [   11.797683] 9f00:   dc068000  dc069f54
>> dc069f20 c0171d00 c0217940
>> [   11.806245] 9f20:  7fff  dc069f38 c0145b6c
>> dd746b40  dd746b40
>> [   11.814807] 9f40: 0076 c000f7a4 dc069f74 dc069f58 c0171d3c
>> c0171c40  7fff
>> [   11.823369] 9f60:  c015cc9c dc069f94 dc069f78 c0171d7c
>> c0171d14  bebe2bf8
>> [   11.831931] 9f80: 00e64c4d b6a104d0 dc069fa4 dc069f98 c017204c
>> c0171d50  dc069fa8
>> [   11.840492] 9fa0: c000f600 c017203c bebe2bf8 00e64c4d 0009
>> bebe2bf8 008d 
>> [   11.849054] 9fc0: bebe2bf8 00e64c4d b6a104d0 0076 00e64810
>> 00e60cd0 0009 
>> [   11.857616] 9fe0:  bebe2bd4 b6e63384 b6dcef00 60060010
>> 0009 044a1103 099b0303
>> [   11.866199] [] (btrfs_sync_file) from []
>> (vfs_fsync_range+0xcc/0xd4)
>> [   11.874674] [] (vfs_fsync_range) from []
>> (vfs_fsync+0x34/0x3c)
>> [   11.882600] [] (vfs_fsync) from [] 
>> (do_fsync+0x38/0x54)
>> [   11.889890] [] (do_fsync) from [] 
>> (SyS_fsync+0x1c/0x20)
>> [   11.897188] [] (SyS_fsync) from []
>> (ret_fast_syscall+0x0/0x3c)
>> [   11.905116] Code: e14b05fc e1a7 eb1251f7 e1993f9f (e2833001)
>> [   13.418969] ---[ end trace d7bcd93aea7d243c ]---
>> [   13.442420] Kernel panic - not syncing: Fatal exception
>
> I looked into this a bit further, and it looks like my issue is that
> the btrfs_root* from BTRFS_I(inode)->root is somehow `1` instead of an
> actual pointer value, so it appears this isn't quite an alignment
> issue.
>
> Kernel starts without issue if I just stop doing the overlay (and as a
> result stop using btrfs at all).
>
> I've added some debugging, and the actual file it's trying to fsync
> appears to be: `/var/lib/connman/settings.4CAC7X`
>
> That directory is a overlayfs dir with a btrfs upper and a squashfs lower.
>
> Ideas on why the btrfs_root pointer is 0x1?

Just a bad interaction with overlayfs, this is being discussed at this
thread:  http://www.spinics.net/lists/linux-btrfs/msg47744.html


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a