Re: RAID5 doesn't mount on boot, but you can afterwards?
hi my fstab looks as follows (nb: i added the recovery option to see if that would help, which didn't) the bootdisk (and @home)is a ssd and the label STORAGE represents the RAID5 array: # /etc/fstab: static file system information. # # Use 'blkid' to print the universally unique identifier for a # device; this may be used with UUID= as a more robust way to name devices # that works even if disks are added and removed. See fstab(5). # # # / was on /dev/sdc1 during installation UUID=0ea60d4d-3f34-4451-8272-442fcccb7f2e / btrfs recovery,noatime,nodiratime,subvol=@ 0 1 # /home was on /dev/sdc1 during installation UUID=0ea60d4d-3f34-4451-8272-442fcccb7f2e /home btrfs recovery,noatime,nodiratime,subvol=@home 0 2 # STORAGE LABEL=STORAGE /data/HOME btrfs recovery,noatime,nodiratime,compress=zlib,subvol=@home_dir 0 2 LABEL=STORAGE /data/Pictures btrfs recovery,noatime,nodiratime,compress=zlib,subvol=@pictures 0 2 LABEL=STORAGE /data/Multimediabtrfs recovery,noatime,nodiratime,compress=zlib,subvol=@multimedia0 2 LABEL=STORAGE /data/dockerbtrfs recovery,noatime,nodiratime,compress=zlib,subvol=@docker0 2LABEL=STORAGE /data/vms btrfs recovery,noatime,nodiratime,compress=zlib,subvol=@vms 0 2 LABEL=STORAGE /data/Downloads btrfs recovery,noatime,nodiratime,compress=zlib,subvol=@downloads 0 2 LABEL=STORAGE /data/Backups btrfs recovery,noatime,nodiratime,compress=zlib,subvol=@backups 0 2 LABEL=STORAGE /data/Software btrfs recovery,noatime,nodiratime,compress=zlib,subvol=@software 0 2 On September 30, 2015 9:04:39 PM Leonidas Spyropouloswrote: Hello, On 30/09/15, Sjoerd wrote: Hi All, A RAID5 setup on raw devices doesn't want to automount on boot. [..] Post your /etc/fstab file please. Thanks -- Sent using mutt -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 6/9] vfs: Copy should use file_out rather than file_in
The way to think about this is that the destination filesystem reads the data from the source file and processes it accordingly. This is especially important to avoid an infinate loop when doing a "server to server" copy on NFS. Signed-off-by: Anna Schumaker--- fs/read_write.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 8e7cb33..6f74f1f 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1355,7 +1355,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (!(file_in->f_mode & FMODE_READ) || !(file_out->f_mode & FMODE_WRITE) || (file_out->f_flags & O_APPEND) || - !file_in->f_op || !file_in->f_op->copy_file_range) + !file_out->f_op || !file_out->f_op->copy_file_range) return -EBADF; inode_in = file_inode(file_in); @@ -1378,8 +1378,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (ret) return ret; - ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out, -len, flags); + ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out, + len, flags); if (ret > 0) { fsnotify_access(file_in); add_rchar(current, ret); -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 3/9] btrfs: add .copy_file_range file operation
From: Zach BrownThis rearranges the existing COPY_RANGE ioctl implementation so that the .copy_file_range file operation can call the core loop that copies file data extent items. The extent copying loop is lifted up into its own function. It retains the core btrfs error checks that should be shared. Signed-off-by: Zach Brown [Anna Schumaker: Make flags an unsigned int] Signed-off-by: Anna Schumaker Reviewed-by: Josef Bacik Reviewed-by: David Sterba --- v5: - Make flags variable an unsigned int --- fs/btrfs/ctree.h | 3 ++ fs/btrfs/file.c | 1 + fs/btrfs/ioctl.c | 91 3 files changed, 56 insertions(+), 39 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 938efe3..0046567 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, loff_t pos, size_t write_bytes, struct extent_state **cached); int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); +ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t len, unsigned int flags); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b823fac..b05449c 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = btrfs_ioctl, #endif + .copy_file_range = btrfs_copy_file_range, }; void btrfs_auto_defrag_exit(void) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0adf542..d3697e8 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3727,17 +3727,16 @@ out: return ret; } -static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, - u64 off, u64 olen, u64 destoff) +static noinline int btrfs_clone_files(struct file *file, struct file *file_src, + u64 off, u64 olen, u64 destoff) { struct inode *inode = file_inode(file); + struct inode *src = file_inode(file_src); struct btrfs_root *root = BTRFS_I(inode)->root; - struct fd src_file; - struct inode *src; int ret; u64 len = olen; u64 bs = root->fs_info->sb->s_blocksize; - int same_inode = 0; + int same_inode = src == inode; /* * TODO: @@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, * be either compressed or non-compressed. */ - /* the destination must be opened for writing */ - if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND)) - return -EINVAL; - if (btrfs_root_readonly(root)) return -EROFS; - ret = mnt_want_write_file(file); - if (ret) - return ret; - - src_file = fdget(srcfd); - if (!src_file.file) { - ret = -EBADF; - goto out_drop_write; - } - - ret = -EXDEV; - if (src_file.file->f_path.mnt != file->f_path.mnt) - goto out_fput; - - src = file_inode(src_file.file); - - ret = -EINVAL; - if (src == inode) - same_inode = 1; - - /* the src must be open for reading */ - if (!(src_file.file->f_mode & FMODE_READ)) - goto out_fput; + if (file_src->f_path.mnt != file->f_path.mnt || + src->i_sb != inode->i_sb) + return -EXDEV; /* don't make the dst file partly checksummed */ if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) - goto out_fput; + return -EINVAL; - ret = -EISDIR; if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode)) - goto out_fput; - - ret = -EXDEV; - if (src->i_sb != inode->i_sb) - goto out_fput; + return -EISDIR; if (!same_inode) { btrfs_double_inode_lock(src, inode); @@ -3869,6 +3839,49 @@ out_unlock: btrfs_double_inode_unlock(src, inode); else mutex_unlock(>i_mutex); + return ret; +} + +ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t len, unsigned int flags) +{ + ssize_t ret; + + ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out); + if (ret == 0) + ret = len; + return ret; +} + +static noinline long btrfs_ioctl_clone(struct
[PATCH v5 0/9] VFS: In-kernel copy system call
Copy system calls came up during Plumbers a while ago, mostly because several filesystems (including NFS and XFS) are currently working on copy acceleration implementations. We haven't heard from Zach Brown in a while, so I volunteered to push his patches upstream so individual filesystems don't need to keep writing their own ioctls. This posting fixes a few issues that popped up after I submitted v4 yesterday. Changes in v5: - Bump syscall number (again) - Add sys_copy_file_range() to include/linux/syscalls.h - Change flags parameter on btrfs to an unsigned int Anna Schumaker (6): vfs: Copy should check len after file open mode vfs: Copy shouldn't forbid ranges inside the same file vfs: Copy should use file_out rather than file_in vfs: Remove copy_file_range mountpoint checks vfs: Add vfs_copy_file_range() support for pagecache copies btrfs: btrfs_copy_file_range() only supports reflinks Zach Brown (3): vfs: add copy_file_range syscall and vfs helper x86: add sys_copy_file_range to syscall tables btrfs: add .copy_file_range file operation arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/btrfs/ctree.h | 3 + fs/btrfs/file.c| 1 + fs/btrfs/ioctl.c | 95 +- fs/read_write.c| 141 + include/linux/copy.h | 6 ++ include/linux/fs.h | 3 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/Kbuild | 1 + include/uapi/linux/copy.h | 8 ++ kernel/sys_ni.c| 1 + 13 files changed, 228 insertions(+), 40 deletions(-) create mode 100644 include/linux/copy.h create mode 100644 include/uapi/linux/copy.h -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 2/9] x86: add sys_copy_file_range to syscall tables
From: Zach BrownAdd sys_copy_file_range to the x86 syscall tables. Signed-off-by: Zach Brown [Anna Schumaker: Update syscall number in syscall_32.tbl] Signed-off-by: Anna Schumaker --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 7663c45..0531270 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -382,3 +382,4 @@ 373i386shutdownsys_shutdown 374i386userfaultfd sys_userfaultfd 375i386membarrier sys_membarrier +376i386copy_file_range sys_copy_file_range diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 278842f..03a9396 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -331,6 +331,7 @@ 32264 execveatstub_execveat 323common userfaultfd sys_userfaultfd 324common membarrier sys_membarrier +325common copy_file_range sys_copy_file_range # # x32-specific system call numbers start at 512 to avoid cache impact -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 doesn't mount on boot, but you can afterwards?
Hello, On 30/09/15, Sjoerd wrote: > Hi All, > > A RAID5 setup on raw devices doesn't want to automount on boot. > [..] Post your /etc/fstab file please. Thanks -- Sent using mutt -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix resending received snapshot with parent
This fixes a regression introduced by 37b8d27d between v4.1 and v4.2. When a snapshot is received, its received_uuid is set to the original uuid of the subvolume. When that snapshot is then resent to a third filesystem, it's received_uuid is set to the second uuid instead of the original one. The same was true for the parent_uuid. This behaviour was partially changed in 37b8d27d, but in that patch only the parent_uuid was taken from the real original, not the uuid itself, causing the search for the parent to fail in the case below. This happens for example when trying to send a series of linked snapshots (e.g. created by snapper) from the backup file system back to the original one. The following commands reproduce the issue in v4.2.1 (no error in 4.1.6) # setup three test file systems for i in 1 2 3; do truncate -s 50M fs$i mkfs.btrfs fs$i mkdir $i mount fs$i $i done echo "content" > 1/testfile btrfs su snapshot -r 1/ 1/snap1 echo "changed content" > 1/testfile btrfs su snapshot -r 1/ 1/snap2 # works fine: btrfs send 1/snap1 | btrfs receive 2/ btrfs send -p 1/snap1 1/snap2 | btrfs receive 2/ # ERROR: could not find parent subvolume btrfs send 2/snap1 | btrfs receive 3/ btrfs send -p 2/snap1 2/snap2 | btrfs receive 3/ Signed-off-by: Robin Ruede--- fs/btrfs/send.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index aa72bfd..890933b 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -2351,8 +2351,14 @@ static int send_subvol_begin(struct send_ctx *sctx) } TLV_PUT_STRING(sctx, BTRFS_SEND_A_PATH, name, namelen); - TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID, - sctx->send_root->root_item.uuid); + + if (!btrfs_is_empty_uuid(sctx->send_root->root_item.received_uuid)) + TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID, + sctx->send_root->root_item.received_uuid); + else + TLV_PUT_UUID(sctx, BTRFS_SEND_A_UUID, + sctx->send_root->root_item.uuid); + TLV_PUT_U64(sctx, BTRFS_SEND_A_CTRANSID, le64_to_cpu(sctx->send_root->root_item.ctransid)); if (parent_root) { -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] fstests: generic: Check if a bull fallocate will change extent number
Qu Wenruo posted on Tue, 29 Sep 2015 18:48:37 +0800 as excerpted: > Both gives quite good expression, I'll pick one of them. ... And for the one-line title, /bull/bad/ should do it. =:^) People wanting details about bad /how/ can look at the fuller description or source. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: Fix lost-data-profile caused by auto removing bg
On Tue, Sep 29, 2015 at 2:51 PM, Zhao Leiwrote: > Reproduce: > (In integration-4.3 branch) > > TEST_DEV=(/dev/vdg /dev/vdh) > TEST_DIR=/mnt/tmp > > umount "$TEST_DEV" >/dev/null > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > umount "$TEST_DEV" > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > btrfs filesystem usage $TEST_DIR > > We can see the data chunk changed from raid1 to single: > # btrfs filesystem usage $TEST_DIR > Data,single: Size:8.00MiB, Used:0.00B > /dev/vdg8.00MiB > # > > Reason: > When a empty filesystem mount with -o nospace_cache, the last > data blockgroup will be auto-removed in umount. > > Then if we mount it again, there is no data chunk in the > filesystem, so the only available data profile is 0x0, result > is all new chunks are created as single type. > > Fix: > Don't auto-delete last blockgroup for a raid type. > > Test: > Test by above script, and confirmed the logic by debug output. > > Signed-off-by: Zhao Lei > --- > fs/btrfs/extent-tree.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 79a5bd9..3505649 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info > *fs_info) >bg_list); > space_info = block_group->space_info; > list_del_init(_group->bg_list); > - if (ret || btrfs_mixed_space_info(space_info)) { > + if (ret || btrfs_mixed_space_info(space_info) || > + block_group->list.next == block_group->list.prev) { This isn't race free. The list block_group->list is protected by the groups_sem semaphore. Need to take before doing this check. You can do that in the "if" statement below this one, where we're holding _info->groups_sem [1] thanks [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c?id=refs/tags/v4.3-rc3#n10021 > btrfs_put_block_group(block_group); > continue; > } > -- > 1.8.5.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix a compiler warning of may be used uninitialized
On 09/30/15 05:55, Zhao Lei wrote: >> count is defined iff add_to_ctl == true, so the patch is not necessary. And >> I'm >> not quite sure that 0 passed down to __btrfs_add_free_space as 'bytes' makes >> sense at all. > > Agree above all. > > So I write following description in changelog: > "Not real problem, just avoid warning of: ..." > > It is just to avoid complier warning, no function changed. > A warning in compiler output is not pretty:) This looks more like a false-positive with gcc 4.8.3. With 5.2: .. CC [M] fs/btrfs/file-item.o CC [M] fs/btrfs/inode-item.o CC [M] fs/btrfs/inode-map.o CC [M] fs/btrfs/disk-io.o CC [M] fs/btrfs/transaction.o .. No warning, as expected. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg
Hi, Filipe Manana > -Original Message- > From: Filipe Manana [mailto:fdman...@gmail.com] > Sent: Wednesday, September 30, 2015 3:41 PM > To: Zhao Lei> Cc: linux-btrfs@vger.kernel.org > Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg > > On Wed, Sep 30, 2015 at 5:20 AM, Zhao Lei wrote: > > Hi, Filipe Manana > > > > Thanks for reviewing. > > > >> -Original Message- > >> From: Filipe Manana [mailto:fdman...@gmail.com] > >> Sent: Tuesday, September 29, 2015 11:48 PM > >> To: Zhao Lei > >> Cc: linux-btrfs@vger.kernel.org > >> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by > >> balance bg > >> > >> On Tue, Sep 29, 2015 at 2:51 PM, Zhao Lei wrote: > >> > Reproduce: > >> > (In integration-4.3 branch) > >> > > >> > TEST_DEV=(/dev/vdg /dev/vdh) > >> > TEST_DIR=/mnt/tmp > >> > > >> > umount "$TEST_DEV" >/dev/null > >> > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" > >> > > >> > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > >> > btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage > >> > $TEST_DIR > >> > > >> > dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem > >> > usage $TEST_DIR > >> > > >> > Result: > >> > We can see "no data chunk" in first "btrfs filesystem usage": > >> > # btrfs filesystem usage $TEST_DIR > >> > Overall: > >> > ... > >> > Metadata,single: Size:8.00MiB, Used:0.00B > >> > /dev/vdg8.00MiB > >> > Metadata,RAID1: Size:122.88MiB, Used:112.00KiB > >> > /dev/vdg 122.88MiB > >> > /dev/vdh 122.88MiB > >> > System,single: Size:4.00MiB, Used:0.00B > >> > /dev/vdg4.00MiB > >> > System,RAID1: Size:8.00MiB, Used:16.00KiB > >> > /dev/vdg8.00MiB > >> > /dev/vdh8.00MiB > >> > Unallocated: > >> > /dev/vdg1.06GiB > >> > /dev/vdh1.07GiB > >> > > >> > And "data chunks changed from raid1 to single" in second "btrfs > >> > filesystem usage": > >> > # btrfs filesystem usage $TEST_DIR > >> > Overall: > >> > ... > >> > Data,single: Size:256.00MiB, Used:0.00B > >> > /dev/vdh 256.00MiB > >> > Metadata,single: Size:8.00MiB, Used:0.00B > >> > /dev/vdg8.00MiB > >> > Metadata,RAID1: Size:122.88MiB, Used:112.00KiB > >> > /dev/vdg 122.88MiB > >> > /dev/vdh 122.88MiB > >> > System,single: Size:4.00MiB, Used:0.00B > >> > /dev/vdg4.00MiB > >> > System,RAID1: Size:8.00MiB, Used:16.00KiB > >> > /dev/vdg8.00MiB > >> > /dev/vdh8.00MiB > >> > Unallocated: > >> > /dev/vdg1.06GiB > >> > /dev/vdh 841.92MiB > >> > > >> > Reason: > >> > btrfs balance delete last data chunk in case of no data in the > >> > filesystem, then we can see "no data chunk" by "fi usage" > >> > command. > >> > > >> > And when we do write operation to fs, the only available data > >> > profile is 0x0, result is all new chunks are allocated single type. > >> > > >> > Fix: > >> > Allocate a data chunk explicitly in balance operation, to reserve > >> > at least one data chunk in the filesystem. > >> > >> Allocate a data chunk explicitly to ensure we don't lose the raid profile > >> for > data. > >> > > > > Thanks, will change in v2. > > > >> > > >> > Test: > >> > Test by above script, and confirmed the logic by debug output. > >> > > >> > Signed-off-by: Zhao Lei > >> > --- > >> > fs/btrfs/volumes.c | 19 +++ > >> > 1 file changed, 19 insertions(+) > >> > > >> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index > >> > 6fc73586..3d5e41e 100644 > >> > --- a/fs/btrfs/volumes.c > >> > +++ b/fs/btrfs/volumes.c > >> > @@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct > >> > btrfs_fs_info > >> *fs_info) > >> > u64 limit_data = bctl->data.limit; > >> > u64 limit_meta = bctl->meta.limit; > >> > u64 limit_sys = bctl->sys.limit; > >> > + int chunk_reserved = 0; > >> > > >> > /* step one make some room on all the devices */ > >> > devices = _info->fs_devices->devices; @@ -3387,6 > >> > +3388,24 @@ again: > >> > goto loop; > >> > } > >> > > >> > + if (!chunk_reserved) { > >> > + trans = btrfs_start_transaction(chunk_root, > 0); > >> > + if (IS_ERR(trans)) { > >> > + > >> mutex_unlock(_info->delete_unused_bgs_mutex); > >> > + ret = PTR_ERR(trans); > >> > + goto error; > >> > + } > >> > + > >> > + ret = btrfs_force_chunk_alloc(trans, > >> > + chunk_root, 1); > >> > >> Can we please use the symbol BTRFS_BLOCK_GROUP_DATA instead of 1? > >> > > Thanks, will change in v2. > > > > > >> > + if (ret < 0) { > >> > + > >> mutex_unlock(_info->delete_unused_bgs_mutex);
[PATCH] fstests: btrfs: Check if fallocate re-truncates page beyond EOF
Even the fallocate range doesn't cover the last page of a file, btrfs will still re-truncate the last page. Such behavior is completely duplicated and unneeded, and fixed by the following kernel patch: btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size Add this test case to check that malfunction. Signed-off-by: Qu Wenruo--- tests/btrfs/104 | 83 + tests/btrfs/104.out | 3 ++ tests/btrfs/group | 1 + 3 files changed, 87 insertions(+) create mode 100755 tests/btrfs/104 create mode 100644 tests/btrfs/104.out diff --git a/tests/btrfs/104 b/tests/btrfs/104 new file mode 100755 index 000..f3ddc15 --- /dev/null +++ b/tests/btrfs/104 @@ -0,0 +1,83 @@ +#! /bin/bash +# FS QA Test 104 +# +# Test that calling fallocate against a range which is already +# allocated does not truncate beyond EOF +# +#--- +# Copyright (c) 2015 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/defrag + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +_supported_fs btrfs +_supported_os Linux +_require_scratch +_need_to_be_root + +# Use 64K file size to match any sectorsize +# And with a unaligned tailing range to ensure it will be at least 2 pages +filesize=$(( 64 * 1024 + 1024 )) + +# Fallocate a range that will not cover the tailing page +fallocrange=$(( 64 * 1024 )) + +_scratch_mkfs > /dev/null 2>&1 +_scratch_mount +$XFS_IO_PROG -f -c "pwrite 0 $filesize" $SCRATCH_MNT/foo | _filter_xfs_io +sync +orig_extent_nr=`_extent_count $SCRATCH_MNT/foo` + +# As all space are allocated and even written to disk, this falloc +# should do nothing with extent modification. +$XFS_IO_PROG -f -c "falloc 0 $fallocrange" $SCRATCH_MNT/foo +sync +new_extent_nr=`_extent_count $SCRATCH_MNT/foo` + +echo "orig: $orig_extent_nr, new: $new_extent_nr" >> $seqres.full + +if [ "x$orig_extent_nr" != "x$new_extent_nr" ]; then + echo "Extent beyond EOF is re-truncated" + echo "orig_extent_nr: $orig_extent_nr new_extent_nr: $new_extent_nr" +fi + +# success, all done +status=0 +exit diff --git a/tests/btrfs/104.out b/tests/btrfs/104.out new file mode 100644 index 000..4c43e17 --- /dev/null +++ b/tests/btrfs/104.out @@ -0,0 +1,3 @@ +QA output created by 104 +wrote 66560/66560 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) diff --git a/tests/btrfs/group b/tests/btrfs/group index e92a65a..640336b 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -106,3 +106,4 @@ 101 auto quick replace 102 auto quick metadata enospc 103 auto quick clone compress +104 auto quick prealloc -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lots of snapshots, forcing defragmentation, and figuring out which files to defrag or nodatacow
James Cook posted on Mon, 28 Sep 2015 22:51:05 -0700 as excerpted: > The context of these three questions is that I'm experiencing occasional > hangs for several seconds while btrfs-transacti works, and very long > boot times. General comments welcome. System info at bottom, > end part of dmesg.log attached. > > Q1: > > I keep a lot of read-only snapshots (> 300 total) of all of my > subvolumes and haven't deleted any so far. Is this in itself a problem > or unanticipated use of btrfs? Very large numbers of snapshots do slow things down, but ~300 isn't what I'd call "very large" -- we're talking thousands to tens of thousands. My general recommendation is to keep it to ~250ish (under 300) per snapshotted subvolume, preferably under 2000 (and if possible 1000) total, easy enough to do even with automated frequent snapshotting (on the order of an hour apart, initially), as long as an equally automated snapshot thinning program is also established. At ~250 per subvolume, 1000 is 4 subvolumes worth, 2000 8 subvolumes worth. A bit over 300, assuming they're all of the same subvolume, is getting a bit high, but it shouldn't be causing a lot of trouble yet. It's just time to start thinking about a thinning program. There's one exception, quotas. Quotas continue to be an issue on btrfs; they're on their third rewrite now and while they believe it will work this time, there's still some serious bugs that will take a couple more kernels to work out. And besides not working right, they dramatically increase scalability issues. So my recommendation, unless you're directly working with the devs to test, report problems with, and bug- trace various quota issues, just don't run them on btrfs at this time. If you need quotas, use a filesystem where they're mature and work. If you don't, use btrfs without them. Really. I've seen at least two confirmed cases posted where people running quotas turned them off and their scaling issues disappeared. So if you have them on, that could well be your problem, right there. > Q2: > > I have some files that remain heavily fragmented (according to filefrag) > even after I run btrfs fi defragment. I think this happens because btrfs > doesn't want to unlink parts of the files from their snapshotted copies. > Can I tell btrfs to defragment them anyway, and not worry about wasting > space? And can I make the autodefrag mount option do this? > > For example (not all output shown): > > # filefrag * > ... > system@1973a03e3af1449ba5dd93362953fd5f-0001-00051f9377f11af6.journal: > 553 extents found ... > > # btrfs fi defragment -rf . > > # filefrag * > ... > system@1973a03e3af1449ba5dd93362953fd5f-0001-00051f9377f11af6.journal: > 331 extents found ... Several points to note, here: 1) Filefrag doesn't understand btrfs compression. If you don't use btrfs compression, this doesn't apply, but for btrfs compressed files, filefrag reports huge numbers of extents -- generally one per btrfs compression block (128 KiB), so 8 per MiB, 8192 per GiB of (before compression, not like btrfs give you a way to see post- compression file size anyway) file size. But unless you run compress-force you won't see it everywhere, because btrfs only compresses some files. 2) Btrfs defrag isn't snapshot aware, and will only defrag the files it's directly pointed at, using more space as it breaks the reflinks to the snapshotted copy. Around 3.9 snapshot-aware defrag was introduced, but it turned out to have *severe* scalability issues, so that was rolled back and snapshot-aware defrag was turned off again in, IIRC, 3.12 (thus well before what you're running). So worrying about breaking snapshot reflinks while defragging isn't going to be your problem, that, per se, is simply not an issue. 3) What /can/ be an issue is dealt with using defrag's -t parameter. I don't remember what the default target extent size is, but it's somewhat smaller than you might expect, well under a gig. Extent sizes larger than this are considered to be already defragged and aren't touched. (While this does touch on #2 above as well, not unnecessarily breaking reflinks to extents shared with other snapshots, the mechanism is one of extent size, not whether the extent is shared with another snapshot. So even if it's a new file not yet snapshotted, extents over this size won't be touched.) It's worth keeping in mind that btrfs' nominal data chunk size is 1 GiB. As such, that's the nominal largest extent size as well, altho in some cases (data chunks created on nearly empty TiB-scale filesystems) data chunk size can be larger, multiple GiB, in which case extent size can be larger as well. Regardless, extent sizes > 1 GiB really aren't going to be a performance issue anyway, so while using the -t 1G or -t 2G option is a good idea and should reduce fragmentation further for extents between the default size and your -t size, going above that isn't
Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg
On Wed, Sep 30, 2015 at 5:20 AM, Zhao Leiwrote: > Hi, Filipe Manana > > Thanks for reviewing. > >> -Original Message- >> From: Filipe Manana [mailto:fdman...@gmail.com] >> Sent: Tuesday, September 29, 2015 11:48 PM >> To: Zhao Lei >> Cc: linux-btrfs@vger.kernel.org >> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg >> >> On Tue, Sep 29, 2015 at 2:51 PM, Zhao Lei wrote: >> > Reproduce: >> > (In integration-4.3 branch) >> > >> > TEST_DEV=(/dev/vdg /dev/vdh) >> > TEST_DIR=/mnt/tmp >> > >> > umount "$TEST_DEV" >/dev/null >> > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" >> > >> > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" >> > btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage >> > $TEST_DIR >> > >> > dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem usage >> > $TEST_DIR >> > >> > Result: >> > We can see "no data chunk" in first "btrfs filesystem usage": >> > # btrfs filesystem usage $TEST_DIR >> > Overall: >> > ... >> > Metadata,single: Size:8.00MiB, Used:0.00B >> > /dev/vdg8.00MiB >> > Metadata,RAID1: Size:122.88MiB, Used:112.00KiB >> > /dev/vdg 122.88MiB >> > /dev/vdh 122.88MiB >> > System,single: Size:4.00MiB, Used:0.00B >> > /dev/vdg4.00MiB >> > System,RAID1: Size:8.00MiB, Used:16.00KiB >> > /dev/vdg8.00MiB >> > /dev/vdh8.00MiB >> > Unallocated: >> > /dev/vdg1.06GiB >> > /dev/vdh1.07GiB >> > >> > And "data chunks changed from raid1 to single" in second "btrfs >> > filesystem usage": >> > # btrfs filesystem usage $TEST_DIR >> > Overall: >> > ... >> > Data,single: Size:256.00MiB, Used:0.00B >> > /dev/vdh 256.00MiB >> > Metadata,single: Size:8.00MiB, Used:0.00B >> > /dev/vdg8.00MiB >> > Metadata,RAID1: Size:122.88MiB, Used:112.00KiB >> > /dev/vdg 122.88MiB >> > /dev/vdh 122.88MiB >> > System,single: Size:4.00MiB, Used:0.00B >> > /dev/vdg4.00MiB >> > System,RAID1: Size:8.00MiB, Used:16.00KiB >> > /dev/vdg8.00MiB >> > /dev/vdh8.00MiB >> > Unallocated: >> > /dev/vdg1.06GiB >> > /dev/vdh 841.92MiB >> > >> > Reason: >> > btrfs balance delete last data chunk in case of no data in the >> > filesystem, then we can see "no data chunk" by "fi usage" >> > command. >> > >> > And when we do write operation to fs, the only available data >> > profile is 0x0, result is all new chunks are allocated single type. >> > >> > Fix: >> > Allocate a data chunk explicitly in balance operation, to reserve at >> > least one data chunk in the filesystem. >> >> Allocate a data chunk explicitly to ensure we don't lose the raid profile >> for data. >> > > Thanks, will change in v2. > >> > >> > Test: >> > Test by above script, and confirmed the logic by debug output. >> > >> > Signed-off-by: Zhao Lei >> > --- >> > fs/btrfs/volumes.c | 19 +++ >> > 1 file changed, 19 insertions(+) >> > >> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index >> > 6fc73586..3d5e41e 100644 >> > --- a/fs/btrfs/volumes.c >> > +++ b/fs/btrfs/volumes.c >> > @@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct btrfs_fs_info >> *fs_info) >> > u64 limit_data = bctl->data.limit; >> > u64 limit_meta = bctl->meta.limit; >> > u64 limit_sys = bctl->sys.limit; >> > + int chunk_reserved = 0; >> > >> > /* step one make some room on all the devices */ >> > devices = _info->fs_devices->devices; @@ -3387,6 +3388,24 >> > @@ again: >> > goto loop; >> > } >> > >> > + if (!chunk_reserved) { >> > + trans = btrfs_start_transaction(chunk_root, 0); >> > + if (IS_ERR(trans)) { >> > + >> mutex_unlock(_info->delete_unused_bgs_mutex); >> > + ret = PTR_ERR(trans); >> > + goto error; >> > + } >> > + >> > + ret = btrfs_force_chunk_alloc(trans, >> > + chunk_root, 1); >> >> Can we please use the symbol BTRFS_BLOCK_GROUP_DATA instead of 1? >> > Thanks, will change in v2. > > >> > + if (ret < 0) { >> > + >> mutex_unlock(_info->delete_unused_bgs_mutex); >> > + goto error; >> > + } >> > + >> > + btrfs_end_transaction(trans, chunk_root); >> > + chunk_reserved = 1; >> > + } >> >> Can we do this logic only if the block group is a data one? I.e. no point >> allocating >> a data block group if our balance only touches metadata ones (due to some >> filter for e.g.). >> > Thanks for point out it, will change in v2. I find it somewhat awkward that we always allocate a new data block group no matter what. Some balance
Re: [PATCH v4 0/9] Btrfs: free space B-tree
On Tue, Sep 29, 2015 at 08:50:29PM -0700, Omar Sandoval wrote: > Hi, > > Here's one more reroll of the free space B-tree patches, a more scalable > alternative to the free space cache. Minimal changes this time around, I > mainly wanted to resend this after Holger and I cleared up his bug > report here: http://www.spinics.net/lists/linux-btrfs/msg47165.html. It > initially looked like it was a bug in a patch that Josef sent, then in > this series, but finally Holger and I figured out that it was something > else in the queue of patches he carries around, we just don't know what > yet (I'm in the middle of looking into it). Okay, I tracked down Holger's bug to a bad merge in his patch queue, so we're off the hook. > While trying to reproduce > that bug, I ran xfstests about a trillion times and a bunch of stress > tests, so this is fairly well tested now. Additionally, the last time > around, Holger and Austin both bravely offered their Tested-bys on the > series. I wasn't sure which patch(es) to tack them onto so here they > are: > > Tested-by: Holger Hoffstätte> Tested-by: Austin S. Hemmelgarn > > Thanks, everyone! > > Omar > > Changes from v3->v4: > > - Added a missing btrfs_end_transaction() to btrfs_create_free_space_tree() > and > btrfs_clear_free_space_tree() in the error cases after we abort the > transaction (see http://www.spinics.net/lists/linux-btrfs/msg47545.html) > - Rebased the kernel patches on v4.3-rc3 > - Rebased the progs patches on v4.2.1 > > v3: http://www.spinics.net/lists/linux-btrfs/msg47095.html > > Changes from v2->v3: > > - Fixed a warning in the free space tree sanity tests caught by Zhao Lei. > - Moved the addition of a block group to the free space tree to occur either > on > the first attempt to modify the free space for the block group or in > btrfs_create_pending_block_groups(), whichever happens first. This avoids a > deadlock (lock recursion) when modifying the free space tree requires > allocating a new block group. In order to do this, it was simpler to change > the on-disk semantics: the superblock stripes will now appear to be free > space > according to the free space tree, but load_free_space_tree() will still > exclude them when building the in-memory free space cache. > - Changed the free_space_tree option to space_cache=v2 and made clear_cache > clear the free space tree. If the free space tree has been created, > the mount will fail unless space_cache=v2 or nospace_cache,clear_cache > is given because we cannot allow the free space tree to get out of > date. > - Did a once-over of the code and caught a couple of error handling typos. > > v2: http://www.spinics.net/lists/linux-btrfs/msg46796.html > > Changes from v1->v2: > > - Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0" > - Added aborts in the free space tree code closer to the site the error is > encountered: where we add or remove block groups, add or remove free space, > and also when we convert formats > - Moved loading of the free space tree into caching_thread() and added a new > patch 3 in preparation for it > - Commented a bunch of stuff in the extent buffer bitmap operations and > refactored some of the complicated logic > > v1: http://www.spinics.net/lists/linux-btrfs/msg46713.html > > Omar Sandoval (9): > Btrfs: add extent buffer bitmap operations > Btrfs: add extent buffer bitmap sanity tests > Btrfs: add helpers for read-only compat bits > Btrfs: refactor caching_thread() > Btrfs: introduce the free space B-tree on-disk format > Btrfs: implement the free space B-tree > Btrfs: add free space tree sanity tests > Btrfs: wire up the free space tree to the extent tree > Btrfs: add free space tree mount option > > fs/btrfs/Makefile |5 +- > fs/btrfs/ctree.h | 157 +++- > fs/btrfs/disk-io.c | 38 + > fs/btrfs/extent-tree.c | 98 +- > fs/btrfs/extent_io.c | 183 +++- > fs/btrfs/extent_io.h | 10 +- > fs/btrfs/free-space-tree.c | 1584 > > fs/btrfs/free-space-tree.h | 72 ++ > fs/btrfs/super.c | 56 +- > fs/btrfs/tests/btrfs-tests.c | 52 ++ > fs/btrfs/tests/btrfs-tests.h | 10 + > fs/btrfs/tests/extent-io-tests.c | 138 ++- > fs/btrfs/tests/free-space-tests.c | 35 +- > fs/btrfs/tests/free-space-tree-tests.c | 571 > fs/btrfs/tests/qgroup-tests.c | 20 +- > include/trace/events/btrfs.h |3 +- > 16 files changed, 2925 insertions(+), 107 deletions(-) > create mode 100644 fs/btrfs/free-space-tree.c > create mode 100644 fs/btrfs/free-space-tree.h > create mode 100644 fs/btrfs/tests/free-space-tree-tests.c > > -- > 2.6.0 > -- Omar -- To unsubscribe from this
[RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory
As per overlayfs documentation, any activity on a merged directory for a application that is doing such activity should work exactly as if that would be a normal, non overlayfs-merged directory. That is, e.g. simple fopen-fwrite-fsync-fclose sequence should work just fine. But apparently it does not. Add a simple generic test to check that. As of right now (linux-4.2.1) this test fails at least on btrfs. PS: An alternative (and probably better approach) would be to run fstests test suite with TEST_DIR set to overlayfs work directory. Also, i'm not sure that this test fits here, but it's my best guess. Signed-off-by: Roman Lebedev--- tests/generic/111 | 80 +++ tests/generic/111.out | 5 tests/generic/group | 1 + 3 files changed, 86 insertions(+) create mode 100755 tests/generic/111 create mode 100644 tests/generic/111.out diff --git a/tests/generic/111 b/tests/generic/111 new file mode 100755 index 000..3c2599b --- /dev/null +++ b/tests/generic/111 @@ -0,0 +1,80 @@ +#! /bin/bash +# FS QA Test 111 +# +# Test that fsync works on file in overlayfs merged directory +# +#--- +# Copyright (c) 2015 Roman Lebedev. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +lower=$TEST_DIR/lower.$$ +upper=$TEST_DIR/upper.$$ +work=$TEST_DIR/work.$$ +merged=$TEST_DIR/merged.$$ + +_cleanup() +{ + cd / + rm -f $tmp.* + umount $merged + rm -rf $merged + rm -rf $work + rm -rf $upper + rm -rf $lower +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +_supported_fs generic +_supported_os IRIX Linux +_require_test + +mkdir $lower + +$XFS_IO_PROG -f -c "pwrite 0 4k" -c "fsync" \ + $lower/file | _filter_xfs_io + +mkdir $upper +mkdir $work +mkdir $merged + +sync + +mount -t overlay overlay -olowerdir=$lower \ + -oupperdir=$upper -oworkdir=$work $merged + +$XFS_IO_PROG -f -c "pwrite 0 4k" -c "fsync" \ + $merged/file | _filter_xfs_io + +# if we are here, then fsync did not crash, so we're good. + +# success, all done +status=0 +exit diff --git a/tests/generic/111.out b/tests/generic/111.out new file mode 100644 index 000..36c7fde --- /dev/null +++ b/tests/generic/111.out @@ -0,0 +1,5 @@ +QA output created by 111 +wrote 4096/4096 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 4096/4096 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) diff --git a/tests/generic/group b/tests/generic/group index 4ae256f..d3516f9 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -112,6 +112,7 @@ 107 auto quick metadata 108 auto quick rw 109 auto metadata dir +111 auto quick 112 rw aio auto quick 113 rw aio auto quick 117 attr auto quick -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs
Hello. My / is btrfs. To do some my local stuff more cleanly i wanted to use overlayfs, but it didn't quite work. Simple non-automatic sequence to reproduce the issue: mkdir lower upper work merged mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged vi merged/file :wq Results in vi being killed on exit, and the following trace appears in dmesg: [34304.047841] BUG: unable to handle kernel paging request at 09618e56 [34304.047846] IP: [] btrfs_sync_file+0xa6/0x350 [btrfs] [34304.047864] PGD 0 [34304.047866] Oops: 0002 [#12] SMP [34304.047867] Modules linked in: overlay cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc fglrx(PO) nls_utf8 joydev nls_cp437 vfat fat hid_generic usbhid kvm_amd hid kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi sha256_ssse3 sha256_generic snd_hda_intel snd_hda_codec hmac drbg ansi_cprng aesni_intel snd_hda_core aes_x86_64 mxm_wmi snd_hwdep lrw eeepc_wmi snd_pcm gf128mul asus_wmi sparse_keymap rfkill video snd_timer glue_helper sp5100_tco evdev ablk_helper e1000e ohci_pci pcspkr snd ohci_hcd xhci_pci edac_mce_amd ehci_pci serio_raw xhci_hcd soundcore fam15h_power ehci_hcd cryptd edac_core ptp pps_core usbcore k10temp i2c_piix4 [34304.047893] sg usb_common acpi_cpufreq wmi tpm_infineon button processor shpchp tpm_tis tpm thermal_sys tcp_yeah tcp_vegas it87 hwmon_vid loop parport_pc ppdev lp parport autofs4 crc32c_generic btrfs xor raid6_pq sd_mod crc32c_intel ahci libahci libata scsi_mod [34304.047905] CPU: 4 PID: 13990 Comm: vi Tainted: P DO 4.2.0-1-amd64 #1 Debian 4.2.1-2 [34304.047906] Hardware name: To be filled by O.E.M. To be filled by O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/2015 [34304.047908] task: 8803d5f7f2c0 ti: 8806a3ec8000 task.ti: 8806a3ec8000 [34304.047909] RIP: 0010:[] [] btrfs_sync_file+0xa6/0x350 [btrfs] [34304.047920] RSP: 0018:8806a3ecbe88 EFLAGS: 00010246 [34304.047921] RAX: 8803d5f7f2c0 RBX: 8807b2d46600 RCX: 81a6ad00 [34304.047922] RDX: 8000 RSI: RDI: 8807c19f8970 [34304.047923] RBP: 8807c19f8970 R08: R09: 0001 [34304.047924] R10: R11: 0246 R12: 8807c19f88c8 [34304.047925] R13: R14: 09618b22 R15: 55cb20184a70 [34304.047926] FS: 7f31c5492800() GS:88082fd0() knlGS: [34304.047927] CS: 0010 DS: ES: CR0: 80050033 [34304.047928] CR2: 09618e56 CR3: 00044af44000 CR4: 000406e0 [34304.047929] Stack: [34304.047930] 0001 7fff 880403d5b918 8000 [34304.047932] 55cb20186d40 8807b2d46600 [34304.047933] 0004 88044b249000 0020 8807b2d46600 [34304.047935] Call Trace: [34304.047939] [] ? do_fsync+0x38/0x60 [34304.047940] [] ? SyS_fsync+0x10/0x20 [34304.047943] [] ? system_call_fast_compare_end+0xc/0x6b [34304.047944] Code: 49 8b 0f 48 85 c9 75 e9 eb b3 48 8b 44 24 08 49 8d ac 24 a8 00 00 00 48 89 ef 4c 29 e8 48 83 c0 01 48 89 44 24 18 e8 3a 59 3e e1 41 ff 86 34 03 00 00 49 8b 84 24 70 ff ff ff 48 c1 e8 07 83 [34304.047959] RIP [] btrfs_sync_file+0xa6/0x350 [btrfs] [34304.047970] RSP [34304.047970] CR2: 09618e56 [34304.047972] ---[ end trace 414199893a542949 ]--- I was able to create a new fstests test that reproduces my issue, and i'm sending it as follow-up to this message. Roman Lebedev (1): fstests: generic: Test that fsync works on file in overlayfs merged directory tests/generic/111 | 80 +++ tests/generic/111.out | 5 tests/generic/group | 1 + 3 files changed, 86 insertions(+) create mode 100755 tests/generic/111 create mode 100644 tests/generic/111.out -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2] btrfs: Fix lost-data-profile caused by auto removing bg
Hi, Filipe Manana > -Original Message- > From: linux-btrfs-ow...@vger.kernel.org > [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Filipe Manana > Sent: Wednesday, September 30, 2015 3:43 PM > To: Zhao Lei> Cc: linux-btrfs@vger.kernel.org > Subject: Re: [PATCH 1/2] btrfs: Fix lost-data-profile caused by auto removing > bg > > On Tue, Sep 29, 2015 at 2:51 PM, Zhao Lei wrote: > > Reproduce: > > (In integration-4.3 branch) > > > > TEST_DEV=(/dev/vdg /dev/vdh) > > TEST_DIR=/mnt/tmp > > > > umount "$TEST_DEV" >/dev/null > > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" > > > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > umount "$TEST_DEV" > > > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > > btrfs filesystem usage $TEST_DIR > > > > We can see the data chunk changed from raid1 to single: > > # btrfs filesystem usage $TEST_DIR > > Data,single: Size:8.00MiB, Used:0.00B > > /dev/vdg8.00MiB > > # > > > > Reason: > > When a empty filesystem mount with -o nospace_cache, the last data > > blockgroup will be auto-removed in umount. > > > > Then if we mount it again, there is no data chunk in the filesystem, > > so the only available data profile is 0x0, result is all new chunks > > are created as single type. > > > > Fix: > > Don't auto-delete last blockgroup for a raid type. > > > > Test: > > Test by above script, and confirmed the logic by debug output. > > > > Signed-off-by: Zhao Lei > > --- > > fs/btrfs/extent-tree.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index > > 79a5bd9..3505649 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -10012,7 +10012,8 @@ void btrfs_delete_unused_bgs(struct > btrfs_fs_info *fs_info) > >bg_list); > > space_info = block_group->space_info; > > list_del_init(_group->bg_list); > > - if (ret || btrfs_mixed_space_info(space_info)) { > > + if (ret || btrfs_mixed_space_info(space_info) || > > + block_group->list.next == block_group->list.prev) > > + { > > This isn't race free. The list block_group->list is protected by the > groups_sem > semaphore. Need to take before doing this check. Thanks for pointing out this. > You can do that in the "if" statement below this one, where we're holding > _info->groups_sem [1] > It is hard to do check in btrfs_remove_block_group(), because it is common function used by both balance and auto-remove bg. For balance operation, we can remove lattest bg in some case, or we need add additional argument to separate these two operation(complex). So I decided to take groups_sem semaphore in place of checking. Thanks for notice this lock problem. btw, could I add your signed-off-by or reviewed-by in patch 2/2? Thanks Zhaolei > thanks > > [1] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/exte > nt-tree.c?id=refs/tags/v4.3-rc3#n10021 > > > btrfs_put_block_group(block_group); > > continue; > > } > > -- > > 1.8.5.1 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" > > in the body of a message to majord...@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the > body > of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs: Fix lost-data-profile caused by balance bg
Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage $TEST_DIR dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem usage $TEST_DIR Result: We can see "no data chunk" in first "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg8.00MiB /dev/vdh8.00MiB Unallocated: /dev/vdg1.06GiB /dev/vdh1.07GiB And "data chunks changed from raid1 to single" in second "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Data,single: Size:256.00MiB, Used:0.00B /dev/vdh 256.00MiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg8.00MiB /dev/vdh8.00MiB Unallocated: /dev/vdg1.06GiB /dev/vdh 841.92MiB Reason: btrfs balance delete last data chunk in case of no data in the filesystem, then we can see "no data chunk" by "fi usage" command. And when we do write operation to fs, the only available data profile is 0x0, result is all new chunks are allocated single type. Fix: Allocate a data chunk explicitly to ensure we don't lose the raid profile for data. Test: Test by above script, and confirmed the logic by debug output. Changelog v1->v2: 1: Update patch description of "Fix" field 2: Use BTRFS_BLOCK_GROUP_DATA for btrfs_force_chunk_alloc instead of 1 3: Only reserve chunk if balance data chunk. All suggested-by: Filipe MananaSigned-off-by: Zhao Lei --- fs/btrfs/volumes.c | 24 1 file changed, 24 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 6fc73586..cd9e5bd 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) u64 limit_data = bctl->data.limit; u64 limit_meta = bctl->meta.limit; u64 limit_sys = bctl->sys.limit; + int chunk_reserved = 0; /* step one make some room on all the devices */ devices = _info->fs_devices->devices; @@ -3326,6 +3327,8 @@ again: key.type = BTRFS_CHUNK_ITEM_KEY; while (1) { + u64 chunk_type; + if ((!counting && atomic_read(_info->balance_pause_req)) || atomic_read(_info->balance_cancel_req)) { ret = -ECANCELED; @@ -3371,8 +3374,10 @@ again: spin_unlock(_info->balance_lock); } + chunk_type = btrfs_chunk_type(leaf, chunk); ret = should_balance_chunk(chunk_root, leaf, chunk, found_key.offset); + btrfs_release_path(path); if (!ret) { mutex_unlock(_info->delete_unused_bgs_mutex); @@ -3387,6 +3392,25 @@ again: goto loop; } + if ((chunk_type & BTRFS_BLOCK_GROUP_DATA) && !chunk_reserved) { + trans = btrfs_start_transaction(chunk_root, 0); + if (IS_ERR(trans)) { + mutex_unlock(_info->delete_unused_bgs_mutex); + ret = PTR_ERR(trans); + goto error; + } + + ret = btrfs_force_chunk_alloc(trans, chunk_root, + BTRFS_BLOCK_GROUP_DATA); + if (ret < 0) { + mutex_unlock(_info->delete_unused_bgs_mutex); + goto error; + } + + btrfs_end_transaction(trans, chunk_root); + chunk_reserved = 1; + } + ret = btrfs_relocate_chunk(chunk_root, found_key.offset); mutex_unlock(_info->delete_unused_bgs_mutex); -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs: Fix lost-data-profile caused by auto removing bg
Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" umount "$TEST_DEV" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs filesystem usage $TEST_DIR We can see the data chunk changed from raid1 to single: # btrfs filesystem usage $TEST_DIR Data,single: Size:8.00MiB, Used:0.00B /dev/vdg8.00MiB # Reason: When a empty filesystem mount with -o nospace_cache, the last data blockgroup will be auto-removed in umount. Then if we mount it again, there is no data chunk in the filesystem, so the only available data profile is 0x0, result is all new chunks are created as single type. Fix: Don't auto-delete last blockgroup for a raid type. Test: Test by above script, and confirmed the logic by debug output. Changelog v1->v2: 1: Put code of checking block_group->list into semaphore of space_info->groups_sem. Noticed-by: Filipe MananaSigned-off-by: Zhao Lei --- fs/btrfs/extent-tree.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 79a5bd9..ed9426c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10010,8 +10010,18 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) block_group = list_first_entry(_info->unused_bgs, struct btrfs_block_group_cache, bg_list); - space_info = block_group->space_info; list_del_init(_group->bg_list); + + space_info = block_group->space_info; + + down_read(_info->groups_sem); + if (block_group->list.next == block_group->list.prev) { + up_read(_info->groups_sem); + btrfs_put_block_group(block_group); + continue; + } + up_read(_info->groups_sem); + if (ret || btrfs_mixed_space_info(space_info)) { btrfs_put_block_group(block_group); continue; -- 1.8.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache
When reading the page from the disk, we can race with Direct I/O which can get the page lock (before prepare_uptodate_page() gets it) and can go ahead and invalidate the page. Hence if the page is not found in the inode's address space, retry the operation of getting a page. Signed-off-by: Chandan Rajendra--- fs/btrfs/file.c | 16 1 file changed, 16 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 5715e29..76db77c 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1316,6 +1316,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages, int faili; for (i = 0; i < num_pages; i++) { +again: pages[i] = find_or_create_page(inode->i_mapping, index + i, mask | __GFP_WRITE); if (!pages[i]) { @@ -1330,6 +1331,21 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages, if (i == num_pages - 1) err = prepare_uptodate_page(pages[i], pos + write_bytes, false); + + /* +* When reading the page from the disk, we can race +* with direct i/o which can get the page lock (before +* prepare_uptodate_page() gets it) and can go ahead +* and invalidate the page. Hence if the page is found +* to be not belonging to the inode's address space, +* retry the operation of getting a page. +*/ + if (unlikely(pages[i]->mapping != inode->i_mapping)) { + unlock_page(pages[i]); + page_cache_release(pages[i]); + goto again; + } + if (err) { page_cache_release(pages[i]); faili = i - 1; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 07/13] Btrfs: Use (eb->start, seq) as search key for tree modification log
In subpagesize-blocksize a page can map multiple extent buffers and hence using (page index, seq) as the search key is incorrect. For example, searching through tree modification log tree can return an entry associated with the first extent buffer mapped by the page (if such an entry exists), when we are actually searching for entries associated with extent buffers that are mapped at position 2 or more in the page. Reviewed-by: Liu BoSigned-off-by: Chandan Rajendra --- fs/btrfs/ctree.c | 34 +- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 5f745ea..719ed3c 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -311,7 +311,7 @@ struct tree_mod_root { struct tree_mod_elem { struct rb_node node; - u64 index; /* shifted logical */ + u64 logical; u64 seq; enum mod_log_op op; @@ -435,11 +435,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info, /* * key order of the log: - * index -> sequence + * node/leaf start address -> sequence * - * the index is the shifted logical of the *new* root node for root replace - * operations, or the shifted logical of the affected block for all other - * operations. + * The 'start address' is the logical address of the *new* root node + * for root replace operations, or the logical address of the affected + * block for all other operations. * * Note: must be called with write lock (tree_mod_log_write_lock). */ @@ -460,9 +460,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm) while (*new) { cur = container_of(*new, struct tree_mod_elem, node); parent = *new; - if (cur->index < tm->index) + if (cur->logical < tm->logical) new = &((*new)->rb_left); - else if (cur->index > tm->index) + else if (cur->logical > tm->logical) new = &((*new)->rb_right); else if (cur->seq < tm->seq) new = &((*new)->rb_left); @@ -523,7 +523,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot, if (!tm) return NULL; - tm->index = eb->start >> PAGE_CACHE_SHIFT; + tm->logical = eb->start; if (op != MOD_LOG_KEY_ADD) { btrfs_node_key(eb, >key, slot); tm->blockptr = btrfs_node_blockptr(eb, slot); @@ -588,7 +588,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info, goto free_tms; } - tm->index = eb->start >> PAGE_CACHE_SHIFT; + tm->logical = eb->start; tm->slot = src_slot; tm->move.dst_slot = dst_slot; tm->move.nr_items = nr_items; @@ -699,7 +699,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info, goto free_tms; } - tm->index = new_root->start >> PAGE_CACHE_SHIFT; + tm->logical = new_root->start; tm->old_root.logical = old_root->start; tm->old_root.level = btrfs_header_level(old_root); tm->generation = btrfs_header_generation(old_root); @@ -739,16 +739,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq, struct rb_node *node; struct tree_mod_elem *cur = NULL; struct tree_mod_elem *found = NULL; - u64 index = start >> PAGE_CACHE_SHIFT; tree_mod_log_read_lock(fs_info); tm_root = _info->tree_mod_log; node = tm_root->rb_node; while (node) { cur = container_of(node, struct tree_mod_elem, node); - if (cur->index < index) { + if (cur->logical < start) { node = node->rb_left; - } else if (cur->index > index) { + } else if (cur->logical > start) { node = node->rb_right; } else if (cur->seq < min_seq) { node = node->rb_left; @@ -1230,9 +1229,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info, return NULL; /* -* the very last operation that's logged for a root is the replacement -* operation (if it is replaced at all). this has the index of the *new* -* root, making it the very first operation that's logged for this root. +* the very last operation that's logged for a root is the +* replacement operation (if it is replaced at all). this has +* the logical address of the *new* root, making it the very +* first operation that's logged for this root. */ while (1) { tm = tree_mod_log_search_oldest(fs_info, root_logical, @@ -1336,7 +1336,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb, if (!next) break;
[PATCH V5 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE units. Fix this by doing reservation/releases in block size units. Signed-off-by: Chandan Rajendra--- fs/btrfs/file.c | 44 +++- 1 file changed, 31 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b823fac..12ce401 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -499,7 +499,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, loff_t isize = i_size_read(inode); start_pos = pos & ~((u64)root->sectorsize - 1); - num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize); + num_bytes = round_up(write_bytes + pos - start_pos, root->sectorsize); end_of_last_block = start_pos + num_bytes - 1; err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block, @@ -1362,16 +1362,19 @@ fail: static noinline int lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, + size_t write_bytes, u64 *lockstart, u64 *lockend, struct extent_state **cached_state) { + struct btrfs_root *root = BTRFS_I(inode)->root; u64 start_pos; u64 last_pos; int i; int ret = 0; - start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1); - last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1; + start_pos = round_down(pos, root->sectorsize); + last_pos = start_pos + + round_up(pos + write_bytes - start_pos, root->sectorsize) - 1; if (start_pos < inode->i_size) { struct btrfs_ordered_extent *ordered; @@ -1489,6 +1492,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, while (iov_iter_count(i) > 0) { size_t offset = pos & (PAGE_CACHE_SIZE - 1); + size_t sector_offset; size_t write_bytes = min(iov_iter_count(i), nrptrs * (size_t)PAGE_CACHE_SIZE - offset); @@ -1497,6 +1501,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, size_t reserve_bytes; size_t dirty_pages; size_t copied; + size_t dirty_sectors; + size_t num_sectors; WARN_ON(num_pages > nrptrs); @@ -1509,8 +1515,12 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, break; } - reserve_bytes = num_pages << PAGE_CACHE_SHIFT; + sector_offset = pos & (root->sectorsize - 1); + reserve_bytes = round_up(write_bytes + sector_offset, + root->sectorsize); + ret = btrfs_check_data_free_space(inode, reserve_bytes, write_bytes); + if (ret == -ENOSPC && (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC))) { @@ -1523,7 +1533,10 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, */ num_pages = DIV_ROUND_UP(write_bytes + offset, PAGE_CACHE_SIZE); - reserve_bytes = num_pages << PAGE_CACHE_SHIFT; + reserve_bytes = round_up(write_bytes + + sector_offset, + root->sectorsize); + ret = 0; } else { ret = -ENOSPC; @@ -1558,8 +1571,8 @@ again: break; ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages, - pos, , , - _state); + pos, write_bytes, , + , _state); if (ret < 0) { if (ret == -EAGAIN) goto again; @@ -1595,9 +1608,14 @@ again: * we still have an outstanding extent for the chunk we actually * managed to copy. */ - if (num_pages > dirty_pages) { - release_bytes = (num_pages - dirty_pages) << - PAGE_CACHE_SHIFT; + num_sectors = reserve_bytes >> inode->i_blkbits; + dirty_sectors = round_up(copied + sector_offset, + root->sectorsize); + dirty_sectors >>=
[PATCH V5 05/13] Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
In subpagesize-blocksize scenario, if i_size occurs in a block which is not the last block in the page, then the space to be reserved should be calculated appropriately. Reviewed-by: Liu BoSigned-off-by: Chandan Rajendra --- fs/btrfs/inode.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5301d4e..5e6052d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8659,11 +8659,24 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) loff_t size; int ret; int reserved = 0; + u64 reserved_space; u64 page_start; u64 page_end; + u64 end; + + reserved_space = PAGE_CACHE_SIZE; sb_start_pagefault(inode->i_sb); - ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); + + /* + Reserving delalloc space after obtaining the page lock can lead to + deadlock. For example, if a dirty page is locked by this function + and the call to btrfs_delalloc_reserve_space() ends up triggering + dirty page write out, then the btrfs_writepage() function could + end up waiting indefinitely to get a lock on the page currently + being processed by btrfs_page_mkwrite() function. +*/ + ret = btrfs_delalloc_reserve_space(inode, reserved_space); if (!ret) { ret = file_update_time(vma->vm_file); reserved = 1; @@ -8684,6 +8697,7 @@ again: size = i_size_read(inode); page_start = page_offset(page); page_end = page_start + PAGE_CACHE_SIZE - 1; + end = page_end; if ((page->mapping != inode->i_mapping) || (page_start >= size)) { @@ -8699,7 +8713,7 @@ again: * we can't set the delalloc bits if there are pending ordered * extents. Drop our locks and wait for them to finish */ - ordered = btrfs_lookup_ordered_extent(inode, page_start); + ordered = btrfs_lookup_ordered_range(inode, page_start, page_end); if (ordered) { unlock_extent_cached(io_tree, page_start, page_end, _state, GFP_NOFS); @@ -8709,6 +8723,18 @@ again: goto again; } + if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT)) { + reserved_space = round_up(size - page_start, root->sectorsize); + if (reserved_space < PAGE_CACHE_SIZE) { + end = page_start + reserved_space - 1; + spin_lock(_I(inode)->lock); + BTRFS_I(inode)->outstanding_extents++; + spin_unlock(_I(inode)->lock); + btrfs_delalloc_release_space(inode, + PAGE_CACHE_SIZE - reserved_space); + } + } + /* * XXX - page_mkwrite gets called every time the page is dirtied, even * if it was already dirty, so for space accounting reasons we need to @@ -8716,12 +8742,12 @@ again: * is probably a better way to do this, but for now keep consistent with * prepare_pages in the normal write path. */ - clear_extent_bit(_I(inode)->io_tree, page_start, page_end, + clear_extent_bit(_I(inode)->io_tree, page_start, end, EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, _state, GFP_NOFS); - ret = btrfs_set_extent_delalloc(inode, page_start, page_end, + ret = btrfs_set_extent_delalloc(inode, page_start, end, _state); if (ret) { unlock_extent_cached(io_tree, page_start, page_end, @@ -8760,7 +8786,7 @@ out_unlock: } unlock_page(page); out: - btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); + btrfs_delalloc_release_space(inode, reserved_space); out_noreserve: sb_end_pagefault(inode->i_sb); return ret; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated
The following issue was observed when running generic/095 test on subpagesize-blocksize patchset. Assume that we are trying to write a dirty page that is mapping file offset range [159744, 163839]. writepage_delalloc() find_lock_delalloc_range(*start = 159744, *end = 0) find_delalloc_range() Returns range [X, Y] where (X > 163839) lock_delalloc_pages() One of the pages in range [X, Y] has dirty flag cleared; Loop once more restricting the delalloc range to span only PAGE_CACHE_SIZE bytes; find_delalloc_range() Returns range [356352, 360447]; lock_delalloc_pages() The page [356352, 360447] has dirty flag cleared; Returns with *start = 159744 and *end = 0; *start = *end + 1; find_lock_delalloc_range(*start = 1, *end = 0) Finds and returns delalloc range [1, 12288]; cow_file_range() Clears delalloc range [1, 12288] Create ordered extent for range [1, 12288] The ordered extent thus created above breaks the rule that extents have to be aligned to the filesystem's block size. In cases where lock_delalloc_pages() fails (either due to PG_dirty flag being cleared or the page no longer being a member of the inode's page cache), this patch sets and returns the delalloc range that was found by find_delalloc_range(). Signed-off-by: Chandan Rajendra--- fs/btrfs/extent_io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 0ee486a..3912d1f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1731,6 +1731,8 @@ again: goto again; } else { found = 0; + *start = delalloc_start; + *end = delalloc_end; goto out_failed; } } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 09/13] Btrfs: Limit inline extents to root->sectorsize
cow_file_range_inline() limits the size of an inline extent to PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by comparing against root->sectorsize. Signed-off-by: Chandan Rajendra--- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b1ceba4..b2eedb9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -257,7 +257,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root, data_len = compressed_size; if (start > 0 || - actual_end > PAGE_CACHE_SIZE || + actual_end > root->sectorsize || data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) || (!compressed_size && (actual_end & (root->sectorsize - 1)) == 0) || -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 08/13] Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
In subpagesize-blocksize scenario, map_length can be less than the length of a bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a zero length bio. Fix this by comparing map_length against block size rather than with bv_len. Signed-off-by: Chandan Rajendra--- fs/btrfs/inode.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4fbe9de..b1ceba4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8148,9 +8148,11 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, u64 file_offset = dip->logical_offset; u64 submit_len = 0; u64 map_length; - int nr_pages = 0; - int ret; + u32 blocksize = root->sectorsize; int async_submit = 0; + int nr_sectors; + int ret; + int i; map_length = orig_bio->bi_iter.bi_size; ret = btrfs_map_block(root->fs_info, rw, start_sector << 9, @@ -8180,9 +8182,12 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, atomic_inc(>pending_bios); while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) { - if (map_length < submit_len + bvec->bv_len || - bio_add_page(bio, bvec->bv_page, bvec->bv_len, -bvec->bv_offset) < bvec->bv_len) { + nr_sectors = bvec->bv_len >> inode->i_blkbits; + i = 0; +next_block: + if (unlikely(map_length < submit_len + blocksize || + bio_add_page(bio, bvec->bv_page, blocksize, + bvec->bv_offset + (i * blocksize)) < blocksize)) { /* * inc the count before we submit the bio so * we know the end IO handler won't happen before @@ -8203,7 +8208,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, file_offset += submit_len; submit_len = 0; - nr_pages = 0; bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev, start_sector, GFP_NOFS); @@ -8221,9 +8225,14 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, bio_put(bio); goto out_err; } + + goto next_block; } else { - submit_len += bvec->bv_len; - nr_pages++; + submit_len += blocksize; + if (--nr_sectors) { + i++; + goto next_block; + } bvec++; } } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 04/13] Btrfs: fallocate: Work with sectorsized blocks
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized blocks instead of pages. Hence the function has been renamed to btrfs_truncate_block(). Signed-off-by: Chandan Rajendra--- fs/btrfs/ctree.h | 2 +- fs/btrfs/file.c | 47 +-- fs/btrfs/inode.c | 52 +++- 3 files changed, 53 insertions(+), 48 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 938efe3..99a0fff 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3893,7 +3893,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct inode *dir, u64 objectid, const char *name, int name_len); -int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len, +int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, int front); int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 12ce401..360d56d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2280,23 +2280,26 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) u64 tail_len; u64 orig_start = offset; u64 cur_offset; + unsigned char blocksize_bits; u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); u64 drop_end; int ret = 0; int err = 0; int rsv_count; - bool same_page; + bool same_block; bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES); u64 ino_size; - bool truncated_page = false; + bool truncated_block = false; bool updated_inode = false; + blocksize_bits = inode->i_blkbits; + ret = btrfs_wait_ordered_range(inode, offset, len); if (ret) return ret; mutex_lock(>i_mutex); - ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE); + ino_size = round_up(inode->i_size, root->sectorsize); ret = find_first_non_hole(inode, , ); if (ret < 0) goto out_only_mutex; @@ -2309,31 +2312,30 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize); lockend = round_down(offset + len, BTRFS_I(inode)->root->sectorsize) - 1; - same_page = ((offset >> PAGE_CACHE_SHIFT) == - ((offset + len - 1) >> PAGE_CACHE_SHIFT)); - + same_block = ((offset >> blocksize_bits) + == ((offset + len - 1) >> blocksize_bits)); /* -* We needn't truncate any page which is beyond the end of the file +* We needn't truncate any block which is beyond the end of the file * because we are sure there is no data there. */ /* -* Only do this if we are in the same page and we aren't doing the -* entire page. +* Only do this if we are in the same block and we aren't doing the +* entire block. */ - if (same_page && len < PAGE_CACHE_SIZE) { + if (same_block && len < root->sectorsize) { if (offset < ino_size) { - truncated_page = true; - ret = btrfs_truncate_page(inode, offset, len, 0); + truncated_block = true; + ret = btrfs_truncate_block(inode, offset, len, 0); } else { ret = 0; } goto out_only_mutex; } - /* zero back part of the first page */ + /* zero back part of the first block */ if (offset < ino_size) { - truncated_page = true; - ret = btrfs_truncate_page(inode, offset, 0, 0); + truncated_block = true; + ret = btrfs_truncate_block(inode, offset, 0, 0); if (ret) { mutex_unlock(>i_mutex); return ret; @@ -2368,9 +2370,10 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if (!ret) { /* zero the front end of the last page */ if (tail_start + tail_len < ino_size) { - truncated_page = true; - ret = btrfs_truncate_page(inode, - tail_start + tail_len, 0, 1); + truncated_block = true; + ret = btrfs_truncate_block(inode, + tail_start + tail_len, + 0, 1); if (ret)
[PATCH V5 02/13] Btrfs: Compute and look up csums based on sectorsized blocks
Checksums are applicable to sectorsize units. The current code uses bio->bv_len units to compute and look up checksums. This works on machines where sectorsize == PAGE_SIZE. This patch makes the checksum computation and look up code to work with sectorsize units. Reviewed-by: Liu BoReviewed-by: Josef Bacik Signed-off-by: Chandan Rajendra --- fs/btrfs/file-item.c | 93 +--- 1 file changed, 59 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 58ece65..818c859 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, u64 item_start_offset = 0; u64 item_last_offset = 0; u64 disk_bytenr; + u64 page_bytes_left; u32 diff; int nblocks; int bio_index = 0; @@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, disk_bytenr = (u64)bio->bi_iter.bi_sector << 9; if (dio) offset = logical_offset; + + page_bytes_left = bvec->bv_len; while (bio_index < bio->bi_vcnt) { if (!dio) offset = page_offset(bvec->bv_page) + bvec->bv_offset; @@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, if (BTRFS_I(inode)->root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) { set_extent_bits(io_tree, offset, - offset + bvec->bv_len - 1, + offset + root->sectorsize - 1, EXTENT_NODATASUM, GFP_NOFS); } else { btrfs_info(BTRFS_I(inode)->root->fs_info, @@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, found: csum += count * csum_size; nblocks -= count; - bio_index += count; + while (count--) { - disk_bytenr += bvec->bv_len; - offset += bvec->bv_len; - bvec++; + disk_bytenr += root->sectorsize; + offset += root->sectorsize; + page_bytes_left -= root->sectorsize; + if (!page_bytes_left) { + bio_index++; + bvec++; + page_bytes_left = bvec->bv_len; + } + } } btrfs_free_path(path); @@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, struct bio_vec *bvec = bio->bi_io_vec; int bio_index = 0; int index; + int nr_sectors; + int i; unsigned long total_bytes = 0; unsigned long this_sum_bytes = 0; u64 offset; @@ -451,7 +462,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, offset = page_offset(bvec->bv_page) + bvec->bv_offset; ordered = btrfs_lookup_ordered_extent(inode, offset); - BUG_ON(!ordered); /* Logic error */ + ASSERT(ordered); /* Logic error */ sums->bytenr = (u64)bio->bi_iter.bi_sector << 9; index = 0; @@ -459,41 +470,55 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, if (!contig) offset = page_offset(bvec->bv_page) + bvec->bv_offset; - if (offset >= ordered->file_offset + ordered->len || - offset < ordered->file_offset) { - unsigned long bytes_left; - sums->len = this_sum_bytes; - this_sum_bytes = 0; - btrfs_add_ordered_sum(inode, ordered, sums); - btrfs_put_ordered_extent(ordered); + data = kmap_atomic(bvec->bv_page); - bytes_left = bio->bi_iter.bi_size - total_bytes; + nr_sectors = (bvec->bv_len + root->sectorsize - 1) + >> inode->i_blkbits; + + for (i = 0; i < nr_sectors; i++) { + if (offset >= ordered->file_offset + ordered->len || + offset < ordered->file_offset) { + unsigned long bytes_left; + + kunmap_atomic(data); + sums->len = this_sum_bytes; + this_sum_bytes = 0; + btrfs_add_ordered_sum(inode, ordered, sums); + btrfs_put_ordered_extent(ordered); + + bytes_left =
[PATCH V5 03/13] Btrfs: Direct I/O read: Work on sectorsized blocks
The direct I/O read's endio and corresponding repair functions work on page sized blocks. This commit adds the ability for direct I/O read to work on subpagesized blocks. Signed-off-by: Chandan Rajendra--- fs/btrfs/inode.c | 96 ++-- 1 file changed, 73 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b7e439b..5a47731 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7664,9 +7664,9 @@ static int btrfs_check_dio_repairable(struct inode *inode, } static int dio_read_error(struct inode *inode, struct bio *failed_bio, - struct page *page, u64 start, u64 end, - int failed_mirror, bio_end_io_t *repair_endio, - void *repair_arg) + struct page *page, unsigned int pgoff, + u64 start, u64 end, int failed_mirror, + bio_end_io_t *repair_endio, void *repair_arg) { struct io_failure_record *failrec; struct bio *bio; @@ -7687,7 +7687,9 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio, return -EIO; } - if (failed_bio->bi_vcnt > 1) + if ((failed_bio->bi_vcnt > 1) + || (failed_bio->bi_io_vec->bv_len + > BTRFS_I(inode)->root->sectorsize)) read_mode = READ_SYNC | REQ_FAILFAST_DEV; else read_mode = READ_SYNC; @@ -7695,7 +7697,7 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio, isector = start - btrfs_io_bio(failed_bio)->logical; isector >>= inode->i_sb->s_blocksize_bits; bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page, - 0, isector, repair_endio, repair_arg); + pgoff, isector, repair_endio, repair_arg); if (!bio) { free_io_failure(inode, failrec); return -EIO; @@ -7725,12 +7727,17 @@ struct btrfs_retry_complete { static void btrfs_retry_endio_nocsum(struct bio *bio, int err) { struct btrfs_retry_complete *done = bio->bi_private; + struct inode *inode; struct bio_vec *bvec; int i; if (err) goto end; + ASSERT(bio->bi_vcnt == 1); + inode = bio->bi_io_vec->bv_page->mapping->host; + ASSERT(bio->bi_io_vec->bv_len == BTRFS_I(inode)->root->sectorsize); + done->uptodate = 1; bio_for_each_segment_all(bvec, bio, i) clean_io_failure(done->inode, done->start, bvec->bv_page, 0); @@ -7745,22 +7752,30 @@ static int __btrfs_correct_data_nocsum(struct inode *inode, struct bio_vec *bvec; struct btrfs_retry_complete done; u64 start; + unsigned int pgoff; + u32 sectorsize; + int nr_sectors; int i; int ret; + sectorsize = BTRFS_I(inode)->root->sectorsize; + start = io_bio->logical; done.inode = inode; bio_for_each_segment_all(bvec, _bio->bio, i) { -try_again: + nr_sectors = bvec->bv_len >> inode->i_blkbits; + pgoff = bvec->bv_offset; + +next_block_or_try_again: done.uptodate = 0; done.start = start; init_completion(); - ret = dio_read_error(inode, _bio->bio, bvec->bv_page, start, -start + bvec->bv_len - 1, -io_bio->mirror_num, -btrfs_retry_endio_nocsum, ); + ret = dio_read_error(inode, _bio->bio, bvec->bv_page, + pgoff, start, start + sectorsize - 1, + io_bio->mirror_num, + btrfs_retry_endio_nocsum, ); if (ret) return ret; @@ -7768,10 +7783,15 @@ try_again: if (!done.uptodate) { /* We might have another mirror, so try again */ - goto try_again; + goto next_block_or_try_again; } - start += bvec->bv_len; + start += sectorsize; + + if (nr_sectors--) { + pgoff += sectorsize; + goto next_block_or_try_again; + } } return 0; @@ -7781,7 +7801,9 @@ static void btrfs_retry_endio(struct bio *bio, int err) { struct btrfs_retry_complete *done = bio->bi_private; struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + struct inode *inode; struct bio_vec *bvec; + u64 start; int uptodate; int ret; int i; @@ -7790,13 +7812,20 @@ static void btrfs_retry_endio(struct bio *bio, int err) goto end; uptodate = 1; + + start = done->start; + +
[PATCH V5 06/13] Btrfs: Search for all ordered extents that could span across a page
In subpagesize-blocksize scenario it is not sufficient to search using the first byte of the page to make sure that there are no ordered extents present across the page. Fix this. Signed-off-by: Chandan Rajendra--- fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 25 ++--- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 11aa8f7..0ee486a 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3224,7 +3224,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree, while (1) { lock_extent(tree, start, end); - ordered = btrfs_lookup_ordered_extent(inode, start); + ordered = btrfs_lookup_ordered_range(inode, start, + PAGE_CACHE_SIZE); if (!ordered) break; unlock_extent(tree, start, end); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5e6052d..4fbe9de 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1975,7 +1975,8 @@ again: if (PagePrivate2(page)) goto out; - ordered = btrfs_lookup_ordered_extent(inode, page_start); + ordered = btrfs_lookup_ordered_range(inode, page_start, + PAGE_CACHE_SIZE); if (ordered) { unlock_extent_cached(_I(inode)->io_tree, page_start, page_end, _state, GFP_NOFS); @@ -8552,6 +8553,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, struct extent_state *cached_state = NULL; u64 page_start = page_offset(page); u64 page_end = page_start + PAGE_CACHE_SIZE - 1; + u64 start; + u64 end; int inode_evicting = inode->i_state & I_FREEING; /* @@ -8571,14 +8574,18 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, if (!inode_evicting) lock_extent_bits(tree, page_start, page_end, 0, _state); - ordered = btrfs_lookup_ordered_extent(inode, page_start); +again: + start = page_start; + ordered = btrfs_lookup_ordered_range(inode, start, + page_end - start + 1); if (ordered) { + end = min(page_end, ordered->file_offset + ordered->len - 1); /* * IO on this page will never be started, so we need * to account for any ordered extents now */ if (!inode_evicting) - clear_extent_bit(tree, page_start, page_end, + clear_extent_bit(tree, start, end, EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_LOCKED | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 0, _state, @@ -8595,22 +8602,26 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, spin_lock_irq(>lock); set_bit(BTRFS_ORDERED_TRUNCATED, >flags); - new_len = page_start - ordered->file_offset; + new_len = start - ordered->file_offset; if (new_len < ordered->truncated_len) ordered->truncated_len = new_len; spin_unlock_irq(>lock); if (btrfs_dec_test_ordered_pending(inode, , - page_start, - PAGE_CACHE_SIZE, 1)) + start, + end - start + 1, 1)) btrfs_finish_ordered_io(ordered); } btrfs_put_ordered_extent(ordered); if (!inode_evicting) { cached_state = NULL; - lock_extent_bits(tree, page_start, page_end, 0, + lock_extent_bits(tree, start, end, 0, _state); } + + start = end + 1; + if (start < page_end) + goto again; } if (!inode_evicting) { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 11/13] Btrfs: Clean pte corresponding to page straddling i_size
When extending a file by either "truncate up" or by writing beyond i_size, the page which had i_size needs to be marked "read only" so that future writes to the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not, a write performed after extending the file via the mmap interface will find the page to be writaeable and continue writing to the page without invoking btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk space. Signed-off-by: Chandan Rajendra--- fs/btrfs/file.c | 12 ++-- fs/btrfs/inode.c | 2 +- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 360d56d..5715e29 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1757,6 +1757,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, ssize_t err; loff_t pos; size_t count; + loff_t oldsize; + int clean_page = 0; mutex_lock(>i_mutex); err = generic_write_checks(iocb, from); @@ -1795,14 +1797,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, pos = iocb->ki_pos; count = iov_iter_count(from); start_pos = round_down(pos, root->sectorsize); - if (start_pos > i_size_read(inode)) { + oldsize = i_size_read(inode); + if (start_pos > oldsize) { /* Expand hole size to cover write data, preventing empty gap */ end_pos = round_up(pos + count, root->sectorsize); - err = btrfs_cont_expand(inode, i_size_read(inode), end_pos); + err = btrfs_cont_expand(inode, oldsize, end_pos); if (err) { mutex_unlock(>i_mutex); goto out; } + if (start_pos > round_up(oldsize, root->sectorsize)) + clean_page = 1; } if (sync) @@ -1814,6 +1819,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, num_written = __btrfs_buffered_write(file, from, pos); if (num_written > 0) iocb->ki_pos = pos + num_written; + if (clean_page) + pagecache_isize_extended(inode, oldsize, + i_size_read(inode)); } mutex_unlock(>i_mutex); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c937357..f31da87 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4853,7 +4853,6 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) } if (newsize > oldsize) { - truncate_pagecache(inode, newsize); /* * Don't do an expanding truncate while snapshoting is ongoing. * This is to ensure the snapshot captures a fully consistent @@ -4876,6 +4875,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) i_size_write(inode, newsize); btrfs_ordered_update_i_size(inode, i_size_read(inode), NULL); + pagecache_isize_extended(inode, oldsize, newsize); ret = btrfs_update_inode(trans, root, inode); btrfs_end_write_no_snapshoting(root); btrfs_end_transaction(trans, root); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 09/13] Btrfs: Limit inline extents to root->sectorsize
cow_file_range_inline() limits the size of an inline extent to PAGE_CACHE_SIZE. This breaks in subpagesize-blocksize scenarios. Fix this by comparing against root->sectorsize. Signed-off-by: Chandan Rajendra--- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b1ceba4..b2eedb9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -257,7 +257,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root, data_len = compressed_size; if (start > 0 || - actual_end > PAGE_CACHE_SIZE || + actual_end > root->sectorsize || data_len > BTRFS_MAX_INLINE_DATA_SIZE(root) || (!compressed_size && (actual_end & (root->sectorsize - 1)) == 0) || -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 03/13] Btrfs: Direct I/O read: Work on sectorsized blocks
The direct I/O read's endio and corresponding repair functions work on page sized blocks. This commit adds the ability for direct I/O read to work on subpagesized blocks. Signed-off-by: Chandan Rajendra--- fs/btrfs/inode.c | 96 ++-- 1 file changed, 73 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b7e439b..5a47731 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7664,9 +7664,9 @@ static int btrfs_check_dio_repairable(struct inode *inode, } static int dio_read_error(struct inode *inode, struct bio *failed_bio, - struct page *page, u64 start, u64 end, - int failed_mirror, bio_end_io_t *repair_endio, - void *repair_arg) + struct page *page, unsigned int pgoff, + u64 start, u64 end, int failed_mirror, + bio_end_io_t *repair_endio, void *repair_arg) { struct io_failure_record *failrec; struct bio *bio; @@ -7687,7 +7687,9 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio, return -EIO; } - if (failed_bio->bi_vcnt > 1) + if ((failed_bio->bi_vcnt > 1) + || (failed_bio->bi_io_vec->bv_len + > BTRFS_I(inode)->root->sectorsize)) read_mode = READ_SYNC | REQ_FAILFAST_DEV; else read_mode = READ_SYNC; @@ -7695,7 +7697,7 @@ static int dio_read_error(struct inode *inode, struct bio *failed_bio, isector = start - btrfs_io_bio(failed_bio)->logical; isector >>= inode->i_sb->s_blocksize_bits; bio = btrfs_create_repair_bio(inode, failed_bio, failrec, page, - 0, isector, repair_endio, repair_arg); + pgoff, isector, repair_endio, repair_arg); if (!bio) { free_io_failure(inode, failrec); return -EIO; @@ -7725,12 +7727,17 @@ struct btrfs_retry_complete { static void btrfs_retry_endio_nocsum(struct bio *bio, int err) { struct btrfs_retry_complete *done = bio->bi_private; + struct inode *inode; struct bio_vec *bvec; int i; if (err) goto end; + ASSERT(bio->bi_vcnt == 1); + inode = bio->bi_io_vec->bv_page->mapping->host; + ASSERT(bio->bi_io_vec->bv_len == BTRFS_I(inode)->root->sectorsize); + done->uptodate = 1; bio_for_each_segment_all(bvec, bio, i) clean_io_failure(done->inode, done->start, bvec->bv_page, 0); @@ -7745,22 +7752,30 @@ static int __btrfs_correct_data_nocsum(struct inode *inode, struct bio_vec *bvec; struct btrfs_retry_complete done; u64 start; + unsigned int pgoff; + u32 sectorsize; + int nr_sectors; int i; int ret; + sectorsize = BTRFS_I(inode)->root->sectorsize; + start = io_bio->logical; done.inode = inode; bio_for_each_segment_all(bvec, _bio->bio, i) { -try_again: + nr_sectors = bvec->bv_len >> inode->i_blkbits; + pgoff = bvec->bv_offset; + +next_block_or_try_again: done.uptodate = 0; done.start = start; init_completion(); - ret = dio_read_error(inode, _bio->bio, bvec->bv_page, start, -start + bvec->bv_len - 1, -io_bio->mirror_num, -btrfs_retry_endio_nocsum, ); + ret = dio_read_error(inode, _bio->bio, bvec->bv_page, + pgoff, start, start + sectorsize - 1, + io_bio->mirror_num, + btrfs_retry_endio_nocsum, ); if (ret) return ret; @@ -7768,10 +7783,15 @@ try_again: if (!done.uptodate) { /* We might have another mirror, so try again */ - goto try_again; + goto next_block_or_try_again; } - start += bvec->bv_len; + start += sectorsize; + + if (nr_sectors--) { + pgoff += sectorsize; + goto next_block_or_try_again; + } } return 0; @@ -7781,7 +7801,9 @@ static void btrfs_retry_endio(struct bio *bio, int err) { struct btrfs_retry_complete *done = bio->bi_private; struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + struct inode *inode; struct bio_vec *bvec; + u64 start; int uptodate; int ret; int i; @@ -7790,13 +7812,20 @@ static void btrfs_retry_endio(struct bio *bio, int err) goto end; uptodate = 1; + + start = done->start; + +
[RFC PATCH V4 05/13] Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
In subpagesize-blocksize scenario, if i_size occurs in a block which is not the last block in the page, then the space to be reserved should be calculated appropriately. Reviewed-by: Liu BoSigned-off-by: Chandan Rajendra --- fs/btrfs/inode.c | 36 +++- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5301d4e..5e6052d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8659,11 +8659,24 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) loff_t size; int ret; int reserved = 0; + u64 reserved_space; u64 page_start; u64 page_end; + u64 end; + + reserved_space = PAGE_CACHE_SIZE; sb_start_pagefault(inode->i_sb); - ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); + + /* + Reserving delalloc space after obtaining the page lock can lead to + deadlock. For example, if a dirty page is locked by this function + and the call to btrfs_delalloc_reserve_space() ends up triggering + dirty page write out, then the btrfs_writepage() function could + end up waiting indefinitely to get a lock on the page currently + being processed by btrfs_page_mkwrite() function. +*/ + ret = btrfs_delalloc_reserve_space(inode, reserved_space); if (!ret) { ret = file_update_time(vma->vm_file); reserved = 1; @@ -8684,6 +8697,7 @@ again: size = i_size_read(inode); page_start = page_offset(page); page_end = page_start + PAGE_CACHE_SIZE - 1; + end = page_end; if ((page->mapping != inode->i_mapping) || (page_start >= size)) { @@ -8699,7 +8713,7 @@ again: * we can't set the delalloc bits if there are pending ordered * extents. Drop our locks and wait for them to finish */ - ordered = btrfs_lookup_ordered_extent(inode, page_start); + ordered = btrfs_lookup_ordered_range(inode, page_start, page_end); if (ordered) { unlock_extent_cached(io_tree, page_start, page_end, _state, GFP_NOFS); @@ -8709,6 +8723,18 @@ again: goto again; } + if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT)) { + reserved_space = round_up(size - page_start, root->sectorsize); + if (reserved_space < PAGE_CACHE_SIZE) { + end = page_start + reserved_space - 1; + spin_lock(_I(inode)->lock); + BTRFS_I(inode)->outstanding_extents++; + spin_unlock(_I(inode)->lock); + btrfs_delalloc_release_space(inode, + PAGE_CACHE_SIZE - reserved_space); + } + } + /* * XXX - page_mkwrite gets called every time the page is dirtied, even * if it was already dirty, so for space accounting reasons we need to @@ -8716,12 +8742,12 @@ again: * is probably a better way to do this, but for now keep consistent with * prepare_pages in the normal write path. */ - clear_extent_bit(_I(inode)->io_tree, page_start, page_end, + clear_extent_bit(_I(inode)->io_tree, page_start, end, EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, _state, GFP_NOFS); - ret = btrfs_set_extent_delalloc(inode, page_start, page_end, + ret = btrfs_set_extent_delalloc(inode, page_start, end, _state); if (ret) { unlock_extent_cached(io_tree, page_start, page_end, @@ -8760,7 +8786,7 @@ out_unlock: } unlock_page(page); out: - btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); + btrfs_delalloc_release_space(inode, reserved_space); out_noreserve: sb_end_pagefault(inode->i_sb); return ret; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg
On Wed, Sep 30, 2015 at 9:26 AM, Zhao Leiwrote: > Hi, Filipe Manana > >> -Original Message- >> From: Filipe Manana [mailto:fdman...@gmail.com] >> Sent: Wednesday, September 30, 2015 3:41 PM >> To: Zhao Lei >> Cc: linux-btrfs@vger.kernel.org >> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by balance bg >> >> On Wed, Sep 30, 2015 at 5:20 AM, Zhao Lei wrote: >> > Hi, Filipe Manana >> > >> > Thanks for reviewing. >> > >> >> -Original Message- >> >> From: Filipe Manana [mailto:fdman...@gmail.com] >> >> Sent: Tuesday, September 29, 2015 11:48 PM >> >> To: Zhao Lei >> >> Cc: linux-btrfs@vger.kernel.org >> >> Subject: Re: [PATCH 2/2] btrfs: Fix lost-data-profile caused by >> >> balance bg >> >> >> >> On Tue, Sep 29, 2015 at 2:51 PM, Zhao Lei wrote: >> >> > Reproduce: >> >> > (In integration-4.3 branch) >> >> > >> >> > TEST_DEV=(/dev/vdg /dev/vdh) >> >> > TEST_DIR=/mnt/tmp >> >> > >> >> > umount "$TEST_DEV" >/dev/null >> >> > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" >> >> > >> >> > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" >> >> > btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage >> >> > $TEST_DIR >> >> > >> >> > dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem >> >> > usage $TEST_DIR >> >> > >> >> > Result: >> >> > We can see "no data chunk" in first "btrfs filesystem usage": >> >> > # btrfs filesystem usage $TEST_DIR >> >> > Overall: >> >> > ... >> >> > Metadata,single: Size:8.00MiB, Used:0.00B >> >> > /dev/vdg8.00MiB >> >> > Metadata,RAID1: Size:122.88MiB, Used:112.00KiB >> >> > /dev/vdg 122.88MiB >> >> > /dev/vdh 122.88MiB >> >> > System,single: Size:4.00MiB, Used:0.00B >> >> > /dev/vdg4.00MiB >> >> > System,RAID1: Size:8.00MiB, Used:16.00KiB >> >> > /dev/vdg8.00MiB >> >> > /dev/vdh8.00MiB >> >> > Unallocated: >> >> > /dev/vdg1.06GiB >> >> > /dev/vdh1.07GiB >> >> > >> >> > And "data chunks changed from raid1 to single" in second "btrfs >> >> > filesystem usage": >> >> > # btrfs filesystem usage $TEST_DIR >> >> > Overall: >> >> > ... >> >> > Data,single: Size:256.00MiB, Used:0.00B >> >> > /dev/vdh 256.00MiB >> >> > Metadata,single: Size:8.00MiB, Used:0.00B >> >> > /dev/vdg8.00MiB >> >> > Metadata,RAID1: Size:122.88MiB, Used:112.00KiB >> >> > /dev/vdg 122.88MiB >> >> > /dev/vdh 122.88MiB >> >> > System,single: Size:4.00MiB, Used:0.00B >> >> > /dev/vdg4.00MiB >> >> > System,RAID1: Size:8.00MiB, Used:16.00KiB >> >> > /dev/vdg8.00MiB >> >> > /dev/vdh8.00MiB >> >> > Unallocated: >> >> > /dev/vdg1.06GiB >> >> > /dev/vdh 841.92MiB >> >> > >> >> > Reason: >> >> > btrfs balance delete last data chunk in case of no data in the >> >> > filesystem, then we can see "no data chunk" by "fi usage" >> >> > command. >> >> > >> >> > And when we do write operation to fs, the only available data >> >> > profile is 0x0, result is all new chunks are allocated single type. >> >> > >> >> > Fix: >> >> > Allocate a data chunk explicitly in balance operation, to reserve >> >> > at least one data chunk in the filesystem. >> >> >> >> Allocate a data chunk explicitly to ensure we don't lose the raid profile >> >> for >> data. >> >> >> > >> > Thanks, will change in v2. >> > >> >> > >> >> > Test: >> >> > Test by above script, and confirmed the logic by debug output. >> >> > >> >> > Signed-off-by: Zhao Lei >> >> > --- >> >> > fs/btrfs/volumes.c | 19 +++ >> >> > 1 file changed, 19 insertions(+) >> >> > >> >> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index >> >> > 6fc73586..3d5e41e 100644 >> >> > --- a/fs/btrfs/volumes.c >> >> > +++ b/fs/btrfs/volumes.c >> >> > @@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct >> >> > btrfs_fs_info >> >> *fs_info) >> >> > u64 limit_data = bctl->data.limit; >> >> > u64 limit_meta = bctl->meta.limit; >> >> > u64 limit_sys = bctl->sys.limit; >> >> > + int chunk_reserved = 0; >> >> > >> >> > /* step one make some room on all the devices */ >> >> > devices = _info->fs_devices->devices; @@ -3387,6 >> >> > +3388,24 @@ again: >> >> > goto loop; >> >> > } >> >> > >> >> > + if (!chunk_reserved) { >> >> > + trans = btrfs_start_transaction(chunk_root, >> 0); >> >> > + if (IS_ERR(trans)) { >> >> > + >> >> mutex_unlock(_info->delete_unused_bgs_mutex); >> >> > + ret = PTR_ERR(trans); >> >> > + goto error; >> >> > + } >> >> > + >> >> > + ret = btrfs_force_chunk_alloc(trans, >> >> > + chunk_root, 1); >> >> >> >> Can
[RFC PATCH V4 10/13] Btrfs: Fix block size returned to user space
btrfs_getattr() returns PAGE_CACHE_SIZE as the block size. Since generic_fillattr() already does the right thing (by obtaining block size from inode->i_blkbits), just remove the statement from btrfs_getattr. Signed-off-by: Chandan Rajendra--- fs/btrfs/inode.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b2eedb9..c937357 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9197,7 +9197,6 @@ static int btrfs_getattr(struct vfsmount *mnt, generic_fillattr(inode, stat); stat->dev = BTRFS_I(inode)->root->anon_dev; - stat->blksize = PAGE_CACHE_SIZE; spin_lock(_I(inode)->lock); delalloc_bytes = BTRFS_I(inode)->delalloc_bytes; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 08/13] Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
In subpagesize-blocksize scenario, map_length can be less than the length of a bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a zero length bio. Fix this by comparing map_length against block size rather than with bv_len. Signed-off-by: Chandan Rajendra--- fs/btrfs/inode.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4fbe9de..b1ceba4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8148,9 +8148,11 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, u64 file_offset = dip->logical_offset; u64 submit_len = 0; u64 map_length; - int nr_pages = 0; - int ret; + u32 blocksize = root->sectorsize; int async_submit = 0; + int nr_sectors; + int ret; + int i; map_length = orig_bio->bi_iter.bi_size; ret = btrfs_map_block(root->fs_info, rw, start_sector << 9, @@ -8180,9 +8182,12 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, atomic_inc(>pending_bios); while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) { - if (map_length < submit_len + bvec->bv_len || - bio_add_page(bio, bvec->bv_page, bvec->bv_len, -bvec->bv_offset) < bvec->bv_len) { + nr_sectors = bvec->bv_len >> inode->i_blkbits; + i = 0; +next_block: + if (unlikely(map_length < submit_len + blocksize || + bio_add_page(bio, bvec->bv_page, blocksize, + bvec->bv_offset + (i * blocksize)) < blocksize)) { /* * inc the count before we submit the bio so * we know the end IO handler won't happen before @@ -8203,7 +8208,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, file_offset += submit_len; submit_len = 0; - nr_pages = 0; bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev, start_sector, GFP_NOFS); @@ -8221,9 +8225,14 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, bio_put(bio); goto out_err; } + + goto next_block; } else { - submit_len += bvec->bv_len; - nr_pages++; + submit_len += blocksize; + if (--nr_sectors) { + i++; + goto next_block; + } bvec++; } } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 02/13] Btrfs: Compute and look up csums based on sectorsized blocks
Checksums are applicable to sectorsize units. The current code uses bio->bv_len units to compute and look up checksums. This works on machines where sectorsize == PAGE_SIZE. This patch makes the checksum computation and look up code to work with sectorsize units. Reviewed-by: Liu BoReviewed-by: Josef Bacik Signed-off-by: Chandan Rajendra --- fs/btrfs/file-item.c | 93 +--- 1 file changed, 59 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 58ece65..818c859 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, u64 item_start_offset = 0; u64 item_last_offset = 0; u64 disk_bytenr; + u64 page_bytes_left; u32 diff; int nblocks; int bio_index = 0; @@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, disk_bytenr = (u64)bio->bi_iter.bi_sector << 9; if (dio) offset = logical_offset; + + page_bytes_left = bvec->bv_len; while (bio_index < bio->bi_vcnt) { if (!dio) offset = page_offset(bvec->bv_page) + bvec->bv_offset; @@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, if (BTRFS_I(inode)->root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) { set_extent_bits(io_tree, offset, - offset + bvec->bv_len - 1, + offset + root->sectorsize - 1, EXTENT_NODATASUM, GFP_NOFS); } else { btrfs_info(BTRFS_I(inode)->root->fs_info, @@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, found: csum += count * csum_size; nblocks -= count; - bio_index += count; + while (count--) { - disk_bytenr += bvec->bv_len; - offset += bvec->bv_len; - bvec++; + disk_bytenr += root->sectorsize; + offset += root->sectorsize; + page_bytes_left -= root->sectorsize; + if (!page_bytes_left) { + bio_index++; + bvec++; + page_bytes_left = bvec->bv_len; + } + } } btrfs_free_path(path); @@ -432,6 +441,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, struct bio_vec *bvec = bio->bi_io_vec; int bio_index = 0; int index; + int nr_sectors; + int i; unsigned long total_bytes = 0; unsigned long this_sum_bytes = 0; u64 offset; @@ -451,7 +462,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, offset = page_offset(bvec->bv_page) + bvec->bv_offset; ordered = btrfs_lookup_ordered_extent(inode, offset); - BUG_ON(!ordered); /* Logic error */ + ASSERT(ordered); /* Logic error */ sums->bytenr = (u64)bio->bi_iter.bi_sector << 9; index = 0; @@ -459,41 +470,55 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, if (!contig) offset = page_offset(bvec->bv_page) + bvec->bv_offset; - if (offset >= ordered->file_offset + ordered->len || - offset < ordered->file_offset) { - unsigned long bytes_left; - sums->len = this_sum_bytes; - this_sum_bytes = 0; - btrfs_add_ordered_sum(inode, ordered, sums); - btrfs_put_ordered_extent(ordered); + data = kmap_atomic(bvec->bv_page); - bytes_left = bio->bi_iter.bi_size - total_bytes; + nr_sectors = (bvec->bv_len + root->sectorsize - 1) + >> inode->i_blkbits; + + for (i = 0; i < nr_sectors; i++) { + if (offset >= ordered->file_offset + ordered->len || + offset < ordered->file_offset) { + unsigned long bytes_left; + + kunmap_atomic(data); + sums->len = this_sum_bytes; + this_sum_bytes = 0; + btrfs_add_ordered_sum(inode, ordered, sums); + btrfs_put_ordered_extent(ordered); + + bytes_left =
[RFC PATCH V4 00/13] Btrfs: Pre subpagesize-blocksize cleanups
The patches posted along with this cover letter are cleanups made during the developement of subpagesize-blocksize patchset. I believe that they can be integrated with the mainline kernel. Hence I have posted them separately from the subpagesize-blocksize patchset. I have testsed the patchset by running xfstests on ppc64 and x86_64. On ppc64, some of the Btrfs specific tests and generic/255 fail because they assume 4K as the filesystem's block size. I have fixed some of the test cases. I will fix the rest and mail them to the fstests mailing list in the near future. Changes from V3: Two new issues have been been fixed by the patches, 1. Btrfs: prepare_pages: Retry adding a page to the page cache. 2. Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated. IMHO, The above issues are also applicable to the "page size == block size" scenario but for reasons unknown to me they aren't seen even when the tests are run for a long time. Changes from V2: 1. For detecting logical errors, Use ASSERT() calls instead of calls to BUG_ON(). 2. In the patch "Btrfs: Compute and look up csums based on sectorsized blocks", fix usage of kmap_atomic/kunmap_atomic such that between the kmap_atomic() and kunmap_atomic() calls we do not invoke any function that might cause the current task to sleep. Changes from V1: 1. Call round_[down,up]() functions instead of doing hard coded alignment. Chandan Rajendra (13): Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Btrfs: Compute and look up csums based on sectorsized blocks Btrfs: Direct I/O read: Work on sectorsized blocks Btrfs: fallocate: Work with sectorsized blocks Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units Btrfs: Search for all ordered extents that could span across a page Btrfs: Use (eb->start, seq) as search key for tree modification log Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length Btrfs: Limit inline extents to root->sectorsize Btrfs: Fix block size returned to user space Btrfs: Clean pte corresponding to page straddling i_size Btrfs: prepare_pages: Retry adding a page to the page cache Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated fs/btrfs/ctree.c | 34 fs/btrfs/ctree.h | 2 +- fs/btrfs/extent_io.c | 5 +- fs/btrfs/file-item.c | 93 fs/btrfs/file.c | 119 + fs/btrfs/inode.c | 239 --- 6 files changed, 331 insertions(+), 161 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 11/13] Btrfs: Clean pte corresponding to page straddling i_size
When extending a file by either "truncate up" or by writing beyond i_size, the page which had i_size needs to be marked "read only" so that future writes to the page via mmap interface causes btrfs_page_mkwrite() to be invoked. If not, a write performed after extending the file via the mmap interface will find the page to be writaeable and continue writing to the page without invoking btrfs_page_mkwrite() i.e. we end up writing to a file without reserving disk space. Signed-off-by: Chandan Rajendra--- fs/btrfs/file.c | 12 ++-- fs/btrfs/inode.c | 2 +- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 360d56d..5715e29 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1757,6 +1757,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, ssize_t err; loff_t pos; size_t count; + loff_t oldsize; + int clean_page = 0; mutex_lock(>i_mutex); err = generic_write_checks(iocb, from); @@ -1795,14 +1797,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, pos = iocb->ki_pos; count = iov_iter_count(from); start_pos = round_down(pos, root->sectorsize); - if (start_pos > i_size_read(inode)) { + oldsize = i_size_read(inode); + if (start_pos > oldsize) { /* Expand hole size to cover write data, preventing empty gap */ end_pos = round_up(pos + count, root->sectorsize); - err = btrfs_cont_expand(inode, i_size_read(inode), end_pos); + err = btrfs_cont_expand(inode, oldsize, end_pos); if (err) { mutex_unlock(>i_mutex); goto out; } + if (start_pos > round_up(oldsize, root->sectorsize)) + clean_page = 1; } if (sync) @@ -1814,6 +1819,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, num_written = __btrfs_buffered_write(file, from, pos); if (num_written > 0) iocb->ki_pos = pos + num_written; + if (clean_page) + pagecache_isize_extended(inode, oldsize, + i_size_read(inode)); } mutex_unlock(>i_mutex); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c937357..f31da87 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4853,7 +4853,6 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) } if (newsize > oldsize) { - truncate_pagecache(inode, newsize); /* * Don't do an expanding truncate while snapshoting is ongoing. * This is to ensure the snapshot captures a fully consistent @@ -4876,6 +4875,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) i_size_write(inode, newsize); btrfs_ordered_update_i_size(inode, i_size_read(inode), NULL); + pagecache_isize_extended(inode, oldsize, newsize); ret = btrfs_update_inode(trans, root, inode); btrfs_end_write_no_snapshoting(root); btrfs_end_transaction(trans, root); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 13/13] Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated
The following issue was observed when running generic/095 test on subpagesize-blocksize patchset. Assume that we are trying to write a dirty page that is mapping file offset range [159744, 163839]. writepage_delalloc() find_lock_delalloc_range(*start = 159744, *end = 0) find_delalloc_range() Returns range [X, Y] where (X > 163839) lock_delalloc_pages() One of the pages in range [X, Y] has dirty flag cleared; Loop once more restricting the delalloc range to span only PAGE_CACHE_SIZE bytes; find_delalloc_range() Returns range [356352, 360447]; lock_delalloc_pages() The page [356352, 360447] has dirty flag cleared; Returns with *start = 159744 and *end = 0; *start = *end + 1; find_lock_delalloc_range(*start = 1, *end = 0) Finds and returns delalloc range [1, 12288]; cow_file_range() Clears delalloc range [1, 12288] Create ordered extent for range [1, 12288] The ordered extent thus created above breaks the rule that extents have to be aligned to the filesystem's block size. In cases where lock_delalloc_pages() fails (either due to PG_dirty flag being cleared or the page no longer being a member of the inode's page cache), this patch sets and returns the delalloc range that was found by find_delalloc_range(). Signed-off-by: Chandan Rajendra--- fs/btrfs/extent_io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 0ee486a..3912d1f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1731,6 +1731,8 @@ again: goto again; } else { found = 0; + *start = delalloc_start; + *end = delalloc_end; goto out_failed; } } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 07/13] Btrfs: Use (eb->start, seq) as search key for tree modification log
In subpagesize-blocksize a page can map multiple extent buffers and hence using (page index, seq) as the search key is incorrect. For example, searching through tree modification log tree can return an entry associated with the first extent buffer mapped by the page (if such an entry exists), when we are actually searching for entries associated with extent buffers that are mapped at position 2 or more in the page. Reviewed-by: Liu BoSigned-off-by: Chandan Rajendra --- fs/btrfs/ctree.c | 34 +- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 5f745ea..719ed3c 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -311,7 +311,7 @@ struct tree_mod_root { struct tree_mod_elem { struct rb_node node; - u64 index; /* shifted logical */ + u64 logical; u64 seq; enum mod_log_op op; @@ -435,11 +435,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info, /* * key order of the log: - * index -> sequence + * node/leaf start address -> sequence * - * the index is the shifted logical of the *new* root node for root replace - * operations, or the shifted logical of the affected block for all other - * operations. + * The 'start address' is the logical address of the *new* root node + * for root replace operations, or the logical address of the affected + * block for all other operations. * * Note: must be called with write lock (tree_mod_log_write_lock). */ @@ -460,9 +460,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm) while (*new) { cur = container_of(*new, struct tree_mod_elem, node); parent = *new; - if (cur->index < tm->index) + if (cur->logical < tm->logical) new = &((*new)->rb_left); - else if (cur->index > tm->index) + else if (cur->logical > tm->logical) new = &((*new)->rb_right); else if (cur->seq < tm->seq) new = &((*new)->rb_left); @@ -523,7 +523,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot, if (!tm) return NULL; - tm->index = eb->start >> PAGE_CACHE_SHIFT; + tm->logical = eb->start; if (op != MOD_LOG_KEY_ADD) { btrfs_node_key(eb, >key, slot); tm->blockptr = btrfs_node_blockptr(eb, slot); @@ -588,7 +588,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info, goto free_tms; } - tm->index = eb->start >> PAGE_CACHE_SHIFT; + tm->logical = eb->start; tm->slot = src_slot; tm->move.dst_slot = dst_slot; tm->move.nr_items = nr_items; @@ -699,7 +699,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info, goto free_tms; } - tm->index = new_root->start >> PAGE_CACHE_SHIFT; + tm->logical = new_root->start; tm->old_root.logical = old_root->start; tm->old_root.level = btrfs_header_level(old_root); tm->generation = btrfs_header_generation(old_root); @@ -739,16 +739,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq, struct rb_node *node; struct tree_mod_elem *cur = NULL; struct tree_mod_elem *found = NULL; - u64 index = start >> PAGE_CACHE_SHIFT; tree_mod_log_read_lock(fs_info); tm_root = _info->tree_mod_log; node = tm_root->rb_node; while (node) { cur = container_of(node, struct tree_mod_elem, node); - if (cur->index < index) { + if (cur->logical < start) { node = node->rb_left; - } else if (cur->index > index) { + } else if (cur->logical > start) { node = node->rb_right; } else if (cur->seq < min_seq) { node = node->rb_left; @@ -1230,9 +1229,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info, return NULL; /* -* the very last operation that's logged for a root is the replacement -* operation (if it is replaced at all). this has the index of the *new* -* root, making it the very first operation that's logged for this root. +* the very last operation that's logged for a root is the +* replacement operation (if it is replaced at all). this has +* the logical address of the *new* root, making it the very +* first operation that's logged for this root. */ while (1) { tm = tree_mod_log_search_oldest(fs_info, root_logical, @@ -1336,7 +1336,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb, if (!next) break;
[RFC PATCH V4 04/13] Btrfs: fallocate: Work with sectorsized blocks
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized blocks instead of pages. Hence the function has been renamed to btrfs_truncate_block(). Signed-off-by: Chandan Rajendra--- fs/btrfs/ctree.h | 2 +- fs/btrfs/file.c | 47 +-- fs/btrfs/inode.c | 52 +++- 3 files changed, 53 insertions(+), 48 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 938efe3..99a0fff 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3893,7 +3893,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct inode *dir, u64 objectid, const char *name, int name_len); -int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len, +int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, int front); int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 12ce401..360d56d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2280,23 +2280,26 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) u64 tail_len; u64 orig_start = offset; u64 cur_offset; + unsigned char blocksize_bits; u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); u64 drop_end; int ret = 0; int err = 0; int rsv_count; - bool same_page; + bool same_block; bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES); u64 ino_size; - bool truncated_page = false; + bool truncated_block = false; bool updated_inode = false; + blocksize_bits = inode->i_blkbits; + ret = btrfs_wait_ordered_range(inode, offset, len); if (ret) return ret; mutex_lock(>i_mutex); - ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE); + ino_size = round_up(inode->i_size, root->sectorsize); ret = find_first_non_hole(inode, , ); if (ret < 0) goto out_only_mutex; @@ -2309,31 +2312,30 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize); lockend = round_down(offset + len, BTRFS_I(inode)->root->sectorsize) - 1; - same_page = ((offset >> PAGE_CACHE_SHIFT) == - ((offset + len - 1) >> PAGE_CACHE_SHIFT)); - + same_block = ((offset >> blocksize_bits) + == ((offset + len - 1) >> blocksize_bits)); /* -* We needn't truncate any page which is beyond the end of the file +* We needn't truncate any block which is beyond the end of the file * because we are sure there is no data there. */ /* -* Only do this if we are in the same page and we aren't doing the -* entire page. +* Only do this if we are in the same block and we aren't doing the +* entire block. */ - if (same_page && len < PAGE_CACHE_SIZE) { + if (same_block && len < root->sectorsize) { if (offset < ino_size) { - truncated_page = true; - ret = btrfs_truncate_page(inode, offset, len, 0); + truncated_block = true; + ret = btrfs_truncate_block(inode, offset, len, 0); } else { ret = 0; } goto out_only_mutex; } - /* zero back part of the first page */ + /* zero back part of the first block */ if (offset < ino_size) { - truncated_page = true; - ret = btrfs_truncate_page(inode, offset, 0, 0); + truncated_block = true; + ret = btrfs_truncate_block(inode, offset, 0, 0); if (ret) { mutex_unlock(>i_mutex); return ret; @@ -2368,9 +2370,10 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if (!ret) { /* zero the front end of the last page */ if (tail_start + tail_len < ino_size) { - truncated_page = true; - ret = btrfs_truncate_page(inode, - tail_start + tail_len, 0, 1); + truncated_block = true; + ret = btrfs_truncate_block(inode, + tail_start + tail_len, + 0, 1); if (ret)
[RFC PATCH V4 06/13] Btrfs: Search for all ordered extents that could span across a page
In subpagesize-blocksize scenario it is not sufficient to search using the first byte of the page to make sure that there are no ordered extents present across the page. Fix this. Signed-off-by: Chandan Rajendra--- fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 25 ++--- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 11aa8f7..0ee486a 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3224,7 +3224,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree, while (1) { lock_extent(tree, start, end); - ordered = btrfs_lookup_ordered_extent(inode, start); + ordered = btrfs_lookup_ordered_range(inode, start, + PAGE_CACHE_SIZE); if (!ordered) break; unlock_extent(tree, start, end); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5e6052d..4fbe9de 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1975,7 +1975,8 @@ again: if (PagePrivate2(page)) goto out; - ordered = btrfs_lookup_ordered_extent(inode, page_start); + ordered = btrfs_lookup_ordered_range(inode, page_start, + PAGE_CACHE_SIZE); if (ordered) { unlock_extent_cached(_I(inode)->io_tree, page_start, page_end, _state, GFP_NOFS); @@ -8552,6 +8553,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, struct extent_state *cached_state = NULL; u64 page_start = page_offset(page); u64 page_end = page_start + PAGE_CACHE_SIZE - 1; + u64 start; + u64 end; int inode_evicting = inode->i_state & I_FREEING; /* @@ -8571,14 +8574,18 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, if (!inode_evicting) lock_extent_bits(tree, page_start, page_end, 0, _state); - ordered = btrfs_lookup_ordered_extent(inode, page_start); +again: + start = page_start; + ordered = btrfs_lookup_ordered_range(inode, start, + page_end - start + 1); if (ordered) { + end = min(page_end, ordered->file_offset + ordered->len - 1); /* * IO on this page will never be started, so we need * to account for any ordered extents now */ if (!inode_evicting) - clear_extent_bit(tree, page_start, page_end, + clear_extent_bit(tree, start, end, EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_LOCKED | EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 0, _state, @@ -8595,22 +8602,26 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, spin_lock_irq(>lock); set_bit(BTRFS_ORDERED_TRUNCATED, >flags); - new_len = page_start - ordered->file_offset; + new_len = start - ordered->file_offset; if (new_len < ordered->truncated_len) ordered->truncated_len = new_len; spin_unlock_irq(>lock); if (btrfs_dec_test_ordered_pending(inode, , - page_start, - PAGE_CACHE_SIZE, 1)) + start, + end - start + 1, 1)) btrfs_finish_ordered_io(ordered); } btrfs_put_ordered_extent(ordered); if (!inode_evicting) { cached_state = NULL; - lock_extent_bits(tree, page_start, page_end, 0, + lock_extent_bits(tree, start, end, 0, _state); } + + start = end + 1; + if (start < page_end) + goto again; } if (!inode_evicting) { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 12/13] Btrfs: prepare_pages: Retry adding a page to the page cache
When reading the page from the disk, we can race with Direct I/O which can get the page lock (before prepare_uptodate_page() gets it) and can go ahead and invalidate the page. Hence if the page is not found in the inode's address space, retry the operation of getting a page. Signed-off-by: Chandan Rajendra--- fs/btrfs/file.c | 16 1 file changed, 16 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 5715e29..76db77c 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1316,6 +1316,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages, int faili; for (i = 0; i < num_pages; i++) { +again: pages[i] = find_or_create_page(inode->i_mapping, index + i, mask | __GFP_WRITE); if (!pages[i]) { @@ -1330,6 +1331,21 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages, if (i == num_pages - 1) err = prepare_uptodate_page(pages[i], pos + write_bytes, false); + + /* +* When reading the page from the disk, we can race +* with direct i/o which can get the page lock (before +* prepare_uptodate_page() gets it) and can go ahead +* and invalidate the page. Hence if the page is found +* to be not belonging to the inode's address space, +* retry the operation of getting a page. +*/ + if (unlikely(pages[i]->mapping != inode->i_mapping)) { + unlock_page(pages[i]); + page_cache_release(pages[i]); + goto again; + } + if (err) { page_cache_release(pages[i]); faili = i - 1; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V4 01/13] Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE units. Fix this by doing reservation/releases in block size units. Signed-off-by: Chandan Rajendra--- fs/btrfs/file.c | 44 +++- 1 file changed, 31 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b823fac..12ce401 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -499,7 +499,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, loff_t isize = i_size_read(inode); start_pos = pos & ~((u64)root->sectorsize - 1); - num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize); + num_bytes = round_up(write_bytes + pos - start_pos, root->sectorsize); end_of_last_block = start_pos + num_bytes - 1; err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block, @@ -1362,16 +1362,19 @@ fail: static noinline int lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, + size_t write_bytes, u64 *lockstart, u64 *lockend, struct extent_state **cached_state) { + struct btrfs_root *root = BTRFS_I(inode)->root; u64 start_pos; u64 last_pos; int i; int ret = 0; - start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1); - last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1; + start_pos = round_down(pos, root->sectorsize); + last_pos = start_pos + + round_up(pos + write_bytes - start_pos, root->sectorsize) - 1; if (start_pos < inode->i_size) { struct btrfs_ordered_extent *ordered; @@ -1489,6 +1492,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, while (iov_iter_count(i) > 0) { size_t offset = pos & (PAGE_CACHE_SIZE - 1); + size_t sector_offset; size_t write_bytes = min(iov_iter_count(i), nrptrs * (size_t)PAGE_CACHE_SIZE - offset); @@ -1497,6 +1501,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, size_t reserve_bytes; size_t dirty_pages; size_t copied; + size_t dirty_sectors; + size_t num_sectors; WARN_ON(num_pages > nrptrs); @@ -1509,8 +1515,12 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, break; } - reserve_bytes = num_pages << PAGE_CACHE_SHIFT; + sector_offset = pos & (root->sectorsize - 1); + reserve_bytes = round_up(write_bytes + sector_offset, + root->sectorsize); + ret = btrfs_check_data_free_space(inode, reserve_bytes, write_bytes); + if (ret == -ENOSPC && (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | BTRFS_INODE_PREALLOC))) { @@ -1523,7 +1533,10 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, */ num_pages = DIV_ROUND_UP(write_bytes + offset, PAGE_CACHE_SIZE); - reserve_bytes = num_pages << PAGE_CACHE_SHIFT; + reserve_bytes = round_up(write_bytes + + sector_offset, + root->sectorsize); + ret = 0; } else { ret = -ENOSPC; @@ -1558,8 +1571,8 @@ again: break; ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages, - pos, , , - _state); + pos, write_bytes, , + , _state); if (ret < 0) { if (ret == -EAGAIN) goto again; @@ -1595,9 +1608,14 @@ again: * we still have an outstanding extent for the chunk we actually * managed to copy. */ - if (num_pages > dirty_pages) { - release_bytes = (num_pages - dirty_pages) << - PAGE_CACHE_SHIFT; + num_sectors = reserve_bytes >> inode->i_blkbits; + dirty_sectors = round_up(copied + sector_offset, + root->sectorsize); + dirty_sectors >>=
[PATCH V5 00/13] Btrfs: Pre subpagesize-blocksize cleanups
The patches posted along with this cover letter are cleanups made during the developement of subpagesize-blocksize patchset. I believe that they can be integrated with the mainline kernel. Hence I have posted them separately from the subpagesize-blocksize patchset. I have testsed the patchset by running xfstests on ppc64 and x86_64. On ppc64, some of the Btrfs specific tests and generic/255 fail because they assume 4K as the filesystem's block size. I have fixed some of the test cases. I will fix the rest and mail them to the fstests mailing list in the near future. Changes from V4: 1. Removed the RFC tag. Changes from V3: Two new issues have been been fixed by the patches, 1. Btrfs: prepare_pages: Retry adding a page to the page cache. 2. Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated. IMHO, The above issues are also applicable to the "page size == block size" scenario but for reasons unknown to me they aren't seen even when the tests are run for a long time. Changes from V2: 1. For detecting logical errors, Use ASSERT() calls instead of calls to BUG_ON(). 2. In the patch "Btrfs: Compute and look up csums based on sectorsized blocks", fix usage of kmap_atomic/kunmap_atomic such that between the kmap_atomic() and kunmap_atomic() calls we do not invoke any function that might cause the current task to sleep. Changes from V1: 1. Call round_[down,up]() functions instead of doing hard coded alignment. Chandan Rajendra (13): Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block size Btrfs: Compute and look up csums based on sectorsized blocks Btrfs: Direct I/O read: Work on sectorsized blocks Btrfs: fallocate: Work with sectorsized blocks Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units Btrfs: Search for all ordered extents that could span across a page Btrfs: Use (eb->start, seq) as search key for tree modification log Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length Btrfs: Limit inline extents to root->sectorsize Btrfs: Fix block size returned to user space Btrfs: Clean pte corresponding to page straddling i_size Btrfs: prepare_pages: Retry adding a page to the page cache Btrfs: Return valid delalloc range when the page does not have PG_Dirty flag set or has been invalidated fs/btrfs/ctree.c | 34 fs/btrfs/ctree.h | 2 +- fs/btrfs/extent_io.c | 5 +- fs/btrfs/file-item.c | 93 fs/btrfs/file.c | 119 + fs/btrfs/inode.c | 239 --- 6 files changed, 331 insertions(+), 161 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory
On Wed, Sep 30, 2015 at 10:57:45PM +0300, Roman Lebedev wrote: > As per overlayfs documentation, any activity on a merged directory > for a application that is doing such activity should work exactly > as if that would be a normal, non overlayfs-merged directory. > > That is, e.g. simple fopen-fwrite-fsync-fclose sequence should > work just fine. We have plenty of tests that do things like that. > But apparently it does not. Add a simple generic test to check that. > As of right now (linux-4.2.1) this test fails at least on btrfs. > > PS: An alternative (and probably better approach) would be to run > fstests test suite with TEST_DIR set to overlayfs work directory. Much better is to run xfstests directly on overlayfs. THere have been some patches to do that posted in the past, but those patches and discussions kinda ended up going nowhere: http://www.mail-archive.com/fstests@vger.kernel.org/msg00474.html Perhaps you'd like to pick this up, and then overlay will by much easier to test and hence likely not to have bugs like this... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] fstests: generic: Test that fsync works on file in overlayfs merged directory
On 9/30/15 4:56 PM, Dave Chinner wrote: > On Wed, Sep 30, 2015 at 10:57:45PM +0300, Roman Lebedev wrote: >> As per overlayfs documentation, any activity on a merged directory >> for a application that is doing such activity should work exactly >> as if that would be a normal, non overlayfs-merged directory. >> >> That is, e.g. simple fopen-fwrite-fsync-fclose sequence should >> work just fine. > > We have plenty of tests that do things like that. > >> But apparently it does not. Add a simple generic test to check that. >> As of right now (linux-4.2.1) this test fails at least on btrfs. >> >> PS: An alternative (and probably better approach) would be to run >> fstests test suite with TEST_DIR set to overlayfs work directory. > > Much better is to run xfstests directly on overlayfs. THere have > been some patches to do that posted in the past, but those patches > and discussions kinda ended up going nowhere: > > http://www.mail-archive.com/fstests@vger.kernel.org/msg00474.html > > Perhaps you'd like to pick this up, and then overlay will by much > easier to test and hence likely not to have bugs like this... Yeah, that could still be used for fun, but Zach's POV was that we should just have a specific overlayfs config (dictating paths to over/under/merge/around/through/whatever directories), a special mount_overlayfs helper, etc, ala NFS & CIFS. It may actually be easier than what I proposed. If you want to take a stab at it I'm happy to help, answer questions, etc - I'm not sure when I'll get back to it... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 doesn't mount on boot, but you can afterwards?
Sjoerd posted on Wed, 30 Sep 2015 18:49:21 +0200 as excerpted: > A RAID5 setup on raw devices doesn't want to automount on boot. After I > skip mounting I can log in (Ubuntu server 14.04 on kernel 4.1.8) and > just do a "sudo mount -a" to get all mounted fine. So the array doesn't > seem to be broken. "btrfs fi show /data" doesn't show anything wrong > either. > > The only weird thing I see in the syslog is : > > BTRFS info (device sdd): disk space caching is enabled BTRFS: has skinny > extents BTRFS: failed to read the system array on sdd BTRFS: open_ctree > failed > > If I reboot the machine the drive in the log changed and looks random > (i've seen in 3 boots sda, sdc and sde passing by) > > I am using btrfs-progs 4.2.1 if that matters in this case... > > Anyone have a clue whyt it's not automounting? Or something I can do to > troubleshoot? That's very likely because unlike traditional single-device filesystems (including single-device btrfs), multi-device btrfs has multiple devices it must know about before it can mount the device, while mount only feeds it one device. There are two ways to tell btrfs (the kernel side) about the other devices. 1) Do a btrfs device scan before trying to mount. 2) Name the component devices in the mount options, using the device= option (multiple times as necessary to list all devices). For various reasons including dynamic device discovery effectively randomizing device sd* assignment, btrfs device scan is the normally used option. What's probably happening is that at some point in the boot process, btrfs device scan is being automatically run, but it's after the attempt to mount the filesystem during boot, so the boot attempt to mount fails, but doing a manual mount succeeds, because the scan has already been done by the time you get a prompt in ordered to run the command. So what you need to do is find the service that runs the btrfs device scan, and make the mount depend on it, so the scan is done before the attempt to mount. Then it should work. Or if it's easier, simply create a new service that runs the scan, and have it run before the mount, since rerunning the scan twice won't hurt anything, it simply needs to run before the mount is attempted in ordered for btrfs to know what devices compose the filesystem, so it can be mounted. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 5/9] vfs: Copy shouldn't forbid ranges inside the same file
This is perfectly valid for BTRFS and XFS, so let's leave this up to filesystems to check. Signed-off-by: Anna SchumakerReviewed-by: David Sterba Reviewed-by: Darrick J. Wong --- fs/read_write.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index f3d6c48..8e7cb33 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1371,10 +1371,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, file_in->f_path.mnt != file_out->f_path.mnt) return -EXDEV; - /* forbid ranges in the same file */ - if (inode_in == inode_out) - return -EINVAL; - if (len == 0) return 0; -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks
Reject copies that don't have the COPY_FR_REFLINK flag set. Signed-off-by: Anna SchumakerReviewed-by: David Sterba --- fs/btrfs/ioctl.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d3697e8..c1f115d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -44,6 +44,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -3848,6 +3849,9 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in, { ssize_t ret; + if (!(flags & COPY_FR_REFLINK)) + return -EOPNOTSUPP; + ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out); if (ret == 0) ret = len; -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 4/9] vfs: Copy should check len after file open mode
I don't think it makes sense to report that a copy succeeded if the files aren't open properly. Signed-off-by: Anna SchumakerReviewed-by: David Sterba --- fs/read_write.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index dd10750..f3d6c48 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1345,9 +1345,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (flags) return -EINVAL; - if (len == 0) - return 0; - /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */ ret = rw_verify_area(READ, file_in, _in, len); if (ret >= 0) @@ -1378,6 +1375,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (inode_in == inode_out) return -EINVAL; + if (len == 0) + return 0; + ret = mnt_want_write_file(file_out); if (ret) return ret; -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 10/9] copy_file_range.2: New page documenting copy_file_range()
copy_file_range() is a new system call for copying ranges of data completely in the kernel. This gives filesystems an opportunity to implement some kind of "copy acceleration", such as reflinks or server-side-copy (in the case of NFS). Signed-off-by: Anna SchumakerReviewed-by: Darrick J. Wong --- man2/copy_file_range.2 | 224 + man2/splice.2 | 1 + 2 files changed, 225 insertions(+) create mode 100644 man2/copy_file_range.2 diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 new file mode 100644 index 000..23e3875 --- /dev/null +++ b/man2/copy_file_range.2 @@ -0,0 +1,224 @@ +.\"This manpage is Copyright (C) 2015 Anna Schumaker +.\" +.\" %%%LICENSE_START(VERBATIM) +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of +.\" this manual under the conditions for verbatim copying, provided that +.\" the entire resulting derived work is distributed under the terms of +.\" a permission notice identical to this one. +.\" +.\" Since the Linux kernel and libraries are constantly changing, this +.\" manual page may be incorrect or out-of-date. The author(s) assume. +.\" no responsibility for errors or omissions, or for damages resulting. +.\" from the use of the information contained herein. The author(s) may. +.\" not have taken the same level of care in the production of this. +.\" manual, which is licensed free of charge, as they might when working. +.\" professionally. +.\" +.\" Formatted or processed versions of this manual, if unaccompanied by +.\" the source, must acknowledge the copyright and authors of this work. +.\" %%%LICENSE_END +.\" +.TH COPY 2 2015-09-29 "Linux" "Linux Programmer's Manual" +.SH NAME +copy_file_range \- Copy a range of data from one file to another +.SH SYNOPSIS +.nf +.B #include +.B #include +.B #include + +.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ", +.BI "loff_t *" off_out ", size_t " len \ +", unsigned int " flags ); +.fi +.SH DESCRIPTION +The +.BR copy_file_range () +system call performs an in-kernel copy between two file descriptors +without the additional cost of transferring data from the kernel to userspace +and then back into the kernel. +It copies up to +.I len +bytes of data from file descriptor +.I fd_in +to file descriptor +.IR fd_out , +overwriting any data that exists within the requested range of the target file. + +The following semantics apply for +.IR off_in , +and similar statements apply to +.IR off_out : +.IP * 3 +If +.I off_in +is NULL, then bytes are read from +.I fd_in +starting from the current file offset, and the offset is +adjusted by the number of bytes copied. +.IP * +If +.I off_in +is not NULL, then +.I off_in +must point to a buffer that specifies the starting +offset where bytes from +.I fd_in +will be read. The current file offset of +.I fd_in +is not changed, but +.I off_in +is adjusted appropriately. +.PP + +The +.I flags +argument can have one of the following flags set: +.TP 1.9i +.B COPY_FR_COPY +Copy all the file data in the requested range. +Some filesystems might be able to accelerate this copy +to avoid unnecessary data transfers. +.TP +.B COPY_FR_REFLINK +Create a lightweight "reflink", where data is not copied until +one of the files is modified. +.TP +.B COPY_FR_DEDUP +Create a reflink, but only if the contents of +both files' byte ranges are identical. +If ranges do not match, +.B EILSEQ +will be returned. +.PP +The default behavior +.RI ( flags +== 0) is to try creating a reflink, +and if reflinking fails +.BR copy_file_range () +will fall back to performing a full data copy. +.SH RETURN VALUE +Upon successful completion, +.BR copy_file_range () +will return the number of bytes copied between files. +This could be less than the length originally requested. + +On error, +.BR copy_file_range () +returns \-1 and +.I errno +is set to indicate the error. +.SH ERRORS +.TP +.B EBADF +One or more file descriptors are not valid; or +.I fd_in +is not open for reading; or +.I fd_out +is not open for writing. +.TP +.B EILSEQ +The contents of both files' byte ranges did not match. +.TP +.B EINVAL +Requested range extends beyond the end of the source file; or the +.I flags +argument is set to an invalid value. +.TP +.B EIO +A low level I/O error occurred while copying. +.TP +.B ENOMEM +Out of memory. +.TP +.B ENOSPC +There is not enough space on the target filesystem to complete the copy. +.TP +.B EOPNOTSUPP +.B COPY_REFLINK +or +.B COPY_DEDUP +was specified in +.IR flags , +but the target filesystem does not support the given operation. +.TP +.B EXDEV +Target filesystem doesn't support cross-filesystem copies. +.SH VERSIONS +The +.BR
[PATCH v5 1/9] vfs: add copy_file_range syscall and vfs helper
From: Zach BrownAdd a copy_file_range() system call for offloading copies between regular files. This gives an interface to underlying layers of the storage stack which can copy without reading and writing all the data. There are a few candidates that should support copy offloading in the nearer term: - btrfs shares extent references with its clone ioctl - NFS has patches to add a COPY command which copies on the server - SCSI has a family of XCOPY commands which copy in the device This system call avoids the complexity of also accelerating the creation of the destination file by operating on an existing destination file descriptor, not a path. Currently the high level vfs entry point limits copy offloading to files on the same mount and super (and not in the same file). This can be relaxed if we get implementations which can copy between file systems safely. Signed-off-by: Zach Brown [Anna Schumaker: Change -EINVAL to -EBADF during file verification] [Anna Schumaker: Change flags parameter from int to unsigned int] [Anna Schumaker: Add function to include/linux/syscalls.h] Signed-off-by: Anna Schumaker --- v5: - Bump syscall number again - Add to include/linux/syscalls.h --- fs/read_write.c | 129 ++ include/linux/fs.h| 3 + include/linux/syscalls.h | 3 + include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c | 1 + 5 files changed, 139 insertions(+), 1 deletion(-) diff --git a/fs/read_write.c b/fs/read_write.c index 819ef3f..dd10750 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "internal.h" #include @@ -1327,3 +1328,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, return do_sendfile(out_fd, in_fd, NULL, count, 0); } #endif + +/* + * copy_file_range() differs from regular file read and write in that it + * specifically allows return partial success. When it does so is up to + * the copy_file_range method. + */ +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t len, unsigned int flags) +{ + struct inode *inode_in; + struct inode *inode_out; + ssize_t ret; + + if (flags) + return -EINVAL; + + if (len == 0) + return 0; + + /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */ + ret = rw_verify_area(READ, file_in, _in, len); + if (ret >= 0) + ret = rw_verify_area(WRITE, file_out, _out, len); + if (ret < 0) + return ret; + + if (!(file_in->f_mode & FMODE_READ) || + !(file_out->f_mode & FMODE_WRITE) || + (file_out->f_flags & O_APPEND) || + !file_in->f_op || !file_in->f_op->copy_file_range) + return -EBADF; + + inode_in = file_inode(file_in); + inode_out = file_inode(file_out); + + /* make sure offsets don't wrap and the input is inside i_size */ + if (pos_in + len < pos_in || pos_out + len < pos_out || + pos_in + len > i_size_read(inode_in)) + return -EINVAL; + + /* this could be relaxed once a method supports cross-fs copies */ + if (inode_in->i_sb != inode_out->i_sb || + file_in->f_path.mnt != file_out->f_path.mnt) + return -EXDEV; + + /* forbid ranges in the same file */ + if (inode_in == inode_out) + return -EINVAL; + + ret = mnt_want_write_file(file_out); + if (ret) + return ret; + + ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out, +len, flags); + if (ret > 0) { + fsnotify_access(file_in); + add_rchar(current, ret); + fsnotify_modify(file_out); + add_wchar(current, ret); + } + inc_syscr(current); + inc_syscw(current); + + mnt_drop_write_file(file_out); + + return ret; +} +EXPORT_SYMBOL(vfs_copy_file_range); + +SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in, + int, fd_out, loff_t __user *, off_out, + size_t, len, unsigned int, flags) +{ + loff_t pos_in; + loff_t pos_out; + struct fd f_in; + struct fd f_out; + ssize_t ret; + + f_in = fdget(fd_in); + f_out = fdget(fd_out); + if (!f_in.file || !f_out.file) { + ret = -EBADF; + goto out; + } + + ret = -EFAULT; + if (off_in) { + if (copy_from_user(_in, off_in, sizeof(loff_t))) + goto out; + } else { + pos_in = f_in.file->f_pos; + } + + if (off_out) { + if
[PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies
This allows us to have an in-kernel copy mechanism that avoids frequent switches between kernel and user space. This is especially useful so NFSD can support server-side copies. I make pagecache copies configurable by adding three new (exclusive) flags: - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. - COPY_FR_DEDUP creates a reflink, but only if the contents of both ranges are identical. The default (flags=0) means to first attempt a reflink, but use the pagecache if that fails. I moved the rw_verify_area() calls into the fallback code since some filesystems can handle reflinking a large range. Signed-off-by: Anna SchumakerReviewed-by: Darrick J. Wong Reviewed-by: Padraig Brady --- fs/read_write.c | 61 +++ include/linux/copy.h | 6 + include/uapi/linux/Kbuild | 1 + include/uapi/linux/copy.h | 8 +++ 4 files changed, 56 insertions(+), 20 deletions(-) create mode 100644 include/linux/copy.h create mode 100644 include/uapi/linux/copy.h diff --git a/fs/read_write.c b/fs/read_write.c index ee9fa37..4fb9b8e 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, } #endif +static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t len) +{ + ssize_t ret; + + ret = rw_verify_area(READ, file_in, _in, len); + if (ret >= 0) { + len = ret; + ret = rw_verify_area(WRITE, file_out, _out, len); + if (ret >= 0) + len = ret; + } + if (ret < 0) + return ret; + + file_start_write(file_out); + ret = do_splice_direct(file_in, _in, file_out, _out, len, 0); + file_end_write(file_out); + + return ret; +} + /* * copy_file_range() differs from regular file read and write in that it * specifically allows return partial success. When it does so is up to @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, size_t len, unsigned int flags) { - struct inode *inode_in; - struct inode *inode_out; ssize_t ret; - if (flags) + /* Flags should only be used exclusively. */ + if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) + return -EINVAL; + if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) + return -EINVAL; + if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) return -EINVAL; - /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */ - ret = rw_verify_area(READ, file_in, _in, len); - if (ret >= 0) - ret = rw_verify_area(WRITE, file_out, _out, len); - if (ret < 0) - return ret; + /* Default behavior is to try both. */ + if (flags == 0) + flags = COPY_FR_COPY | COPY_FR_REFLINK; if (!(file_in->f_mode & FMODE_READ) || !(file_out->f_mode & FMODE_WRITE) || (file_out->f_flags & O_APPEND) || - !file_out->f_op || !file_out->f_op->copy_file_range) + !file_out->f_op) return -EBADF; - inode_in = file_inode(file_in); - inode_out = file_inode(file_out); - - /* make sure offsets don't wrap and the input is inside i_size */ - if (pos_in + len < pos_in || pos_out + len < pos_out || - pos_in + len > i_size_read(inode_in)) - return -EINVAL; - if (len == 0) return 0; @@ -1373,8 +1389,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (ret) return ret; - ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out, - len, flags); + ret = -EOPNOTSUPP; + if (file_out->f_op->copy_file_range && (file_in->f_op == file_out->f_op)) + ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, + pos_out, len, flags); + if ((ret < 0) && (flags & COPY_FR_COPY)) + ret = vfs_copy_file_pagecache(file_in, pos_in, file_out, + pos_out, len); if (ret > 0) { fsnotify_access(file_in); add_rchar(current, ret); diff --git a/include/linux/copy.h b/include/linux/copy.h new file mode 100644 index
[PATCH v5 7/9] vfs: Remove copy_file_range mountpoint checks
I still want to do an in-kernel copy even if the files are on different mountpoints, and NFS has a "server to server" copy that expects two files on different mountpoints. Let's have individual filesystems implement this check instead. Signed-off-by: Anna SchumakerReviewed-by: David Sterba --- fs/read_write.c | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 6f74f1f..ee9fa37 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1366,11 +1366,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, pos_in + len > i_size_read(inode_in)) return -EINVAL; - /* this could be relaxed once a method supports cross-fs copies */ - if (inode_in->i_sb != inode_out->i_sb || - file_in->f_path.mnt != file_out->f_path.mnt) - return -EXDEV; - if (len == 0) return 0; -- 2.6.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix a compiler warning of may be used uninitialized
On Wed, Sep 30, 2015 at 11:55:13AM +0800, Zhao Lei wrote: > > AFAICS the codepath that would use uninitialized value of count is not > > reachable: > > > > add_to_ctl = true > > > > 270 if (info->offset > root->ino_cache_progress) > > 271 add_to_ctl = false; > > 272 else if (info->offset + info->bytes > > > root->ino_cache_progress) > > 273 count = root->ino_cache_progress - > > info->offset + 1; > > 274 else > > 275 count = info->bytes; > > 276 > > 277 rb_erase(>offset_index, rbroot); > > 278 spin_unlock(rbroot_lock); > > 279 if (add_to_ctl) > > 280 __btrfs_add_free_space(ctl, info->offset, > > count); > > > > count is defined iff add_to_ctl == true, so the patch is not necessary. And > > I'm > > not quite sure that 0 passed down to __btrfs_add_free_space as 'bytes' makes > > sense at all. > > Agree above all. > > So I write following description in changelog: > "Not real problem, just avoid warning of: ..." > > It is just to avoid complier warning, no function changed. > A warning in compiler output is not pretty:) And the compiler is wrong in this case, the code is fine as is. I'd say go fix you compiler and the output will be pretty :) No really, this kind of fixes brings false sense of "fixing something in the code". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/23] btrfs device related patch set
On Wed, Sep 30, 2015 at 06:10:53AM +0800, Anand Jain wrote: > > > On 09/29/2015 10:34 PM, David Sterba wrote: > > On Fri, Aug 14, 2015 at 06:32:45PM +0800, Anand Jain wrote: > >> Anand Jain (22): > >>Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mounted > >>Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mounted > >>Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link > >>Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link > >>Btrfs: rename super_kobj to fsid_kobj > >>Btrfs: SB read failure should return EIO for __bread failure > >>Btrfs: __btrfs_std_error() logic should be consistent w/out > >> CONFIG_PRINTK defined > > > > FYI, I'm picking the above for 4.4 as they're quite straightforward, > > the other patches touch interfaces and I have some comments. > > Thanks David. > Except for >[PATCH 08/23] Btrfs: device delete by devid >[PATCH 23/23] Btrfs: allow -o rw,degraded for single group profile In case the two patches are independent on the rest of series, it would be better to put them towards the end of the series. I was going down the list and stpeed at 08/23 because it introduced something nontrivial and then I can't be sure that skipping the single patch would not break the whole series. > rest are straightforward as well. If there is any comment will take it. I'll have another look and will let you know. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/2] btrfs-progs: Introduce warning and error for common use
On Mon, Sep 28, 2015 at 09:58:13PM +0800, Zhao Lei wrote: > And sometimes, we forgot add tailed '\n', This is actually a good point and we don't need to put the trailing newline to all the messages, similar to the btrfs_* macros used in kernel. I'll update the patches. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/2] btrfs-progs: Introduce warning and error for common use
On Mon, Sep 28, 2015 at 09:58:12PM +0800, Zhao Lei wrote: > This patch introduce warning() and error() as common function, ... > Converting all source is a big work, this patch convert cmds-scrub.c > We'll convert others these days, and new code can use these function > directly. > > Zhao Lei (2): > btrfs-progs: Introduce warning and error for common use > btrfs-progs: use common warning/error for cmds-scrub.c Both applied, thanks. With some changes like joining lines where possible and shifting the messages left so they fit to ~80 chars. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Add stripes filter
On Tue, Sep 29, 2015 at 12:21:39PM +, Hugo Mills wrote: > On Tue, Sep 29, 2015 at 08:10:19AM -0400, Austin S Hemmelgarn wrote: > > On 2015-09-29 08:00, David Sterba wrote: > > >On Mon, Sep 28, 2015 at 05:57:05PM +, Gabríel Arthúr Pétursson wrote: > > >>The attached patches to linux and btrfs-progs add support for filtering > > >>based on the number of strips in a block when balancing. > > > > > >What usecase do you want to address? As I understand it, this would help > > >the raid56 rebalancing to process only blockgroups that are not spread > > >accross enough devices. > >Exactly. Last week, I was trying to help Gabríel on IRC with a > close-to-full filesystem balance it to add some new devices in a > parity RAID configuration. He'd added the devices and balanced, but > the usage was unequal across the devices. The only way I could think > of dealing with it with the current tools was either to do a full > balance repeatedly until it worked itself out, or to delve into the > metadata with btrfs-debug-tree, and balance selected block groups > individually. > >I whinged that we needed a filter to pick just the block groups > that weren't "as full as possible", and Gabríel picked up the idea and > ran with it. That's great, thanks. The stripe filters are really missing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: check pending chunks when shrinking fs to avoid corruption
Hi Filipe, Looking the code of this patch, I see that if we discover a pending chunk, we unlock the chunk mutex, commit the transaction (which completes the allocation of all pending chunks and inserts relevant items into the device tree and chunk tree), and retry the search. However, after we unlock the chunk mutex, somebody could have attempted a new chunk allocation, which would have resulted in new pending chunk. On the other hand, we have done: btrfs_device_set_total_bytes(device, new_size); so this line should prevent anybody to allocate beyond the new size. In that case, we are sure that on the seconds pass there will be no pending chunks beyond the new size, so we can shrink to new_size safely. Is my understanding correct? Thanks, Alex. On Tue, Jun 2, 2015 at 3:43 PM,wrote: > From: Filipe Manana > > When we shrink the usable size of a device (its total_bytes), we go over > all the device extent items in the device tree and attempt to relocate > the chunk of any device extent that goes beyond the new usable size for > the device. We do that after setting the new usable size (total_bytes) in > the device object, so that all new allocations (and reallocations) don't > use areas of the device that go beyond the new (shorter) size. However we > were not considering that before setting the new size in the device, > pending chunks might have been created that use device extents that go > beyond the new size, and those device extents are not yet in the device > tree after we search the device tree - they are still attached to the > list of new block group for some ongoing transaction handle, and they are > only added to the device tree when the transaction handle is ended (via > btrfs_create_pending_block_groups()). > > So check for pending chunks with device extents that go beyond the new > size and if any exists, commit the current transaction and repeat the > search in the device tree. > > Not doing this it would mean we would return success to user space while > still having extents that go beyond the new size, and later user space > could override those locations on the device while the fs still references > them, causing all sorts of corruption and unexpected events. > > Signed-off-by: Filipe Manana > --- > fs/btrfs/volumes.c | 49 - > 1 file changed, 40 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index dbea12e..09e89a6 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3984,6 +3984,7 @@ int btrfs_shrink_device(struct btrfs_device *device, > u64 new_size) > int slot; > int failed = 0; > bool retried = false; > + bool checked_pending_chunks = false; > struct extent_buffer *l; > struct btrfs_key key; > struct btrfs_super_block *super_copy = root->fs_info->super_copy; > @@ -4064,15 +4065,6 @@ again: > goto again; > } else if (failed && retried) { > ret = -ENOSPC; > - lock_chunks(root); > - > - btrfs_device_set_total_bytes(device, old_size); > - if (device->writeable) > - device->fs_devices->total_rw_bytes += diff; > - spin_lock(>fs_info->free_chunk_lock); > - root->fs_info->free_chunk_space += diff; > - spin_unlock(>fs_info->free_chunk_lock); > - unlock_chunks(root); > goto done; > } > > @@ -4084,6 +4076,35 @@ again: > } > > lock_chunks(root); > + > + /* > +* We checked in the above loop all device extents that were already > in > +* the device tree. However before we have updated the device's > +* total_bytes to the new size, we might have had chunk allocations > that > +* have not complete yet (new block groups attached to transaction > +* handles), and therefore their device extents were not yet in the > +* device tree and we missed them in the loop above. So if we have any > +* pending chunk using a device extent that overlaps the device range > +* that we can not use anymore, commit the current transaction and > +* repeat the search on the device tree - this way we guarantee we > will > +* not have chunks using device extents that end beyond 'new_size'. > +*/ > + if (!checked_pending_chunks) { > + u64 start = new_size; > + u64 len = old_size - new_size; > + > + if (contains_pending_extent(trans, device, , len)) { > + unlock_chunks(root); > + checked_pending_chunks = true; > + failed = 0; > + retried = false; > + ret = btrfs_commit_transaction(trans, root); > + if (ret) > +
Re: [PATCH v2 1/2] btrfs: Fix lost-data-profile caused by auto removing bg
On Wed, Sep 30, 2015 at 12:11 PM, Zhao Leiwrote: > Reproduce: > (In integration-4.3 branch) > > TEST_DEV=(/dev/vdg /dev/vdh) > TEST_DIR=/mnt/tmp > > umount "$TEST_DEV" >/dev/null > mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > umount "$TEST_DEV" > > mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" > btrfs filesystem usage $TEST_DIR > > We can see the data chunk changed from raid1 to single: > # btrfs filesystem usage $TEST_DIR > Data,single: Size:8.00MiB, Used:0.00B > /dev/vdg8.00MiB > # > > Reason: > When a empty filesystem mount with -o nospace_cache, the last > data blockgroup will be auto-removed in umount. > > Then if we mount it again, there is no data chunk in the > filesystem, so the only available data profile is 0x0, result > is all new chunks are created as single type. > > Fix: > Don't auto-delete last blockgroup for a raid type. > > Test: > Test by above script, and confirmed the logic by debug output. > > Changelog v1->v2: > 1: Put code of checking block_group->list into >semaphore of space_info->groups_sem. > Noticed-by: Filipe Manana > > Signed-off-by: Zhao Lei Reviewed-by: Filipe Manana I would have made the check in the "if" statement below that is already done while holding a write lock on the semaphore (smaller code diff), but this is equally correct. thanks > --- > fs/btrfs/extent-tree.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 79a5bd9..ed9426c 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10010,8 +10010,18 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info > *fs_info) > block_group = list_first_entry(_info->unused_bgs, >struct btrfs_block_group_cache, >bg_list); > - space_info = block_group->space_info; > list_del_init(_group->bg_list); > + > + space_info = block_group->space_info; > + > + down_read(_info->groups_sem); > + if (block_group->list.next == block_group->list.prev) { > + up_read(_info->groups_sem); > + btrfs_put_block_group(block_group); > + continue; > + } > + up_read(_info->groups_sem); > + > if (ret || btrfs_mixed_space_info(space_info)) { > btrfs_put_block_group(block_group); > continue; > -- > 1.8.5.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: add stripes filter
Hi, thanks for the patch. The stripe filter is really helpful. There are some minor comments below but otherwise the patch looks good. On Mon, Sep 28, 2015 at 10:32:41PM +, Gabríel Arthúr Pétursson wrote: > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -849,7 +849,11 @@ struct btrfs_disk_balance_args { > /* BTRFS_BALANCE_ARGS_LIMIT value */ > __le64 limit; > > - __le64 unused[7]; > + /* btrfs stripes filter */ > + __le64 sstart; > + __le64 send; Please be more descriptive, eg. min_stripes/max_stripes. The u64 type seems too much, I think we can fit the stripe count into a 32bit number. I made a mistake with u64 type for the 'limit' filter but I think that we can somehow extend it to be two u32 with the min/max meaning as well. Either way, this is independent of your patch. > + > + __le64 unused[5]; > } __attribute__ ((__packed__)); > > /* > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3236,6 +3248,12 @@ static int should_balance_chunk(struct btrfs_root > *root, > return 0; > } > > + /* stripes filter */ > + if ((bargs->flags & BTRFS_BALANCE_ARGS_STRIPES) && > + chunk_stripes_filter(leaf, chunk, bargs)) { > + return 0; > + } Ok, I think that this ordering of the filters is right. > + > /* soft profile changing mode */ > if ((bargs->flags & BTRFS_BALANCE_ARGS_SOFT) && > chunk_soft_convert_filter(chunk_type, bargs)) { > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 2ca784a..fb6b89a 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -375,6 +375,7 @@ struct map_lookup { > #define BTRFS_BALANCE_ARGS_DRANGE(1ULL << 3) > #define BTRFS_BALANCE_ARGS_VRANGE(1ULL << 4) > #define BTRFS_BALANCE_ARGS_LIMIT (1ULL << 5) > +#define BTRFS_BALANCE_ARGS_STRIPES (1ULL << 6) > > /* > * Profile changing flags. When SOFT is set we won't relocate chunk if > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h > index b6dec05..a7819d0 100644 > --- a/include/uapi/linux/btrfs.h > +++ b/include/uapi/linux/btrfs.h > @@ -218,7 +218,11 @@ struct btrfs_balance_args { > __u64 flags; > > __u64 limit;/* limit number of processed chunks */ > - __u64 unused[7]; same comment from the ctree.h applies here > + > + __u64 sstart; > + __u64 send; > + > + __u64 unused[5]; > } __attribute__ ((__packed__)); > > /* report balance progress to userspace */ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Add stripes filter
On Tue, Sep 29, 2015 at 08:10:19AM -0400, Austin S Hemmelgarn wrote: > On 2015-09-29 08:00, David Sterba wrote: > > On Mon, Sep 28, 2015 at 05:57:05PM +, Gabríel Arthúr Pétursson wrote: > >> The attached patches to linux and btrfs-progs add support for filtering > >> based on the number of strips in a block when balancing. > > > > What usecase do you want to address? As I understand it, this would help > > the raid56 rebalancing to process only blockgroups that are not spread > > accross enough devices. > This could also be helpful when reshaping a raid10 or raid0 setup. Right, I forgot to mention it, I should have said "any raid profile that uses stripes". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs-progs: device delete to accept devid
On Wed, Sep 30, 2015 at 06:03:42AM +0800, Anand Jain wrote: > +struct btrfs_ioctl_vol_args_v3 { Can we use struct btrfs_ioctl_vol_args_v2 for that purpose? It contains the 'flags' so we can abuse the name field to store the device id and set the flags accordingly. > + __s64 fd; > + char name[BTRFS_PATH_NAME_MAX + 1]; > + __u64 devid; > +}; > + > #define BTRFS_DEVICE_PATH_NAME_MAX 1024 > > #define BTRFS_SUBVOL_CREATE_ASYNC(1ULL << 0) > @@ -683,6 +689,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code > err_code) >struct btrfs_ioctl_feature_flags[2]) > #define BTRFS_IOC_GET_SUPPORTED_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \ >struct btrfs_ioctl_feature_flags[3]) > +#define BTRFS_IOC_RM_DEV_V2 _IOW(BTRFS_IOCTL_MAGIC, 58, \ > +struct btrfs_ioctl_vol_args_v3) And we can reuse the ioctl nmuber 11 #define BTRFS_IOC_RM_DEV_V2 _IOW(BTRFS_IOCTL_MAGIC, 11, \ struct btrfs_ioctl_vol_args_v2) The vol_v2 structure is extensible so we can add more functionality there and then I think it justifies the V2 interface bump. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/11] using of calloc instead of malloc+memset
On Tue, Sep 29, 2015 at 07:10:35PM +0200, Silvio Fricke wrote: > Silvio Fricke (11): > btrfs-progs: use calloc instead of malloc+memset for btrfs-image.c > btrfs-progs: use calloc instead of malloc+memset for btrfs-list.c > btrfs-progs: use calloc instead of malloc+memset for chunk-recover.c > btrfs-progs: use calloc instead of malloc+memset for cmds-check.c > btrfs-progs: use calloc instead of malloc+memset for disk-io.c > btrfs-progs: use calloc instead of malloc+memset for extent_io.c > btrfs-progs: use calloc instead of malloc+memset for mkfs.c > btrfs-progs: use calloc instead of malloc+memset for qgroup.c > btrfs-progs: use calloc instead of malloc+memset for quick-test.c > btrfs-progs: use calloc instead of malloc+memset for volumes.c Thanks. I think that all of these can be squeezed into one, the change is logically same in all the files and easy to review even in a big patch. So I'll do that, the lat patch will be separate as it's a fix. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html