Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
LGTM On 2017/12/28 15:48, Gang He wrote: > If we can't get inode lock immediately in the function > ocfs2_inode_lock_with_page() when reading a page, we should not > return directly here, since this will lead to a softlockup problem > when the kernel is configured with CONFIG_PREEMPT is not set. > The method is to get a blocking lock and immediately unlock before > returning, this can avoid CPU resource waste due to lots of retries, > and benefits fairness in getting lock among multiple nodes, increase > efficiency in case modifying the same file frequently from multiple > nodes. > The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) > looks like, > Kernel panic - not syncing: softlockup: hung tasks > CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > Call Trace: > > dump_stack+0x5c/0x82 > panic+0xd5/0x21e > watchdog_timer_fn+0x208/0x210 > ? watchdog_park_threads+0x70/0x70 > __hrtimer_run_queues+0xcc/0x200 > hrtimer_interrupt+0xa6/0x1f0 > smp_apic_timer_interrupt+0x34/0x50 > apic_timer_interrupt+0x96/0xa0 > > RIP: 0010:unlock_page+0x17/0x30 > RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10 > RAX: dead0100 RBX: f21e009f5300 RCX: 0004 > RDX: dead00ff RSI: 0202 RDI: f21e009f5300 > RBP: R08: R09: af154080bb00 > R10: af154080bc30 R11: 0040 R12: 993749a39518 > R13: R14: f21e009f5300 R15: f21e009f5300 > ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] > ocfs2_readpage+0x41/0x2d0 [ocfs2] > ? pagecache_get_page+0x30/0x200 > filemap_fault+0x12b/0x5c0 > ? recalc_sigpending+0x17/0x50 > ? __set_task_blocked+0x28/0x70 > ? __set_current_blocked+0x3d/0x60 > ocfs2_fault+0x29/0xb0 [ocfs2] > __do_fault+0x1a/0xa0 > __handle_mm_fault+0xbe8/0x1090 > handle_mm_fault+0xaa/0x1f0 > __do_page_fault+0x235/0x4b0 > trace_do_page_fault+0x3c/0x110 > async_page_fault+0x28/0x30 > RIP: 0033:0x7fa75ded638e > RSP: 002b:7ffd6657db18 EFLAGS: 00010287 > RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700 > RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700 > RBP: 0003 R08: 000e R09: > R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770 > R13: 000e R14: 1770 R15: > > About performance improvement, we can see the testing time is reduced, > and CPU utilization decreases, the detailed data is as follows. > I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. > Before apply this patch, > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 > multi_mmap > 1505 root rt 0 36 123060 97224 S 2.658 6.015 0:01.44 corosync > 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 > kworker/u8:0 >95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 > kworker/u8:1 > 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 > jbd2/sda1-33 > 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 > ocfs2dc-3C8CFD4 > 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 14:44:52 CST 2017 > multi_mmap..Passed. > Runtime 783 seconds. > > After apply this patch, > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 > multi_mmap > 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 > kworker/u8:3 >95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 > kworker/u8:1 > 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun > 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 > kworker/u8:0 > 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 > jbd2/sda1-33 > 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 > kworker/2:1H > 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 > kworker/1:1H > 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged > 1282 root rt 0 84 123108 97224 S 0.333 6.017 0:01.33 corosync > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 15:04:12 CST 2017
[Ocfs2-devel] [PATCH v3 0/3] ocfs2: add nowait aio support
As you know, VFS layer has introduced non-block aio flag IOCB_NOWAIT, which informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. Subsequent, pwritev2/preadv2 also can leverage this part kernel code. So far, ext4/xfs/btrfs have supported this feature, I'd like to add the related code for ocfs2 file system. Compare with v2, do some modification in ocfs2_overwrite_io() function for OCFS2_INLINE_DATA_FL case. Compare with v1, some changes are as below, use osb pointer in ocfs2_try_rw_lock() function, modify ocfs2_overwrite_io() function to make all error value can be returned to the upper code, move invoking ocfs2_overwrite_io() function from ocfs2_file_write_iter() to ocfs2_prepare_inode_for_write(), this change can combine acquiring the related locks. Gang He (3): ocfs2: add ocfs2_try_rw_lock and ocfs2_try_inode_lock ocfs2: add ocfs2_overwrite_io function ocfs2: nowait aio support fs/ocfs2/dir.c | 2 +- fs/ocfs2/dlmglue.c | 41 +++--- fs/ocfs2/dlmglue.h | 6 +++- fs/ocfs2/extent_map.c | 45 fs/ocfs2/extent_map.h | 3 ++ fs/ocfs2/file.c| 95 +++--- fs/ocfs2/mmap.c| 2 +- fs/ocfs2/ocfs2_trace.h | 10 +++--- 8 files changed, 172 insertions(+), 32 deletions(-) -- 1.8.5.6 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH v3 3/3] ocfs2: nowait aio support
Return -EAGAIN if any of the following checks fail for direct I/O with nowait flag: Can not get the related locks immediately, Blocks are not allocated at the write location, it will trigger block allocation, this will block IO operations. Signed-off-by: Gang He --- fs/ocfs2/dir.c | 2 +- fs/ocfs2/dlmglue.c | 20 --- fs/ocfs2/dlmglue.h | 2 +- fs/ocfs2/file.c| 95 +++--- fs/ocfs2/mmap.c| 2 +- fs/ocfs2/ocfs2_trace.h | 10 +++--- 6 files changed, 99 insertions(+), 32 deletions(-) diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c index febe631..ea50901 100644 --- a/fs/ocfs2/dir.c +++ b/fs/ocfs2/dir.c @@ -1957,7 +1957,7 @@ int ocfs2_readdir(struct file *file, struct dir_context *ctx) trace_ocfs2_readdir((unsigned long long)OCFS2_I(inode)->ip_blkno); - error = ocfs2_inode_lock_atime(inode, file->f_path.mnt, &lock_level); + error = ocfs2_inode_lock_atime(inode, file->f_path.mnt, &lock_level, 1); if (lock_level && error >= 0) { /* We release EX lock which used to update atime * and get PR lock again to reduce contention diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index a68efa3..07e169f 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -2515,13 +2515,18 @@ int ocfs2_inode_lock_with_page(struct inode *inode, int ocfs2_inode_lock_atime(struct inode *inode, struct vfsmount *vfsmnt, - int *level) + int *level, int wait) { int ret; - ret = ocfs2_inode_lock(inode, NULL, 0); + if (wait) + ret = ocfs2_inode_lock(inode, NULL, 0); + else + ret = ocfs2_try_inode_lock(inode, NULL, 0); + if (ret < 0) { - mlog_errno(ret); + if (ret != -EAGAIN) + mlog_errno(ret); return ret; } @@ -2533,9 +2538,14 @@ int ocfs2_inode_lock_atime(struct inode *inode, struct buffer_head *bh = NULL; ocfs2_inode_unlock(inode, 0); - ret = ocfs2_inode_lock(inode, &bh, 1); + if (wait) + ret = ocfs2_inode_lock(inode, &bh, 1); + else + ret = ocfs2_try_inode_lock(inode, &bh, 1); + if (ret < 0) { - mlog_errno(ret); + if (ret != -EAGAIN) + mlog_errno(ret); return ret; } *level = 1; diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h index 05910fc..c83dbb5 100644 --- a/fs/ocfs2/dlmglue.h +++ b/fs/ocfs2/dlmglue.h @@ -123,7 +123,7 @@ void ocfs2_refcount_lock_res_init(struct ocfs2_lock_res *lockres, void ocfs2_open_unlock(struct inode *inode); int ocfs2_inode_lock_atime(struct inode *inode, struct vfsmount *vfsmnt, - int *level); + int *level, int wait); int ocfs2_inode_lock_full_nested(struct inode *inode, struct buffer_head **ret_bh, int ex, diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index a1d0510..caef9b1 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -140,6 +140,8 @@ static int ocfs2_file_open(struct inode *inode, struct file *file) spin_unlock(&oi->ip_lock); } + file->f_mode |= FMODE_NOWAIT; + leave: return status; } @@ -2132,12 +2134,12 @@ static int ocfs2_prepare_inode_for_refcount(struct inode *inode, } static int ocfs2_prepare_inode_for_write(struct file *file, -loff_t pos, -size_t count) +loff_t pos, size_t count, int wait) { - int ret = 0, meta_level = 0; + int ret = 0, meta_level = 0, overwrite_io = 0; struct dentry *dentry = file->f_path.dentry; struct inode *inode = d_inode(dentry); + struct buffer_head *di_bh = NULL; loff_t end; /* @@ -2145,13 +2147,40 @@ static int ocfs2_prepare_inode_for_write(struct file *file, * if we need to make modifications here. */ for(;;) { - ret = ocfs2_inode_lock(inode, NULL, meta_level); + if (wait) + ret = ocfs2_inode_lock(inode, NULL, meta_level); + else + ret = ocfs2_try_inode_lock(inode, + overwrite_io ? NULL : &di_bh, meta_level); if (ret < 0) { meta_level = -1; - mlog_errno(ret); + if (ret != -EAGAIN) + mlog_errno(ret); goto out; } + /* +* Check if IO will overwrite allocated blocks in case +
[Ocfs2-devel] [PATCH v3 1/3] ocfs2: add ocfs2_try_rw_lock and ocfs2_try_inode_lock
Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which will be used in non-block IO scenarios. Signed-off-by: Gang He --- fs/ocfs2/dlmglue.c | 21 + fs/ocfs2/dlmglue.h | 4 2 files changed, 25 insertions(+) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 4689940..a68efa3 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -1742,6 +1742,27 @@ int ocfs2_rw_lock(struct inode *inode, int write) return status; } +int ocfs2_try_rw_lock(struct inode *inode, int write) +{ + int status, level; + struct ocfs2_lock_res *lockres; + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + + mlog(0, "inode %llu try to take %s RW lock\n", +(unsigned long long)OCFS2_I(inode)->ip_blkno, +write ? "EXMODE" : "PRMODE"); + + if (ocfs2_mount_local(osb)) + return 0; + + lockres = &OCFS2_I(inode)->ip_rw_lockres; + + level = write ? DLM_LOCK_EX : DLM_LOCK_PR; + + status = ocfs2_cluster_lock(osb, lockres, level, DLM_LKF_NOQUEUE, 0); + return status; +} + void ocfs2_rw_unlock(struct inode *inode, int write) { int level = write ? DLM_LOCK_EX : DLM_LOCK_PR; diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h index a7fc18b..05910fc 100644 --- a/fs/ocfs2/dlmglue.h +++ b/fs/ocfs2/dlmglue.h @@ -116,6 +116,7 @@ void ocfs2_refcount_lock_res_init(struct ocfs2_lock_res *lockres, int ocfs2_create_new_inode_locks(struct inode *inode); int ocfs2_drop_inode_locks(struct inode *inode); int ocfs2_rw_lock(struct inode *inode, int write); +int ocfs2_try_rw_lock(struct inode *inode, int write); void ocfs2_rw_unlock(struct inode *inode, int write); int ocfs2_open_lock(struct inode *inode); int ocfs2_try_open_lock(struct inode *inode, int write); @@ -140,6 +141,9 @@ int ocfs2_inode_lock_with_page(struct inode *inode, /* 99% of the time we don't want to supply any additional flags -- * those are for very specific cases only. */ #define ocfs2_inode_lock(i, b, e) ocfs2_inode_lock_full_nested(i, b, e, 0, OI_LS_NORMAL) +#define ocfs2_try_inode_lock(i, b, e)\ + ocfs2_inode_lock_full_nested(i, b, e, OCFS2_META_LOCK_NOQUEUE,\ + OI_LS_NORMAL) void ocfs2_inode_unlock(struct inode *inode, int ex); int ocfs2_super_lock(struct ocfs2_super *osb, -- 1.8.5.6 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH v3 2/3] ocfs2: add ocfs2_overwrite_io function
Add ocfs2_overwrite_io function, which is used to judge if overwrite allocated blocks, otherwise, the write will bring extra block allocation overhead. Signed-off-by: Gang He --- fs/ocfs2/extent_map.c | 45 + fs/ocfs2/extent_map.h | 3 +++ 2 files changed, 48 insertions(+) diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c index e4719e0..06cb964 100644 --- a/fs/ocfs2/extent_map.c +++ b/fs/ocfs2/extent_map.c @@ -38,6 +38,7 @@ #include "inode.h" #include "super.h" #include "symlink.h" +#include "aops.h" #include "ocfs2_trace.h" #include "buffer_head_io.h" @@ -832,6 +833,50 @@ int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, return ret; } +/* Is IO overwriting allocated blocks? */ +int ocfs2_overwrite_io(struct inode *inode, struct buffer_head *di_bh, + u64 map_start, u64 map_len) +{ + int ret = 0, is_last; + u32 mapping_end, cpos; + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + struct ocfs2_extent_rec rec; + + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + if (ocfs2_size_fits_inline_data(di_bh, map_start + map_len)) + return ret; + else + return -EAGAIN; + } + + cpos = map_start >> osb->s_clustersize_bits; + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb, + map_start + map_len); + is_last = 0; + while (cpos < mapping_end && !is_last) { + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, +NULL, &rec, &is_last); + if (ret) { + mlog_errno(ret); + goto out; + } + + if (rec.e_blkno == 0ULL) + break; + + if (rec.e_flags & OCFS2_EXT_REFCOUNTED) + break; + + cpos = le32_to_cpu(rec.e_cpos) + + le16_to_cpu(rec.e_leaf_clusters); + } + + if (cpos < mapping_end) + ret = -EAGAIN; +out: + return ret; +} + int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int whence) { struct inode *inode = file->f_mapping->host; diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h index 67ea57d..1057586 100644 --- a/fs/ocfs2/extent_map.h +++ b/fs/ocfs2/extent_map.h @@ -53,6 +53,9 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, u64 map_start, u64 map_len); +int ocfs2_overwrite_io(struct inode *inode, struct buffer_head *di_bh, + u64 map_start, u64 map_len); + int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int origin); int ocfs2_xattr_get_clusters(struct inode *inode, u32 v_cluster, -- 1.8.5.6 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
On 17/12/28 15:48, Gang He wrote: > If we can't get inode lock immediately in the function > ocfs2_inode_lock_with_page() when reading a page, we should not > return directly here, since this will lead to a softlockup problem > when the kernel is configured with CONFIG_PREEMPT is not set. > The method is to get a blocking lock and immediately unlock before > returning, this can avoid CPU resource waste due to lots of retries, > and benefits fairness in getting lock among multiple nodes, increase > efficiency in case modifying the same file frequently from multiple > nodes. > The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) > looks like, > Kernel panic - not syncing: softlockup: hung tasks > CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > Call Trace: > > dump_stack+0x5c/0x82 > panic+0xd5/0x21e > watchdog_timer_fn+0x208/0x210 > ? watchdog_park_threads+0x70/0x70 > __hrtimer_run_queues+0xcc/0x200 > hrtimer_interrupt+0xa6/0x1f0 > smp_apic_timer_interrupt+0x34/0x50 > apic_timer_interrupt+0x96/0xa0 > > RIP: 0010:unlock_page+0x17/0x30 > RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10 > RAX: dead0100 RBX: f21e009f5300 RCX: 0004 > RDX: dead00ff RSI: 0202 RDI: f21e009f5300 > RBP: R08: R09: af154080bb00 > R10: af154080bc30 R11: 0040 R12: 993749a39518 > R13: R14: f21e009f5300 R15: f21e009f5300 > ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] > ocfs2_readpage+0x41/0x2d0 [ocfs2] > ? pagecache_get_page+0x30/0x200 > filemap_fault+0x12b/0x5c0 > ? recalc_sigpending+0x17/0x50 > ? __set_task_blocked+0x28/0x70 > ? __set_current_blocked+0x3d/0x60 > ocfs2_fault+0x29/0xb0 [ocfs2] > __do_fault+0x1a/0xa0 > __handle_mm_fault+0xbe8/0x1090 > handle_mm_fault+0xaa/0x1f0 > __do_page_fault+0x235/0x4b0 > trace_do_page_fault+0x3c/0x110 > async_page_fault+0x28/0x30 > RIP: 0033:0x7fa75ded638e > RSP: 002b:7ffd6657db18 EFLAGS: 00010287 > RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700 > RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700 > RBP: 0003 R08: 000e R09: > R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770 > R13: 000e R14: 1770 R15: > > About performance improvement, we can see the testing time is reduced, > and CPU utilization decreases, the detailed data is as follows. > I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. > Before apply this patch, > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 > multi_mmap > 1505 root rt 0 36 123060 97224 S 2.658 6.015 0:01.44 corosync > 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 > kworker/u8:0 >95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 > kworker/u8:1 > 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 > jbd2/sda1-33 > 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 > ocfs2dc-3C8CFD4 > 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 14:44:52 CST 2017 > multi_mmap..Passed. > Runtime 783 seconds. > > After apply this patch, > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 > multi_mmap > 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 > kworker/u8:3 >95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 > kworker/u8:1 > 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun > 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 > kworker/u8:0 > 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 > jbd2/sda1-33 > 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 > kworker/2:1H > 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 > kworker/1:1H > 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged > 1282 root rt 0 84 123108 97224 S 0.333 6.017 0:01.33 corosync > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 15:04:12 CST 2017 > mult
Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
Hi Gang, It looks good to me. Thanks, Alex On 2017/12/28 15:48, Gang He wrote: > If we can't get inode lock immediately in the function > ocfs2_inode_lock_with_page() when reading a page, we should not > return directly here, since this will lead to a softlockup problem > when the kernel is configured with CONFIG_PREEMPT is not set. > The method is to get a blocking lock and immediately unlock before > returning, this can avoid CPU resource waste due to lots of retries, > and benefits fairness in getting lock among multiple nodes, increase > efficiency in case modifying the same file frequently from multiple > nodes. > The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) > looks like, > Kernel panic - not syncing: softlockup: hung tasks > CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > Call Trace: > > dump_stack+0x5c/0x82 > panic+0xd5/0x21e > watchdog_timer_fn+0x208/0x210 > ? watchdog_park_threads+0x70/0x70 > __hrtimer_run_queues+0xcc/0x200 > hrtimer_interrupt+0xa6/0x1f0 > smp_apic_timer_interrupt+0x34/0x50 > apic_timer_interrupt+0x96/0xa0 > > RIP: 0010:unlock_page+0x17/0x30 > RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10 > RAX: dead0100 RBX: f21e009f5300 RCX: 0004 > RDX: dead00ff RSI: 0202 RDI: f21e009f5300 > RBP: R08: R09: af154080bb00 > R10: af154080bc30 R11: 0040 R12: 993749a39518 > R13: R14: f21e009f5300 R15: f21e009f5300 > ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] > ocfs2_readpage+0x41/0x2d0 [ocfs2] > ? pagecache_get_page+0x30/0x200 > filemap_fault+0x12b/0x5c0 > ? recalc_sigpending+0x17/0x50 > ? __set_task_blocked+0x28/0x70 > ? __set_current_blocked+0x3d/0x60 > ocfs2_fault+0x29/0xb0 [ocfs2] > __do_fault+0x1a/0xa0 > __handle_mm_fault+0xbe8/0x1090 > handle_mm_fault+0xaa/0x1f0 > __do_page_fault+0x235/0x4b0 > trace_do_page_fault+0x3c/0x110 > async_page_fault+0x28/0x30 > RIP: 0033:0x7fa75ded638e > RSP: 002b:7ffd6657db18 EFLAGS: 00010287 > RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700 > RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700 > RBP: 0003 R08: 000e R09: > R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770 > R13: 000e R14: 1770 R15: > > About performance improvement, we can see the testing time is reduced, > and CPU utilization decreases, the detailed data is as follows. > I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. > Before apply this patch, > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 > multi_mmap > 1505 root rt 0 36 123060 97224 S 2.658 6.015 0:01.44 corosync > 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 > kworker/u8:0 >95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 > kworker/u8:1 > 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 > jbd2/sda1-33 > 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 > ocfs2dc-3C8CFD4 > 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 14:44:52 CST 2017 > multi_mmap..Passed. > Runtime 783 seconds. > > After apply this patch, > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 > multi_mmap > 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 > kworker/u8:3 >95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 > kworker/u8:1 > 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun > 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 > kworker/u8:0 > 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 > jbd2/sda1-33 > 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 > kworker/2:1H > 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 > kworker/1:1H > 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged > 1282 root rt 0 84 123108 97224 S 0.333 6.017 0:01.33 corosync > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096
Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
Hi Gang, It looks good to me. Thanks, Changwei On 2017/12/28 15:49, Gang He wrote: > If we can't get inode lock immediately in the function > ocfs2_inode_lock_with_page() when reading a page, we should not > return directly here, since this will lead to a softlockup problem > when the kernel is configured with CONFIG_PREEMPT is not set. > The method is to get a blocking lock and immediately unlock before > returning, this can avoid CPU resource waste due to lots of retries, > and benefits fairness in getting lock among multiple nodes, increase > efficiency in case modifying the same file frequently from multiple > nodes. > The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) > looks like, > Kernel panic - not syncing: softlockup: hung tasks > CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > Call Trace: > >dump_stack+0x5c/0x82 >panic+0xd5/0x21e >watchdog_timer_fn+0x208/0x210 >? watchdog_park_threads+0x70/0x70 >__hrtimer_run_queues+0xcc/0x200 >hrtimer_interrupt+0xa6/0x1f0 >smp_apic_timer_interrupt+0x34/0x50 >apic_timer_interrupt+0x96/0xa0 > > RIP: 0010:unlock_page+0x17/0x30 > RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10 > RAX: dead0100 RBX: f21e009f5300 RCX: 0004 > RDX: dead00ff RSI: 0202 RDI: f21e009f5300 > RBP: R08: R09: af154080bb00 > R10: af154080bc30 R11: 0040 R12: 993749a39518 > R13: R14: f21e009f5300 R15: f21e009f5300 >ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] >ocfs2_readpage+0x41/0x2d0 [ocfs2] >? pagecache_get_page+0x30/0x200 >filemap_fault+0x12b/0x5c0 >? recalc_sigpending+0x17/0x50 >? __set_task_blocked+0x28/0x70 >? __set_current_blocked+0x3d/0x60 >ocfs2_fault+0x29/0xb0 [ocfs2] >__do_fault+0x1a/0xa0 >__handle_mm_fault+0xbe8/0x1090 >handle_mm_fault+0xaa/0x1f0 >__do_page_fault+0x235/0x4b0 >trace_do_page_fault+0x3c/0x110 >async_page_fault+0x28/0x30 > RIP: 0033:0x7fa75ded638e > RSP: 002b:7ffd6657db18 EFLAGS: 00010287 > RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700 > RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700 > RBP: 0003 R08: 000e R09: > R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770 > R13: 000e R14: 1770 R15: > > About performance improvement, we can see the testing time is reduced, > and CPU utilization decreases, the detailed data is as follows. > I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. > Before apply this patch, >PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 > multi_mmap > 1505 root rt 0 36 123060 97224 S 2.658 6.015 0:01.44 corosync > 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 > kworker/u8:0 > 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 > kworker/u8:1 > 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 > jbd2/sda1-33 > 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 > ocfs2dc-3C8CFD4 > 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 14:44:52 CST 2017 > multi_mmap..Passed. > Runtime 783 seconds. > > After apply this patch, >PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND > 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 > multi_mmap >155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 > kworker/u8:3 > 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 > kworker/u8:1 > 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun > 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 > kworker/u8:0 > 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 > jbd2/sda1-33 >299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 > kworker/2:1H >335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 > kworker/1:1H >535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged > 1282 root rt 0 84 123108 97224 S 0.333 6.017 0:01.33 corosync > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -