Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-28 Thread piaojun
LGTM

On 2017/12/28 15:48, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>   
>   dump_stack+0x5c/0x82
>   panic+0xd5/0x21e
>   watchdog_timer_fn+0x208/0x210
>   ? watchdog_park_threads+0x70/0x70
>   __hrtimer_run_queues+0xcc/0x200
>   hrtimer_interrupt+0xa6/0x1f0
>   smp_apic_timer_interrupt+0x34/0x50
>   apic_timer_interrupt+0x96/0xa0
>   
>  RIP: 0010:unlock_page+0x17/0x30
>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>  RBP:  R08:  R09: af154080bb00
>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>  R13:  R14: f21e009f5300 R15: f21e009f5300
>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>   ? pagecache_get_page+0x30/0x200
>   filemap_fault+0x12b/0x5c0
>   ? recalc_sigpending+0x17/0x50
>   ? __set_task_blocked+0x28/0x70
>   ? __set_current_blocked+0x3d/0x60
>   ocfs2_fault+0x29/0xb0 [ocfs2]
>   __do_fault+0x1a/0xa0
>   __handle_mm_fault+0xbe8/0x1090
>   handle_mm_fault+0xaa/0x1f0
>   __do_page_fault+0x235/0x4b0
>   trace_do_page_fault+0x3c/0x110
>   async_page_fault+0x28/0x30
>  RIP: 0033:0x7fa75ded638e
>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>  RBP: 0003 R08: 000e R09: 
>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>  R13: 000e R14: 1770 R15: 
> 
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 
> multi_mmap
>  1505 root  rt   0  36 123060  97224 S 2.658 6.015   0:01.44 corosync
> 5 root  20   0   0  0  0 S 1.329 0.000   0:00.19 
> kworker/u8:0
>95 root  20   0   0  0  0 S 1.329 0.000   0:00.25 
> kworker/u8:1
>  2728 root  20   0   0  0  0 S 0.997 0.000   0:00.24 
> jbd2/sda1-33
>  2721 root  20   0   0  0  0 S 0.664 0.000   0:00.07 
> ocfs2dc-3C8CFD4
>  2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..Passed.
> Runtime 783 seconds.
> 
> After apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 
> multi_mmap
>   155 root  20   0   0  0  0 S 2.667 0.000   0:01.20 
> kworker/u8:3
>95 root  20   0   0  0  0 S 2.000 0.000   0:01.58 
> kworker/u8:1
>  2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
> 5 root  20   0   0  0  0 S 1.000 0.000   0:01.36 
> kworker/u8:0
>  2482 root  20   0   0  0  0 S 1.000 0.000   0:00.86 
> jbd2/sda1-33
>   299 root   0 -20   0  0  0 S 0.333 0.000   0:00.13 
> kworker/2:1H
>   335 root   0 -20   0  0  0 S 0.333 0.000   0:00.17 
> kworker/1:1H
>   535 root  20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
>  1282 root  rt   0  84 123108  97224 S 0.333 6.017   0:01.33 corosync
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 15:04:12 CST 2017

[Ocfs2-devel] [PATCH v3 0/3] ocfs2: add nowait aio support

2017-12-28 Thread Gang He
As you know, VFS layer has introduced non-block aio
flag IOCB_NOWAIT, which informs kernel to bail out
if an AIO request will block for reasons such as file
allocations, or a writeback triggered, or would block
while allocating requests while performing direct I/O.
Subsequent, pwritev2/preadv2 also can leverage this
part kernel code.
So far, ext4/xfs/btrfs have supported this feature,
I'd like to add the related code for ocfs2 file system.

Compare with v2, do some modification in ocfs2_overwrite_io()
function for OCFS2_INLINE_DATA_FL case.
Compare with v1, some changes are as below,
use osb pointer in ocfs2_try_rw_lock() function,
modify ocfs2_overwrite_io() function to make all error
value can be returned to the upper code,
move invoking ocfs2_overwrite_io() function from
ocfs2_file_write_iter() to ocfs2_prepare_inode_for_write(),
this change can combine acquiring the related locks.

Gang He (3):
  ocfs2: add ocfs2_try_rw_lock and ocfs2_try_inode_lock
  ocfs2: add ocfs2_overwrite_io function
  ocfs2: nowait aio support

 fs/ocfs2/dir.c |  2 +-
 fs/ocfs2/dlmglue.c | 41 +++---
 fs/ocfs2/dlmglue.h |  6 +++-
 fs/ocfs2/extent_map.c  | 45 
 fs/ocfs2/extent_map.h  |  3 ++
 fs/ocfs2/file.c| 95 +++---
 fs/ocfs2/mmap.c|  2 +-
 fs/ocfs2/ocfs2_trace.h | 10 +++---
 8 files changed, 172 insertions(+), 32 deletions(-)

-- 
1.8.5.6


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH v3 3/3] ocfs2: nowait aio support

2017-12-28 Thread Gang He
Return -EAGAIN if any of the following checks fail for
direct I/O with nowait flag:
Can not get the related locks immediately,
Blocks are not allocated at the write location, it will trigger
block allocation, this will block IO operations.

Signed-off-by: Gang He 
---
 fs/ocfs2/dir.c |  2 +-
 fs/ocfs2/dlmglue.c | 20 ---
 fs/ocfs2/dlmglue.h |  2 +-
 fs/ocfs2/file.c| 95 +++---
 fs/ocfs2/mmap.c|  2 +-
 fs/ocfs2/ocfs2_trace.h | 10 +++---
 6 files changed, 99 insertions(+), 32 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe631..ea50901 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -1957,7 +1957,7 @@ int ocfs2_readdir(struct file *file, struct dir_context 
*ctx)
 
trace_ocfs2_readdir((unsigned long long)OCFS2_I(inode)->ip_blkno);
 
-   error = ocfs2_inode_lock_atime(inode, file->f_path.mnt, &lock_level);
+   error = ocfs2_inode_lock_atime(inode, file->f_path.mnt, &lock_level, 1);
if (lock_level && error >= 0) {
/* We release EX lock which used to update atime
 * and get PR lock again to reduce contention
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index a68efa3..07e169f 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -2515,13 +2515,18 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
 
 int ocfs2_inode_lock_atime(struct inode *inode,
  struct vfsmount *vfsmnt,
- int *level)
+ int *level, int wait)
 {
int ret;
 
-   ret = ocfs2_inode_lock(inode, NULL, 0);
+   if (wait)
+   ret = ocfs2_inode_lock(inode, NULL, 0);
+   else
+   ret = ocfs2_try_inode_lock(inode, NULL, 0);
+
if (ret < 0) {
-   mlog_errno(ret);
+   if (ret != -EAGAIN)
+   mlog_errno(ret);
return ret;
}
 
@@ -2533,9 +2538,14 @@ int ocfs2_inode_lock_atime(struct inode *inode,
struct buffer_head *bh = NULL;
 
ocfs2_inode_unlock(inode, 0);
-   ret = ocfs2_inode_lock(inode, &bh, 1);
+   if (wait)
+   ret = ocfs2_inode_lock(inode, &bh, 1);
+   else
+   ret = ocfs2_try_inode_lock(inode, &bh, 1);
+
if (ret < 0) {
-   mlog_errno(ret);
+   if (ret != -EAGAIN)
+   mlog_errno(ret);
return ret;
}
*level = 1;
diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
index 05910fc..c83dbb5 100644
--- a/fs/ocfs2/dlmglue.h
+++ b/fs/ocfs2/dlmglue.h
@@ -123,7 +123,7 @@ void ocfs2_refcount_lock_res_init(struct ocfs2_lock_res 
*lockres,
 void ocfs2_open_unlock(struct inode *inode);
 int ocfs2_inode_lock_atime(struct inode *inode,
  struct vfsmount *vfsmnt,
- int *level);
+ int *level, int wait);
 int ocfs2_inode_lock_full_nested(struct inode *inode,
 struct buffer_head **ret_bh,
 int ex,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index a1d0510..caef9b1 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -140,6 +140,8 @@ static int ocfs2_file_open(struct inode *inode, struct file 
*file)
spin_unlock(&oi->ip_lock);
}
 
+   file->f_mode |= FMODE_NOWAIT;
+
 leave:
return status;
 }
@@ -2132,12 +2134,12 @@ static int ocfs2_prepare_inode_for_refcount(struct 
inode *inode,
 }
 
 static int ocfs2_prepare_inode_for_write(struct file *file,
-loff_t pos,
-size_t count)
+loff_t pos, size_t count, int wait)
 {
-   int ret = 0, meta_level = 0;
+   int ret = 0, meta_level = 0, overwrite_io = 0;
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = d_inode(dentry);
+   struct buffer_head *di_bh = NULL;
loff_t end;
 
/*
@@ -2145,13 +2147,40 @@ static int ocfs2_prepare_inode_for_write(struct file 
*file,
 * if we need to make modifications here.
 */
for(;;) {
-   ret = ocfs2_inode_lock(inode, NULL, meta_level);
+   if (wait)
+   ret = ocfs2_inode_lock(inode, NULL, meta_level);
+   else
+   ret = ocfs2_try_inode_lock(inode,
+   overwrite_io ? NULL : &di_bh, meta_level);
if (ret < 0) {
meta_level = -1;
-   mlog_errno(ret);
+   if (ret != -EAGAIN)
+   mlog_errno(ret);
goto out;
}
 
+   /*
+* Check if IO will overwrite allocated blocks in case
+  

[Ocfs2-devel] [PATCH v3 1/3] ocfs2: add ocfs2_try_rw_lock and ocfs2_try_inode_lock

2017-12-28 Thread Gang He
Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which
will be used in non-block IO scenarios.

Signed-off-by: Gang He 
---
 fs/ocfs2/dlmglue.c | 21 +
 fs/ocfs2/dlmglue.h |  4 
 2 files changed, 25 insertions(+)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 4689940..a68efa3 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -1742,6 +1742,27 @@ int ocfs2_rw_lock(struct inode *inode, int write)
return status;
 }
 
+int ocfs2_try_rw_lock(struct inode *inode, int write)
+{
+   int status, level;
+   struct ocfs2_lock_res *lockres;
+   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+   mlog(0, "inode %llu try to take %s RW lock\n",
+(unsigned long long)OCFS2_I(inode)->ip_blkno,
+write ? "EXMODE" : "PRMODE");
+
+   if (ocfs2_mount_local(osb))
+   return 0;
+
+   lockres = &OCFS2_I(inode)->ip_rw_lockres;
+
+   level = write ? DLM_LOCK_EX : DLM_LOCK_PR;
+
+   status = ocfs2_cluster_lock(osb, lockres, level, DLM_LKF_NOQUEUE, 0);
+   return status;
+}
+
 void ocfs2_rw_unlock(struct inode *inode, int write)
 {
int level = write ? DLM_LOCK_EX : DLM_LOCK_PR;
diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
index a7fc18b..05910fc 100644
--- a/fs/ocfs2/dlmglue.h
+++ b/fs/ocfs2/dlmglue.h
@@ -116,6 +116,7 @@ void ocfs2_refcount_lock_res_init(struct ocfs2_lock_res 
*lockres,
 int ocfs2_create_new_inode_locks(struct inode *inode);
 int ocfs2_drop_inode_locks(struct inode *inode);
 int ocfs2_rw_lock(struct inode *inode, int write);
+int ocfs2_try_rw_lock(struct inode *inode, int write);
 void ocfs2_rw_unlock(struct inode *inode, int write);
 int ocfs2_open_lock(struct inode *inode);
 int ocfs2_try_open_lock(struct inode *inode, int write);
@@ -140,6 +141,9 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
 /* 99% of the time we don't want to supply any additional flags --
  * those are for very specific cases only. */
 #define ocfs2_inode_lock(i, b, e) ocfs2_inode_lock_full_nested(i, b, e, 0, 
OI_LS_NORMAL)
+#define ocfs2_try_inode_lock(i, b, e)\
+   ocfs2_inode_lock_full_nested(i, b, e, OCFS2_META_LOCK_NOQUEUE,\
+   OI_LS_NORMAL)
 void ocfs2_inode_unlock(struct inode *inode,
   int ex);
 int ocfs2_super_lock(struct ocfs2_super *osb,
-- 
1.8.5.6


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH v3 2/3] ocfs2: add ocfs2_overwrite_io function

2017-12-28 Thread Gang He
Add ocfs2_overwrite_io function, which is used to judge if
overwrite allocated blocks, otherwise, the write will bring extra
block allocation overhead.

Signed-off-by: Gang He 
---
 fs/ocfs2/extent_map.c | 45 +
 fs/ocfs2/extent_map.h |  3 +++
 2 files changed, 48 insertions(+)

diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
index e4719e0..06cb964 100644
--- a/fs/ocfs2/extent_map.c
+++ b/fs/ocfs2/extent_map.c
@@ -38,6 +38,7 @@
 #include "inode.h"
 #include "super.h"
 #include "symlink.h"
+#include "aops.h"
 #include "ocfs2_trace.h"
 
 #include "buffer_head_io.h"
@@ -832,6 +833,50 @@ int ocfs2_fiemap(struct inode *inode, struct 
fiemap_extent_info *fieinfo,
return ret;
 }
 
+/* Is IO overwriting allocated blocks? */
+int ocfs2_overwrite_io(struct inode *inode, struct buffer_head *di_bh,
+  u64 map_start, u64 map_len)
+{
+   int ret = 0, is_last;
+   u32 mapping_end, cpos;
+   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+   struct ocfs2_extent_rec rec;
+
+   if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+   if (ocfs2_size_fits_inline_data(di_bh, map_start + map_len))
+   return ret;
+   else
+   return -EAGAIN;
+   }
+
+   cpos = map_start >> osb->s_clustersize_bits;
+   mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
+  map_start + map_len);
+   is_last = 0;
+   while (cpos < mapping_end && !is_last) {
+   ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
+NULL, &rec, &is_last);
+   if (ret) {
+   mlog_errno(ret);
+   goto out;
+   }
+
+   if (rec.e_blkno == 0ULL)
+   break;
+
+   if (rec.e_flags & OCFS2_EXT_REFCOUNTED)
+   break;
+
+   cpos = le32_to_cpu(rec.e_cpos) +
+   le16_to_cpu(rec.e_leaf_clusters);
+   }
+
+   if (cpos < mapping_end)
+   ret = -EAGAIN;
+out:
+   return ret;
+}
+
 int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int whence)
 {
struct inode *inode = file->f_mapping->host;
diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
index 67ea57d..1057586 100644
--- a/fs/ocfs2/extent_map.h
+++ b/fs/ocfs2/extent_map.h
@@ -53,6 +53,9 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 
v_blkno, u64 *p_blkno,
 int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 u64 map_start, u64 map_len);
 
+int ocfs2_overwrite_io(struct inode *inode, struct buffer_head *di_bh,
+  u64 map_start, u64 map_len);
+
 int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int origin);
 
 int ocfs2_xattr_get_clusters(struct inode *inode, u32 v_cluster,
-- 
1.8.5.6


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-28 Thread Joseph Qi


On 17/12/28 15:48, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>   
>   dump_stack+0x5c/0x82
>   panic+0xd5/0x21e
>   watchdog_timer_fn+0x208/0x210
>   ? watchdog_park_threads+0x70/0x70
>   __hrtimer_run_queues+0xcc/0x200
>   hrtimer_interrupt+0xa6/0x1f0
>   smp_apic_timer_interrupt+0x34/0x50
>   apic_timer_interrupt+0x96/0xa0
>   
>  RIP: 0010:unlock_page+0x17/0x30
>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>  RBP:  R08:  R09: af154080bb00
>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>  R13:  R14: f21e009f5300 R15: f21e009f5300
>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>   ? pagecache_get_page+0x30/0x200
>   filemap_fault+0x12b/0x5c0
>   ? recalc_sigpending+0x17/0x50
>   ? __set_task_blocked+0x28/0x70
>   ? __set_current_blocked+0x3d/0x60
>   ocfs2_fault+0x29/0xb0 [ocfs2]
>   __do_fault+0x1a/0xa0
>   __handle_mm_fault+0xbe8/0x1090
>   handle_mm_fault+0xaa/0x1f0
>   __do_page_fault+0x235/0x4b0
>   trace_do_page_fault+0x3c/0x110
>   async_page_fault+0x28/0x30
>  RIP: 0033:0x7fa75ded638e
>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>  RBP: 0003 R08: 000e R09: 
>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>  R13: 000e R14: 1770 R15: 
> 
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 
> multi_mmap
>  1505 root  rt   0  36 123060  97224 S 2.658 6.015   0:01.44 corosync
> 5 root  20   0   0  0  0 S 1.329 0.000   0:00.19 
> kworker/u8:0
>95 root  20   0   0  0  0 S 1.329 0.000   0:00.25 
> kworker/u8:1
>  2728 root  20   0   0  0  0 S 0.997 0.000   0:00.24 
> jbd2/sda1-33
>  2721 root  20   0   0  0  0 S 0.664 0.000   0:00.07 
> ocfs2dc-3C8CFD4
>  2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..Passed.
> Runtime 783 seconds.
> 
> After apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 
> multi_mmap
>   155 root  20   0   0  0  0 S 2.667 0.000   0:01.20 
> kworker/u8:3
>95 root  20   0   0  0  0 S 2.000 0.000   0:01.58 
> kworker/u8:1
>  2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
> 5 root  20   0   0  0  0 S 1.000 0.000   0:01.36 
> kworker/u8:0
>  2482 root  20   0   0  0  0 S 1.000 0.000   0:00.86 
> jbd2/sda1-33
>   299 root   0 -20   0  0  0 S 0.333 0.000   0:00.13 
> kworker/2:1H
>   335 root   0 -20   0  0  0 S 0.333 0.000   0:00.17 
> kworker/1:1H
>   535 root  20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
>  1282 root  rt   0  84 123108  97224 S 0.333 6.017   0:01.33 corosync
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 15:04:12 CST 2017
> mult

Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-28 Thread alex chen
Hi Gang,

It looks good to me.

Thanks,
Alex


On 2017/12/28 15:48, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>   
>   dump_stack+0x5c/0x82
>   panic+0xd5/0x21e
>   watchdog_timer_fn+0x208/0x210
>   ? watchdog_park_threads+0x70/0x70
>   __hrtimer_run_queues+0xcc/0x200
>   hrtimer_interrupt+0xa6/0x1f0
>   smp_apic_timer_interrupt+0x34/0x50
>   apic_timer_interrupt+0x96/0xa0
>   
>  RIP: 0010:unlock_page+0x17/0x30
>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>  RBP:  R08:  R09: af154080bb00
>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>  R13:  R14: f21e009f5300 R15: f21e009f5300
>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>   ? pagecache_get_page+0x30/0x200
>   filemap_fault+0x12b/0x5c0
>   ? recalc_sigpending+0x17/0x50
>   ? __set_task_blocked+0x28/0x70
>   ? __set_current_blocked+0x3d/0x60
>   ocfs2_fault+0x29/0xb0 [ocfs2]
>   __do_fault+0x1a/0xa0
>   __handle_mm_fault+0xbe8/0x1090
>   handle_mm_fault+0xaa/0x1f0
>   __do_page_fault+0x235/0x4b0
>   trace_do_page_fault+0x3c/0x110
>   async_page_fault+0x28/0x30
>  RIP: 0033:0x7fa75ded638e
>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>  RBP: 0003 R08: 000e R09: 
>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>  R13: 000e R14: 1770 R15: 
> 
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 
> multi_mmap
>  1505 root  rt   0  36 123060  97224 S 2.658 6.015   0:01.44 corosync
> 5 root  20   0   0  0  0 S 1.329 0.000   0:00.19 
> kworker/u8:0
>95 root  20   0   0  0  0 S 1.329 0.000   0:00.25 
> kworker/u8:1
>  2728 root  20   0   0  0  0 S 0.997 0.000   0:00.24 
> jbd2/sda1-33
>  2721 root  20   0   0  0  0 S 0.664 0.000   0:00.07 
> ocfs2dc-3C8CFD4
>  2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..Passed.
> Runtime 783 seconds.
> 
> After apply this patch,
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>  2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 
> multi_mmap
>   155 root  20   0   0  0  0 S 2.667 0.000   0:01.20 
> kworker/u8:3
>95 root  20   0   0  0  0 S 2.000 0.000   0:01.58 
> kworker/u8:1
>  2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
> 5 root  20   0   0  0  0 S 1.000 0.000   0:01.36 
> kworker/u8:0
>  2482 root  20   0   0  0  0 S 1.000 0.000   0:00.86 
> jbd2/sda1-33
>   299 root   0 -20   0  0  0 S 0.333 0.000   0:00.13 
> kworker/2:1H
>   335 root   0 -20   0  0  0 S 0.333 0.000   0:00.17 
> kworker/1:1H
>   535 root  20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
>  1282 root  rt   0  84 123108  97224 S 0.333 6.017   0:01.33 corosync
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 

Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2017-12-28 Thread Changwei Ge
Hi Gang,

It looks good to me.

Thanks,
Changwei

On 2017/12/28 15:49, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
>
>dump_stack+0x5c/0x82
>panic+0xd5/0x21e
>watchdog_timer_fn+0x208/0x210
>? watchdog_park_threads+0x70/0x70
>__hrtimer_run_queues+0xcc/0x200
>hrtimer_interrupt+0xa6/0x1f0
>smp_apic_timer_interrupt+0x34/0x50
>apic_timer_interrupt+0x96/0xa0
>
>   RIP: 0010:unlock_page+0x17/0x30
>   RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>   RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>   RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>   RBP:  R08:  R09: af154080bb00
>   R10: af154080bc30 R11: 0040 R12: 993749a39518
>   R13:  R14: f21e009f5300 R15: f21e009f5300
>ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>ocfs2_readpage+0x41/0x2d0 [ocfs2]
>? pagecache_get_page+0x30/0x200
>filemap_fault+0x12b/0x5c0
>? recalc_sigpending+0x17/0x50
>? __set_task_blocked+0x28/0x70
>? __set_current_blocked+0x3d/0x60
>ocfs2_fault+0x29/0xb0 [ocfs2]
>__do_fault+0x1a/0xa0
>__handle_mm_fault+0xbe8/0x1090
>handle_mm_fault+0xaa/0x1f0
>__do_page_fault+0x235/0x4b0
>trace_do_page_fault+0x3c/0x110
>async_page_fault+0x28/0x30
>   RIP: 0033:0x7fa75ded638e
>   RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>   RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>   RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>   RBP: 0003 R08: 000e R09: 
>   R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>   R13: 000e R14: 1770 R15: 
> 
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
>PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>   2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 
> multi_mmap
>   1505 root  rt   0  36 123060  97224 S 2.658 6.015   0:01.44 corosync
>  5 root  20   0   0  0  0 S 1.329 0.000   0:00.19 
> kworker/u8:0
> 95 root  20   0   0  0  0 S 1.329 0.000   0:00.25 
> kworker/u8:1
>   2728 root  20   0   0  0  0 S 0.997 0.000   0:00.24 
> jbd2/sda1-33
>   2721 root  20   0   0  0  0 S 0.664 0.000   0:00.07 
> ocfs2dc-3C8CFD4
>   2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..Passed.
> Runtime 783 seconds.
> 
> After apply this patch,
>PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>   2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 
> multi_mmap
>155 root  20   0   0  0  0 S 2.667 0.000   0:01.20 
> kworker/u8:3
> 95 root  20   0   0  0  0 S 2.000 0.000   0:01.58 
> kworker/u8:1
>   2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
>  5 root  20   0   0  0  0 S 1.000 0.000   0:01.36 
> kworker/u8:0
>   2482 root  20   0   0  0  0 S 1.000 0.000   0:00.86 
> jbd2/sda1-33
>299 root   0 -20   0  0  0 S 0.333 0.000   0:00.13 
> kworker/2:1H
>335 root   0 -20   0  0  0 S 0.333 0.000   0:00.17 
> kworker/1:1H
>535 root  20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
>   1282 root  rt   0  84 123108  97224 S 0.333 6.017   0:01.33 corosync
> 
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -