[Devel] [PATCH rh7] netfilter: Add warning on nft NAT init if "iptable_nat" already loaded

2020-11-30 Thread Konstantin Khorenko
nft NAT cannot work along with iptables NAT.
"iptable_nat" module is always loaded on the VZ Node (libvirt triggers
the load), so warn on "nft_nat" module load.

i've added an additional check - if "ip(6)table_nat" modules are really
loaded - may be some time later libvirt won't trigger their load.

https://jira.sw.ru/browse/PSBM-102919
https://jira.sw.ru/browse/PSBM-123111

Signed-off-by: Konstantin Khorenko 
---
 net/netfilter/nft_nat.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/nft_nat.c b/net/netfilter/nft_nat.c
index 3883504db5c3..d12d540e1b60 100644
--- a/net/netfilter/nft_nat.c
+++ b/net/netfilter/nft_nat.c
@@ -279,6 +279,12 @@ static struct nft_expr_type nft_nat_type __read_mostly = {
 
 static int __init nft_nat_module_init(void)
 {
+   /* nft NAT does not work if ip(6)table_nat module is loaded */
+   WARN_ONCE(init_net.ipv4.nat_table || init_net.ipv6.ip6table_nat,
+ "WARNING: 'nft_nat' kernel module is being loaded "
+ "while 'ip(6)table_nat' module already loaded. "
+ "nft NAT will not work.\n");
+
return nft_register_expr(&nft_nat_type);
 }
 
-- 
2.24.3

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 8/8] ms/mm/memory.c: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

The unmap_mapping_range family of functions do the unmapping of user pages
(ultimately via zap_page_range_single) without touching the actual
interval tree, thus share the lock.

Signed-off-by: Davidlohr Bueso 
Cc: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit c8475d144abb1e62958cc5ec281d2a9e161c1946)
Signed-off-by: Andrey Ryabinin 
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e66dea08f3f..3e5124d14996 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2712,10 +2712,10 @@ void unmap_mapping_range(struct address_space *mapping,
if (details.last_index < details.first_index)
details.last_index = ULONG_MAX;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 7/8] ms/mm/nommu: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify
that any shared mappings of the inode in question aren't broken (dead
zone).  afaict the only user being ramfs to handle the size change
attribute.

Pretty much a no-brainer to share the lock.

Signed-off-by: Davidlohr Bueso 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 1acf2e040721564d579297646862b8ea3dd4511b)
Signed-off-by: Andrey Ryabinin 
---
 mm/nommu.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index f994621e52f0..290fe3031147 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -2134,14 +2134,14 @@ int nommu_shrink_inode_mappings(struct inode *inode, 
size_t size,
high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
down_write(&nommu_region_sem);
-   i_mmap_lock_write(inode->i_mapping);
+   i_mmap_lock_read(inode->i_mapping);
 
/* search for VMAs that fall within the dead zone */
vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) {
/* found one - only interested if it's shared out of the page
 * cache */
if (vma->vm_flags & VM_SHARED) {
-   i_mmap_unlock_write(inode->i_mapping);
+   i_mmap_unlock_read(inode->i_mapping);
up_write(&nommu_region_sem);
return -ETXTBSY; /* not quite true, but near enough */
}
@@ -2153,8 +2153,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, 
size_t size,
 * we don't check for any regions that start beyond the EOF as there
 * shouldn't be any
 */
-   vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap,
- 0, ULONG_MAX) {
+   vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, 0, ULONG_MAX) 
{
if (!(vma->vm_flags & VM_SHARED))
continue;
 
@@ -2169,7 +2168,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, 
size_t size,
}
}
 
-   i_mmap_unlock_write(inode->i_mapping);
+   i_mmap_unlock_read(inode->i_mapping);
up_write(&nommu_region_sem);
return 0;
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 5/8] ms/uprobes: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Both register and unregister call build_map_info() in order to create the
list of mappings before installing or removing breakpoints for every mm
which maps file backed memory.  As such, there is no reason to hold the
i_mmap_rwsem exclusively, so share it and allow concurrent readers to
build the mapping data.

Signed-off-by: Davidlohr Bueso 
Acked-by: Srikar Dronamraju 
Acked-by: "Kirill A. Shutemov" 
Cc: Oleg Nesterov 
Acked-by: Hugh Dickins 
Acked-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 4a23717a236b2ab31efb1651f586126789fc997f)
Signed-off-by: Andrey Ryabinin 
---
 kernel/events/uprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 9f312227a769..be501d8d9704 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -690,7 +690,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
int more = 0;
 
  again:
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
@@ -721,7 +721,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
info->mm = vma->vm_mm;
info->vaddr = offset_to_vaddr(vma, offset);
}
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
 
if (!more)
goto out;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 6/8] ms/mm/memory-failure: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

No brainer conversion: collect_procs_file() only schedules a process for
later kill, share the lock, similarly to the anon vma variant.

Signed-off-by: Davidlohr Bueso 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit d28eb9c861f41aa2af4cfcc5eeeddff42b13d31e)
Signed-off-by: Andrey Ryabinin 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index da1ef2edd5dd..a5f5e604c0b8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -497,7 +497,7 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
struct task_struct *tsk;
struct address_space *mapping = page->mapping;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
qread_lock(&tasklist_lock);
for_each_process(tsk) {
pgoff_t pgoff = page_to_pgoff(page);
@@ -519,7 +519,7 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
}
}
qread_unlock(&tasklist_lock);
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
 }
 
 /*
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/8] ms/mm, fs: introduce helpers around the i_mmap_mutex

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

This series is a continuation of the conversion of the i_mmap_mutex to
rwsem, following what we have for the anon memory counterpart.  With
Hugh's feedback from the first iteration.

Ultimately, the most obvious paths that require exclusive ownership of the
lock is when we modify the VMA interval tree, via
vma_interval_tree_insert() and vma_interval_tree_remove() families.  Cases
such as unmapping, where the ptes content is changed but the tree remains
untouched should make it safe to share the i_mmap_rwsem.

As such, the code of course is straightforward, however the devil is very
much in the details.  While its been tested on a number of workloads
without anything exploding, I would not be surprised if there are some
less documented/known assumptions about the lock that could suffer from
these changes.  Or maybe I'm just missing something, but either way I
believe its at the point where it could use more eyes and hopefully some
time in linux-next.

Because the lock type conversion is the heart of this patchset,
its worth noting a few comparisons between mutex vs rwsem (xadd):

  (i) Same size, no extra footprint.

  (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
   exclusive lock ownership.

  (iii) Both can be slightly unfair wrt exclusive ownership, with
writer lock stealing properties, not necessarily respecting
FIFO order for granting the lock when contended.

  (iv) Mutexes can be slightly faster than rwsems when
   the lock is non-contended.

  (v) Both suck at performance for debug (slowpaths), which
  shouldn't matter anyway.

Sharing the lock is obviously beneficial, and sem writer ownership is
close enough to mutexes.  The biggest winner of these changes is
migration.

As for concrete numbers, the following performance results are for a
4-socket 60-core IvyBridge-EX with 130Gb of RAM.

Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
with this set, with a steady ~60% throughput (jpm) increase for alltests
and up to ~30% for disk for high amounts of concurrency.  Lower counts of
workload users (< 100) does not show much difference at all, so at least
no regressions.

3.18-rc13.18-rc1-i_mmap_rwsem
alltests-100 17918.72 (  0.00%)28417.97 ( 58.59%)
alltests-200 16529.39 (  0.00%)26807.92 ( 62.18%)
alltests-300 16591.17 (  0.00%)26878.08 ( 62.00%)
alltests-400 16490.37 (  0.00%)26664.63 ( 61.70%)
alltests-500 16593.17 (  0.00%)26433.72 ( 59.30%)
alltests-600 16508.56 (  0.00%)26409.20 ( 59.97%)
alltests-700 16508.19 (  0.00%)26298.58 ( 59.31%)
alltests-800 16437.58 (  0.00%)26433.02 ( 60.81%)
alltests-900 16418.35 (  0.00%)26241.61 ( 59.83%)
alltests-100016369.00 (  0.00%)26195.76 ( 60.03%)
alltests-110016330.11 (  0.00%)26133.46 ( 60.03%)
alltests-120016341.30 (  0.00%)26084.03 ( 59.62%)
alltests-130016304.75 (  0.00%)26024.74 ( 59.61%)
alltests-140016231.08 (  0.00%)25952.35 ( 59.89%)
alltests-150016168.06 (  0.00%)25850.58 ( 59.89%)
alltests-160016142.56 (  0.00%)25767.42 ( 59.62%)
alltests-170016118.91 (  0.00%)25689.58 ( 59.38%)
alltests-180016068.06 (  0.00%)25599.71 ( 59.32%)
alltests-190016046.94 (  0.00%)25525.92 ( 59.07%)
alltests-200016007.26 (  0.00%)25513.07 ( 59.38%)

disk-100  7582.14 (  0.00%) 7257.48 ( -4.28%)
disk-200  6962.44 (  0.00%) 7109.15 (  2.11%)
disk-300  6435.93 (  0.00%) 6904.75 (  7.28%)
disk-400  6370.84 (  0.00%) 6861.26 (  7.70%)
disk-500  6353.42 (  0.00%) 6846.71 (  7.76%)
disk-600  6368.82 (  0.00%) 6806.75 (  6.88%)
disk-700  6331.37 (  0.00%) 6796.01 (  7.34%)
disk-800  6324.22 (  0.00%) 6788.00 (  7.33%)
disk-900  6253.52 (  0.00%) 6750.43 (  7.95%)
disk-1000 6242.53 (  0.00%) 6855.11 (  9.81%)
disk-1100 6234.75 (  0.00%) 6858.47 ( 10.00%)
disk-1200 6312.76 (  0.00%) 6845.13 (  8.43%)
disk-1300 6309.95 (  0.00%) 6834.51 (  8.31%)
disk-1400 6171.76 (  0.00%) 6787.09 (  9.97%)
disk-1500 6139.81 (  0.00%) 6761.09 ( 10.12%)
disk-1600 4807.12 (  0.00%) 6725.33 ( 39.90%)
disk-1700 4669.50 (  0.00%) 5985.38 ( 28.18%)
disk-1800 4663.51 (  0.00%) 5972.99 ( 28.08%)
disk-1900 4674.31 (  0.00%) 5949.94 ( 27.29%)
disk-2000 4668.36 (  0.00%) 5834.93 ( 24.99%)

In addition, a 67.5% increase in successfully migrated NUMA pages, thus
improving node locality.

The patch layout is simple but designed for bisection (in case reversion
is needed if the changes break upstream) and easier review:

o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
o Patches 5-10 share the lock in specific paths, each patch
  details the rationale behind why it should be safe.

This p

[Devel] [PATCH rh7 4/8] ms/mm/rmap: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Similarly to the anon memory counterpart, we can share the mapping's lock
ownership as the interval tree is not modified when doing doing the walk,
only the file page.

Signed-off-by: Davidlohr Bueso 
Acked-by: Rik van Riel 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 3dec0ba0be6a532cac949e02b853021bf6d57dad)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/fs.h | 10 ++
 mm/rmap.c  |  9 +
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f422b0f7b02a..acedffc46fe4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -709,6 +709,16 @@ static inline void i_mmap_unlock_write(struct 
address_space *mapping)
up_write(&mapping->i_mmap_rwsem);
 }
 
+static inline void i_mmap_lock_read(struct address_space *mapping)
+{
+   down_read(&mapping->i_mmap_rwsem);
+}
+
+static inline void i_mmap_unlock_read(struct address_space *mapping)
+{
+   up_read(&mapping->i_mmap_rwsem);
+}
+
 /*
  * Might pages of this file be mapped into userspace?
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index e72be32c3dae..523957450d20 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1723,7 +1723,8 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
if (!mapping)
return ret;
pgoff = page_to_pgoff(page);
-   down_write_nested(&mapping->i_mmap_rwsem, SINGLE_DEPTH_NESTING);
+
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
 
@@ -1748,7 +1749,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
if (!mapping_mapped(peer))
continue;
 
-   i_mmap_lock_write(peer);
+   i_mmap_lock_read(peer);
 
vma_interval_tree_foreach(vma, &peer->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
@@ -1764,7 +1765,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
 
cond_resched();
}
-   i_mmap_unlock_write(peer);
+   i_mmap_unlock_read(peer);
 
if (ret != SWAP_AGAIN)
goto done;
@@ -1772,7 +1773,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
goto done;
}
 done:
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
return ret;
 }
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/8] ms/mm: convert i_mmap_mutex to rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory.  To
this end, this lock can also be a rwsem.  In addition, there are some
important opportunities to share the lock when there are no tree
modifications.

This conversion is straightforward.  For now, all users take the write
lock.

[s...@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso 
Reviewed-by: Rik van Riel 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit c8c06efa8b552608493b7066c234cfa82c47fcea)
Signed-off-by: Andrey Ryabinin 
---
 Documentation/vm/locking |  2 +-
 fs/hugetlbfs/inode.c | 10 +-
 fs/inode.c   |  2 +-
 include/linux/fs.h   |  7 ---
 include/linux/mmu_notifier.h |  2 +-
 kernel/events/uprobes.c  |  2 +-
 mm/filemap.c | 10 +-
 mm/hugetlb.c | 10 +-
 mm/memory.c  |  2 +-
 mm/mmap.c|  6 +++---
 mm/mremap.c  |  2 +-
 mm/rmap.c|  4 ++--
 12 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
index f61228bd6395..fb6402884062 100644
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is 
modified by
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_mutex and the kmem cache
+The page_table_lock nests with the inode i_mmap_rwsem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fb40a55cc8f1..68f8f2f0eaf5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -757,12 +757,12 @@ static struct inode *hugetlbfs_get_root(struct 
super_block *sb,
 }
 
 /*
- * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
+ * Hugetlbfs is not reclaimable; therefore its i_mmap_rwsem will never
  * be taken from reclaim -- unlike regular filesystems. This needs an
  * annotation because huge_pmd_share() does an allocation under hugetlb's
- * i_mmap_mutex.
+ * i_mmap_rwsem.
  */
-struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
 
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct inode *dir,
@@ -779,8 +779,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block 
*sb,
if (inode) {
inode->i_ino = get_next_ino();
inode_init_owner(inode, dir, mode);
-   lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
-   &hugetlbfs_i_mmap_mutex_key);
+   lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
+   &hugetlbfs_i_mmap_rwsem_key);
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/inode.c b/fs/inode.c
index 5253272c3742..2423a30dda1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -356,7 +356,7 @@ void address_space_init_once(struct address_space *mapping)
memset(mapping, 0, sizeof(*mapping));
INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
spin_lock_init(&mapping->tree_lock);
-   mutex_init(&mapping->i_mmap_mutex);
+   init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
spin_lock_init(&mapping->private_lock);
mapping->i_mmap = RB_ROOT;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e32cb9b71042..f422b0f7b02a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -626,7 +627,7 @@ struct address_space {
RH_KABI_REPLACE(unsigned int i_mmap_writable,
 atomic_t i_mmap_writable) /* count VM_SHARED mappings 
*/
struct rb_root  i_mmap; /* tree of private and shared 
mappings */
-   struct mutexi_mmap_mutex;   /* protect tree, count, list */
+   struct rw_semaphore i_mmap_rwsem;   /* protect tree, count, list */
/* Protected by tree_lock together with the radix tree */
unsign

[Devel] [PATCH rh7 2/8] ms/mm: use new helper functions around the i_mmap_mutex

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso 
Acked-by: Rik van Riel 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 83cde9e8ba95d180eaefefe834958fbf7008cf39)
Signed-off-by: Andrey Ryabinin 
---
 fs/dax.c|  4 ++--
 fs/hugetlbfs/inode.c| 12 ++--
 kernel/events/uprobes.c |  4 ++--
 kernel/fork.c   |  4 ++--
 mm/hugetlb.c| 12 ++--
 mm/memory-failure.c |  4 ++--
 mm/memory.c | 28 ++--
 mm/mmap.c   | 14 +++---
 mm/mremap.c |  4 ++--
 mm/nommu.c  | 14 +++---
 mm/rmap.c   |  6 +++---
 11 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f22e3b32b6cc..7a18745acf01 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -909,7 +909,7 @@ static void dax_mapping_entry_mkclean(struct address_space 
*mapping,
spinlock_t *ptl;
bool changed;
 
-   mutex_lock(&mapping->i_mmap_mutex);
+   i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
unsigned long address;
 
@@ -960,7 +960,7 @@ unlock_pte:
if (changed)
mmu_notifier_invalidate_page(vma->vm_mm, address);
}
-   mutex_unlock(&mapping->i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
 }
 
 static int dax_writeback_one(struct dax_device *dax_dev,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index bdd5c7827391..fb40a55cc8f1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -493,11 +493,11 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
 
-   mutex_lock(&mapping->i_mmap_mutex);
+   i_mmap_lock_write(mapping);
hugetlb_vmdelete_list(&mapping->i_mmap,
next * pages_per_huge_page(h),
(next + 1) * pages_per_huge_page(h));
-   mutex_unlock(&mapping->i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
}
 
lock_page(page);
@@ -553,10 +553,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t 
offset)
pgoff = offset >> PAGE_SHIFT;
 
i_size_write(inode, offset);
-   mutex_lock(&mapping->i_mmap_mutex);
+   i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
-   mutex_unlock(&mapping->i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
return 0;
 }
@@ -578,12 +578,12 @@ static long hugetlbfs_punch_hole(struct inode *inode, 
loff_t offset, loff_t len)
struct address_space *mapping = inode->i_mapping;
 
mutex_lock(&inode->i_mutex);
-   mutex_lock(&mapping->i_mmap_mutex);
+   i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end  >> PAGE_SHIFT);
-   mutex_unlock(&mapping->i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
mutex_unlock(&inode->i_mutex);
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a5a59cc93fb6..816ad8e3d92f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -690,7 +690,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
int more = 0;
 
  again:
-   mutex_lock(&mapping->i_mmap_mutex);
+   i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
@@ -721,7 +721,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
info->mm = vma->vm_mm;
info->vaddr = offset_to_vaddr(vma, offset);
}
-   mutex_unlock(&mapping->i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
 
if (!more)
goto out;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9467e21a8fa4..b6a5279403be 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -5

[Devel] [PATCH rh7] Revert "mm: Port diff-mm-vmscan-disable-fs-related-activity-for-direct-direct-reclaim"

2020-11-30 Thread Andrey Ryabinin
This reverts commit 50fb388878b646872b78143de3c1bf3fa6f7f148.
Sometimes we can see a lot of reclaimable dcache and no other reclaimbale 
memory.
It looks like that kswapd can't keep up reclaiming dcache fast enough.

Commit 50fb388878b6 forbids to reclaim dcache in direct reclaim to prevent
potential deadlocks that might happen due to bugs in other subsystems.
Revert it to allow more aggressive dcache reclaim. It's unlikely to cause
any problems since we already directly reclaim dcache in memcg reclaim,
so let's do the same for the global one.

https://jira.sw.ru/browse/PSBM-122663
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85622f235e78..240435eb6d84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2653,15 +2653,9 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
 {
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
-   gfp_t slab_gfp = sc->gfp_mask;
bool slab_only = sc->slab_only;
bool retry;
 
-   /* Disable fs-related IO for direct reclaim */
-   if (!sc->target_mem_cgroup &&
-   (current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
-   slab_gfp &= ~__GFP_FS;
-
do {
struct mem_cgroup *root = sc->target_mem_cgroup;
struct mem_cgroup_reclaim_cookie reclaim = {
@@ -2695,7 +2689,7 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
}
 
if (is_classzone) {
-   shrink_slab(slab_gfp, zone_to_nid(zone),
+   shrink_slab(sc->gfp_mask, zone_to_nid(zone),
memcg, sc->priority, false);
if (reclaim_state) {
sc->nr_reclaimed += 
reclaim_state->reclaimed_slab;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel