[Devel] [PATCH rh7] mm/memcg: sleep if mem_cgroup_force_empty_list() stumped on busy page

2017-09-05 Thread Andrey Ryabinin
mem_cgroup_force_empty_list() executed in workqueue context. If work doesn't
go to sleep the workqueue engine thinks that this work is making progress,
so there is no need to start more workers to execute other works.

So, if we need other works to be executed to unlock those pages we might
have a deadlock. I think this easy might happen with fuse, something
like this:

fuse:  cgroup_destroy work:
  mem_cgroup_force_empty_list()
  //busy wait for fuse to unlock pages.
 queue_work()

 //this may have to wait
 //for mem_cgroup_force_empty_list
 //to finish
 flush_work()

 read pages and unlock them.

The solution to this problem is to put mem_cgroup_force_empty_list() in short
sleep() instead of cond_resched(). This will allow other works to make progress
if mem_cgroup_force_empty_list() is stuck.

https://jira.sw.ru/browse/PSBM-70021
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 940cca31ed5d..b09d5be27444 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4034,7 +4034,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup 
*memcg,
if (mem_cgroup_move_parent(page, pc, memcg)) {
/* found lock contention or "pc" is obsolete. */
busy = page;
-   cond_resched();
+   schedule_timeout_uninterruptible(1);
} else
busy = NULL;
} while (!list_empty(list));
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RHEL7 COMMIT] ve/autofs: drop fix double pid put in error path and leaked pid on error path in autofs4_fill_super

2017-09-05 Thread Konstantin Khorenko

On 09/05/2017 04:26 PM, Vasily Averin wrote:

Kostja,
why it changes autofs_sb_info?
this hook looks unrelated to the problem
and first hook in fs/autofs4/inode.c too.


The patch rolls back hunks of our patch,
all those hunks are not needed now.

Bug is fixed by only 2 last hunks, agree, i wrote that in the bug.



On 2017-09-05 15:47, Konstantin Khorenko wrote:

Please consider to prepare a ReadyKernel patch for it.

This is needed for all kernels prior to 3.10.0-693.x

https://readykernel.com/

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 09/05/2017 03:18 PM, Konstantin Khorenko wrote:

The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.26.1.vz7.35.6
-->
commit e851cc10aa14e1ca311187fda9a3a53a5e3dee25
Author: Konstantin Khorenko 
Date:   Tue Sep 5 15:13:26 2017 +0300

ve/autofs: drop fix double pid put in error path and leaked pid on error 
path in autofs4_fill_super

Drop redundant hunks of 078889e ("VE/AUTOFS: port 71-diff-autofs-combined"),
they lead to unbalanced pid get/put in autofs4_fill_super().

Fixes: 078889e ("VE/AUTOFS: port 71-diff-autofs-combined")

Signed-off-by: Konstantin Khorenko 
---
 fs/autofs4/autofs_i.h | 1 -
 fs/autofs4/inode.c| 6 ++
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c957d14..39f197c 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -123,7 +123,6 @@ struct autofs_sb_info {
 struct list_head active_list;
 struct list_head expiring_list;
 struct rcu_head rcu;
-unsigned is32bit:1;
 };

 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index b23cf2a..af7506c 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -127,7 +127,7 @@ static const match_table_t tokens = {
 {Opt_indirect, "indirect"},
 {Opt_direct, "direct"},
 {Opt_offset, "offset"},
-{Opt_err, NULL}
+{Opt_err, NULL}
 };

 static int parse_options(char *options, int *pipefd, kuid_t *uid, kgid_t *gid,
@@ -313,7 +313,7 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)

 if (!pipe) {
 printk("autofs: could not open pipe file descriptor\n");
-goto fail_put_pid;
+goto fail_dput;
 }
 ret = autofs_prepare_pipe(pipe);
 if (ret < 0)
@@ -335,8 +335,6 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)
 printk("autofs: pipe file descriptor does not contain proper ops\n");
 fput(pipe);
 /* fall through */
-fail_put_pid:
-put_pid(sbi->oz_pgrp);
 fail_dput:
 dput(root);
 goto fail_free;
.




.


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RHEL7 COMMIT] ve/autofs: drop fix double pid put in error path and leaked pid on error path in autofs4_fill_super

2017-09-05 Thread Vasily Averin
Kostja,
why it changes autofs_sb_info?
this hook looks unrelated to the problem
and first hook in fs/autofs4/inode.c too.

On 2017-09-05 15:47, Konstantin Khorenko wrote:
> Please consider to prepare a ReadyKernel patch for it.
> 
> This is needed for all kernels prior to 3.10.0-693.x
> 
> https://readykernel.com/
> 
> -- 
> Best regards,
> 
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
> 
> On 09/05/2017 03:18 PM, Konstantin Khorenko wrote:
>> The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will 
>> appear at https://src.openvz.org/scm/ovz/vzkernel.git
>> after rh7-3.10.0-514.26.1.vz7.35.6
>> -->
>> commit e851cc10aa14e1ca311187fda9a3a53a5e3dee25
>> Author: Konstantin Khorenko 
>> Date:   Tue Sep 5 15:13:26 2017 +0300
>>
>> ve/autofs: drop fix double pid put in error path and leaked pid on error 
>> path in autofs4_fill_super
>>
>> Drop redundant hunks of 078889e ("VE/AUTOFS: port 
>> 71-diff-autofs-combined"),
>> they lead to unbalanced pid get/put in autofs4_fill_super().
>>
>> Fixes: 078889e ("VE/AUTOFS: port 71-diff-autofs-combined")
>>
>> Signed-off-by: Konstantin Khorenko 
>> ---
>>  fs/autofs4/autofs_i.h | 1 -
>>  fs/autofs4/inode.c| 6 ++
>>  2 files changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
>> index c957d14..39f197c 100644
>> --- a/fs/autofs4/autofs_i.h
>> +++ b/fs/autofs4/autofs_i.h
>> @@ -123,7 +123,6 @@ struct autofs_sb_info {
>>  struct list_head active_list;
>>  struct list_head expiring_list;
>>  struct rcu_head rcu;
>> -unsigned is32bit:1;
>>  };
>>
>>  static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
>> index b23cf2a..af7506c 100644
>> --- a/fs/autofs4/inode.c
>> +++ b/fs/autofs4/inode.c
>> @@ -127,7 +127,7 @@ static const match_table_t tokens = {
>>  {Opt_indirect, "indirect"},
>>  {Opt_direct, "direct"},
>>  {Opt_offset, "offset"},
>> -{Opt_err, NULL}
>> +{Opt_err, NULL}
>>  };
>>
>>  static int parse_options(char *options, int *pipefd, kuid_t *uid, kgid_t 
>> *gid,
>> @@ -313,7 +313,7 @@ int autofs4_fill_super(struct super_block *s, void 
>> *data, int silent)
>>
>>  if (!pipe) {
>>  printk("autofs: could not open pipe file descriptor\n");
>> -goto fail_put_pid;
>> +goto fail_dput;
>>  }
>>  ret = autofs_prepare_pipe(pipe);
>>  if (ret < 0)
>> @@ -335,8 +335,6 @@ int autofs4_fill_super(struct super_block *s, void 
>> *data, int silent)
>>  printk("autofs: pipe file descriptor does not contain proper ops\n");
>>  fput(pipe);
>>  /* fall through */
>> -fail_put_pid:
>> -put_pid(sbi->oz_pgrp);
>>  fail_dput:
>>  dput(root);
>>  goto fail_free;
>> .
>>
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-09-05 Thread Pavel Tikhomirov
We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add   CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

if (shost->shost_state == SHOST_RECOVERY)
/* does not get here */
return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

busy = shost->host_busy++;
/* host->can_queue == 1 initialy, busy == 1
 * - go to starved label */
lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
if (host_busy == host_failed) {
/* does not get here */
wake_up_process(shost->ehandler)
}
}
unlock(shost->host_lock)

/* acquire lock */
shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
coming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_failed(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed checks under lock)

Signed-off-by: Pavel Tikhomirov 
---
 drivers/scsi/scsi_lib.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..6c99221d60aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev)
if (starget->can_queue > 0)
atomic_dec(>target_busy);
 
+   spin_lock_irqsave(shost->host_lock, flags);
if (unlikely(scsi_host_in_recovery(shost) &&
-(shost->host_failed || shost->host_eh_scheduled))) {
-   spin_lock_irqsave(shost->host_lock, flags);
+(shost->host_failed || shost->host_eh_scheduled)))
scsi_eh_wakeup(shost);
-   spin_unlock_irqrestore(shost->host_lock, flags);
-   }
+   spin_unlock_irqrestore(shost->host_lock, flags);
 
atomic_dec(>device_busy);
 }
@@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct 
request_queue *q,
spin_unlock_irq(shost->host_lock);
 out_dec:
atomic_dec(>host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
return 0;
 }
 
@@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
 
 out_dec_host_busy:
atomic_dec(>host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
 out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(_target(sdev)->target_busy);
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RHEL7 COMMIT] ve/autofs: drop fix double pid put in error path and leaked pid on error path in autofs4_fill_super

2017-09-05 Thread Konstantin Khorenko

Please consider to prepare a ReadyKernel patch for it.

This is needed for all kernels prior to 3.10.0-693.x

https://readykernel.com/

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 09/05/2017 03:18 PM, Konstantin Khorenko wrote:

The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.26.1.vz7.35.6
-->
commit e851cc10aa14e1ca311187fda9a3a53a5e3dee25
Author: Konstantin Khorenko 
Date:   Tue Sep 5 15:13:26 2017 +0300

ve/autofs: drop fix double pid put in error path and leaked pid on error 
path in autofs4_fill_super

Drop redundant hunks of 078889e ("VE/AUTOFS: port 71-diff-autofs-combined"),
they lead to unbalanced pid get/put in autofs4_fill_super().

Fixes: 078889e ("VE/AUTOFS: port 71-diff-autofs-combined")

Signed-off-by: Konstantin Khorenko 
---
 fs/autofs4/autofs_i.h | 1 -
 fs/autofs4/inode.c| 6 ++
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c957d14..39f197c 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -123,7 +123,6 @@ struct autofs_sb_info {
struct list_head active_list;
struct list_head expiring_list;
struct rcu_head rcu;
-   unsigned is32bit:1;
 };

 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index b23cf2a..af7506c 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -127,7 +127,7 @@ static const match_table_t tokens = {
{Opt_indirect, "indirect"},
{Opt_direct, "direct"},
{Opt_offset, "offset"},
-{Opt_err, NULL}
+   {Opt_err, NULL}
 };

 static int parse_options(char *options, int *pipefd, kuid_t *uid, kgid_t *gid,
@@ -313,7 +313,7 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)

if (!pipe) {
printk("autofs: could not open pipe file descriptor\n");
-   goto fail_put_pid;
+   goto fail_dput;
}
ret = autofs_prepare_pipe(pipe);
if (ret < 0)
@@ -335,8 +335,6 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)
printk("autofs: pipe file descriptor does not contain proper ops\n");
fput(pipe);
/* fall through */
-fail_put_pid:
-   put_pid(sbi->oz_pgrp);
 fail_dput:
dput(root);
goto fail_free;
.


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/autofs: drop redundant hunks of 71-diff-autofs-combined

2017-09-05 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-693.1.1.vz7.37.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-693.1.1.vz7.37.2
-->
commit 5ce3e1be78f85902e8606e77a1883248dbc5ea6e
Author: Konstantin Khorenko 
Date:   Tue Sep 5 15:24:34 2017 +0300

ve/autofs: drop redundant hunks of 71-diff-autofs-combined

Fixes: 078889e ("VE/AUTOFS: port 71-diff-autofs-combined")

Signed-off-by: Konstantin Khorenko 
---
 fs/autofs4/autofs_i.h | 1 -
 fs/autofs4/inode.c| 1 -
 2 files changed, 2 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index e105f59..7e44fa7 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -119,7 +119,6 @@ struct autofs_sb_info {
struct list_head active_list;
struct list_head expiring_list;
struct rcu_head rcu;
-   unsigned is32bit:1;
 };
 
 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 0af786c..175ae8a 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -338,7 +338,6 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)
 fail_fput:
pr_err("pipe file descriptor does not contain proper ops\n");
fput(pipe);
-   /* fall through */
 fail_put_pid:
put_pid(sbi->oz_pgrp);
 fail_dput:
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ve/autofs: drop fix double pid put in error path and leaked pid on error path in autofs4_fill_super

2017-09-05 Thread Konstantin Khorenko
The commit is pushed to "branch-rh7-3.10.0-514.26.1.vz7.35.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.26.1.vz7.35.6
-->
commit e851cc10aa14e1ca311187fda9a3a53a5e3dee25
Author: Konstantin Khorenko 
Date:   Tue Sep 5 15:13:26 2017 +0300

ve/autofs: drop fix double pid put in error path and leaked pid on error 
path in autofs4_fill_super

Drop redundant hunks of 078889e ("VE/AUTOFS: port 71-diff-autofs-combined"),
they lead to unbalanced pid get/put in autofs4_fill_super().

Fixes: 078889e ("VE/AUTOFS: port 71-diff-autofs-combined")

Signed-off-by: Konstantin Khorenko 
---
 fs/autofs4/autofs_i.h | 1 -
 fs/autofs4/inode.c| 6 ++
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c957d14..39f197c 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -123,7 +123,6 @@ struct autofs_sb_info {
struct list_head active_list;
struct list_head expiring_list;
struct rcu_head rcu;
-   unsigned is32bit:1;
 };
 
 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index b23cf2a..af7506c 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -127,7 +127,7 @@ static const match_table_t tokens = {
{Opt_indirect, "indirect"},
{Opt_direct, "direct"},
{Opt_offset, "offset"},
-{Opt_err, NULL}
+   {Opt_err, NULL}
 };
 
 static int parse_options(char *options, int *pipefd, kuid_t *uid, kgid_t *gid,
@@ -313,7 +313,7 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)
 
if (!pipe) {
printk("autofs: could not open pipe file descriptor\n");
-   goto fail_put_pid;
+   goto fail_dput;
}
ret = autofs_prepare_pipe(pipe);
if (ret < 0)
@@ -335,8 +335,6 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)
printk("autofs: pipe file descriptor does not contain proper ops\n");
fput(pipe);
/* fall through */
-fail_put_pid:
-   put_pid(sbi->oz_pgrp);
 fail_dput:
dput(root);
goto fail_free;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RFC] mm: Limit number of busy-looped shrinking processes

2017-09-05 Thread Dmitry Monakhov
Kirill Tkhai  writes:

> When a FUSE process is making shrink, it must not wait
> on page writeback. Otherwise, it may meet a page,
> that is being writebacked by him, and the process will stall.
>
> So, our kernel does not wait writeback after commit a9707947010d
> "mm: vmscan: never wait on writeback pages".
>
> But in case of huge number of writebacked pages and
> memory pressure, this lead to busy loop: many process
> in the system are trying to shrink memory and have
> no success. And the node shows high time, spent in kernel.
>
> This patch reduces the number of processes, which may
> busy looping on shrink. Only one userspace process --
> vstorage -- will be allowed not to sleep on writeback.
> Other processes will sleep up to 5 seconds to wait
> writeback completion on every page.
>
> The detection of vstorage is very simple and it based
> on process name. It seems, there is no a way to detect
NAK. Detection by name is very very bad design style.
fused and others should mark iself as writeback-proof explicitly
via API similar ioctl/madvice/ionice/ulimit,
may be it is reasonable to place such app to speciffic cgroup,
you may pick any recepy you like. But please do not do comm-name
matching.

> all FUSE processes, especially from !ve0, because FUSE
> mount is tricky, and a process doing mount may not be
> a FUSE daemon. So, we remain the vanila kernel behaviour,
> but we don't wait forever, just 5 second. This will save
> us from lookup messages from kernel and will allow
> to kill FUSE daemon if necessary.
>
> https://jira.sw.ru/browse/PSBM-69296
>
> Signed-off-by: Kirill Tkhai 
> ---
>  mm/vmscan.c |   19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a5db5940bb1..e72d515c111 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -959,8 +959,16 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>  
>   /* Case 3 above */
>   } else {
> - nr_immediate++;
> - goto keep_locked;
> + /*
> +  * Currently, vstorage is the only fuse process,
> +  * exercising writeback; it mustn't sleep to 
> avoid
> +  * deadlocks.
> +  */
> + if (!strncmp(current->comm, "vstorage", 8) ||
> + wait_on_page_bit_killable_timeout(page, 
> PG_writeback, 5 * HZ) != 0) {
> + nr_immediate++;
> + goto keep_locked;
> + }
>   }
>   }
>  
> @@ -1592,9 +1600,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   if (nr_writeback && nr_writeback == nr_taken)
>   zone_set_flag(zone, ZONE_WRITEBACK);
>  
> - if (!global_reclaim(sc) && nr_immediate)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> + /*
> +  * memcg will stall in page writeback so only consider forcibly
> +  * stalling for global reclaim
> +  */
>   if (global_reclaim(sc)) {
>   /*
>* Tag a zone as congested if all the dirty pages scanned were
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy

2017-09-05 Thread Pavel Tikhomirov
We have a problem on several our nodes with scsi EH. Imagine such an
order of execution of two threads:

CPU1 scsi_eh_scmd_add   CPU2 scsi_host_queue_ready
/* shost->host_busy == 1 initialy */

if (shost->shost_state == SHOST_RECOVERY)
/* does not get here */
return 0;

lock(shost->host_lock);
shost->shost_state = SHOST_RECOVERY;

busy = shost->host_busy++;
/* host->can_queue == 1 initialy, busy == 1 */
lock(shost->host_lock) /* wait */

shost->host_failed++;
/* shost->host_busy == 2, shost->host_failed == 1 */
call scsi_eh_wakeup(shost) {
if (host_busy == host_failed) {
/* does not get here */
wake_up_process(shost->ehandler)
}
}
unlock(shost->host_lock)

/* acquire lock */
shost->host_busy--;

Finaly we do not wakeup scsi_error_handler and all other commands
comming will hang as we are in never ending recovery state as there
is no one left to wakeup handler.

So scsi disc in these host becomes unresponsive and all bio on node
hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.)

Main idea of the fix is to try to do wake up every time we decrement
host_busy or increment host_faild(the latter is already OK).

Now the very *last* one of busy threads getting host_lock after
decrementing host_busy will see all write operations on host's
shost_state, host_busy and host_failed completed thanks to implied
memory barriers on spin_lock/unlock, so at the time of busy==failed
we will trigger wakeup in at least one thread. (Thats why putting
recovery and failed check under lock)

https://jira.sw.ru/browse/PSBM-69788

Signed-off-by: Pavel Tikhomirov 
---
 drivers/scsi/scsi_lib.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0f55949765c7..ec211185abd6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -320,13 +320,13 @@ void scsi_device_unbusy(struct scsi_device *sdev)
if (starget->can_queue > 0)
atomic_dec(>target_busy);
 
+   spin_lock_irqsave(shost->host_lock, flags);
if (unlikely(scsi_host_in_recovery(shost) &&
 (shost->host_failed || shost->host_eh_scheduled))) {
-   spin_lock_irqsave(shost->host_lock, flags);
scsi_debug_log_shost(SCSI_DEVICE_UNBUSY_CALLS_EH_WAKEUP, shost);
scsi_eh_wakeup(shost);
-   spin_unlock_irqrestore(shost->host_lock, flags);
}
+   spin_unlock_irqrestore(shost->host_lock, flags);
 
atomic_dec(>device_busy);
 }
@@ -1568,6 +1568,13 @@ static inline int scsi_host_queue_ready(struct 
request_queue *q,
 out_dec:
scsi_debug_log_sdev(SCSI_HOST_QUEUE_READY_DEC_HOST_BUSY, sdev);
atomic_dec(>host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
return 0;
 }
 
@@ -1958,6 +1965,13 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_dec_host_busy:
scsi_debug_log_sdev(SCSI_QUEUE_RQ_DEC_HOST_BUSY, sdev);
atomic_dec(>host_busy);
+
+   spin_lock_irq(shost->host_lock);
+   if (unlikely(scsi_host_in_recovery(shost) &&
+(shost->host_failed || shost->host_eh_scheduled)))
+   scsi_eh_wakeup(shost);
+   spin_unlock_irq(shost->host_lock);
+
 out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(_target(sdev)->target_busy);
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RFC] mm: Limit number of busy-looped shrinking processes

2017-09-05 Thread Kirill Tkhai
When a FUSE process is making shrink, it must not wait
on page writeback. Otherwise, it may meet a page,
that is being writebacked by him, and the process will stall.

So, our kernel does not wait writeback after commit a9707947010d
"mm: vmscan: never wait on writeback pages".

But in case of huge number of writebacked pages and
memory pressure, this lead to busy loop: many process
in the system are trying to shrink memory and have
no success. And the node shows high time, spent in kernel.

This patch reduces the number of processes, which may
busy looping on shrink. Only one userspace process --
vstorage -- will be allowed not to sleep on writeback.
Other processes will sleep up to 5 seconds to wait
writeback completion on every page.

The detection of vstorage is very simple and it based
on process name. It seems, there is no a way to detect
all FUSE processes, especially from !ve0, because FUSE
mount is tricky, and a process doing mount may not be
a FUSE daemon. So, we remain the vanila kernel behaviour,
but we don't wait forever, just 5 second. This will save
us from lookup messages from kernel and will allow
to kill FUSE daemon if necessary.

https://jira.sw.ru/browse/PSBM-69296

Signed-off-by: Kirill Tkhai 
---
 mm/vmscan.c |   19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5db5940bb1..e72d515c111 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -959,8 +959,16 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 
/* Case 3 above */
} else {
-   nr_immediate++;
-   goto keep_locked;
+   /*
+* Currently, vstorage is the only fuse process,
+* exercising writeback; it mustn't sleep to 
avoid
+* deadlocks.
+*/
+   if (!strncmp(current->comm, "vstorage", 8) ||
+   wait_on_page_bit_killable_timeout(page, 
PG_writeback, 5 * HZ) != 0) {
+   nr_immediate++;
+   goto keep_locked;
+   }
}
}
 
@@ -1592,9 +1600,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
if (nr_writeback && nr_writeback == nr_taken)
zone_set_flag(zone, ZONE_WRITEBACK);
 
-   if (!global_reclaim(sc) && nr_immediate)
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
-
+   /*
+* memcg will stall in page writeback so only consider forcibly
+* stalling for global reclaim
+*/
if (global_reclaim(sc)) {
/*
 * Tag a zone as congested if all the dirty pages scanned were

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel