[RFC PATCH V4 5/5] workqueue: introduce a way to set workqueue's scheduler

2018-01-24 Thread Wen Yang
When pinning RT threads to specific cores using CPU affinity, the
kworkers on the same CPU would starve, which may lead to some kind
of priority inversion. In that case, the RT threads would also
suffer high performance impact.

The priority inversion looks like,
CPU 0:  libvirtd acquired cgroup_mutex, and triggered
lru_add_drain_per_cpu, then waiting for all the kworkers to complete:
PID: 44145  TASK: 8807bec7b980  CPU: 0   COMMAND: "libvirtd"
#0 [8807f2cbb9d0] __schedule at 816410ed
#1 [8807f2cbba38] schedule at 81641789
#2 [8807f2cbba48] schedule_timeout at 8163f479
#3 [8807f2cbbaf8] wait_for_completion at 81641b56
#4 [8807f2cbbb58] flush_work at 8109efdc
#5 [8807f2cbbbd0] lru_add_drain_all at 81179002
#6 [8807f2cbbc08] migrate_prep at 811c77be
#7 [8807f2cbbc18] do_migrate_pages at 811b8010
#8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c
#9 [8807f2cbbd10] cpuset_attach at 810ff91e
#10 [8807f2cbbd50] cgroup_attach_task at 810f9972
#11 [8807f2cbbe08] attach_task_by_pid at 810fa520
#12 [8807f2cbbe58] cgroup_tasks_write at 810fa593
#13 [8807f2cbbe68] cgroup_file_write at 810f8773
#14 [8807f2cbbef8] vfs_write at 811dfdfd
#15 [8807f2cbbf38] sys_write at 811e089f
#16 [8807f2cbbf80] system_call_fastpath at 8164c809

CPU 43: kworker/43 starved because of the RT threads:
CURRENT: PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
RT PRIO_ARRAY: 883fff3f4950
[ 79] PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
[ 79] PID: 21295  TASK: 88276d481700  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21351  TASK: 8807be822280  COMMAND: "dispatcher"
[ 79] PID: 21129  TASK: 8807bef0f300  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21337  TASK: 88276d482e00  COMMAND: "handler_3"
[ 79] PID: 21352  TASK: 8807be824500  COMMAND: "flow_dumper"
[ 79] PID: 21336  TASK: 88276d480b80  COMMAND: "handler_2"
[ 79] PID: 21342  TASK: 88276d484500  COMMAND: "handler_8"
[ 79] PID: 21341  TASK: 88276d482280  COMMAND: "handler_7"
[ 79] PID: 21338  TASK: 88276d483980  COMMAND: "handler_4"
[ 79] PID: 21339  TASK: 88276d48  COMMAND: "handler_5"
[ 79] PID: 21340  TASK: 88276d486780  COMMAND: "handler_6"
CFS RB_ROOT: 883fff3f4868
[120] PID: 37959  TASK: 88276e148000  COMMAND: "kworker/43:1"

CPU 28: Systemd(Victim) was blocked by cgroup_mutex:
PID: 1  TASK: 883fd2d4  CPU: 28  COMMAND: "systemd"
#0 [881fd317bd60] __schedule at 816410ed
#1 [881fd317bdc8] schedule_preempt_disabled at 81642869
#2 [881fd317bdd8] __mutex_lock_slowpath at 81640565
#3 [881fd317be38] mutex_lock at 8163f9cf
#4 [881fd317be50] proc_cgroup_show at 810fd256
#5 [881fd317be98] seq_read at 81203cda
#6 [881fd317bf08] vfs_read at 811dfc6c
#7 [881fd317bf38] sys_read at 811e07bf
#8 [881fd317bf80] system_call_fastpath at 81

The simplest way to fix that is to set the scheduler of kworkers to
higher RT priority, just like,
chrt --fifo -p 61 
However, it can not avoid other WORK_CPU_BOUND worker threads running
and starving.

This patch introduces a way to set the scheduler(policy and priority)
of percpu worker_pool, in that way, user could set proper scheduler
policy and priority of the worker_pool as needed, which could apply
to all the WORK_CPU_BOUND workers on the same CPU. On the other hand,
we could using /sys/devices/virtual/workqueue/cpumask for
WORK_CPU_UNBOUND workers to prevent them starving.

Tejun Heo suggested:
"* Add scheduler type to wq_attrs so that unbound workqueues can be
 configured.

* Rename system_wq's wq->name from "events" to "system_percpu", and
 similarly for the similarly named workqueues.

* Enable wq_attrs (only the applicable part should show up in the
 interface) for system_percpu and system_percpu_highpri, and use that
 to change the attributes of the percpu pools."

This patch implements the basic infrastructure and /sys interface,
such as:
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
# echo "policy=1 prio=1 nice=0" > 
/sys/devices/virtual/workqueue/system_percpu/sched_attr
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=1 prio=1 nice=0
# cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20
# echo "policy=1 prio=2 nice=0" > 

[RFC PATCH V4 5/5] workqueue: introduce a way to set workqueue's scheduler

2018-01-24 Thread Wen Yang
When pinning RT threads to specific cores using CPU affinity, the
kworkers on the same CPU would starve, which may lead to some kind
of priority inversion. In that case, the RT threads would also
suffer high performance impact.

The priority inversion looks like,
CPU 0:  libvirtd acquired cgroup_mutex, and triggered
lru_add_drain_per_cpu, then waiting for all the kworkers to complete:
PID: 44145  TASK: 8807bec7b980  CPU: 0   COMMAND: "libvirtd"
#0 [8807f2cbb9d0] __schedule at 816410ed
#1 [8807f2cbba38] schedule at 81641789
#2 [8807f2cbba48] schedule_timeout at 8163f479
#3 [8807f2cbbaf8] wait_for_completion at 81641b56
#4 [8807f2cbbb58] flush_work at 8109efdc
#5 [8807f2cbbbd0] lru_add_drain_all at 81179002
#6 [8807f2cbbc08] migrate_prep at 811c77be
#7 [8807f2cbbc18] do_migrate_pages at 811b8010
#8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c
#9 [8807f2cbbd10] cpuset_attach at 810ff91e
#10 [8807f2cbbd50] cgroup_attach_task at 810f9972
#11 [8807f2cbbe08] attach_task_by_pid at 810fa520
#12 [8807f2cbbe58] cgroup_tasks_write at 810fa593
#13 [8807f2cbbe68] cgroup_file_write at 810f8773
#14 [8807f2cbbef8] vfs_write at 811dfdfd
#15 [8807f2cbbf38] sys_write at 811e089f
#16 [8807f2cbbf80] system_call_fastpath at 8164c809

CPU 43: kworker/43 starved because of the RT threads:
CURRENT: PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
RT PRIO_ARRAY: 883fff3f4950
[ 79] PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
[ 79] PID: 21295  TASK: 88276d481700  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21351  TASK: 8807be822280  COMMAND: "dispatcher"
[ 79] PID: 21129  TASK: 8807bef0f300  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21337  TASK: 88276d482e00  COMMAND: "handler_3"
[ 79] PID: 21352  TASK: 8807be824500  COMMAND: "flow_dumper"
[ 79] PID: 21336  TASK: 88276d480b80  COMMAND: "handler_2"
[ 79] PID: 21342  TASK: 88276d484500  COMMAND: "handler_8"
[ 79] PID: 21341  TASK: 88276d482280  COMMAND: "handler_7"
[ 79] PID: 21338  TASK: 88276d483980  COMMAND: "handler_4"
[ 79] PID: 21339  TASK: 88276d48  COMMAND: "handler_5"
[ 79] PID: 21340  TASK: 88276d486780  COMMAND: "handler_6"
CFS RB_ROOT: 883fff3f4868
[120] PID: 37959  TASK: 88276e148000  COMMAND: "kworker/43:1"

CPU 28: Systemd(Victim) was blocked by cgroup_mutex:
PID: 1  TASK: 883fd2d4  CPU: 28  COMMAND: "systemd"
#0 [881fd317bd60] __schedule at 816410ed
#1 [881fd317bdc8] schedule_preempt_disabled at 81642869
#2 [881fd317bdd8] __mutex_lock_slowpath at 81640565
#3 [881fd317be38] mutex_lock at 8163f9cf
#4 [881fd317be50] proc_cgroup_show at 810fd256
#5 [881fd317be98] seq_read at 81203cda
#6 [881fd317bf08] vfs_read at 811dfc6c
#7 [881fd317bf38] sys_read at 811e07bf
#8 [881fd317bf80] system_call_fastpath at 81

The simplest way to fix that is to set the scheduler of kworkers to
higher RT priority, just like,
chrt --fifo -p 61 
However, it can not avoid other WORK_CPU_BOUND worker threads running
and starving.

This patch introduces a way to set the scheduler(policy and priority)
of percpu worker_pool, in that way, user could set proper scheduler
policy and priority of the worker_pool as needed, which could apply
to all the WORK_CPU_BOUND workers on the same CPU. On the other hand,
we could using /sys/devices/virtual/workqueue/cpumask for
WORK_CPU_UNBOUND workers to prevent them starving.

Tejun Heo suggested:
"* Add scheduler type to wq_attrs so that unbound workqueues can be
 configured.

* Rename system_wq's wq->name from "events" to "system_percpu", and
 similarly for the similarly named workqueues.

* Enable wq_attrs (only the applicable part should show up in the
 interface) for system_percpu and system_percpu_highpri, and use that
 to change the attributes of the percpu pools."

This patch implements the basic infrastructure and /sys interface,
such as:
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
# echo "policy=1 prio=1 nice=0" > 
/sys/devices/virtual/workqueue/system_percpu/sched_attr
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=1 prio=1 nice=0
# cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20
# echo "policy=1 prio=2 nice=0" >