Hello, recently we have been debugging regression of basically any IO workload when systemd started enabling blkio controller for user sessions (due to delegation feature). Now using blkio controller certainly has its costs but some of the hits seemed just too heavy - e.g. dbench4 throughput dropped from ~150 MB/s to ~26 MB/s for ext4 with barrier=0 mount option on an ordinary SATA drive. The reason for the drop is visible in the following blktrace:
0.000383426 5122 A WS 27691328 + 8 <- (259,851968) 21473600 0.000384039 5122 Q WS 27691328 + 8 [jbd2/sdb3-8] 0.000385944 5122 G WS 27691328 + 8 [jbd2/sdb3-8] 0.000386315 5122 P N [jbd2/sdb3-8] ... 0.000394031 5122 A WS 27691384 + 8 <- (259,851968) 21473656 0.000394210 5122 Q WS 27691384 + 8 [jbd2/sdb3-8] 0.000394569 5122 M WS 27691384 + 8 [jbd2/sdb3-8] 0.000395239 5122 I WS 27691328 + 64 [jbd2/sdb3-8] 0.000396572 0 m N cfq5122SN / insert_request 0.000397389 0 m N cfq5122SN / add_to_rr 0.000398458 5122 U N [jbd2/sdb3-8] 1 <<< Here we wait for 7.5 ms for idle timer on dbench sync-noidle queue to fire 0.008001111 0 m N cfq idle timer fired 0.008003152 0 m N cfq5174SN /dbench slice expired t=0 0.008004871 0 m N /dbench served: vt=24796020 min_vt=24771438 0.008006508 0 m N cfq5174SN /dbench sl_used=2 disp=1 charge=2 iops=0 sect=24 0.008007509 0 m N cfq5174SN /dbench del_from_rr 0.008008197 0 m N /dbench del_from_rr group 0.008008771 0 m N cfq schedule dispatch 0.008013506 0 m N cfq workload slice:16 0.008014979 0 m N cfq5122SN / set_active wl_class:0 wl_type:1 0.008017229 0 m N cfq5122SN / fifo= (null) 0.008018149 0 m N cfq5122SN / dispatch_insert 0.008019863 0 m N cfq5122SN / dispatched a request 0.008020829 0 m N cfq5122SN / activate rq, drv=1 0.008021578 389 D WS 27691328 + 64 [kworker/5:1H] 0.008491262 0 C WS 27691328 + 64 [0] 0.008498654 0 m N cfq5122SN / complete rqnoidle 1 0.008500202 0 m N cfq5122SN / set_slice=19 0.008501797 0 m N cfq5122SN / arm_idle: 2 group_idle: 0 0.008502073 0 m N cfq schedule dispatch 0.008517281 5122 A WS 27691392 + 8 <- (259,851968) 21473664 0.008517627 5122 Q WS 27691392 + 8 [jbd2/sdb3-8] 0.008519126 5122 G WS 27691392 + 8 [jbd2/sdb3-8] 0.008519534 5122 I WS 27691392 + 8 [jbd2/sdb3-8] 0.008520560 0 m N cfq5122SN / insert_request 0.008521908 0 m N cfq5122SN / dispatch_insert 0.008522798 0 m N cfq5122SN / dispatched a request 0.008523558 0 m N cfq5122SN / activate rq, drv=1 0.008523841 5122 D WS 27691392 + 8 [jbd2/sdb3-8] 0.008718527 0 C WS 27691392 + 8 [0] 0.008721911 0 m N cfq5122SN / complete rqnoidle 1 0.008723186 0 m N cfq5122SN / arm_idle: 2 group_idle: 0 0.008723578 0 m N cfq schedule dispatch 0.009062333 5174 A WS 23276680 + 24 <- (259,851968) 17058952 0.009062950 5174 Q WS 23276680 + 24 [dbench4] 0.009065427 5174 G WS 23276680 + 24 [dbench4] 0.009065717 5174 P N [dbench4] 0.009067472 5174 I WS 23276680 + 24 [dbench4] 0.009069038 0 m N cfq5174SN /dbench insert_request 0.009069913 0 m N cfq5174SN /dbench add_to_rr 0.009071190 5174 U N [dbench4] 1 <<<< Here we wait another 7 ms for idle timer on jbd2 sync-noidle queue to fire 0.016001504 0 m N cfq idle timer fired 0.016002924 0 m N cfq5122SN / slice expired t=0 0.016004424 0 m N / served: vt=24783779 min_vt=24771488 0.016005888 0 m N cfq5122SN / sl_used=2 disp=2 charge=2 iops=0 sect=72 0.016006635 0 m N cfq5122SN / del_from_rr 0.016007152 0 m N / del_from_rr group 0.016007613 0 m N cfq schedule dispatch 0.016014571 0 m N cfq workload slice:24 0.016015679 0 m N cfq5174SN /dbench set_active wl_class:0 wl_type:1 0.016016794 0 m N cfq5174SN /dbench fifo= (null) 0.016017652 0 m N cfq5174SN /dbench dispatch_insert 0.016018883 0 m N cfq5174SN /dbench dispatched a request 0.016019714 0 m N cfq5174SN /dbench activate rq, drv=1 0.016019973 382 D WS 23276680 + 24 [kworker/6:1H] 0.016347056 0 C WS 23276680 + 24 [0] 0.016357022 0 m N cfq5174SN /dbench complete rqnoidle 1 0.016358509 0 m N cfq5174SN /dbench set_slice=24 0.016360127 0 m N cfq5174SN /dbench arm_idle: 2 group_idle: 0 0.016360508 0 m N cfq schedule dispatch ... When dbench isn't in a separate cgroup, dbench and jbd2 sync-noidle queues just freely preempt each other. When dbench gets contained in a dedicated blkio cgroup, preemption is not allowed and the throughput dropped. The idling is happening because we want to provide separation of IO between different blkio cgroups and thus we idle to avoid starving one cgroup where process is submitting only dependent IO. I am of the opinion that in case ancestor would like to preempt a descendant cgroup, there is no strong reason to provide the separation and we can save at least one of the idle times (when switching from dbench to jbd2 thread). Thus the following patch set which improves the throughput of dbench4 from ~26 MB/s to ~48 MB/s. The first patch in the patch set is just unrelated improvement where I've spotted some asymetry in how slice_idle and group_idle are handled. Patches two and three prepare cfq_should_preempt() to be able to work on service trees of different cgroups, patch 4 then adds the logic in cfq_should_preempt() to allow preemption by ancestor cgroup. Comments welcome! Honza -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/