Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-13 Thread Jonathan Davies

On 13/04/15 11:56, George Dunlap wrote:

On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
 wrote:

On Thu, 9 Apr 2015, Eric Dumazet wrote:

On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:

Hi all,

I found a performance regression when running netperf -t TCP_MAERTS from
an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
virtual machine are 30% slower than v3.18.

Through bisection I found that the perf regression is caused by the
prensence of the following commit in the guest kernel:


commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
Author: Eric Dumazet 
Date:   Sun Dec 7 12:22:18 2014 -0800

 tcp: refine TSO autosizing


[snip]


I recently discussed this issue on netdev in the following thread:

https://www.marc.info/?l=linux-netdev=142738853820517


This commit restored original TCP Small Queue behavior, which is the
first step to fight bufferbloat.

Some network drivers are known to be problematic because of a delayed TX
completion.


[snip]


Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
makes a difference ?


A very big difference:

echo 262144 > /proc/sys/net/ipv4/tcp_limit_output_bytes
brings us much closer to the original performance, the slowdown is just
8%

echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes
fills the gap entirely, same performance as before "refine TSO
autosizing"


What would be the next step for here?  Should I just document this as an
important performance tweaking step for Xen, or is there something else
we can do?


Is the problem perhaps that netback/netfront delays TX completion?
Would it be better to see if that can be addressed properly, so that
the original purpose of the patch (fighting bufferbloat) can be
achieved while not degrading performance for Xen?  Or at least, so
that people get decent perfomance out of the box without having to
tweak TCP parameters?


I agree; reducing the completion latency should be the ultimate goal. 
However, that won't be easy, so we need a work-around in the short term. 
I don't like the idea of relying on documenting the recommendation to 
change tcp_limit_output_bytes; too many people won't read this advice 
and will expect the out-of-the-box defaults to be reasonable.


Following Eric's pointers to where a similar problem had been 
experienced in wifi drivers, I came up with two proof-of-concept patches 
that gave a similar performance gain without any changes to sysctl 
parameters or core tcp/ip code. See 
https://www.marc.info/?l=linux-netdev=142746161307283.


I haven't yet received any feedback from the xen-netfront maintainers 
about whether those ideas could be reasonably adopted.


Jonathan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen

2015-04-13 Thread Jonathan Davies

On 13/04/15 11:56, George Dunlap wrote:

On Thu, Apr 9, 2015 at 5:36 PM, Stefano Stabellini
stefano.stabell...@eu.citrix.com wrote:

On Thu, 9 Apr 2015, Eric Dumazet wrote:

On Thu, 2015-04-09 at 16:46 +0100, Stefano Stabellini wrote:

Hi all,

I found a performance regression when running netperf -t TCP_MAERTS from
an external host to a Xen VM on ARM64: v3.19 and v4.0-rc4 running in the
virtual machine are 30% slower than v3.18.

Through bisection I found that the perf regression is caused by the
prensence of the following commit in the guest kernel:


commit 605ad7f184b60cfaacbc038aa6c55ee68dee3c89
Author: Eric Dumazet eduma...@google.com
Date:   Sun Dec 7 12:22:18 2014 -0800

 tcp: refine TSO autosizing


[snip]


I recently discussed this issue on netdev in the following thread:

https://www.marc.info/?l=linux-netdevm=142738853820517


This commit restored original TCP Small Queue behavior, which is the
first step to fight bufferbloat.

Some network drivers are known to be problematic because of a delayed TX
completion.


[snip]


Try to tweak /proc/sys/net/ipv4/tcp_limit_output_bytes to see if it
makes a difference ?


A very big difference:

echo 262144  /proc/sys/net/ipv4/tcp_limit_output_bytes
brings us much closer to the original performance, the slowdown is just
8%

echo 1048576  /proc/sys/net/ipv4/tcp_limit_output_bytes
fills the gap entirely, same performance as before refine TSO
autosizing


What would be the next step for here?  Should I just document this as an
important performance tweaking step for Xen, or is there something else
we can do?


Is the problem perhaps that netback/netfront delays TX completion?
Would it be better to see if that can be addressed properly, so that
the original purpose of the patch (fighting bufferbloat) can be
achieved while not degrading performance for Xen?  Or at least, so
that people get decent perfomance out of the box without having to
tweak TCP parameters?


I agree; reducing the completion latency should be the ultimate goal. 
However, that won't be easy, so we need a work-around in the short term. 
I don't like the idea of relying on documenting the recommendation to 
change tcp_limit_output_bytes; too many people won't read this advice 
and will expect the out-of-the-box defaults to be reasonable.


Following Eric's pointers to where a similar problem had been 
experienced in wifi drivers, I came up with two proof-of-concept patches 
that gave a similar performance gain without any changes to sysctl 
parameters or core tcp/ip code. See 
https://www.marc.info/?l=linux-netdevm=142746161307283.


I haven't yet received any feedback from the xen-netfront maintainers 
about whether those ideas could be reasonably adopted.


Jonathan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched/core: Make idle_cpu return 0 if doing softirq work

2014-07-21 Thread Jonathan Davies



On 18/07/14 15:08, Peter Zijlstra wrote:

On Fri, Jul 18, 2014 at 01:59:06PM +0100, Jonathan Davies wrote:

The current implementation of idle_cpu only considers tasks that might be in the
CPU's runqueue. If there's nothing in the specified CPU's runqueue, it will
return 1. But if the CPU is doing work in the softirq context, it is wrong for
idle_cpu to return 1. This patch makes it return 0.

I observed this to be a problem with a device driver kicking a kthread by
executing wake_up from softirq context. The Completely Fair Scheduler's
select_task_rq_fair was looking for an "idle sibling" of the CPU executing it by
calling select_idle_sibling, passing the executing CPU as the 'target'
parameter. The first thing that select_idle_sibling does is to check whether the
'target' CPU is idle, using idle_cpu, and to return that CPU if so. Despite the
executing CPU being busy in softirq context, idle_cpu was returning 1, meaning
that the scheduler would consistently try to run the kthread on the same CPU as
the kick came from. Given that the softirq work was on-going, this led to a
multi-millisecond delay before the scheduler eventually realised it should
migrate the kthread to a different CPU.


If your softirq takes _that_ long its broken anyhow.


Modern NICs can sustain 40 Gb/s of traffic. For network device drivers 
that use NAPI, polling is done in softirq context. At this data-rate, 
the per-packet processing overhead means means that a lot of CPU time is 
spent in softirq.


(CCing Dave and Eric for their thoughts about long-running softirq due 
to NAPI. The example I gave above was of xen-netback sending data to 
another virtual interface at a high rate.)



A solution to this problem would be to make idle_cpu return 0 when the CPU is
running in softirq context. I haven't got a patch for that because I couldn't
find an easy way of querying whether an arbitrary CPU is doing this. (Perhaps I
should look at the per-CPU softirq_work_list[]...?)


in_serving_softirq()?


That's probably more appropriate, but only tells us about the currently 
executing CPU, rather than what other CPUs are doing.



Instead, the following patch is a partial solution, only handling the case when
the currently-executing CPU is in softirq context. This was sufficient to solve
the problem I observed.


NAK, IRQ and SoftIRQ are outside of what the scheduler can control, so
for its purpose the CPU is indeed idle.


The scheduler can't control those things, but surely it wants to make 
the best possible placement for the things it can control? So it seems 
odd to me that it would ignore relevant information about the resources 
it can use. As I observed, it leads to pathological behaviour, and is 
easily fixed.


Jonathan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] sched/core: Make idle_cpu return 0 if doing softirq work

2014-07-21 Thread Jonathan Davies



On 18/07/14 15:08, Peter Zijlstra wrote:

On Fri, Jul 18, 2014 at 01:59:06PM +0100, Jonathan Davies wrote:

The current implementation of idle_cpu only considers tasks that might be in the
CPU's runqueue. If there's nothing in the specified CPU's runqueue, it will
return 1. But if the CPU is doing work in the softirq context, it is wrong for
idle_cpu to return 1. This patch makes it return 0.

I observed this to be a problem with a device driver kicking a kthread by
executing wake_up from softirq context. The Completely Fair Scheduler's
select_task_rq_fair was looking for an idle sibling of the CPU executing it by
calling select_idle_sibling, passing the executing CPU as the 'target'
parameter. The first thing that select_idle_sibling does is to check whether the
'target' CPU is idle, using idle_cpu, and to return that CPU if so. Despite the
executing CPU being busy in softirq context, idle_cpu was returning 1, meaning
that the scheduler would consistently try to run the kthread on the same CPU as
the kick came from. Given that the softirq work was on-going, this led to a
multi-millisecond delay before the scheduler eventually realised it should
migrate the kthread to a different CPU.


If your softirq takes _that_ long its broken anyhow.


Modern NICs can sustain 40 Gb/s of traffic. For network device drivers 
that use NAPI, polling is done in softirq context. At this data-rate, 
the per-packet processing overhead means means that a lot of CPU time is 
spent in softirq.


(CCing Dave and Eric for their thoughts about long-running softirq due 
to NAPI. The example I gave above was of xen-netback sending data to 
another virtual interface at a high rate.)



A solution to this problem would be to make idle_cpu return 0 when the CPU is
running in softirq context. I haven't got a patch for that because I couldn't
find an easy way of querying whether an arbitrary CPU is doing this. (Perhaps I
should look at the per-CPU softirq_work_list[]...?)


in_serving_softirq()?


That's probably more appropriate, but only tells us about the currently 
executing CPU, rather than what other CPUs are doing.



Instead, the following patch is a partial solution, only handling the case when
the currently-executing CPU is in softirq context. This was sufficient to solve
the problem I observed.


NAK, IRQ and SoftIRQ are outside of what the scheduler can control, so
for its purpose the CPU is indeed idle.


The scheduler can't control those things, but surely it wants to make 
the best possible placement for the things it can control? So it seems 
odd to me that it would ignore relevant information about the resources 
it can use. As I observed, it leads to pathological behaviour, and is 
easily fixed.


Jonathan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] sched/core: Make idle_cpu return 0 if doing softirq work

2014-07-18 Thread Jonathan Davies
The current implementation of idle_cpu only considers tasks that might be in the
CPU's runqueue. If there's nothing in the specified CPU's runqueue, it will
return 1. But if the CPU is doing work in the softirq context, it is wrong for
idle_cpu to return 1. This patch makes it return 0.

I observed this to be a problem with a device driver kicking a kthread by
executing wake_up from softirq context. The Completely Fair Scheduler's
select_task_rq_fair was looking for an "idle sibling" of the CPU executing it by
calling select_idle_sibling, passing the executing CPU as the 'target'
parameter. The first thing that select_idle_sibling does is to check whether the
'target' CPU is idle, using idle_cpu, and to return that CPU if so. Despite the
executing CPU being busy in softirq context, idle_cpu was returning 1, meaning
that the scheduler would consistently try to run the kthread on the same CPU as
the kick came from. Given that the softirq work was on-going, this led to a
multi-millisecond delay before the scheduler eventually realised it should
migrate the kthread to a different CPU.

A solution to this problem would be to make idle_cpu return 0 when the CPU is
running in softirq context. I haven't got a patch for that because I couldn't
find an easy way of querying whether an arbitrary CPU is doing this. (Perhaps I
should look at the per-CPU softirq_work_list[]...?)

Instead, the following patch is a partial solution, only handling the case when
the currently-executing CPU is in softirq context. This was sufficient to solve
the problem I observed.

Signed-off-by: Jonathan Davies 
---
 kernel/sched/core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599d..4ee58c4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3169,6 +3169,10 @@ int idle_cpu(int cpu)
return 0;
 #endif
 
+   /* When the current CPU is in softirq context, count it as non-idle */
+   if (cpu == smp_processor_id() && in_softirq())
+   return 0;
+
return 1;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC] sched/core: Make idle_cpu return 0 if doing softirq work

2014-07-18 Thread Jonathan Davies
The current implementation of idle_cpu only considers tasks that might be in the
CPU's runqueue. If there's nothing in the specified CPU's runqueue, it will
return 1. But if the CPU is doing work in the softirq context, it is wrong for
idle_cpu to return 1. This patch makes it return 0.

I observed this to be a problem with a device driver kicking a kthread by
executing wake_up from softirq context. The Completely Fair Scheduler's
select_task_rq_fair was looking for an idle sibling of the CPU executing it by
calling select_idle_sibling, passing the executing CPU as the 'target'
parameter. The first thing that select_idle_sibling does is to check whether the
'target' CPU is idle, using idle_cpu, and to return that CPU if so. Despite the
executing CPU being busy in softirq context, idle_cpu was returning 1, meaning
that the scheduler would consistently try to run the kthread on the same CPU as
the kick came from. Given that the softirq work was on-going, this led to a
multi-millisecond delay before the scheduler eventually realised it should
migrate the kthread to a different CPU.

A solution to this problem would be to make idle_cpu return 0 when the CPU is
running in softirq context. I haven't got a patch for that because I couldn't
find an easy way of querying whether an arbitrary CPU is doing this. (Perhaps I
should look at the per-CPU softirq_work_list[]...?)

Instead, the following patch is a partial solution, only handling the case when
the currently-executing CPU is in softirq context. This was sufficient to solve
the problem I observed.

Signed-off-by: Jonathan Davies jonathan.dav...@citrix.com
---
 kernel/sched/core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599d..4ee58c4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3169,6 +3169,10 @@ int idle_cpu(int cpu)
return 0;
 #endif
 
+   /* When the current CPU is in softirq context, count it as non-idle */
+   if (cpu == smp_processor_id()  in_softirq())
+   return 0;
+
return 1;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/