Re: Fw: Benchmarking for vhost polling patch

2015-01-17 Thread Razya Ladelsky
  
  Our suggestion would be to use the maximum (a large enough) value,
  so that vhost is polling 100% of the time.
 
  The polling optimization mainly addresses users who want to maximize 
their 
  performance, even on the expense of wasting cpu cycles. The maximum 
value 
  will produce the biggest impact on performance.
 
 *Everyone* is interested in getting maximum performance from
 their systems.
 

Maybe so, but not everyone is willing to pay the price.
That is also the reason why this optimization should not be enabled by 
default. 

  However, using the maximum default value will be valuable even for 
users 
  who care more about the normalized throughput/cpu criteria. Such 
users, 
  interested in a finer tuning of the polling timeout need to look for 
an 
  optimal timeout value for their system. The maximum value serves as 
the 
  upper limit of the range that needs to be searched for such optimal 
  timeout value.
 
 Number of users who are going to do this kind of tuning
 can be counted on one hand.
 

If the optimization is not enabled by default, the default value is almost 
irrelevant, because when users turn on the feature they should understand 
that there's an associated cost and they have to tune their system if they 
want to get the maximum benefit (depending how they define their maximum 
benefit).
The maximum value is a good starting point that will work in most cases 
and can be used to start the tuning. 

  
   There are some cases where networking stack already
   exposes low-level hardware detail to userspace, e.g.
   tcp polling configuration. If we can't come up with
   a way to abstract hardware, maybe we can at least tie
   it to these existing controls rather than introducing
   new ones?
   
  
  We've spent time thinking about the possible interfaces that 
  could be appropriate for such an optimization(including tcp polling).
  We think that using the ioctl as interface to configure the virtual 
  device/vhost, 
  in the same manner that e.g. SET_NET_BACKEND is configured, makes a 
lot of 
  sense, and
  is consistent with the existing mechanism. 
  
  Thanks,
  Razya
 
 guest is giving up it's share of CPU for benefit of vhost, right?
 So maybe exposing this to guest is appropriate, and then
 add e.g. an ethtool interface for guest admin to set this.
 

The decision making of whether to turn polling on (and with what rate)
should be made by the system administrator, who has a broad view of the 
system and workload, and not by the guest administrator.
Polling should be a tunable parameter from the host side, the guest should 
not be aware of it.
The guest is not necessarily giving up its time. It may be that there's 
just an extra dedicated core or free cpu cycles on a different cpu.
We provide a mechanism and an interface that can be tuned by some other 
program to implement its policy.
This patch is all about the mechanism and not the policy of how to use it.

Thank you,
Razya 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: Benchmarking for vhost polling patch

2015-01-14 Thread Michael S. Tsirkin
On Wed, Jan 14, 2015 at 05:01:05PM +0200, Razya Ladelsky wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 12/01/2015 12:36:13 PM:
 
  From: Michael S. Tsirkin m...@redhat.com
  To: Razya Ladelsky/Haifa/IBM@IBMIL
  Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, 
  Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
  abel.gor...@gmail.com, kvm@vger.kernel.org, Eyal 
 Moscovici/Haifa/IBM@IBMIL
  Date: 12/01/2015 12:36 PM
  Subject: Re: Fw: Benchmarking for vhost polling patch
  
  On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote:
Hi Razya,
Thanks for the update.
So that's reasonable I think, and I think it makes sense
to keep working on this in isolation - it's more
manageable at this size.

The big questions in my mind:
- What happens if system is lightly loaded?
  E.g. a ping/pong benchmark. How much extra CPU are
  we wasting?
- We see the best performance on your system is with 10usec worth of 
 
   polling.
  It's OK to be able to tune it for best performance, but
  most people don't have the time or the inclination.
  So what would be the best value for other CPUs?
   
   The extra cpu waste vs throughput gains depends on the polling timeout 
 
   value(poll_stop_idle).
   The best value to chose is dependant on the workload and the system 
   hardware and configuration.
   There is nothing that we can say about this value in advance. The 
 system's 
   manager/administrator should use this optimization with the awareness 
 that 
   polling
   consumes extra cpu cycles, as documented. 
   
- Should this be tunable from usespace per vhost instance?
  Why is it only tunable globally?
   
   It should be tunable per vhost thread.
   We can do it in a subsequent patch.
  
  So I think whether the patchset is appropriate upstream
  will depend exactly on coming up with a reasonable
  interface for enabling and tuning the functionality.
  
 
 How about adding a new ioctl for each vhost device that 
 sets the poll_stop_idle (the timeout)? 
 This should be aligned with the QEMU way of doing things.

  I was hopeful some reasonable default value can be
  derived from e.g. cost of the exit.
  If that is not the case, it becomes that much harder
  for users to select good default values.
  
 
 Our suggestion would be to use the maximum (a large enough) value,
 so that vhost is polling 100% of the time.

 The polling optimization mainly addresses users who want to maximize their 
 performance, even on the expense of wasting cpu cycles. The maximum value 
 will produce the biggest impact on performance.

*Everyone* is interested in getting maximum performance from
their systems.

 However, using the maximum default value will be valuable even for users 
 who care more about the normalized throughput/cpu criteria. Such users, 
 interested in a finer tuning of the polling timeout need to look for an 
 optimal timeout value for their system. The maximum value serves as the 
 upper limit of the range that needs to be searched for such optimal 
 timeout value.

Number of users who are going to do this kind of tuning
can be counted on one hand.

The poll all the time also only works well
only if you have dedicated CPUs for VMs, and no HT.

I'm concerned you didn't really try to do something more widely useful,
and easier to use, being too focused on getting your high netperf
number.


 
  There are some cases where networking stack already
  exposes low-level hardware detail to userspace, e.g.
  tcp polling configuration. If we can't come up with
  a way to abstract hardware, maybe we can at least tie
  it to these existing controls rather than introducing
  new ones?
  
 
 We've spent time thinking about the possible interfaces that 
 could be appropriate for such an optimization(including tcp polling).
 We think that using the ioctl as interface to configure the virtual 
 device/vhost, 
 in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of 
 sense, and
 is consistent with the existing mechanism. 
 
 Thanks,
 Razya

guest is giving up it's share of CPU for benefit of vhost, right?
So maybe exposing this to guest is appropriate, and then
add e.g. an ethtool interface for guest admin to set this.

This means we'll want virtio and qemu patches for this.

But really, you want to find a way to enable it by default.


- How bad is it if you don't pin vhost and vcpu threads?
  Is the scheduler smart enough to pull them apart?
- What happens in overcommit scenarios? Does polling make things
  much worse?
  Clearly polling will work worse if e.g. vhost and vcpu
  share the host cpu. How can we avoid conflicts?

  For two last questions, better cooperation with host scheduler 
 will
  likely help here.
  See e.g. 
   http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
  I'm currently looking at pushing something similar upstream

Re: Fw: Benchmarking for vhost polling patch

2015-01-14 Thread Razya Ladelsky
Michael S. Tsirkin m...@redhat.com wrote on 12/01/2015 12:36:13 PM:

 From: Michael S. Tsirkin m...@redhat.com
 To: Razya Ladelsky/Haifa/IBM@IBMIL
 Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, 
 Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
 abel.gor...@gmail.com, kvm@vger.kernel.org, Eyal 
Moscovici/Haifa/IBM@IBMIL
 Date: 12/01/2015 12:36 PM
 Subject: Re: Fw: Benchmarking for vhost polling patch
 
 On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote:
   Hi Razya,
   Thanks for the update.
   So that's reasonable I think, and I think it makes sense
   to keep working on this in isolation - it's more
   manageable at this size.
   
   The big questions in my mind:
   - What happens if system is lightly loaded?
 E.g. a ping/pong benchmark. How much extra CPU are
 we wasting?
   - We see the best performance on your system is with 10usec worth of 

  polling.
 It's OK to be able to tune it for best performance, but
 most people don't have the time or the inclination.
 So what would be the best value for other CPUs?
  
  The extra cpu waste vs throughput gains depends on the polling timeout 

  value(poll_stop_idle).
  The best value to chose is dependant on the workload and the system 
  hardware and configuration.
  There is nothing that we can say about this value in advance. The 
system's 
  manager/administrator should use this optimization with the awareness 
that 
  polling
  consumes extra cpu cycles, as documented. 
  
   - Should this be tunable from usespace per vhost instance?
 Why is it only tunable globally?
  
  It should be tunable per vhost thread.
  We can do it in a subsequent patch.
 
 So I think whether the patchset is appropriate upstream
 will depend exactly on coming up with a reasonable
 interface for enabling and tuning the functionality.
 

How about adding a new ioctl for each vhost device that 
sets the poll_stop_idle (the timeout)? 
This should be aligned with the QEMU way of doing things.

 I was hopeful some reasonable default value can be
 derived from e.g. cost of the exit.
 If that is not the case, it becomes that much harder
 for users to select good default values.
 

Our suggestion would be to use the maximum (a large enough) value,
so that vhost is polling 100% of the time.
The polling optimization mainly addresses users who want to maximize their 
performance, even on the expense of wasting cpu cycles. The maximum value 
will produce the biggest impact on performance.
However, using the maximum default value will be valuable even for users 
who care more about the normalized throughput/cpu criteria. Such users, 
interested in a finer tuning of the polling timeout need to look for an 
optimal timeout value for their system. The maximum value serves as the 
upper limit of the range that needs to be searched for such optimal 
timeout value.


 There are some cases where networking stack already
 exposes low-level hardware detail to userspace, e.g.
 tcp polling configuration. If we can't come up with
 a way to abstract hardware, maybe we can at least tie
 it to these existing controls rather than introducing
 new ones?
 

We've spent time thinking about the possible interfaces that 
could be appropriate for such an optimization(including tcp polling).
We think that using the ioctl as interface to configure the virtual 
device/vhost, 
in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of 
sense, and
is consistent with the existing mechanism. 

Thanks,
Razya



 
   - How bad is it if you don't pin vhost and vcpu threads?
 Is the scheduler smart enough to pull them apart?
   - What happens in overcommit scenarios? Does polling make things
 much worse?
 Clearly polling will work worse if e.g. vhost and vcpu
 share the host cpu. How can we avoid conflicts?
   
 For two last questions, better cooperation with host scheduler 
will
 likely help here.
 See e.g. 
  http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
 I'm currently looking at pushing something similar upstream,
 if it goes in vhost polling can do something similar.
   
   Any data points to shed light on these questions?
  
  I ran a simple apache benchmark, with an over commit scenario, where 
both 
  the vcpu and vhost share the same core.
  In some cases (c4 in my testcases) polling surprisingly produced a 
better 
  throughput.
 
 Likely because latency is hurt, so you get better batching?
 
  Therefore, it is hard to predict how the polling will impact 
performance 
  in advance. 
 
 If it's so hard, users will struggle to configure this properly.
 Looks like an argument for us developers to do the hard work,
 and expose simpler controls to users?
 
  It is up to whoever is using this optimization to use it wisely.
  Thanks,
  Razya 
  
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo

Re: Fw: Benchmarking for vhost polling patch

2015-01-12 Thread Michael S. Tsirkin
On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote:
  Hi Razya,
  Thanks for the update.
  So that's reasonable I think, and I think it makes sense
  to keep working on this in isolation - it's more
  manageable at this size.
  
  The big questions in my mind:
  - What happens if system is lightly loaded?
E.g. a ping/pong benchmark. How much extra CPU are
we wasting?
  - We see the best performance on your system is with 10usec worth of 
 polling.
It's OK to be able to tune it for best performance, but
most people don't have the time or the inclination.
So what would be the best value for other CPUs?
 
 The extra cpu waste vs throughput gains depends on the polling timeout 
 value(poll_stop_idle).
 The best value to chose is dependant on the workload and the system 
 hardware and configuration.
 There is nothing that we can say about this value in advance. The system's 
 manager/administrator should use this optimization with the awareness that 
 polling
 consumes extra cpu cycles, as documented. 
 
  - Should this be tunable from usespace per vhost instance?
Why is it only tunable globally?
 
 It should be tunable per vhost thread.
 We can do it in a subsequent patch.

So I think whether the patchset is appropriate upstream
will depend exactly on coming up with a reasonable
interface for enabling and tuning the functionality.

I was hopeful some reasonable default value can be
derived from e.g. cost of the exit.
If that is not the case, it becomes that much harder
for users to select good default values.

There are some cases where networking stack already
exposes low-level hardware detail to userspace, e.g.
tcp polling configuration. If we can't come up with
a way to abstract hardware, maybe we can at least tie
it to these existing controls rather than introducing
new ones?


  - How bad is it if you don't pin vhost and vcpu threads?
Is the scheduler smart enough to pull them apart?
  - What happens in overcommit scenarios? Does polling make things
much worse?
Clearly polling will work worse if e.g. vhost and vcpu
share the host cpu. How can we avoid conflicts?
  
For two last questions, better cooperation with host scheduler will
likely help here.
See e.g.  
 http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
I'm currently looking at pushing something similar upstream,
if it goes in vhost polling can do something similar.
  
  Any data points to shed light on these questions?
 
 I ran a simple apache benchmark, with an over commit scenario, where both 
 the vcpu and vhost share the same core.
 In some cases (c4 in my testcases) polling surprisingly produced a better 
 throughput.

Likely because latency is hurt, so you get better batching?

 Therefore, it is hard to predict how the polling will impact performance 
 in advance. 

If it's so hard, users will struggle to configure this properly.
Looks like an argument for us developers to do the hard work,
and expose simpler controls to users?

 It is up to whoever is using this optimization to use it wisely.
 Thanks,
 Razya 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: Benchmarking for vhost polling patch

2015-01-11 Thread Razya Ladelsky
 Hi Razya,
 Thanks for the update.
 So that's reasonable I think, and I think it makes sense
 to keep working on this in isolation - it's more
 manageable at this size.
 
 The big questions in my mind:
 - What happens if system is lightly loaded?
   E.g. a ping/pong benchmark. How much extra CPU are
   we wasting?
 - We see the best performance on your system is with 10usec worth of 
polling.
   It's OK to be able to tune it for best performance, but
   most people don't have the time or the inclination.
   So what would be the best value for other CPUs?

The extra cpu waste vs throughput gains depends on the polling timeout 
value(poll_stop_idle).
The best value to chose is dependant on the workload and the system 
hardware and configuration.
There is nothing that we can say about this value in advance. The system's 
manager/administrator should use this optimization with the awareness that 
polling
consumes extra cpu cycles, as documented. 

 - Should this be tunable from usespace per vhost instance?
   Why is it only tunable globally?

It should be tunable per vhost thread.
We can do it in a subsequent patch.

 - How bad is it if you don't pin vhost and vcpu threads?
   Is the scheduler smart enough to pull them apart?
 - What happens in overcommit scenarios? Does polling make things
   much worse?
   Clearly polling will work worse if e.g. vhost and vcpu
   share the host cpu. How can we avoid conflicts?
 
   For two last questions, better cooperation with host scheduler will
   likely help here.
   See e.g.  
http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
   I'm currently looking at pushing something similar upstream,
   if it goes in vhost polling can do something similar.
 
 Any data points to shed light on these questions?

I ran a simple apache benchmark, with an over commit scenario, where both 
the vcpu and vhost share the same core.
In some cases (c4 in my testcases) polling surprisingly produced a better 
throughput.
Therefore, it is hard to predict how the polling will impact performance 
in advance. 
It is up to whoever is using this optimization to use it wisely.
Thanks,
Razya 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: Benchmarking for vhost polling patch

2015-01-05 Thread Michael S. Tsirkin
Hi Razya,
Thanks for the update.
So that's reasonable I think, and I think it makes sense
to keep working on this in isolation - it's more
manageable at this size.

The big questions in my mind:
- What happens if system is lightly loaded?
  E.g. a ping/pong benchmark. How much extra CPU are
  we wasting?
- We see the best performance on your system is with 10usec worth of polling.
  It's OK to be able to tune it for best performance, but
  most people don't have the time or the inclination.
  So what would be the best value for other CPUs?
- Should this be tunable from usespace per vhost instance?
  Why is it only tunable globally?
- How bad is it if you don't pin vhost and vcpu threads?
  Is the scheduler smart enough to pull them apart?
- What happens in overcommit scenarios? Does polling make things
  much worse?
  Clearly polling will work worse if e.g. vhost and vcpu
  share the host cpu. How can we avoid conflicts?

  For two last questions, better cooperation with host scheduler will
  likely help here.
  See e.g.  http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
  I'm currently looking at pushing something similar upstream,
  if it goes in vhost polling can do something similar.

Any data points to shed light on these questions?

On Thu, Jan 01, 2015 at 02:59:21PM +0200, Razya Ladelsky wrote:
 Hi Michael,
 Just a follow up on the polling patch numbers,..
 Please let me know if you find these numbers satisfying enough to continue 
 with submitting this patch.
 Otherwise - we'll have this patch submitted as part of the larger Elvis 
 patch set rather than independently.
 Thank you,
 Razya 
 
 - Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM -
 
 From:   Razya Ladelsky/Haifa/IBM@IBMIL
 To: m...@redhat.com
 Cc: 
 Date:   25/11/2014 02:43 PM
 Subject:Re: Benchmarking for vhost polling patch
 Sent by:kvm-ow...@vger.kernel.org
 
 
 
 Hi Michael,
 
  Hi Razya,
  On the netperf benchmark, it looks like polling=10 gives a modest but
  measureable gain.  So from that perspective it might be worth it if it's
  not too much code, though we'll need to spend more time checking the
  macro effect - we barely moved the needle on the macro benchmark and
  that is suspicious.
 
 I ran memcached with various values for the key  value arguments, and 
 managed to see a bigger impact of polling than when I used the default 
 values,
 Here are the numbers:
 
 key=250 TPS  netvhost vm   TPS/cpu  TPS/CPU
 value=2048   rate   util  util  change
 
 polling=0   101540   103.0  46   100   695.47
 polling=5   136747   123.0  83   100   747.25   0.074440609
 polling=7   140722   125.7  84   100   764.79   0.099663658
 polling=10  141719   126.3  87   100   757.85   0.089688003
 polling=15  142430   127.1  90   100   749.63   0.077863015
 polling=25  146347   128.7  95   100   750.49   0.079107993
 polling=50  150882   131.1  100  100   754.41   0.084733701
 
 Macro benchmarks are less I/O intensive than the micro benchmark, which is 
 why 
 we can expect less impact for polling as compared to netperf. 
 However, as shown above, we managed to get 10% TPS/CPU improvement with 
 the 
 polling patch.
 
  Is there a chance you are actually trading latency for throughput?
  do you observe any effect on latency?
 
 No.
 
  How about trying some other benchmark, e.g. NFS?
  
 
 Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

OK but was there a regression in this case?


  
  Also, I am wondering:
  
  since vhost thread is polling in kernel anyway, shouldn't
  we try and poll the host NIC?
  that would likely reduce at least the latency significantly,
  won't it?
  
 
 Yes, it could be a great addition at some point, but needs a thorough 
 investigation. In any case, not a part of this patch...
 
 Thanks,
 Razya
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fw: Benchmarking for vhost polling patch

2015-01-01 Thread Razya Ladelsky
Hi Michael,
Just a follow up on the polling patch numbers,..
Please let me know if you find these numbers satisfying enough to continue 
with submitting this patch.
Otherwise - we'll have this patch submitted as part of the larger Elvis 
patch set rather than independently.
Thank you,
Razya 

- Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM -

From:   Razya Ladelsky/Haifa/IBM@IBMIL
To: m...@redhat.com
Cc: 
Date:   25/11/2014 02:43 PM
Subject:Re: Benchmarking for vhost polling patch
Sent by:kvm-ow...@vger.kernel.org



Hi Michael,

 Hi Razya,
 On the netperf benchmark, it looks like polling=10 gives a modest but
 measureable gain.  So from that perspective it might be worth it if it's
 not too much code, though we'll need to spend more time checking the
 macro effect - we barely moved the needle on the macro benchmark and
 that is suspicious.

I ran memcached with various values for the key  value arguments, and 
managed to see a bigger impact of polling than when I used the default 
values,
Here are the numbers:

key=250 TPS  netvhost vm   TPS/cpu  TPS/CPU
value=2048   rate   util  util  change

polling=0   101540   103.0  46   100   695.47
polling=5   136747   123.0  83   100   747.25   0.074440609
polling=7   140722   125.7  84   100   764.79   0.099663658
polling=10  141719   126.3  87   100   757.85   0.089688003
polling=15  142430   127.1  90   100   749.63   0.077863015
polling=25  146347   128.7  95   100   750.49   0.079107993
polling=50  150882   131.1  100  100   754.41   0.084733701

Macro benchmarks are less I/O intensive than the micro benchmark, which is 
why 
we can expect less impact for polling as compared to netperf. 
However, as shown above, we managed to get 10% TPS/CPU improvement with 
the 
polling patch.

 Is there a chance you are actually trading latency for throughput?
 do you observe any effect on latency?

No.

 How about trying some other benchmark, e.g. NFS?
 

Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

 
 Also, I am wondering:
 
 since vhost thread is polling in kernel anyway, shouldn't
 we try and poll the host NIC?
 that would likely reduce at least the latency significantly,
 won't it?
 

Yes, it could be a great addition at some point, but needs a thorough 
investigation. In any case, not a part of this patch...

Thanks,
Razya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-11-25 Thread Razya Ladelsky
Hi Michael,

 Hi Razya,
 On the netperf benchmark, it looks like polling=10 gives a modest but
 measureable gain.  So from that perspective it might be worth it if it's
 not too much code, though we'll need to spend more time checking the
 macro effect - we barely moved the needle on the macro benchmark and
 that is suspicious.

I ran memcached with various values for the key  value arguments, and 
managed to see a bigger impact of polling than when I used the default values,
Here are the numbers:

key=250 TPS  netvhost vm   TPS/cpu  TPS/CPU
value=2048   rate   util  util  change

polling=0   101540   103.0  46   100   695.47
polling=5   136747   123.0  83   100   747.25   0.074440609
polling=7   140722   125.7  84   100   764.79   0.099663658
polling=10  141719   126.3  87   100   757.85   0.089688003
polling=15  142430   127.1  90   100   749.63   0.077863015
polling=25  146347   128.7  95   100   750.49   0.079107993
polling=50  150882   131.1  100  100   754.41   0.084733701

Macro benchmarks are less I/O intensive than the micro benchmark, which is why 
we can expect less impact for polling as compared to netperf. 
However, as shown above, we managed to get 10% TPS/CPU improvement with the 
polling patch.

 Is there a chance you are actually trading latency for throughput?
 do you observe any effect on latency?

No.

 How about trying some other benchmark, e.g. NFS?
 

Tried, but didn't have enough I/O produced (vhost was at most at 15% util)

 
 Also, I am wondering:
 
 since vhost thread is polling in kernel anyway, shouldn't
 we try and poll the host NIC?
 that would likely reduce at least the latency significantly,
 won't it?
 

Yes, it could be a great addition at some point, but needs a thorough 
investigation. In any case, not a part of this patch...

Thanks,
Razya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Benchmarking for vhost polling patch

2014-11-16 Thread Razya Ladelsky
Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM:

 From: Razya Ladelsky/Haifa/IBM@IBMIL
 To: m...@redhat.com
 Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, 
 Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, 
 Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org
 Date: 29/10/2014 02:38 PM
 Subject: Benchmarking for vhost polling patch
 
 Hi Michael,
 
 Following the polling patch thread: http://marc.info/?
 l=kvmm=140853271510179w=2, 
 I changed poll_stop_idle to be counted in micro seconds, and carried out 

 experiments using varying sizes of this value. 
 
 If it makes sense to you, I will continue with the other changes 
 requested for 
 the patch.
 
 Thank you,
 Razya
 
 

Dear Michael,
I'm still interested in hearing your opinion about these numbers 
http://marc.info/?l=kvmm=141458631532669w=2, 
and whether it is worthwhile to continue with the polling patch.
Thank you,
Razya 


 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-11-16 Thread Michael S. Tsirkin
On Sun, Nov 16, 2014 at 02:08:49PM +0200, Razya Ladelsky wrote:
 Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM:
 
  From: Razya Ladelsky/Haifa/IBM@IBMIL
  To: m...@redhat.com
  Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, 
  Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, 
  Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org
  Date: 29/10/2014 02:38 PM
  Subject: Benchmarking for vhost polling patch
  
  Hi Michael,
  
  Following the polling patch thread: http://marc.info/?
  l=kvmm=140853271510179w=2, 
  I changed poll_stop_idle to be counted in micro seconds, and carried out 
 
  experiments using varying sizes of this value. 
  
  If it makes sense to you, I will continue with the other changes 
  requested for 
  the patch.
  
  Thank you,
  Razya
  
  
 
 Dear Michael,
 I'm still interested in hearing your opinion about these numbers 
 http://marc.info/?l=kvmm=141458631532669w=2, 
 and whether it is worthwhile to continue with the polling patch.
 Thank you,
 Razya 
 
 
  
  

Hi Razya,
On the netperf benchmark, it looks like polling=10 gives a modest but
measureable gain.  So from that perspective it might be worth it if it's
not too much code, though we'll need to spend more time checking the
macro effect - we barely moved the needle on the macro benchmark and
that is suspicious.
Is there a chance you are actually trading latency for throughput?
do you observe any effect on latency?
How about trying some other benchmark, e.g. NFS?


Also, I am wondering:

since vhost thread is polling in kernel anyway, shouldn't
we try and poll the host NIC?
that would likely reduce at least the latency significantly,
won't it?


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-11-09 Thread Razya Ladelsky
Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM:

 From: Razya Ladelsky/Haifa/IBM@IBMIL
 To: m...@redhat.com
 Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, 
 Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, 
 Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org
 Date: 29/10/2014 02:38 PM
 Subject: Benchmarking for vhost polling patch
 
 Hi Michael,
 
 Following the polling patch thread: http://marc.info/?
 l=kvmm=140853271510179w=2, 
 I changed poll_stop_idle to be counted in micro seconds, and carried out 

 experiments using varying sizes of this value. 
 
 If it makes sense to you, I will continue with the other changes 
 requested for 
 the patch.
 
 Thank you,
 Razya
 
 

Hi Michael,
Have you had the chance to look into these numbers?
Thank you,
Razya 


 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-10-30 Thread Zhang Haoyu
 Hi Michael,
 
 Following the polling patch thread: 
 http://marc.info/?l=kvmm=140853271510179w=2, 
 I changed poll_stop_idle to be counted in micro seconds, and carried out 
 experiments using varying sizes of this value. The setup for netperf 
 consisted of 
 1 vm and 1 vhost , each running on their own dedicated core.
 
Could you provide your changing code?

Thanks,
Zhang Haoyu

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-10-30 Thread Razya Ladelsky
Zhang Haoyu zhan...@sangfor.com wrote on 30/10/2014 01:30:08 PM:

 From: Zhang Haoyu zhan...@sangfor.com
 To: Razya Ladelsky/Haifa/IBM@IBMIL, mst m...@redhat.com
 Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm kvm@vger.kernel.org
 Date: 30/10/2014 01:30 PM
 Subject: Re: Benchmarking for vhost polling patch
 
  Hi Michael,
  
  Following the polling patch thread: http://marc.info/?
 l=kvmm=140853271510179w=2, 
  I changed poll_stop_idle to be counted in micro seconds, and carried 
out 
  experiments using varying sizes of this value. The setup for 
 netperf consisted of 
  1 vm and 1 vhost , each running on their own dedicated core.
  
 Could you provide your changing code?
 
 Thanks,
 Zhang Haoyu
 
Hi Zhang,
Do you mean the change in code for poll_stop_idle?
Thanks,
Razya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Benchmarking for vhost polling patch

2014-10-30 Thread Zhang Haoyu
  Hi Michael,
  
  Following the polling patch thread: http://marc.info/?
 l=kvmm=140853271510179w=2, 
  I changed poll_stop_idle to be counted in micro seconds, and carried 
out 
  experiments using varying sizes of this value. The setup for 
 netperf consisted of 
  1 vm and 1 vhost , each running on their own dedicated core.
  
 Could you provide your changing code?
 
 Thanks,
 Zhang Haoyu
 
Hi Zhang,
Do you mean the change in code for poll_stop_idle?
Yes, it's better to provide the complete code, including the polling patch.

Thanks,
Zhang Haoyu
Thanks,
Razya

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Benchmarking for vhost polling patch

2014-10-29 Thread Razya Ladelsky
Hi Michael,

Following the polling patch thread: 
http://marc.info/?l=kvmm=140853271510179w=2, 
I changed poll_stop_idle to be counted in micro seconds, and carried out 
experiments using varying sizes of this value. The setup for netperf consisted 
of 
1 vm and 1 vhost , each running on their own dedicated core.
  
Here are the  numbers for netperf (micro benchmark):

polling|Send |Throughput|Utilization |S. Demand   
|vhost|exits|throughput|throughput
mode   |Msg  |  |Send  Recv  |Send  Recv  |util |/sec | /cpu |   
/cpu
   |Size |  |local remote|local remote| | |  |% 
change
   |bytes|10^6bits/s|  %%|us/KB us/KB |  %  | |  |
-
NoPolling  64   1054.11   99.97 3.01  7.78  3.74   38.80  92K7.60
Polling=1  64   1036.67   99.97 2.93  7.90  3.70   53.00  92K6.78 -10.78
Polling=5  64   1079.27   99.97 3.07  7.59  3.73   83.00  90K5.90 -22.35
Polling=7  64   1444.90   99.97 3.98  5.67  3.61   95.00  19.5K  7.41  -2.44
Polling=10 64   1521.70   99.97 4.21  5.38  3.63   98.00  8.5K   7.69   1.19
Polling=25 64   1534.24   99.97 4.18  5.34  3.57   99.00  8.5K   7.71   1.51
Polling=50 64   1534.24   99.97 4.18  5.34  3.57   99.00  8.5K   7.71   1.51
  
NoPolling  128  1577.39   99.97 4.09  5.19  3.40   54.00  113K   10.24 
Polling=1  128  1596.08   99.97 4.22  5.13  3.47   71.00  120K   9.34  -8.88
Polling=5  128  2238.49   99.97 5.45  3.66  3.19   92.00  24K11.66 13.82
Polling=7  128  2330.97   99.97 5.59  3.51  3.14   95.00  19.5K  11.96 16.70
Polling=10 128  2375.78   99.97 5.69  3.45  3.14   98.00  10K12.00 17.14
Polling=25 128  2655.01   99.97 2.45  3.09  1.21   99.00  8.5K   13.34 30.25
Polling=50 128  2655.01   99.97 2.45  3.09  1.21   99.00  8.5K   13.34 30.25
  
NoPolling  25   2558.10   99.97 2.33  3.20  1.20   67.00  120K   15.32 
Polling=1  25   2508.93   99.97 3.13  3.27  1.67   75.00  125K   14.34 -6.41
Polling=5  25   3740.34   99.97 2.70  2.19  0.95   94.00  17K19.28 25.86
Polling=7  25   3692.69   99.97 2.80  2.22  0.99   97.00  15.5K  18.75 22.37
Polling=10 25   4036.60   99.97 2.69  2.03  0.87   99.00  8.5K   20.29 32.42
Polling=25 25   3998.89   99.97 2.64  2.05  0.87   99.00  8.5K   20.10 31.18
Polling=50 25   3998.89   99.97 2.64  2.05  0.87   99.00  8.5K   20.10 31.18
  
NoPolling  512  4531.50   99.90 2.75  1.81  0.79   78.00  55K25.47 
Polling=1  512  4684.19   99.95 2.69  1.75  0.75   83.00  35K25.60  0.52
Polling=5  512  4932.65   99.75 2.75  1.68  0.74   91.00  12K25.86  1.52
Polling=7  512  5226.14   99.86 2.80  1.57  0.70   95.00  7.5K   26.82  5.30
Polling=10 512  5464.90   99.60 2.90  1.49  0.70   96.00  8.2K   27.94  9.69
Polling=25 512  5550.44   99.58 2.84  1.47  0.67   99.00  7.5K   27.95  9.73
Polling=50 512  5550.44   99.58 2.84  1.47  0.67   99.00  7.5K   27.95  9.73


As you can see from the last column, polling improves performance in most cases.

I ran memcached (macro benchmark), where (as in the previous benchmark) the vm 
and 
vhost each get their own dedicated core. I configured memslap with C=128, T=8, 
as 
this configuration was required to produce enough load to saturate the vm.
I tried several other configurations, but this one produced the maximal 
throughput(for the baseline). 
  
The numbers for memcached (macro benchmark):

polling time   TPS Netvhost vm   exits  TPS/cpu  TPS/cpu
mode   rate   util  util /sec % change
  %   
Disabled15.9s  125819  91.5   4599   87K873.74   
polling=1   15.8s  126820  92.3   6099   87K797.61   -8.71
polling=5   12.82  155799  113.4  7999   25.5K  875.280.18
polling=10  11.7s  160639  116.9  8399   16.3K  882.631.02
pollling=15 12.4s  160897  117.2  8799   15K865.04   -1.00
polling=100 11.7s  170971  124.4  9999   30 863.49   -1.17


For memcached TPS/cpu does not show a significant difference in any of the 
cases. 
However, TPS numbers did improve in up to 35%, which can be useful for 
under-utilized 
systems which have cpu time to spare for extra throughput. 

If it makes sense to you, I will continue with the other changes requested for 
the patch.

Thank you,
Razya




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html