Re: Fw: Benchmarking for vhost polling patch
Our suggestion would be to use the maximum (a large enough) value, so that vhost is polling 100% of the time. The polling optimization mainly addresses users who want to maximize their performance, even on the expense of wasting cpu cycles. The maximum value will produce the biggest impact on performance. *Everyone* is interested in getting maximum performance from their systems. Maybe so, but not everyone is willing to pay the price. That is also the reason why this optimization should not be enabled by default. However, using the maximum default value will be valuable even for users who care more about the normalized throughput/cpu criteria. Such users, interested in a finer tuning of the polling timeout need to look for an optimal timeout value for their system. The maximum value serves as the upper limit of the range that needs to be searched for such optimal timeout value. Number of users who are going to do this kind of tuning can be counted on one hand. If the optimization is not enabled by default, the default value is almost irrelevant, because when users turn on the feature they should understand that there's an associated cost and they have to tune their system if they want to get the maximum benefit (depending how they define their maximum benefit). The maximum value is a good starting point that will work in most cases and can be used to start the tuning. There are some cases where networking stack already exposes low-level hardware detail to userspace, e.g. tcp polling configuration. If we can't come up with a way to abstract hardware, maybe we can at least tie it to these existing controls rather than introducing new ones? We've spent time thinking about the possible interfaces that could be appropriate for such an optimization(including tcp polling). We think that using the ioctl as interface to configure the virtual device/vhost, in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of sense, and is consistent with the existing mechanism. Thanks, Razya guest is giving up it's share of CPU for benefit of vhost, right? So maybe exposing this to guest is appropriate, and then add e.g. an ethtool interface for guest admin to set this. The decision making of whether to turn polling on (and with what rate) should be made by the system administrator, who has a broad view of the system and workload, and not by the guest administrator. Polling should be a tunable parameter from the host side, the guest should not be aware of it. The guest is not necessarily giving up its time. It may be that there's just an extra dedicated core or free cpu cycles on a different cpu. We provide a mechanism and an interface that can be tuned by some other program to implement its policy. This patch is all about the mechanism and not the policy of how to use it. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: Benchmarking for vhost polling patch
On Wed, Jan 14, 2015 at 05:01:05PM +0200, Razya Ladelsky wrote: Michael S. Tsirkin m...@redhat.com wrote on 12/01/2015 12:36:13 PM: From: Michael S. Tsirkin m...@redhat.com To: Razya Ladelsky/Haifa/IBM@IBMIL Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org, Eyal Moscovici/Haifa/IBM@IBMIL Date: 12/01/2015 12:36 PM Subject: Re: Fw: Benchmarking for vhost polling patch On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote: Hi Razya, Thanks for the update. So that's reasonable I think, and I think it makes sense to keep working on this in isolation - it's more manageable at this size. The big questions in my mind: - What happens if system is lightly loaded? E.g. a ping/pong benchmark. How much extra CPU are we wasting? - We see the best performance on your system is with 10usec worth of polling. It's OK to be able to tune it for best performance, but most people don't have the time or the inclination. So what would be the best value for other CPUs? The extra cpu waste vs throughput gains depends on the polling timeout value(poll_stop_idle). The best value to chose is dependant on the workload and the system hardware and configuration. There is nothing that we can say about this value in advance. The system's manager/administrator should use this optimization with the awareness that polling consumes extra cpu cycles, as documented. - Should this be tunable from usespace per vhost instance? Why is it only tunable globally? It should be tunable per vhost thread. We can do it in a subsequent patch. So I think whether the patchset is appropriate upstream will depend exactly on coming up with a reasonable interface for enabling and tuning the functionality. How about adding a new ioctl for each vhost device that sets the poll_stop_idle (the timeout)? This should be aligned with the QEMU way of doing things. I was hopeful some reasonable default value can be derived from e.g. cost of the exit. If that is not the case, it becomes that much harder for users to select good default values. Our suggestion would be to use the maximum (a large enough) value, so that vhost is polling 100% of the time. The polling optimization mainly addresses users who want to maximize their performance, even on the expense of wasting cpu cycles. The maximum value will produce the biggest impact on performance. *Everyone* is interested in getting maximum performance from their systems. However, using the maximum default value will be valuable even for users who care more about the normalized throughput/cpu criteria. Such users, interested in a finer tuning of the polling timeout need to look for an optimal timeout value for their system. The maximum value serves as the upper limit of the range that needs to be searched for such optimal timeout value. Number of users who are going to do this kind of tuning can be counted on one hand. The poll all the time also only works well only if you have dedicated CPUs for VMs, and no HT. I'm concerned you didn't really try to do something more widely useful, and easier to use, being too focused on getting your high netperf number. There are some cases where networking stack already exposes low-level hardware detail to userspace, e.g. tcp polling configuration. If we can't come up with a way to abstract hardware, maybe we can at least tie it to these existing controls rather than introducing new ones? We've spent time thinking about the possible interfaces that could be appropriate for such an optimization(including tcp polling). We think that using the ioctl as interface to configure the virtual device/vhost, in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of sense, and is consistent with the existing mechanism. Thanks, Razya guest is giving up it's share of CPU for benefit of vhost, right? So maybe exposing this to guest is appropriate, and then add e.g. an ethtool interface for guest admin to set this. This means we'll want virtio and qemu patches for this. But really, you want to find a way to enable it by default. - How bad is it if you don't pin vhost and vcpu threads? Is the scheduler smart enough to pull them apart? - What happens in overcommit scenarios? Does polling make things much worse? Clearly polling will work worse if e.g. vhost and vcpu share the host cpu. How can we avoid conflicts? For two last questions, better cooperation with host scheduler will likely help here. See e.g. http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505 I'm currently looking at pushing something similar upstream
Re: Fw: Benchmarking for vhost polling patch
Michael S. Tsirkin m...@redhat.com wrote on 12/01/2015 12:36:13 PM: From: Michael S. Tsirkin m...@redhat.com To: Razya Ladelsky/Haifa/IBM@IBMIL Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org, Eyal Moscovici/Haifa/IBM@IBMIL Date: 12/01/2015 12:36 PM Subject: Re: Fw: Benchmarking for vhost polling patch On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote: Hi Razya, Thanks for the update. So that's reasonable I think, and I think it makes sense to keep working on this in isolation - it's more manageable at this size. The big questions in my mind: - What happens if system is lightly loaded? E.g. a ping/pong benchmark. How much extra CPU are we wasting? - We see the best performance on your system is with 10usec worth of polling. It's OK to be able to tune it for best performance, but most people don't have the time or the inclination. So what would be the best value for other CPUs? The extra cpu waste vs throughput gains depends on the polling timeout value(poll_stop_idle). The best value to chose is dependant on the workload and the system hardware and configuration. There is nothing that we can say about this value in advance. The system's manager/administrator should use this optimization with the awareness that polling consumes extra cpu cycles, as documented. - Should this be tunable from usespace per vhost instance? Why is it only tunable globally? It should be tunable per vhost thread. We can do it in a subsequent patch. So I think whether the patchset is appropriate upstream will depend exactly on coming up with a reasonable interface for enabling and tuning the functionality. How about adding a new ioctl for each vhost device that sets the poll_stop_idle (the timeout)? This should be aligned with the QEMU way of doing things. I was hopeful some reasonable default value can be derived from e.g. cost of the exit. If that is not the case, it becomes that much harder for users to select good default values. Our suggestion would be to use the maximum (a large enough) value, so that vhost is polling 100% of the time. The polling optimization mainly addresses users who want to maximize their performance, even on the expense of wasting cpu cycles. The maximum value will produce the biggest impact on performance. However, using the maximum default value will be valuable even for users who care more about the normalized throughput/cpu criteria. Such users, interested in a finer tuning of the polling timeout need to look for an optimal timeout value for their system. The maximum value serves as the upper limit of the range that needs to be searched for such optimal timeout value. There are some cases where networking stack already exposes low-level hardware detail to userspace, e.g. tcp polling configuration. If we can't come up with a way to abstract hardware, maybe we can at least tie it to these existing controls rather than introducing new ones? We've spent time thinking about the possible interfaces that could be appropriate for such an optimization(including tcp polling). We think that using the ioctl as interface to configure the virtual device/vhost, in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of sense, and is consistent with the existing mechanism. Thanks, Razya - How bad is it if you don't pin vhost and vcpu threads? Is the scheduler smart enough to pull them apart? - What happens in overcommit scenarios? Does polling make things much worse? Clearly polling will work worse if e.g. vhost and vcpu share the host cpu. How can we avoid conflicts? For two last questions, better cooperation with host scheduler will likely help here. See e.g. http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505 I'm currently looking at pushing something similar upstream, if it goes in vhost polling can do something similar. Any data points to shed light on these questions? I ran a simple apache benchmark, with an over commit scenario, where both the vcpu and vhost share the same core. In some cases (c4 in my testcases) polling surprisingly produced a better throughput. Likely because latency is hurt, so you get better batching? Therefore, it is hard to predict how the polling will impact performance in advance. If it's so hard, users will struggle to configure this properly. Looks like an argument for us developers to do the hard work, and expose simpler controls to users? It is up to whoever is using this optimization to use it wisely. Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo
Re: Fw: Benchmarking for vhost polling patch
On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote: Hi Razya, Thanks for the update. So that's reasonable I think, and I think it makes sense to keep working on this in isolation - it's more manageable at this size. The big questions in my mind: - What happens if system is lightly loaded? E.g. a ping/pong benchmark. How much extra CPU are we wasting? - We see the best performance on your system is with 10usec worth of polling. It's OK to be able to tune it for best performance, but most people don't have the time or the inclination. So what would be the best value for other CPUs? The extra cpu waste vs throughput gains depends on the polling timeout value(poll_stop_idle). The best value to chose is dependant on the workload and the system hardware and configuration. There is nothing that we can say about this value in advance. The system's manager/administrator should use this optimization with the awareness that polling consumes extra cpu cycles, as documented. - Should this be tunable from usespace per vhost instance? Why is it only tunable globally? It should be tunable per vhost thread. We can do it in a subsequent patch. So I think whether the patchset is appropriate upstream will depend exactly on coming up with a reasonable interface for enabling and tuning the functionality. I was hopeful some reasonable default value can be derived from e.g. cost of the exit. If that is not the case, it becomes that much harder for users to select good default values. There are some cases where networking stack already exposes low-level hardware detail to userspace, e.g. tcp polling configuration. If we can't come up with a way to abstract hardware, maybe we can at least tie it to these existing controls rather than introducing new ones? - How bad is it if you don't pin vhost and vcpu threads? Is the scheduler smart enough to pull them apart? - What happens in overcommit scenarios? Does polling make things much worse? Clearly polling will work worse if e.g. vhost and vcpu share the host cpu. How can we avoid conflicts? For two last questions, better cooperation with host scheduler will likely help here. See e.g. http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505 I'm currently looking at pushing something similar upstream, if it goes in vhost polling can do something similar. Any data points to shed light on these questions? I ran a simple apache benchmark, with an over commit scenario, where both the vcpu and vhost share the same core. In some cases (c4 in my testcases) polling surprisingly produced a better throughput. Likely because latency is hurt, so you get better batching? Therefore, it is hard to predict how the polling will impact performance in advance. If it's so hard, users will struggle to configure this properly. Looks like an argument for us developers to do the hard work, and expose simpler controls to users? It is up to whoever is using this optimization to use it wisely. Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: Benchmarking for vhost polling patch
Hi Razya, Thanks for the update. So that's reasonable I think, and I think it makes sense to keep working on this in isolation - it's more manageable at this size. The big questions in my mind: - What happens if system is lightly loaded? E.g. a ping/pong benchmark. How much extra CPU are we wasting? - We see the best performance on your system is with 10usec worth of polling. It's OK to be able to tune it for best performance, but most people don't have the time or the inclination. So what would be the best value for other CPUs? The extra cpu waste vs throughput gains depends on the polling timeout value(poll_stop_idle). The best value to chose is dependant on the workload and the system hardware and configuration. There is nothing that we can say about this value in advance. The system's manager/administrator should use this optimization with the awareness that polling consumes extra cpu cycles, as documented. - Should this be tunable from usespace per vhost instance? Why is it only tunable globally? It should be tunable per vhost thread. We can do it in a subsequent patch. - How bad is it if you don't pin vhost and vcpu threads? Is the scheduler smart enough to pull them apart? - What happens in overcommit scenarios? Does polling make things much worse? Clearly polling will work worse if e.g. vhost and vcpu share the host cpu. How can we avoid conflicts? For two last questions, better cooperation with host scheduler will likely help here. See e.g. http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505 I'm currently looking at pushing something similar upstream, if it goes in vhost polling can do something similar. Any data points to shed light on these questions? I ran a simple apache benchmark, with an over commit scenario, where both the vcpu and vhost share the same core. In some cases (c4 in my testcases) polling surprisingly produced a better throughput. Therefore, it is hard to predict how the polling will impact performance in advance. It is up to whoever is using this optimization to use it wisely. Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: Benchmarking for vhost polling patch
Hi Razya, Thanks for the update. So that's reasonable I think, and I think it makes sense to keep working on this in isolation - it's more manageable at this size. The big questions in my mind: - What happens if system is lightly loaded? E.g. a ping/pong benchmark. How much extra CPU are we wasting? - We see the best performance on your system is with 10usec worth of polling. It's OK to be able to tune it for best performance, but most people don't have the time or the inclination. So what would be the best value for other CPUs? - Should this be tunable from usespace per vhost instance? Why is it only tunable globally? - How bad is it if you don't pin vhost and vcpu threads? Is the scheduler smart enough to pull them apart? - What happens in overcommit scenarios? Does polling make things much worse? Clearly polling will work worse if e.g. vhost and vcpu share the host cpu. How can we avoid conflicts? For two last questions, better cooperation with host scheduler will likely help here. See e.g. http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505 I'm currently looking at pushing something similar upstream, if it goes in vhost polling can do something similar. Any data points to shed light on these questions? On Thu, Jan 01, 2015 at 02:59:21PM +0200, Razya Ladelsky wrote: Hi Michael, Just a follow up on the polling patch numbers,.. Please let me know if you find these numbers satisfying enough to continue with submitting this patch. Otherwise - we'll have this patch submitted as part of the larger Elvis patch set rather than independently. Thank you, Razya - Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM - From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Date: 25/11/2014 02:43 PM Subject:Re: Benchmarking for vhost polling patch Sent by:kvm-ow...@vger.kernel.org Hi Michael, Hi Razya, On the netperf benchmark, it looks like polling=10 gives a modest but measureable gain. So from that perspective it might be worth it if it's not too much code, though we'll need to spend more time checking the macro effect - we barely moved the needle on the macro benchmark and that is suspicious. I ran memcached with various values for the key value arguments, and managed to see a bigger impact of polling than when I used the default values, Here are the numbers: key=250 TPS netvhost vm TPS/cpu TPS/CPU value=2048 rate util util change polling=0 101540 103.0 46 100 695.47 polling=5 136747 123.0 83 100 747.25 0.074440609 polling=7 140722 125.7 84 100 764.79 0.099663658 polling=10 141719 126.3 87 100 757.85 0.089688003 polling=15 142430 127.1 90 100 749.63 0.077863015 polling=25 146347 128.7 95 100 750.49 0.079107993 polling=50 150882 131.1 100 100 754.41 0.084733701 Macro benchmarks are less I/O intensive than the micro benchmark, which is why we can expect less impact for polling as compared to netperf. However, as shown above, we managed to get 10% TPS/CPU improvement with the polling patch. Is there a chance you are actually trading latency for throughput? do you observe any effect on latency? No. How about trying some other benchmark, e.g. NFS? Tried, but didn't have enough I/O produced (vhost was at most at 15% util) OK but was there a regression in this case? Also, I am wondering: since vhost thread is polling in kernel anyway, shouldn't we try and poll the host NIC? that would likely reduce at least the latency significantly, won't it? Yes, it could be a great addition at some point, but needs a thorough investigation. In any case, not a part of this patch... Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: Benchmarking for vhost polling patch
Hi Michael, Just a follow up on the polling patch numbers,.. Please let me know if you find these numbers satisfying enough to continue with submitting this patch. Otherwise - we'll have this patch submitted as part of the larger Elvis patch set rather than independently. Thank you, Razya - Forwarded by Razya Ladelsky/Haifa/IBM on 01/01/2015 09:37 AM - From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Date: 25/11/2014 02:43 PM Subject:Re: Benchmarking for vhost polling patch Sent by:kvm-ow...@vger.kernel.org Hi Michael, Hi Razya, On the netperf benchmark, it looks like polling=10 gives a modest but measureable gain. So from that perspective it might be worth it if it's not too much code, though we'll need to spend more time checking the macro effect - we barely moved the needle on the macro benchmark and that is suspicious. I ran memcached with various values for the key value arguments, and managed to see a bigger impact of polling than when I used the default values, Here are the numbers: key=250 TPS netvhost vm TPS/cpu TPS/CPU value=2048 rate util util change polling=0 101540 103.0 46 100 695.47 polling=5 136747 123.0 83 100 747.25 0.074440609 polling=7 140722 125.7 84 100 764.79 0.099663658 polling=10 141719 126.3 87 100 757.85 0.089688003 polling=15 142430 127.1 90 100 749.63 0.077863015 polling=25 146347 128.7 95 100 750.49 0.079107993 polling=50 150882 131.1 100 100 754.41 0.084733701 Macro benchmarks are less I/O intensive than the micro benchmark, which is why we can expect less impact for polling as compared to netperf. However, as shown above, we managed to get 10% TPS/CPU improvement with the polling patch. Is there a chance you are actually trading latency for throughput? do you observe any effect on latency? No. How about trying some other benchmark, e.g. NFS? Tried, but didn't have enough I/O produced (vhost was at most at 15% util) Also, I am wondering: since vhost thread is polling in kernel anyway, shouldn't we try and poll the host NIC? that would likely reduce at least the latency significantly, won't it? Yes, it could be a great addition at some point, but needs a thorough investigation. In any case, not a part of this patch... Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Hi Michael, Hi Razya, On the netperf benchmark, it looks like polling=10 gives a modest but measureable gain. So from that perspective it might be worth it if it's not too much code, though we'll need to spend more time checking the macro effect - we barely moved the needle on the macro benchmark and that is suspicious. I ran memcached with various values for the key value arguments, and managed to see a bigger impact of polling than when I used the default values, Here are the numbers: key=250 TPS netvhost vm TPS/cpu TPS/CPU value=2048 rate util util change polling=0 101540 103.0 46 100 695.47 polling=5 136747 123.0 83 100 747.25 0.074440609 polling=7 140722 125.7 84 100 764.79 0.099663658 polling=10 141719 126.3 87 100 757.85 0.089688003 polling=15 142430 127.1 90 100 749.63 0.077863015 polling=25 146347 128.7 95 100 750.49 0.079107993 polling=50 150882 131.1 100 100 754.41 0.084733701 Macro benchmarks are less I/O intensive than the micro benchmark, which is why we can expect less impact for polling as compared to netperf. However, as shown above, we managed to get 10% TPS/CPU improvement with the polling patch. Is there a chance you are actually trading latency for throughput? do you observe any effect on latency? No. How about trying some other benchmark, e.g. NFS? Tried, but didn't have enough I/O produced (vhost was at most at 15% util) Also, I am wondering: since vhost thread is polling in kernel anyway, shouldn't we try and poll the host NIC? that would likely reduce at least the latency significantly, won't it? Yes, it could be a great addition at some point, but needs a thorough investigation. In any case, not a part of this patch... Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Benchmarking for vhost polling patch
Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM: From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org Date: 29/10/2014 02:38 PM Subject: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. If it makes sense to you, I will continue with the other changes requested for the patch. Thank you, Razya Dear Michael, I'm still interested in hearing your opinion about these numbers http://marc.info/?l=kvmm=141458631532669w=2, and whether it is worthwhile to continue with the polling patch. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
On Sun, Nov 16, 2014 at 02:08:49PM +0200, Razya Ladelsky wrote: Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM: From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org Date: 29/10/2014 02:38 PM Subject: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. If it makes sense to you, I will continue with the other changes requested for the patch. Thank you, Razya Dear Michael, I'm still interested in hearing your opinion about these numbers http://marc.info/?l=kvmm=141458631532669w=2, and whether it is worthwhile to continue with the polling patch. Thank you, Razya Hi Razya, On the netperf benchmark, it looks like polling=10 gives a modest but measureable gain. So from that perspective it might be worth it if it's not too much code, though we'll need to spend more time checking the macro effect - we barely moved the needle on the macro benchmark and that is suspicious. Is there a chance you are actually trading latency for throughput? do you observe any effect on latency? How about trying some other benchmark, e.g. NFS? Also, I am wondering: since vhost thread is polling in kernel anyway, shouldn't we try and poll the host NIC? that would likely reduce at least the latency significantly, won't it? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Razya Ladelsky/Haifa/IBM@IBMIL wrote on 29/10/2014 02:38:31 PM: From: Razya Ladelsky/Haifa/IBM@IBMIL To: m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, abel.gor...@gmail.com, kvm@vger.kernel.org Date: 29/10/2014 02:38 PM Subject: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. If it makes sense to you, I will continue with the other changes requested for the patch. Thank you, Razya Hi Michael, Have you had the chance to look into these numbers? Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Hi Michael, Following the polling patch thread: http://marc.info/?l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Could you provide your changing code? Thanks, Zhang Haoyu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Zhang Haoyu zhan...@sangfor.com wrote on 30/10/2014 01:30:08 PM: From: Zhang Haoyu zhan...@sangfor.com To: Razya Ladelsky/Haifa/IBM@IBMIL, mst m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm kvm@vger.kernel.org Date: 30/10/2014 01:30 PM Subject: Re: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Could you provide your changing code? Thanks, Zhang Haoyu Hi Zhang, Do you mean the change in code for poll_stop_idle? Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Could you provide your changing code? Thanks, Zhang Haoyu Hi Zhang, Do you mean the change in code for poll_stop_idle? Yes, it's better to provide the complete code, including the polling patch. Thanks, Zhang Haoyu Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Benchmarking for vhost polling patch
Hi Michael, Following the polling patch thread: http://marc.info/?l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Here are the numbers for netperf (micro benchmark): polling|Send |Throughput|Utilization |S. Demand |vhost|exits|throughput|throughput mode |Msg | |Send Recv |Send Recv |util |/sec | /cpu | /cpu |Size | |local remote|local remote| | | |% change |bytes|10^6bits/s| %%|us/KB us/KB | % | | | - NoPolling 64 1054.11 99.97 3.01 7.78 3.74 38.80 92K7.60 Polling=1 64 1036.67 99.97 2.93 7.90 3.70 53.00 92K6.78 -10.78 Polling=5 64 1079.27 99.97 3.07 7.59 3.73 83.00 90K5.90 -22.35 Polling=7 64 1444.90 99.97 3.98 5.67 3.61 95.00 19.5K 7.41 -2.44 Polling=10 64 1521.70 99.97 4.21 5.38 3.63 98.00 8.5K 7.69 1.19 Polling=25 64 1534.24 99.97 4.18 5.34 3.57 99.00 8.5K 7.71 1.51 Polling=50 64 1534.24 99.97 4.18 5.34 3.57 99.00 8.5K 7.71 1.51 NoPolling 128 1577.39 99.97 4.09 5.19 3.40 54.00 113K 10.24 Polling=1 128 1596.08 99.97 4.22 5.13 3.47 71.00 120K 9.34 -8.88 Polling=5 128 2238.49 99.97 5.45 3.66 3.19 92.00 24K11.66 13.82 Polling=7 128 2330.97 99.97 5.59 3.51 3.14 95.00 19.5K 11.96 16.70 Polling=10 128 2375.78 99.97 5.69 3.45 3.14 98.00 10K12.00 17.14 Polling=25 128 2655.01 99.97 2.45 3.09 1.21 99.00 8.5K 13.34 30.25 Polling=50 128 2655.01 99.97 2.45 3.09 1.21 99.00 8.5K 13.34 30.25 NoPolling 25 2558.10 99.97 2.33 3.20 1.20 67.00 120K 15.32 Polling=1 25 2508.93 99.97 3.13 3.27 1.67 75.00 125K 14.34 -6.41 Polling=5 25 3740.34 99.97 2.70 2.19 0.95 94.00 17K19.28 25.86 Polling=7 25 3692.69 99.97 2.80 2.22 0.99 97.00 15.5K 18.75 22.37 Polling=10 25 4036.60 99.97 2.69 2.03 0.87 99.00 8.5K 20.29 32.42 Polling=25 25 3998.89 99.97 2.64 2.05 0.87 99.00 8.5K 20.10 31.18 Polling=50 25 3998.89 99.97 2.64 2.05 0.87 99.00 8.5K 20.10 31.18 NoPolling 512 4531.50 99.90 2.75 1.81 0.79 78.00 55K25.47 Polling=1 512 4684.19 99.95 2.69 1.75 0.75 83.00 35K25.60 0.52 Polling=5 512 4932.65 99.75 2.75 1.68 0.74 91.00 12K25.86 1.52 Polling=7 512 5226.14 99.86 2.80 1.57 0.70 95.00 7.5K 26.82 5.30 Polling=10 512 5464.90 99.60 2.90 1.49 0.70 96.00 8.2K 27.94 9.69 Polling=25 512 5550.44 99.58 2.84 1.47 0.67 99.00 7.5K 27.95 9.73 Polling=50 512 5550.44 99.58 2.84 1.47 0.67 99.00 7.5K 27.95 9.73 As you can see from the last column, polling improves performance in most cases. I ran memcached (macro benchmark), where (as in the previous benchmark) the vm and vhost each get their own dedicated core. I configured memslap with C=128, T=8, as this configuration was required to produce enough load to saturate the vm. I tried several other configurations, but this one produced the maximal throughput(for the baseline). The numbers for memcached (macro benchmark): polling time TPS Netvhost vm exits TPS/cpu TPS/cpu mode rate util util /sec % change % Disabled15.9s 125819 91.5 4599 87K873.74 polling=1 15.8s 126820 92.3 6099 87K797.61 -8.71 polling=5 12.82 155799 113.4 7999 25.5K 875.280.18 polling=10 11.7s 160639 116.9 8399 16.3K 882.631.02 pollling=15 12.4s 160897 117.2 8799 15K865.04 -1.00 polling=100 11.7s 170971 124.4 9999 30 863.49 -1.17 For memcached TPS/cpu does not show a significant difference in any of the cases. However, TPS numbers did improve in up to 35%, which can be useful for under-utilized systems which have cpu time to spare for extra throughput. If it makes sense to you, I will continue with the other changes requested for the patch. Thank you, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html