Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 11:49 AM, Raghavendra K T wrote: On 09/25/2012 08:30 PM, Dor Laor wrote: On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1->2...->N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. No, I am not there yet. So In summary, we are suffering with inconsistent benchmark result, while measuring the benefit of our improvement in PLE/pvlock etc.. So good point from your suggestion is, - Giving predictability to workload that runs in guest, so that we have pi-pi comparison of improvement. - we can easily tune the workload via sysfs, and we can have script to automate them. What is complicated is: - How can we simulate a workload close to what we measure with benchmarks? - How can we mimic lock holding time/ lock hierarchy close to the way it is seen with real workloads (for e.g. highly contended zone lru lock with similar amount of lockholding times). You can spin for a similar instruction count that you're interested - How close it would be to when we forget about other types of spinning (for e.g, flush_tlb). So I feel it is not as trivial as it looks like. Indeed this is mainly a tool that can serve to optimize few synthetic workloads. I still believe that it worth to go through this exercise since a 100% predictable and controlled case can help us purely asses the state of PLE and pvticket code. Otherwise we're dealing w/ too many parameters and assumptions at once. Dor -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 11:49 AM, Raghavendra K T wrote: On 09/25/2012 08:30 PM, Dor Laor wrote: On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1-2...-N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads #vcpus and other versions of #total vcpus (belonging to all VMs) #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. No, I am not there yet. So In summary, we are suffering with inconsistent benchmark result, while measuring the benefit of our improvement in PLE/pvlock etc.. So good point from your suggestion is, - Giving predictability to workload that runs in guest, so that we have pi-pi comparison of improvement. - we can easily tune the workload via sysfs, and we can have script to automate them. What is complicated is: - How can we simulate a workload close to what we measure with benchmarks? - How can we mimic lock holding time/ lock hierarchy close to the way it is seen with real workloads (for e.g. highly contended zone lru lock with similar amount of lockholding times). You can spin for a similar instruction count that you're interested - How close it would be to when we forget about other types of spinning (for e.g, flush_tlb). So I feel it is not as trivial as it looks like. Indeed this is mainly a tool that can serve to optimize few synthetic workloads. I still believe that it worth to go through this exercise since a 100% predictable and controlled case can help us purely asses the state of PLE and pvticket code. Otherwise we're dealing w/ too many parameters and assumptions at once. Dor -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1->2...->N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. Thanks for following up! Dor -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1-2...-N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads #vcpus and other versions of #total vcpus (belonging to all VMs) #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. Thanks for following up! Dor -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor On 09/21/2012 08:36 PM, Raghavendra K T wrote: On 09/21/2012 06:48 PM, Chegu Vinod wrote: On 9/21/2012 4:59 AM, Raghavendra K T wrote: In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, Yes. because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. An idea to solve this is: 1) As Avi had proposed we can modify hardware ple_window dynamically to avoid frequent PL-exit. Yes. We had to do this to get around some scaling issues for large (>20way) guests (with no overcommitment) Do you mean you already have some solution tested for this? As part of some experimentation we even tried "switching off" PLE too :( Honestly, Your this experiment and Andrew Theurer's observations were the motivation for this patch. (IMHO, it is difficult to decide when we have mixed type of VMs). Agree. Not sure if the following alternatives have also been looked at : - Could the behavior associated with the "ple_window" be modified to be a function of some [new] per-guest attribute (which can be conveyed to the host as part of the guest launch sequence). The user can choose to set this [new] attribute for a given guest. This would help avoid the frequent exits due to PLE (as Avi had mentioned earlier) ? Ccing Drew also. We had a good discussion on this idea last time. (sorry that I forgot to include in patch series) May be a good idea when we know the load in advance.. - Can the PLE feature ( in VT) be "enhanced" to be made a per guest attribute ? IMHO, the approach of not taking a frequent exit is better than taking an exit and returning back from the handler etc. I entirely agree on this point. (though have not tried above approaches). Hope to see more expert opinions pouring in. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads #vcpus and other versions of #total vcpus (belonging to all VMs) #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor On 09/21/2012 08:36 PM, Raghavendra K T wrote: On 09/21/2012 06:48 PM, Chegu Vinod wrote: On 9/21/2012 4:59 AM, Raghavendra K T wrote: In some special scenarios like #vcpu = #pcpu, PLE handler may prove very costly, Yes. because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. An idea to solve this is: 1) As Avi had proposed we can modify hardware ple_window dynamically to avoid frequent PL-exit. Yes. We had to do this to get around some scaling issues for large (20way) guests (with no overcommitment) Do you mean you already have some solution tested for this? As part of some experimentation we even tried switching off PLE too :( Honestly, Your this experiment and Andrew Theurer's observations were the motivation for this patch. (IMHO, it is difficult to decide when we have mixed type of VMs). Agree. Not sure if the following alternatives have also been looked at : - Could the behavior associated with the ple_window be modified to be a function of some [new] per-guest attribute (which can be conveyed to the host as part of the guest launch sequence). The user can choose to set this [new] attribute for a given guest. This would help avoid the frequent exits due to PLE (as Avi had mentioned earlier) ? Ccing Drew also. We had a good discussion on this idea last time. (sorry that I forgot to include in patch series) May be a good idea when we know the load in advance.. - Can the PLE feature ( in VT) be enhanced to be made a per guest attribute ? IMHO, the approach of not taking a frequent exit is better than taking an exit and returning back from the handler etc. I entirely agree on this point. (though have not tried above approaches). Hope to see more expert opinions pouring in. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] virtio: provide a way for host to monitor critical events in the device
On 07/24/2012 03:30 PM, Sasha Levin wrote: On 07/24/2012 10:26 AM, Dor Laor wrote: On 07/24/2012 07:55 AM, Rusty Russell wrote: On Mon, 23 Jul 2012 22:32:39 +0200, Sasha Levin wrote: As it was discussed recently, there's currently no way for the guest to notify the host about panics. Further more, there's no reasonable way to notify the host of other critical events such as an OOM kill. I clearly missed the discussion. Is this actually useful? In practice, Admit this is not a killer feature.. won't you want the log from the guest? What makes a virtual guest different from a physical guest? Most times virt guest can do better than a physical OS. In that sense, this is where virtualization shines (live migration, hotplug for any virtual resource including net/block/cpu/memory/..). There are plenty of niche but worth while small features such as the virtio-trace series and other that allow the host/virt-mgmt to get more insight into the guest w/o a need to configure the guest. In theory guest OOM can trigger a host memory hot plug action. Again, I don't see it as a key feature.. Guest watchdog functionality might be useful, but that's simpler to There is already a fully emulated watchdog device in qemu. There is, but why emulate physical devices when you can take advantage of virtio? You could say the same about the rest of the virtio family - "There is already a fully emulated NIC device in qemu". The single issue virtio-nic solves is performance enhancements that can be done w/ a fully emulated NIC. The reason is that such NIC tend to access pio/mmio space a lot while virtio is designed for virtualization. Standard watchdog device (isn't it time you'll try qemu?) isn't about performance and if that's all the functionality you need it should work fine. btw: check the virtio-trace series that was just send in a parallel thread. Cheers, Dor -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] virtio: provide a way for host to monitor critical events in the device
On 07/24/2012 07:55 AM, Rusty Russell wrote: On Mon, 23 Jul 2012 22:32:39 +0200, Sasha Levin wrote: As it was discussed recently, there's currently no way for the guest to notify the host about panics. Further more, there's no reasonable way to notify the host of other critical events such as an OOM kill. I clearly missed the discussion. Is this actually useful? In practice, Admit this is not a killer feature.. won't you want the log from the guest? What makes a virtual guest different from a physical guest? Most times virt guest can do better than a physical OS. In that sense, this is where virtualization shines (live migration, hotplug for any virtual resource including net/block/cpu/memory/..). There are plenty of niche but worth while small features such as the virtio-trace series and other that allow the host/virt-mgmt to get more insight into the guest w/o a need to configure the guest. In theory guest OOM can trigger a host memory hot plug action. Again, I don't see it as a key feature.. Guest watchdog functionality might be useful, but that's simpler to There is already a fully emulated watchdog device in qemu. Cheers, Dor implement via a virtio watchdog device, and more effective to implement via a host facility that actually pings guest functionality (rather than the kernel). Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] virtio: provide a way for host to monitor critical events in the device
On 07/24/2012 07:55 AM, Rusty Russell wrote: On Mon, 23 Jul 2012 22:32:39 +0200, Sasha Levin levinsasha...@gmail.com wrote: As it was discussed recently, there's currently no way for the guest to notify the host about panics. Further more, there's no reasonable way to notify the host of other critical events such as an OOM kill. I clearly missed the discussion. Is this actually useful? In practice, Admit this is not a killer feature.. won't you want the log from the guest? What makes a virtual guest different from a physical guest? Most times virt guest can do better than a physical OS. In that sense, this is where virtualization shines (live migration, hotplug for any virtual resource including net/block/cpu/memory/..). There are plenty of niche but worth while small features such as the virtio-trace series and other that allow the host/virt-mgmt to get more insight into the guest w/o a need to configure the guest. In theory guest OOM can trigger a host memory hot plug action. Again, I don't see it as a key feature.. Guest watchdog functionality might be useful, but that's simpler to There is already a fully emulated watchdog device in qemu. Cheers, Dor implement via a virtio watchdog device, and more effective to implement via a host facility that actually pings guest functionality (rather than the kernel). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] virtio: provide a way for host to monitor critical events in the device
On 07/24/2012 03:30 PM, Sasha Levin wrote: On 07/24/2012 10:26 AM, Dor Laor wrote: On 07/24/2012 07:55 AM, Rusty Russell wrote: On Mon, 23 Jul 2012 22:32:39 +0200, Sasha Levin levinsasha...@gmail.com wrote: As it was discussed recently, there's currently no way for the guest to notify the host about panics. Further more, there's no reasonable way to notify the host of other critical events such as an OOM kill. I clearly missed the discussion. Is this actually useful? In practice, Admit this is not a killer feature.. won't you want the log from the guest? What makes a virtual guest different from a physical guest? Most times virt guest can do better than a physical OS. In that sense, this is where virtualization shines (live migration, hotplug for any virtual resource including net/block/cpu/memory/..). There are plenty of niche but worth while small features such as the virtio-trace series and other that allow the host/virt-mgmt to get more insight into the guest w/o a need to configure the guest. In theory guest OOM can trigger a host memory hot plug action. Again, I don't see it as a key feature.. Guest watchdog functionality might be useful, but that's simpler to There is already a fully emulated watchdog device in qemu. There is, but why emulate physical devices when you can take advantage of virtio? You could say the same about the rest of the virtio family - There is already a fully emulated NIC device in qemu. The single issue virtio-nic solves is performance enhancements that can be done w/ a fully emulated NIC. The reason is that such NIC tend to access pio/mmio space a lot while virtio is designed for virtualization. Standard watchdog device (isn't it time you'll try qemu?) isn't about performance and if that's all the functionality you need it should work fine. btw: check the virtio-trace series that was just send in a parallel thread. Cheers, Dor -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] Guest kernel hangs in smp kvm for older kernelsprior to tsc sync cleanup
Amit Shah wrote: On Wednesday 19 December 2007 21:02:06 Glauber de Oliveira Costa wrote: > On Dec 19, 2007 12:27 PM, Avi Kivity <[EMAIL PROTECTED]> wrote: > > Ingo Molnar wrote: > > > * Avi Kivity <[EMAIL PROTECTED]> wrote: > > >> Avi Kivity wrote: > > >>> Testing shows wrmsr and rdtsc function normally. > > >>> > > >>> I'll try pinning the vcpus to cpus and see if that helps. > > >> > > >> It does. > > > > > > do we let the guest read the physical CPU's TSC? That would be trouble. > > > > vmx (and svm) allow us to add an offset to the physical tsc. We set it > > on startup to -tsc (so that an rdtsc on boot would return 0), and > > massage it on vcpu migration so that guest rdtsc is monotonic. > > > > The net effect is that tsc on a vcpu can experience large forward jumps > > and changes in rate, but no negative jumps. > > Changes in rate does not sound good. It's possibly what's screwing up > my paravirt clock implementation in smp. Do you mean in the case of VM migration, or just starting them on a single host? It's the cpu preemption stuff on local host and not VM migration > Since the host updates guest time prior to putting vcpu to run, two > vcpus that start running at different times will have different system > values. > > Now if the vcpu that started running later probes the time first, > we'll se the time going backwards. A constant tsc rate is the only way > around > my limited mind sees around the problem (besides, obviously, _not_ > making the system time per-vcpu). - SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] Guest kernel hangs in smp kvm for older kernelsprior to tsc sync cleanup
Amit Shah wrote: On Wednesday 19 December 2007 21:02:06 Glauber de Oliveira Costa wrote: On Dec 19, 2007 12:27 PM, Avi Kivity [EMAIL PROTECTED] wrote: Ingo Molnar wrote: * Avi Kivity [EMAIL PROTECTED] wrote: Avi Kivity wrote: Testing shows wrmsr and rdtsc function normally. I'll try pinning the vcpus to cpus and see if that helps. It does. do we let the guest read the physical CPU's TSC? That would be trouble. vmx (and svm) allow us to add an offset to the physical tsc. We set it on startup to -tsc (so that an rdtsc on boot would return 0), and massage it on vcpu migration so that guest rdtsc is monotonic. The net effect is that tsc on a vcpu can experience large forward jumps and changes in rate, but no negative jumps. Changes in rate does not sound good. It's possibly what's screwing up my paravirt clock implementation in smp. Do you mean in the case of VM migration, or just starting them on a single host? It's the cpu preemption stuff on local host and not VM migration Since the host updates guest time prior to putting vcpu to run, two vcpus that start running at different times will have different system values. Now if the vcpu that started running later probes the time first, we'll se the time going backwards. A constant tsc rate is the only way around my limited mind sees around the problem (besides, obviously, _not_ making the system time per-vcpu). - SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance overhead of get_cycles_sync
Ingo Molnar wrote: * Dor Laor <[EMAIL PROTECTED]> wrote: Here [include/asm-x86/tsc.h]: /* Like get_cycles, but make sure the CPU is synchronized. */ static __always_inline cycles_t get_cycles_sync(void) { unsigned long long ret; unsigned eax, edx; /* * Use RDTSCP if possible; it is guaranteed to be synchronous * and doesn't cause a VMEXIT on Hypervisors */ alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP, ASM_OUTPUT2("=a" (eax), "=d" (edx)), "a" (0U), "d" (0U) : "ecx", "memory"); ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax); if (ret) return ret; /* * Don't do an additional sync on CPUs where we know * RDTSC is already synchronous: */ //alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC, // "=a" (eax), "0" (1) : "ebx","ecx","edx","memory"); rdtscll(ret); The patch below should resolve this - could you please test and Ack it? It works, actually I already commented it out. Acked-by: Dor Laor <[EMAIL PROTECTED]> But this CPUID was present in v2.6.23 too, so why did it only show up in 2.6.24-rc for you? I tried to figure out but all the code movements for i386 go in the way. In the previous email I reported to Andi that Fedora kernel 2.6.23-8 did not suffer from it. Thanks for the ultra fast reply :) Dor Ingo --> Subject: x86: fix get_cycles_sync() overhead From: Ingo Molnar <[EMAIL PROTECTED]> get_cycles_sync() is causing massive overhead in KVM networking: http://lkml.org/lkml/2007/12/11/54 remove the explicit CPUID serialization - it causes VM exits and is pointless: we care about GTOD coherency but that goes to user-space via a syscall, and syscalls are serialization points anyway. Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]> --- include/asm-x86/tsc.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-x86.q/include/asm-x86/tsc.h === --- linux-x86.q.orig/include/asm-x86/tsc.h +++ linux-x86.q/include/asm-x86/tsc.h @@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl unsigned eax, edx; /* -* Use RDTSCP if possible; it is guaranteed to be synchronous -* and doesn't cause a VMEXIT on Hypervisors +* Use RDTSCP if possible; it is guaranteed to be synchronous +* and doesn't cause a VMEXIT on Hypervisors */ alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP, ASM_OUTPUT2("=a" (eax), "=d" (edx)), @@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl return ret; /* -* Don't do an additional sync on CPUs where we know -* RDTSC is already synchronous: +* Use RDTSC on other CPUs. This might not be fully synchronous, +* but it's not a problem: the only coherency we care about is +* the GTOD output to user-space, and syscalls are synchronization +* points anyway: */ - alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC, - "=a" (eax), "0" (1) : "ebx","ecx","edx","memory"); rdtscll(ret); return ret; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance overhead of get_cycles_sync
Andi Kleen wrote: [headers rewritten because of gmane crosspost breakage] In the latest kernel (2.6.24-rc3) I noticed a drastic performance decrease for KVM networking. That should not have changed for quite some time. Also it depends on the CPU of course. I didn't find the exact place of the change but using fedora 2.6.23-8 there is no problem. 3aefbe0746580a710d4392a884ac1e4aac7c728f turn X86_FEATURE_SYNC_RDTSC off for most intel cpus, but it was committed in May. The reason is many vmexit (exit reason is cpuid instruction) caused by calls to gettimeofday that uses tsc sourceclock. read_tsc calls get_cycles_sync which might call cpuid in order to serialize the cpu. Can you explain why the cpu needs to be serialized for every gettime call? Otherwise RDTSC can be speculated around and happen outside the protection of the seqlock and that can sometimes lead to non monotonic time reporting. What about moving the result into memory and calling mb() instead? Anyways after a lot of discussions it turns out there are ways to archive this without CPUID and there is a solution implemented for this in ff tree which I will submit for .25. It's a little complicated though and not a quick fix. Do we need to be that accurate? (It will also slightly improve physical hosts). I believe you have a reason and the answer is yes. In that case can you replace the serializing instruction with an instruction that does not trigger vmexit? Maybe use 'ltr' for example? ltr doesn't synchronize RDTSC. According to Intel spec it is a serializing instruction along with cpuid and others. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance overhead of get_cycles_sync
Ingo Molnar wrote: * Dor Laor <[EMAIL PROTECTED]> wrote: > Hi Ingo, Thomas, > > In the latest kernel (2.6.24-rc3) I noticed a drastic performance > decrease for KVM networking. The reason is many vmexit (exit reason is > cpuid instruction) caused by calls to gettimeofday that uses tsc > sourceclock. read_tsc calls get_cycles_sync which might call cpuid in > order to serialize the cpu. > > Can you explain why the cpu needs to be serialized for every gettime > call? Do we need to be that accurate? (It will also slightly improve > physical hosts). I believe you have a reason and the answer is yes. In > that case can you replace the serializing instruction with an > instruction that does not trigger vmexit? Maybe use 'ltr' for example? hm, where exactly does it call CPUID? Ingo Here, commented out [include/asm-x86/tsc.h]: /* Like get_cycles, but make sure the CPU is synchronized. */ static __always_inline cycles_t get_cycles_sync(void) { unsigned long long ret; unsigned eax, edx; /* * Use RDTSCP if possible; it is guaranteed to be synchronous * and doesn't cause a VMEXIT on Hypervisors */ alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP, ASM_OUTPUT2("=a" (eax), "=d" (edx)), "a" (0U), "d" (0U) : "ecx", "memory"); ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax); if (ret) return ret; /* * Don't do an additional sync on CPUs where we know * RDTSC is already synchronous: */ //alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC, // "=a" (eax), "0" (1) : "ebx","ecx","edx","memory"); rdtscll(ret); return ret; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Performance overhead of get_cycles_sync
Hi Ingo, Thomas, In the latest kernel (2.6.24-rc3) I noticed a drastic performance decrease for KVM networking. The reason is many vmexit (exit reason is cpuid instruction) caused by calls to gettimeofday that uses tsc sourceclock. read_tsc calls get_cycles_sync which might call cpuid in order to serialize the cpu. Can you explain why the cpu needs to be serialized for every gettime call? Do we need to be that accurate? (It will also slightly improve physical hosts). I believe you have a reason and the answer is yes. In that case can you replace the serializing instruction with an instruction that does not trigger vmexit? Maybe use 'ltr' for example? Regards, Dor. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Performance overhead of get_cycles_sync
Hi Ingo, Thomas, In the latest kernel (2.6.24-rc3) I noticed a drastic performance decrease for KVM networking. The reason is many vmexit (exit reason is cpuid instruction) caused by calls to gettimeofday that uses tsc sourceclock. read_tsc calls get_cycles_sync which might call cpuid in order to serialize the cpu. Can you explain why the cpu needs to be serialized for every gettime call? Do we need to be that accurate? (It will also slightly improve physical hosts). I believe you have a reason and the answer is yes. In that case can you replace the serializing instruction with an instruction that does not trigger vmexit? Maybe use 'ltr' for example? Regards, Dor. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance overhead of get_cycles_sync
Ingo Molnar wrote: * Dor Laor [EMAIL PROTECTED] wrote: Hi Ingo, Thomas, In the latest kernel (2.6.24-rc3) I noticed a drastic performance decrease for KVM networking. The reason is many vmexit (exit reason is cpuid instruction) caused by calls to gettimeofday that uses tsc sourceclock. read_tsc calls get_cycles_sync which might call cpuid in order to serialize the cpu. Can you explain why the cpu needs to be serialized for every gettime call? Do we need to be that accurate? (It will also slightly improve physical hosts). I believe you have a reason and the answer is yes. In that case can you replace the serializing instruction with an instruction that does not trigger vmexit? Maybe use 'ltr' for example? hm, where exactly does it call CPUID? Ingo Here, commented out [include/asm-x86/tsc.h]: /* Like get_cycles, but make sure the CPU is synchronized. */ static __always_inline cycles_t get_cycles_sync(void) { unsigned long long ret; unsigned eax, edx; /* * Use RDTSCP if possible; it is guaranteed to be synchronous * and doesn't cause a VMEXIT on Hypervisors */ alternative_io(ASM_NOP3, .byte 0x0f,0x01,0xf9, X86_FEATURE_RDTSCP, ASM_OUTPUT2(=a (eax), =d (edx)), a (0U), d (0U) : ecx, memory); ret = (((unsigned long long)edx) 32) | ((unsigned long long)eax); if (ret) return ret; /* * Don't do an additional sync on CPUs where we know * RDTSC is already synchronous: */ //alternative_io(cpuid, ASM_NOP2, X86_FEATURE_SYNC_RDTSC, // =a (eax), 0 (1) : ebx,ecx,edx,memory); rdtscll(ret); return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance overhead of get_cycles_sync
Andi Kleen wrote: [headers rewritten because of gmane crosspost breakage] In the latest kernel (2.6.24-rc3) I noticed a drastic performance decrease for KVM networking. That should not have changed for quite some time. Also it depends on the CPU of course. I didn't find the exact place of the change but using fedora 2.6.23-8 there is no problem. 3aefbe0746580a710d4392a884ac1e4aac7c728f turn X86_FEATURE_SYNC_RDTSC off for most intel cpus, but it was committed in May. The reason is many vmexit (exit reason is cpuid instruction) caused by calls to gettimeofday that uses tsc sourceclock. read_tsc calls get_cycles_sync which might call cpuid in order to serialize the cpu. Can you explain why the cpu needs to be serialized for every gettime call? Otherwise RDTSC can be speculated around and happen outside the protection of the seqlock and that can sometimes lead to non monotonic time reporting. What about moving the result into memory and calling mb() instead? Anyways after a lot of discussions it turns out there are ways to archive this without CPUID and there is a solution implemented for this in ff tree which I will submit for .25. It's a little complicated though and not a quick fix. Do we need to be that accurate? (It will also slightly improve physical hosts). I believe you have a reason and the answer is yes. In that case can you replace the serializing instruction with an instruction that does not trigger vmexit? Maybe use 'ltr' for example? ltr doesn't synchronize RDTSC. According to Intel spec it is a serializing instruction along with cpuid and others. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Performance overhead of get_cycles_sync
Ingo Molnar wrote: * Dor Laor [EMAIL PROTECTED] wrote: Here [include/asm-x86/tsc.h]: /* Like get_cycles, but make sure the CPU is synchronized. */ static __always_inline cycles_t get_cycles_sync(void) { unsigned long long ret; unsigned eax, edx; /* * Use RDTSCP if possible; it is guaranteed to be synchronous * and doesn't cause a VMEXIT on Hypervisors */ alternative_io(ASM_NOP3, .byte 0x0f,0x01,0xf9, X86_FEATURE_RDTSCP, ASM_OUTPUT2(=a (eax), =d (edx)), a (0U), d (0U) : ecx, memory); ret = (((unsigned long long)edx) 32) | ((unsigned long long)eax); if (ret) return ret; /* * Don't do an additional sync on CPUs where we know * RDTSC is already synchronous: */ //alternative_io(cpuid, ASM_NOP2, X86_FEATURE_SYNC_RDTSC, // =a (eax), 0 (1) : ebx,ecx,edx,memory); rdtscll(ret); The patch below should resolve this - could you please test and Ack it? It works, actually I already commented it out. Acked-by: Dor Laor [EMAIL PROTECTED] But this CPUID was present in v2.6.23 too, so why did it only show up in 2.6.24-rc for you? I tried to figure out but all the code movements for i386 go in the way. In the previous email I reported to Andi that Fedora kernel 2.6.23-8 did not suffer from it. Thanks for the ultra fast reply :) Dor Ingo -- Subject: x86: fix get_cycles_sync() overhead From: Ingo Molnar [EMAIL PROTECTED] get_cycles_sync() is causing massive overhead in KVM networking: http://lkml.org/lkml/2007/12/11/54 remove the explicit CPUID serialization - it causes VM exits and is pointless: we care about GTOD coherency but that goes to user-space via a syscall, and syscalls are serialization points anyway. Signed-off-by: Ingo Molnar [EMAIL PROTECTED] Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] --- include/asm-x86/tsc.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-x86.q/include/asm-x86/tsc.h === --- linux-x86.q.orig/include/asm-x86/tsc.h +++ linux-x86.q/include/asm-x86/tsc.h @@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl unsigned eax, edx; /* -* Use RDTSCP if possible; it is guaranteed to be synchronous -* and doesn't cause a VMEXIT on Hypervisors +* Use RDTSCP if possible; it is guaranteed to be synchronous +* and doesn't cause a VMEXIT on Hypervisors */ alternative_io(ASM_NOP3, .byte 0x0f,0x01,0xf9, X86_FEATURE_RDTSCP, ASM_OUTPUT2(=a (eax), =d (edx)), @@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl return ret; /* -* Don't do an additional sync on CPUs where we know -* RDTSC is already synchronous: +* Use RDTSC on other CPUs. This might not be fully synchronous, +* but it's not a problem: the only coherency we care about is +* the GTOD output to user-space, and syscalls are synchronization +* points anyway: */ - alternative_io(cpuid, ASM_NOP2, X86_FEATURE_SYNC_RDTSC, - =a (eax), 0 (1) : ebx,ecx,edx,memory); rdtscll(ret); return ret; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. While it's a little premature, we can start thinking of irq path improvements. The current patch acks a private isr and afterwards apic eoi will also be hit since its a level trig irq. This means 2 vmexits per irq. We can start with regular pci irqs and move afterwards to msi. Some other ugly hack options [we're better use msi]: - Read the eoi directly from apic and save the first private isr ack - Convert the specific irq line to edge triggered and dont share it What do you guys think? +/* A small wrapper to also acknowledge the interrupt when it's handled. + * I really need an EIO hook for the vring so I can ack the interrupt once we + * know that we'll be handling the IRQ but before we invoke the callback since + * the callback may notify the host which results in the host attempting to + * raise an interrupt that we would then mask once we acknowledged the + * interrupt. */ +static irqreturn_t vp_interrupt(int irq, void *opaque) +{ + struct virtio_pci_device *vp_dev = opaque; + struct virtio_pci_vq_info *info; + irqreturn_t ret = IRQ_NONE; + u8 isr; + + /* reading the ISR has the effect of also clearing it so it's very +* important to save off the value. */ + isr = ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR); + + /* It's definitely not us if the ISR was not high */ + if (!isr) + return IRQ_NONE; + + spin_lock(_dev->lock); + list_for_each_entry(info, _dev->virtqueues, node) { + if (vring_interrupt(irq, info->vq) == IRQ_HANDLED) + ret = IRQ_HANDLED; + } + spin_unlock(_dev->lock); + + return ret; +} - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. Didn't see support for dma. Not sure what you're expecting there. Using dma_ops in virtio_ring? I think that with Amit's pvdma patches you can support dma-capable devices as well without too much fuss. What is the use case you're thinking of? A semi-paravirt driver that does dma directly to a device? Regards, Anthony Liguori You would also lose performance since pv-dma will trigger an exit for each virtio io while virtio kicks the hypervisor after several IOs were queued. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. Didn't see support for dma. Not sure what you're expecting there. Using dma_ops in virtio_ring? I think that with Amit's pvdma patches you can support dma-capable devices as well without too much fuss. What is the use case you're thinking of? A semi-paravirt driver that does dma directly to a device? Regards, Anthony Liguori You would also lose performance since pv-dma will trigger an exit for each virtio io while virtio kicks the hypervisor after several IOs were queued. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. While it's a little premature, we can start thinking of irq path improvements. The current patch acks a private isr and afterwards apic eoi will also be hit since its a level trig irq. This means 2 vmexits per irq. We can start with regular pci irqs and move afterwards to msi. Some other ugly hack options [we're better use msi]: - Read the eoi directly from apic and save the first private isr ack - Convert the specific irq line to edge triggered and dont share it What do you guys think? +/* A small wrapper to also acknowledge the interrupt when it's handled. + * I really need an EIO hook for the vring so I can ack the interrupt once we + * know that we'll be handling the IRQ but before we invoke the callback since + * the callback may notify the host which results in the host attempting to + * raise an interrupt that we would then mask once we acknowledged the + * interrupt. */ +static irqreturn_t vp_interrupt(int irq, void *opaque) +{ + struct virtio_pci_device *vp_dev = opaque; + struct virtio_pci_vq_info *info; + irqreturn_t ret = IRQ_NONE; + u8 isr; + + /* reading the ISR has the effect of also clearing it so it's very +* important to save off the value. */ + isr = ioread8(vp_dev-ioaddr + VIRTIO_PCI_ISR); + + /* It's definitely not us if the ISR was not high */ + if (!isr) + return IRQ_NONE; + + spin_lock(vp_dev-lock); + list_for_each_entry(info, vp_dev-virtqueues, node) { + if (vring_interrupt(irq, info-vq) == IRQ_HANDLED) + ret = IRQ_HANDLED; + } + spin_unlock(vp_dev-lock); + + return ret; +} - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] 2.6.23.1-rt4 and kvm 48
David Brown wrote: Uhm, not sure who to send this too... I thought I'd try out the realtime patch set and it didn't work at all with kvm. The console didn't dump anything and the system completely locked up. Anyone have any suggestions as to how to get more output on this issue? It got to the point of bringing up the tap interface and attaching it to the bridge but that was about it for the console messages. Thanks, - David Brown I tried to recreate your problem using 2.6.23-1 and latest rt patch (rt5). The problem is that the kernel is not stable at all, I can't even compile the code over vnc - my connection is constantly lost. So it might not be kvm problem? Can you try is with -no-kvm and see if it's working - then its just a regular userspace process. Anyway if all other things are stable on your end, can you send us dmesg/strace outputs? Also try without the good -no-kvm-irqchip. Regards, Dor. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] 2.6.23.1-rt4 and kvm 48
David Brown wrote: Uhm, not sure who to send this too... I thought I'd try out the realtime patch set and it didn't work at all with kvm. The console didn't dump anything and the system completely locked up. Anyone have any suggestions as to how to get more output on this issue? It got to the point of bringing up the tap interface and attaching it to the bridge but that was about it for the console messages. Thanks, - David Brown I tried to recreate your problem using 2.6.23-1 and latest rt patch (rt5). The problem is that the kernel is not stable at all, I can't even compile the code over vnc - my connection is constantly lost. So it might not be kvm problem? Can you try is with -no-kvm and see if it's working - then its just a regular userspace process. Anyway if all other things are stable on your end, can you send us dmesg/strace outputs? Also try without the good -no-kvm-irqchip. Regards, Dor. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Lguest] [V9fs-developer] [kvm-devel] [RFC] 9p: add KVM/QEMUpci transport
My current view of the IO stack is the following: -- -- -- -- - |NET_PCI_BACK| |BLK_PCI_BACK| |9P_PCI_BACK| |NET_FRONT| |BLK_FRONT| |9P_FRONT| -- -- -- -- - - --- --- |KVM_PCI_BUS| |hypercall_ops| |shared_mem_virtio| - --- --- So the 9P implementation should add the front end logic and the p9_pci_backend that glues the shared_memory, pci_bus and hypercalls together. >That's also in our plans. There was no virtio support in KVM when I >started working in the transport. > >Thanks, >Lucho > >On 8/29/07, Anthony Liguori <[EMAIL PROTECTED]> wrote: >> I think that it would be nicer to implement the p9 transport on top of >> virtio instead of directly on top of PCI. I think your PCI transport >> would make a pretty nice start of a PCI virtio transport though. >> >> Regards, >> >> Anthony Liguori >> >> On Tue, 2007-08-28 at 13:52 -0500, Eric Van Hensbergen wrote: >> > From: Latchesar Ionkov <[EMAIL PROTECTED]> >> > >> > This adds a shared memory transport for a synthetic 9p device for >> > paravirtualized file system support under KVM/QEMU. >> > >> > Signed-off-by: Latchesar Ionkov <[EMAIL PROTECTED]> >> > Signed-off-by: Eric Van Hensbergen <[EMAIL PROTECTED]> >> > --- >> > Documentation/filesystems/9p.txt |2 + >> > net/9p/Kconfig | 10 ++- >> > net/9p/Makefile |4 + >> > net/9p/trans_pci.c | 295 >++ >> > 4 files changed, 310 insertions(+), 1 deletions(-) >> > create mode 100644 net/9p/trans_pci.c >> > >> > diff --git a/Documentation/filesystems/9p.txt >b/Documentation/filesystems/9p.txt >> > index 1a5f50d..e1879bd 100644 >> > --- a/Documentation/filesystems/9p.txt >> > +++ b/Documentation/filesystems/9p.txt >> > @@ -46,6 +46,8 @@ OPTIONS >> > tcp - specifying a normal TCP/IP connection >> > fd - used passed file descriptors for >connection >> > (see rfdno and wfdno) >> > + pci - use a PCI pseudo device for 9p >communication >> > + over shared memory between a guest and >host >> > >> >uname=name user name to attempt mount as on the remote server. >The >> > server may override or ignore this value. Certain >user >> > diff --git a/net/9p/Kconfig b/net/9p/Kconfig >> > index 09566ae..8517560 100644 >> > --- a/net/9p/Kconfig >> > +++ b/net/9p/Kconfig >> > @@ -16,13 +16,21 @@ menuconfig NET_9P >> > config NET_9P_FD >> > depends on NET_9P >> > default y if NET_9P >> > - tristate "9P File Descriptor Transports (Experimental)" >> > + tristate "9p File Descriptor Transports (Experimental)" >> > help >> > This builds support for file descriptor transports for 9p >> > which includes support for TCP/IP, named pipes, or passed >> > file descriptors. TCP/IP is the default transport for 9p, >> > so if you are going to use 9p, you'll likely want this. >> > >> > +config NET_9P_PCI >> > + depends on NET_9P >> > + tristate "9p PCI Shared Memory Transport (Experimental)" >> > + help >> > + This builds support for a PCI psuedo-device currently >available >> > + under KVM/QEMU which allows for 9p transactions over shared >> > + memory between the guest and the host. >> > + >> > config NET_9P_DEBUG >> > bool "Debug information" >> > depends on NET_9P >> > diff --git a/net/9p/Makefile b/net/9p/Makefile >> > index 7b2a67a..26ce89d 100644 >> > --- a/net/9p/Makefile >> > +++ b/net/9p/Makefile >> > @@ -1,5 +1,6 @@ >> > obj-$(CONFIG_NET_9P) := 9pnet.o >> > obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o >> > +obj-$(CONFIG_NET_9P_PCI) += 9pnet_pci.o >> > >> > 9pnet-objs := \ >> > mod.o \ >> > @@ -14,3 +15,6 @@ obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o >> > >> > 9pnet_fd-objs := \ >> > trans_fd.o \ >> > + >> > +9pnet_pci-objs := \ >> > + trans_pci.o \ >> > diff --git a/net/9p/trans_pci.c b/net/9p/trans_pci.c >> > new file mode 100644 >> > index 000..36ddc5f >> > --- /dev/null >> > +++ b/net/9p/trans_pci.c >> > @@ -0,0 +1,295 @@ >> > +/* >> > + * net/9p/trans_pci.c >> > + * >> > + * 9P over PCI transport layer. For use with KVM/QEMU. >> > + * >> > + * Copyright (C) 2007 by Latchesar Ionkov <[EMAIL PROTECTED]> >> > + * >> > + * This program is free software; you can redistribute it and/or >modify >> > + * it under the terms of the GNU General Public License version 2 >> > + * as published by the Free Software Foundation. >> > + * >> > + * This program is distributed in the hope that it will be useful, >> > + * but WITHOUT ANY WARRANTY; without even the implied
RE: [Lguest] [V9fs-developer] [kvm-devel] [RFC] 9p: add KVM/QEMUpci transport
My current view of the IO stack is the following: -- -- -- -- - |NET_PCI_BACK| |BLK_PCI_BACK| |9P_PCI_BACK| |NET_FRONT| |BLK_FRONT| |9P_FRONT| -- -- -- -- - - --- --- |KVM_PCI_BUS| |hypercall_ops| |shared_mem_virtio| - --- --- So the 9P implementation should add the front end logic and the p9_pci_backend that glues the shared_memory, pci_bus and hypercalls together. That's also in our plans. There was no virtio support in KVM when I started working in the transport. Thanks, Lucho On 8/29/07, Anthony Liguori [EMAIL PROTECTED] wrote: I think that it would be nicer to implement the p9 transport on top of virtio instead of directly on top of PCI. I think your PCI transport would make a pretty nice start of a PCI virtio transport though. Regards, Anthony Liguori On Tue, 2007-08-28 at 13:52 -0500, Eric Van Hensbergen wrote: From: Latchesar Ionkov [EMAIL PROTECTED] This adds a shared memory transport for a synthetic 9p device for paravirtualized file system support under KVM/QEMU. Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] Signed-off-by: Eric Van Hensbergen [EMAIL PROTECTED] --- Documentation/filesystems/9p.txt |2 + net/9p/Kconfig | 10 ++- net/9p/Makefile |4 + net/9p/trans_pci.c | 295 ++ 4 files changed, 310 insertions(+), 1 deletions(-) create mode 100644 net/9p/trans_pci.c diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index 1a5f50d..e1879bd 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -46,6 +46,8 @@ OPTIONS tcp - specifying a normal TCP/IP connection fd - used passed file descriptors for connection (see rfdno and wfdno) + pci - use a PCI pseudo device for 9p communication + over shared memory between a guest and host uname=name user name to attempt mount as on the remote server. The server may override or ignore this value. Certain user diff --git a/net/9p/Kconfig b/net/9p/Kconfig index 09566ae..8517560 100644 --- a/net/9p/Kconfig +++ b/net/9p/Kconfig @@ -16,13 +16,21 @@ menuconfig NET_9P config NET_9P_FD depends on NET_9P default y if NET_9P - tristate 9P File Descriptor Transports (Experimental) + tristate 9p File Descriptor Transports (Experimental) help This builds support for file descriptor transports for 9p which includes support for TCP/IP, named pipes, or passed file descriptors. TCP/IP is the default transport for 9p, so if you are going to use 9p, you'll likely want this. +config NET_9P_PCI + depends on NET_9P + tristate 9p PCI Shared Memory Transport (Experimental) + help + This builds support for a PCI psuedo-device currently available + under KVM/QEMU which allows for 9p transactions over shared + memory between the guest and the host. + config NET_9P_DEBUG bool Debug information depends on NET_9P diff --git a/net/9p/Makefile b/net/9p/Makefile index 7b2a67a..26ce89d 100644 --- a/net/9p/Makefile +++ b/net/9p/Makefile @@ -1,5 +1,6 @@ obj-$(CONFIG_NET_9P) := 9pnet.o obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o +obj-$(CONFIG_NET_9P_PCI) += 9pnet_pci.o 9pnet-objs := \ mod.o \ @@ -14,3 +15,6 @@ obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o 9pnet_fd-objs := \ trans_fd.o \ + +9pnet_pci-objs := \ + trans_pci.o \ diff --git a/net/9p/trans_pci.c b/net/9p/trans_pci.c new file mode 100644 index 000..36ddc5f --- /dev/null +++ b/net/9p/trans_pci.c @@ -0,0 +1,295 @@ +/* + * net/9p/trans_pci.c + * + * 9P over PCI transport layer. For use with KVM/QEMU. + * + * Copyright (C) 2007 by Latchesar Ionkov [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to: + * Free Software Foundation + * 51 Franklin Street, Fifth
RE: [Lguest] [kvm-devel] [RFC] 9p: add KVM/QEMU pci transport
>> >> Nice driver. I'm hoping we can do a virtio driver using a similar >> concept. >> >> > +#define PCI_VENDOR_ID_9P 0x5002 >> > +#define PCI_DEVICE_ID_9P 0x000D >> >> Where do these numbers come from? Can we be sure they don't conflict >with >> actual hardware? > >I stole the VENDOR_ID from kvm's hypercall driver. There are no any >guarantees that it doesn't conflict with actual hardware. As it was >discussed before, there is still no ID assigned for the virtual >devices. Currently 5002 does not registered to Qumranet nor KVM. We will do something about it pretty soon. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] 9p: add KVM/QEMU pci transport
>> > This adds a shared memory transport for a synthetic 9p device for >> > paravirtualized file system support under KVM/QEMU. >> >> Nice driver. I'm hoping we can do a virtio driver using a similar >> concept. >> > >Yes. I'm looking at the patches from Dor now, it should be pretty >straight forward. The PCI is interesting in its own right for other >(non-virtual) projects we've been playing with > > -eric Great, we can add lots of pci bus shared functionality into the kvm_pci_bus.c --Dor - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] 9p: add KVM/QEMU pci transport
This adds a shared memory transport for a synthetic 9p device for paravirtualized file system support under KVM/QEMU. Nice driver. I'm hoping we can do a virtio driver using a similar concept. Yes. I'm looking at the patches from Dor now, it should be pretty straight forward. The PCI is interesting in its own right for other (non-virtual) projects we've been playing with -eric Great, we can add lots of pci bus shared functionality into the kvm_pci_bus.c --Dor - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [Lguest] [kvm-devel] [RFC] 9p: add KVM/QEMU pci transport
Nice driver. I'm hoping we can do a virtio driver using a similar concept. +#define PCI_VENDOR_ID_9P 0x5002 +#define PCI_DEVICE_ID_9P 0x000D Where do these numbers come from? Can we be sure they don't conflict with actual hardware? I stole the VENDOR_ID from kvm's hypercall driver. There are no any guarantees that it doesn't conflict with actual hardware. As it was discussed before, there is still no ID assigned for the virtual devices. Currently 5002 does not registered to Qumranet nor KVM. We will do something about it pretty soon. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] Deferred interrupt handling.
>> >> Guest0 - blocked on I/O >> >> >> >> IRQ14 from your hardware >> >> Block IRQ14 >> >> Sent to guest (guest is blocked) >> >> >> >> IRQ14 from hard disk >> >> Ignored (as blocked) >> >> >> >> >> But now the timer will pop and the hard disk will get its irq. >> The guest will be released right after. > >How do you plan to do this ? If you unmask the interrupt then it will >immediately jam solid with IRQs from your hardware and the line will be >disabled. Hope it should work like the following [Please correct me if I'm wrong]: - Make the device the last irqaction in the list - Our dummy handler will always return IRQ_HANDLED in case any other previous irqaction did not return such. It will also issue the timer and mask the irq in this case. The line is temporarily jammed but not disabled - note_interrupt() will not consider our irq unhandled and won't disable it. btw, if I'm not mistaken only after bad 99900/10 the irq is disabled. - If the timer pops before the guest acks the irq, the timer handler will ack the irq and unmask it. The timer's job is only to prevent deadlocks. Maybe it's better to code it first then send RFC. Or wanted to get a feed back before hand to hear opinions and to know whether to use the deferred option or the irq polarity option. Both of them can lead to the above deadlock without the timer hack. Best regards, Dor. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] Deferred interrupt handling.
>Alan Cox wrote: >>> What if we will force the specific device to the end of the list. >Once >>> IRQ_NONE was returned by the other devices, we will mask the irq, >>> forward the irq to the guest, issue a timer for 1msec. Motivation: >>> 1msec is long enough for the guest to ack the irq + host unmask the >irq >>> >> >> It makes no difference. The deadlock isn't fixable by timing hacks. >> Consider the following sequence >> >> >> Guest0 - blocked on I/O >> >> IRQ14 from your hardware >> Block IRQ14 >> Sent to guest (guest is blocked) >> >> IRQ14 from hard disk >> Ignored (as blocked) >> But now the timer will pop and the hard disk will get its irq. The guest will be released right after. >> Deadlock >> > >IMO the only reasonable solution is to disallow interrupt forwarding >with shared irqs. If someone later comes up with a bright idea, we can >implement it. Otherwise the problem will solve itself with hardware >moving to msi. > I though of that but the problem is that we'd like to use it with current hardware devices that are shared. :( - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] Deferred interrupt handling.
>> In particular, this requires interrupt handling to be done by the >guest -- >> The host shouldn't load the corresponding device driver or otherwise >access >> the device. Since the host kernel is not aware of the device semantics >it >> cannot acknowledge the interrupt at the device level. > >Tricky indeed. > >> As far as the host kernel is concerned the VM is a user level process. >We >> require the ability to forward interrupt handling to such entities. >The >> current kernel interrupt handling path doesn't allow deferring >interrupt >> handling _and_ acknowledgement. > >We don't support this model at all, and it doesn't appear to work >anyway. > >> 0. Adding an IRQ_DEFERRED mechanism to the interrupt handling path. >ISRs >> returning IRQ_DEFERRED will keep the interrupt masked until future >> acknowledge. > >Deadlock. If you get an IRQ for a guest and you block the IRQ until the >guest handles it you may (eg if the IRQ is shared) get priority >inversion >with another interrupt source on the same line the guest requires first >(eg disks and other I/O) What if we will force the specific device to the end of the list. Once IRQ_NONE was returned by the other devices, we will mask the irq, forward the irq to the guest, issue a timer for 1msec. Motivation: 1msec is long enough for the guest to ack the irq + host unmask the irq + cancell the timer. (ping round-trip for a guest is about 100msec) If the timer poped, it will unmask irqs + run over the device list to check whether one of them has a pending irq. This will solve the deadlock possiblity in a small price of potential latency. ... >> Any ideas ? Thoughts ? > >Mask the interrupt in the main kernel, pass an event of some kind to the >guest. You can describe most devices from guest to kernel in a safe form >as > >device, bar, offset, register size, mask, bits to set, bits to clear > >(or bits to test when deciding if it is the irq source) > The problem is that each device has its own bits and it cannot be a general solution. Except that the device driver inside the guest should be changed because the host already disabled the irq/status for them. I know the above solution in not neat but we do want to contribute it. Any other ideas are welcome, 10x, Dor. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] Deferred interrupt handling.
In particular, this requires interrupt handling to be done by the guest -- The host shouldn't load the corresponding device driver or otherwise access the device. Since the host kernel is not aware of the device semantics it cannot acknowledge the interrupt at the device level. Tricky indeed. As far as the host kernel is concerned the VM is a user level process. We require the ability to forward interrupt handling to such entities. The current kernel interrupt handling path doesn't allow deferring interrupt handling _and_ acknowledgement. We don't support this model at all, and it doesn't appear to work anyway. 0. Adding an IRQ_DEFERRED mechanism to the interrupt handling path. ISRs returning IRQ_DEFERRED will keep the interrupt masked until future acknowledge. Deadlock. If you get an IRQ for a guest and you block the IRQ until the guest handles it you may (eg if the IRQ is shared) get priority inversion with another interrupt source on the same line the guest requires first (eg disks and other I/O) What if we will force the specific device to the end of the list. Once IRQ_NONE was returned by the other devices, we will mask the irq, forward the irq to the guest, issue a timer for 1msec. Motivation: 1msec is long enough for the guest to ack the irq + host unmask the irq + cancell the timer. (ping round-trip for a guest is about 100msec) If the timer poped, it will unmask irqs + run over the device list to check whether one of them has a pending irq. This will solve the deadlock possiblity in a small price of potential latency. ... Any ideas ? Thoughts ? Mask the interrupt in the main kernel, pass an event of some kind to the guest. You can describe most devices from guest to kernel in a safe form as device, bar, offset, register size, mask, bits to set, bits to clear (or bits to test when deciding if it is the irq source) The problem is that each device has its own bits and it cannot be a general solution. Except that the device driver inside the guest should be changed because the host already disabled the irq/status for them. I know the above solution in not neat but we do want to contribute it. Any other ideas are welcome, 10x, Dor. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] Deferred interrupt handling.
Alan Cox wrote: What if we will force the specific device to the end of the list. Once IRQ_NONE was returned by the other devices, we will mask the irq, forward the irq to the guest, issue a timer for 1msec. Motivation: 1msec is long enough for the guest to ack the irq + host unmask the irq It makes no difference. The deadlock isn't fixable by timing hacks. Consider the following sequence Guest0 - blocked on I/O IRQ14 from your hardware Block IRQ14 Sent to guest (guest is blocked) IRQ14 from hard disk Ignored (as blocked) But now the timer will pop and the hard disk will get its irq. The guest will be released right after. Deadlock IMO the only reasonable solution is to disallow interrupt forwarding with shared irqs. If someone later comes up with a bright idea, we can implement it. Otherwise the problem will solve itself with hardware moving to msi. I though of that but the problem is that we'd like to use it with current hardware devices that are shared. :( - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [RFC] Deferred interrupt handling.
Guest0 - blocked on I/O IRQ14 from your hardware Block IRQ14 Sent to guest (guest is blocked) IRQ14 from hard disk Ignored (as blocked) But now the timer will pop and the hard disk will get its irq. The guest will be released right after. How do you plan to do this ? If you unmask the interrupt then it will immediately jam solid with IRQs from your hardware and the line will be disabled. Hope it should work like the following [Please correct me if I'm wrong]: - Make the device the last irqaction in the list - Our dummy handler will always return IRQ_HANDLED in case any other previous irqaction did not return such. It will also issue the timer and mask the irq in this case. The line is temporarily jammed but not disabled - note_interrupt() will not consider our irq unhandled and won't disable it. btw, if I'm not mistaken only after bad 99900/10 the irq is disabled. - If the timer pops before the guest acks the irq, the timer handler will ack the irq and unmask it. The timer's job is only to prevent deadlocks. Maybe it's better to code it first then send RFC. Or wanted to get a feed back before hand to hear opinions and to know whether to use the deferred option or the irq polarity option. Both of them can lead to the above deadlock without the timer hack. Best regards, Dor. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [ANNOUNCE] kvm-15 release
>Avi Kivity wrote: >> - new userspace interface (work in progress) > >kvmfs in kvm-15 kernel code does not to build with older kernels (2.6.16 >fails, 2.6.18 works ok), looks like the reason are some changes in >superblock handling. > >Do you intend to fix that? Did you run the make sync under the svn kernel directory? It uses sed to replace the f_path.dentry with backward compatible f_dentry. > >cheers, > Gerd > >-- >Gerd Hoffmann <[EMAIL PROTECTED]> > >--- -- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share >your >opinions on IT & business topics through brief surveys-and earn cash >http://www.techsay.com/default.php?page=join.php=sourceforge=DEVD EV >___ >kvm-devel mailing list >kvm-devel@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [ANNOUNCE] kvm-15 release
Avi Kivity wrote: - new userspace interface (work in progress) kvmfs in kvm-15 kernel code does not to build with older kernels (2.6.16 fails, 2.6.18 works ok), looks like the reason are some changes in superblock handling. Do you intend to fix that? Did you run the make sync under the svn kernel directory? It uses sed to replace the f_path.dentry with backward compatible f_dentry. cheers, Gerd -- Gerd Hoffmann [EMAIL PROTECTED] --- -- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVD EV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [PATCH 10/13] KVM: Wire up hypercall handlers toa central arch-independent location
> > >> Somthing else that came up in a conversation with Dor: the need for a >> clean way to raise a guest interrupt. The guest may be sleeping in >> userspace, scheduled out, or running on another cpu (and requiring an >> ipi to get it out of guest mode). > >yeah it'd be nice if I could just call a function for it rather than >poking into kvm internals ;) > >> Right now I'm thinking about using the signal machinery since it appears >> to do exactly the right thing. > >signals are *expensive* though. > >If you design an interrupt interface, it'd rock if you could make it >such that it is "raise interrupt within miliseconds from >now", rather than making it mostly synchronous. That way irq mitigation >becomes part of the interface rather than having to duplicate it all >over the virtual drivers... Why do you need to raise an interrupt within a timeout? I thought on just asking for a synchronous, as-fast-as-you-can-get interrupt. If you need an interrupt that should pop within some milliseconds you can set a timer. > > > >-- >if you want to mail me at work (you don't), use arjan (at) linux.intel.com >Test the interaction between Linux and your BIOS via >http://www.linuxfirmwarekit.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [PATCH 10/13] KVM: Wire up hypercall handlers to a central arch-independent location
> >Pavel Machek wrote: >> On Mon 2007-02-19 10:30:52, Avi Kivity wrote: >> >>> Signed-off-by: Avi Kivity <[EMAIL PROTECTED]> >>> >> >> changelog? >> >> > >Well, I can't think of anything to add beyond $subject. The patch adds >calls from the arch-dependent hypercall handlers to a new arch >independent function. > > >>> + switch (nr) { >>> + default: >>> + ; >>> + } >>> >> >> Eh? >> >> > >No hypercalls defined yet. > I have Ingo's network PV hypercalls to commit in my piplien. Till then we can just add the test hypercall: case __NR_hypercall_test: printk(KERN_DEBUG "%s __NR_hypercall_test\n", __FUNCTION__); ret = 0x5a5a; break; default: BUG(); > >-- >error compiling committee.c: too many arguments to function > > >--- -- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share >your >opinions on IT & business topics through brief surveys-and earn cash >http://www.techsay.com/default.php?page=join.php=sourceforge=DEVD EV >___ >kvm-devel mailing list >kvm-devel@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [PATCH 10/13] KVM: Wire up hypercall handlers to a central arch-independent location
Pavel Machek wrote: On Mon 2007-02-19 10:30:52, Avi Kivity wrote: Signed-off-by: Avi Kivity [EMAIL PROTECTED] changelog? Well, I can't think of anything to add beyond $subject. The patch adds calls from the arch-dependent hypercall handlers to a new arch independent function. + switch (nr) { + default: + ; + } Eh? No hypercalls defined yet. I have Ingo's network PV hypercalls to commit in my piplien. Till then we can just add the test hypercall: case __NR_hypercall_test: printk(KERN_DEBUG %s __NR_hypercall_test\n, __FUNCTION__); ret = 0x5a5a; break; default: BUG(); -- error compiling committee.c: too many arguments to function --- -- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVD EV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] [PATCH 10/13] KVM: Wire up hypercall handlers toa central arch-independent location
Somthing else that came up in a conversation with Dor: the need for a clean way to raise a guest interrupt. The guest may be sleeping in userspace, scheduled out, or running on another cpu (and requiring an ipi to get it out of guest mode). yeah it'd be nice if I could just call a function for it rather than poking into kvm internals ;) Right now I'm thinking about using the signal machinery since it appears to do exactly the right thing. signals are *expensive* though. If you design an interrupt interface, it'd rock if you could make it such that it is raise this interrupt within x miliseconds from now, rather than making it mostly synchronous. That way irq mitigation becomes part of the interface rather than having to duplicate it all over the virtual drivers... Why do you need to raise an interrupt within a timeout? I thought on just asking for a synchronous, as-fast-as-you-can-get interrupt. If you need an interrupt that should pop within some milliseconds you can set a timer. -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2.6.20] KVM: Use ARRAY_SIZE macro when appropriate
> >Hi all, > >A patch to use ARRAY_SIZE macro already defined in kernel.h > >Signed-off-by: Ahmed S. Darwish <[EMAIL PROTECTED]> Applied, 10x - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2.6.20] KVM: Use ARRAY_SIZE macro when appropriate
Hi all, A patch to use ARRAY_SIZE macro already defined in kernel.h Signed-off-by: Ahmed S. Darwish [EMAIL PROTECTED] Applied, 10x - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] kvm & dyntick
>* Ingo Molnar <[EMAIL PROTECTED]> wrote: > >> > dyntick-enabled guest: >> > - reduce the load on the host when the guest is idling >> > (currently an idle guest consumes a few percent cpu) >> >> yeah. KVM under -rt already works with dynticks enabled on both the >> host and the guest. (but it's more optimal to use a dedicated >> hypercall to set the next guest-interrupt) > >using the dynticks code from the -rt kernel makes the overhead of an >idle guest go down by a factor of 10-15: > > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 2556 mingo 15 0 598m 159m 157m R 1.5 8.0 0:26.20 qemu > >( for this to work on my system i have added a 'hyper' clocksource > hypercall API for KVM guests to use - this is needed instead of the > running-to-slowly TSC. ) > > Ingo This is great news for PV guests. Never-the-less we still need to improve our full virtualized guest support. First we need a mechanism (can we use the timeout_granularity?) to dynamically change the host timer frequency so we can support guests with 100hz that dynamically change their freq to 1000hz and back. Afterwards we'll need to compensate the lost alarm signals to the guests by using one of - hrtimers to inject the lost interrupts for specific guests. The problem this will increase the overall load. - Injecting several virtual irq to the guests one after another (using interrupt window exit). The question is how the guest will be effected from this unfair behavior. Can dyntick help HVMs? Will the answer be the same for guest-dense hosts? I understood that the main gain of dyn-tick is for idle time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [kvm-devel] kvm dyntick
* Ingo Molnar [EMAIL PROTECTED] wrote: dyntick-enabled guest: - reduce the load on the host when the guest is idling (currently an idle guest consumes a few percent cpu) yeah. KVM under -rt already works with dynticks enabled on both the host and the guest. (but it's more optimal to use a dedicated hypercall to set the next guest-interrupt) using the dynticks code from the -rt kernel makes the overhead of an idle guest go down by a factor of 10-15: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2556 mingo 15 0 598m 159m 157m R 1.5 8.0 0:26.20 qemu ( for this to work on my system i have added a 'hyper' clocksource hypercall API for KVM guests to use - this is needed instead of the running-to-slowly TSC. ) Ingo This is great news for PV guests. Never-the-less we still need to improve our full virtualized guest support. First we need a mechanism (can we use the timeout_granularity?) to dynamically change the host timer frequency so we can support guests with 100hz that dynamically change their freq to 1000hz and back. Afterwards we'll need to compensate the lost alarm signals to the guests by using one of - hrtimers to inject the lost interrupts for specific guests. The problem this will increase the overall load. - Injecting several virtual irq to the guests one after another (using interrupt window exit). The question is how the guest will be effected from this unfair behavior. Can dyntick help HVMs? Will the answer be the same for guest-dense hosts? I understood that the main gain of dyn-tick is for idle time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: open /dev/kvm: No such file or directory
>On linux-26..20-rc2, "modprobe kvm-intel" loaded the module >successful, but running qemu returns a error ... > >/usr/local/kvm/bin/qemu -hda vdisk.img -cdrom cd.iso -boot d -m 128 >open /dev/kvm: No such file or directory >Could not initialize KVM, will disable KVM support Are you sure the kvm_intel & kvm modules are loaded? Maybe you're bios does not support virtualization. Please check your dmesg. > >/dev/kvm does not exist should I create this before running qemu? >If so, what's the parameters to "mknod"? It's a dynamic misc device, you don't need to create it. > > >Thanks, >Jeff. >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to [EMAIL PROTECTED] >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: open /dev/kvm: No such file or directory
On linux-26..20-rc2, modprobe kvm-intel loaded the module successful, but running qemu returns a error ... /usr/local/kvm/bin/qemu -hda vdisk.img -cdrom cd.iso -boot d -m 128 open /dev/kvm: No such file or directory Could not initialize KVM, will disable KVM support Are you sure the kvm_intel kvm modules are loaded? Maybe you're bios does not support virtualization. Please check your dmesg. /dev/kvm does not exist should I create this before running qemu? If so, what's the parameters to mknod? It's a dynamic misc device, you don't need to create it. Thanks, Jeff. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/