Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 10/24/2011 02:06 PM, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. Good point. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Right. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Tue, Oct 25, 2011 at 01:41:39PM +0200, Jan Kiszka wrote: On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? Not immediately I think. If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. True. OTOH this might have cost (extra mmio) for the doubtful benefit of making PBA values exact. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Clearing PBA when we detect an empty ring in host is not required, IMO. It's an optimization. Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? Apparently whether mask accesses happen a lot depends on the workload. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It might not know this unless notified by driver. E.g. virtio drivers currently don't do interrupt status reads. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan Talking about emulated devices? It's not sure that real hardware clears PBA. Considering that no guests I know of use PBA ATM, I would not be surprised if many devices had broken PBA support. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-25 14:05, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 01:41:39PM +0200, Jan Kiszka wrote: On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? Not immediately I think. If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. True. OTOH this might have cost (extra mmio) for the doubtful benefit of making PBA values exact. I think correctness come before performance unless the latter hurts significantly. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Clearing PBA when we detect an empty ring in host is not required, IMO. It's an optimization. For virtio that might be true - as we are free to define the device behaviour to our benefit. What emulated real devices do is another thing. Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? Apparently whether mask accesses happen a lot depends on the workload. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It might not know this unless notified by driver. E.g. virtio drivers currently don't do interrupt status reads. Talking about real devices, they obviously know as they maintain the hardware state. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan Talking about emulated devices? It's not sure that real hardware clears PBA. Considering that no guests I know of use PBA ATM, I would not be surprised if many devices had broken PBA support. OK, if there are no conforming MSI-X devices out there, then we can forget about all the PBA maintenance beyond set if message hit mask, cleared again on unmask. But I doubt that this is generally true. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Tue, Oct 25, 2011 at 02:21:01PM +0200, Jan Kiszka wrote: On 2011-10-25 14:05, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 01:41:39PM +0200, Jan Kiszka wrote: On 2011-10-25 13:20, Michael S. Tsirkin wrote: On Tue, Oct 25, 2011 at 09:24:17AM +0200, Jan Kiszka wrote: On 2011-10-24 19:23, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Maybe, but I'm also thinking of fully emulated devices. One thing seems certain: actual, assigned devices don't have this fake msi-x level so they don't notify host when that changes. But they have real PBA. We just need to replicate the emulated vector mask state into real hw. Doesn't this happen anyway when we disable the IRQ on the host? Not immediately I think. If not, that may require a bit more work, maybe a special masking mode that can be requested by the managing backend of an assigned device from the MSI-X in-kernel service. True. OTOH this might have cost (extra mmio) for the doubtful benefit of making PBA values exact. I think correctness come before performance unless the latter hurts significantly. Do you expect this optimization to give a significant performance gain? Hard to asses in general. But I have a silly guest here that obviously masks MSI vectors for each event. This currently not only kicks us into a heavy-weight exit, it also enforces serialization on qemu_global_mutex (while we have the rest already isolated). It easy to see how MSIX mask support in kernel would help. Not sure whether it's worth it to also add special APIs to reduce the number of spurious interrupts for such silly guests. I do not get the latter point. What could be simplified (without making it incorrect) when ignoring excessive mask accesses? Clearing PBA when we detect an empty ring in host is not required, IMO. It's an optimization. For virtio that might be true - as we are free to define the device behaviour to our benefit. What emulated real devices do is another thing. Anything specific in mind? Also, if sane guests do not access the mask that frequently, why was in-kernel MSI-X MMIO proposed at all? Apparently whether mask accesses happen a lot depends on the workload. It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. With an in-kernel MSI-X MMIO handler, this race will be naturally unavoidable as there is no more global lock shared between table/PBA accesses and the device model. But, when using atomic bit ops, I don't think that will cause headache. Jan This is not the race I meant. The challenge is for the device to determine that it can clear the PBA. atomic accesses on PBA won't help here I think. The device knows best if the interrupt reason persists. It might not know this unless notified by driver. E.g. virtio drivers currently don't do interrupt status reads. Talking about real devices, they obviously know as they maintain the hardware state. Not necessarily. It's quite common to keep the ring in coherent memory allocated by driver, not within the device, the state is then maintained by driver and device together. It can synchronize MSI assertion and PBA bit clearance. If it clears too late, than this reflects what may happen on real hw as well when host and device race for changing vector mask vs. device state. It's not stated that those changes need to be serialized inside the device, is it? Jan Talking about emulated devices? It's not sure that real hardware clears PBA. Considering that no guests I know of use PBA ATM, I would not be surprised if many devices had broken PBA support. OK, if there are no conforming MSI-X devices out there, Oh, I'm guessing some devices are conforming :) then we can forget about all the PBA maintenance beyond set if message hit mask, cleared again on unmask. But I doubt that this is generally true. Jan We seem to get by basically with what you describe but I'm not saying it's perfect, just that it's hard to make it perfect. --
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 10/21/2011 11:19 AM, Jan Kiszka wrote: Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By itself, this does not provide enough value to offset the cost of a new ABI, especially as userspace will need to continue supporting the old method for a very long while. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 11:45, Avi Kivity wrote: On 10/21/2011 11:19 AM, Jan Kiszka wrote: Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By itself, this does not provide enough value to offset the cost of a new ABI, especially as userspace will need to continue supporting the old method for a very long while. Yes, but less sophistically as it would now. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, and specifically vfio. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) I don't agree here. IMO PBA emulation would need to clear pending bits on interrupt status register read. So clearing pending bits could be done by ioctl from qemu while setting them would be done from irqfd. So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, I'm actually working on a qemu patch to get pba emulation working correctly. I think it's doable with existing irqfd. and specifically vfio. Interesting. How would you clear the pseudo interrupt level? -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 14:43, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) I don't agree here. IMO PBA emulation would need to clear pending bits on interrupt status register read. So clearing pending bits could be done by ioctl from qemu while setting them would be done from irqfd. How should QEMU know if the reason for pending has been cleared at device level if the device is outside the scope of QEMU? This model only works for PV devices when you agree that spurious IRQs are OK. So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, I'm actually working on a qemu patch to get pba emulation working correctly. I think it's doable with existing irqfd. irqfd has no notion of level. You can only communicate a rising edge and then need a side channel for the state of the edge reason. and specifically vfio. Interesting. How would you clear the pseudo interrupt level? Ideally: not at all (for MSI). If we manage the mask at device level, we only need to send the message if there is actually something to deliver to the interrupt controller and masked input events would be lost on real HW as well. That said, we still need to address the irqfd level topic for the finite amount of legacy interrupt lines. If a line is masked at an IRQ controller, the device need to keep the controller up to date /wrt to the line state, or the controller has to poll the current state on unmask to avoid spurious injections. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 15:11, Jan Kiszka wrote: On 2011-10-24 14:43, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) I don't agree here. IMO PBA emulation would need to clear pending bits on interrupt status register read. So clearing pending bits could be done by ioctl from qemu while setting them would be done from irqfd. How should QEMU know if the reason for pending has been cleared at device level if the device is outside the scope of QEMU? This model only works for PV devices when you agree that spurious IRQs are OK. So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, I'm actually working on a qemu patch to get pba emulation working correctly. I think it's doable with existing irqfd. irqfd has no notion of level. You can only communicate a rising edge and then need a side channel for the state of the edge reason. and specifically vfio. Interesting. How would you clear the pseudo interrupt level? Ideally: not at all (for MSI). If we manage the mask at device level, we only need to send the message if there is actually something to deliver to the interrupt controller and masked input events would be lost on real HW as well. This wouldn't work out nicely as well. We rather need a combined model: Devices need to maintain the PBA actively, i.e. set clear them themselves and do not rely on the core here (with the core being either QEMU user space or an in-kernel MSI-X MMIO accelerator). The core only checks the PBA if it is about to deliver some message and refrains from doing so if the bit became 0 in the meantime (specifically during the masked period). For QEMU device models, that means no additional IOCTLs, just memory sharing of the PBA which is required anyway. But that means QEMU-external device models need to gain at least basic MSI-X knowledge. And if they gain this awareness, they could also use it to send full-blown messages directly (e.g. device-id/vector tuples) instead of encoding them into finite GSI numbers. But that's an add-on topic. Moreover, we still need a corresponding side channel for line-base interrupts. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Mon, Oct 24, 2011 at 03:11:25PM +0200, Jan Kiszka wrote: On 2011-10-24 14:43, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) I don't agree here. IMO PBA emulation would need to clear pending bits on interrupt status register read. So clearing pending bits could be done by ioctl from qemu while setting them would be done from irqfd. How should QEMU know if the reason for pending has been cleared at device level if the device is outside the scope of QEMU? This model only works for PV devices when you agree that spurious IRQs are OK. A read or irq status clears pending in the same way it clears irq line for level. I don't think this generates spurious irqs. Yes it only works for PV. For assigned devices, the only way I see to implement PBA correctly is by masking the vector in the device and looking at the actual pending bit. So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, I'm actually working on a qemu patch to get pba emulation working correctly. I think it's doable with existing irqfd. irqfd has no notion of level. You can only communicate a rising edge and then need a side channel for the state of the edge reason. True. But we only need that for PBA read which is unused ATM. So kvm can just send the read to userspace, have qemu query vfio or whatever. and specifically vfio. Interesting. How would you clear the pseudo interrupt level? Ideally: not at all (for MSI). If we manage the mask at device level, we only need to send the message if there is actually something to deliver to the interrupt controller and masked input events would be lost on real HW as well. Not sure I understand. we certainly shouldn't send masked interrupts to the APIC if for no other reason that the message value is invalid while masked. That said, we still need to address the irqfd level topic for the finite amount of legacy interrupt lines. If a line is masked at an IRQ controller, the device need to keep the controller up to date /wrt to the line state, or the controller has to poll the current state on unmask to avoid spurious injections. Jan Yes, level interrupts are tricky. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Mon, Oct 24, 2011 at 03:43:53PM +0200, Jan Kiszka wrote: On 2011-10-24 15:11, Jan Kiszka wrote: On 2011-10-24 14:43, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) I don't agree here. IMO PBA emulation would need to clear pending bits on interrupt status register read. So clearing pending bits could be done by ioctl from qemu while setting them would be done from irqfd. How should QEMU know if the reason for pending has been cleared at device level if the device is outside the scope of QEMU? This model only works for PV devices when you agree that spurious IRQs are OK. So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, I'm actually working on a qemu patch to get pba emulation working correctly. I think it's doable with existing irqfd. irqfd has no notion of level. You can only communicate a rising edge and then need a side channel for the state of the edge reason. and specifically vfio. Interesting. How would you clear the pseudo interrupt level? Ideally: not at all (for MSI). If we manage the mask at device level, we only need to send the message if there is actually something to deliver to the interrupt controller and masked input events would be lost on real HW as well. This wouldn't work out nicely as well. We rather need a combined model: Devices need to maintain the PBA actively, i.e. set clear them themselves and do not rely on the core here (with the core being either QEMU user space or an in-kernel MSI-X MMIO accelerator). The core only checks the PBA if it is about to deliver some message and refrains from doing so if the bit became 0 in the meantime (specifically during the masked period). For QEMU device models, that means no additional IOCTLs, just memory sharing of the PBA which is required anyway. Sorry, I don't understand the above two paragraphs. Maybe I am confused by terminology here. We really only need to check PBA when it's read. Whether the message is delivered only depends on the mask bit. But that means QEMU-external device models need to gain at least basic MSI-X knowledge. And if they gain this awareness, they could also use it to send full-blown messages directly (e.g. device-id/vector tuples) instead of encoding them into finite GSI numbers. But that's an add-on topic. Moreover, we still need a corresponding side channel for line-base interrupts. Jan Agree on all points with the above. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 16:40, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 03:43:53PM +0200, Jan Kiszka wrote: On 2011-10-24 15:11, Jan Kiszka wrote: On 2011-10-24 14:43, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) I don't agree here. IMO PBA emulation would need to clear pending bits on interrupt status register read. So clearing pending bits could be done by ioctl from qemu while setting them would be done from irqfd. How should QEMU know if the reason for pending has been cleared at device level if the device is outside the scope of QEMU? This model only works for PV devices when you agree that spurious IRQs are OK. So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, I'm actually working on a qemu patch to get pba emulation working correctly. I think it's doable with existing irqfd. irqfd has no notion of level. You can only communicate a rising edge and then need a side channel for the state of the edge reason. and specifically vfio. Interesting. How would you clear the pseudo interrupt level? Ideally: not at all (for MSI). If we manage the mask at device level, we only need to send the message if there is actually something to deliver to the interrupt controller and masked input events would be lost on real HW as well. This wouldn't work out nicely as well. We rather need a combined model: Devices need to maintain the PBA actively, i.e. set clear them themselves and do not rely on the core here (with the core being either QEMU user space or an in-kernel MSI-X MMIO accelerator). The core only checks the PBA if it is about to deliver some message and refrains from doing so if the bit became 0 in the meantime (specifically during the masked period). For QEMU device models, that means no additional IOCTLs, just memory sharing of the PBA which is required anyway. Sorry, I don't understand the above two paragraphs. Maybe I am confused by terminology here. We really only need to check PBA when it's read. Whether the message is delivered only depends on the mask bit. This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) The marked (*) lines differ from the current user space model where only the core does PBA manipulation (including clearance via a special function). Basically, the PBA becomes a communication channel also between device and MSI core. And this model also works if core and device run in different processes provided they set up the PBA as shared memory. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Mon, Oct 24, 2011 at 05:00:27PM +0200, Jan Kiszka wrote: On 2011-10-24 16:40, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 03:43:53PM +0200, Jan Kiszka wrote: On 2011-10-24 15:11, Jan Kiszka wrote: On 2011-10-24 14:43, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 02:06:08PM +0200, Jan Kiszka wrote: On 2011-10-24 13:09, Avi Kivity wrote: On 10/24/2011 12:19 PM, Jan Kiszka wrote: With the new feature it may be worthwhile, but I'd like to see the whole thing, with numbers attached. It's not a performance issue, it's a resource limitation issue: With the new API we can stop worrying about user space device models consuming limited IRQ routes of the KVM subsystem. Only if those devices are in the same process (or have access to the vmfd). Interrupt routing together with irqfd allows you to disaggregate the device model. Instead of providing a competing implementation with new limitations, we need to remove the limitations of the old implementation. That depends on where we do the cut. Currently we let the IRQ source signal an abstract edge on a pre-allocated pseudo IRQ line. But we cannot build correct MSI-X on top of the current irqfd model as we lack the level information (for PBA emulation). *) I don't agree here. IMO PBA emulation would need to clear pending bits on interrupt status register read. So clearing pending bits could be done by ioctl from qemu while setting them would be done from irqfd. How should QEMU know if the reason for pending has been cleared at device level if the device is outside the scope of QEMU? This model only works for PV devices when you agree that spurious IRQs are OK. So we either need to extend the existing model anyway -- or push per-vector masking back to the IRQ source. In the latter case, it would be a very good chance to give up on limited pseudo GSIs with static routes and do MSI messaging from external IRQ sources to KVM directly. But all those considerations affect different APIs than what I'm proposing here. We will always need a way to inject MSIs in the context of the VM as there will always be scenarios where devices are better run in that very same context, for performance or simplicity or whatever reasons. E.g., I could imagine that one would like to execute an emulated IRQ remapper rather in the hypervisor context than over-microkernelized in a separate process. Jan *) Realized this while trying to generalize the proposed MSI-X MMIO acceleration for assigned devices to arbitrary device models, vhost-net, I'm actually working on a qemu patch to get pba emulation working correctly. I think it's doable with existing irqfd. irqfd has no notion of level. You can only communicate a rising edge and then need a side channel for the state of the edge reason. and specifically vfio. Interesting. How would you clear the pseudo interrupt level? Ideally: not at all (for MSI). If we manage the mask at device level, we only need to send the message if there is actually something to deliver to the interrupt controller and masked input events would be lost on real HW as well. This wouldn't work out nicely as well. We rather need a combined model: Devices need to maintain the PBA actively, i.e. set clear them themselves and do not rely on the core here (with the core being either QEMU user space or an in-kernel MSI-X MMIO accelerator). The core only checks the PBA if it is about to deliver some message and refrains from doing so if the bit became 0 in the meantime (specifically during the masked period). For QEMU device models, that means no additional IOCTLs, just memory sharing of the PBA which is required anyway. Sorry, I don't understand the above two paragraphs. Maybe I am confused by terminology here. We really only need to check PBA when it's read. Whether the message is delivered only depends on the mask bit. This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? The marked (*) lines differ from the current user space model where only the core does PBA manipulation (including clearance via a special function). Basically, the PBA becomes a communication channel also between device and MSI core. And this model also works if core and device run in different processes provided they set up the PBA as shared memory. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Do you expect this optimization to give a significant performance gain? -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Mon, Oct 24, 2011 at 07:05:08PM +0200, Michael S. Tsirkin wrote: On Mon, Oct 24, 2011 at 06:10:28PM +0200, Jan Kiszka wrote: On 2011-10-24 18:05, Michael S. Tsirkin wrote: This is what I have in mind: - devices set PBA bit if MSI message cannot be sent due to mask (*) - core checksclears PBA bit on unmask, injects message if bit was set - devices clear PBA bit if message reason is resolved before unmask (*) OK, but practically, when exactly does the device clear PBA? Consider a network adapter that signals messages in a RX ring: If the corresponding vector is masked while the guest empties the ring, I strongly assume that the device is supposed to take back the pending bit in that case so that there is no interrupt inject on a later vector unmask operation. Jan Do you mean virtio here? Do you expect this optimization to give a significant performance gain? It would also be challenging to implement this in a race free manner. Clearing on interrupt status read seems straight-forward. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Documentation/virtual/kvm/api.txt | 23 +++ include/linux/kvm.h | 15 +++ virt/kvm/kvm_main.c | 18 ++ 3 files changed, 56 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 7945b0b..f4c3de3 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1383,6 +1383,29 @@ The following flags are defined: If datamatch flag is set, the event will be signaled only if the written value to the registered address is equal to datamatch in struct kvm_ioeventfd. +4.59 KVM_SET_MSI + +Capability: KVM_CAP_SET_MSI +Architectures: x86 ia64 +Type: vm ioctl +Parameters: struct kvm_msi (in) +Returns: 0 on success, -1 on error + +Directly inject a MSI message. Only valid with in-kernel irqchip that handles +MSI messages. + +struct kvm_msi { + __u32 address_lo; + __u32 address_hi; + __u32 data; + __u32 flags; + __u8 pad[16]; +}; + +The following flags are defined: + +#define KVM_MSI_FLAG_RAISE (1 0) + 4.62 KVM_CREATE_SPAPR_TCE Capability: KVM_CAP_SPAPR_TCE diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 6884054..83875ed 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -557,6 +557,9 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_HIOR 67 #define KVM_CAP_PPC_PAPR 68 #define KVM_CAP_S390_GMAP 71 +#ifdef __KVM_HAVE_MSI +#define KVM_CAP_SET_MSI 72 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -636,6 +639,16 @@ struct kvm_clock_data { __u32 pad[9]; }; +#define KVM_MSI_FLAG_RAISE (1 0) + +struct kvm_msi { + __u32 address_lo; + __u32 address_hi; + __u32 data; + __u32 flags; + __u8 pad[16]; +}; + /* * ioctls for VM fds */ @@ -696,6 +709,8 @@ struct kvm_clock_data { /* Available with KVM_CAP_TSC_CONTROL */ #define KVM_SET_TSC_KHZ _IO(KVMIO, 0xa2) #define KVM_GET_TSC_KHZ _IO(KVMIO, 0xa3) +/* Available with KVM_CAP_SET_MSI */ +#define KVM_SET_MSI _IOW(KVMIO, 0xa4, struct kvm_msi) /* * ioctls for vcpu fds diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d9cfb78..0e3a947 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2058,6 +2058,24 @@ static long kvm_vm_ioctl(struct file *filp, mutex_unlock(kvm-lock); break; #endif +#ifdef __KVM_HAVE_MSI + case KVM_SET_MSI: { + struct kvm_kernel_irq_routing_entry route; + struct kvm_msi msi; + + r = -EFAULT; + if (copy_from_user(msi, argp, sizeof msi)) + goto out; + route.msi.address_lo = msi.address_lo; + route.msi.address_hi = msi.address_hi; + route.msi.data = msi.data; + r = 0; + if (msi.flags KVM_MSI_FLAG_RAISE) + r = kvm_set_msi(route, kvm, +KVM_USERSPACE_IRQ_SOURCE_ID, 1); + break; + } +#endif default: r = kvm_arch_vm_ioctl(filp, ioctl, arg); if (r == -ENOTTY) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Fri, 2011-10-21 at 11:19 +0200, Jan Kiszka wrote: Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Documentation/virtual/kvm/api.txt | 23 +++ include/linux/kvm.h | 15 +++ virt/kvm/kvm_main.c | 18 ++ 3 files changed, 56 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 7945b0b..f4c3de3 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1383,6 +1383,29 @@ The following flags are defined: If datamatch flag is set, the event will be signaled only if the written value to the registered address is equal to datamatch in struct kvm_ioeventfd. +4.59 KVM_SET_MSI + +Capability: KVM_CAP_SET_MSI +Architectures: x86 ia64 +Type: vm ioctl +Parameters: struct kvm_msi (in) +Returns: 0 on success, -1 on error + +Directly inject a MSI message. Only valid with in-kernel irqchip that handles +MSI messages. + +struct kvm_msi { + __u32 address_lo; + __u32 address_hi; + __u32 data; + __u32 flags; + __u8 pad[16]; +}; + +The following flags are defined: + +#define KVM_MSI_FLAG_RAISE (1 0) + 4.62 KVM_CREATE_SPAPR_TCE Capability: KVM_CAP_SPAPR_TCE diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 6884054..83875ed 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -557,6 +557,9 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_HIOR 67 #define KVM_CAP_PPC_PAPR 68 #define KVM_CAP_S390_GMAP 71 +#ifdef __KVM_HAVE_MSI +#define KVM_CAP_SET_MSI 72 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -636,6 +639,16 @@ struct kvm_clock_data { __u32 pad[9]; }; +#define KVM_MSI_FLAG_RAISE (1 0) + +struct kvm_msi { + __u32 address_lo; + __u32 address_hi; + __u32 data; + __u32 flags; + __u8 pad[16]; +}; + How about defining it as: struct kvm_msi { struct msi_msg msi; __u32 flags; __u8 pad[16]; }; It would allow keeping everything in a msi_msg all the way from userspace up to kvm_set_msi() -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Fri, Oct 21, 2011 at 11:19:19AM +0200, Jan Kiszka wrote: Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. Signed-off-by: Jan Kiszka jan.kis...@siemens.com I would love to see how you envision extending this to add the masking support at least at the API level, not necessarily the supporting code. It would seem hard to use flags field for that since MSIX mask is per device per vector, not per message. Which gets us back to resource per vector which userspace has to manage ... interrupt remapping is also per device, so it isn't any easier with this API. --- Documentation/virtual/kvm/api.txt | 23 +++ include/linux/kvm.h | 15 +++ virt/kvm/kvm_main.c | 18 ++ 3 files changed, 56 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 7945b0b..f4c3de3 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1383,6 +1383,29 @@ The following flags are defined: If datamatch flag is set, the event will be signaled only if the written value to the registered address is equal to datamatch in struct kvm_ioeventfd. +4.59 KVM_SET_MSI + +Capability: KVM_CAP_SET_MSI +Architectures: x86 ia64 +Type: vm ioctl +Parameters: struct kvm_msi (in) +Returns: 0 on success, -1 on error + +Directly inject a MSI message. Only valid with in-kernel irqchip that handles +MSI messages. + +struct kvm_msi { + __u32 address_lo; + __u32 address_hi; + __u32 data; + __u32 flags; + __u8 pad[16]; +}; + +The following flags are defined: + +#define KVM_MSI_FLAG_RAISE (1 0) + 4.62 KVM_CREATE_SPAPR_TCE Capability: KVM_CAP_SPAPR_TCE diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 6884054..83875ed 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -557,6 +557,9 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_HIOR 67 #define KVM_CAP_PPC_PAPR 68 #define KVM_CAP_S390_GMAP 71 +#ifdef __KVM_HAVE_MSI +#define KVM_CAP_SET_MSI 72 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -636,6 +639,16 @@ struct kvm_clock_data { __u32 pad[9]; }; +#define KVM_MSI_FLAG_RAISE (1 0) + +struct kvm_msi { + __u32 address_lo; + __u32 address_hi; + __u32 data; + __u32 flags; + __u8 pad[16]; +}; + /* * ioctls for VM fds */ @@ -696,6 +709,8 @@ struct kvm_clock_data { /* Available with KVM_CAP_TSC_CONTROL */ #define KVM_SET_TSC_KHZ _IO(KVMIO, 0xa2) #define KVM_GET_TSC_KHZ _IO(KVMIO, 0xa3) +/* Available with KVM_CAP_SET_MSI */ +#define KVM_SET_MSI _IOW(KVMIO, 0xa4, struct kvm_msi) /* * ioctls for vcpu fds diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d9cfb78..0e3a947 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2058,6 +2058,24 @@ static long kvm_vm_ioctl(struct file *filp, mutex_unlock(kvm-lock); break; #endif +#ifdef __KVM_HAVE_MSI + case KVM_SET_MSI: { + struct kvm_kernel_irq_routing_entry route; + struct kvm_msi msi; + + r = -EFAULT; + if (copy_from_user(msi, argp, sizeof msi)) + goto out; + route.msi.address_lo = msi.address_lo; + route.msi.address_hi = msi.address_hi; + route.msi.data = msi.data; + r = 0; + if (msi.flags KVM_MSI_FLAG_RAISE) + r = kvm_set_msi(route, kvm, + KVM_USERSPACE_IRQ_SOURCE_ID, 1); + break; + } +#endif default: r = kvm_arch_vm_ioctl(filp, ioctl, arg); if (r == -ENOTTY) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-21 13:06, Michael S. Tsirkin wrote: On Fri, Oct 21, 2011 at 11:19:19AM +0200, Jan Kiszka wrote: Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. Signed-off-by: Jan Kiszka jan.kis...@siemens.com I would love to see how you envision extending this to add the masking support at least at the API level, not necessarily the supporting code. It would seem hard to use flags field for that since MSIX mask is per device per vector, not per message. Which gets us back to resource per vector which userspace has to manage ... interrupt remapping is also per device, so it isn't any easier with this API. Yes, we will need an additional field to associate the message with its source device. Could be a PCI address or a handle (like the one assigned devices get) returned on MSI-X kernel region setup. We will need a flag to declare that address/handle valid, also to tell apart platform MSI messages (e.g. coming from HPET on x86). I see no obstacles ATM that prevent doing that on top of this API, do you? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On Fri, Oct 21, 2011 at 01:51:15PM +0200, Jan Kiszka wrote: On 2011-10-21 13:06, Michael S. Tsirkin wrote: On Fri, Oct 21, 2011 at 11:19:19AM +0200, Jan Kiszka wrote: Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. Signed-off-by: Jan Kiszka jan.kis...@siemens.com I would love to see how you envision extending this to add the masking support at least at the API level, not necessarily the supporting code. It would seem hard to use flags field for that since MSIX mask is per device per vector, not per message. Which gets us back to resource per vector which userspace has to manage ... interrupt remapping is also per device, so it isn't any easier with this API. Yes, we will need an additional field to associate the message with its source device. Could be a PCI address or a handle (like the one assigned devices get) returned on MSI-X kernel region setup. We will need a flag to declare that address/handle valid, also to tell apart platform MSI messages (e.g. coming from HPET on x86). I have not thought about remapping a lot yet: HPET interrupts are not subject to remapping? I see no obstacles ATM that prevent doing that on top of this API, do you? Jan For masking, I think I do. We need to maintain the pending bit and the io notifiers in kernel, per vector. An MSI injected with just an address/data pair, without vector/device info, can't be masked properly. We get back to maintaining some handle per vector, right? -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] KVM: Introduce direct MSI message injection for in-kernel irqchips
On 2011-10-21 14:04, Michael S. Tsirkin wrote: On Fri, Oct 21, 2011 at 01:51:15PM +0200, Jan Kiszka wrote: On 2011-10-21 13:06, Michael S. Tsirkin wrote: On Fri, Oct 21, 2011 at 11:19:19AM +0200, Jan Kiszka wrote: Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated on the fly by user space, IRQ routes are a limited resource that user space as to manage carefully. By providing a direct injection with, we can both avoid using up limited resources and simplify the necessary steps for user land. The API already provides a channel (flags) to revoke an injected but not yet delivered message which will become important for in-kernel MSI-X vector masking support. Signed-off-by: Jan Kiszka jan.kis...@siemens.com I would love to see how you envision extending this to add the masking support at least at the API level, not necessarily the supporting code. It would seem hard to use flags field for that since MSIX mask is per device per vector, not per message. Which gets us back to resource per vector which userspace has to manage ... interrupt remapping is also per device, so it isn't any easier with this API. Yes, we will need an additional field to associate the message with its source device. Could be a PCI address or a handle (like the one assigned devices get) returned on MSI-X kernel region setup. We will need a flag to declare that address/handle valid, also to tell apart platform MSI messages (e.g. coming from HPET on x86). I have not thought about remapping a lot yet: HPET interrupts are not subject to remapping? Looks it is, at least on VT-d: The related VT-d document knows two non-PCI source IDs, namely legacy pin interrupts and other MSIs. So we may want a more generic source ID that, for MSI-X in-kernel masking, can then be associated with a device vector for which we accelerate mask management. I see no obstacles ATM that prevent doing that on top of this API, do you? Jan For masking, I think I do. We need to maintain the pending bit and the io notifiers in kernel, per vector. An MSI injected with just an address/data pair, without vector/device info, can't be masked properly. We get back to maintaining some handle per vector, right? First of all, the common case for in-kernel MSI-X mask management will be MSI sources that are _not_ injected as address-data pair from user space but come from in-kernel sources (irqfd or host IRQs, ie. assigned devices). In contrast, this API here is targeting MSI messages generated in the hypervisor process (ie. current QEMU device emulation). Still, the new interface should allow for injecting the other vectors as well without requiring additional coordination of an in-kernel MSI-X page vs. user space's view on it. For that reason we need a per vector handle for that special case. But that will naturally derive from defining a generic MSI-X in-kernel mask management API. You will have to specify which device shall be accelerated and how many vectors it has (at maximum). So a directly injected MSI message for those devices will have to specify that source tuple (device, vector), but only in that special case. Maybe I will sit down now and create a draft for a MSI-X mask acceleration API. That may help feeling better about this proposal. :) Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html