Re: Question on skip_emulated_instructions()
On 04/08/2010 04:42 PM, Yoshiaki Tamura wrote: Yes, you can release the I/O from the iothread instead of the vcpu thread. You can make virtio_net_handle_tx() disable virtio notifications and initiate state sync and return, when state sync continues you can call the original virtio_net_handle_tx(). If the secondary takes over, it needs to call the original virtio_net_handle_tx() as well. Agreed. Let me try it. Meanwhile, I'll post what I have done including the hack preventing rip to proceed. I would appreciate if you could comment on that too, to keep things in a good direction. Certainly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
2010/4/8 Avi Kivity : > On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote: >>> >>> I don't think you can in the general case. But if you gate output at the >>> device level, instead of the instruction level, the problem goes away, >>> no? >> >> Yes, it should. >> To implement this, we need to make No.3 to be called asynchronously. If >> qemu is already handling I/O asynchronously, it would be relatively easy to >> make this. > > Yes, you can release the I/O from the iothread instead of the vcpu thread. > You can make virtio_net_handle_tx() disable virtio notifications and > initiate state sync and return, when state sync continues you can call the > original virtio_net_handle_tx(). If the secondary takes over, it needs to > call the original virtio_net_handle_tx() as well. Agreed. Let me try it. Meanwhile, I'll post what I have done including the hack preventing rip to proceed. I would appreciate if you could comment on that too, to keep things in a good direction. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote: I don't think you can in the general case. But if you gate output at the device level, instead of the instruction level, the problem goes away, no? Yes, it should. To implement this, we need to make No.3 to be called asynchronously. If qemu is already handling I/O asynchronously, it would be relatively easy to make this. Yes, you can release the I/O from the iothread instead of the vcpu thread. You can make virtio_net_handle_tx() disable virtio notifications and initiate state sync and return, when state sync continues you can call the original virtio_net_handle_tx(). If the secondary takes over, it needs to call the original virtio_net_handle_tx() as well. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Avi Kivity wrote: > On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote: >> >> If I transferred a VM after I/O operations, let's say the VM sent an >> TCP ACK to the client, and if a hardware failure occurred to the >> primary during the VM transferring *but the client received the TCP >> ACK*, the secondary will resume from the previous state, and it may >> need to receive some data from the client. However, because the client >> has already receiver TCP ACK, it won't resend the data to the >> secondary. It looks this data is going to be dropped. Am I missing >> some point here? >> > > I think you should block I/O not at the cpu/device boundary (that's > inefficient as many cpu I/O instructions don't necessarily cause > externally visible I/O) but at the device level. Whenever the network > device wants to send out a packet, halt the guest (letting any I/O > instructions complete), synchronize the secondary, and then release the > pending I/O. This ensures that the secondary has all of the data prior > to the ack being sent out. Although I was thinking to clean up my current code, maybe I should post the current status for explanation now. As you mentioned, I'm capturing I/O at the device level, by inserting a hook inside of PIO/MMIO handler in virtio-blk, virtio-net and e1000 emulator. Since it's implemented naively, it'll stop (meaning I/O instructions will be delayed) until transferring the VM is done. So what I can do here is, 1. Let I/O instructions to complete both at qemu and kvm. 2. Transfer the guest state. # VCPU and device model thinks I/O emulation is already done. 3. Finally release the pending output to the real world. If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. I agree. What I'm wondering is how can we guarantee that the responses are the same... I don't think you can in the general case. But if you gate output at the device level, instead of the instruction level, the problem goes away, no? Yes, it should. To implement this, we need to make No.3 to be called asynchronously. If qemu is already handling I/O asynchronously, it would be relatively easy to make this. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 11:10 AM, Yoshiaki Tamura wrote: If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. I agree. What I'm wondering is how can we guarantee that the responses are the same... I don't think you can in the general case. But if you gate output at the device level, instead of the instruction level, the problem goes away, no? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote: If I transferred a VM after I/O operations, let's say the VM sent an TCP ACK to the client, and if a hardware failure occurred to the primary during the VM transferring *but the client received the TCP ACK*, the secondary will resume from the previous state, and it may need to receive some data from the client. However, because the client has already receiver TCP ACK, it won't resend the data to the secondary. It looks this data is going to be dropped. Am I missing some point here? I think you should block I/O not at the cpu/device boundary (that's inefficient as many cpu I/O instructions don't necessarily cause externally visible I/O) but at the device level. Whenever the network device wants to send out a packet, halt the guest (letting any I/O instructions complete), synchronize the secondary, and then release the pending I/O. This ensures that the secondary has all of the data prior to the ack being sent out. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Avi Kivity wrote: On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote: To answer your question, it should be possible to implement. The down side is that after going into KVM to make the guest state to consistent, we need to go back to qemu to actually transfer the guest, and this bounce would introduce another overhead if I'm understanding correctly. Yes. It should be around a microsecond or so, given you will issue I/O after this I don't think this will affect performance. That is a good news. And yes, all I need is some consistent state to resume VM from, which must be able to continue I/O operations, like writing to disks and sending ack over the network. If I can guarantee this, sending the VM state after completing output is acceptable. I suggest you start with this. If it turns out performance is severely impacted, we can revisit instruction completion. If performance is satisfactory, then we'll be able to run Kemari with older kernels. I was almost to say yes here, but let me ask one more question. BTW, thank you two for taking time for this discussion which isn't a topic on KVM itself. If I transferred a VM after I/O operations, let's say the VM sent an TCP ACK to the client, and if a hardware failure occurred to the primary during the VM transferring *but the client received the TCP ACK*, the secondary will resume from the previous state, and it may need to receive some data from the client. However, because the client has already receiver TCP ACK, it won't resend the data to the secondary. It looks this data is going to be dropped. Am I missing some point here? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Gleb Natapov wrote: On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote: On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote: The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. You have two choices: - complete execution of the instruction in both the kernel and the device model This is what live migration does currently. Any mmio and pio requests are completed, the last instruction is finalized, and state is saved. - complete execution of the instruction in the kernel, but queue execution of mmio/pio requests This is more in line with what you describe. vcpu state will be after the instruction, device model state will be before instruction completion, when you replay the queue, the device model state will be consistent with the vcpu state. For "in" or "mmio read" you can't complete instruction without doing actual IO. So, if the mmio/pio requests in the queue are only "out" or "mmio write" Avi's suggestion No.2 would work. But if "in" or "mmio read" are mixed with these, (We don't have to think if the queue is filled with only "in" or "mmio read" because we're currently transferring only in case of "out" or "mmio write") the story gets complicated. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. I agree. What I'm wondering is how can we guarantee that the responses are the same... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote: To answer your question, it should be possible to implement. The down side is that after going into KVM to make the guest state to consistent, we need to go back to qemu to actually transfer the guest, and this bounce would introduce another overhead if I'm understanding correctly. Yes. It should be around a microsecond or so, given you will issue I/O after this I don't think this will affect performance. And yes, all I need is some consistent state to resume VM from, which must be able to continue I/O operations, like writing to disks and sending ack over the network. If I can guarantee this, sending the VM state after completing output is acceptable. I suggest you start with this. If it turns out performance is severely impacted, we can revisit instruction completion. If performance is satisfactory, then we'll be able to run Kemari with older kernels. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Gleb Natapov wrote: On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: Currently we complete instructions for output operations and leave them incomplete for input operations. Deferring completion for output operations should work, except it may break the vmware backdoor port (see hw/vmport.c), which changes register state following an output instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state following a write instruction. Do you really need to transfer the vcpu state before the instruction, or do you just need a consistent state? If the latter, then you can get away by posting a signal and re-entering the guest. kvm will complete the instruction and exit immediately, and you will have fully consistent state. The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. All you need is some consistent sate to restart VM from, no? So if you transfer VM state after instruction that caused IO is completed you can restart VM on secondary from that state in case primary fails. I guess my question is: Can you make synchronization point to be immediately after IO instruction instead of before? To answer your question, it should be possible to implement. The down side is that after going into KVM to make the guest state to consistent, we need to go back to qemu to actually transfer the guest, and this bounce would introduce another overhead if I'm understanding correctly. And yes, all I need is some consistent state to resume VM from, which must be able to continue I/O operations, like writing to disks and sending ack over the network. If I can guarantee this, sending the VM state after completing output is acceptable. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote: > On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote: > > > >The requirement is that the guest must always be able to replay at > >least the instruction which triggered the synchronization on the > >primary. > > > You have two choices: > > - complete execution of the instruction in both the kernel and the > device model > > This is what live migration does currently. Any mmio and pio > requests are completed, the last instruction is finalized, and state > is saved. > > - complete execution of the instruction in the kernel, but queue > execution of mmio/pio requests > > This is more in line with what you describe. vcpu state will be > after the instruction, device model state will be before instruction > completion, when you replay the queue, the device model state will > be consistent with the vcpu state. > For "in" or "mmio read" you can't complete instruction without doing actual IO. > > From that point of view, I think I need to transfer the vcpu > >state before the instruction. If I post a signal and let the > >guest or emulator proceed, I'm not sure whether the guest on the > >secondary can be replay as expected. Please point out if I were > >misunderstanding. > > If the responses to the mmio or pio request are exactly the same, > then the replay will happen exactly the same. > > -- > error compiling committee.c: too many arguments to function -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote: The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. You have two choices: - complete execution of the instruction in both the kernel and the device model This is what live migration does currently. Any mmio and pio requests are completed, the last instruction is finalized, and state is saved. - complete execution of the instruction in the kernel, but queue execution of mmio/pio requests This is more in line with what you describe. vcpu state will be after the instruction, device model state will be before instruction completion, when you replay the queue, the device model state will be consistent with the vcpu state. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. If the responses to the mmio or pio request are exactly the same, then the replay will happen exactly the same. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: > >Currently we complete instructions for output operations and leave them > >incomplete for input operations. Deferring completion for output > >operations should work, except it may break the vmware backdoor port > >(see hw/vmport.c), which changes register state following an output > >instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state > >following a write instruction. > > > >Do you really need to transfer the vcpu state before the instruction, or > >do you just need a consistent state? If the latter, then you can get > >away by posting a signal and re-entering the guest. kvm will complete > >the instruction and exit immediately, and you will have fully consistent > >state. > > The requirement is that the guest must always be able to replay at > least the instruction which triggered the synchronization on the > primary. From that point of view, I think I need to transfer the > vcpu state before the instruction. If I post a signal and let the > guest or emulator proceed, I'm not sure whether the guest on the > secondary can be replay as expected. Please point out if I were > misunderstanding. All you need is some consistent sate to restart VM from, no? So if you transfer VM state after instruction that caused IO is completed you can restart VM on secondary from that state in case primary fails. I guess my question is: Can you make synchronization point to be immediately after IO instruction instead of before? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Gleb Natapov wrote: On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: Avi Kivity wrote: On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. vcpu state is undefined when an mmio operation is pending, Documentation/kvm/api.txt says the following: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Thanks for the information. So the point is the vcpu state that can been observed from qemu upon KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used because it's not complete/consistent? Definitely. VCPU is in the middle of an instruction execution, so the state is undefined. One instruction may generate more then one IO exit during its execution BTW. Regarding the multiple IO exits, we're paying attention too. Although it depends on the guest behavior, if we limit the device model, one IO exit per one instruction may be practical at beggining. But thanks for pointing out. To solve the undefined VCPU state, how about keeping a copy of initial state upon VMEXIT? I guess there already is a similar shadow state in KVM. If possible we can allocate another one for this purpose. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: > Avi Kivity wrote: > >On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: > >> > >>The problem here is that, I needed to transfer the VM state which is > >>just *before* the output to the devices. Otherwise, the VM state has > >>already been proceeded, and after failover, some I/O didn't work as I > >>expected. > >>I tracked down this issue, and figured out rip was already proceeded > >>in KVM, > >>and transferring this VCPU state was meaningless. > >> > >>I'm planning to post the patch set of Kemari soon, but I would like to > >>solve > >>this rip issue before that. If there is no drawback, I'm happy to work > >>and post a patch. > > > >vcpu state is undefined when an mmio operation is pending, > >Documentation/kvm/api.txt says the following: > > > >>NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding > >>operations are complete (and guest state is consistent) only after > >>userspace > >>has re-entered the kernel with KVM_RUN. The kernel side will first finish > >>incomplete operations and then check for pending signals. Userspace > >>can re-enter the guest with an unmasked signal pending to complete > >>pending operations. > > Thanks for the information. > > So the point is the vcpu state that can been observed from qemu upon > KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used > because it's not complete/consistent? > Definitely. VCPU is in the middle of an instruction execution, so the state is undefined. One instruction may generate more then one IO exit during its execution BTW. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Avi Kivity wrote: On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. vcpu state is undefined when an mmio operation is pending, Documentation/kvm/api.txt says the following: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Thanks for the information. So the point is the vcpu state that can been observed from qemu upon KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used because it's not complete/consistent? Currently we complete instructions for output operations and leave them incomplete for input operations. Deferring completion for output operations should work, except it may break the vmware backdoor port (see hw/vmport.c), which changes register state following an output instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state following a write instruction. Do you really need to transfer the vcpu state before the instruction, or do you just need a consistent state? If the latter, then you can get away by posting a signal and re-entering the guest. kvm will complete the instruction and exit immediately, and you will have fully consistent state. The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. vcpu state is undefined when an mmio operation is pending, Documentation/kvm/api.txt says the following: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Currently we complete instructions for output operations and leave them incomplete for input operations. Deferring completion for output operations should work, except it may break the vmware backdoor port (see hw/vmport.c), which changes register state following an output instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state following a write instruction. Do you really need to transfer the vcpu state before the instruction, or do you just need a consistent state? If the latter, then you can get away by posting a signal and re-entering the guest. kvm will complete the instruction and exit immediately, and you will have fully consistent state. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
2010/4/8 Gleb Natapov : > On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote: >> 2010/4/6 Gleb Natapov : >> > On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: >> >> Hi. >> >> >> >> When handle_io() is called, rip is currently proceeded *before* actually >> >> having >> >> I/O handled by qemu in userland. Upon implementing Kemari for >> >> KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly >> >> in >> >> userland qemu, we encountered a problem that synchronizing the content of >> >> VCPU >> >> before handling I/O in qemu is too late because rip is already proceeded >> >> in KVM, >> >> Although we avoided this issue with temporal hack, I would like to ask a >> >> few >> >> question on skip_emulated_instructions. >> >> >> >> 1. Does rip need to be proceeded before having I/O handled by qemu? >> > In current kvm.git rip is proceeded before I/O is handled by qemu only >> > in case of "out" instruction. From architecture point of view I think >> > it's OK since on real HW you can't guaranty that I/O will take effect >> > before instruction pointer is advanced. It is done like that because we >> > want "out" emulation to be real fast so we skip x86 emulator. >> >> Thanks for your reply. >> >> If proceeding rip later doesn't break the behavior of devices or >> introduce slow down, I would like that to be done. >> > Device can not care less about what value rip register currently has. > Why is it matters for you code? My code, Kemari is a mechanism to synchronize VMs to achieve fault tolerance. It transfers the whole VM state upon events such as disk or network output, so that the secondary server can keep continuing upon hardware failure. Please think it like continuous live migration. I've implemented this feature in userland qemu, which calls the live migration function when it detects any outputs from the device emulators. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. >> >> 2. If no, is it possible to divide skip_emulated_instructions(), like >> >> rec_emulated_instructions() to remember to next_rip, and >> >> skip_emulated_instructions() to actually proceed the rip. >> > Currently only emulator can call userspace to do I/O, so after >> > userspace returns after I/O exit, control is handled back to emulator >> > unconditionally. "out" instruction skips emulator, but there is nothing >> > to do after userspace returns, so regular cpu loop is executed. If we >> > want to advance rip only after userspace executed I/O done by "out" we >> > need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) >> > and call different code depending on who that was. It can be done by >> > having a callback that (if not null) is called on return from userspace. >> >> Your suggestion is to introduce a callback entry, and instead of >> calling kvm_rip_write(), set it to the entry before calling >> kvm_fast_pio_out(), >> and check the entry upon return from the userspace, correct? >> > Something like that, yes. OK. Let me work on that. >> According to the comment in x86.c, when it was "out" instruction >> vcpu->arch.pio.count is set to 0 to skip the emulator. >> To call kvm_fast_pio_out(), "!string" and "!in" must be set. >> If we can check, vcpu->arch.pio.count, "string" and "in" on return >> from the userspace, can't we distinguish who requested I/O, emulator >> or kvm_fast_pio_out()? >> > May be, but callback approach is much cleaner. "string" and "in" can have > stale data for instance. I see. I was thinking that can be a trade off between introducing a new variable. I'll take the callback approach first, and think again later if necessary. > >> >> 3. svm has next_rip but when it is 0, nop is emulated. Can this be >> >> modified to >>
Re: Question on skip_emulated_instructions()
On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote: > 2010/4/6 Gleb Natapov : > > On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: > >> Hi. > >> > >> When handle_io() is called, rip is currently proceeded *before* actually > >> having > >> I/O handled by qemu in userland. Upon implementing Kemari for > >> KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly > >> in > >> userland qemu, we encountered a problem that synchronizing the content of > >> VCPU > >> before handling I/O in qemu is too late because rip is already proceeded > >> in KVM, > >> Although we avoided this issue with temporal hack, I would like to ask a > >> few > >> question on skip_emulated_instructions. > >> > >> 1. Does rip need to be proceeded before having I/O handled by qemu? > > In current kvm.git rip is proceeded before I/O is handled by qemu only > > in case of "out" instruction. From architecture point of view I think > > it's OK since on real HW you can't guaranty that I/O will take effect > > before instruction pointer is advanced. It is done like that because we > > want "out" emulation to be real fast so we skip x86 emulator. > > Thanks for your reply. > > If proceeding rip later doesn't break the behavior of devices or > introduce slow down, I would like that to be done. > Device can not care less about what value rip register currently has. Why is it matters for you code? > >> 2. If no, is it possible to divide skip_emulated_instructions(), like > >> rec_emulated_instructions() to remember to next_rip, and > >> skip_emulated_instructions() to actually proceed the rip. > > Currently only emulator can call userspace to do I/O, so after > > userspace returns after I/O exit, control is handled back to emulator > > unconditionally. "out" instruction skips emulator, but there is nothing > > to do after userspace returns, so regular cpu loop is executed. If we > > want to advance rip only after userspace executed I/O done by "out" we > > need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) > > and call different code depending on who that was. It can be done by > > having a callback that (if not null) is called on return from userspace. > > Your suggestion is to introduce a callback entry, and instead of > calling kvm_rip_write(), set it to the entry before calling > kvm_fast_pio_out(), > and check the entry upon return from the userspace, correct? > Something like that, yes. > According to the comment in x86.c, when it was "out" instruction > vcpu->arch.pio.count is set to 0 to skip the emulator. > To call kvm_fast_pio_out(), "!string" and "!in" must be set. > If we can check, vcpu->arch.pio.count, "string" and "in" on return > from the userspace, can't we distinguish who requested I/O, emulator > or kvm_fast_pio_out()? > May be, but callback approach is much cleaner. "string" and "in" can have stale data for instance. > >> 3. svm has next_rip but when it is 0, nop is emulated. Can this be > >> modified to > >> continue without emulating nop when next_rip is 0? > >> > > I don't see where nop is emulated if next_rip is 0. As far as I see in > > case of next_rip==0 an instruction at rip is decoded to figure out its > > length and then rip is advanced by instruction length. Anyway next_rip > > is svm thing only. > > Sorry. I wasn't understanding the code enough. > > static void skip_emulated_instruction(struct kvm_vcpu *vcpu) > { > ... > if (!svm->next_rip) { > if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) != > EMULATE_DONE) > printk(KERN_DEBUG "%s: NOP\n", __func__); > return; > } > > Since the printk says NOP, I thought emulate_instruction was doing so... > > The reason I asked about next_rip is because I was hoping to use this > entry to advance rip only after userspace executed I/O done by "out", > like if next_rip is !0, > call kvm_rip_write(), and introduce next_rip to vmx if it is usable > because vmx is > currently using local variable rip. > > Yoshi -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
2010/4/6 Gleb Natapov : > On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: >> Hi. >> >> When handle_io() is called, rip is currently proceeded *before* actually >> having >> I/O handled by qemu in userland. Upon implementing Kemari for >> KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in >> userland qemu, we encountered a problem that synchronizing the content of >> VCPU >> before handling I/O in qemu is too late because rip is already proceeded in >> KVM, >> Although we avoided this issue with temporal hack, I would like to ask a few >> question on skip_emulated_instructions. >> >> 1. Does rip need to be proceeded before having I/O handled by qemu? > In current kvm.git rip is proceeded before I/O is handled by qemu only > in case of "out" instruction. From architecture point of view I think > it's OK since on real HW you can't guaranty that I/O will take effect > before instruction pointer is advanced. It is done like that because we > want "out" emulation to be real fast so we skip x86 emulator. Thanks for your reply. If proceeding rip later doesn't break the behavior of devices or introduce slow down, I would like that to be done. >> 2. If no, is it possible to divide skip_emulated_instructions(), like >> rec_emulated_instructions() to remember to next_rip, and >> skip_emulated_instructions() to actually proceed the rip. > Currently only emulator can call userspace to do I/O, so after > userspace returns after I/O exit, control is handled back to emulator > unconditionally. "out" instruction skips emulator, but there is nothing > to do after userspace returns, so regular cpu loop is executed. If we > want to advance rip only after userspace executed I/O done by "out" we > need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) > and call different code depending on who that was. It can be done by > having a callback that (if not null) is called on return from userspace. Your suggestion is to introduce a callback entry, and instead of calling kvm_rip_write(), set it to the entry before calling kvm_fast_pio_out(), and check the entry upon return from the userspace, correct? According to the comment in x86.c, when it was "out" instruction vcpu->arch.pio.count is set to 0 to skip the emulator. To call kvm_fast_pio_out(), "!string" and "!in" must be set. If we can check, vcpu->arch.pio.count, "string" and "in" on return from the userspace, can't we distinguish who requested I/O, emulator or kvm_fast_pio_out()? >> 3. svm has next_rip but when it is 0, nop is emulated. Can this be modified >> to >> continue without emulating nop when next_rip is 0? >> > I don't see where nop is emulated if next_rip is 0. As far as I see in > case of next_rip==0 an instruction at rip is decoded to figure out its > length and then rip is advanced by instruction length. Anyway next_rip > is svm thing only. Sorry. I wasn't understanding the code enough. static void skip_emulated_instruction(struct kvm_vcpu *vcpu) { ... if (!svm->next_rip) { if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) != EMULATE_DONE) printk(KERN_DEBUG "%s: NOP\n", __func__); return; } Since the printk says NOP, I thought emulate_instruction was doing so... The reason I asked about next_rip is because I was hoping to use this entry to advance rip only after userspace executed I/O done by "out", like if next_rip is !0, call kvm_rip_write(), and introduce next_rip to vmx if it is usable because vmx is currently using local variable rip. Yoshi -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: > Hi. > > When handle_io() is called, rip is currently proceeded *before* actually > having > I/O handled by qemu in userland. Upon implementing Kemari for > KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in > userland qemu, we encountered a problem that synchronizing the content of VCPU > before handling I/O in qemu is too late because rip is already proceeded in > KVM, > Although we avoided this issue with temporal hack, I would like to ask a few > question on skip_emulated_instructions. > > 1. Does rip need to be proceeded before having I/O handled by qemu? In current kvm.git rip is proceeded before I/O is handled by qemu only in case of "out" instruction. From architecture point of view I think it's OK since on real HW you can't guaranty that I/O will take effect before instruction pointer is advanced. It is done like that because we want "out" emulation to be real fast so we skip x86 emulator. > 2. If no, is it possible to divide skip_emulated_instructions(), like > rec_emulated_instructions() to remember to next_rip, and > skip_emulated_instructions() to actually proceed the rip. Currently only emulator can call userspace to do I/O, so after userspace returns after I/O exit, control is handled back to emulator unconditionally. "out" instruction skips emulator, but there is nothing to do after userspace returns, so regular cpu loop is executed. If we want to advance rip only after userspace executed I/O done by "out" we need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) and call different code depending on who that was. It can be done by having a callback that (if not null) is called on return from userspace. > 3. svm has next_rip but when it is 0, nop is emulated. Can this be modified > to > continue without emulating nop when next_rip is 0? > I don't see where nop is emulated if next_rip is 0. As far as I see in case of next_rip==0 an instruction at rip is decoded to figure out its length and then rip is advanced by instruction length. Anyway next_rip is svm thing only. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Question on skip_emulated_instructions()
Hi. When handle_io() is called, rip is currently proceeded *before* actually having I/O handled by qemu in userland. Upon implementing Kemari for KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in userland qemu, we encountered a problem that synchronizing the content of VCPU before handling I/O in qemu is too late because rip is already proceeded in KVM, Although we avoided this issue with temporal hack, I would like to ask a few question on skip_emulated_instructions. 1. Does rip need to be proceeded before having I/O handled by qemu? 2. If no, is it possible to divide skip_emulated_instructions(), like rec_emulated_instructions() to remember to next_rip, and skip_emulated_instructions() to actually proceed the rip. 3. svm has next_rip but when it is 0, nop is emulated. Can this be modified to continue without emulating nop when next_rip is 0? Thanks, Yoshi -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html