subject:"Question on skip_emulated

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity


On 04/08/2010 04:42 PM, Yoshiaki Tamura wrote:



Yes, you can release the I/O from the iothread instead of the vcpu thread.
  You can make virtio_net_handle_tx() disable virtio notifications and
initiate state sync and return, when state sync continues you can call the
original virtio_net_handle_tx().  If the secondary takes over, it needs to
call the original virtio_net_handle_tx() as well.
 

Agreed.  Let me try it.
Meanwhile, I'll post what I have done including the hack preventing
rip to proceed.
I would appreciate if you could comment on that too, to keep things in
a good direction.
   


Certainly.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura

2010/4/8 Avi Kivity :
> On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote:
>>>
>>> I don't think you can in the general case. But if you gate output at the
>>> device level, instead of the instruction level, the problem goes away,
>>> no?
>>
>> Yes, it should.
>> To implement this, we need to make No.3 to be called asynchronously.  If
>> qemu is already handling I/O asynchronously, it would be relatively easy to
>> make this.
>
> Yes, you can release the I/O from the iothread instead of the vcpu thread.
>  You can make virtio_net_handle_tx() disable virtio notifications and
> initiate state sync and return, when state sync continues you can call the
> original virtio_net_handle_tx().  If the secondary takes over, it needs to
> call the original virtio_net_handle_tx() as well.

Agreed.  Let me try it.
Meanwhile, I'll post what I have done including the hack preventing
rip to proceed.
I would appreciate if you could comment on that too, to keep things in
a good direction.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity


On 04/08/2010 12:14 PM, Yoshiaki Tamura wrote:

I don't think you can in the general case. But if you gate output at the
device level, instead of the instruction level, the problem goes 
away, no?


Yes, it should.
To implement this, we need to make No.3 to be called asynchronously.  
If qemu is already handling I/O asynchronously, it would be relatively 
easy to make this.


Yes, you can release the I/O from the iothread instead of the vcpu 
thread.  You can make virtio_net_handle_tx() disable virtio 
notifications and initiate state sync and return, when state sync 
continues you can call the original virtio_net_handle_tx().  If the 
secondary takes over, it needs to call the original 
virtio_net_handle_tx() as well.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura

Avi Kivity wrote:
> On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote:
>>
>> If I transferred a VM after I/O operations, let's say the VM sent an
>> TCP ACK to the client, and if a hardware failure occurred to the
>> primary during the VM transferring *but the client received the TCP
>> ACK*, the secondary will resume from the previous state, and it may
>> need to receive some data from the client. However, because the client
>> has already receiver TCP　ACK, it won't resend the data to the
>> secondary. It looks this data is going to be dropped. Am I missing
>> some point here?
>>
>
> I think you should block I/O not at the cpu/device boundary (that's
> inefficient as many cpu I/O instructions don't necessarily cause
> externally visible I/O) but at the device level. Whenever the network
> device wants to send out a packet, halt the guest (letting any I/O
> instructions complete), synchronize the secondary, and then release the
> pending I/O. This ensures that the secondary has all of the data prior
> to the ack being sent out.

Although I was thinking to clean up my current code, maybe I should post the 
current status for explanation now.  As you mentioned, I'm capturing I/O at the 
device level, by inserting a hook inside of PIO/MMIO handler in virtio-blk, 
virtio-net and e1000 emulator.  Since it's implemented naively, it'll stop 
(meaning I/O instructions will be delayed) until transferring the VM is done.

So what I can do here is,

1. Let I/O instructions to complete both at qemu and kvm.
2. Transfer the guest state.
# VCPU and device model thinks I/O emulation is already done.
3. Finally release the pending output to the real world.

If the responses to the mmio or pio request are exactly the same,
then the replay will happen exactly the same.

I agree. What I'm wondering is how can we guarantee that the responses
are the same...

I don't think you can in the general case. But if you gate output at the
device level, instead of the instruction level, the problem goes away, no?

Yes, it should.
To implement this, we need to make No.3 to be called asynchronously.  If qemu is 
already handling I/O asynchronously, it would be relatively easy to make this.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity


On 04/08/2010 11:10 AM, Yoshiaki Tamura wrote:

If the responses to the mmio or pio request are exactly the same,
then the replay will happen exactly the same.



I agree.  What I'm wondering is how can we guarantee that the 
responses are the same...


I don't think you can in the general case.  But if you gate output at 
the device level, instead of the instruction level, the problem goes 
away, no?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity


On 04/08/2010 11:30 AM, Yoshiaki Tamura wrote:


If I transferred a VM after I/O operations, let's say the VM sent an 
TCP ACK to the client, and if a hardware failure occurred to the 
primary during the VM transferring *but the client received the TCP 
ACK*, the secondary will resume from the previous state, and it may 
need to receive some data from the client. However, because the client 
has already receiver TCP　ACK, it won't resend the data to the 
secondary.  It looks this data is going to be dropped.  Am I missing 
some point here?




I think you should block I/O not at the cpu/device boundary (that's 
inefficient as many cpu I/O instructions don't necessarily cause 
externally visible I/O) but at the device level.  Whenever the network 
device wants to send out a packet, halt the guest (letting any I/O 
instructions complete), synchronize the secondary, and then release the 
pending I/O.  This ensures that the secondary has all of the data prior 
to the ack being sent out.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura


Avi Kivity wrote:

On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote:


To answer your question, it should be possible to implement.
The down side is that after going into KVM to make the guest state to
consistent, we need to go back to qemu to actually transfer the guest,
and this bounce would introduce another overhead if I'm understanding
correctly.


Yes. It should be around a microsecond or so, given you will issue I/O
after this I don't think this will affect performance.


That is a good news.


And yes, all I need is some consistent state to resume VM from, which
must be able to continue I/O operations, like writing to disks and
sending ack over the network. If I can guarantee this, sending the VM
state after completing output is acceptable.



I suggest you start with this. If it turns out performance is severely
impacted, we can revisit instruction completion. If performance is
satisfactory, then we'll be able to run Kemari with older kernels.


I was almost to say yes here, but let me ask one more question.
BTW, thank you two for taking time for this discussion which isn't a topic on 
KVM itself.


If I transferred a VM after I/O operations, let's say the VM sent an TCP ACK to 
the client, and if a hardware failure occurred to the primary during the VM 
transferring *but the client received the TCP ACK*, the secondary will resume 
from the previous state, and it may need to receive some data from the client. 
However, because the client has already receiver TCP　ACK, it won't resend the 
data to the secondary.  It looks this data is going to be dropped.  Am I missing 
some point here?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura


Gleb Natapov wrote:

On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote:

On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote:


The requirement is that the guest must always be able to replay at
least the instruction which triggered the synchronization on the
primary.



You have two choices:

  - complete execution of the instruction in both the kernel and the
device model

This is what live migration does currently.  Any mmio and pio
requests are completed, the last instruction is finalized, and state
is saved.

  - complete execution of the instruction in the kernel, but queue
execution of mmio/pio requests

This is more in line with what you describe.  vcpu state will be
after the instruction, device model state will be before instruction
completion, when you replay the queue, the device model state will
be consistent with the vcpu state.


For "in" or "mmio read" you can't complete instruction without doing
actual IO.


So, if the mmio/pio requests in the queue are only "out" or "mmio write" Avi's 
suggestion No.2 would work. But if "in" or "mmio read" are mixed with these, (We 
don't have to think if the queue is filled with only "in" or "mmio read" because 
we're currently transferring only in case of "out" or "mmio write")

the story gets complicated.


   From that point of view, I think I need to transfer the vcpu
state before the instruction.  If I post a signal and let the
guest or emulator proceed, I'm not sure whether the guest on the
secondary can be replay as expected.  Please point out if I were
misunderstanding.


If the responses to the mmio or pio request are exactly the same,
then the replay will happen exactly the same.


I agree.  What I'm wondering is how can we guarantee that the responses are the 
same...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity


On 04/08/2010 10:30 AM, Yoshiaki Tamura wrote:


To answer your question, it should be possible to implement.
The down side is that after going into KVM to make the guest state to 
consistent, we need to go back to qemu to actually transfer the guest, 
and this bounce would introduce another overhead if I'm understanding 
correctly.


Yes.  It should be around a microsecond or so, given you will issue I/O 
after this I don't think this will affect performance.


And yes, all I need is some consistent state to resume VM from, which 
must be able to continue I/O operations, like writing to disks and 
sending ack over the network.  If I can guarantee this, sending the VM 
state after completing output is acceptable.




I suggest you start with this.  If it turns out performance is severely 
impacted, we can revisit instruction completion.  If performance is 
satisfactory, then we'll be able to run Kemari with older kernels.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Yoshiaki Tamura


Gleb Natapov wrote:

On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:

Currently we complete instructions for output operations and leave them
incomplete for input operations. Deferring completion for output
operations should work, except it may break the vmware backdoor port
(see hw/vmport.c), which changes register state following an output
instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state
following a write instruction.

Do you really need to transfer the vcpu state before the instruction, or
do you just need a consistent state? If the latter, then you can get
away by posting a signal and re-entering the guest. kvm will complete
the instruction and exit immediately, and you will have fully consistent
state.


The requirement is that the guest must always be able to replay at
least the instruction which triggered the synchronization on the
primary.  From that point of view, I think I need to transfer the
vcpu state before the instruction.  If I post a signal and let the
guest or emulator proceed, I'm not sure whether the guest on the
secondary can be replay as expected.  Please point out if I were
misunderstanding.

All you need is some consistent sate to restart VM from, no? So if you
transfer VM state after instruction that caused IO is completed you can
restart VM on secondary from that state in case primary fails. I guess
my question is: Can you make synchronization point to be immediately after
IO instruction instead of before?


To answer your question, it should be possible to implement.
The down side is that after going into KVM to make the guest state to 
consistent, we need to go back to qemu to actually transfer the guest, and this 
bounce would introduce another overhead if I'm understanding correctly.


And yes, all I need is some consistent state to resume VM from, which must be 
able to continue I/O operations, like writing to disks and sending ack over the 
network.  If I can guarantee this, sending the VM state after completing output 
is acceptable.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Gleb Natapov

On Thu, Apr 08, 2010 at 10:17:01AM +0300, Avi Kivity wrote:
> On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote:
> >
> >The requirement is that the guest must always be able to replay at
> >least the instruction which triggered the synchronization on the
> >primary.
> 
> 
> You have two choices:
> 
>  - complete execution of the instruction in both the kernel and the
> device model
> 
> This is what live migration does currently.  Any mmio and pio
> requests are completed, the last instruction is finalized, and state
> is saved.
> 
>  - complete execution of the instruction in the kernel, but queue
> execution of mmio/pio requests
> 
> This is more in line with what you describe.  vcpu state will be
> after the instruction, device model state will be before instruction
> completion, when you replay the queue, the device model state will
> be consistent with the vcpu state.
> 
For "in" or "mmio read" you can't complete instruction without doing
actual IO.

> >  From that point of view, I think I need to transfer the vcpu
> >state before the instruction.  If I post a signal and let the
> >guest or emulator proceed, I'm not sure whether the guest on the
> >secondary can be replay as expected.  Please point out if I were
> >misunderstanding.
> 
> If the responses to the mmio or pio request are exactly the same,
> then the replay will happen exactly the same.
> 
> -- 
> error compiling committee.c: too many arguments to function

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-08 Thread Avi Kivity


On 04/08/2010 08:27 AM, Yoshiaki Tamura wrote:


The requirement is that the guest must always be able to replay at 
least the instruction which triggered the synchronization on the primary.



You have two choices:

 - complete execution of the instruction in both the kernel and the 
device model


This is what live migration does currently.  Any mmio and pio requests 
are completed, the last instruction is finalized, and state is saved.


 - complete execution of the instruction in the kernel, but queue 
execution of mmio/pio requests


This is more in line with what you describe.  vcpu state will be after 
the instruction, device model state will be before instruction 
completion, when you replay the queue, the device model state will be 
consistent with the vcpu state.


  From that point of view, I think I need to transfer the vcpu state 
before the instruction.  If I post a signal and let the guest or 
emulator proceed, I'm not sure whether the guest on the secondary can 
be replay as expected.  Please point out if I were misunderstanding.


If the responses to the mmio or pio request are exactly the same, then 
the replay will happen exactly the same.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Gleb Natapov

On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:
> >Currently we complete instructions for output operations and leave them
> >incomplete for input operations. Deferring completion for output
> >operations should work, except it may break the vmware backdoor port
> >(see hw/vmport.c), which changes register state following an output
> >instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state
> >following a write instruction.
> >
> >Do you really need to transfer the vcpu state before the instruction, or
> >do you just need a consistent state? If the latter, then you can get
> >away by posting a signal and re-entering the guest. kvm will complete
> >the instruction and exit immediately, and you will have fully consistent
> >state.
> 
> The requirement is that the guest must always be able to replay at
> least the instruction which triggered the synchronization on the
> primary.  From that point of view, I think I need to transfer the
> vcpu state before the instruction.  If I post a signal and let the
> guest or emulator proceed, I'm not sure whether the guest on the
> secondary can be replay as expected.  Please point out if I were
> misunderstanding.
All you need is some consistent sate to restart VM from, no? So if you
transfer VM state after instruction that caused IO is completed you can
restart VM on secondary from that state in case primary fails. I guess
my question is: Can you make synchronization point to be immediately after
IO instruction instead of before?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Yoshiaki Tamura


Gleb Natapov wrote:

On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:

Avi Kivity wrote:

On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:


The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices. Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I
expected.
I tracked down this issue, and figured out rip was already proceeded
in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to
solve
this rip issue before that. If there is no drawback, I'm happy to work
and post a patch.


vcpu state is undefined when an mmio operation is pending,
Documentation/kvm/api.txt says the following:


NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
operations are complete (and guest state is consistent) only after
userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals. Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.


Thanks for the information.

So the point is the vcpu state that can been observed from qemu upon
KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used
because it's not complete/consistent?


Definitely. VCPU is in the middle of an instruction execution, so the
state is undefined. One instruction may generate more then one IO exit
during its execution BTW.


Regarding the multiple IO exits, we're paying attention too.  Although it 
depends on the guest behavior, if we limit the device model, one IO exit per one 
instruction may be practical at beggining.  But thanks for pointing out.


To solve the undefined VCPU state, how about keeping a copy of initial state 
upon VMEXIT?  I guess there already is a similar shadow state in KVM.  If 
possible we can allocate another one for this purpose.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Gleb Natapov

On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:
> Avi Kivity wrote:
> >On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:
> >>
> >>The problem here is that, I needed to transfer the VM state which is
> >>just *before* the output to the devices. Otherwise, the VM state has
> >>already been proceeded, and after failover, some I/O didn't work as I
> >>expected.
> >>I tracked down this issue, and figured out rip was already proceeded
> >>in KVM,
> >>and transferring this VCPU state was meaningless.
> >>
> >>I'm planning to post the patch set of Kemari soon, but I would like to
> >>solve
> >>this rip issue before that. If there is no drawback, I'm happy to work
> >>and post a patch.
> >
> >vcpu state is undefined when an mmio operation is pending,
> >Documentation/kvm/api.txt says the following:
> >
> >>NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
> >>operations are complete (and guest state is consistent) only after
> >>userspace
> >>has re-entered the kernel with KVM_RUN. The kernel side will first finish
> >>incomplete operations and then check for pending signals. Userspace
> >>can re-enter the guest with an unmasked signal pending to complete
> >>pending operations.
> 
> Thanks for the information.
> 
> So the point is the vcpu state that can been observed from qemu upon
> KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used
> because it's not complete/consistent?
> 
Definitely. VCPU is in the middle of an instruction execution, so the
state is undefined. One instruction may generate more then one IO exit
during its execution BTW.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Yoshiaki Tamura


Avi Kivity wrote:

On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:


The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices. Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I
expected.
I tracked down this issue, and figured out rip was already proceeded
in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to
solve
this rip issue before that. If there is no drawback, I'm happy to work
and post a patch.


vcpu state is undefined when an mmio operation is pending,
Documentation/kvm/api.txt says the following:


NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
operations are complete (and guest state is consistent) only after
userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals. Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.


Thanks for the information.

So the point is the vcpu state that can been observed from qemu upon 
KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used because it's not 
complete/consistent?



Currently we complete instructions for output operations and leave them
incomplete for input operations. Deferring completion for output
operations should work, except it may break the vmware backdoor port
(see hw/vmport.c), which changes register state following an output
instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state
following a write instruction.

Do you really need to transfer the vcpu state before the instruction, or
do you just need a consistent state? If the latter, then you can get
away by posting a signal and re-entering the guest. kvm will complete
the instruction and exit immediately, and you will have fully consistent
state.


The requirement is that the guest must always be able to replay at least the 
instruction which triggered the synchronization on the primary.  From that point 
of view, I think I need to transfer the vcpu state before the instruction.  If I 
post a signal and let the guest or emulator proceed, I'm not sure whether the 
guest on the secondary can be replay as expected.  Please point out if I were 
misunderstanding.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Avi Kivity


On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:


The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices.  Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I expected.
I tracked down this issue, and figured out rip was already proceeded in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to solve
this rip issue before that.  If there is no drawback, I'm happy to work
and post a patch.
   


vcpu state is undefined when an mmio operation is pending, 
Documentation/kvm/api.txt says the following:



NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
operations are complete (and guest state is consistent) only after 
userspace

has re-entered the kernel with KVM_RUN.  The kernel side will first finish
incomplete operations and then check for pending signals.  Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.


Currently we complete instructions for output operations and leave them 
incomplete for input operations.  Deferring completion for output 
operations should work, except it may break the vmware backdoor port 
(see hw/vmport.c), which changes register state following an output 
instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state 
following a write instruction.


Do you really need to transfer the vcpu state before the instruction, or 
do you just need a consistent state?  If the latter, then you can get 
away by posting a signal and re-entering the guest.  kvm will complete 
the instruction and exit immediately, and you will have fully consistent 
state.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Yoshiaki Tamura

2010/4/8 Gleb Natapov :
> On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote:
>> 2010/4/6 Gleb Natapov :
>> > On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
>> >> Hi.
>> >>
>> >> When handle_io() is called, rip is currently proceeded *before* actually 
>> >> having
>> >> I/O handled by qemu in userland.  Upon implementing Kemari for
>> >> KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly 
>> >> in
>> >> userland qemu, we encountered a problem that synchronizing the content of 
>> >> VCPU
>> >> before handling I/O in qemu is too late because rip is already proceeded 
>> >> in KVM,
>> >> Although we avoided this issue with temporal hack, I would like to ask a 
>> >> few
>> >> question on skip_emulated_instructions.
>> >>
>> >> 1. Does rip need to be proceeded before having I/O handled by qemu?
>> > In current kvm.git rip is proceeded before I/O is handled by qemu only
>> > in case of "out" instruction. From architecture point of view I think
>> > it's OK since on real HW you can't guaranty that I/O will take effect
>> > before instruction pointer is advanced. It is done like that because we
>> > want "out" emulation to be real fast so we skip x86 emulator.
>>
>> Thanks for your reply.
>>
>> If proceeding rip later doesn't break the behavior of devices or
>> introduce slow down, I would like that to be done.
>>
> Device can not care less about what value rip register currently has.
> Why is it matters for you code?

My code, Kemari is a mechanism to synchronize VMs to achieve fault tolerance.
It transfers the whole VM state upon events such as disk or network output,
so that the secondary server can keep continuing upon hardware failure.
Please think it like continuous live migration.
I've implemented this feature in userland qemu, which calls the live migration
function when it detects any outputs from the device emulators.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices.  Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I expected.
I tracked down this issue, and figured out rip was already proceeded in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to solve
this rip issue before that.  If there is no drawback, I'm happy to work
and post a patch.

>> >> 2. If no, is it possible to divide skip_emulated_instructions(), like
>> >> rec_emulated_instructions() to remember to next_rip, and
>> >> skip_emulated_instructions() to actually proceed the rip.
>> > Currently only emulator can call userspace to do I/O, so after
>> > userspace returns after I/O exit, control is handled back to emulator
>> > unconditionally.  "out" instruction skips emulator, but there is nothing
>> > to do after userspace returns, so regular cpu loop is executed. If we
>> > want to advance rip only after userspace executed I/O done by "out" we
>> > need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
>> > and call different code depending on who that was. It can be done by
>> > having a callback that (if not null) is called on return from userspace.
>>
>> Your suggestion is to introduce a callback entry, and instead of
>> calling kvm_rip_write(), set it to the entry before calling
>> kvm_fast_pio_out(),
>> and check the entry upon return from the userspace, correct?
>>
> Something like that, yes.

OK.  Let me work on that.

>> According to the comment in x86.c, when it was "out" instruction
>> vcpu->arch.pio.count is set to 0 to skip the emulator.
>> To call kvm_fast_pio_out(), "!string" and "!in" must be set.
>> If we can check, vcpu->arch.pio.count, "string" and "in" on return
>> from the userspace, can't we distinguish who requested I/O, emulator
>> or kvm_fast_pio_out()?
>>
> May be, but callback approach is much cleaner. "string" and "in" can have
> stale data for instance.

I see.  I was thinking that can be a trade off between introducing a
new variable.
I'll take the callback approach first, and think again later if necessary.

>
>> >> 3. svm has next_rip but when it is 0, nop is emulated.  Can this be 
>> >> modified to
>>

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Gleb Natapov

On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote:
> 2010/4/6 Gleb Natapov :
> > On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
> >> Hi.
> >>
> >> When handle_io() is called, rip is currently proceeded *before* actually 
> >> having
> >> I/O handled by qemu in userland.  Upon implementing Kemari for
> >> KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly 
> >> in
> >> userland qemu, we encountered a problem that synchronizing the content of 
> >> VCPU
> >> before handling I/O in qemu is too late because rip is already proceeded 
> >> in KVM,
> >> Although we avoided this issue with temporal hack, I would like to ask a 
> >> few
> >> question on skip_emulated_instructions.
> >>
> >> 1. Does rip need to be proceeded before having I/O handled by qemu?
> > In current kvm.git rip is proceeded before I/O is handled by qemu only
> > in case of "out" instruction. From architecture point of view I think
> > it's OK since on real HW you can't guaranty that I/O will take effect
> > before instruction pointer is advanced. It is done like that because we
> > want "out" emulation to be real fast so we skip x86 emulator.
> 
> Thanks for your reply.
> 
> If proceeding rip later doesn't break the behavior of devices or
> introduce slow down, I would like that to be done.
> 
Device can not care less about what value rip register currently has.
Why is it matters for you code?

> >> 2. If no, is it possible to divide skip_emulated_instructions(), like
> >> rec_emulated_instructions() to remember to next_rip, and
> >> skip_emulated_instructions() to actually proceed the rip.
> > Currently only emulator can call userspace to do I/O, so after
> > userspace returns after I/O exit, control is handled back to emulator
> > unconditionally.  "out" instruction skips emulator, but there is nothing
> > to do after userspace returns, so regular cpu loop is executed. If we
> > want to advance rip only after userspace executed I/O done by "out" we
> > need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
> > and call different code depending on who that was. It can be done by
> > having a callback that (if not null) is called on return from userspace.
> 
> Your suggestion is to introduce a callback entry, and instead of
> calling kvm_rip_write(), set it to the entry before calling
> kvm_fast_pio_out(),
> and check the entry upon return from the userspace, correct?
> 
Something like that, yes.

> According to the comment in x86.c, when it was "out" instruction
> vcpu->arch.pio.count is set to 0 to skip the emulator.
> To call kvm_fast_pio_out(), "!string" and "!in" must be set.
> If we can check, vcpu->arch.pio.count, "string" and "in" on return
> from the userspace, can't we distinguish who requested I/O, emulator
> or kvm_fast_pio_out()?
> 
May be, but callback approach is much cleaner. "string" and "in" can have
stale data for instance.

> >> 3. svm has next_rip but when it is 0, nop is emulated.  Can this be 
> >> modified to
> >> continue without emulating nop when next_rip is 0?
> >>
> > I don't see where nop is emulated if next_rip is 0. As far as I see in
> > case of next_rip==0 an instruction at rip is decoded to figure out its
> > length and then rip is advanced by instruction length. Anyway next_rip
> > is svm thing only.
> 
> Sorry.  I wasn't understanding the code enough.
> 
> static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
> {
> ...
>   if (!svm->next_rip) {
>   if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) !=
>   EMULATE_DONE)
>   printk(KERN_DEBUG "%s: NOP\n", __func__);
>   return;
>   }
> 
> Since the printk says NOP, I thought emulate_instruction was doing so...
> 
> The reason I asked about next_rip is because I was hoping to use this
> entry to advance rip only after userspace executed I/O done by "out",
> like if next_rip is !0,
> call kvm_rip_write(), and introduce next_rip to vmx if it is usable
> because vmx is
> currently using local variable rip.
> 
> Yoshi

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-06 Thread Yoshiaki Tamura

2010/4/6 Gleb Natapov :
> On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
>> Hi.
>>
>> When handle_io() is called, rip is currently proceeded *before* actually 
>> having
>> I/O handled by qemu in userland.  Upon implementing Kemari for
>> KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in
>> userland qemu, we encountered a problem that synchronizing the content of 
>> VCPU
>> before handling I/O in qemu is too late because rip is already proceeded in 
>> KVM,
>> Although we avoided this issue with temporal hack, I would like to ask a few
>> question on skip_emulated_instructions.
>>
>> 1. Does rip need to be proceeded before having I/O handled by qemu?
> In current kvm.git rip is proceeded before I/O is handled by qemu only
> in case of "out" instruction. From architecture point of view I think
> it's OK since on real HW you can't guaranty that I/O will take effect
> before instruction pointer is advanced. It is done like that because we
> want "out" emulation to be real fast so we skip x86 emulator.

Thanks for your reply.

If proceeding rip later doesn't break the behavior of devices or
introduce slow down, I would like that to be done.

>> 2. If no, is it possible to divide skip_emulated_instructions(), like
>> rec_emulated_instructions() to remember to next_rip, and
>> skip_emulated_instructions() to actually proceed the rip.
> Currently only emulator can call userspace to do I/O, so after
> userspace returns after I/O exit, control is handled back to emulator
> unconditionally.  "out" instruction skips emulator, but there is nothing
> to do after userspace returns, so regular cpu loop is executed. If we
> want to advance rip only after userspace executed I/O done by "out" we
> need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
> and call different code depending on who that was. It can be done by
> having a callback that (if not null) is called on return from userspace.

Your suggestion is to introduce a callback entry, and instead of
calling kvm_rip_write(), set it to the entry before calling
kvm_fast_pio_out(),
and check the entry upon return from the userspace, correct?

According to the comment in x86.c, when it was "out" instruction
vcpu->arch.pio.count is set to 0 to skip the emulator.
To call kvm_fast_pio_out(), "!string" and "!in" must be set.
If we can check, vcpu->arch.pio.count, "string" and "in" on return
from the userspace, can't we distinguish who requested I/O, emulator
or kvm_fast_pio_out()?

>> 3. svm has next_rip but when it is 0, nop is emulated.  Can this be modified 
>> to
>> continue without emulating nop when next_rip is 0?
>>
> I don't see where nop is emulated if next_rip is 0. As far as I see in
> case of next_rip==0 an instruction at rip is decoded to figure out its
> length and then rip is advanced by instruction length. Anyway next_rip
> is svm thing only.

Sorry.  I wasn't understanding the code enough.

static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
...
if (!svm->next_rip) {
if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) !=
EMULATE_DONE)
printk(KERN_DEBUG "%s: NOP\n", __func__);
return;
}

Since the printk says NOP, I thought emulate_instruction was doing so...

The reason I asked about next_rip is because I was hoping to use this
entry to advance rip only after userspace executed I/O done by "out",
like if next_rip is !0,
call kvm_rip_write(), and introduce next_rip to vmx if it is usable
because vmx is
currently using local variable rip.

Yoshi
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

2010-04-06 Thread Gleb Natapov

On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
> Hi.
> 
> When handle_io() is called, rip is currently proceeded *before* actually 
> having
> I/O handled by qemu in userland.  Upon implementing Kemari for
> KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in
> userland qemu, we encountered a problem that synchronizing the content of VCPU
> before handling I/O in qemu is too late because rip is already proceeded in 
> KVM,
> Although we avoided this issue with temporal hack, I would like to ask a few
> question on skip_emulated_instructions.
> 
> 1. Does rip need to be proceeded before having I/O handled by qemu?
In current kvm.git rip is proceeded before I/O is handled by qemu only
in case of "out" instruction. From architecture point of view I think
it's OK since on real HW you can't guaranty that I/O will take effect
before instruction pointer is advanced. It is done like that because we
want "out" emulation to be real fast so we skip x86 emulator.

> 2. If no, is it possible to divide skip_emulated_instructions(), like
> rec_emulated_instructions() to remember to next_rip, and
> skip_emulated_instructions() to actually proceed the rip.
Currently only emulator can call userspace to do I/O, so after
userspace returns after I/O exit, control is handled back to emulator
unconditionally.  "out" instruction skips emulator, but there is nothing
to do after userspace returns, so regular cpu loop is executed. If we
want to advance rip only after userspace executed I/O done by "out" we
need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
and call different code depending on who that was. It can be done by
having a callback that (if not null) is called on return from userspace.

> 3. svm has next_rip but when it is 0, nop is emulated.  Can this be modified 
> to
> continue without emulating nop when next_rip is 0?
> 
I don't see where nop is emulated if next_rip is 0. As far as I see in
case of next_rip==0 an instruction at rip is decoded to figure out its
length and then rip is advanced by instruction length. Anyway next_rip
is svm thing only.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Question on skip_emulated_instructions()

2010-04-05 Thread Yoshiaki Tamura

Hi.

When handle_io() is called, rip is currently proceeded *before* actually having
I/O handled by qemu in userland.  Upon implementing Kemari for
KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in
userland qemu, we encountered a problem that synchronizing the content of VCPU
before handling I/O in qemu is too late because rip is already proceeded in KVM,
Although we avoided this issue with temporal hack, I would like to ask a few
question on skip_emulated_instructions.

1. Does rip need to be proceeded before having I/O handled by qemu?
2. If no, is it possible to divide skip_emulated_instructions(), like
rec_emulated_instructions() to remember to next_rip, and
skip_emulated_instructions() to actually proceed the rip.
3. svm has next_rip but when it is 0, nop is emulated.  Can this be modified to
continue without emulating nop when next_rip is 0?

Thanks,

Yoshi
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Re: Question on skip_emulated_instructions()

Question on skip_emulated_instructions()

22 matches

Site Navigation

Mail list logo

Footer information