Re: who cames from xen?

2011-02-10 Thread Nikola Ciprich
Well,

mine are pretty much the same as of those who already replied.
to emphasize the most important for me:
- xen developers didn't seem that much interested to push everything
into mainline, in general kvm developmen process seem much open to me..,
- it was problematic for me to use some of new features we needed for such old 
kernels
XEN's been based on
- after the xen has been bought by citrix, future course was unclear
- redhat which we've based our distro upon switched to KVM as well (and bought 
qumranet)
- since KVM runs VMs as normal processes, there are better possibilities to
use various types of "shaping" using cgroups etc.
- KVM seems to be simpler to debug to me and community is pretty friendly here

well, thats enough I guess :)
all I have to say is that I too am pretty gratefull to KVM and also QEMU 
developers.
thanks guys!

nik



On Thu, Feb 10, 2011 at 09:20:17PM +, Mauro wrote:
> On 10 February 2011 19:30, Nikola Ciprich  wrote:
> > Hi,
> > I switched from XEN to KVM long time ago, and haven't felt sorry since 
> > then...
> > Are You interestid in something in particular?
> 
> Then.I'm interested on your motivations to switch from xen to kvm.
> If it's important I use debian squeeze.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: add the __noclone attribute

2011-02-10 Thread Lai Jiangshan
The changelog of 104f226 said "adds the __noclone attribute",
but it was missing in its patch. I think it is still needed.

Signed-off-by: Lai Jiangshan 
---
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bf89ec2..de99a4d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3962,7 +3962,7 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 #define Q "l"
 #endif
 
-static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
+static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf

On 11.02.2011, at 01:22, Alexander Graf wrote:

> 
> On 11.02.2011, at 01:20, Alexander Graf wrote:
> 
>> 
>> On 10.02.2011, at 19:51, Scott Wood wrote:
>> 
>>> On Thu, 10 Feb 2011 12:45:38 +0100
>>> Alexander Graf  wrote:
>>> 
 Ok, thinking about this a bit more. You're basically proposing a list of
 tlb set calls, with each array field identifying one tlb set call. What
 I was thinking of was a full TLB sync, so we could keep qemu's internal
 TLB representation identical to the ioctl layout and then just call that
 one ioctl to completely overwrite all of qemu's internal data (and vice
 versa).
>>> 
>>> No, this is a full sync -- the list replaces any existing TLB entries (need
>>> to make that explicit in the doc).  Basically it's an invalidate plus a
>>> list of tlb set operations.
>>> 
>>> Qemu's internal representation will want to be ordered with no missing
>>> entries.  If we require that of the transfer representation we can't do
>>> early termination.  It would also limit Qemu's flexibility in choosing its
>>> internal representation, and make it more awkward to support multiple MMU
>>> types.
>> 
>> Well, but this way it means we'll have to assemble/disassemble a list of 
>> entries multiple times:
>> 
>> SET:
>> * qemu assembles the list from its internal representation
>> * kvm disassembles the list into its internal structure
>> 
>> GET:
>> * kvm assembles the list from its internal representation
>> * qemu disassembles the list into its internal structure
>> 
>> Maybe we should go with Avi's proposal after all and simply keep the full 
>> soft-mmu synced between kernel and user space? That way we only need a setup 
>> call at first, no copying in between and simply update the user space 
>> version whenever something changes in the guest. We need to store the TLB's 
>> contents off somewhere anyways, so all we need is an additional in-kernel 
>> array with internal translation data, but that can be separate from the 
>> guest visible data, right?
> 
> If we could then keep qemu's internal representation == shared data with kvm 
> == kvm's internal data for guest visible stuff, we get this done with almost 
> no additional overhead. And I don't see any problem with this. Should be 
> easily doable.

So then everything we need to get all the functionality we need is a hint from 
kernel to user space that something changed and vice versa.

>From kernel to user space is simple. We can just document that after every 
>RUN, all fields can be modified.
>From user space to kernel, we could modify the entries directly and then pass 
>in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide 
>what to do with it. I guess the easiest implementation for now would be to 
>ignore the bitmap and simply flush the shadow tlb.

That gives us the flush almost for free. All we need to do is set the tlb to 
all zeros (should be done by env init anyways) and pass in the "something 
changed" call. KVM can then decide to simply drop all of its shadow state or 
loop through every shadow entry and flush it individually. Maybe we should give 
a hint on the amount of flushes, so KVM can implement some threshold.

Also, please tell me you didn't implement the previous revisions already. It'd 
be a real bummer to see that work wasted only because we're still iterating 
through the spec O_o.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot

2011-02-10 Thread Huang Ying
On Thu, 2011-02-10 at 16:52 +0800, Jan Kiszka wrote:
> On 2011-02-10 01:27, Huang Ying wrote:
> >>> @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en
> >>>  hardware_memory_error();
> >>>  }
> >>>  }
> >>> +kvm_hwpoison_page_add(ram_addr);
> >>>  
> >>>  if (code == BUS_MCEERR_AR) {
> >>>  /* Fake an Intel architectural Data Load SRAR UCR */
> >>> @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a
> >>>  "QEMU itself instead of guest system!: %p\n", addr);
> >>>  return 0;
> >>>  }
> >>> +kvm_hwpoison_page_add(ram_addr);
> >>>  kvm_mce_inj_srao_memscrub2(first_cpu, paddr);
> >>>  } else
> >>>  #endif
> >>>
> >>>
> >>
> >> Looks fine otherwise. Unless that simplification makes sense, I could
> >> offer to include this into my MCE rework (there is some minor conflict).
> >> If all goes well, that series should be posted during this week.
> 
> Please have a look at
> 
> git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream
> 
> and tell me if it works for you and your signed-off still applies.

Thanks!  Works as expected in my testing!

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf

On 11.02.2011, at 01:20, Alexander Graf wrote:

> 
> On 10.02.2011, at 19:51, Scott Wood wrote:
> 
>> On Thu, 10 Feb 2011 12:45:38 +0100
>> Alexander Graf  wrote:
>> 
>>> Ok, thinking about this a bit more. You're basically proposing a list of
>>> tlb set calls, with each array field identifying one tlb set call. What
>>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>>> TLB representation identical to the ioctl layout and then just call that
>>> one ioctl to completely overwrite all of qemu's internal data (and vice
>>> versa).
>> 
>> No, this is a full sync -- the list replaces any existing TLB entries (need
>> to make that explicit in the doc).  Basically it's an invalidate plus a
>> list of tlb set operations.
>> 
>> Qemu's internal representation will want to be ordered with no missing
>> entries.  If we require that of the transfer representation we can't do
>> early termination.  It would also limit Qemu's flexibility in choosing its
>> internal representation, and make it more awkward to support multiple MMU
>> types.
> 
> Well, but this way it means we'll have to assemble/disassemble a list of 
> entries multiple times:
> 
> SET:
> * qemu assembles the list from its internal representation
> * kvm disassembles the list into its internal structure
> 
> GET:
> * kvm assembles the list from its internal representation
> * qemu disassembles the list into its internal structure
> 
> Maybe we should go with Avi's proposal after all and simply keep the full 
> soft-mmu synced between kernel and user space? That way we only need a setup 
> call at first, no copying in between and simply update the user space version 
> whenever something changes in the guest. We need to store the TLB's contents 
> off somewhere anyways, so all we need is an additional in-kernel array with 
> internal translation data, but that can be separate from the guest visible 
> data, right?

If we could then keep qemu's internal representation == shared data with kvm == 
kvm's internal data for guest visible stuff, we get this done with almost no 
additional overhead. And I don't see any problem with this. Should be easily 
doable.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf

On 10.02.2011, at 19:51, Scott Wood wrote:

> On Thu, 10 Feb 2011 12:45:38 +0100
> Alexander Graf  wrote:
> 
>> Ok, thinking about this a bit more. You're basically proposing a list of
>> tlb set calls, with each array field identifying one tlb set call. What
>> I was thinking of was a full TLB sync, so we could keep qemu's internal
>> TLB representation identical to the ioctl layout and then just call that
>> one ioctl to completely overwrite all of qemu's internal data (and vice
>> versa).
> 
> No, this is a full sync -- the list replaces any existing TLB entries (need
> to make that explicit in the doc).  Basically it's an invalidate plus a
> list of tlb set operations.
> 
> Qemu's internal representation will want to be ordered with no missing
> entries.  If we require that of the transfer representation we can't do
> early termination.  It would also limit Qemu's flexibility in choosing its
> internal representation, and make it more awkward to support multiple MMU
> types.

Well, but this way it means we'll have to assemble/disassemble a list of 
entries multiple times:

SET:
 * qemu assembles the list from its internal representation
 * kvm disassembles the list into its internal structure

GET:
 * kvm assembles the list from its internal representation
 * qemu disassembles the list into its internal structure

Maybe we should go with Avi's proposal after all and simply keep the full 
soft-mmu synced between kernel and user space? That way we only need a setup 
call at first, no copying in between and simply update the user space version 
whenever something changes in the guest. We need to store the TLB's contents 
off somewhere anyways, so all we need is an additional in-kernel array with 
internal translation data, but that can be separate from the guest visible 
data, right?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: who cames from xen?

2011-02-10 Thread Alejandro Leyva
We have switched for the same reasons one year ago. We have 50
physical servers and around 400 VM's on them, ranging from single
hosts with RAID 1 internal  storage to iSCSI solutions.

On Thu, Feb 10, 2011 at 3:53 PM, Dan VerWeire  wrote:
> I switched from Xen to KVM a couple years ago for the following reasons IIRC:
>
> -Xen was stuck on an older kernel that did not have the drivers I needed
> -Ubuntu decided to switch their focus to KVM as their main
> virtualization package
> -KVM had more solid Windows network drivers
> -I personally didn't like the Dom0/DomU concept I just want my VMs to
> be processes on the host just like any other process
> -KVM was in the kernel which gave me a good feeling about the
> longevity and support of the project
>
> I am a sys admin (among other things) for a wholesale distribution
> company. We have 28 virtual machines on 3 different hosts. They are a
> mixture of Windows and Linux. I am extremely happy with KVM and
> Ubuntu's support of KVM. It is awesome to get new features like KSM
> and Ceph block devices (which I haven't used yet but am very excited
> about) as the kernel and KVM evolve.
>
> I can say that, in my experience, our VMs run more solid on KVM than
> they did on Xen and even more solid than on bare metal, especially in
> the case of Windows.
>
> Thank you KVM developers.
>
> Dan VerWeire
>
>
> On Thu, Feb 10, 2011 at 4:20 PM, Mauro  wrote:
>> On 10 February 2011 19:30, Nikola Ciprich  wrote:
>>> Hi,
>>> I switched from XEN to KVM long time ago, and haven't felt sorry since 
>>> then...
>>> Are You interestid in something in particular?
>>
>> Then.I'm interested on your motivations to switch from xen to kvm.
>> If it's important I use debian squeeze.
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: who cames from xen?

2011-02-10 Thread Freddie Cash
On Thu, Feb 10, 2011 at 1:53 PM, Dan VerWeire  wrote:
> I switched from Xen to KVM a couple years ago for the following reasons IIRC:

We're in the process of switching from Xen to KVM, for similar
reasons, but also to get away from the hassle that is configuring Xen,
especially with the crap that is Grub2.  With KVM, it's easy to run
"Linux Version X" on the host, and "Linux Version X+Y" as guests.

Trying to get that setup to work with Xen, especially with Grub1 on
the host, and the VMs wanting Grub2 (aka using Debian Lenny for Dom0
and Debian Squeeze for DomU) was an extreme exercise in frustration.
Doing the same with KVM is a snap.

The whole Dom0/DomU split is a hassle as well.

Now that all CPUs (well, at least all of AMD's CPUs) support hardware
virt, I honestly do not see a reason to use Xen.  It's just not worth
the hassle for a theoretical couple % better performance.

> Thank you KVM developers.

Wholeheartedly agree!!!

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: who cames from xen?

2011-02-10 Thread Dan VerWeire
I switched from Xen to KVM a couple years ago for the following reasons IIRC:

-Xen was stuck on an older kernel that did not have the drivers I needed
-Ubuntu decided to switch their focus to KVM as their main
virtualization package
-KVM had more solid Windows network drivers
-I personally didn't like the Dom0/DomU concept I just want my VMs to
be processes on the host just like any other process
-KVM was in the kernel which gave me a good feeling about the
longevity and support of the project

I am a sys admin (among other things) for a wholesale distribution
company. We have 28 virtual machines on 3 different hosts. They are a
mixture of Windows and Linux. I am extremely happy with KVM and
Ubuntu's support of KVM. It is awesome to get new features like KSM
and Ceph block devices (which I haven't used yet but am very excited
about) as the kernel and KVM evolve.

I can say that, in my experience, our VMs run more solid on KVM than
they did on Xen and even more solid than on bare metal, especially in
the case of Windows.

Thank you KVM developers.

Dan VerWeire


On Thu, Feb 10, 2011 at 4:20 PM, Mauro  wrote:
> On 10 February 2011 19:30, Nikola Ciprich  wrote:
>> Hi,
>> I switched from XEN to KVM long time ago, and haven't felt sorry since 
>> then...
>> Are You interestid in something in particular?
>
> Then.I'm interested on your motivations to switch from xen to kvm.
> If it's important I use debian squeeze.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: who cames from xen?

2011-02-10 Thread Mauro
On 10 February 2011 19:30, Nikola Ciprich  wrote:
> Hi,
> I switched from XEN to KVM long time ago, and haven't felt sorry since then...
> Are You interestid in something in particular?

Then.I'm interested on your motivations to switch from xen to kvm.
If it's important I use debian squeeze.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Does KVM use one EPT table per Guest CR3?

2011-02-10 Thread Lok Kwong Yan
Sorry for the late reply.

Seems to me that the EPTP pointer is changing because of kvm_set_cr0.

Here is what I did and please correct me if I am doing the trace incorrectly:

- Added a trace entry in vmx_set_cr3 where a trace message is outputted 
whenever vmcs_read64(EPT_POINTER) != eptp after construct_eptp(cr3).

I then looked at the trace log and seems to show up with 

kvm_exit: reason cr_access rip 0xc0122003
kvm_cr: cr_write 0 = 0x8005003b

I also noticed that kvm_mmu_reset_context(vcpu) is being called at the end of 
kvm_set_cr0. 

The CR0 value of 0x8005003b doesn't seem to trigger any of the if cases which 
would indicate that kvm_mmu_reset_context(vcpu) is being called and could be 
the reason why eptp is changing.

Thanks for your help again.

Enjoy,

Lok





From: Avi Kivity [a...@redhat.com]
Sent: Sunday, December 19, 2010 9:31 AM
To: Lok Kwong Yan
Cc: Anthony Liguori; kvm@vger.kernel.org
Subject: Re: Does KVM use one EPT table per Guest CR3?

On 12/17/2010 05:24 PM, Avi Kivity wrote:
> On 12/17/2010 12:14 AM, Lok Kwong Yan wrote:
>> Thanks for the reply and it makes a lot of sense.
>>
>> I am not seeing any EPT tables being zapped after the guest has fully
>> started up although the value of EPTP continuously changes as the
>> guest is running.
>
> Really strange, this is likely a bug.
>

I tried to reproduce, the only times I see eptp changes are when the
guest reprograms the vga adapter:

  qemu-system-x86-20944 [033]  1327.151819: kvm_pio:
pio_write at 0x3ce size 2 count 1
  qemu-system-x86-20944 [033]  1327.151819: kvm_userspace_exit:   reason
KVM_EXIT_IO (2)
  qemu-system-x86-20944 [033]  1327.152405: kvm_mmu_prepare_zap_page:
[FAILED TO PARSE] gfn=237568 role=122881 root_count=0 unsync=0
...
  qemu-system-x86-20944 [033]  1327.153230: kvm_mmu_prepare_zap_page:
[FAILED TO PARSE] gfn=0 role=253956 root_count=2 unsync=0
  qemu-system-x86-20944 [033]  1327.153339: kvm_mmu_get_page: sp gfn
0 0/4 q0 direct --- !pge !nxe root 0sync
  qemu-system-x86-20944 [033]  1327.153344: print:
a0265cde vmx_set_cr3: eptp fef14101

Under what scenario do you see eptp changing?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Scott Wood
On Thu, 10 Feb 2011 19:22:38 +
Peter Maydell  wrote:

> On 10 February 2011 19:17, Scott Wood  wrote:
> > On Thu, 10 Feb 2011 08:16:15 +
> > Peter Maydell  wrote:
> >> On 10 February 2011 07:47, Anthony Liguori  wrote:
> >> > So very concretely, I'm suggesting we do the following to target-i386:
> >>
> >> > 2) get rid of the entire concept of machines.  Creating a i440fx is
> >> > essentially equivalent to creating a bare machine.
> >>
> >> Does that make any sense for anything other than target-i386?
> 
> > It makes a lot of sense for us on powerpc.  Maybe it has to do with a
> > longer tradition of using device trees versus opaque machine IDs -- I don't
> > think the hardware itself makes any substantial difference.  Currently we
> > end up having everything pretend to be an mpc8544ds (with some differences
> > described by the guest device tree that the user feeds in), which is ugly.
> 
> Hmm. Device tree is coming to ARM, but just at the moment it's
> generally one-kernel-one-machine still. (We've only just gained the
> ability to compile one kernel for both UP and SMP...)
> 
> I kind of think you're still defining a "machine", you're just doing it
> in your device tree blob rather than in C.

Right, that's the point -- the definition is just a definition, it's not
tied up with implementation.  This reduces the amount of duplication in
implementation (or inappropriate sharing, as in the "use mpc8544ds for
all 85xx" case).

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: who cames from xen?

2011-02-10 Thread Nikola Ciprich
Hi,
I switched from XEN to KVM long time ago, and haven't felt sorry since then...
Are You interestid in something in particular?
n.

On Thu, Feb 10, 2011 at 03:28:10PM +, Mauro wrote:
> I'm using xen for years with no problems in my production environments.
> Now I want to try kvm.
> Any experiences here from xen to kvm?
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Peter Maydell
On 10 February 2011 19:17, Scott Wood  wrote:
> On Thu, 10 Feb 2011 08:16:15 +
> Peter Maydell  wrote:
>> On 10 February 2011 07:47, Anthony Liguori  wrote:
>> > So very concretely, I'm suggesting we do the following to target-i386:
>>
>> > 2) get rid of the entire concept of machines.  Creating a i440fx is
>> > essentially equivalent to creating a bare machine.
>>
>> Does that make any sense for anything other than target-i386?

> It makes a lot of sense for us on powerpc.  Maybe it has to do with a
> longer tradition of using device trees versus opaque machine IDs -- I don't
> think the hardware itself makes any substantial difference.  Currently we
> end up having everything pretend to be an mpc8544ds (with some differences
> described by the guest device tree that the user feeds in), which is ugly.

Hmm. Device tree is coming to ARM, but just at the moment it's
generally one-kernel-one-machine still. (We've only just gained the
ability to compile one kernel for both UP and SMP...)

I kind of think you're still defining a "machine", you're just doing it
in your device tree blob rather than in C.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Scott Wood
On Thu, 10 Feb 2011 08:16:15 +
Peter Maydell  wrote:

> On 10 February 2011 07:47, Anthony Liguori  wrote:
> > So very concretely, I'm suggesting we do the following to target-i386:
> 
> > 2) get rid of the entire concept of machines.  Creating a i440fx is
> > essentially equivalent to creating a bare machine.
> 
> Does that make any sense for anything other than target-i386?
> The concept of a machine model seems a pretty obvious one
> for ARM boards, for instance, and I'm not sure we'd gain much
> by having i386 be different to the other architectures...

It makes a lot of sense for us on powerpc.  Maybe it has to do with a
longer tradition of using device trees versus opaque machine IDs -- I don't
think the hardware itself makes any substantial difference.  Currently we
end up having everything pretend to be an mpc8544ds (with some differences
described by the guest device tree that the user feeds in), which is ugly.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Scott Wood
On Thu, 10 Feb 2011 12:45:38 +0100
Alexander Graf  wrote:

> Ok, thinking about this a bit more. You're basically proposing a list of
> tlb set calls, with each array field identifying one tlb set call. What
> I was thinking of was a full TLB sync, so we could keep qemu's internal
> TLB representation identical to the ioctl layout and then just call that
> one ioctl to completely overwrite all of qemu's internal data (and vice
> versa).

No, this is a full sync -- the list replaces any existing TLB entries (need
to make that explicit in the doc).  Basically it's an invalidate plus a
list of tlb set operations.

Qemu's internal representation will want to be ordered with no missing
entries.  If we require that of the transfer representation we can't do
early termination.  It would also limit Qemu's flexibility in choosing its
internal representation, and make it more awkward to support multiple MMU
types.

Let's see if the format conversion imposes significant overhead before
imposing a less flexible/larger transfer format. :-)

> > MMU type ID also controls this, but could add some padding to make
> > extensions simpler (esp. since we're not making an array of it).  How much
> > would you recommend?
> >   
> 
> How about making it 64 bytes? That should leave us plenty of room.

OK.

> > The fields inside the struct should be __u32, of course. :-P
> >   
> 
> Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
> right? :)

Right.

> > I assumed most MMU types would have some straightforward way of marking an
> > entry invalid (if not, it can add a software field in the struct), and that
> > it would be MMU-specific code that is processing the list.
> >   
> 
> See above :).

Which part?

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


vhost disables kvm acceleration

2011-02-10 Thread Asdo
Hello
I have set up a server with kernel 2.6.37, qemu-kvm-0.13.0 compiled from
source, and libvirt 0.8.3 patched so to enable use of netdev (and hence
vhost).

When I modprobe vhost_net and restart a VM having virtio networking, the
VM crawls at 1/100th of its normal speed.
It seems to me it's about the speed of emulation without kvm
acceleration (I tried that).
If I stop the machine, remove vhost_net module, and restart the VM, it
is normal speed again.

These are the invocations by libvirt:
(note that libvirt autodetects presence of vhost and uses it, the config
of the VM hasn't changed ; also note that they both specify -enable-kvm ...)

without vhost_net module (=fast)

LC_ALL=C
PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
QEMU_AUDIO_DRV=none /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc-0.13
-enable-kvm -m 4096 -smp 2,sockets=2,cores=1,threads=1 -name
uarray_server -uuid 7db77cca-addd-4cf4-f7cd-5399d217543e -nodefconfig
-nodefaults -chardev
socket,id=monitor,path=/var/lib/libvirt/qemu/uarray_server.monitor,server,nowait
-mon chardev=monitor,mode=readline -rtc base=utc -boot c -drive
file=/virtualmachines/myserver.raw,if=none,id=drive-virtio-disk0,boot=on,format=raw
-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-netdev tap,fd=54,id=hostnet0 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:69:94:91:65,bus=pci.0,addr=0x3
-usb -vnc 127.0.0.1:3 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


with vhost_net module (=slow)

LC_ALL=C
PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
QEMU_AUDIO_DRV=none /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc-0.13
-enable-kvm -m 4096 -smp 2,sockets=2,cores=1,threads=1 -name
uarray_server -uuid 7db77cca-addd-4cf4-f7cd-5399d217543e -nodefconfig
-nodefaults -chardev
socket,id=monitor,path=/var/lib/libvirt/qemu/uarray_server.monitor,server,nowait
-mon chardev=monitor,mode=readline -rtc base=utc -boot c -drive
file=/virtualmachines/myserver.raw,if=none,id=drive-virtio-disk0,boot=on,format=raw
-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-netdev tap,fd=52,id=hostnet0,vhost=on,vhostfd=54 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:69:94:91:65,bus=pci.0,addr=0x3
-usb -vnc 127.0.0.1:3 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


What's the problem?

Thank you
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-02-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #27 from Marcelo Tosatti   2011-02-10 16:57:59 
---
Created an attachment (id=47152)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=47152)
kvm-debug-spte-gfn-2.patch

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-02-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #26 from Marcelo Tosatti   2011-02-10 16:57:17 
---
Nicolas,

New debug patch attached. Please try it on top of clean 2.6.37.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 03:20 PM, Gleb Natapov wrote:

Jugging by how well all previous conversion went we will end up with one
more way of creating devices. One legacy, another qdev and your new one.
And what is the problem with qdev again (not that I am a big qdev fan)?
   


We've really been arguing about probably the most minor aspect of the 
problem with qdev.


All I'm really saying is that we shouldn't tie device construction to a 
factory interface as we do with qdev.


That simply means that we should be able to do:

RTC *rtc_create(arg1, arg2, arg2);

And that a separate piece of code decides which devices are exposed 
through -device or device_add.  Which devices are exposed is really a 
minor detail.


That said, qdev has a number of significant limitations in my mind.  The 
first is that the only relationship between devices is through the 
BusState interface.  I don't think we should even try to have a generic 
bus model.  When you look at how badly broken PCI hotplug is current in 
qdev, I think this is symptomatic of this.


There's also no way in qdev to really have polymorphism.  Interfaces 
really aren't meaningful in qdev so you have things like PCIDevice where 
some methods are stored in the object instead of the class dispatch 
table and you have overuse of static class members.


And it's all unrelated to VMState.

And this is just the basic mechanisms of qdev.  The actual 
implementation is worse.  The use of qemu_irq as gpio in the base class 
and overuse of SystemBus is really quite insane.


And so far, the use of qdev has been entirely superficial.  Devices 
still don't make use of bus level interfaces to do I/O so we don't have 
any better componentization than we did before qdev.



The fact that there is no enough interest to convert all devices to it?
   


I don't think there is any device that has been improved by qdev.  
-device is a nice feature, but it could have been implemented without qdev.


Regards,

Anthony Liguori


How new way of doing things will solve this?

Just to be clear I do not have problem with not having ability to
compose x86 without pit or kbd controller. Basic things like RTC, pit,
pic, ioapic, dma, kbd should be created unconditionally as part of x86
pc machine. But IMHO you are trying to take things to other extreme.

--
Gleb.
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-02-10 Thread Ruben Kerkhof
On Wed, Jan 26, 2011 at 16:00, Ruben Kerkhof  wrote:
> On Wed, Jan 26, 2011 at 10:52, Avi Kivity  wrote:
>> On 01/25/2011 08:29 PM, Ruben Kerkhof wrote:
>>>
>>> >  When you say "suddenly", this was with no changes to software and
>>> > hardware?
>>>
>>> The host software and hardware hasn't changed in the two months since
>>> the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.
>>>
>>> We host customer vms on it though, so virtual machines come and go.
>>> Various operating systems, a mixture of Linux, FreeBSD and Windows
>>> 2008 R2. We have other machines with the same config without these
>>> problems though.
>>
>> Are those other machines running a similar workload?
>
> Yes, similar, or they're more heavily loaded.
>
> On this machine, about half of the 48GB memory was used for virtual machines.
>
>> The traces look awfully like bad hardware, though that can also be explained
>> by random memory corruption due to a bug.
>
> Yeah, that's what I'm expecting. We already replaced the memory, next
> step is to move the disks over to another server to make sure it's not
> the board or cpu's.
>
>>> This time I have a few different messages though:
>>>
>>> 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault:
>>>  [#1] SMP
>>>
>>> RSI:  RDI: 1603a07305001568
>>>
>>> 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46
>>> 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d
>>> 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00  ff 4f 08 0f 94 c0 84
>>> c0 74 10 85 f6 75 07 e8 63 fe ff ff eb
>>
>> lock decl 0x8(%rdi)
>>
>> %rdi is completely crap, looks like corruption again.  Strangely, it is
>> similar to the bad spte from the previous trace: 0x1603a0730500d277.  The
>> upper 48 bits are identical, the lower 16 bits are different.:
>>>
>>> 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted
>>> page table at address 7f37b37ff000
>>> 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD
>>> 94e538067 PMD 61e5bf067 PTE 1603a0730500e067
>>
>> Here are those magic 48 bits again, in the PTE entry.
>>>
>>> 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration.
>>> 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038
>>> 2011-01-25T12:38:49.417526+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4
>>> 2011-01-25T12:38:49.417532+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x5db595007 level 3
>>> 2011-01-25T12:38:49.417553+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2
>>> 2011-01-25T12:38:49.417558+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1
>>
>> Again.
>>
>>> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in
>>> process qemu-kvm  pte:1603a0730500d067 pmd:61059f067
>>
>> Again.
>>
>> However, these all came from a single boot, yes?
>
> Correct.
>
>> If so they can be the same
>> corruption.  Please collect more traces, with reboots in between.

This machine has been running for a week without problems, but then we
started to get the following oopses again:

2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle
kernel paging request at ea71929180e0
2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP:
[] gup_pte_range+0x94/0xd3
2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0
2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops:  [#1] SMP
2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file:
/sys/devices/system/cpu/cpu15/topology/thread_siblings
2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4
2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun
ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb
iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded:
scsi_wait_scan]
2011-02-06T19:45:35.31+01:00 phy005 kernel:
2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm:
qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU
2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP:
0010:[]  []
gup_pte_range+0x94/0xd3
2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP:
0018:88060b9bda78  EFLAGS: 00010082
2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0
RBX: 3000 RCX: 0005
2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40
RSI: 7fe54e3ff000 RDI: 1603a07305004067
2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98
R08: 880b94384560 R09: 88060b9bdb44
2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8
R11: ea00 R12: 0205
2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff
R14: 0005 R15: 
2011-02-06T19:45:35.55+01:00 phy0

Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Jan Kiszka
On 2011-02-10 15:47, Avi Kivity wrote:
> On 02/10/2011 04:34 PM, Jan Kiszka wrote:
>> On 2011-02-10 15:26, Avi Kivity wrote:
>>>  On 02/10/2011 03:47 PM, Jan Kiszka wrote:
>>
>>   Accept for mmu_shrink, which is write but not delete, thus works 
>> without
>>   that slow synchronize_rcu.
>
>   I don't really see how you can implement list_move_rcu(), it has to be
>   atomic or other users will see a partial vm_list.

  Right, even if we synchronized that step cleanly, rcu-protected users
  could miss the moving vm during concurrent list walks.

  What about using a separate mutex for protecting vm_list instead?
  Unless I missed some detail, mmu_shrink should allow blocking.
>>>
>>>  What else does kvm_lock protect?
>>
>> Someone tried to write a locking.txt and stated that it's also
>> protecting enabling/disabling hardware virtualization. But that guy may
>> have overlooked something.
> 
> Right.  I guess splitting that lock makes sense.
> 
>>>
>>>  I think we could simply reduce the amount of time we hold kvm_lock.
>>>  Pick a vm, ref it, list_move_tail(), unlock, then do the actual
>>>  shrinking.  Of course taking a ref must be done carefully, we might
>>>  already be in kvm_destroy_vm() at that time.
>>>
>>
>> Plain mutex held across the whole mmu_shrink loop is still simpler and
>> should be sufficient - unless we also have to deal with scalability
>> issues if that handler is able to run concurrently. But based on how we
>> were using kvm_lock so far...
> 
> I don't think a mutex would work for kvmclock_cpufreq_notifier().  At 
> the very least, we'd need a preempt_disable() there.  At the worst, the 
> notifier won't like sleeping.

Damn, there was that other user. Yes, this means we need to break the
lock in mmu_shrink.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Avi Kivity

On 02/10/2011 04:34 PM, Jan Kiszka wrote:

On 2011-02-10 15:26, Avi Kivity wrote:
>  On 02/10/2011 03:47 PM, Jan Kiszka wrote:

   Accept for mmu_shrink, which is write but not delete, thus works without
   that slow synchronize_rcu.
>>>
>>>   I don't really see how you can implement list_move_rcu(), it has to be
>>>   atomic or other users will see a partial vm_list.
>>
>>  Right, even if we synchronized that step cleanly, rcu-protected users
>>  could miss the moving vm during concurrent list walks.
>>
>>  What about using a separate mutex for protecting vm_list instead?
>>  Unless I missed some detail, mmu_shrink should allow blocking.
>
>  What else does kvm_lock protect?

Someone tried to write a locking.txt and stated that it's also
protecting enabling/disabling hardware virtualization. But that guy may
have overlooked something.


Right.  I guess splitting that lock makes sense.


>
>  I think we could simply reduce the amount of time we hold kvm_lock.
>  Pick a vm, ref it, list_move_tail(), unlock, then do the actual
>  shrinking.  Of course taking a ref must be done carefully, we might
>  already be in kvm_destroy_vm() at that time.
>

Plain mutex held across the whole mmu_shrink loop is still simpler and
should be sufficient - unless we also have to deal with scalability
issues if that handler is able to run concurrently. But based on how we
were using kvm_lock so far...


I don't think a mutex would work for kvmclock_cpufreq_notifier().  At 
the very least, we'd need a preempt_disable() there.  At the worst, the 
notifier won't like sleeping.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Jan Kiszka
On 2011-02-10 15:26, Avi Kivity wrote:
> On 02/10/2011 03:47 PM, Jan Kiszka wrote:

  Accept for mmu_shrink, which is write but not delete, thus works without
  that slow synchronize_rcu.
>>>
>>>  I don't really see how you can implement list_move_rcu(), it has to be
>>>  atomic or other users will see a partial vm_list.
>>
>> Right, even if we synchronized that step cleanly, rcu-protected users
>> could miss the moving vm during concurrent list walks.
>>
>> What about using a separate mutex for protecting vm_list instead?
>> Unless I missed some detail, mmu_shrink should allow blocking.
> 
> What else does kvm_lock protect?

Someone tried to write a locking.txt and stated that it's also
protecting enabling/disabling hardware virtualization. But that guy may
have overlooked something.

> 
> I think we could simply reduce the amount of time we hold kvm_lock.  
> Pick a vm, ref it, list_move_tail(), unlock, then do the actual 
> shrinking.  Of course taking a ref must be done carefully, we might 
> already be in kvm_destroy_vm() at that time.
> 

Plain mutex held across the whole mmu_shrink loop is still simpler and
should be sufficient - unless we also have to deal with scalability
issues if that handler is able to run concurrently. But based on how we
were using kvm_lock so far...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Avi Kivity

On 02/10/2011 03:47 PM, Jan Kiszka wrote:

>>
>>  Accept for mmu_shrink, which is write but not delete, thus works without
>>  that slow synchronize_rcu.
>
>  I don't really see how you can implement list_move_rcu(), it has to be
>  atomic or other users will see a partial vm_list.

Right, even if we synchronized that step cleanly, rcu-protected users
could miss the moving vm during concurrent list walks.

What about using a separate mutex for protecting vm_list instead?
Unless I missed some detail, mmu_shrink should allow blocking.


What else does kvm_lock protect?

I think we could simply reduce the amount of time we hold kvm_lock.  
Pick a vm, ref it, list_move_tail(), unlock, then do the actual 
shrinking.  Of course taking a ref must be done carefully, we might 
already be in kvm_destroy_vm() at that time.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 03:04:28PM +0100, Anthony Liguori wrote:
> On 02/10/2011 02:27 PM, Gleb Natapov wrote:
> >I don't care how command line will look like, but I do not see how you
> >will support ide=off without device composition unless you put ad-hoc
> >ifs all over your i440fx device code.
> 
> Yes, in the piix3 device code, the ide property would trigger an if().
> 
> BTW, I'm extremely sceptical that you really do have machines w/o
> IDE at all.  Even the servers we ship with only SAS or SCSI support
> still have an integrated IDE controller.
> 
> Since most servers are built from the same chipset design that has
> IDE, I don't really see how you could build a modern system without
> IDE.
> 
Well, this may be true. But since I can't find IDE (or ATA) nor in lspci
neither in dmesg does it really matter that silicon that implement IDE
functionality is present somewhere inside the box?

> >>And that's okay, but the base modelling ought to follow rea
> >>hardware closely with deviations being the exception.
> >>
> >You keep saying this without explaining why. But with device composition
> >you will have exactly that, you will compose real chipsets using config
> >files, not code.
> 
> Yeah, that's been the direction we've been going in since qdev was
> introduced.  I'm now convinced that this is overly ambitious.  By
> simply reducing the scope of conversion, we get 99% of the benefit
> with 10% of the effort.  Seems like a no brainer to me.
> 
Jugging by how well all previous conversion went we will end up with one
more way of creating devices. One legacy, another qdev and your new one.
And what is the problem with qdev again (not that I am a big qdev fan)?
The fact that there is no enough interest to convert all devices to it?
How new way of doing things will solve this?

Just to be clear I do not have problem with not having ability to
compose x86 without pit or kbd controller. Basic things like RTC, pit,
pic, ioapic, dma, kbd should be created unconditionally as part of x86
pc machine. But IMHO you are trying to take things to other extreme.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-02-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #25 from prochazka   2011-02-10 
14:16:51 ---
cmdline 
/usr/local/bin/qemu -name Soins_003 -vga std -net
tap,vlan=0,name=interne,ifname=vmtap5 -net
nic,vlan=0,macaddr=ac:de:48:1d:e8:2c,model=e1000 -cpu host -localtime -usb
-usbdevice tablet -vnc 10.98.98.19:120 -monitor
tcp:127.0.0.1:10120,server,nowait,nodelay -m 512 -pidfile
/var/run/qemu/Soins_003.pid -net
vde,port=70,vlan=5,sock=/tmpsafe/neoswitch_bridge,name=externe -net
nic,vlan=5,macaddr=ac:de:48:8c:cc:e0,model=e1000 -rtc base=localtime -drive
file=/mnt/vdisk/images/VM-Soins_003.1296578833.637768,index=0,media=disk,snapshot=on,cache=unsafe
-drive
file=/swapfile-guest/swap1,if=ide,index=1,media=disk,snapshot=on,boot=off -fda
fat:floppy:/mnt/vdisk/diskconf/Soins_003

KSM and transparent hugepage is activated on this kernel.

Regards, 
Nicolas

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-02-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #24 from prochazka   2011-02-10 
14:14:25 ---
I can now reproduce it under this circonstance on different server 

- Windows XP guest SP2  : guest OS seems to be important, other XP sp3 works
fine
- connect with vnc to this guest and connect with RDP on other 
( 5 or 6 guests ) .

kernel : 2.6.37 
qemu-kvm with hugepages option for #18 #19 . 

/usr/local/bin/qemu -name XP_013 -vga std -net
tap,vlan=0,name=interne,ifname=vmtap28 -net
nic,vlan=0,macaddr=ac:de:48:88:e2:92,model=e1000 -cpu host -localtime -usb
-usbdevice tablet -vnc 10.98.98.13:135 -monitor
tcp:127.0.0.1:10135,server,nowait,nodelay -m 512 -pidfile
/var/run/qemu/XP_013.pid -net
vde,port=85,vlan=5,sock=/tmpsafe/neoswitch_bridge,name=externe -net
nic,vlan=5,macaddr=ac:de:48:7b:9e:ec,model=e1000 -mem-prealloc -mem-path
/hugepages -rtc base=localtime -drive
file=/mnt/vdisk/images/VM-XP_013.1297326902.381783,index=0,media=disk,snapshot=on,cache=unsafe
-drive
file=/swapfile-guest/swap1,if=ide,index=1,media=disk,snapshot=on,boot=off -fda
fat:floppy:/mnt/vdisk/diskconf/XP_013

Last Kernel that works reliably : 2.6.34  ( I do not test with kernel between
2.6.34 and 2.6.37 ) 


I just reproduce bug, with kernel 2.6.38rc4  + without hugepage 
( kvm module from 2.6.38rc4 tree) 


general protection fault:  [#4] SMP 
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
CPU 0 
Modules linked in: kvm_intel kvm bnx2

Pid: 15886, comm: qemu Tainted: G  D 2.6.38-rc4 #1 0P010H/PowerEdge
M600
RIP: 0010:[]  [] drop_spte+0xd5/0x1f0 [kvm]
RSP: 0018:8804d6cd5b88  EFLAGS: 00010246
RAX: c9001a2d2ff8 RBX: 88049dbc7c00 RCX: 880529dd6460
RDX:  RSI: 880529dd6460 RDI: 8807e30ba000
RBP: 8804d6cd5b98 R08:  R09: dead00200200
R10: dead00100100 R11:  R12: 8804d6efc000
R13: 8804d6cd5c08 R14:  R15: 88049dbc7c00
FS:  7f9b43455740() GS:8800bfc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 056ab000 CR3: 0004d6cfd000 CR4: 000426e0
DR0: 00a0 DR1:  DR2: 0003
DR3: 00b0 DR6: 0ff0 DR7: 0400
Process qemu (pid: 15886, threadinfo 8804d6cd4000, task 88050f22c000)
Stack:
 8804a5027f00 8804d6efc000 8804d6cd5bf8 a0031e7f
 fff5  8804d6cd5be8 0180
  8804d6efc000 8804a50276e0 8804d6cd5c08
Call Trace:
 [] kvm_mmu_prepare_zap_page+0x8f/0x2f0 [kvm]
 [] kvm_mmu_zap_all+0x4a/0x90 [kvm]
 [] kvm_arch_flush_shadow+0x16/0x30 [kvm]
 [] __kvm_set_memory_region+0x2c3/0x810 [kvm]
 [] ? hrtimer_start+0x18/0x20
 [] ? create_pit_timer+0xb7/0xd0 [kvm]
 [] ? pit_load_count+0xd3/0x120 [kvm]
 [] ? kvm_pit_load_count+0x22/0x60 [kvm]
 [] kvm_set_memory_region+0x43/0x70 [kvm]
 [] kvm_vm_ioctl_set_memory_region+0x1d/0x30 [kvm]
 [] kvm_vm_ioctl+0x1e5/0x3e0 [kvm]
 [] do_vfs_ioctl+0xa3/0x540
 [] ? sys_futex+0xce/0x170
 [] sys_ioctl+0x4f/0x80
 [] system_call_fastpath+0x16/0x1b
Code: 50 38 48 63 f6 48 8b 34 f2 0f b6 50 28 83 e2 0f eb b8 0f 1f 40 00 48 83
e6 fe 0f 84 d9 00 00 00 45 31 c0 0f 1f 00 48 89 f1 31 d2 <48> 8b 39 48 85 ff 74
10 48 39 fb 74 26 ff c2 48 83 c1 08 83 fa 
RIP  [] drop_spte+0xd5/0x1f0 [kvm]
 RSP 
---[ end trace a0f93d7b4fb495a7 ]---
general protection fault:  [#5] SMP 
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
CPU 5 
Modules linked in: kvm_intel kvm bnx2

Pid: 30332, comm: bash Tainted: G  D 2.6.38-rc4 #1 0P010H/PowerEdge
M600
RIP: 0010:[]  [] dup_fd+0x168/0x300
RSP: 0018:8805fbd03da0  EFLAGS: 00010202
RAX: 07f8 RBX: 8807e94179c0 RCX: bfff
RDX: 8807e3ef5480 RSI: 00ff RDI: 0800
RBP: 8805fbd03e00 R08: 8804f2c20280 R09: 0003
R10: 0001 R11: 4000 R12: 8804bf071000
R13: 8804f2c20540 R14: 8807dac23800 R15: 0100
FS:  7fb0a6a11700() GS:8800bfd4() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 00bf3000 CR3: 0007116cf000 CR4: 000426e0
DR0: 0003 DR1: 00b0 DR2: 0001
DR3:  DR6: 0ff0 DR7: 0400
Process bash (pid: 30332, threadinfo 8805fbd02000, task 880715cd1000)
Stack:
 88050005 00010282 0020 8806fa7dca40
 8807feaceec8 8807feacef40 7fb0a6a119d0 8807db5f7000
  01200011 7fb0a6a119d0 
Call Trace:
 [] copy_process+0xa02/0x1200
 [] do_fork+0x63/0x340
 [] ? _raw_spin_lock+0xe/0x20
 [] ? fd_install+0x67/0x90
 [] ? do_pipe_flags+0xb0/0x100
 [] sys_clone+0x28/0x30
 [] stub_clone+0x13/0x20
 [] ? system_call_fastpath+0x16/0x1b
Code: 4c 89 c2 e8 1b 35 23 00 45 85 ff 74 77

Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 02:27 PM, Gleb Natapov wrote:

I don't care how command line will look like, but I do not see how you
will support ide=off without device composition unless you put ad-hoc
ifs all over your i440fx device code.
   


Yes, in the piix3 device code, the ide property would trigger an if().

BTW, I'm extremely sceptical that you really do have machines w/o IDE at 
all.  Even the servers we ship with only SAS or SCSI support still have 
an integrated IDE controller.


Since most servers are built from the same chipset design that has IDE, 
I don't really see how you could build a modern system without IDE.



And that's okay, but the base modelling ought to follow rea
hardware closely with deviations being the exception.

 

You keep saying this without explaining why. But with device composition
you will have exactly that, you will compose real chipsets using config
files, not code.
   


Yeah, that's been the direction we've been going in since qdev was 
introduced.  I'm now convinced that this is overly ambitious.  By simply 
reducing the scope of conversion, we get 99% of the benefit with 10% of 
the effort.  Seems like a no brainer to me.


Regards,

Anthony Liguori


--
Gleb.

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 02:00 PM, Avi Kivity wrote:

On 02/10/2011 02:51 PM, Anthony Liguori wrote:

On 02/10/2011 12:13 PM, Gleb Natapov wrote:


Which spec? Even in this discussion we completely mixed different
things. 440FX is not a chipset.


Yes, it is.  It's a single silicon package with a defined pinout.  If 
you don't believe me, re-read the spec.


It's a MCM with the PIIX3 being internally connected.   The 
connection between the i440fx and PIIX3 happens to be PCI but that's 
not always the case.  Sometimes it's a proprietary bus.


Aren't they two distinct chips, together comprising the chip-set?

One (the northbridge) converts the system bus to PCI + some extra 
wires, the other (southbridge) bridges PCI to ISA and contains some 
embedded ISA devices.  IIRC there are some wires between them that are 
not PCI.


Yes, you are correct.  So I can understand an argument for:

  -device i440fx,id=pmc -device piix3,chipset=pmc

Or something like that.

Regards,

Anthony Liguori




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-02-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #23 from Marcelo Tosatti   2011-02-10 13:50:08 
---
Nicolas,

On comment #2 you mention the bug could not be reproduced, but in comment #3
you report it without hugepages enabled. So, were you using hugepages or not, 
in the reports #18 and #19?

Another thing, what is the last kernel version that works reliably under this
workload?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Jan Kiszka
On 2011-02-10 14:19, Avi Kivity wrote:
> On 02/10/2011 03:14 PM, Jan Kiszka wrote:
>> On 2011-02-10 13:57, Avi Kivity wrote:
>>>  On 02/10/2011 02:56 PM, Avi Kivity wrote:
>  What's the benefit? The downside is a bit more complexity as you need an
>  additional callback handler.


  synchronize_rcu() can be very slow (its a systemwide operation), and
  mmu_shrink() can be called often on a loaded system.

>>>
>>>  In fact this just shows that vm_list is not a good candidate for rcu;
>>>  rcu is useful where most operations are reads, but if we discount stats,
>>>  most operations on vm_list are going to be writes.
>>
>> Accept for mmu_shrink, which is write but not delete, thus works without
>> that slow synchronize_rcu.
> 
> I don't really see how you can implement list_move_rcu(), it has to be 
> atomic or other users will see a partial vm_list.

Right, even if we synchronized that step cleanly, rcu-protected users
could miss the moving vm during concurrent list walks.

What about using a separate mutex for protecting vm_list instead?
Unless I missed some detail, mmu_shrink should allow blocking.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Eduardo Habkost
On Thu, Feb 10, 2011 at 01:03:53PM +0200, Avi Kivity wrote:
> On 02/10/2011 12:57 PM, Michael Goldish wrote:
> >>
> >>  I can't easily think of a case where this might cause confusion.  The
> >>  purpose of this is to allow people to write:
> >>
> >>  only qcow2..raw..rtl8139
> >>
> >>  without having to remember the order in which those were defined in
> >>  tests_base.cfg.
> >
> >Sorry, I meant something like
> >
> >only qcow2..hugepages..rtl8139
> >
> >Obviously qcow2 and raw can't coexist.
> 
> The config files describe a cartesian product, in which order matters.

Mathematically speaking, the ordering in the result is different, but BA
and AB are often equivalent for the user.

In many situations, people don't care in which order (as an example)
"qcow" and "ide" are defined on the base config, they just want to
exclude the combination of "qcow" and "ide".

> 
> [A B C] x [1 2] generates [A1 A2 B1 B2 C1 C2]; no confusion here if
> you specify A..1
> 
> however
> 
> [A B C] x [A B] generates [AA AB BA BB CA CB]; A..B is ambiguous

If you do the above and reuse keywords, "A" is also ambiguous, "B" is
also ambiguous. "A..B" being ambiguous is a consequence of "A" and "B"
being ambiguous. If you don't want to be ambiguous, just use "A.B" or
"B.A".

> 
> we might require that keywords be unique.

I wouldn't be against that. At least for the use cases I see, people
have been assuming that keywords are unique on most "only" and "no"
statements.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-02-10 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #22 from Marcelo Tosatti   2011-02-10 13:36:25 
---
Problem description:

Present spte is dropped while syncing 32-bit level 1 shadow page. But
sp->gfns[index] contains uninitialized value (0 or f001), so
gfn->rmap conversion in rmap_remove fails.

However, debug patch from comment #18 verifies that on present spte
instantiation, via mmu_set_spte, sp->gfns[] is initialized correctly.

>From bug instances of comments 19 and 20, index == 511.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 03:00:05PM +0200, Avi Kivity wrote:
> On 02/10/2011 02:51 PM, Anthony Liguori wrote:
> >On 02/10/2011 12:13 PM, Gleb Natapov wrote:
> >>
> >>Which spec? Even in this discussion we completely mixed different
> >>things. 440FX is not a chipset.
> >
> >Yes, it is.  It's a single silicon package with a defined pinout.
> >If you don't believe me, re-read the spec.
> >
> >It's a MCM with the PIIX3 being internally connected.   The
> >connection between the i440fx and PIIX3 happens to be PCI but
> >that's not always the case.  Sometimes it's a proprietary bus.
> 
> Aren't they two distinct chips, together comprising the chip-set?
> 
> One (the northbridge) converts the system bus to PCI + some extra
> wires, the other (southbridge) bridges PCI to ISA and contains some
> embedded ISA devices.  IIRC there are some wires between them that
> are not PCI.
> 
Yeah, 440fx is probably northbridge and PIIX3 southbridge.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 01:51:14PM +0100, Anthony Liguori wrote:
> On 02/10/2011 12:13 PM, Gleb Natapov wrote:
> >
> >Which spec? Even in this discussion we completely mixed different
> >things. 440FX is not a chipset.
> 
> Yes, it is.  It's a single silicon package with a defined pinout.
> If you don't believe me, re-read the spec.
> 
> It's a MCM with the PIIX3 being internally connected.   The
> connection between the i440fx and PIIX3 happens to be PCI but that's
> not always the case.  Sometimes it's a proprietary bus.
> 
Which one? 29054901.pdf describes memory controller and PCI host bridge only.
 
> >Again you probably mean PIIX3. Even then removing unused ide will free
> >one more PCI slot for my cool virtio disk array. The things is, from
> >code point of view, it does not cost you extra to allow composition of
> >ide since it is just a regular PCI device and we need to support composing
> >those anyway.
> 
> If this is useful, and it doesn't break guests, you can always do
> -device i440fx,ide=off.  However, it's an exception where we're
> deviating from how hardware works.
> 
I don't care how command line will look like, but I do not see how you
will support ide=off without device composition unless you put ad-hoc
ifs all over your i440fx device code.

And I don't understand what do you mean by saying that this is not how
hardware works. Presence or absence of PCI device does not change how
hardware works.

> And that's okay, but the base modelling ought to follow real
> hardware closely with deviations being the exception.
> 
You keep saying this without explaining why. But with device composition
you will have exactly that, you will compose real chipsets using config
files, not code.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Avi Kivity

On 02/10/2011 03:14 PM, Jan Kiszka wrote:

On 2011-02-10 13:57, Avi Kivity wrote:
>  On 02/10/2011 02:56 PM, Avi Kivity wrote:
>>>  What's the benefit? The downside is a bit more complexity as you need an
>>>  additional callback handler.
>>
>>
>>  synchronize_rcu() can be very slow (its a systemwide operation), and
>>  mmu_shrink() can be called often on a loaded system.
>>
>
>  In fact this just shows that vm_list is not a good candidate for rcu;
>  rcu is useful where most operations are reads, but if we discount stats,
>  most operations on vm_list are going to be writes.

Accept for mmu_shrink, which is write but not delete, thus works without
that slow synchronize_rcu.


I don't really see how you can implement list_move_rcu(), it has to be 
atomic or other users will see a partial vm_list.



  And I don't see the need for call_rcu in the
vm deletion path.


synchronize_rcu() is fine for vm destruction.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Jan Kiszka
On 2011-02-10 13:57, Avi Kivity wrote:
> On 02/10/2011 02:56 PM, Avi Kivity wrote:
>>> What's the benefit? The downside is a bit more complexity as you need an
>>> additional callback handler.
>>
>>
>> synchronize_rcu() can be very slow (its a systemwide operation), and 
>> mmu_shrink() can be called often on a loaded system.
>>
> 
> In fact this just shows that vm_list is not a good candidate for rcu; 
> rcu is useful where most operations are reads, but if we discount stats, 
> most operations on vm_list are going to be writes.

Accept for mmu_shrink, which is write but not delete, thus works without
that slow synchronize_rcu. And I don't see the need for call_rcu in the
vm deletion path.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 01:47:06PM +0100, Anthony Liguori wrote:
> On 02/10/2011 11:49 AM, Gleb Natapov wrote:
> >On Thu, Feb 10, 2011 at 11:19:48AM +0100, Anthony Liguori wrote:
> >>On 02/10/2011 11:10 AM, Gleb Natapov wrote:
> >>>On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote:
> On 02/10/2011 10:07 AM, Gleb Natapov wrote:
> >So what if it is easier, it doesn't mean it is correct thing to do.
> If we spend the next 10 years trying to do the "correct thing" for
> some arbitrary definition of correct, that's not terribly useful.
> >>>Changing direction by 180 every 2 years even less useful.
> >>If we think through what we are doing and have a coherent
> >>architecture before changing direction, then we won't have this
> >>problem.
> >>
> >I'd like to believe this :)
> >
> It's really simple actually.  Let's do the least clever thing and
> model how hardware actual works.  Once we have that, we can try to
> be better than real hardware (if it's possible).
> >>>I think out understanding on how HW actually works is very different.
> >>>You are placing to much value on were device resides physically, for me
> >>>it is completely unimportant detail. Not worth even mentioning.
> >>No, I place value on how things are modelled in the real world.
> >Real world (physical HW) have consideration not relevant for our
> >software emulation. Such as cost, physical dimension, power consumption
> >and many other I am sure I missed.
> >
> >>There simply aren't PC's out there that lack an RTC so I have no
> >>interest in jumping through hoops in QEMU to make it possible to do
> >>this without modifying QEMU code.  It might sound nice to a
> >>developer but it's of absolutely no use to users.
> >>
> >RTC is not good example. HPET suppose to replace it (and PIT too).
> 
> HPET's embed RTCs to provide support for legacy implementations.
> This is extremely good example of where our modelling breaks down.
> Take a close look at how the HPET and RTC emulations interact for an
> example of why we'd be much better off just implementing an RTC
> within an HPET.
> 
Yes HPET can provide legacy RTC timer functionality. No I do not see why
we should implement RTC withing HPET. In your model we should remove
HPET code completely since HPET is not present in chipset emulated by
QEMU.

> >  AFAIC
> >there are PCs without RTC already.
> 
> RTC also provides CMOS functionality and no PC can boot without
> CMOS.  So no, there's nothing we'd consider a PC today that doesn't
> have an RTC.
CMOS may be present even if RTC functionality is absent. Does EFI base
machine still need CMOS though?

> 
> >  Good example would be PIC or IOAPIC
> >device and then I would agree with you that it is not worth it to make
> >it possible to create x86 machine without them from command line if it
> >means extra complexity. But how have you jumped from this to "lets make usb
> >mandatory"?
> 
> USB is mandatory in the PIIX3 but the only significant difference
> between the piix2 and piix3 is the addition of USB.
> Consequentially, the main difference between an i440fx and i440bx is
> the use of a piix2 vs. a piix3.  So if you really want to create the
> same PC we have today w/o USB, the right way to do it would be to
> have:
> 
> -device i440,model=fx   // with USB
> -device i440,model=bx  // w/o USB

Why not qemu -config piix2.cfg or qemu -config piix3.cfg? No need to
make data into code.

> 
> 
> >>No, we don't. It's possible to have an 'rtc=off' option but I'm
> >>tremendously opposed to doing this.  Arbitrary composition is not a
> >>useful goal IMHO.
> >IMHO is different. We should support composition where it makes sense.
> >For PIC-less x86 it doesn't make it. For usb-less or even ide-less it
> >does.
> 
> The right way to do a USB-less PC is to have an option to create an i440bx.
Why is this the right way?

> 
> An IDE-less PC is a bit more difficult because IDE is really baked
> into the concept of a PC.  Chances are, there are more than a few
> guests out there that would have issues from there being no IDE bus
> present.
> 
Non of my modern PCs have IDE. Many high end PC had SCSI instead of IDE
in the past. If guest can't run without IDE you do not run it without
IDE.

> >>>  So why do you like -device i440fx over what we have now?
> >>Because I don't think tools like libvirt should be doing device
> >>composition to create an i440fx-like chipset.  I think the current
> >>path we're on is pushing too much logic that belongs in QEMU into
> >>the management stack.
> >I can agree with that. But from this it doesn't follow that we should
> >get rid of composition. We shouldn't push composition of common HW to
> >libvirt. Looking at libvirt command line I do not think we do it though.
> >Typical libvirt command line specifies disks, networks, usb, vga. How
> >-device i440fx will simplified that? Well usb could be omitted (but not
> >-usbdevice table), disks are not property of i440fx so they will stay,
> >since 

Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Peter Maydell
On 10 February 2011 12:23, Anthony Liguori  wrote:
> But something interacts with each processor and dispatches the I/O
> operations in the address space, no?  I can't believe there are 2^32 address
> lines coming off of every arm chip that each device connects.

Well, the AXI bus is kind of complicated and definitely not my
area of expertise, but as I understand it you have an interconnect
like a PL300 that effectively implements the "memory map" and
defines where the slaves (devices) appear. But unless you actually
want to be modelling bus transactions at a pretty low level this
isn't really a visible difference from "these devices appear at
this address in the memory map on this bus". (And there might
be a bridge down from AXI to AHB or APB between the core
and any particular device, but that's not programmer visible either.)

> This relationship of how I/O fans out through various devices is important
> because occasionally platforms do weird things during I/O fan out like
> implement an IOMMU.  If we don't model this I/O dispatch model within QEMU,
> then it's extremely difficult to implement things like IOMMUs.

Yes, but what does this have to do with chipsets and getting rid
of machines? Getting I/O fanout through devices is a matter of
modelling some sort of conceptual bus, and having the right APIs
so you can do it fast in the common case and still allow IOMMUs
and other interesting devices to intercept and change transactions.
Any particular board might have to wire up the bus so it goes through
an IOMMU, or it might not.

Whether you want to bundle up a collection of devices and bus wiring
and call it a "chipset" or not should be a matter of whether that makes
sense and is a usefully reusable conceptual unit for whatever board
you're modelling, I think. (For instance "an OMAP3" is an obvious
reusable unit which any OMAP3-based board model is going to want
to use.)

Some of the I/O fanout and bus wiring might be internal to
a qemu core model, for that matter -- for instance M profile
ARM cores have several output buses which deal with
different bits of the memory space (which are predefined
as being for devices, or memory, or whatever), and the
A9MP's internal timers and interrupt controller and so on ought
to all be inside the core (at the moment we rely on all A9MP
boards instantiating them as a separate device, which is ugly).

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Avi Kivity

On 02/10/2011 02:51 PM, Anthony Liguori wrote:

On 02/10/2011 12:13 PM, Gleb Natapov wrote:


Which spec? Even in this discussion we completely mixed different
things. 440FX is not a chipset.


Yes, it is.  It's a single silicon package with a defined pinout.  If 
you don't believe me, re-read the spec.


It's a MCM with the PIIX3 being internally connected.   The connection 
between the i440fx and PIIX3 happens to be PCI but that's not always 
the case.  Sometimes it's a proprietary bus.


Aren't they two distinct chips, together comprising the chip-set?

One (the northbridge) converts the system bus to PCI + some extra wires, 
the other (southbridge) bridges PCI to ISA and contains some embedded 
ISA devices.  IIRC there are some wires between them that are not PCI.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Avi Kivity

On 02/10/2011 02:56 PM, Avi Kivity wrote:

What's the benefit? The downside is a bit more complexity as you need an
additional callback handler.



synchronize_rcu() can be very slow (its a systemwide operation), and 
mmu_shrink() can be called often on a loaded system.




In fact this just shows that vm_list is not a good candidate for rcu; 
rcu is useful where most operations are reads, but if we discount stats, 
most operations on vm_list are going to be writes.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Avi Kivity

On 02/10/2011 02:45 PM, Jan Kiszka wrote:

>>>
>>>   There is no list_move_tail_rcu().
>>
>>  ...specifically not for this one.
>
>  Well, we can add one if needed (and if possible).

I can have a look, at least at the lower hanging fruits.


Please keep rcu->parent in the loop.


>
>>>
>>>   Why check kvm->deleted?  it's in the process of being torn down anyway,
>>>   it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger.
>>
>>  kvm_destroy_vm removes a vm from the list while mmu_shrink is running.
>>  Then mmu_shrink's list_move_tail will re-add that vm to the list tail
>>  again (unless already the removal in move_tail produces a crash).
>
>  It's too subtle.  Communication across threads with a variable needs
>  memory barriers (even though they're nops on x86) and documentation.

The barriers are provided by this spin lock we acquire for testing are
modifying deleted.


Right.

I'm not thrilled with adding ->deleted though.


>
>  btw, not even sure if it's legal: you have a mutating call within an rcu
>  read critical section for the same object.  If synchronize_rcu() were
>  called there, would it ever terminate?

Why not? kvm_destroy_vm is not preventing blocking mmu_shrink to acquire
the kvm_lock where we then find the vm deleted and release both kvm_lock
and the rcu read "lock" afterwards.


synchronize_rcu() waits until all currently running rcu read-side 
critical sections are completed.  But we are in the middle of one, which 
isn't going to complete until it synchronize_rcu() returns.



>
>  (not that synchronize_rcu() is a good thing there, better do it with
>  call_rcu()).

What's the benefit? The downside is a bit more complexity as you need an
additional callback handler.


synchronize_rcu() can be very slow (its a systemwide operation), and 
mmu_shrink() can be called often on a loaded system.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 12:13 PM, Gleb Natapov wrote:


Which spec? Even in this discussion we completely mixed different
things. 440FX is not a chipset.


Yes, it is.  It's a single silicon package with a defined pinout.  If 
you don't believe me, re-read the spec.


It's a MCM with the PIIX3 being internally connected.   The connection 
between the i440fx and PIIX3 happens to be PCI but that's not always the 
case.  Sometimes it's a proprietary bus.



Again you probably mean PIIX3. Even then removing unused ide will free
one more PCI slot for my cool virtio disk array. The things is, from
code point of view, it does not cost you extra to allow composition of
ide since it is just a regular PCI device and we need to support composing
those anyway.
   


If this is useful, and it doesn't break guests, you can always do 
-device i440fx,ide=off.  However, it's an exception where we're 
deviating from how hardware works.


And that's okay, but the base modelling ought to follow real hardware 
closely with deviations being the exception.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 11:49 AM, Gleb Natapov wrote:

On Thu, Feb 10, 2011 at 11:19:48AM +0100, Anthony Liguori wrote:
   

On 02/10/2011 11:10 AM, Gleb Natapov wrote:
 

On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote:
   

On 02/10/2011 10:07 AM, Gleb Natapov wrote:
 

So what if it is easier, it doesn't mean it is correct thing to do.
   

If we spend the next 10 years trying to do the "correct thing" for
some arbitrary definition of correct, that's not terribly useful.
 

Changing direction by 180 every 2 years even less useful.
   

If we think through what we are doing and have a coherent
architecture before changing direction, then we won't have this
problem.

 

I'd like to believe this :)

   

It's really simple actually.  Let's do the least clever thing and
model how hardware actual works.  Once we have that, we can try to
be better than real hardware (if it's possible).
 

I think out understanding on how HW actually works is very different.
You are placing to much value on were device resides physically, for me
it is completely unimportant detail. Not worth even mentioning.
   

No, I place value on how things are modelled in the real world.
 

Real world (physical HW) have consideration not relevant for our
software emulation. Such as cost, physical dimension, power consumption
and many other I am sure I missed.

   

There simply aren't PC's out there that lack an RTC so I have no
interest in jumping through hoops in QEMU to make it possible to do
this without modifying QEMU code.  It might sound nice to a
developer but it's of absolutely no use to users.

 

RTC is not good example. HPET suppose to replace it (and PIT too).


HPET's embed RTCs to provide support for legacy implementations.   This 
is extremely good example of where our modelling breaks down.  Take a 
close look at how the HPET and RTC emulations interact for an example of 
why we'd be much better off just implementing an RTC within an HPET.



  AFAIC
there are PCs without RTC already.


RTC also provides CMOS functionality and no PC can boot without CMOS.  
So no, there's nothing we'd consider a PC today that doesn't have an RTC.



  Good example would be PIC or IOAPIC
device and then I would agree with you that it is not worth it to make
it possible to create x86 machine without them from command line if it
means extra complexity. But how have you jumped from this to "lets make usb
mandatory"?
   


USB is mandatory in the PIIX3 but the only significant difference 
between the piix2 and piix3 is the addition of USB.  Consequentially, 
the main difference between an i440fx and i440bx is the use of a piix2 
vs. a piix3.  So if you really want to create the same PC we have today 
w/o USB, the right way to do it would be to have:


-device i440,model=fx   // with USB
-device i440,model=bx  // w/o USB



No, we don't. It's possible to have an 'rtc=off' option but I'm
tremendously opposed to doing this.  Arbitrary composition is not a
useful goal IMHO.
 

IMHO is different. We should support composition where it makes sense.
For PIC-less x86 it doesn't make it. For usb-less or even ide-less it
does.
   


The right way to do a USB-less PC is to have an option to create an i440bx.

An IDE-less PC is a bit more difficult because IDE is really baked into 
the concept of a PC.  Chances are, there are more than a few guests out 
there that would have issues from there being no IDE bus present.


 

  So why do you like -device i440fx over what we have now?
   

Because I don't think tools like libvirt should be doing device
composition to create an i440fx-like chipset.  I think the current
path we're on is pushing too much logic that belongs in QEMU into
the management stack.
 

I can agree with that. But from this it doesn't follow that we should
get rid of composition. We shouldn't push composition of common HW to
libvirt. Looking at libvirt command line I do not think we do it though.
Typical libvirt command line specifies disks, networks, usb, vga. How
-device i440fx will simplified that? Well usb could be omitted (but not
-usbdevice table), disks are not property of i440fx so they will stay,
since user may want to use virtio controller (which is not part of
i440fx) this should stay too. Network obviously will have to be
specified by libvirt too, vga may go to i440fx, but since libvirt
supports qxl we will have to have a way to disable default vga and
enable qxl instead. So will we really simplify libvirt's life by
introducing -device i440fx?
   


libvirt also uses -no-defaults which prevents much of the PC's machine 
init from creating anything but stuff that really belongs in the main 
chipset.


But I bet if you asked 5 different QEMU developers what belongs in 
machine init and what the role of -no-defaults is, you'd get different 
answers.


OTOH, skipping any notion of machine and explicitly creating a chipset 
provides a very consistent 

Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Jan Kiszka
On 2011-02-10 13:34, Avi Kivity wrote:
> On 02/10/2011 01:31 PM, Jan Kiszka wrote:
>>>

  @@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, 
 int nr_to_scan, gfp_t gfp_mask)
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
}
  - if (kvm_freed)
  - list_move_tail(&kvm_freed->vm_list,&vm_list);
  + if (kvm_freed) {
  + raw_spin_lock(&kvm_lock);
  + if (!kvm->deleted)
  + list_move_tail(&kvm_freed->vm_list,&vm_list);
>>>
>>>  There is no list_move_tail_rcu().
>>
>> ...specifically not for this one.
> 
> Well, we can add one if needed (and if possible).

I can have a look, at least at the lower hanging fruits.

> 
>>>
>>>  Why check kvm->deleted?  it's in the process of being torn down anyway,
>>>  it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger.
>>
>> kvm_destroy_vm removes a vm from the list while mmu_shrink is running.
>> Then mmu_shrink's list_move_tail will re-add that vm to the list tail
>> again (unless already the removal in move_tail produces a crash).
> 
> It's too subtle.  Communication across threads with a variable needs 
> memory barriers (even though they're nops on x86) and documentation.

The barriers are provided by this spin lock we acquire for testing are
modifying deleted.

> 
> btw, not even sure if it's legal: you have a mutating call within an rcu 
> read critical section for the same object.  If synchronize_rcu() were 
> called there, would it ever terminate?

Why not? kvm_destroy_vm is not preventing blocking mmu_shrink to acquire
the kvm_lock where we then find the vm deleted and release both kvm_lock
and the rcu read "lock" afterwards.

> 
> (not that synchronize_rcu() is a good thing there, better do it with 
> call_rcu()).

What's the benefit? The downside is a bit more complexity as you need an
additional callback handler.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Lucas Meneghel Rodrigues
On Thu, 2011-02-10 at 09:18 +0800, Amos Kong wrote:
> On Wed, Feb 09, 2011 at 11:28:56AM +0200, Avi Kivity wrote:
> > On 02/09/2011 03:50 AM, Michael Goldish wrote:
> > >This is a reimplementation of the dict generator.  It is much faster than 
> > >the
> > >current implementation and uses a very small amount of memory.  Running 
> > >time
> > >and memory usage scale polynomially with the number of defined variants,
> > >compared to exponentially in the current implementation.
> > >
> > >Instead of regular expressions in the filters, the following syntax is 
> > >used:
> > >
> > >, means OR
> > >.. means AND
> > >. means IMMEDIATELY-FOLLOWED-BY
> > >
> > >Example:
> > >
> > >only qcow2..Fedora.14, RHEL.6..raw..boot, smp2..qcow2..migrate..ide
> > >
> > 
> > 
> > Is it not possible to keep the old syntax?  Breaking people's
> > scripts is bad.
> 
> we only need convert the configure file, it's not too complex

Yes, the benefits of the new format outnumber the inconveniences. As for
my opinion on the operator, .. is sufficiently clear and expressive to
do most of the stuff we need to do with configuration anyway.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Avi Kivity

On 02/10/2011 01:31 PM, Jan Kiszka wrote:

>
>>
>>  @@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, int 
nr_to_scan, gfp_t gfp_mask)
>>spin_unlock(&kvm->mmu_lock);
>>srcu_read_unlock(&kvm->srcu, idx);
>>}
>>  - if (kvm_freed)
>>  - list_move_tail(&kvm_freed->vm_list,&vm_list);
>>  + if (kvm_freed) {
>>  + raw_spin_lock(&kvm_lock);
>>  + if (!kvm->deleted)
>>  + list_move_tail(&kvm_freed->vm_list,&vm_list);
>
>  There is no list_move_tail_rcu().

...specifically not for this one.


Well, we can add one if needed (and if possible).


>
>  Why check kvm->deleted?  it's in the process of being torn down anyway,
>  it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger.

kvm_destroy_vm removes a vm from the list while mmu_shrink is running.
Then mmu_shrink's list_move_tail will re-add that vm to the list tail
again (unless already the removal in move_tail produces a crash).


It's too subtle.  Communication across threads with a variable needs 
memory barriers (even though they're nops on x86) and documentation.


btw, not even sure if it's legal: you have a mutating call within an rcu 
read critical section for the same object.  If synchronize_rcu() were 
called there, would it ever terminate?


(not that synchronize_rcu() is a good thing there, better do it with 
call_rcu()).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Edgar E. Iglesias
On Thu, Feb 10, 2011 at 12:55:22PM +0100, Alexander Graf wrote:
> Scott Wood wrote:
> > On Thu, 3 Feb 2011 10:19:06 +0100
> > Alexander Graf  wrote:
> >
> >   
> >> Yeah, that one's tricky. Usually the way the memory resolver in qemu works 
> >> is as follows:
> >>
> >>  * kvm goes to qemu
> >>  * qemu fetches all mmu and register data from kvm
> >>  * qemu runs its mmu resolution function as if the target was emulated
> >>
> >> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove 
> >> them into env and implement the MMU in qemu (at least enough of it to 
> >> enable debugging). No other target modifies this code path. But no other 
> >> target needs to copy > 30kb of data only to get the mmu data either :).
> >> 
> >
> > I guess you mean that cpu_synchronize_state() is supposed to pull in the
> > MMU state, though I don't see where it gets called for 'm'/'M' commands in
> > the gdb stub.
> >   
> 
> Well, we could also call it in get_phys_page_debug in target-ppc, but
> yes. I guess the reason it works for now is that SDR1 is pretty constant
> and was fetched earlier on. For BookE not syncing is obviously even more
> broken.
> 
> > The MMU code seems to be pretty target-specific.  It's not clear to what
> > extent there is a "normal" way, versus what book3s happens to rely on in
> > its get_physical_address() code.  I don't think there are any platforms
> > supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> > implementation) that have a pure software-managed TLB.  x86 has page
> > tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> > Linux doesn't use it, but I guess that's not supported by KVM yet?).
> >   
> 
> As for PPC, only 440, e500 and G3-5 are basically supported. It happens
> to work on POWER4 and above too and I've even got reports that it's good
> on e600 :).
> 
> > We could probably do some sort of lazy state transfer only when MMU code
> > that needs it is run.  This could initially include debug translations, for
> > testing a non-KVM-dependent get_physical_address() implementation, but
> > eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
> >   
> 
> Yup :).
> 
> > trigger the state transfer.  I'd also like to add an "info tlb" command,
> > which would require the state transfer.
> >   
> 
> Very nice.
> 
> > BTW, how much other than the MMU is missing to be able to run an e500
> > target in qemu, without kvm?
> >   
> 
> The last person working on BookE emulation was Edgar. Edgar, how far did
> you get?

Hi,

TBH, I don't really know. My goal was to get linux running on an PPC-440
embedded with the Xilinx FPGA's. I managed to fix enough BookE emulation
to get that far.

After that, we've done a few more hacks to run fsboot and uboot. Also,
we've added support for some of the BookE debug registers to be able
to run gdbserver from within linux guests. Some of these patches haven't
made it upstream yet.

I haven't taken the time to compare the specs to qemu code, so I don't
really know how much is missing. My guess is that If you wan't to run
linux guests, the MMU won't be the limiting factor.

Cheers
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 11:38 AM, Peter Maydell wrote:

On 10 February 2011 10:13, Anthony Liguori  wrote:
   

On 02/10/2011 10:04 AM, Peter Maydell wrote:
 

On 10 February 2011 08:36, Anthony Liguoriwrote:
   

So you would model arm926ej-s as the chipset and then build up the
machines
by modifying parameters of the chipset (like the board id) and/or adding
different components on top of it.

 

Er, ARM926 is the CPU, it's not a chipset. The board ID is definitely
not a property of an ARM926, it's a property of the board (clue is in
the name :-)). I don't think versatile boards have a "chipset" really...

   

As I said, I'm not well versed in the component names in ARM.

But that said, an actual processor doesn't connect directly to a bunch of
devices.  It almost always go through some chipset and that chipset
implements a lot of functionality typically.

I think the name of the component I'm trying to refer to PL300 which I
believe is the Northbridge used for the Versatile boards.
 

PL300 is just a bus interconnect (so you can connect multiple AXI
bus masters (cores) to multiple AXI bus slaves (devices)).
Versatile PB doesn't have anything in the documentation that claims
to be a Northbridge (PBX does, VExpress doesn't).

This is the system diagram for the Versatile Express:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0447d/I1007683.html
I don't know what you'd want to claim is a "northbridge" there.
Basically there's an FPGA with a pile of devices in it,
and there's a test chip with the core and some other devices in
it. But from a modelling perspective this is all completely
irrelevant because regardless of where the hardware designer
put the devices, they're just devices at a particular point in the
memory map and with a particular set of interrupt wiring and so
on.


But something interacts with each processor and dispatches the I/O 
operations in the address space, no?  I can't believe there are 2^32 
address lines coming off of every arm chip that each device connects.


This relationship of how I/O fans out through various devices is 
important because occasionally platforms do weird things during I/O fan 
out like implement an IOMMU.  If we don't model this I/O dispatch model 
within QEMU, then it's extremely difficult to implement things like IOMMUs.


It might be the case that a platform has a chipset that is a pile of 
well isolated devices that are crammed in the same silicon space but 
that otherwise have very well defined interactions with each other.  
This is the exception though, not the rule.


Particularly when looking at the relationship between certain devices on 
the PC (like the role the pckbd plays in address translation), things 
are simply not so idealized in practice.


But if it makes sense for ARM to describe every single platform device 
through a factory interface, that's fine.


Even in this case, you still want to model things like the distinction 
between the UART16650A and the ISA bus bridge for the serial device.  In 
this case, you want to be able to do composition without going through a 
factory.



An n900 is a very specific hardware configuration that is best represented
by some sort of configuration file vs. something hard coded in QEMU.
 

Yes, that's the whole point -- "machine" == "specific hardware
configuration".

That's not getting rid of "machine", it's just saying "we should have
some custom scripting language to define them rather than doing
them in C". You still want, fundamentally, to be able to say
   qemu-system-arm -M machinename
   


No, qemu-system-arm -M /path/to/n900.cfg

But yeah, no disagreement there.  But today, the machine concept in QEMU 
is definitely not a specific hardware configuration.


Regards,

Anthony Liguori


-- PMM

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf
Scott Wood wrote:
> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf  wrote:
>
>   
>> Yeah, that one's tricky. Usually the way the memory resolver in qemu works 
>> is as follows:
>>
>>  * kvm goes to qemu
>>  * qemu fetches all mmu and register data from kvm
>>  * qemu runs its mmu resolution function as if the target was emulated
>>
>> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them 
>> into env and implement the MMU in qemu (at least enough of it to enable 
>> debugging). No other target modifies this code path. But no other target 
>> needs to copy > 30kb of data only to get the mmu data either :).
>> 
>
> I guess you mean that cpu_synchronize_state() is supposed to pull in the
> MMU state, though I don't see where it gets called for 'm'/'M' commands in
> the gdb stub.
>   

Well, we could also call it in get_phys_page_debug in target-ppc, but
yes. I guess the reason it works for now is that SDR1 is pretty constant
and was fetched earlier on. For BookE not syncing is obviously even more
broken.

> The MMU code seems to be pretty target-specific.  It's not clear to what
> extent there is a "normal" way, versus what book3s happens to rely on in
> its get_physical_address() code.  I don't think there are any platforms
> supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
> implementation) that have a pure software-managed TLB.  x86 has page
> tables, and book3s has the hash table (603/e300 doesn't, or more accurately
> Linux doesn't use it, but I guess that's not supported by KVM yet?).
>   

As for PPC, only 440, e500 and G3-5 are basically supported. It happens
to work on POWER4 and above too and I've even got reports that it's good
on e600 :).

> We could probably do some sort of lazy state transfer only when MMU code
> that needs it is run.  This could initially include debug translations, for
> testing a non-KVM-dependent get_physical_address() implementation, but
> eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
>   

Yup :).

> trigger the state transfer.  I'd also like to add an "info tlb" command,
> which would require the state transfer.
>   

Very nice.

> BTW, how much other than the MMU is missing to be able to run an e500
> target in qemu, without kvm?
>   

The last person working on BookE emulation was Edgar. Edgar, how far did
you get?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Michael Goldish
On 02/10/2011 01:03 PM, Avi Kivity wrote:
> On 02/10/2011 12:57 PM, Michael Goldish wrote:
>> >
>> >  I can't easily think of a case where this might cause confusion.  The
>> >  purpose of this is to allow people to write:
>> >
>> >  only qcow2..raw..rtl8139
>> >
>> >  without having to remember the order in which those were defined in
>> >  tests_base.cfg.
>>
>> Sorry, I meant something like
>>
>> only qcow2..hugepages..rtl8139
>>
>> Obviously qcow2 and raw can't coexist.
> 
> The config files describe a cartesian product, in which order matters.
> 
> [A B C] x [1 2] generates [A1 A2 B1 B2 C1 C2]; no confusion here if you
> specify A..1
> 
> however
> 
> [A B C] x [A B] generates [AA AB BA BB CA CB]; A..B is ambiguous

This is a bad idea anyway:

[A B C] x [A B] x [install boot migrate]

'only A..install' is ambiguous regardless of whether we match in-order
or not.

> we might require that keywords be unique.

Ambiguity can be resolved by prefixing a name with its immediate parent.
 If we have Fedora.9.32 and Fedora.9.64, and some test 'foo' has both a
32 bit and a 64 bit version, then the following isn't ambiguous:

only Fedora.9.32..foo.32

If we require that keywords be unique, such combinations will not be
possible.  The same applies to RHEL.3..sometest.3.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf
Scott Wood wrote:
> On Wed, 9 Feb 2011 18:21:40 +0100
> Alexander Graf  wrote:
>
>   
>> On 07.02.2011, at 21:15, Scott Wood wrote:
>>
>> 
>>> That's pretty much what the proposed API does -- except it uses a void
>>> pointer instead of uint64_t *.
>>>   
>> Oh? Did I miss something there? The proposal looked as if it only transfers 
>> a single TLB entry at a time.
>> 
>
> Right, I just meant in terms of avoiding a fixed reference to a hw-specific
> type.
>
>   
>>> How about:
>>>
>>> struct kvmppc_booke_tlb_entry {
>>> union {
>>> __u64 mas0_1;
>>> struct {
>>> __u32 mas0;
>>> __u32 mas1;
>>> };
>>> };
>>> __u64 mas2;
>>> union {
>>> __u64 mas7_3
>>> struct {
>>> __u32 mas7;
>>> __u32 mas3;
>>> };
>>> };
>>> __u32 mas8;
>>> __u32 pad;
>>>   
>> Would it make sense to add some reserved fields or would we just bump up the 
>> mmu id?
>> 
>
> I was thinking we'd just bump the ID.  I only stuck "pad" in there for
> alignment.  And we're making a large array of it, so padding could hurt.
>   

Ok, thinking about this a bit more. You're basically proposing a list of
tlb set calls, with each array field identifying one tlb set call. What
I was thinking of was a full TLB sync, so we could keep qemu's internal
TLB representation identical to the ioctl layout and then just call that
one ioctl to completely overwrite all of qemu's internal data (and vice
versa).

>>> struct kvmppc_booke_tlb_params {
>>> /*
>>>  * book3e defines 4 TLBs.  Individual implementations may have
>>>  * fewer.  TLBs that do not exist on the target must be configured
>>>  * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
>>>  * configured here, though arrays greater than 2048 entries will
>>>  * have TLBnCFG[NENTRY] set to zero.
>>>  */
>>> __u32 tlb_sizes[4];
>>>   
>> Add some reserved fields?
>> 
>
> MMU type ID also controls this, but could add some padding to make
> extensions simpler (esp. since we're not making an array of it).  How much
> would you recommend?
>   

How about making it 64 bytes? That should leave us plenty of room.

>   
>>> struct kvmppc_booke_tlb_search {
>>>   
>> Search? I thought we agreed on having a search later, after the full get/set 
>> is settled?
>> 
>
> We agreed on having a full array-like get/set... my preference was to keep
> it all under one capability, which implies adding it at the same time.
> But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
> skeptical that array-only will not be a performance issue under any usage
> pattern, but we can implement it and try it out before finalizing any of
> this.
>   

Yup. We can even implement it, measure what exactly is slow and then
decide on how to implement it. I'd bet that only the emulation stub is
slow - and for that KVM_TRANSLATE seems like a good fit.

>   
>>> struct kvmppc_booke_tlb_entry entry;
>>> union {
>>> __u64 mas5_6;
>>> struct {
>>> __u64 mas5;
>>> __u64 mas6;
>>> };
>>> };
>>> };
>>>   
>
> The fields inside the struct should be __u32, of course. :-P
>   

Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
right? :)

>   
>>> - An entry with MAS1[V] = 0 terminates the list early (but there will
>>>   be no terminating entry if the full array is valid).  On a call to
>>>   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
>>>   On a call to KVM_SET_TLB, excess elements beyond the terminating
>>>   entry may not be accessed by KVM.
>>>   
>> Very implementation specific, but ok with me. 
>> 
>
> I assumed most MMU types would have some straightforward way of marking an
> entry invalid (if not, it can add a software field in the struct), and that
> it would be MMU-specific code that is processing the list.
>   

See above :).

>   
>> It's constrained to the BOOKE implementation of that GET/SET anyway. Is
>> this how the hardware works too?
>> 
>
> Hardware doesn't process lists of entries.  But MAS1[V] is the valid
> bit in hardware.
>
>   
>>> [Note: Once we implement sregs, Qemu can determine which TLBs are
>>> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
>>> unsupported by KVM if its existence is implied by the target CPU]
>>>
>>> KVM_SET_TLB
>>> ---
>>>
>>> Capability: KVM_CAP_SW_TLB
>>> Type: vcpu ioctl
>>> Parameters: struct kvm_set_tlb (in)
>>> Returns: 0 on success
>>> -1 on error
>>>
>>> struct kvm_set_tlb {
>>> __u64 params;
>>> __u64 array;
>>> __u32 mmu_type;
>>> };
>>>
>>> [Note: I used __u64 rather than void * to avoid the need for special
>>> compat handling with 32-bit userspace on a 64-bit kernel -- if the other
>>> way is preferred, that's fin

Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Jan Kiszka
On 2011-02-10 11:16, Avi Kivity wrote:
> On 02/08/2011 01:55 PM, Jan Kiszka wrote:
>> Only for walking the list of VMs, we do not need to hold the preemption
>> disabling kvm_lock. Convert stat services, the cpufreq callback and
>> mmu_shrink to RCU. For the latter, special care is required to
>> synchronize its list_move_tail with kvm_destroy_vm.
>>
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index b6a9963..e9d0ed8 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -3587,9 +3587,9 @@ static int mmu_shrink(struct shrinker *shrink, int 
>> nr_to_scan, gfp_t gfp_mask)
>>  if (nr_to_scan == 0)
>>  goto out;
>>
>> -raw_spin_lock(&kvm_lock);
>> +rcu_read_lock();
>>
>> -list_for_each_entry(kvm,&vm_list, vm_list) {
>> +list_for_each_entry_rcu(kvm,&vm_list, vm_list) {
>>  int idx, freed_pages;
>>  LIST_HEAD(invalid_list);
> 
> Have to #include rculist.h,

OK.

> and to change all list operations on vm_list 
> to rcu variants.

Not sure if we have such variants for all cases...

> 
>>
>> @@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, int 
>> nr_to_scan, gfp_t gfp_mask)
>>  spin_unlock(&kvm->mmu_lock);
>>  srcu_read_unlock(&kvm->srcu, idx);
>>  }
>> -if (kvm_freed)
>> -list_move_tail(&kvm_freed->vm_list,&vm_list);
>> +if (kvm_freed) {
>> +raw_spin_lock(&kvm_lock);
>> +if (!kvm->deleted)
>> +list_move_tail(&kvm_freed->vm_list,&vm_list);
> 
> There is no list_move_tail_rcu().

...specifically not for this one.

> 
> Why check kvm->deleted?  it's in the process of being torn down anyway, 
> it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger.

kvm_destroy_vm removes a vm from the list while mmu_shrink is running.
Then mmu_shrink's list_move_tail will re-add that vm to the list tail
again (unless already the removal in move_tail produces a crash).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 10:38:53AM +, Peter Maydell wrote:
> This is the system diagram for the Versatile Express:
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0447d/I1007683.html
> I don't know what you'd want to claim is a "northbridge" there.
> Basically there's an FPGA with a pile of devices in it,
> and there's a test chip with the core and some other devices in
> it. But from a modelling perspective this is all completely
> irrelevant because regardless of where the hardware designer
> put the devices, they're just devices at a particular point in the
> memory map and with a particular set of interrupt wiring and so
> on. I don't see the point in modelling a concept that has no
> user-visible effects and doesn't actually make the model any
> clearer or simpler.
> 
Exactly. This is really the same with x86. The fact that some company
put several devices on the same chip and gave it commercial name
shouldn't govern our design.

> 
> > A machine today is basically the northbridge, southbridge, plus a bunch of
> > default components to make the virtual hardware useful.
> 
> This doesn't really correspond to ARM boards I've looked at,
> by and large (for instance there's no mention of the word "northbridge"
> in the whole 3700 page OMAP3 TRM). PCs may be best modelled
> that way, sure, but I don't think you can cram everything into that mould.
> 
Even on x86 this model is falling apart. Memory controller moves to cpu.
PCI controller will follow.

> >> If you mean that you want machines to be implemented under the
> >> hood as a single huge "device" you can only have one of that spans
> >> the entire memory map, well I guess that's an implementation
> >> detail. But conceptually machines really do exist, and we definitely
> >> still want users to be able to say "I want a beagle machine; I want
> >> a versatile; I want an n900".
> 
> > An n900 is a very specific hardware configuration that is best represented
> > by some sort of configuration file vs. something hard coded in QEMU.
> 
> Yes, that's the whole point -- "machine" == "specific hardware
> configuration".
> 
> That's not getting rid of "machine", it's just saying "we should have
> some custom scripting language to define them rather than doing
> them in C". You still want, fundamentally, to be able to say
>   qemu-system-arm -M machinename
> 
+1

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 12:25:38PM +0200, Avi Kivity wrote:
> On 02/10/2011 11:07 AM, Gleb Natapov wrote:
> >On Thu, Feb 10, 2011 at 08:47:12AM +0100, Anthony Liguori wrote:
> >>  On 02/09/2011 09:15 PM, Blue Swirl wrote:
> >>  >On Wed, Feb 9, 2011 at 9:59 PM, Anthony Liguori   
> >> wrote:
> >>  >>On 02/09/2011 06:48 PM, Blue Swirl wrote:
> >>  ISASerialState dev;
> >>  
> >>  isa_serial_init(&dev, 0, 0x274, 0x07, NULL, NULL);
> >>  
> >>  >>>Do you mean that there should be a generic way of doing that, like
> >>  >>>sysbus_create_varargs() for qdev, or just add inline functions which
> >>  >>>hide qdev property setup?
> >>  >>>
> >>  >>>I still think that FDT should be used in the future. That would
> >>  >>>require that the properties can be set up mechanically, and I don't
> >>  >>>see how your proposal would help that.
> >>  >>>
> >>  >>Yeah, I don't think that is a good idea anymore.  I think this is part 
> >> of
> >>  >>why we're having so many problems with qdev.
> >>  >>
> >>  >>While (most?) hardware hierarchies can be represented by device tree 
> >> syntax,
> >>  >>not all valid device trees correspond to interface and/or useful 
> >> hardware
> >>  >>hierarchies.
> >>  >User creates a non-working machine and so gets to fix the problems?
> >>  >How is that a problem for us?
> >>
> >>  It's not about creating a non-working machine.  It's about what
> >>  user-level abstraction we need to provide.
> >>
> >>  It's a whole lot easier to implement an i440fx device with a fixed
> >>  set of parameters than it is to make every possible subdevice have a
> >>  proper factory interface along with mechanisms to hook everything
> >>  together.
> >>
> >So what if it is easier, it doesn't mean it is correct thing to do. What
> >you are proposing is just a huge step backwards. May be we shouldn't
> >support hooking everything together in completely arbitrary ways, but we
> >shouldn't force isa/pci devices upon our users just because they are
> >non-removable on real chip.
> 
> I disagree.  We don't want to deviate from the spec any more than we
> already do.
> 
Which spec? Even in this discussion we completely mixed different
things. 440FX is not a chipset. It is memory controller/pci host bridge.
PIIX3/4 is the chipset which is just an arbitrary combination of devices
put on the same chip. We do not deviate from spec when we implement
those devices.

> The reason for wanting flexibility is because the code for the PIC
> or RTC, for example, can be used in other Super-IO chipsets or even
> standalone.  If qemu only supported the 440FX chipset, we'd have no
> reason to make things flexible.
Again you probably mean PIIX3. Even then removing unused ide will free
one more PCI slot for my cool virtio disk array. The things is, from
code point of view, it does not cost you extra to allow composition of
ide since it is just a regular PCI device and we need to support composing
those anyway.

> 
> >>
> >>  So very concretely, I'm suggesting we do the following to target-i386:
> >>
> >>  1) make the i440fx device have an embedded ide controller, piix3,
> >>  and usb controller that get initialized automatically.  The piix3
> >>  embeds the PCI-to-ISA bridge along with all of the default ISA
> >>  devices (rtc, serial, etc.).
> >This may be a problem even from security point of view. What if usb code
> >(ide, serial, parallel) has guest exploitable bug? Currently I can happily
> >continue running guests if they do not need affected subsystem. If we'll
> >get it your way I will no longer be able to do so.
> 
> You can't just remove a device from a guest.  You have to shut it
> down.  When you power it back up, you may end up with different IRQ
> assignments or expose some guest bug.
As I answered to Anthony already I am not talking about changing HW
configuration after guest is created rather about creating minimal HW
setup for the task from the start. This means no soundcard or usb for
Windows exchange server for instance.

> 
> If you have a security issue in code that is exposed to the guest,
> you have to fix it.
> 
Of course. That is why it is a good idea to expose as little code to
guest as possible. Don't you think so?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Avi Kivity

On 02/10/2011 12:57 PM, Michael Goldish wrote:

>
>  I can't easily think of a case where this might cause confusion.  The
>  purpose of this is to allow people to write:
>
>  only qcow2..raw..rtl8139
>
>  without having to remember the order in which those were defined in
>  tests_base.cfg.

Sorry, I meant something like

only qcow2..hugepages..rtl8139

Obviously qcow2 and raw can't coexist.


The config files describe a cartesian product, in which order matters.

[A B C] x [1 2] generates [A1 A2 B1 B2 C1 C2]; no confusion here if you 
specify A..1


however

[A B C] x [A B] generates [AA AB BA BB CA CB]; A..B is ambiguous

we might require that keywords be unique.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Michael Goldish
On 02/10/2011 12:55 PM, Michael Goldish wrote:
> On 02/10/2011 12:47 PM, Avi Kivity wrote:
>> On 02/10/2011 12:46 PM, Michael Goldish wrote:
>>> On 02/10/2011 12:34 PM, Avi Kivity wrote:
  On 02/10/2011 11:14 AM, Michael Goldish wrote:
>  only Fedora..boot
>

  So this would include Fedora.9.32.boot and Fedora.9.64.boot, but
>>> exclude
  Windows.XP.32.boot or Fedora.9.32.migrate?  seems reasonable.
>>>
>>> Correct, and it would also include boot.Fedora.9.32 and
>>> boot.9.32.Fedora, if there were such things.
>>
>> That's counterintuitive and requires careful planning.
> 
> I can't easily think of a case where this might cause confusion.  The
> purpose of this is to allow people to write:
> 
> only qcow2..raw..rtl8139
> 
> without having to remember the order in which those were defined in
> tests_base.cfg.

Sorry, I meant something like

only qcow2..hugepages..rtl8139

Obviously qcow2 and raw can't coexist.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Michael Goldish
On 02/10/2011 12:47 PM, Avi Kivity wrote:
> On 02/10/2011 12:46 PM, Michael Goldish wrote:
>> On 02/10/2011 12:34 PM, Avi Kivity wrote:
>> >  On 02/10/2011 11:14 AM, Michael Goldish wrote:
>> >>  only Fedora..boot
>> >>
>> >
>> >  So this would include Fedora.9.32.boot and Fedora.9.64.boot, but
>> exclude
>> >  Windows.XP.32.boot or Fedora.9.32.migrate?  seems reasonable.
>>
>> Correct, and it would also include boot.Fedora.9.32 and
>> boot.9.32.Fedora, if there were such things.
> 
> That's counterintuitive and requires careful planning.

I can't easily think of a case where this might cause confusion.  The
purpose of this is to allow people to write:

only qcow2..raw..rtl8139

without having to remember the order in which those were defined in
tests_base.cfg.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.

2011-02-10 Thread Yoshiaki Tamura
2011/2/10 Daniel P. Berrange :
> On Thu, Feb 10, 2011 at 07:23:33PM +0900, Yoshiaki Tamura wrote:
>> 2011/2/10 Daniel P. Berrange :
>> > On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote:
>> >> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote:
>> >> >Currently FdMigrationState doesn't support read(), and this patch
>> >> >introduces it to get response from the other side.
>> >> >
>> >> >Signed-off-by: Yoshiaki Tamura
>> >>
>> >> Migration is unidirectional.  Changing this is fundamental and not
>> >> something to be done lightly.
>> >
>> > Making it bi-directional might break libvirt's save/restore
>> > to file support which uses migration, passing a unidirectional
>> > FD for the file. It could also break libvirt's secure tunnelled
>> > migration support which is currently only expecting to have
>> > data sent in one direction on the socket.
>>
>> Hi Daniel,
>>
>> IIUC, this patch isn't something to make existing live migration
>> bi-directional.  Just opens up a way for Kemari to use it.  Do
>> you think it's dangerous for libvirt still?
>
> The key is for it to be a no-op for any usage of the existing
> 'migrate' command. I had thought this was wiring up read into
> the event loop too, so it would be poll()ing for reads, but
> after re-reading I see this isn't the case here.

It's a no-op for existing migration related code.  Anthony, did
you have the same concern?

Yoshi

>
> Regards,
> Daniel
> --
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 11:19:48AM +0100, Anthony Liguori wrote:
> On 02/10/2011 11:10 AM, Gleb Natapov wrote:
> >On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote:
> >>On 02/10/2011 10:07 AM, Gleb Natapov wrote:
> >>>So what if it is easier, it doesn't mean it is correct thing to do.
> >>If we spend the next 10 years trying to do the "correct thing" for
> >>some arbitrary definition of correct, that's not terribly useful.
> >Changing direction by 180 every 2 years even less useful.
> 
> If we think through what we are doing and have a coherent
> architecture before changing direction, then we won't have this
> problem.
> 
I'd like to believe this :)

> >>It's really simple actually.  Let's do the least clever thing and
> >>model how hardware actual works.  Once we have that, we can try to
> >>be better than real hardware (if it's possible).
> >I think out understanding on how HW actually works is very different.
> >You are placing to much value on were device resides physically, for me
> >it is completely unimportant detail. Not worth even mentioning.
> 
> No, I place value on how things are modelled in the real world.
Real world (physical HW) have consideration not relevant for our
software emulation. Such as cost, physical dimension, power consumption
and many other I am sure I missed.

> 
> There simply aren't PC's out there that lack an RTC so I have no
> interest in jumping through hoops in QEMU to make it possible to do
> this without modifying QEMU code.  It might sound nice to a
> developer but it's of absolutely no use to users.
> 
RTC is not good example. HPET suppose to replace it (and PIT too). AFAIC
there are PCs without RTC already. Good example would be PIC or IOAPIC
device and then I would agree with you that it is not worth it to make
it possible to create x86 machine without them from command line if it
means extra complexity. But how have you jumped from this to "lets make usb
mandatory"?

> If all composition is done through a factory interface, it doesn't.
> But my main argument here is that we shouldn't try to make all
> composition done through a factory interface--only where it makes
> sense.
> 
> So very concretely, I'm suggesting we do the following to target-i386:
> 
> 1) make the i440fx device have an embedded ide controller, piix3,
> and usb controller that get initialized automatically.  The piix3
> embeds the PCI-to-ISA bridge along with all of the default ISA
> devices (rtc, serial, etc.).
> >>>This may be a problem even from security point of view. What if usb code
> >>>(ide, serial, parallel) has guest exploitable bug? Currently I can happily
> >>>continue running guests if they do not need affected subsystem. If we'll
> >>>get it your way I will no longer be able to do so.
> >>qemu -device i440fx,ide=off
> >>
> >So you still need to support arbitrary composition. What's the
> >difference?
> 
> No, we don't.  It's possible to have an 'rtc=off' option but I'm
> tremendously opposed to doing this.  Arbitrary composition is not a
> useful goal IMHO.
IMHO is different. We should support composition where it makes sense.
For PIC-less x86 it doesn't make it. For usb-less or even ide-less it
does.

> 
> >  So why do you like -device i440fx over what we have now?
> 
> Because I don't think tools like libvirt should be doing device
> composition to create an i440fx-like chipset.  I think the current
> path we're on is pushing too much logic that belongs in QEMU into
> the management stack.
I can agree with that. But from this it doesn't follow that we should
get rid of composition. We shouldn't push composition of common HW to
libvirt. Looking at libvirt command line I do not think we do it though.
Typical libvirt command line specifies disks, networks, usb, vga. How 
-device i440fx will simplified that? Well usb could be omitted (but not
-usbdevice table), disks are not property of i440fx so they will stay,
since user may want to use virtio controller (which is not part of
i440fx) this should stay too. Network obviously will have to be
specified by libvirt too, vga may go to i440fx, but since libvirt
supports qxl we will have to have a way to disable default vga and
enable qxl instead. So will we really simplify libvirt's life by
introducing -device i440fx?

> 
> >In current speak you propose will be implement by using i440fx machine
> >type. Qdev will build it for you.
> 
> If you had an i440fx machine type, that had no non-optional
> components added, and you could specify options to the machine type,
> yes.  But I think you'll agree that there's no reason to not just
> treat the i440fx as a device.
I do not agree. There is not such device as i440fx. This is just
packaging.

> 
> >>If you really care to do this.  But this desire to remove devices is
> >>silly IMHO.  Concerns about security are misplaced.  If you have to
> >>change the way a guest is invoked in order to eliminate security
> >>problems, then there's someth

Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Avi Kivity

On 02/10/2011 12:46 PM, Michael Goldish wrote:

On 02/10/2011 12:34 PM, Avi Kivity wrote:
>  On 02/10/2011 11:14 AM, Michael Goldish wrote:
>>  only Fedora..boot
>>
>
>  So this would include Fedora.9.32.boot and Fedora.9.64.boot, but exclude
>  Windows.XP.32.boot or Fedora.9.32.migrate?  seems reasonable.

Correct, and it would also include boot.Fedora.9.32 and
boot.9.32.Fedora, if there were such things.


That's counterintuitive and requires careful planning.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Michael Goldish
On 02/10/2011 12:34 PM, Avi Kivity wrote:
> On 02/10/2011 11:14 AM, Michael Goldish wrote:
>> only Fedora..boot
>>
> 
> So this would include Fedora.9.32.boot and Fedora.9.64.boot, but exclude
> Windows.XP.32.boot or Fedora.9.32.migrate?  seems reasonable.

Correct, and it would also include boot.Fedora.9.32 and
boot.9.32.Fedora, if there were such things.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.

2011-02-10 Thread Daniel P. Berrange
On Thu, Feb 10, 2011 at 07:23:33PM +0900, Yoshiaki Tamura wrote:
> 2011/2/10 Daniel P. Berrange :
> > On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote:
> >> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote:
> >> >Currently FdMigrationState doesn't support read(), and this patch
> >> >introduces it to get response from the other side.
> >> >
> >> >Signed-off-by: Yoshiaki Tamura
> >>
> >> Migration is unidirectional.  Changing this is fundamental and not
> >> something to be done lightly.
> >
> > Making it bi-directional might break libvirt's save/restore
> > to file support which uses migration, passing a unidirectional
> > FD for the file. It could also break libvirt's secure tunnelled
> > migration support which is currently only expecting to have
> > data sent in one direction on the socket.
> 
> Hi Daniel,
> 
> IIUC, this patch isn't something to make existing live migration
> bi-directional.  Just opens up a way for Kemari to use it.  Do
> you think it's dangerous for libvirt still?

The key is for it to be a no-op for any usage of the existing
'migrate' command. I had thought this was wiring up read into
the event loop too, so it would be poll()ing for reads, but
after re-reading I see this isn't the case here.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Convert tsc_write_lock to raw_spinlock

2011-02-10 Thread Avi Kivity

On 02/04/2011 11:49 AM, Jan Kiszka wrote:

Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.


Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Peter Maydell
On 10 February 2011 10:13, Anthony Liguori  wrote:
> On 02/10/2011 10:04 AM, Peter Maydell wrote:
>>
>> On 10 February 2011 08:36, Anthony Liguori  wrote:
>>> So you would model arm926ej-s as the chipset and then build up the
>>> machines
>>> by modifying parameters of the chipset (like the board id) and/or adding
>>> different components on top of it.
>>>
>>
>> Er, ARM926 is the CPU, it's not a chipset. The board ID is definitely
>> not a property of an ARM926, it's a property of the board (clue is in
>> the name :-)). I don't think versatile boards have a "chipset" really...
>>
>
> As I said, I'm not well versed in the component names in ARM.
>
> But that said, an actual processor doesn't connect directly to a bunch of
> devices.  It almost always go through some chipset and that chipset
> implements a lot of functionality typically.
>
> I think the name of the component I'm trying to refer to PL300 which I
> believe is the Northbridge used for the Versatile boards.

PL300 is just a bus interconnect (so you can connect multiple AXI
bus masters (cores) to multiple AXI bus slaves (devices)).
Versatile PB doesn't have anything in the documentation that claims
to be a Northbridge (PBX does, VExpress doesn't).

This is the system diagram for the Versatile Express:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0447d/I1007683.html
I don't know what you'd want to claim is a "northbridge" there.
Basically there's an FPGA with a pile of devices in it,
and there's a test chip with the core and some other devices in
it. But from a modelling perspective this is all completely
irrelevant because regardless of where the hardware designer
put the devices, they're just devices at a particular point in the
memory map and with a particular set of interrupt wiring and so
on. I don't see the point in modelling a concept that has no
user-visible effects and doesn't actually make the model any
clearer or simpler.

>> In my understanding the "machine" is the thing that says "I need a
>> 926, and an MMC controller at this address, and some UARTS,
>> and..." ie it is the thing that does the "modifying parameters"
>> and "adding different components". So if we'd still be doing that
>> I don't see how we've "got rid of the concept". I guess I'm missing
>> the point somehow.

> A machine today is basically the northbridge, southbridge, plus a bunch of
> default components to make the virtual hardware useful.

This doesn't really correspond to ARM boards I've looked at,
by and large (for instance there's no mention of the word "northbridge"
in the whole 3700 page OMAP3 TRM). PCs may be best modelled
that way, sure, but I don't think you can cram everything into that mould.

>> If you mean that you want machines to be implemented under the
>> hood as a single huge "device" you can only have one of that spans
>> the entire memory map, well I guess that's an implementation
>> detail. But conceptually machines really do exist, and we definitely
>> still want users to be able to say "I want a beagle machine; I want
>> a versatile; I want an n900".

> An n900 is a very specific hardware configuration that is best represented
> by some sort of configuration file vs. something hard coded in QEMU.

Yes, that's the whole point -- "machine" == "specific hardware
configuration".

That's not getting rid of "machine", it's just saying "we should have
some custom scripting language to define them rather than doing
them in C". You still want, fundamentally, to be able to say
  qemu-system-arm -M machinename

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: remove isr_ack logic from PIC

2011-02-10 Thread Avi Kivity

On 02/09/2011 12:09 PM, Gleb Natapov wrote:

isr_ack logic was added by e48258009d to avoid unnecessary IPIs. Back
then it made sense, but now the code checks that vcpu is ready to accept
interrupt before sending IPI, so this logic is no longer needed. The
patch removes it.



Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py

2011-02-10 Thread Avi Kivity

On 02/10/2011 11:14 AM, Michael Goldish wrote:

only Fedora..boot



So this would include Fedora.9.32.boot and Fedora.9.64.boot, but exclude 
Windows.XP.32.boot or Fedora.9.32.migrate?  seems reasonable.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Avi Kivity

On 02/10/2011 09:47 AM, Anthony Liguori wrote:


So very concretely, I'm suggesting we do the following to target-i386:

1) make the i440fx device have an embedded ide controller, piix3, and 
usb controller that get initialized automatically.  The piix3 embeds 
the PCI-to-ISA bridge along with all of the default ISA devices (rtc, 
serial, etc.).


This I like.



2) get rid of the entire concept of machines.  Creating a i440fx is 
essentially equivalent to creating a bare machine.


No, it's not.  The 440fx does not include an IOAPIC, for example.  There 
may be other optional components, or differences in wiring, that make 
two machines with i440fx not identical.




4) model the CPUs as devices that take a pointer to a host controller, 
for x86, the normal case would be giving it a pointer to i440fx.




Surely the connection is via a bus?  An x86 cpu talks to the bus, and 
there happens to be an 440fx north bridge at the end of it.  It could 
also be a Q35 or something else.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Avi Kivity

On 02/10/2011 11:07 AM, Gleb Natapov wrote:

On Thu, Feb 10, 2011 at 08:47:12AM +0100, Anthony Liguori wrote:
>  On 02/09/2011 09:15 PM, Blue Swirl wrote:
>  >On Wed, Feb 9, 2011 at 9:59 PM, Anthony Liguori   
wrote:
>  >>On 02/09/2011 06:48 PM, Blue Swirl wrote:
>  ISASerialState dev;
>  
>  isa_serial_init(&dev, 0, 0x274, 0x07, NULL, NULL);
>  
>  >>>Do you mean that there should be a generic way of doing that, like
>  >>>sysbus_create_varargs() for qdev, or just add inline functions which
>  >>>hide qdev property setup?
>  >>>
>  >>>I still think that FDT should be used in the future. That would
>  >>>require that the properties can be set up mechanically, and I don't
>  >>>see how your proposal would help that.
>  >>>
>  >>Yeah, I don't think that is a good idea anymore.  I think this is part of
>  >>why we're having so many problems with qdev.
>  >>
>  >>While (most?) hardware hierarchies can be represented by device tree 
syntax,
>  >>not all valid device trees correspond to interface and/or useful hardware
>  >>hierarchies.
>  >User creates a non-working machine and so gets to fix the problems?
>  >How is that a problem for us?
>
>  It's not about creating a non-working machine.  It's about what
>  user-level abstraction we need to provide.
>
>  It's a whole lot easier to implement an i440fx device with a fixed
>  set of parameters than it is to make every possible subdevice have a
>  proper factory interface along with mechanisms to hook everything
>  together.
>
So what if it is easier, it doesn't mean it is correct thing to do. What
you are proposing is just a huge step backwards. May be we shouldn't
support hooking everything together in completely arbitrary ways, but we
shouldn't force isa/pci devices upon our users just because they are
non-removable on real chip.


I disagree.  We don't want to deviate from the spec any more than we 
already do.


The reason for wanting flexibility is because the code for the PIC or 
RTC, for example, can be used in other Super-IO chipsets or even 
standalone.  If qemu only supported the 440FX chipset, we'd have no 
reason to make things flexible.



>
>  So very concretely, I'm suggesting we do the following to target-i386:
>
>  1) make the i440fx device have an embedded ide controller, piix3,
>  and usb controller that get initialized automatically.  The piix3
>  embeds the PCI-to-ISA bridge along with all of the default ISA
>  devices (rtc, serial, etc.).
This may be a problem even from security point of view. What if usb code
(ide, serial, parallel) has guest exploitable bug? Currently I can happily
continue running guests if they do not need affected subsystem. If we'll
get it your way I will no longer be able to do so.


You can't just remove a device from a guest.  You have to shut it down.  
When you power it back up, you may end up with different IRQ assignments 
or expose some guest bug.


If you have a security issue in code that is exposed to the guest, you 
have to fix it.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.

2011-02-10 Thread Yoshiaki Tamura
2011/2/10 Daniel P. Berrange :
> On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote:
>> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote:
>> >Currently FdMigrationState doesn't support read(), and this patch
>> >introduces it to get response from the other side.
>> >
>> >Signed-off-by: Yoshiaki Tamura
>>
>> Migration is unidirectional.  Changing this is fundamental and not
>> something to be done lightly.
>
> Making it bi-directional might break libvirt's save/restore
> to file support which uses migration, passing a unidirectional
> FD for the file. It could also break libvirt's secure tunnelled
> migration support which is currently only expecting to have
> data sent in one direction on the socket.

Hi Daniel,

IIUC, this patch isn't something to make existing live migration
bi-directional.  Just opens up a way for Kemari to use it.  Do
you think it's dangerous for libvirt still?

Thanks,

Yoshi

>
> Daniel
> --
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 11:10 AM, Gleb Natapov wrote:

On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote:
   

On 02/10/2011 10:07 AM, Gleb Natapov wrote:
 

So what if it is easier, it doesn't mean it is correct thing to do.
   

If we spend the next 10 years trying to do the "correct thing" for
some arbitrary definition of correct, that's not terribly useful.
 

Changing direction by 180 every 2 years even less useful.
   


If we think through what we are doing and have a coherent architecture 
before changing direction, then we won't have this problem.



It's really simple actually.  Let's do the least clever thing and
model how hardware actual works.  Once we have that, we can try to
be better than real hardware (if it's possible).
 

I think out understanding on how HW actually works is very different.
You are placing to much value on were device resides physically, for me
it is completely unimportant detail. Not worth even mentioning.
   


No, I place value on how things are modelled in the real world.

There simply aren't PC's out there that lack an RTC so I have no 
interest in jumping through hoops in QEMU to make it possible to do this 
without modifying QEMU code.  It might sound nice to a developer but 
it's of absolutely no use to users.



If all composition is done through a factory interface, it doesn't.
But my main argument here is that we shouldn't try to make all
composition done through a factory interface--only where it makes
sense.

So very concretely, I'm suggesting we do the following to target-i386:

1) make the i440fx device have an embedded ide controller, piix3,
and usb controller that get initialized automatically.  The piix3
embeds the PCI-to-ISA bridge along with all of the default ISA
devices (rtc, serial, etc.).
 

This may be a problem even from security point of view. What if usb code
(ide, serial, parallel) has guest exploitable bug? Currently I can happily
continue running guests if they do not need affected subsystem. If we'll
get it your way I will no longer be able to do so.
   

qemu -device i440fx,ide=off

 

So you still need to support arbitrary composition. What's the
difference?


No, we don't.  It's possible to have an 'rtc=off' option but I'm 
tremendously opposed to doing this.  Arbitrary composition is not a 
useful goal IMHO.



  So why do you like -device i440fx over what we have now?
   


Because I don't think tools like libvirt should be doing device 
composition to create an i440fx-like chipset.  I think the current path 
we're on is pushing too much logic that belongs in QEMU into the 
management stack.



In current speak you propose will be implement by using i440fx machine
type. Qdev will build it for you.
   


If you had an i440fx machine type, that had no non-optional components 
added, and you could specify options to the machine type, yes.  But I 
think you'll agree that there's no reason to not just treat the i440fx 
as a device.



If you really care to do this.  But this desire to remove devices is
silly IMHO.  Concerns about security are misplaced.  If you have to
change the way a guest is invoked in order to eliminate security
problems, then there's something seriously wrong.

 

No I do not.  I do not create guest with unneeded devices from the
beginning.
   


There is very little that isn't 'unneeded'.

Regards,

Anthony Liguori


--
Gleb.
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.

2011-02-10 Thread Daniel P. Berrange
On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote:
> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote:
> >Currently FdMigrationState doesn't support read(), and this patch
> >introduces it to get response from the other side.
> >
> >Signed-off-by: Yoshiaki Tamura
> 
> Migration is unidirectional.  Changing this is fundamental and not
> something to be done lightly.

Making it bi-directional might break libvirt's save/restore
to file support which uses migration, passing a unidirectional
FD for the file. It could also break libvirt's secure tunnelled
migration support which is currently only expecting to have
data sent in one direction on the socket.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU

2011-02-10 Thread Avi Kivity

On 02/08/2011 01:55 PM, Jan Kiszka wrote:

Only for walking the list of VMs, we do not need to hold the preemption
disabling kvm_lock. Convert stat services, the cpufreq callback and
mmu_shrink to RCU. For the latter, special care is required to
synchronize its list_move_tail with kvm_destroy_vm.


diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b6a9963..e9d0ed8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3587,9 +3587,9 @@ static int mmu_shrink(struct shrinker *shrink, int 
nr_to_scan, gfp_t gfp_mask)
if (nr_to_scan == 0)
goto out;

-   raw_spin_lock(&kvm_lock);
+   rcu_read_lock();

-   list_for_each_entry(kvm,&vm_list, vm_list) {
+   list_for_each_entry_rcu(kvm,&vm_list, vm_list) {
int idx, freed_pages;
LIST_HEAD(invalid_list);


Have to #include rculist.h, and to change all list operations on vm_list 
to rcu variants.




@@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, int 
nr_to_scan, gfp_t gfp_mask)
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
}
-   if (kvm_freed)
-   list_move_tail(&kvm_freed->vm_list,&vm_list);
+   if (kvm_freed) {
+   raw_spin_lock(&kvm_lock);
+   if (!kvm->deleted)
+   list_move_tail(&kvm_freed->vm_list,&vm_list);


There is no list_move_tail_rcu().

Why check kvm->deleted?  it's in the process of being torn down anyway, 
it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger.



+   raw_spin_unlock(&kvm_lock);
+   }

-   raw_spin_unlock(&kvm_lock);
+   rcu_read_unlock();





--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 10:04 AM, Peter Maydell wrote:

On 10 February 2011 08:36, Anthony Liguori  wrote:
   

On 02/10/2011 09:16 AM, Peter Maydell wrote:
 

On 10 February 2011 07:47, Anthony Liguoriwrote:
   

2) get rid of the entire concept of machines.  Creating a i440fx is
essentially equivalent to creating a bare machine.
 

Does that make any sense for anything other than target-i386?
The concept of a machine model seems a pretty obvious one
for ARM boards, for instance, and I'm not sure we'd gain much
by having i386 be different to the other architectures...
   

Yes, it makes a lot of sense, I just don't know the component names as well
so bear with me :-)

There are two types of Versatile machines today, Versatile/AB and
Versatile/PB.  They are both made with the same core, ARM926EJ-S, with
different expansions.

So you would model arm926ej-s as the chipset and then build up the machines
by modifying parameters of the chipset (like the board id) and/or adding
different components on top of it.
 

Er, ARM926 is the CPU, it's not a chipset. The board ID is definitely
not a property of an ARM926, it's a property of the board (clue is in
the name :-)). I don't think versatile boards have a "chipset" really...
   


As I said, I'm not well versed in the component names in ARM.

But that said, an actual processor doesn't connect directly to a bunch 
of devices.  It almost always go through some chipset and that chipset 
implements a lot of functionality typically.


I think the name of the component I'm trying to refer to PL300 which I 
believe is the Northbridge used for the Versatile boards.



In my understanding the "machine" is the thing that says "I need a
926, and an MMC controller at this address, and some UARTS,
and..." ie it is the thing that does the "modifying parameters"
and "adding different components". So if we'd still be doing that
I don't see how we've "got rid of the concept". I guess I'm missing
the point somehow.
   


A machine today is basically the northbridge, southbridge, plus a bunch 
of default components to make the virtual hardware useful.


I'm suggesting that we model a proper northbridge/southbridge.


A good way to think about what I'm proposing is that machine->init really
should be a constructor for a device object.
 

If you mean that you want machines to be implemented under the
hood as a single huge "device" you can only have one of that spans
the entire memory map, well I guess that's an implementation
detail. But conceptually machines really do exist, and we definitely
still want users to be able to say "I want a beagle machine; I want
a versatile; I want an n900".
   


An n900 is a very specific hardware configuration that is best 
represented by some sort of configuration file vs. something hard coded 
in QEMU.


The question is, what level of component modelling do we need to do in 
order to make it practical to create such configurations from a file.


Regards,

Anthony Liguori


-- PMM
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Gleb Natapov
On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote:
> On 02/10/2011 10:07 AM, Gleb Natapov wrote:
> >So what if it is easier, it doesn't mean it is correct thing to do.
> 
> If we spend the next 10 years trying to do the "correct thing" for
> some arbitrary definition of correct, that's not terribly useful.
Changing direction by 180 every 2 years even less useful.

> 
> It's really simple actually.  Let's do the least clever thing and
> model how hardware actual works.  Once we have that, we can try to
> be better than real hardware (if it's possible).
I think out understanding on how HW actually works is very different.
You are placing to much value on were device resides physically, for me
it is completely unimportant detail. Not worth even mentioning.

> 
> >
> >>If all composition is done through a factory interface, it doesn't.
> >>But my main argument here is that we shouldn't try to make all
> >>composition done through a factory interface--only where it makes
> >>sense.
> >>
> >>So very concretely, I'm suggesting we do the following to target-i386:
> >>
> >>1) make the i440fx device have an embedded ide controller, piix3,
> >>and usb controller that get initialized automatically.  The piix3
> >>embeds the PCI-to-ISA bridge along with all of the default ISA
> >>devices (rtc, serial, etc.).
> >This may be a problem even from security point of view. What if usb code
> >(ide, serial, parallel) has guest exploitable bug? Currently I can happily
> >continue running guests if they do not need affected subsystem. If we'll
> >get it your way I will no longer be able to do so.
> 
> qemu -device i440fx,ide=off
> 
So you still need to support arbitrary composition. What's the
difference? So why do you like -device i440fx over what we have now?
In current speak you propose will be implement by using i440fx machine
type. Qdev will build it for you.

> If you really care to do this.  But this desire to remove devices is
> silly IMHO.  Concerns about security are misplaced.  If you have to
> change the way a guest is invoked in order to eliminate security
> problems, then there's something seriously wrong.
> 
No I do not.  I do not create guest with unneeded devices from the
beginning.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix detection of BIOS disabling VMX

2011-02-10 Thread Avi Kivity

On 02/08/2011 09:45 PM, Joseph Cihula wrote:

This patch fixes the logic used to detect whether BIOS has disabled VMX.



Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes for Feb 8

2011-02-10 Thread Anthony Liguori

On 02/10/2011 10:07 AM, Gleb Natapov wrote:

So what if it is easier, it doesn't mean it is correct thing to do.


If we spend the next 10 years trying to do the "correct thing" for some 
arbitrary definition of correct, that's not terribly useful.


It's really simple actually.  Let's do the least clever thing and model 
how hardware actual works.  Once we have that, we can try to be better 
than real hardware (if it's possible).





If all composition is done through a factory interface, it doesn't.
But my main argument here is that we shouldn't try to make all
composition done through a factory interface--only where it makes
sense.

So very concretely, I'm suggesting we do the following to target-i386:

1) make the i440fx device have an embedded ide controller, piix3,
and usb controller that get initialized automatically.  The piix3
embeds the PCI-to-ISA bridge along with all of the default ISA
devices (rtc, serial, etc.).
 

This may be a problem even from security point of view. What if usb code
(ide, serial, parallel) has guest exploitable bug? Currently I can happily
continue running guests if they do not need affected subsystem. If we'll
get it your way I will no longer be able to do so.
   


qemu -device i440fx,ide=off

If you really care to do this.  But this desire to remove devices is 
silly IMHO.  Concerns about security are misplaced.  If you have to 
change the way a guest is invoked in order to eliminate security 
problems, then there's something seriously wrong.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.

2011-02-10 Thread Yoshiaki Tamura
2011/2/10 Anthony Liguori :
> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote:
>>
>> Currently FdMigrationState doesn't support read(), and this patch
>> introduces it to get response from the other side.
>>
>> Signed-off-by: Yoshiaki Tamura
>>
>
> Migration is unidirectional.  Changing this is fundamental and not something
> to be done lightly.
>
> I thought we previously discussed using a protocol wrapper around the
> existing migration protocol?

AFAIR, I don't think we had that discussion before.  I applied
comments from Stefan though.  If I missed the discussion, could
you please give me the link?

Thanks,

Yoshi

>
> Regards,
>
> Anthony Liguori
>
>> ---
>>  migration-tcp.c |   15 +++
>>  migration.c     |   13 +
>>  migration.h     |    3 +++
>>  3 files changed, 31 insertions(+), 0 deletions(-)
>>
>> diff --git a/migration-tcp.c b/migration-tcp.c
>> index b55f419..55777c8 100644
>> --- a/migration-tcp.c
>> +++ b/migration-tcp.c
>> @@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void
>> * buf, size_t size)
>>      return send(s->fd, buf, size, 0);
>>  }
>>
>> +static int socket_read(FdMigrationState *s, const void * buf, size_t
>> size)
>> +{
>> +    ssize_t len;
>> +
>> +    do {
>> +        len = recv(s->fd, (void *)buf, size, 0);
>> +    } while (len == -1&&  socket_error() == EINTR);
>> +    if (len == -1) {
>> +        len = -socket_error();
>> +    }
>> +
>> +    return len;
>> +}
>> +
>>  static int tcp_close(FdMigrationState *s)
>>  {
>>      DPRINTF("tcp_close\n");
>> @@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor
>> *mon,
>>
>>      s->get_error = socket_errno;
>>      s->write = socket_write;
>> +    s->read = socket_read;
>>      s->close = tcp_close;
>>      s->mig_state.cancel = migrate_fd_cancel;
>>      s->mig_state.get_status = migrate_fd_get_status;
>> diff --git a/migration.c b/migration.c
>> index 3612572..f0df5fc 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const
>> void *data, size_t size)
>>      return ret;
>>  }
>>
>> +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos,
>> size_t size)
>> +{
>> +    FdMigrationState *s = opaque;
>> +    int ret;
>> +
>> +    ret = s->read(s, data, size);
>> +    if (ret == -1) {
>> +        ret = -(s->get_error(s));
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>  void migrate_fd_connect(FdMigrationState *s)
>>  {
>>      int ret;
>> diff --git a/migration.h b/migration.h
>> index 2170792..88a6987 100644
>> --- a/migration.h
>> +++ b/migration.h
>> @@ -48,6 +48,7 @@ struct FdMigrationState
>>      int (*get_error)(struct FdMigrationState*);
>>      int (*close)(struct FdMigrationState*);
>>      int (*write)(struct FdMigrationState*, const void *, size_t);
>> +    int (*read)(struct FdMigrationState *, const void *, size_t);
>>      void *opaque;
>>  };
>>
>> @@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque);
>>
>>  ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t
>> size);
>>
>> +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos,
>> size_t size);
>> +
>>  void migrate_fd_connect(FdMigrationState *s);
>>
>>  void migrate_fd_put_ready(void *opaque);
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.

2011-02-10 Thread Anthony Liguori

On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote:

Currently FdMigrationState doesn't support read(), and this patch
introduces it to get response from the other side.

Signed-off-by: Yoshiaki Tamura
   


Migration is unidirectional.  Changing this is fundamental and not 
something to be done lightly.


I thought we previously discussed using a protocol wrapper around the 
existing migration protocol?


Regards,

Anthony Liguori


---
  migration-tcp.c |   15 +++
  migration.c |   13 +
  migration.h |3 +++
  3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index b55f419..55777c8 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * 
buf, size_t size)
  return send(s->fd, buf, size, 0);
  }

+static int socket_read(FdMigrationState *s, const void * buf, size_t size)
+{
+ssize_t len;
+
+do {
+len = recv(s->fd, (void *)buf, size, 0);
+} while (len == -1&&  socket_error() == EINTR);
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
  static int tcp_close(FdMigrationState *s)
  {
  DPRINTF("tcp_close\n");
@@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,

  s->get_error = socket_errno;
  s->write = socket_write;
+s->read = socket_read;
  s->close = tcp_close;
  s->mig_state.cancel = migrate_fd_cancel;
  s->mig_state.get_status = migrate_fd_get_status;
diff --git a/migration.c b/migration.c
index 3612572..f0df5fc 100644
--- a/migration.c
+++ b/migration.c
@@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
  return ret;
  }

+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size)
+{
+FdMigrationState *s = opaque;
+int ret;
+
+ret = s->read(s, data, size);
+if (ret == -1) {
+ret = -(s->get_error(s));
+}
+
+return ret;
+}
+
  void migrate_fd_connect(FdMigrationState *s)
  {
  int ret;
diff --git a/migration.h b/migration.h
index 2170792..88a6987 100644
--- a/migration.h
+++ b/migration.h
@@ -48,6 +48,7 @@ struct FdMigrationState
  int (*get_error)(struct FdMigrationState*);
  int (*close)(struct FdMigrationState*);
  int (*write)(struct FdMigrationState*, const void *, size_t);
+int (*read)(struct FdMigrationState *, const void *, size_t);
  void *opaque;
  };

@@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque);

  ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);

+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size);
+
  void migrate_fd_connect(FdMigrationState *s);

  void migrate_fd_put_ready(void *opaque);
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/18] Introduce "kemari:" to enable FT migration mode (Kemari).

2011-02-10 Thread Paolo Bonzini

On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote:

When "kemari:" is set in front of URI of migrate command, it will turn
on ft_mode to start FT migration mode (Kemari).  On the receiver side,
the option looks like, -incoming kemari:::

Signed-off-by: Yoshiaki Tamura
---
  hmp-commands.hx |4 +++-
  migration.c |   12 
  qmp-commands.hx |4 +++-
  3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 38e1eb7..ee14344 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -760,7 +760,9 @@ ETEXI
  "\n\t\t\t -b for migration without shared storage with"
  " full copy of disk\n\t\t\t -i for migration without "
  "shared storage with incremental copy of disk "
- "(base image shared between src and destination)",
+ "(base image shared between src and destination)"
+ "\n\t\t\t put \"kemari:\" in front of URI to enable "
+ "Fault Tolerance mode (Kemari protocol)",
  .user_print = monitor_user_noop,  
.mhandler.cmd_new = do_migrate,
  },
diff --git a/migration.c b/migration.c
index 7837c55..a3f7722 100644
--- a/migration.c
+++ b/migration.c
@@ -48,6 +48,12 @@ int qemu_start_incoming_migration(const char *uri)
  const char *p;
  int ret;

+/* check ft_mode (Kemari protocol) */
+if (strstart(uri, "kemari:",&p)) {
+ft_mode = FT_INIT;
+uri = p;
+}
+
  if (strstart(uri, "tcp:",&p))
  ret = tcp_start_incoming_migration(p);
  #if !defined(WIN32)
@@ -99,6 +105,12 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject 
**ret_data)
  return -1;
  }

+/* check ft_mode (Kemari protocol) */
+if (strstart(uri, "kemari:",&p)) {
+ft_mode = FT_INIT;
+uri = p;
+}
+
  if (strstart(uri, "tcp:",&p)) {
  s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
   blk, inc);
diff --git a/qmp-commands.hx b/qmp-commands.hx
index df40a3d..68ca48a 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -437,7 +437,9 @@ EQMP
  "\n\t\t\t -b for migration without shared storage with"
  " full copy of disk\n\t\t\t -i for migration without "
  "shared storage with incremental copy of disk "
- "(base image shared between src and destination)",
+ "(base image shared between src and destination)"
+ "\n\t\t\t put \"kemari:\" in front of URI to enable "
+ "Fault Tolerance mode (Kemari protocol)",
  .user_print = monitor_user_noop,  
.mhandler.cmd_new = do_migrate,
  },


Acked-by: Paolo Bonzini 

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM fix for 2.6.38-rc4

2011-02-10 Thread Avi Kivity

Linus, please pull a KVM fix from

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/2.6.38

This closes a small window during which an NMI could kill an AMD host.

Joerg Roedel (1):
  KVM: SVM: Make sure KERNEL_GS_BASE is valid when loading gs_index

 arch/x86/kvm/svm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/18] qemu-char: export socket_set_nodelay().

2011-02-10 Thread Yoshiaki Tamura
Signed-off-by: Yoshiaki Tamura 
---
 qemu-char.c   |2 +-
 qemu_socket.h |1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index ee4f4ca..7286aeb 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2111,7 +2111,7 @@ static void tcp_chr_telnet_init(int fd)
 send(fd, (char *)buf, 3, 0);
 }
 
-static void socket_set_nodelay(int fd)
+void socket_set_nodelay(int fd)
 {
 int val = 1;
 setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)&val, sizeof(val));
diff --git a/qemu_socket.h b/qemu_socket.h
index 897a8ae..b7f8465 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia);
 int qemu_socket(int domain, int type, int protocol);
 int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen);
 void socket_set_nonblock(int fd);
+void socket_set_nodelay(int fd);
 int send_all(int fd, const void *buf, int len1);
 
 /* New, ipv6-ready socket helper functions, see qemu-sockets.c */
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/18] migration: introduce migrate_ft_trans_{put,get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.

2011-02-10 Thread Yoshiaki Tamura
Introduce migrate_ft_trans_put_ready() which kicks the FT transaction
cycle.  When ft_mode is on, migrate_fd_put_ready() would open
ft_trans_file and turn on event_tap.  To end or cancel FT transaction,
ft_mode and event_tap is turned off.  migrate_ft_trans_get_ready() is
called to receive ack from the receiver.

Signed-off-by: Yoshiaki Tamura 
---
 migration.c |  261 ++-
 1 files changed, 260 insertions(+), 1 deletions(-)

diff --git a/migration.c b/migration.c
index c5e0146..7837c55 100644
--- a/migration.c
+++ b/migration.c
@@ -21,6 +21,7 @@
 #include "qemu_socket.h"
 #include "block-migration.h"
 #include "qemu-objects.h"
+#include "event-tap.h"
 
 //#define DEBUG_MIGRATION
 
@@ -283,6 +284,14 @@ void migrate_fd_error(FdMigrationState *s)
 migrate_fd_cleanup(s);
 }
 
+static void migrate_ft_trans_error(FdMigrationState *s)
+{
+ft_mode = FT_ERROR;
+qemu_savevm_state_cancel(s->mon, s->file);
+migrate_fd_error(s);
+event_tap_unregister();
+}
+
 int migrate_fd_cleanup(FdMigrationState *s)
 {
 int ret = 0;
@@ -318,6 +327,17 @@ void migrate_fd_put_notify(void *opaque)
 qemu_file_put_notify(s->file);
 }
 
+static void migrate_fd_get_notify(void *opaque)
+{
+FdMigrationState *s = opaque;
+
+qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
+qemu_file_get_notify(s->file);
+if (qemu_file_has_error(s->file)) {
+migrate_ft_trans_error(s);
+}
+}
+
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
 {
 FdMigrationState *s = opaque;
@@ -353,6 +373,10 @@ int migrate_fd_get_buffer(void *opaque, uint8_t *data, 
int64_t pos, size_t size)
 ret = -(s->get_error(s));
 }
 
+if (ret == -EAGAIN) {
+qemu_set_fd_handler2(s->fd, NULL, migrate_fd_get_notify, NULL, s);
+}
+
 return ret;
 }
 
@@ -379,6 +403,230 @@ void migrate_fd_connect(FdMigrationState *s)
 migrate_fd_put_ready(s);
 }
 
+static int migrate_ft_trans_commit(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_COMMIT && ft_mode != FT_TRANSACTION_ATOMIC) {
+fprintf(stderr,
+"migrate_ft_trans_commit: invalid ft_mode %d\n", ft_mode);
+goto out;
+}
+
+do {
+if (ft_mode == FT_TRANSACTION_ATOMIC) {
+if (qemu_ft_trans_begin(s->file) < 0) {
+fprintf(stderr, "qemu_ft_trans_begin failed\n");
+goto out;
+}
+
+ret = qemu_savevm_trans_begin(s->mon, s->file, 0);
+if (ret < 0) {
+fprintf(stderr, "qemu_savevm_trans_begin failed\n");
+goto out;
+}
+
+ft_mode = FT_TRANSACTION_COMMIT;
+if (ret) {
+/* don't proceed until if fd isn't ready */
+goto out;
+}
+}
+
+/* make the VM state consistent by flushing outstanding events */
+vm_stop(0);
+
+/* send at full speed */
+qemu_file_set_rate_limit(s->file, 0);
+
+ret = qemu_savevm_trans_complete(s->mon, s->file);
+if (ret < 0) {
+fprintf(stderr, "qemu_savevm_trans_complete failed\n");
+goto out;
+}
+
+ret = qemu_ft_trans_commit(s->file);
+if (ret < 0) {
+fprintf(stderr, "qemu_ft_trans_commit failed\n");
+goto out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_RECV;
+ret = 1;
+goto out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+ret = event_tap_flush_one();
+if (ret < 0) {
+fprintf(stderr, "event_tap_flush_one failed\n");
+goto out;
+}
+
+ft_mode =  ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC;
+} while (ft_mode != FT_TRANSACTION_BEGIN);
+
+vm_start();
+ret = 0;
+
+out:
+return ret;
+}
+
+static int migrate_ft_trans_get_ready(void *opaque)
+{
+FdMigrationState *s = opaque;
+int ret = -1;
+
+if (ft_mode != FT_TRANSACTION_RECV) {
+fprintf(stderr,
+"migrate_ft_trans_get_ready: invalid ft_mode %d\n", ft_mode);
+goto error_out;
+}
+
+/* flush and check if events are remaining */
+vm_start();
+ret = event_tap_flush_one();
+if (ret < 0) {
+fprintf(stderr, "event_tap_flush_one failed\n");
+goto error_out;
+}
+
+if (ret) {
+ft_mode = FT_TRANSACTION_BEGIN;
+} else {
+ft_mode = FT_TRANSACTION_ATOMIC;
+
+ret = migrate_ft_trans_commit(s);
+if (ret < 0) {
+goto error_out;
+}
+if (ret) {
+goto out;
+}
+}
+
+vm_start();
+ret = 0;
+goto out;
+
+error_out:
+migrate_ft_trans_error(s);
+
+out:
+return ret;
+}
+
+static int migrate_ft_trans_put_ready(void)
+{
+FdMigrationState *s = migrate_to_fms(current_m

[PATCH 09/18] Introduce event-tap.

2011-02-10 Thread Yoshiaki Tamura
event-tap controls when to start FT transaction, and provides proxy
functions to called from net/block devices.  While FT transaction, it
queues up net/block requests, and flush them when the transaction gets
completed.

Signed-off-by: Yoshiaki Tamura 
Signed-off-by: OHMURA Kei 
---
 Makefile.target |1 +
 event-tap.c |  939 +++
 event-tap.h |   44 +++
 qemu-tool.c |   28 ++
 trace-events|   10 +
 5 files changed, 1022 insertions(+), 0 deletions(-)
 create mode 100644 event-tap.c
 create mode 100644 event-tap.h

diff --git a/Makefile.target b/Makefile.target
index b0ba95f..edbdbee 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,7 @@ obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_NO_KVM) += kvm-stub.o
 LIBS+=-lz
+obj-y += event-tap.o
 
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
 QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
diff --git a/event-tap.c b/event-tap.c
new file mode 100644
index 000..f44d835
--- /dev/null
+++ b/event-tap.c
@@ -0,0 +1,939 @@
+/*
+ * Event Tap functions for QEMU
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "qemu-error.h"
+#include "block.h"
+#include "block_int.h"
+#include "ioport.h"
+#include "osdep.h"
+#include "sysemu.h"
+#include "hw/hw.h"
+#include "net.h"
+#include "event-tap.h"
+#include "trace.h"
+
+enum EVENT_TAP_STATE {
+EVENT_TAP_OFF,
+EVENT_TAP_ON,
+EVENT_TAP_SUSPEND,
+EVENT_TAP_FLUSH,
+EVENT_TAP_LOAD,
+EVENT_TAP_REPLAY,
+};
+
+static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF;
+
+typedef struct EventTapIOport {
+uint32_t address;
+uint32_t data;
+int  index;
+} EventTapIOport;
+
+#define MMIO_BUF_SIZE 8
+
+typedef struct EventTapMMIO {
+uint64_t address;
+uint8_t  buf[MMIO_BUF_SIZE];
+int  len;
+} EventTapMMIO;
+
+typedef struct EventTapNetReq {
+char *device_name;
+int iovcnt;
+int vlan_id;
+bool vlan_needed;
+bool async;
+struct iovec *iov;
+NetPacketSent *sent_cb;
+} EventTapNetReq;
+
+#define MAX_BLOCK_REQUEST 32
+
+typedef struct EventTapAIOCB EventTapAIOCB;
+
+typedef struct EventTapBlkReq {
+char *device_name;
+int num_reqs;
+int num_cbs;
+bool is_flush;
+BlockRequest reqs[MAX_BLOCK_REQUEST];
+EventTapAIOCB *acb[MAX_BLOCK_REQUEST];
+} EventTapBlkReq;
+
+#define EVENT_TAP_IOPORT (1 << 0)
+#define EVENT_TAP_MMIO   (1 << 1)
+#define EVENT_TAP_NET(1 << 2)
+#define EVENT_TAP_BLK(1 << 3)
+
+#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1)
+
+typedef struct EventTapLog {
+int mode;
+union {
+EventTapIOport ioport;
+EventTapMMIO mmio;
+};
+union {
+EventTapNetReq net_req;
+EventTapBlkReq blk_req;
+};
+QTAILQ_ENTRY(EventTapLog) node;
+} EventTapLog;
+
+struct EventTapAIOCB {
+BlockDriverAIOCB common;
+BlockDriverAIOCB *acb;
+bool is_canceled;
+};
+
+static EventTapLog *last_event_tap;
+
+static QTAILQ_HEAD(, EventTapLog) event_list;
+static QTAILQ_HEAD(, EventTapLog) event_pool;
+
+static int (*event_tap_cb)(void);
+static QEMUBH *event_tap_bh;
+static VMChangeStateEntry *vmstate;
+
+static void event_tap_bh_cb(void *p)
+{
+if (event_tap_cb) {
+event_tap_cb();
+}
+
+qemu_bh_delete(event_tap_bh);
+event_tap_bh = NULL;
+}
+
+static void event_tap_schedule_bh(void)
+{
+trace_event_tap_ignore_bh(!!event_tap_bh);
+
+/* if bh is already set, we ignore it for now */
+if (event_tap_bh) {
+return;
+}
+
+event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL);
+qemu_bh_schedule(event_tap_bh);
+
+return;
+}
+
+static void *event_tap_alloc_log(void)
+{
+EventTapLog *log;
+
+if (QTAILQ_EMPTY(&event_pool)) {
+log = qemu_mallocz(sizeof(EventTapLog));
+} else {
+log = QTAILQ_FIRST(&event_pool);
+QTAILQ_REMOVE(&event_pool, log, node);
+}
+
+return log;
+}
+
+static void event_tap_free_net_req(EventTapNetReq *net_req);
+static void event_tap_free_blk_req(EventTapBlkReq *blk_req);
+
+static void event_tap_free_log(EventTapLog *log)
+{
+int mode = log->mode & ~EVENT_TAP_TYPE_MASK;
+
+if (mode == EVENT_TAP_NET) {
+event_tap_free_net_req(&log->net_req);
+} else if (mode == EVENT_TAP_BLK) {
+event_tap_free_blk_req(&log->blk_req);
+}
+
+log->mode = 0;
+
+/* return the log to event_pool */
+QTAILQ_INSERT_HEAD(&event_pool, log, node);
+}
+
+static void event_tap_free_pool(void)
+{
+EventTapLog *log, *next;
+
+QTAILQ_FOREACH_SAFE(log, &event_pool, node, next) {
+QTAILQ_REMOVE(&event_pool, log, node);
+qemu_free(log);
+}
+}
+
+static void event_tap_free_net_req(EventTapNetReq *net_req)
+{
+int i;
+
+if (!net_req->async) {
+for (i = 0; i <

[PATCH 06/18] virtio: decrement last_avail_idx with inuse before saving.

2011-02-10 Thread Yoshiaki Tamura
For regular migration inuse == 0 always as requests are flushed before
save. However, event-tap log when enabled introduces an extra queue
for requests which is not being flushed, thus the last inuse requests
are left in the event-tap queue.  Move the last_avail_idx value sent
to the remote back to make it repeat the last inuse requests.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Yoshiaki Tamura 
---
 hw/virtio.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index 31bd9e3..f05d1b6 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -673,12 +673,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 qemu_put_be32(f, i);
 
 for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
+/* For regular migration inuse == 0 always as
+ * requests are flushed before save. However,
+ * event-tap log when enabled introduces an extra
+ * queue for requests which is not being flushed,
+ * thus the last inuse requests are left in the event-tap queue.
+ * Move the last_avail_idx value sent to the remote back
+ * to make it repeat the last inuse requests. */
+uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
 if (vdev->vq[i].vring.num == 0)
 break;
 
 qemu_put_be32(f, vdev->vq[i].vring.num);
 qemu_put_be64(f, vdev->vq[i].pa);
-qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
+qemu_put_be16s(f, &last_avail);
 if (vdev->binding->save_queue)
 vdev->binding->save_queue(vdev->binding_opaque, i, f);
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/18] Kemari for KVM v0.2.10

2011-02-10 Thread Yoshiaki Tamura
Hi,

This patch series is a revised version of Kemari for KVM, which
applied comments for the previous post.  The current code is based on
qemu.git f26e5a54f0554798a2e6f7a074b809b13635d007.

The changes from v0.2.9 -> v0.2.10 are:

- change migrate format to kemari::: (Paolo)

The changes from v0.2.8 -> v0.2.9 are:

- abstract common code between qemu_savevm_{state,trans}_* (Paolo)
- change incoming format to kemari::: (Paolo)

The changes from v0.2.7 -> v0.2.8 are:

- fixed calling wrong cb in event-tap
- add missing qemu_aio_release in event-tap

The changes from v0.2.6 -> v0.2.7 are:

- add AIOCB, AIOPool and cancel functions (Kevin)
- insert event-tap for bdrv_flush (Kevin)
- add error handing when calling bdrv functions (Kevin)
- fix usage of qemu_aio_flush and bdrv_flush (Kevin)
- use bs in AIOCB on the primary (Kevin)
- reorder event-tap functions to gather with block/net (Kevin)
- fix checking bs->device_name (Kevin)

The changes from v0.2.5 -> v0.2.6 are:

- use qemu_{put,get}_be32() to save/load niov in event-tap

The changes from v0.2.4 -> v0.2.5 are:

- fixed braces and trailing spaces by using Blue's checkpatch.pl (Blue)
- event-tap: don't try to send blk_req if it's a bdrv_aio_flush event

The changes from v0.2.3 -> v0.2.4 are:

- call vm_start() before event_tap_flush_one() to avoid failure in
  virtio-net assertion
- add vm_change_state_handler to turn off ft_mode
- use qemu_iovec functions in event-tap
- remove duplicated code in migration
- remove unnecessary new line for error_report in ft_trans_file

The changes from v0.2.2 -> v0.2.3 are:

- queue async net requests without copying (MST)
-- if not async, contents of the packets are sent to the secondary
- better description for option -k (MST)
- fix memory transfer failure
- fix ft transaction initiation failure

The changes from v0.2.1 -> v0.2.2 are:

- decrement last_avaid_idx with inuse before saving (MST)
- remove qemu_aio_flush() and bdrv_flush_all() in migrate_ft_trans_commit()

The changes from v0.2 -> v0.2.1 are:

- Move event-tap to net/block layer and use stubs (Blue, Paul, MST, Kevin)
- Tap bdrv_aio_flush (Marcelo)
- Remove multiwrite interface in event-tap (Stefan)
- Fix event-tap to use pio/mmio to replay both net/block (Stefan)
- Improve error handling in event-tap (Stefan)
- Fix leak in event-tap (Stefan)
- Revise virtio last_avail_idx manipulation (MST)
- Clean up migration.c hook (Marcelo)
- Make deleting change state handler robust (Isaku, Anthony)

The changes from v0.1.1 -> v0.2 are:

- Introduce a queue in event-tap to make VM sync live.
- Change transaction receiver to a state machine for async receiving.
- Replace net/block layer functions with event-tap proxy functions.
- Remove dirty bitmap optimization for now.
- convert DPRINTF() in ft_trans_file to trace functions.
- convert fprintf() in ft_trans_file to error_report().
- improved error handling in ft_trans_file.
- add a tmp pointer to qemu_del_vm_change_state_handler.

The changes from v0.1 -> v0.1.1 are:

- events are tapped in net/block layer instead of device emulation layer.
- Introduce a new option for -incoming to accept FT transaction.

- Removed writev() support to QEMUFile and FdMigrationState for now.
  I would post this work in a different series.

- Modified virtio-blk save/load handler to send inuse variable to
  correctly replay.

- Removed configure --enable-ft-mode.
- Removed unnecessary check for qemu_realloc().

The first 6 patches modify several functions of qemu to prepare
introducing Kemari specific components.

The next 6 patches are the components of Kemari.  They introduce
event-tap and the FT transaction protocol file based on buffered file.
The design document of FT transaction protocol can be found at,
http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf

Then the following 2 patches modifies net/block layer functions with
event-tap functions.  Please note that if Kemari is off, event-tap
will just passthrough, and there is most no intrusion to exisiting
functions including normal live migration.

Finally, the migration layer are modified to support Kemari in the
last 4 patches.  Again, there shouldn't be any affection if a user
doesn't specify Kemari specific options.  The transaction is now async
on both sender and receiver side.  The sender side respects the
max_downtime to decide when to switch from async to sync mode.

The repository contains all patches I'm sending with this message.
For those who want to try, please pull the following repository.  It
also includes dirty bitmap optimization which aren't ready for posting
yet.  To remove the dirty bitmap optimization, please look at HEAD~4
of the tree.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari next

Thanks,

Yoshi

Yoshiaki Tamura (18):
  Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and
qemu_clear_buffer().
  Introduce read() to FdMigrationState.
  Introduce skip_header parameter to qemu_loadvm_state().
  qemu-char: export soc

[PATCH 17/18] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.

2011-02-10 Thread Yoshiaki Tamura
When ft_mode is set in the header, tcp_accept_incoming_migration()
sets ft_trans_incoming() as a callback, and call
qemu_file_get_notify() to receive FT transaction iteratively.  We also
need a hack no to close fd before moving to ft_transaction mode, so
that we can reuse the fd for it.  vm_change_state_handler is added to
turn off ft_mode when cont is pressed.

Signed-off-by: Yoshiaki Tamura 
---
 migration-tcp.c |   67 ++-
 1 files changed, 66 insertions(+), 1 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index 55777c8..84076d6 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -18,6 +18,8 @@
 #include "sysemu.h"
 #include "buffered_file.h"
 #include "block.h"
+#include "ft_trans_file.h"
+#include "event-tap.h"
 
 //#define DEBUG_MIGRATION_TCP
 
@@ -29,6 +31,8 @@
 do { } while (0)
 #endif
 
+static VMChangeStateEntry *vmstate;
+
 static int socket_errno(FdMigrationState *s)
 {
 return socket_error();
@@ -56,7 +60,8 @@ static int socket_read(FdMigrationState *s, const void * buf, 
size_t size)
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF("tcp_close\n");
-if (s->fd != -1) {
+/* FIX ME: accessing ft_mode here isn't clean */
+if (s->fd != -1 && ft_mode != FT_INIT) {
 close(s->fd);
 s->fd = -1;
 }
@@ -150,6 +155,36 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 return &s->mig_state;
 }
 
+static void ft_trans_incoming(void *opaque)
+{
+QEMUFile *f = opaque;
+
+qemu_file_get_notify(f);
+if (qemu_file_has_error(f)) {
+ft_mode = FT_ERROR;
+qemu_fclose(f);
+}
+}
+
+static void ft_trans_reset(void *opaque, int running, int reason)
+{
+QEMUFile *f = opaque;
+
+if (running) {
+if (ft_mode != FT_ERROR) {
+qemu_fclose(f);
+}
+ft_mode = FT_OFF;
+qemu_del_vm_change_state_handler(vmstate);
+}
+}
+
+static void ft_trans_schedule_replay(QEMUFile *f)
+{
+event_tap_schedule_replay();
+vmstate = qemu_add_vm_change_state_handler(ft_trans_reset, f);
+}
+
 static void tcp_accept_incoming_migration(void *opaque)
 {
 struct sockaddr_in addr;
@@ -175,8 +210,38 @@ static void tcp_accept_incoming_migration(void *opaque)
 goto out;
 }
 
+if (ft_mode == FT_INIT) {
+autostart = 0;
+}
+
 process_incoming_migration(f);
+
+if (ft_mode == FT_INIT) {
+int ret;
+
+socket_set_nodelay(c);
+
+f = qemu_fopen_ft_trans(s, c);
+if (f == NULL) {
+fprintf(stderr, "could not qemu_fopen_ft_trans\n");
+goto out;
+}
+
+/* need to wait sender to setup */
+ret = qemu_ft_trans_begin(f);
+if (ret < 0) {
+goto out;
+}
+
+qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f);
+ft_trans_schedule_replay(f);
+ft_mode = FT_TRANSACTION_RECV;
+
+return;
+}
+
 qemu_fclose(f);
+
 out:
 close(c);
 out2:
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/18] ioport: insert event_tap_ioport() to ioport_write().

2011-02-10 Thread Yoshiaki Tamura
Record ioport event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura 
---
 ioport.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/ioport.c b/ioport.c
index aa4188a..74aebf5 100644
--- a/ioport.c
+++ b/ioport.c
@@ -27,6 +27,7 @@
 
 #include "ioport.h"
 #include "trace.h"
+#include "event-tap.h"
 
 /***/
 /* IO Port */
@@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, 
uint32_t data)
 default_ioport_writel
 };
 IOPortWriteFunc *func = ioport_write_table[index][address];
+event_tap_ioport(index, address, data);
 if (!func)
 func = default_func[index];
 func(ioport_opaque[address], address, data);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/18] Call init handler of event-tap at main() in vl.c.

2011-02-10 Thread Yoshiaki Tamura
Signed-off-by: Yoshiaki Tamura 
---
 vl.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/vl.c b/vl.c
index 00155fb..f4d4abf 100644
--- a/vl.c
+++ b/vl.c
@@ -162,6 +162,7 @@ int main(int argc, char **argv)
 #include "qemu-queue.h"
 #include "cpus.h"
 #include "arch_init.h"
+#include "event-tap.h"
 
 #include "ui/qemu-spice.h"
 
@@ -2919,6 +2920,8 @@ int main(int argc, char **argv, char **envp)
 
 blk_mig_init();
 
+event_tap_init();
+
 /* open the virtual block devices */
 if (snapshot)
 qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot, 
NULL, 0);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/18] Introduce read() to FdMigrationState.

2011-02-10 Thread Yoshiaki Tamura
Currently FdMigrationState doesn't support read(), and this patch
introduces it to get response from the other side.

Signed-off-by: Yoshiaki Tamura 
---
 migration-tcp.c |   15 +++
 migration.c |   13 +
 migration.h |3 +++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index b55f419..55777c8 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * 
buf, size_t size)
 return send(s->fd, buf, size, 0);
 }
 
+static int socket_read(FdMigrationState *s, const void * buf, size_t size)
+{
+ssize_t len;
+
+do {
+len = recv(s->fd, (void *)buf, size, 0);
+} while (len == -1 && socket_error() == EINTR);
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
 static int tcp_close(FdMigrationState *s)
 {
 DPRINTF("tcp_close\n");
@@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 
 s->get_error = socket_errno;
 s->write = socket_write;
+s->read = socket_read;
 s->close = tcp_close;
 s->mig_state.cancel = migrate_fd_cancel;
 s->mig_state.get_status = migrate_fd_get_status;
diff --git a/migration.c b/migration.c
index 3612572..f0df5fc 100644
--- a/migration.c
+++ b/migration.c
@@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
 return ret;
 }
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size)
+{
+FdMigrationState *s = opaque;
+int ret;
+
+ret = s->read(s, data, size);
+if (ret == -1) {
+ret = -(s->get_error(s));
+}
+
+return ret;
+}
+
 void migrate_fd_connect(FdMigrationState *s)
 {
 int ret;
diff --git a/migration.h b/migration.h
index 2170792..88a6987 100644
--- a/migration.h
+++ b/migration.h
@@ -48,6 +48,7 @@ struct FdMigrationState
 int (*get_error)(struct FdMigrationState*);
 int (*close)(struct FdMigrationState*);
 int (*write)(struct FdMigrationState*, const void *, size_t);
+int (*read)(struct FdMigrationState *, const void *, size_t);
 void *opaque;
 };
 
@@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque);
 
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t 
size);
+
 void migrate_fd_connect(FdMigrationState *s);
 
 void migrate_fd_put_ready(void *opaque);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/18] savevm: introduce util functions to control ft_trans_file from savevm layer.

2011-02-10 Thread Yoshiaki Tamura
To utilize ft_trans_file function, savevm needs interfaces to be
exported.

Signed-off-by: Yoshiaki Tamura 
---
 hw/hw.h  |5 ++
 savevm.c |  149 ++
 2 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index a168a37..a9eff5a 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc 
*put_buffer,
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
@@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int 
size);
 void qemu_put_byte(QEMUFile *f, int v);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
+int qemu_ft_trans_begin(QEMUFile *f);
+int qemu_ft_trans_commit(QEMUFile *f);
+int qemu_ft_trans_cancel(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
@@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f);
  * halted due to rate limiting or EAGAIN errors occur as it can be used to
  * resume output. */
 void qemu_file_put_notify(QEMUFile *f);
+void qemu_file_get_notify(void *opaque);
 
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
diff --git a/savevm.c b/savevm.c
index 58e48e3..e44eccd 100644
--- a/savevm.c
+++ b/savevm.c
@@ -82,6 +82,7 @@
 #include "migration.h"
 #include "qemu_socket.h"
 #include "qemu-queue.h"
+#include "ft_trans_file.h"
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
@@ -189,6 +190,13 @@ typedef struct QEMUFileSocket
 QEMUFile *file;
 } QEMUFileSocket;
 
+typedef struct QEMUFileSocketTrans
+{
+int fd;
+QEMUFileSocket *s;
+VMChangeStateEntry *e;
+} QEMUFileSocketTrans;
+
 static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
 {
 QEMUFileSocket *s = opaque;
@@ -204,6 +212,22 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, 
int64_t pos, int size)
 return len;
 }
 
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+QEMUFileSocket *s = opaque;
+ssize_t len;
+
+do {
+len = send(s->fd, (void *)buf, size, 0);
+} while (len == -1 && socket_error() == EINTR);
+
+if (len == -1) {
+len = -socket_error();
+}
+
+return len;
+}
+
 static int socket_close(void *opaque)
 {
 QEMUFileSocket *s = opaque;
@@ -211,6 +235,70 @@ static int socket_close(void *opaque)
 return 0;
 }
 
+static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, 
size_t size)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t->s;
+ssize_t len;
+
+len = socket_get_buffer(s, buf, pos, size);
+
+return len;
+}
+
+static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t 
size)
+{
+QEMUFileSocketTrans *t = opaque;
+
+return socket_put_buffer(t->s, buf, size);
+}
+
+
+static int socket_trans_get_ready(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t->s;
+QEMUFile *f = s->file;
+int ret = 0;
+
+ret = qemu_loadvm_state(f, 1);
+if (ret < 0) {
+fprintf(stderr,
+"socket_trans_get_ready: error while loading vmstate\n");
+}
+
+return ret;
+}
+
+static int socket_trans_close(void *opaque)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t->s;
+
+qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
+qemu_set_fd_handler2(t->fd, NULL, NULL, NULL, NULL);
+qemu_del_vm_change_state_handler(t->e);
+close(s->fd);
+close(t->fd);
+qemu_free(s);
+qemu_free(t);
+
+return 0;
+}
+
+static void socket_trans_resume(void *opaque, int running, int reason)
+{
+QEMUFileSocketTrans *t = opaque;
+QEMUFileSocket *s = t->s;
+
+if (!running) {
+return;
+}
+
+qemu_announce_self();
+qemu_fclose(s->file);
+}
+
 static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int 
size)
 {
 QEMUFileStdio *s = opaque;
@@ -333,6 +421,26 @@ QEMUFile *qemu_fopen_socket(int fd)
 return s->file;
 }
 
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd)
+{
+QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans));
+QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
+
+t->s = s;
+t->fd = s_fd;
+t->e = qemu_add_vm_change_state_handler(socket_trans_resume, t);
+
+s->fd = c_fd;
+s->file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer,
+  socket_trans_get_buffer, NULL,
+  socket_trans_get_ready,
+  migrate_fd_wait_for_unfreeze,
+  socket_trans_close, 0

[PATCH 05/18] vl.c: add deleted flag for deleting the handler.

2011-02-10 Thread Yoshiaki Tamura
Make deleting handlers robust against deletion of any elements in a
handler by using a deleted flag like in file descriptors.

Signed-off-by: Yoshiaki Tamura 
---
 vl.c |   13 +
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/vl.c b/vl.c
index ed2cdfa..00155fb 100644
--- a/vl.c
+++ b/vl.c
@@ -1158,6 +1158,7 @@ static void nographic_update(void *opaque)
 struct vm_change_state_entry {
 VMChangeStateHandler *cb;
 void *opaque;
+int deleted;
 QLIST_ENTRY (vm_change_state_entry) entries;
 };
 
@@ -1178,8 +1179,7 @@ VMChangeStateEntry 
*qemu_add_vm_change_state_handler(VMChangeStateHandler *cb,
 
 void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
 {
-QLIST_REMOVE (e, entries);
-qemu_free (e);
+e->deleted = 1;
 }
 
 void vm_state_notify(int running, int reason)
@@ -1188,8 +1188,13 @@ void vm_state_notify(int running, int reason)
 
 trace_vm_state_notify(running, reason);
 
-for (e = vm_change_state_head.lh_first; e; e = e->entries.le_next) {
-e->cb(e->opaque, running, reason);
+QLIST_FOREACH(e, &vm_change_state_head, entries) {
+if (e->deleted) {
+QLIST_REMOVE(e, entries);
+qemu_free(e);
+} else {
+e->cb(e->opaque, running, reason);
+}
 }
 }
 
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/18] block: insert event-tap to bdrv_aio_writev(), bdrv_aio_flush() and bdrv_flush().

2011-02-10 Thread Yoshiaki Tamura
event-tap function is called only when it is on, and requests were
sent from device emulators.

Signed-off-by: Yoshiaki Tamura 
Acked-by: Kevin Wolf 
---
 block.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index b476479..8ddce13 100644
--- a/block.c
+++ b/block.c
@@ -28,6 +28,7 @@
 #include "block_int.h"
 #include "module.h"
 #include "qemu-objects.h"
+#include "event-tap.h"
 
 #ifdef CONFIG_BSD
 #include 
@@ -1482,6 +1483,10 @@ int bdrv_flush(BlockDriverState *bs)
 }
 
 if (bs->drv && bs->drv->bdrv_flush) {
+if (*bs->device_name && event_tap_is_on()) {
+event_tap_bdrv_flush();
+}
+
 return bs->drv->bdrv_flush(bs);
 }
 
@@ -2117,6 +2122,11 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, 
int64_t sector_num,
 if (bdrv_check_request(bs, sector_num, nb_sectors))
 return NULL;
 
+if (*bs->device_name && event_tap_is_on()) {
+return event_tap_bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
+ cb, opaque);
+}
+
 if (bs->dirty_bitmap) {
 blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
  opaque);
@@ -2380,6 +2390,11 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
 
 if (!drv)
 return NULL;
+
+if (*bs->device_name && event_tap_is_on()) {
+return event_tap_bdrv_aio_flush(bs, cb, opaque);
+}
+
 return drv->bdrv_aio_flush(bs, cb, opaque);
 }
 
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/18] Introduce fault tolerant VM transaction QEMUFile and ft_mode.

2011-02-10 Thread Yoshiaki Tamura
This code implements VM transaction protocol.  Like buffered_file, it
sits between savevm and migration layer.  With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.

Signed-off-by: Yoshiaki Tamura 
Signed-off-by: OHMURA Kei 
---
 Makefile.objs   |1 +
 ft_trans_file.c |  624 +++
 ft_trans_file.h |   72 +++
 migration.c |3 +
 trace-events|   15 ++
 5 files changed, 715 insertions(+), 0 deletions(-)
 create mode 100644 ft_trans_file.c
 create mode 100644 ft_trans_file.h

diff --git a/Makefile.objs b/Makefile.objs
index 353b1a8..04148b5 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -100,6 +100,7 @@ common-obj-y += msmouse.o ps2.o
 common-obj-y += qdev.o qdev-properties.o
 common-obj-y += block-migration.o
 common-obj-y += pflib.o
+common-obj-y += ft_trans_file.o
 
 common-obj-$(CONFIG_BRLAPI) += baum.o
 common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_trans_file.c b/ft_trans_file.c
new file mode 100644
index 000..2b42b95
--- /dev/null
+++ b/ft_trans_file.c
@@ -0,0 +1,624 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguori
+ */
+
+#include "qemu-common.h"
+#include "qemu-error.h"
+#include "hw/hw.h"
+#include "qemu-timer.h"
+#include "sysemu.h"
+#include "qemu-char.h"
+#include "trace.h"
+#include "ft_trans_file.h"
+
+typedef struct FtTransHdr
+{
+uint16_t cmd;
+uint16_t id;
+uint32_t seq;
+uint32_t payload_len;
+} FtTransHdr;
+
+typedef struct QEMUFileFtTrans
+{
+FtTransPutBufferFunc *put_buffer;
+FtTransGetBufferFunc *get_buffer;
+FtTransPutReadyFunc *put_ready;
+FtTransGetReadyFunc *get_ready;
+FtTransWaitForUnfreezeFunc *wait_for_unfreeze;
+FtTransCloseFunc *close;
+void *opaque;
+QEMUFile *file;
+
+enum QEMU_VM_TRANSACTION_STATE state;
+uint32_t seq;
+uint16_t id;
+
+int has_error;
+
+bool freeze_output;
+bool freeze_input;
+bool rate_limit;
+bool is_sender;
+bool is_payload;
+
+uint8_t *buf;
+size_t buf_max_size;
+size_t put_offset;
+size_t get_offset;
+
+FtTransHdr header;
+size_t header_offset;
+} QEMUFileFtTrans;
+
+#define IO_BUF_SIZE 32768
+
+static void ft_trans_append(QEMUFileFtTrans *s,
+const uint8_t *buf, size_t size)
+{
+if (size > (s->buf_max_size - s->put_offset)) {
+trace_ft_trans_realloc(s->buf_max_size, size + 1024);
+s->buf_max_size += size + 1024;
+s->buf = qemu_realloc(s->buf, s->buf_max_size);
+}
+
+trace_ft_trans_append(size);
+memcpy(s->buf + s->put_offset, buf, size);
+s->put_offset += size;
+}
+
+static void ft_trans_flush(QEMUFileFtTrans *s)
+{
+size_t offset = 0;
+
+if (s->has_error) {
+error_report("flush when error %d, bailing", s->has_error);
+return;
+}
+
+while (offset < s->put_offset) {
+ssize_t ret;
+
+ret = s->put_buffer(s->opaque, s->buf + offset, s->put_offset - 
offset);
+if (ret == -EAGAIN) {
+break;
+}
+
+if (ret <= 0) {
+error_report("error flushing data, %s", strerror(errno));
+s->has_error = FT_TRANS_ERR_FLUSH;
+break;
+} else {
+offset += ret;
+}
+}
+
+trace_ft_trans_flush(offset, s->put_offset);
+memmove(s->buf, s->buf + offset, s->put_offset - offset);
+s->put_offset -= offset;
+s->freeze_output = !!s->put_offset;
+}
+
+static ssize_t ft_trans_put(void *opaque, void *buf, int size)
+{
+QEMUFileFtTrans *s = opaque;
+size_t offset = 0;
+ssize_t len;
+
+/* flush buffered data before putting next */
+if (s->put_offset) {
+ft_trans_flush(s);
+}
+
+while (!s->freeze_output && offset < size) {
+len = s->put_buffer(s->opaque, (uint8_t *)buf + offset, size - offset);
+
+if (len == -EAGAIN) {
+trace_ft_trans_freeze_output();
+s->freeze_output = 1;
+break;
+}
+
+if (len <= 0) {
+error_report("putting data failed, %s", strerror(errno));
+s->has_error = 1;
+offset = -EINVAL;
+break;
+}
+
+offset += len;
+}
+
+if (s->freeze_output) {
+ft_trans_append(s, buf + offset, size - offset);
+offset = size;
+}
+
+return offset;
+}
+
+static int ft_trans_send_header(QEMUFileFtTrans *s,
+enum QEMU_VM_TRANSACTION_STATE state,
+uint32_t payload_len)
+{
+int ret;
+FtTransHdr *hdr = &s->h

[PATCH 12/18] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.

2011-02-10 Thread Yoshiaki Tamura
Record mmio write event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura 
---
 exec.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index e950df2..c81fd09 100644
--- a/exec.c
+++ b/exec.c
@@ -33,6 +33,7 @@
 #include "osdep.h"
 #include "kvm.h"
 #include "qemu-timer.h"
+#include "event-tap.h"
 #if defined(CONFIG_USER_ONLY)
 #include 
 #include 
@@ -3632,6 +3633,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 io_index = (pd >> IO_MEM_SHIFT) & (IO_MEM_NB_ENTRIES - 1);
 if (p)
 addr1 = (addr & ~TARGET_PAGE_MASK) + p->region_offset;
+
+event_tap_mmio(addr, buf, len);
+
 /* XXX: could force cpu_single_env to NULL to avoid
potential bugs */
 if (l >= 4 && ((addr1 & 3) == 0)) {
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/18] savevm: introduce qemu_savevm_trans_{begin,commit}.

2011-02-10 Thread Yoshiaki Tamura
Introduce qemu_savevm_trans_{begin,commit} to send the memory and
device info together, while avoiding cancelling memory state tracking.
This patch also abstracts common code between
qemu_savevm_state_{begin,iterate,commit}.

Signed-off-by: Yoshiaki Tamura 
---
 savevm.c |  157 +++---
 sysemu.h |2 +
 2 files changed, 101 insertions(+), 58 deletions(-)

diff --git a/savevm.c b/savevm.c
index e44eccd..1c2a7fb 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1601,29 +1601,68 @@ bool qemu_savevm_state_blocked(Monitor *mon)
 return false;
 }
 
-int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
-int shared)
+/*
+ * section: header to write
+ * inc: if true, forces to pass SECTION_PART instead of SECTION_START
+ * pause: if true, breaks the loop when live handler returned 0
+ */
+static int qemu_savevm_state_live(Monitor *mon, QEMUFile *f, int section,
+  bool inc, bool pause)
 {
 SaveStateEntry *se;
+int skip = 0, ret;
 
 QTAILQ_FOREACH(se, &savevm_handlers, entry) {
-if(se->set_params == NULL) {
+int len, stage;
+
+if (se->save_live_state == NULL) {
 continue;
-   }
-   se->set_params(blk_enable, shared, se->opaque);
+}
+
+/* Section type */
+qemu_put_byte(f, section);
+qemu_put_be32(f, se->section_id);
+
+if (section == QEMU_VM_SECTION_START) {
+/* ID string */
+len = strlen(se->idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+qemu_put_be32(f, se->instance_id);
+qemu_put_be32(f, se->version_id);
+
+stage = inc ? QEMU_VM_SECTION_PART : QEMU_VM_SECTION_START;
+} else {
+assert(inc);
+stage = section;
+}
+
+ret = se->save_live_state(mon, f, stage, se->opaque);
+if (!ret) {
+skip++;
+if (pause) {
+break;
+}
+}
 }
-
-qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
-qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+return skip;
+}
+
+static void qemu_savevm_state_full(QEMUFile *f)
+{
+SaveStateEntry *se;
 
 QTAILQ_FOREACH(se, &savevm_handlers, entry) {
 int len;
 
-if (se->save_live_state == NULL)
+if (se->save_state == NULL && se->vmsd == NULL) {
 continue;
+}
 
 /* Section type */
-qemu_put_byte(f, QEMU_VM_SECTION_START);
+qemu_put_byte(f, QEMU_VM_SECTION_FULL);
 qemu_put_be32(f, se->section_id);
 
 /* ID string */
@@ -1634,9 +1673,29 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, 
int blk_enable,
 qemu_put_be32(f, se->instance_id);
 qemu_put_be32(f, se->version_id);
 
-se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
+vmstate_save(f, se);
+}
+
+qemu_put_byte(f, QEMU_VM_EOF);
+}
+
+int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
+int shared)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+if (se->set_params == NULL) {
+continue;
+}
+se->set_params(blk_enable, shared, se->opaque);
 }
 
+qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
+qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_START, 0, 0);
+
 if (qemu_file_has_error(f)) {
 qemu_savevm_state_cancel(mon, f);
 return -EIO;
@@ -1647,29 +1706,16 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, 
int blk_enable,
 
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f)
 {
-SaveStateEntry *se;
 int ret = 1;
 
-QTAILQ_FOREACH(se, &savevm_handlers, entry) {
-if (se->save_live_state == NULL)
-continue;
-
-/* Section type */
-qemu_put_byte(f, QEMU_VM_SECTION_PART);
-qemu_put_be32(f, se->section_id);
-
-ret = se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
-if (!ret) {
-/* Do not proceed to the next vmstate before this one reported
-   completion of the current stage. This serializes the migration
-   and reduces the probability that a faster changing state is
-   synchronized over and over again. */
-break;
-}
-}
-
-if (ret)
+/* Do not proceed to the next vmstate before this one reported
+   completion of the current stage. This serializes the migration
+   and reduces the probability that a faster changing state is
+   synchronized over and over again. */
+ret = qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_PART, 1, 1);
+if (!ret) {
 return 1;
+}
 
 if (qemu_file_has_error(f)) {
 qemu_savevm_state_cancel(mon, f);
@@ -1681,46 +1727,41 @@ int qemu_save

[PATCH 13/18] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async().

2011-02-10 Thread Yoshiaki Tamura
event-tap function is called only when it is on.

Signed-off-by: Yoshiaki Tamura 
---
 net.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/net.c b/net.c
index 9ba5be2..1176124 100644
--- a/net.c
+++ b/net.c
@@ -36,6 +36,7 @@
 #include "qemu-common.h"
 #include "qemu_socket.h"
 #include "hw/qdev.h"
+#include "event-tap.h"
 
 static QTAILQ_HEAD(, VLANState) vlans;
 static QTAILQ_HEAD(, VLANClientState) non_vlan_clients;
@@ -559,6 +560,10 @@ ssize_t qemu_send_packet_async(VLANClientState *sender,
 
 void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size)
 {
+if (event_tap_is_on()) {
+return event_tap_send_packet(vc, buf, size);
+}
+
 qemu_send_packet_async(vc, buf, size, NULL);
 }
 
@@ -657,6 +662,10 @@ ssize_t qemu_sendv_packet_async(VLANClientState *sender,
 {
 NetQueue *queue;
 
+if (event_tap_is_on()) {
+return event_tap_sendv_packet_async(sender, iov, iovcnt, sent_cb);
+}
+
 if (sender->link_down || (!sender->peer && !sender->vlan)) {
 return calc_iov_length(iov, iovcnt);
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/18] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2011-02-10 Thread Yoshiaki Tamura
Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.

Signed-off-by: Yoshiaki Tamura 
---
 hw/hw.h  |2 ++
 savevm.c |   20 +++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 5e24329..a168a37 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+void *qemu_realloc_buffer(QEMUFile *f, int size);
+void qemu_clear_buffer(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index 6d83b0f..6c4c72b 100644
--- a/savevm.c
+++ b/savevm.c
@@ -171,7 +171,8 @@ struct QEMUFile {
when reading */
 int buf_index;
 int buf_size; /* 0 when writing */
-uint8_t buf[IO_BUF_SIZE];
+int buf_max_size;
+uint8_t *buf;
 
 int has_error;
 };
@@ -422,6 +423,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, 
QEMUFilePutBufferFunc *put_buffer,
 f->get_rate_limit = get_rate_limit;
 f->is_write = 0;
 
+f->buf_max_size = IO_BUF_SIZE;
+f->buf = qemu_malloc(sizeof(uint8_t) * f->buf_max_size);
+
 return f;
 }
 
@@ -452,6 +456,19 @@ void qemu_fflush(QEMUFile *f)
 }
 }
 
+void *qemu_realloc_buffer(QEMUFile *f, int size)
+{
+f->buf_max_size = size;
+f->buf = qemu_realloc(f->buf, f->buf_max_size);
+
+return f->buf;
+}
+
+void qemu_clear_buffer(QEMUFile *f)
+{
+f->buf_size = f->buf_index = f->buf_offset = 0;
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
 int len;
@@ -477,6 +494,7 @@ int qemu_fclose(QEMUFile *f)
 qemu_fflush(f);
 if (f->close)
 ret = f->close(f->opaque);
+qemu_free(f->buf);
 qemu_free(f);
 return ret;
 }
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/18] Introduce skip_header parameter to qemu_loadvm_state().

2011-02-10 Thread Yoshiaki Tamura
Introduce skip_header parameter to qemu_loadvm_state() so that it can
be called iteratively without reading the header.

Signed-off-by: Yoshiaki Tamura 
---
 migration.c |2 +-
 savevm.c|   24 +---
 sysemu.h|2 +-
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/migration.c b/migration.c
index f0df5fc..dd3bf94 100644
--- a/migration.c
+++ b/migration.c
@@ -63,7 +63,7 @@ int qemu_start_incoming_migration(const char *uri)
 
 void process_incoming_migration(QEMUFile *f)
 {
-if (qemu_loadvm_state(f) < 0) {
+if (qemu_loadvm_state(f, 0) < 0) {
 fprintf(stderr, "load of migration failed\n");
 exit(0);
 }
diff --git a/savevm.c b/savevm.c
index 6c4c72b..58e48e3 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1716,7 +1716,7 @@ typedef struct LoadStateEntry {
 int version_id;
 } LoadStateEntry;
 
-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
 {
 QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
 QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1729,17 +1729,19 @@ int qemu_loadvm_state(QEMUFile *f)
 return -EINVAL;
 }
 
-v = qemu_get_be32(f);
-if (v != QEMU_VM_FILE_MAGIC)
-return -EINVAL;
+if (!skip_header) {
+v = qemu_get_be32(f);
+if (v != QEMU_VM_FILE_MAGIC)
+return -EINVAL;
 
-v = qemu_get_be32(f);
-if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-fprintf(stderr, "SaveVM v2 format is obsolete and don't work 
anymore\n");
-return -ENOTSUP;
+v = qemu_get_be32(f);
+if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+fprintf(stderr, "SaveVM v2 format is obsolete and don't work 
anymore\n");
+return -ENOTSUP;
+}
+if (v != QEMU_VM_FILE_VERSION)
+return -ENOTSUP;
 }
-if (v != QEMU_VM_FILE_VERSION)
-return -ENOTSUP;
 
 while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
 uint32_t instance_id, version_id, section_id;
@@ -2062,7 +2064,7 @@ int load_vmstate(const char *name)
 return -EINVAL;
 }
 
-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
 
 qemu_fclose(f);
 if (ret < 0) {
diff --git a/sysemu.h b/sysemu.h
index 23ae17e..c86b4e8 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -81,7 +81,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 /* SLIRP */
 void do_info_slirp(Monitor *mon);
-- 
1.7.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >