Re: who cames from xen?
Well, mine are pretty much the same as of those who already replied. to emphasize the most important for me: - xen developers didn't seem that much interested to push everything into mainline, in general kvm developmen process seem much open to me.., - it was problematic for me to use some of new features we needed for such old kernels XEN's been based on - after the xen has been bought by citrix, future course was unclear - redhat which we've based our distro upon switched to KVM as well (and bought qumranet) - since KVM runs VMs as normal processes, there are better possibilities to use various types of "shaping" using cgroups etc. - KVM seems to be simpler to debug to me and community is pretty friendly here well, thats enough I guess :) all I have to say is that I too am pretty gratefull to KVM and also QEMU developers. thanks guys! nik On Thu, Feb 10, 2011 at 09:20:17PM +, Mauro wrote: > On 10 February 2011 19:30, Nikola Ciprich wrote: > > Hi, > > I switched from XEN to KVM long time ago, and haven't felt sorry since > > then... > > Are You interestid in something in particular? > > Then.I'm interested on your motivations to switch from xen to kvm. > If it's important I use debian squeeze. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: add the __noclone attribute
The changelog of 104f226 said "adds the __noclone attribute", but it was missing in its patch. I think it is still needed. Signed-off-by: Lai Jiangshan --- diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bf89ec2..de99a4d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3962,7 +3962,7 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu) #define Q "l" #endif -static void vmx_vcpu_run(struct kvm_vcpu *vcpu) +static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 11.02.2011, at 01:22, Alexander Graf wrote: > > On 11.02.2011, at 01:20, Alexander Graf wrote: > >> >> On 10.02.2011, at 19:51, Scott Wood wrote: >> >>> On Thu, 10 Feb 2011 12:45:38 +0100 >>> Alexander Graf wrote: >>> Ok, thinking about this a bit more. You're basically proposing a list of tlb set calls, with each array field identifying one tlb set call. What I was thinking of was a full TLB sync, so we could keep qemu's internal TLB representation identical to the ioctl layout and then just call that one ioctl to completely overwrite all of qemu's internal data (and vice versa). >>> >>> No, this is a full sync -- the list replaces any existing TLB entries (need >>> to make that explicit in the doc). Basically it's an invalidate plus a >>> list of tlb set operations. >>> >>> Qemu's internal representation will want to be ordered with no missing >>> entries. If we require that of the transfer representation we can't do >>> early termination. It would also limit Qemu's flexibility in choosing its >>> internal representation, and make it more awkward to support multiple MMU >>> types. >> >> Well, but this way it means we'll have to assemble/disassemble a list of >> entries multiple times: >> >> SET: >> * qemu assembles the list from its internal representation >> * kvm disassembles the list into its internal structure >> >> GET: >> * kvm assembles the list from its internal representation >> * qemu disassembles the list into its internal structure >> >> Maybe we should go with Avi's proposal after all and simply keep the full >> soft-mmu synced between kernel and user space? That way we only need a setup >> call at first, no copying in between and simply update the user space >> version whenever something changes in the guest. We need to store the TLB's >> contents off somewhere anyways, so all we need is an additional in-kernel >> array with internal translation data, but that can be separate from the >> guest visible data, right? > > If we could then keep qemu's internal representation == shared data with kvm > == kvm's internal data for guest visible stuff, we get this done with almost > no additional overhead. And I don't see any problem with this. Should be > easily doable. So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa. >From kernel to user space is simple. We can just document that after every >RUN, all fields can be modified. >From user space to kernel, we could modify the entries directly and then pass >in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide >what to do with it. I guess the easiest implementation for now would be to >ignore the bitmap and simply flush the shadow tlb. That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold. Also, please tell me you didn't implement the previous revisions already. It'd be a real bummer to see that work wasted only because we're still iterating through the spec O_o. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH uq/master -v2 2/2] KVM, MCE, unpoison memory address across reboot
On Thu, 2011-02-10 at 16:52 +0800, Jan Kiszka wrote: > On 2011-02-10 01:27, Huang Ying wrote: > >>> @@ -1882,6 +1919,7 @@ int kvm_arch_on_sigbus_vcpu(CPUState *en > >>> hardware_memory_error(); > >>> } > >>> } > >>> +kvm_hwpoison_page_add(ram_addr); > >>> > >>> if (code == BUS_MCEERR_AR) { > >>> /* Fake an Intel architectural Data Load SRAR UCR */ > >>> @@ -1926,6 +1964,7 @@ int kvm_arch_on_sigbus(int code, void *a > >>> "QEMU itself instead of guest system!: %p\n", addr); > >>> return 0; > >>> } > >>> +kvm_hwpoison_page_add(ram_addr); > >>> kvm_mce_inj_srao_memscrub2(first_cpu, paddr); > >>> } else > >>> #endif > >>> > >>> > >> > >> Looks fine otherwise. Unless that simplification makes sense, I could > >> offer to include this into my MCE rework (there is some minor conflict). > >> If all goes well, that series should be posted during this week. > > Please have a look at > > git://git.kiszka.org/qemu-kvm.git queues/kvm-upstream > > and tell me if it works for you and your signed-off still applies. Thanks! Works as expected in my testing! Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 11.02.2011, at 01:20, Alexander Graf wrote: > > On 10.02.2011, at 19:51, Scott Wood wrote: > >> On Thu, 10 Feb 2011 12:45:38 +0100 >> Alexander Graf wrote: >> >>> Ok, thinking about this a bit more. You're basically proposing a list of >>> tlb set calls, with each array field identifying one tlb set call. What >>> I was thinking of was a full TLB sync, so we could keep qemu's internal >>> TLB representation identical to the ioctl layout and then just call that >>> one ioctl to completely overwrite all of qemu's internal data (and vice >>> versa). >> >> No, this is a full sync -- the list replaces any existing TLB entries (need >> to make that explicit in the doc). Basically it's an invalidate plus a >> list of tlb set operations. >> >> Qemu's internal representation will want to be ordered with no missing >> entries. If we require that of the transfer representation we can't do >> early termination. It would also limit Qemu's flexibility in choosing its >> internal representation, and make it more awkward to support multiple MMU >> types. > > Well, but this way it means we'll have to assemble/disassemble a list of > entries multiple times: > > SET: > * qemu assembles the list from its internal representation > * kvm disassembles the list into its internal structure > > GET: > * kvm assembles the list from its internal representation > * qemu disassembles the list into its internal structure > > Maybe we should go with Avi's proposal after all and simply keep the full > soft-mmu synced between kernel and user space? That way we only need a setup > call at first, no copying in between and simply update the user space version > whenever something changes in the guest. We need to store the TLB's contents > off somewhere anyways, so all we need is an additional in-kernel array with > internal translation data, but that can be separate from the guest visible > data, right? If we could then keep qemu's internal representation == shared data with kvm == kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 10.02.2011, at 19:51, Scott Wood wrote: > On Thu, 10 Feb 2011 12:45:38 +0100 > Alexander Graf wrote: > >> Ok, thinking about this a bit more. You're basically proposing a list of >> tlb set calls, with each array field identifying one tlb set call. What >> I was thinking of was a full TLB sync, so we could keep qemu's internal >> TLB representation identical to the ioctl layout and then just call that >> one ioctl to completely overwrite all of qemu's internal data (and vice >> versa). > > No, this is a full sync -- the list replaces any existing TLB entries (need > to make that explicit in the doc). Basically it's an invalidate plus a > list of tlb set operations. > > Qemu's internal representation will want to be ordered with no missing > entries. If we require that of the transfer representation we can't do > early termination. It would also limit Qemu's flexibility in choosing its > internal representation, and make it more awkward to support multiple MMU > types. Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times: SET: * qemu assembles the list from its internal representation * kvm disassembles the list into its internal structure GET: * kvm assembles the list from its internal representation * qemu disassembles the list into its internal structure Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: who cames from xen?
We have switched for the same reasons one year ago. We have 50 physical servers and around 400 VM's on them, ranging from single hosts with RAID 1 internal storage to iSCSI solutions. On Thu, Feb 10, 2011 at 3:53 PM, Dan VerWeire wrote: > I switched from Xen to KVM a couple years ago for the following reasons IIRC: > > -Xen was stuck on an older kernel that did not have the drivers I needed > -Ubuntu decided to switch their focus to KVM as their main > virtualization package > -KVM had more solid Windows network drivers > -I personally didn't like the Dom0/DomU concept I just want my VMs to > be processes on the host just like any other process > -KVM was in the kernel which gave me a good feeling about the > longevity and support of the project > > I am a sys admin (among other things) for a wholesale distribution > company. We have 28 virtual machines on 3 different hosts. They are a > mixture of Windows and Linux. I am extremely happy with KVM and > Ubuntu's support of KVM. It is awesome to get new features like KSM > and Ceph block devices (which I haven't used yet but am very excited > about) as the kernel and KVM evolve. > > I can say that, in my experience, our VMs run more solid on KVM than > they did on Xen and even more solid than on bare metal, especially in > the case of Windows. > > Thank you KVM developers. > > Dan VerWeire > > > On Thu, Feb 10, 2011 at 4:20 PM, Mauro wrote: >> On 10 February 2011 19:30, Nikola Ciprich wrote: >>> Hi, >>> I switched from XEN to KVM long time ago, and haven't felt sorry since >>> then... >>> Are You interestid in something in particular? >> >> Then.I'm interested on your motivations to switch from xen to kvm. >> If it's important I use debian squeeze. >> -- >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: who cames from xen?
On Thu, Feb 10, 2011 at 1:53 PM, Dan VerWeire wrote: > I switched from Xen to KVM a couple years ago for the following reasons IIRC: We're in the process of switching from Xen to KVM, for similar reasons, but also to get away from the hassle that is configuring Xen, especially with the crap that is Grub2. With KVM, it's easy to run "Linux Version X" on the host, and "Linux Version X+Y" as guests. Trying to get that setup to work with Xen, especially with Grub1 on the host, and the VMs wanting Grub2 (aka using Debian Lenny for Dom0 and Debian Squeeze for DomU) was an extreme exercise in frustration. Doing the same with KVM is a snap. The whole Dom0/DomU split is a hassle as well. Now that all CPUs (well, at least all of AMD's CPUs) support hardware virt, I honestly do not see a reason to use Xen. It's just not worth the hassle for a theoretical couple % better performance. > Thank you KVM developers. Wholeheartedly agree!!! -- Freddie Cash fjwc...@gmail.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: who cames from xen?
I switched from Xen to KVM a couple years ago for the following reasons IIRC: -Xen was stuck on an older kernel that did not have the drivers I needed -Ubuntu decided to switch their focus to KVM as their main virtualization package -KVM had more solid Windows network drivers -I personally didn't like the Dom0/DomU concept I just want my VMs to be processes on the host just like any other process -KVM was in the kernel which gave me a good feeling about the longevity and support of the project I am a sys admin (among other things) for a wholesale distribution company. We have 28 virtual machines on 3 different hosts. They are a mixture of Windows and Linux. I am extremely happy with KVM and Ubuntu's support of KVM. It is awesome to get new features like KSM and Ceph block devices (which I haven't used yet but am very excited about) as the kernel and KVM evolve. I can say that, in my experience, our VMs run more solid on KVM than they did on Xen and even more solid than on bare metal, especially in the case of Windows. Thank you KVM developers. Dan VerWeire On Thu, Feb 10, 2011 at 4:20 PM, Mauro wrote: > On 10 February 2011 19:30, Nikola Ciprich wrote: >> Hi, >> I switched from XEN to KVM long time ago, and haven't felt sorry since >> then... >> Are You interestid in something in particular? > > Then.I'm interested on your motivations to switch from xen to kvm. > If it's important I use debian squeeze. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: who cames from xen?
On 10 February 2011 19:30, Nikola Ciprich wrote: > Hi, > I switched from XEN to KVM long time ago, and haven't felt sorry since then... > Are You interestid in something in particular? Then.I'm interested on your motivations to switch from xen to kvm. If it's important I use debian squeeze. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Does KVM use one EPT table per Guest CR3?
Sorry for the late reply. Seems to me that the EPTP pointer is changing because of kvm_set_cr0. Here is what I did and please correct me if I am doing the trace incorrectly: - Added a trace entry in vmx_set_cr3 where a trace message is outputted whenever vmcs_read64(EPT_POINTER) != eptp after construct_eptp(cr3). I then looked at the trace log and seems to show up with kvm_exit: reason cr_access rip 0xc0122003 kvm_cr: cr_write 0 = 0x8005003b I also noticed that kvm_mmu_reset_context(vcpu) is being called at the end of kvm_set_cr0. The CR0 value of 0x8005003b doesn't seem to trigger any of the if cases which would indicate that kvm_mmu_reset_context(vcpu) is being called and could be the reason why eptp is changing. Thanks for your help again. Enjoy, Lok From: Avi Kivity [a...@redhat.com] Sent: Sunday, December 19, 2010 9:31 AM To: Lok Kwong Yan Cc: Anthony Liguori; kvm@vger.kernel.org Subject: Re: Does KVM use one EPT table per Guest CR3? On 12/17/2010 05:24 PM, Avi Kivity wrote: > On 12/17/2010 12:14 AM, Lok Kwong Yan wrote: >> Thanks for the reply and it makes a lot of sense. >> >> I am not seeing any EPT tables being zapped after the guest has fully >> started up although the value of EPTP continuously changes as the >> guest is running. > > Really strange, this is likely a bug. > I tried to reproduce, the only times I see eptp changes are when the guest reprograms the vga adapter: qemu-system-x86-20944 [033] 1327.151819: kvm_pio: pio_write at 0x3ce size 2 count 1 qemu-system-x86-20944 [033] 1327.151819: kvm_userspace_exit: reason KVM_EXIT_IO (2) qemu-system-x86-20944 [033] 1327.152405: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=237568 role=122881 root_count=0 unsync=0 ... qemu-system-x86-20944 [033] 1327.153230: kvm_mmu_prepare_zap_page: [FAILED TO PARSE] gfn=0 role=253956 root_count=2 unsync=0 qemu-system-x86-20944 [033] 1327.153339: kvm_mmu_get_page: sp gfn 0 0/4 q0 direct --- !pge !nxe root 0sync qemu-system-x86-20944 [033] 1327.153344: print: a0265cde vmx_set_cr3: eptp fef14101 Under what scenario do you see eptp changing? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, 10 Feb 2011 19:22:38 + Peter Maydell wrote: > On 10 February 2011 19:17, Scott Wood wrote: > > On Thu, 10 Feb 2011 08:16:15 + > > Peter Maydell wrote: > >> On 10 February 2011 07:47, Anthony Liguori wrote: > >> > So very concretely, I'm suggesting we do the following to target-i386: > >> > >> > 2) get rid of the entire concept of machines. Creating a i440fx is > >> > essentially equivalent to creating a bare machine. > >> > >> Does that make any sense for anything other than target-i386? > > > It makes a lot of sense for us on powerpc. Maybe it has to do with a > > longer tradition of using device trees versus opaque machine IDs -- I don't > > think the hardware itself makes any substantial difference. Currently we > > end up having everything pretend to be an mpc8544ds (with some differences > > described by the guest device tree that the user feeds in), which is ugly. > > Hmm. Device tree is coming to ARM, but just at the moment it's > generally one-kernel-one-machine still. (We've only just gained the > ability to compile one kernel for both UP and SMP...) > > I kind of think you're still defining a "machine", you're just doing it > in your device tree blob rather than in C. Right, that's the point -- the definition is just a definition, it's not tied up with implementation. This reduces the amount of duplication in implementation (or inappropriate sharing, as in the "use mpc8544ds for all 85xx" case). -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: who cames from xen?
Hi, I switched from XEN to KVM long time ago, and haven't felt sorry since then... Are You interestid in something in particular? n. On Thu, Feb 10, 2011 at 03:28:10PM +, Mauro wrote: > I'm using xen for years with no problems in my production environments. > Now I want to try kvm. > Any experiences here from xen to kvm? > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 10 February 2011 19:17, Scott Wood wrote: > On Thu, 10 Feb 2011 08:16:15 + > Peter Maydell wrote: >> On 10 February 2011 07:47, Anthony Liguori wrote: >> > So very concretely, I'm suggesting we do the following to target-i386: >> >> > 2) get rid of the entire concept of machines. Creating a i440fx is >> > essentially equivalent to creating a bare machine. >> >> Does that make any sense for anything other than target-i386? > It makes a lot of sense for us on powerpc. Maybe it has to do with a > longer tradition of using device trees versus opaque machine IDs -- I don't > think the hardware itself makes any substantial difference. Currently we > end up having everything pretend to be an mpc8544ds (with some differences > described by the guest device tree that the user feeds in), which is ugly. Hmm. Device tree is coming to ARM, but just at the moment it's generally one-kernel-one-machine still. (We've only just gained the ability to compile one kernel for both UP and SMP...) I kind of think you're still defining a "machine", you're just doing it in your device tree blob rather than in C. -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, 10 Feb 2011 08:16:15 + Peter Maydell wrote: > On 10 February 2011 07:47, Anthony Liguori wrote: > > So very concretely, I'm suggesting we do the following to target-i386: > > > 2) get rid of the entire concept of machines. Creating a i440fx is > > essentially equivalent to creating a bare machine. > > Does that make any sense for anything other than target-i386? > The concept of a machine model seems a pretty obvious one > for ARM boards, for instance, and I'm not sure we'd gain much > by having i386 be different to the other architectures... It makes a lot of sense for us on powerpc. Maybe it has to do with a longer tradition of using device trees versus opaque machine IDs -- I don't think the hardware itself makes any substantial difference. Currently we end up having everything pretend to be an mpc8544ds (with some differences described by the guest device tree that the user feeds in), which is ugly. -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On Thu, 10 Feb 2011 12:45:38 +0100 Alexander Graf wrote: > Ok, thinking about this a bit more. You're basically proposing a list of > tlb set calls, with each array field identifying one tlb set call. What > I was thinking of was a full TLB sync, so we could keep qemu's internal > TLB representation identical to the ioctl layout and then just call that > one ioctl to completely overwrite all of qemu's internal data (and vice > versa). No, this is a full sync -- the list replaces any existing TLB entries (need to make that explicit in the doc). Basically it's an invalidate plus a list of tlb set operations. Qemu's internal representation will want to be ordered with no missing entries. If we require that of the transfer representation we can't do early termination. It would also limit Qemu's flexibility in choosing its internal representation, and make it more awkward to support multiple MMU types. Let's see if the format conversion imposes significant overhead before imposing a less flexible/larger transfer format. :-) > > MMU type ID also controls this, but could add some padding to make > > extensions simpler (esp. since we're not making an array of it). How much > > would you recommend? > > > > How about making it 64 bytes? That should leave us plenty of room. OK. > > The fields inside the struct should be __u32, of course. :-P > > > > Ugh, yes :). But since we're dopping this anyways, it doesn't matter, > right? :) Right. > > I assumed most MMU types would have some straightforward way of marking an > > entry invalid (if not, it can add a software field in the struct), and that > > it would be MMU-specific code that is processing the list. > > > > See above :). Which part? -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
vhost disables kvm acceleration
Hello I have set up a server with kernel 2.6.37, qemu-kvm-0.13.0 compiled from source, and libvirt 0.8.3 patched so to enable use of netdev (and hence vhost). When I modprobe vhost_net and restart a VM having virtio networking, the VM crawls at 1/100th of its normal speed. It seems to me it's about the speed of emulation without kvm acceleration (I tried that). If I stop the machine, remove vhost_net module, and restart the VM, it is normal speed again. These are the invocations by libvirt: (note that libvirt autodetects presence of vhost and uses it, the config of the VM hasn't changed ; also note that they both specify -enable-kvm ...) without vhost_net module (=fast) LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc-0.13 -enable-kvm -m 4096 -smp 2,sockets=2,cores=1,threads=1 -name uarray_server -uuid 7db77cca-addd-4cf4-f7cd-5399d217543e -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/uarray_server.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot c -drive file=/virtualmachines/myserver.raw,if=none,id=drive-virtio-disk0,boot=on,format=raw -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=54,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:69:94:91:65,bus=pci.0,addr=0x3 -usb -vnc 127.0.0.1:3 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 with vhost_net module (=slow) LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/local/kvm/bin/qemu-system-x86_64 -S -M pc-0.13 -enable-kvm -m 4096 -smp 2,sockets=2,cores=1,threads=1 -name uarray_server -uuid 7db77cca-addd-4cf4-f7cd-5399d217543e -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/uarray_server.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot c -drive file=/virtualmachines/myserver.raw,if=none,id=drive-virtio-disk0,boot=on,format=raw -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=52,id=hostnet0,vhost=on,vhostfd=54 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:69:94:91:65,bus=pci.0,addr=0x3 -usb -vnc 127.0.0.1:3 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 What's the problem? Thank you -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #27 from Marcelo Tosatti 2011-02-10 16:57:59 --- Created an attachment (id=47152) --> (https://bugzilla.kernel.org/attachment.cgi?id=47152) kvm-debug-spte-gfn-2.patch -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #26 from Marcelo Tosatti 2011-02-10 16:57:17 --- Nicolas, New debug patch attached. Please try it on top of clean 2.6.37. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 03:20 PM, Gleb Natapov wrote: Jugging by how well all previous conversion went we will end up with one more way of creating devices. One legacy, another qdev and your new one. And what is the problem with qdev again (not that I am a big qdev fan)? We've really been arguing about probably the most minor aspect of the problem with qdev. All I'm really saying is that we shouldn't tie device construction to a factory interface as we do with qdev. That simply means that we should be able to do: RTC *rtc_create(arg1, arg2, arg2); And that a separate piece of code decides which devices are exposed through -device or device_add. Which devices are exposed is really a minor detail. That said, qdev has a number of significant limitations in my mind. The first is that the only relationship between devices is through the BusState interface. I don't think we should even try to have a generic bus model. When you look at how badly broken PCI hotplug is current in qdev, I think this is symptomatic of this. There's also no way in qdev to really have polymorphism. Interfaces really aren't meaningful in qdev so you have things like PCIDevice where some methods are stored in the object instead of the class dispatch table and you have overuse of static class members. And it's all unrelated to VMState. And this is just the basic mechanisms of qdev. The actual implementation is worse. The use of qemu_irq as gpio in the base class and overuse of SystemBus is really quite insane. And so far, the use of qdev has been entirely superficial. Devices still don't make use of bus level interfaces to do I/O so we don't have any better componentization than we did before qdev. The fact that there is no enough interest to convert all devices to it? I don't think there is any device that has been improved by qdev. -device is a nice feature, but it could have been implemented without qdev. Regards, Anthony Liguori How new way of doing things will solve this? Just to be clear I do not have problem with not having ability to compose x86 without pit or kbd controller. Basic things like RTC, pit, pic, ioapic, dma, kbd should be created unconditionally as part of x86 pc machine. But IMHO you are trying to take things to other extreme. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On Wed, Jan 26, 2011 at 16:00, Ruben Kerkhof wrote: > On Wed, Jan 26, 2011 at 10:52, Avi Kivity wrote: >> On 01/25/2011 08:29 PM, Ruben Kerkhof wrote: >>> >>> > When you say "suddenly", this was with no changes to software and >>> > hardware? >>> >>> The host software and hardware hasn't changed in the two months since >>> the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13. >>> >>> We host customer vms on it though, so virtual machines come and go. >>> Various operating systems, a mixture of Linux, FreeBSD and Windows >>> 2008 R2. We have other machines with the same config without these >>> problems though. >> >> Are those other machines running a similar workload? > > Yes, similar, or they're more heavily loaded. > > On this machine, about half of the 48GB memory was used for virtual machines. > >> The traces look awfully like bad hardware, though that can also be explained >> by random memory corruption due to a bug. > > Yeah, that's what I'm expecting. We already replaced the memory, next > step is to move the disks over to another server to make sure it's not > the board or cpu's. > >>> This time I have a few different messages though: >>> >>> 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault: >>> [#1] SMP >>> >>> RSI: RDI: 1603a07305001568 >>> >>> 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46 >>> 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d >>> 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 4f 08 0f 94 c0 84 >>> c0 74 10 85 f6 75 07 e8 63 fe ff ff eb >> >> lock decl 0x8(%rdi) >> >> %rdi is completely crap, looks like corruption again. Strangely, it is >> similar to the bad spte from the previous trace: 0x1603a0730500d277. The >> upper 48 bits are identical, the lower 16 bits are different.: >>> >>> 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted >>> page table at address 7f37b37ff000 >>> 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD >>> 94e538067 PMD 61e5bf067 PTE 1603a0730500e067 >> >> Here are those magic 48 bits again, in the PTE entry. >>> >>> 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration. >>> 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038 >>> 2011-01-25T12:38:49.417526+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4 >>> 2011-01-25T12:38:49.417532+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x5db595007 level 3 >>> 2011-01-25T12:38:49.417553+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2 >>> 2011-01-25T12:38:49.417558+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1 >> >> Again. >> >>> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in >>> process qemu-kvm pte:1603a0730500d067 pmd:61059f067 >> >> Again. >> >> However, these all came from a single boot, yes? > > Correct. > >> If so they can be the same >> corruption. Please collect more traces, with reboots in between. This machine has been running for a week without problems, but then we started to get the following oopses again: 2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle kernel paging request at ea71929180e0 2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP: [] gup_pte_range+0x94/0xd3 2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0 2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops: [#1] SMP 2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/topology/thread_siblings 2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4 2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded: scsi_wait_scan] 2011-02-06T19:45:35.31+01:00 phy005 kernel: 2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm: qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU 2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP: 0010:[] [] gup_pte_range+0x94/0xd3 2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP: 0018:88060b9bda78 EFLAGS: 00010082 2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0 RBX: 3000 RCX: 0005 2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40 RSI: 7fe54e3ff000 RDI: 1603a07305004067 2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98 R08: 880b94384560 R09: 88060b9bdb44 2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8 R11: ea00 R12: 0205 2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff R14: 0005 R15: 2011-02-06T19:45:35.55+01:00 phy0
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 2011-02-10 15:47, Avi Kivity wrote: > On 02/10/2011 04:34 PM, Jan Kiszka wrote: >> On 2011-02-10 15:26, Avi Kivity wrote: >>> On 02/10/2011 03:47 PM, Jan Kiszka wrote: >> >> Accept for mmu_shrink, which is write but not delete, thus works >> without >> that slow synchronize_rcu. > > I don't really see how you can implement list_move_rcu(), it has to be > atomic or other users will see a partial vm_list. Right, even if we synchronized that step cleanly, rcu-protected users could miss the moving vm during concurrent list walks. What about using a separate mutex for protecting vm_list instead? Unless I missed some detail, mmu_shrink should allow blocking. >>> >>> What else does kvm_lock protect? >> >> Someone tried to write a locking.txt and stated that it's also >> protecting enabling/disabling hardware virtualization. But that guy may >> have overlooked something. > > Right. I guess splitting that lock makes sense. > >>> >>> I think we could simply reduce the amount of time we hold kvm_lock. >>> Pick a vm, ref it, list_move_tail(), unlock, then do the actual >>> shrinking. Of course taking a ref must be done carefully, we might >>> already be in kvm_destroy_vm() at that time. >>> >> >> Plain mutex held across the whole mmu_shrink loop is still simpler and >> should be sufficient - unless we also have to deal with scalability >> issues if that handler is able to run concurrently. But based on how we >> were using kvm_lock so far... > > I don't think a mutex would work for kvmclock_cpufreq_notifier(). At > the very least, we'd need a preempt_disable() there. At the worst, the > notifier won't like sleeping. Damn, there was that other user. Yes, this means we need to break the lock in mmu_shrink. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 02/10/2011 04:34 PM, Jan Kiszka wrote: On 2011-02-10 15:26, Avi Kivity wrote: > On 02/10/2011 03:47 PM, Jan Kiszka wrote: Accept for mmu_shrink, which is write but not delete, thus works without that slow synchronize_rcu. >>> >>> I don't really see how you can implement list_move_rcu(), it has to be >>> atomic or other users will see a partial vm_list. >> >> Right, even if we synchronized that step cleanly, rcu-protected users >> could miss the moving vm during concurrent list walks. >> >> What about using a separate mutex for protecting vm_list instead? >> Unless I missed some detail, mmu_shrink should allow blocking. > > What else does kvm_lock protect? Someone tried to write a locking.txt and stated that it's also protecting enabling/disabling hardware virtualization. But that guy may have overlooked something. Right. I guess splitting that lock makes sense. > > I think we could simply reduce the amount of time we hold kvm_lock. > Pick a vm, ref it, list_move_tail(), unlock, then do the actual > shrinking. Of course taking a ref must be done carefully, we might > already be in kvm_destroy_vm() at that time. > Plain mutex held across the whole mmu_shrink loop is still simpler and should be sufficient - unless we also have to deal with scalability issues if that handler is able to run concurrently. But based on how we were using kvm_lock so far... I don't think a mutex would work for kvmclock_cpufreq_notifier(). At the very least, we'd need a preempt_disable() there. At the worst, the notifier won't like sleeping. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 2011-02-10 15:26, Avi Kivity wrote: > On 02/10/2011 03:47 PM, Jan Kiszka wrote: Accept for mmu_shrink, which is write but not delete, thus works without that slow synchronize_rcu. >>> >>> I don't really see how you can implement list_move_rcu(), it has to be >>> atomic or other users will see a partial vm_list. >> >> Right, even if we synchronized that step cleanly, rcu-protected users >> could miss the moving vm during concurrent list walks. >> >> What about using a separate mutex for protecting vm_list instead? >> Unless I missed some detail, mmu_shrink should allow blocking. > > What else does kvm_lock protect? Someone tried to write a locking.txt and stated that it's also protecting enabling/disabling hardware virtualization. But that guy may have overlooked something. > > I think we could simply reduce the amount of time we hold kvm_lock. > Pick a vm, ref it, list_move_tail(), unlock, then do the actual > shrinking. Of course taking a ref must be done carefully, we might > already be in kvm_destroy_vm() at that time. > Plain mutex held across the whole mmu_shrink loop is still simpler and should be sufficient - unless we also have to deal with scalability issues if that handler is able to run concurrently. But based on how we were using kvm_lock so far... Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 02/10/2011 03:47 PM, Jan Kiszka wrote: >> >> Accept for mmu_shrink, which is write but not delete, thus works without >> that slow synchronize_rcu. > > I don't really see how you can implement list_move_rcu(), it has to be > atomic or other users will see a partial vm_list. Right, even if we synchronized that step cleanly, rcu-protected users could miss the moving vm during concurrent list walks. What about using a separate mutex for protecting vm_list instead? Unless I missed some detail, mmu_shrink should allow blocking. What else does kvm_lock protect? I think we could simply reduce the amount of time we hold kvm_lock. Pick a vm, ref it, list_move_tail(), unlock, then do the actual shrinking. Of course taking a ref must be done carefully, we might already be in kvm_destroy_vm() at that time. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 03:04:28PM +0100, Anthony Liguori wrote: > On 02/10/2011 02:27 PM, Gleb Natapov wrote: > >I don't care how command line will look like, but I do not see how you > >will support ide=off without device composition unless you put ad-hoc > >ifs all over your i440fx device code. > > Yes, in the piix3 device code, the ide property would trigger an if(). > > BTW, I'm extremely sceptical that you really do have machines w/o > IDE at all. Even the servers we ship with only SAS or SCSI support > still have an integrated IDE controller. > > Since most servers are built from the same chipset design that has > IDE, I don't really see how you could build a modern system without > IDE. > Well, this may be true. But since I can't find IDE (or ATA) nor in lspci neither in dmesg does it really matter that silicon that implement IDE functionality is present somewhere inside the box? > >>And that's okay, but the base modelling ought to follow rea > >>hardware closely with deviations being the exception. > >> > >You keep saying this without explaining why. But with device composition > >you will have exactly that, you will compose real chipsets using config > >files, not code. > > Yeah, that's been the direction we've been going in since qdev was > introduced. I'm now convinced that this is overly ambitious. By > simply reducing the scope of conversion, we get 99% of the benefit > with 10% of the effort. Seems like a no brainer to me. > Jugging by how well all previous conversion went we will end up with one more way of creating devices. One legacy, another qdev and your new one. And what is the problem with qdev again (not that I am a big qdev fan)? The fact that there is no enough interest to convert all devices to it? How new way of doing things will solve this? Just to be clear I do not have problem with not having ability to compose x86 without pit or kbd controller. Basic things like RTC, pit, pic, ioapic, dma, kbd should be created unconditionally as part of x86 pc machine. But IMHO you are trying to take things to other extreme. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #25 from prochazka 2011-02-10 14:16:51 --- cmdline /usr/local/bin/qemu -name Soins_003 -vga std -net tap,vlan=0,name=interne,ifname=vmtap5 -net nic,vlan=0,macaddr=ac:de:48:1d:e8:2c,model=e1000 -cpu host -localtime -usb -usbdevice tablet -vnc 10.98.98.19:120 -monitor tcp:127.0.0.1:10120,server,nowait,nodelay -m 512 -pidfile /var/run/qemu/Soins_003.pid -net vde,port=70,vlan=5,sock=/tmpsafe/neoswitch_bridge,name=externe -net nic,vlan=5,macaddr=ac:de:48:8c:cc:e0,model=e1000 -rtc base=localtime -drive file=/mnt/vdisk/images/VM-Soins_003.1296578833.637768,index=0,media=disk,snapshot=on,cache=unsafe -drive file=/swapfile-guest/swap1,if=ide,index=1,media=disk,snapshot=on,boot=off -fda fat:floppy:/mnt/vdisk/diskconf/Soins_003 KSM and transparent hugepage is activated on this kernel. Regards, Nicolas -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #24 from prochazka 2011-02-10 14:14:25 --- I can now reproduce it under this circonstance on different server - Windows XP guest SP2 : guest OS seems to be important, other XP sp3 works fine - connect with vnc to this guest and connect with RDP on other ( 5 or 6 guests ) . kernel : 2.6.37 qemu-kvm with hugepages option for #18 #19 . /usr/local/bin/qemu -name XP_013 -vga std -net tap,vlan=0,name=interne,ifname=vmtap28 -net nic,vlan=0,macaddr=ac:de:48:88:e2:92,model=e1000 -cpu host -localtime -usb -usbdevice tablet -vnc 10.98.98.13:135 -monitor tcp:127.0.0.1:10135,server,nowait,nodelay -m 512 -pidfile /var/run/qemu/XP_013.pid -net vde,port=85,vlan=5,sock=/tmpsafe/neoswitch_bridge,name=externe -net nic,vlan=5,macaddr=ac:de:48:7b:9e:ec,model=e1000 -mem-prealloc -mem-path /hugepages -rtc base=localtime -drive file=/mnt/vdisk/images/VM-XP_013.1297326902.381783,index=0,media=disk,snapshot=on,cache=unsafe -drive file=/swapfile-guest/swap1,if=ide,index=1,media=disk,snapshot=on,boot=off -fda fat:floppy:/mnt/vdisk/diskconf/XP_013 Last Kernel that works reliably : 2.6.34 ( I do not test with kernel between 2.6.34 and 2.6.37 ) I just reproduce bug, with kernel 2.6.38rc4 + without hugepage ( kvm module from 2.6.38rc4 tree) general protection fault: [#4] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 0 Modules linked in: kvm_intel kvm bnx2 Pid: 15886, comm: qemu Tainted: G D 2.6.38-rc4 #1 0P010H/PowerEdge M600 RIP: 0010:[] [] drop_spte+0xd5/0x1f0 [kvm] RSP: 0018:8804d6cd5b88 EFLAGS: 00010246 RAX: c9001a2d2ff8 RBX: 88049dbc7c00 RCX: 880529dd6460 RDX: RSI: 880529dd6460 RDI: 8807e30ba000 RBP: 8804d6cd5b98 R08: R09: dead00200200 R10: dead00100100 R11: R12: 8804d6efc000 R13: 8804d6cd5c08 R14: R15: 88049dbc7c00 FS: 7f9b43455740() GS:8800bfc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 056ab000 CR3: 0004d6cfd000 CR4: 000426e0 DR0: 00a0 DR1: DR2: 0003 DR3: 00b0 DR6: 0ff0 DR7: 0400 Process qemu (pid: 15886, threadinfo 8804d6cd4000, task 88050f22c000) Stack: 8804a5027f00 8804d6efc000 8804d6cd5bf8 a0031e7f fff5 8804d6cd5be8 0180 8804d6efc000 8804a50276e0 8804d6cd5c08 Call Trace: [] kvm_mmu_prepare_zap_page+0x8f/0x2f0 [kvm] [] kvm_mmu_zap_all+0x4a/0x90 [kvm] [] kvm_arch_flush_shadow+0x16/0x30 [kvm] [] __kvm_set_memory_region+0x2c3/0x810 [kvm] [] ? hrtimer_start+0x18/0x20 [] ? create_pit_timer+0xb7/0xd0 [kvm] [] ? pit_load_count+0xd3/0x120 [kvm] [] ? kvm_pit_load_count+0x22/0x60 [kvm] [] kvm_set_memory_region+0x43/0x70 [kvm] [] kvm_vm_ioctl_set_memory_region+0x1d/0x30 [kvm] [] kvm_vm_ioctl+0x1e5/0x3e0 [kvm] [] do_vfs_ioctl+0xa3/0x540 [] ? sys_futex+0xce/0x170 [] sys_ioctl+0x4f/0x80 [] system_call_fastpath+0x16/0x1b Code: 50 38 48 63 f6 48 8b 34 f2 0f b6 50 28 83 e2 0f eb b8 0f 1f 40 00 48 83 e6 fe 0f 84 d9 00 00 00 45 31 c0 0f 1f 00 48 89 f1 31 d2 <48> 8b 39 48 85 ff 74 10 48 39 fb 74 26 ff c2 48 83 c1 08 83 fa RIP [] drop_spte+0xd5/0x1f0 [kvm] RSP ---[ end trace a0f93d7b4fb495a7 ]--- general protection fault: [#5] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 5 Modules linked in: kvm_intel kvm bnx2 Pid: 30332, comm: bash Tainted: G D 2.6.38-rc4 #1 0P010H/PowerEdge M600 RIP: 0010:[] [] dup_fd+0x168/0x300 RSP: 0018:8805fbd03da0 EFLAGS: 00010202 RAX: 07f8 RBX: 8807e94179c0 RCX: bfff RDX: 8807e3ef5480 RSI: 00ff RDI: 0800 RBP: 8805fbd03e00 R08: 8804f2c20280 R09: 0003 R10: 0001 R11: 4000 R12: 8804bf071000 R13: 8804f2c20540 R14: 8807dac23800 R15: 0100 FS: 7fb0a6a11700() GS:8800bfd4() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 00bf3000 CR3: 0007116cf000 CR4: 000426e0 DR0: 0003 DR1: 00b0 DR2: 0001 DR3: DR6: 0ff0 DR7: 0400 Process bash (pid: 30332, threadinfo 8805fbd02000, task 880715cd1000) Stack: 88050005 00010282 0020 8806fa7dca40 8807feaceec8 8807feacef40 7fb0a6a119d0 8807db5f7000 01200011 7fb0a6a119d0 Call Trace: [] copy_process+0xa02/0x1200 [] do_fork+0x63/0x340 [] ? _raw_spin_lock+0xe/0x20 [] ? fd_install+0x67/0x90 [] ? do_pipe_flags+0xb0/0x100 [] sys_clone+0x28/0x30 [] stub_clone+0x13/0x20 [] ? system_call_fastpath+0x16/0x1b Code: 4c 89 c2 e8 1b 35 23 00 45 85 ff 74 77
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 02:27 PM, Gleb Natapov wrote: I don't care how command line will look like, but I do not see how you will support ide=off without device composition unless you put ad-hoc ifs all over your i440fx device code. Yes, in the piix3 device code, the ide property would trigger an if(). BTW, I'm extremely sceptical that you really do have machines w/o IDE at all. Even the servers we ship with only SAS or SCSI support still have an integrated IDE controller. Since most servers are built from the same chipset design that has IDE, I don't really see how you could build a modern system without IDE. And that's okay, but the base modelling ought to follow rea hardware closely with deviations being the exception. You keep saying this without explaining why. But with device composition you will have exactly that, you will compose real chipsets using config files, not code. Yeah, that's been the direction we've been going in since qdev was introduced. I'm now convinced that this is overly ambitious. By simply reducing the scope of conversion, we get 99% of the benefit with 10% of the effort. Seems like a no brainer to me. Regards, Anthony Liguori -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 02:00 PM, Avi Kivity wrote: On 02/10/2011 02:51 PM, Anthony Liguori wrote: On 02/10/2011 12:13 PM, Gleb Natapov wrote: Which spec? Even in this discussion we completely mixed different things. 440FX is not a chipset. Yes, it is. It's a single silicon package with a defined pinout. If you don't believe me, re-read the spec. It's a MCM with the PIIX3 being internally connected. The connection between the i440fx and PIIX3 happens to be PCI but that's not always the case. Sometimes it's a proprietary bus. Aren't they two distinct chips, together comprising the chip-set? One (the northbridge) converts the system bus to PCI + some extra wires, the other (southbridge) bridges PCI to ISA and contains some embedded ISA devices. IIRC there are some wires between them that are not PCI. Yes, you are correct. So I can understand an argument for: -device i440fx,id=pmc -device piix3,chipset=pmc Or something like that. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #23 from Marcelo Tosatti 2011-02-10 13:50:08 --- Nicolas, On comment #2 you mention the bug could not be reproduced, but in comment #3 you report it without hugepages enabled. So, were you using hugepages or not, in the reports #18 and #19? Another thing, what is the last kernel version that works reliably under this workload? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 2011-02-10 14:19, Avi Kivity wrote: > On 02/10/2011 03:14 PM, Jan Kiszka wrote: >> On 2011-02-10 13:57, Avi Kivity wrote: >>> On 02/10/2011 02:56 PM, Avi Kivity wrote: > What's the benefit? The downside is a bit more complexity as you need an > additional callback handler. synchronize_rcu() can be very slow (its a systemwide operation), and mmu_shrink() can be called often on a loaded system. >>> >>> In fact this just shows that vm_list is not a good candidate for rcu; >>> rcu is useful where most operations are reads, but if we discount stats, >>> most operations on vm_list are going to be writes. >> >> Accept for mmu_shrink, which is write but not delete, thus works without >> that slow synchronize_rcu. > > I don't really see how you can implement list_move_rcu(), it has to be > atomic or other users will see a partial vm_list. Right, even if we synchronized that step cleanly, rcu-protected users could miss the moving vm during concurrent list walks. What about using a separate mutex for protecting vm_list instead? Unless I missed some detail, mmu_shrink should allow blocking. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On Thu, Feb 10, 2011 at 01:03:53PM +0200, Avi Kivity wrote: > On 02/10/2011 12:57 PM, Michael Goldish wrote: > >> > >> I can't easily think of a case where this might cause confusion. The > >> purpose of this is to allow people to write: > >> > >> only qcow2..raw..rtl8139 > >> > >> without having to remember the order in which those were defined in > >> tests_base.cfg. > > > >Sorry, I meant something like > > > >only qcow2..hugepages..rtl8139 > > > >Obviously qcow2 and raw can't coexist. > > The config files describe a cartesian product, in which order matters. Mathematically speaking, the ordering in the result is different, but BA and AB are often equivalent for the user. In many situations, people don't care in which order (as an example) "qcow" and "ide" are defined on the base config, they just want to exclude the combination of "qcow" and "ide". > > [A B C] x [1 2] generates [A1 A2 B1 B2 C1 C2]; no confusion here if > you specify A..1 > > however > > [A B C] x [A B] generates [AA AB BA BB CA CB]; A..B is ambiguous If you do the above and reuse keywords, "A" is also ambiguous, "B" is also ambiguous. "A..B" being ambiguous is a consequence of "A" and "B" being ambiguous. If you don't want to be ambiguous, just use "A.B" or "B.A". > > we might require that keywords be unique. I wouldn't be against that. At least for the use cases I see, people have been assuming that keywords are unique on most "only" and "no" statements. -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #22 from Marcelo Tosatti 2011-02-10 13:36:25 --- Problem description: Present spte is dropped while syncing 32-bit level 1 shadow page. But sp->gfns[index] contains uninitialized value (0 or f001), so gfn->rmap conversion in rmap_remove fails. However, debug patch from comment #18 verifies that on present spte instantiation, via mmu_set_spte, sp->gfns[] is initialized correctly. >From bug instances of comments 19 and 20, index == 511. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 03:00:05PM +0200, Avi Kivity wrote: > On 02/10/2011 02:51 PM, Anthony Liguori wrote: > >On 02/10/2011 12:13 PM, Gleb Natapov wrote: > >> > >>Which spec? Even in this discussion we completely mixed different > >>things. 440FX is not a chipset. > > > >Yes, it is. It's a single silicon package with a defined pinout. > >If you don't believe me, re-read the spec. > > > >It's a MCM with the PIIX3 being internally connected. The > >connection between the i440fx and PIIX3 happens to be PCI but > >that's not always the case. Sometimes it's a proprietary bus. > > Aren't they two distinct chips, together comprising the chip-set? > > One (the northbridge) converts the system bus to PCI + some extra > wires, the other (southbridge) bridges PCI to ISA and contains some > embedded ISA devices. IIRC there are some wires between them that > are not PCI. > Yeah, 440fx is probably northbridge and PIIX3 southbridge. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 01:51:14PM +0100, Anthony Liguori wrote: > On 02/10/2011 12:13 PM, Gleb Natapov wrote: > > > >Which spec? Even in this discussion we completely mixed different > >things. 440FX is not a chipset. > > Yes, it is. It's a single silicon package with a defined pinout. > If you don't believe me, re-read the spec. > > It's a MCM with the PIIX3 being internally connected. The > connection between the i440fx and PIIX3 happens to be PCI but that's > not always the case. Sometimes it's a proprietary bus. > Which one? 29054901.pdf describes memory controller and PCI host bridge only. > >Again you probably mean PIIX3. Even then removing unused ide will free > >one more PCI slot for my cool virtio disk array. The things is, from > >code point of view, it does not cost you extra to allow composition of > >ide since it is just a regular PCI device and we need to support composing > >those anyway. > > If this is useful, and it doesn't break guests, you can always do > -device i440fx,ide=off. However, it's an exception where we're > deviating from how hardware works. > I don't care how command line will look like, but I do not see how you will support ide=off without device composition unless you put ad-hoc ifs all over your i440fx device code. And I don't understand what do you mean by saying that this is not how hardware works. Presence or absence of PCI device does not change how hardware works. > And that's okay, but the base modelling ought to follow real > hardware closely with deviations being the exception. > You keep saying this without explaining why. But with device composition you will have exactly that, you will compose real chipsets using config files, not code. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 02/10/2011 03:14 PM, Jan Kiszka wrote: On 2011-02-10 13:57, Avi Kivity wrote: > On 02/10/2011 02:56 PM, Avi Kivity wrote: >>> What's the benefit? The downside is a bit more complexity as you need an >>> additional callback handler. >> >> >> synchronize_rcu() can be very slow (its a systemwide operation), and >> mmu_shrink() can be called often on a loaded system. >> > > In fact this just shows that vm_list is not a good candidate for rcu; > rcu is useful where most operations are reads, but if we discount stats, > most operations on vm_list are going to be writes. Accept for mmu_shrink, which is write but not delete, thus works without that slow synchronize_rcu. I don't really see how you can implement list_move_rcu(), it has to be atomic or other users will see a partial vm_list. And I don't see the need for call_rcu in the vm deletion path. synchronize_rcu() is fine for vm destruction. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 2011-02-10 13:57, Avi Kivity wrote: > On 02/10/2011 02:56 PM, Avi Kivity wrote: >>> What's the benefit? The downside is a bit more complexity as you need an >>> additional callback handler. >> >> >> synchronize_rcu() can be very slow (its a systemwide operation), and >> mmu_shrink() can be called often on a loaded system. >> > > In fact this just shows that vm_list is not a good candidate for rcu; > rcu is useful where most operations are reads, but if we discount stats, > most operations on vm_list are going to be writes. Accept for mmu_shrink, which is write but not delete, thus works without that slow synchronize_rcu. And I don't see the need for call_rcu in the vm deletion path. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 01:47:06PM +0100, Anthony Liguori wrote: > On 02/10/2011 11:49 AM, Gleb Natapov wrote: > >On Thu, Feb 10, 2011 at 11:19:48AM +0100, Anthony Liguori wrote: > >>On 02/10/2011 11:10 AM, Gleb Natapov wrote: > >>>On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote: > On 02/10/2011 10:07 AM, Gleb Natapov wrote: > >So what if it is easier, it doesn't mean it is correct thing to do. > If we spend the next 10 years trying to do the "correct thing" for > some arbitrary definition of correct, that's not terribly useful. > >>>Changing direction by 180 every 2 years even less useful. > >>If we think through what we are doing and have a coherent > >>architecture before changing direction, then we won't have this > >>problem. > >> > >I'd like to believe this :) > > > It's really simple actually. Let's do the least clever thing and > model how hardware actual works. Once we have that, we can try to > be better than real hardware (if it's possible). > >>>I think out understanding on how HW actually works is very different. > >>>You are placing to much value on were device resides physically, for me > >>>it is completely unimportant detail. Not worth even mentioning. > >>No, I place value on how things are modelled in the real world. > >Real world (physical HW) have consideration not relevant for our > >software emulation. Such as cost, physical dimension, power consumption > >and many other I am sure I missed. > > > >>There simply aren't PC's out there that lack an RTC so I have no > >>interest in jumping through hoops in QEMU to make it possible to do > >>this without modifying QEMU code. It might sound nice to a > >>developer but it's of absolutely no use to users. > >> > >RTC is not good example. HPET suppose to replace it (and PIT too). > > HPET's embed RTCs to provide support for legacy implementations. > This is extremely good example of where our modelling breaks down. > Take a close look at how the HPET and RTC emulations interact for an > example of why we'd be much better off just implementing an RTC > within an HPET. > Yes HPET can provide legacy RTC timer functionality. No I do not see why we should implement RTC withing HPET. In your model we should remove HPET code completely since HPET is not present in chipset emulated by QEMU. > > AFAIC > >there are PCs without RTC already. > > RTC also provides CMOS functionality and no PC can boot without > CMOS. So no, there's nothing we'd consider a PC today that doesn't > have an RTC. CMOS may be present even if RTC functionality is absent. Does EFI base machine still need CMOS though? > > > Good example would be PIC or IOAPIC > >device and then I would agree with you that it is not worth it to make > >it possible to create x86 machine without them from command line if it > >means extra complexity. But how have you jumped from this to "lets make usb > >mandatory"? > > USB is mandatory in the PIIX3 but the only significant difference > between the piix2 and piix3 is the addition of USB. > Consequentially, the main difference between an i440fx and i440bx is > the use of a piix2 vs. a piix3. So if you really want to create the > same PC we have today w/o USB, the right way to do it would be to > have: > > -device i440,model=fx // with USB > -device i440,model=bx // w/o USB Why not qemu -config piix2.cfg or qemu -config piix3.cfg? No need to make data into code. > > > >>No, we don't. It's possible to have an 'rtc=off' option but I'm > >>tremendously opposed to doing this. Arbitrary composition is not a > >>useful goal IMHO. > >IMHO is different. We should support composition where it makes sense. > >For PIC-less x86 it doesn't make it. For usb-less or even ide-less it > >does. > > The right way to do a USB-less PC is to have an option to create an i440bx. Why is this the right way? > > An IDE-less PC is a bit more difficult because IDE is really baked > into the concept of a PC. Chances are, there are more than a few > guests out there that would have issues from there being no IDE bus > present. > Non of my modern PCs have IDE. Many high end PC had SCSI instead of IDE in the past. If guest can't run without IDE you do not run it without IDE. > >>> So why do you like -device i440fx over what we have now? > >>Because I don't think tools like libvirt should be doing device > >>composition to create an i440fx-like chipset. I think the current > >>path we're on is pushing too much logic that belongs in QEMU into > >>the management stack. > >I can agree with that. But from this it doesn't follow that we should > >get rid of composition. We shouldn't push composition of common HW to > >libvirt. Looking at libvirt command line I do not think we do it though. > >Typical libvirt command line specifies disks, networks, usb, vga. How > >-device i440fx will simplified that? Well usb could be omitted (but not > >-usbdevice table), disks are not property of i440fx so they will stay, > >since
Re: [Qemu-devel] KVM call minutes for Feb 8
On 10 February 2011 12:23, Anthony Liguori wrote: > But something interacts with each processor and dispatches the I/O > operations in the address space, no? I can't believe there are 2^32 address > lines coming off of every arm chip that each device connects. Well, the AXI bus is kind of complicated and definitely not my area of expertise, but as I understand it you have an interconnect like a PL300 that effectively implements the "memory map" and defines where the slaves (devices) appear. But unless you actually want to be modelling bus transactions at a pretty low level this isn't really a visible difference from "these devices appear at this address in the memory map on this bus". (And there might be a bridge down from AXI to AHB or APB between the core and any particular device, but that's not programmer visible either.) > This relationship of how I/O fans out through various devices is important > because occasionally platforms do weird things during I/O fan out like > implement an IOMMU. If we don't model this I/O dispatch model within QEMU, > then it's extremely difficult to implement things like IOMMUs. Yes, but what does this have to do with chipsets and getting rid of machines? Getting I/O fanout through devices is a matter of modelling some sort of conceptual bus, and having the right APIs so you can do it fast in the common case and still allow IOMMUs and other interesting devices to intercept and change transactions. Any particular board might have to wire up the bus so it goes through an IOMMU, or it might not. Whether you want to bundle up a collection of devices and bus wiring and call it a "chipset" or not should be a matter of whether that makes sense and is a usefully reusable conceptual unit for whatever board you're modelling, I think. (For instance "an OMAP3" is an obvious reusable unit which any OMAP3-based board model is going to want to use.) Some of the I/O fanout and bus wiring might be internal to a qemu core model, for that matter -- for instance M profile ARM cores have several output buses which deal with different bits of the memory space (which are predefined as being for devices, or memory, or whatever), and the A9MP's internal timers and interrupt controller and so on ought to all be inside the core (at the moment we rely on all A9MP boards instantiating them as a separate device, which is ugly). -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 02:51 PM, Anthony Liguori wrote: On 02/10/2011 12:13 PM, Gleb Natapov wrote: Which spec? Even in this discussion we completely mixed different things. 440FX is not a chipset. Yes, it is. It's a single silicon package with a defined pinout. If you don't believe me, re-read the spec. It's a MCM with the PIIX3 being internally connected. The connection between the i440fx and PIIX3 happens to be PCI but that's not always the case. Sometimes it's a proprietary bus. Aren't they two distinct chips, together comprising the chip-set? One (the northbridge) converts the system bus to PCI + some extra wires, the other (southbridge) bridges PCI to ISA and contains some embedded ISA devices. IIRC there are some wires between them that are not PCI. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 02/10/2011 02:56 PM, Avi Kivity wrote: What's the benefit? The downside is a bit more complexity as you need an additional callback handler. synchronize_rcu() can be very slow (its a systemwide operation), and mmu_shrink() can be called often on a loaded system. In fact this just shows that vm_list is not a good candidate for rcu; rcu is useful where most operations are reads, but if we discount stats, most operations on vm_list are going to be writes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 02/10/2011 02:45 PM, Jan Kiszka wrote: >>> >>> There is no list_move_tail_rcu(). >> >> ...specifically not for this one. > > Well, we can add one if needed (and if possible). I can have a look, at least at the lower hanging fruits. Please keep rcu->parent in the loop. > >>> >>> Why check kvm->deleted? it's in the process of being torn down anyway, >>> it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger. >> >> kvm_destroy_vm removes a vm from the list while mmu_shrink is running. >> Then mmu_shrink's list_move_tail will re-add that vm to the list tail >> again (unless already the removal in move_tail produces a crash). > > It's too subtle. Communication across threads with a variable needs > memory barriers (even though they're nops on x86) and documentation. The barriers are provided by this spin lock we acquire for testing are modifying deleted. Right. I'm not thrilled with adding ->deleted though. > > btw, not even sure if it's legal: you have a mutating call within an rcu > read critical section for the same object. If synchronize_rcu() were > called there, would it ever terminate? Why not? kvm_destroy_vm is not preventing blocking mmu_shrink to acquire the kvm_lock where we then find the vm deleted and release both kvm_lock and the rcu read "lock" afterwards. synchronize_rcu() waits until all currently running rcu read-side critical sections are completed. But we are in the middle of one, which isn't going to complete until it synchronize_rcu() returns. > > (not that synchronize_rcu() is a good thing there, better do it with > call_rcu()). What's the benefit? The downside is a bit more complexity as you need an additional callback handler. synchronize_rcu() can be very slow (its a systemwide operation), and mmu_shrink() can be called often on a loaded system. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 12:13 PM, Gleb Natapov wrote: Which spec? Even in this discussion we completely mixed different things. 440FX is not a chipset. Yes, it is. It's a single silicon package with a defined pinout. If you don't believe me, re-read the spec. It's a MCM with the PIIX3 being internally connected. The connection between the i440fx and PIIX3 happens to be PCI but that's not always the case. Sometimes it's a proprietary bus. Again you probably mean PIIX3. Even then removing unused ide will free one more PCI slot for my cool virtio disk array. The things is, from code point of view, it does not cost you extra to allow composition of ide since it is just a regular PCI device and we need to support composing those anyway. If this is useful, and it doesn't break guests, you can always do -device i440fx,ide=off. However, it's an exception where we're deviating from how hardware works. And that's okay, but the base modelling ought to follow real hardware closely with deviations being the exception. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 11:49 AM, Gleb Natapov wrote: On Thu, Feb 10, 2011 at 11:19:48AM +0100, Anthony Liguori wrote: On 02/10/2011 11:10 AM, Gleb Natapov wrote: On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote: On 02/10/2011 10:07 AM, Gleb Natapov wrote: So what if it is easier, it doesn't mean it is correct thing to do. If we spend the next 10 years trying to do the "correct thing" for some arbitrary definition of correct, that's not terribly useful. Changing direction by 180 every 2 years even less useful. If we think through what we are doing and have a coherent architecture before changing direction, then we won't have this problem. I'd like to believe this :) It's really simple actually. Let's do the least clever thing and model how hardware actual works. Once we have that, we can try to be better than real hardware (if it's possible). I think out understanding on how HW actually works is very different. You are placing to much value on were device resides physically, for me it is completely unimportant detail. Not worth even mentioning. No, I place value on how things are modelled in the real world. Real world (physical HW) have consideration not relevant for our software emulation. Such as cost, physical dimension, power consumption and many other I am sure I missed. There simply aren't PC's out there that lack an RTC so I have no interest in jumping through hoops in QEMU to make it possible to do this without modifying QEMU code. It might sound nice to a developer but it's of absolutely no use to users. RTC is not good example. HPET suppose to replace it (and PIT too). HPET's embed RTCs to provide support for legacy implementations. This is extremely good example of where our modelling breaks down. Take a close look at how the HPET and RTC emulations interact for an example of why we'd be much better off just implementing an RTC within an HPET. AFAIC there are PCs without RTC already. RTC also provides CMOS functionality and no PC can boot without CMOS. So no, there's nothing we'd consider a PC today that doesn't have an RTC. Good example would be PIC or IOAPIC device and then I would agree with you that it is not worth it to make it possible to create x86 machine without them from command line if it means extra complexity. But how have you jumped from this to "lets make usb mandatory"? USB is mandatory in the PIIX3 but the only significant difference between the piix2 and piix3 is the addition of USB. Consequentially, the main difference between an i440fx and i440bx is the use of a piix2 vs. a piix3. So if you really want to create the same PC we have today w/o USB, the right way to do it would be to have: -device i440,model=fx // with USB -device i440,model=bx // w/o USB No, we don't. It's possible to have an 'rtc=off' option but I'm tremendously opposed to doing this. Arbitrary composition is not a useful goal IMHO. IMHO is different. We should support composition where it makes sense. For PIC-less x86 it doesn't make it. For usb-less or even ide-less it does. The right way to do a USB-less PC is to have an option to create an i440bx. An IDE-less PC is a bit more difficult because IDE is really baked into the concept of a PC. Chances are, there are more than a few guests out there that would have issues from there being no IDE bus present. So why do you like -device i440fx over what we have now? Because I don't think tools like libvirt should be doing device composition to create an i440fx-like chipset. I think the current path we're on is pushing too much logic that belongs in QEMU into the management stack. I can agree with that. But from this it doesn't follow that we should get rid of composition. We shouldn't push composition of common HW to libvirt. Looking at libvirt command line I do not think we do it though. Typical libvirt command line specifies disks, networks, usb, vga. How -device i440fx will simplified that? Well usb could be omitted (but not -usbdevice table), disks are not property of i440fx so they will stay, since user may want to use virtio controller (which is not part of i440fx) this should stay too. Network obviously will have to be specified by libvirt too, vga may go to i440fx, but since libvirt supports qxl we will have to have a way to disable default vga and enable qxl instead. So will we really simplify libvirt's life by introducing -device i440fx? libvirt also uses -no-defaults which prevents much of the PC's machine init from creating anything but stuff that really belongs in the main chipset. But I bet if you asked 5 different QEMU developers what belongs in machine init and what the role of -no-defaults is, you'd get different answers. OTOH, skipping any notion of machine and explicitly creating a chipset provides a very consistent
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 2011-02-10 13:34, Avi Kivity wrote: > On 02/10/2011 01:31 PM, Jan Kiszka wrote: >>> @@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) spin_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, idx); } - if (kvm_freed) - list_move_tail(&kvm_freed->vm_list,&vm_list); + if (kvm_freed) { + raw_spin_lock(&kvm_lock); + if (!kvm->deleted) + list_move_tail(&kvm_freed->vm_list,&vm_list); >>> >>> There is no list_move_tail_rcu(). >> >> ...specifically not for this one. > > Well, we can add one if needed (and if possible). I can have a look, at least at the lower hanging fruits. > >>> >>> Why check kvm->deleted? it's in the process of being torn down anyway, >>> it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger. >> >> kvm_destroy_vm removes a vm from the list while mmu_shrink is running. >> Then mmu_shrink's list_move_tail will re-add that vm to the list tail >> again (unless already the removal in move_tail produces a crash). > > It's too subtle. Communication across threads with a variable needs > memory barriers (even though they're nops on x86) and documentation. The barriers are provided by this spin lock we acquire for testing are modifying deleted. > > btw, not even sure if it's legal: you have a mutating call within an rcu > read critical section for the same object. If synchronize_rcu() were > called there, would it ever terminate? Why not? kvm_destroy_vm is not preventing blocking mmu_shrink to acquire the kvm_lock where we then find the vm deleted and release both kvm_lock and the rcu read "lock" afterwards. > > (not that synchronize_rcu() is a good thing there, better do it with > call_rcu()). What's the benefit? The downside is a bit more complexity as you need an additional callback handler. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On Thu, 2011-02-10 at 09:18 +0800, Amos Kong wrote: > On Wed, Feb 09, 2011 at 11:28:56AM +0200, Avi Kivity wrote: > > On 02/09/2011 03:50 AM, Michael Goldish wrote: > > >This is a reimplementation of the dict generator. It is much faster than > > >the > > >current implementation and uses a very small amount of memory. Running > > >time > > >and memory usage scale polynomially with the number of defined variants, > > >compared to exponentially in the current implementation. > > > > > >Instead of regular expressions in the filters, the following syntax is > > >used: > > > > > >, means OR > > >.. means AND > > >. means IMMEDIATELY-FOLLOWED-BY > > > > > >Example: > > > > > >only qcow2..Fedora.14, RHEL.6..raw..boot, smp2..qcow2..migrate..ide > > > > > > > > > Is it not possible to keep the old syntax? Breaking people's > > scripts is bad. > > we only need convert the configure file, it's not too complex Yes, the benefits of the new format outnumber the inconveniences. As for my opinion on the operator, .. is sufficiently clear and expressive to do most of the stuff we need to do with configuration anyway. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 02/10/2011 01:31 PM, Jan Kiszka wrote: > >> >> @@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) >>spin_unlock(&kvm->mmu_lock); >>srcu_read_unlock(&kvm->srcu, idx); >>} >> - if (kvm_freed) >> - list_move_tail(&kvm_freed->vm_list,&vm_list); >> + if (kvm_freed) { >> + raw_spin_lock(&kvm_lock); >> + if (!kvm->deleted) >> + list_move_tail(&kvm_freed->vm_list,&vm_list); > > There is no list_move_tail_rcu(). ...specifically not for this one. Well, we can add one if needed (and if possible). > > Why check kvm->deleted? it's in the process of being torn down anyway, > it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger. kvm_destroy_vm removes a vm from the list while mmu_shrink is running. Then mmu_shrink's list_move_tail will re-add that vm to the list tail again (unless already the removal in move_tail produces a crash). It's too subtle. Communication across threads with a variable needs memory barriers (even though they're nops on x86) and documentation. btw, not even sure if it's legal: you have a mutating call within an rcu read critical section for the same object. If synchronize_rcu() were called there, would it ever terminate? (not that synchronize_rcu() is a good thing there, better do it with call_rcu()). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On Thu, Feb 10, 2011 at 12:55:22PM +0100, Alexander Graf wrote: > Scott Wood wrote: > > On Thu, 3 Feb 2011 10:19:06 +0100 > > Alexander Graf wrote: > > > > > >> Yeah, that one's tricky. Usually the way the memory resolver in qemu works > >> is as follows: > >> > >> * kvm goes to qemu > >> * qemu fetches all mmu and register data from kvm > >> * qemu runs its mmu resolution function as if the target was emulated > >> > >> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove > >> them into env and implement the MMU in qemu (at least enough of it to > >> enable debugging). No other target modifies this code path. But no other > >> target needs to copy > 30kb of data only to get the mmu data either :). > >> > > > > I guess you mean that cpu_synchronize_state() is supposed to pull in the > > MMU state, though I don't see where it gets called for 'm'/'M' commands in > > the gdb stub. > > > > Well, we could also call it in get_phys_page_debug in target-ppc, but > yes. I guess the reason it works for now is that SDR1 is pretty constant > and was fetched earlier on. For BookE not syncing is obviously even more > broken. > > > The MMU code seems to be pretty target-specific. It's not clear to what > > extent there is a "normal" way, versus what book3s happens to rely on in > > its get_physical_address() code. I don't think there are any platforms > > supported yet (with both KVM and a non-empty cpu_get_phys_page_debug() > > implementation) that have a pure software-managed TLB. x86 has page > > tables, and book3s has the hash table (603/e300 doesn't, or more accurately > > Linux doesn't use it, but I guess that's not supported by KVM yet?). > > > > As for PPC, only 440, e500 and G3-5 are basically supported. It happens > to work on POWER4 and above too and I've even got reports that it's good > on e600 :). > > > We could probably do some sort of lazy state transfer only when MMU code > > that needs it is run. This could initially include debug translations, for > > testing a non-KVM-dependent get_physical_address() implementation, but > > eventually that would use KVM_TRANSLATE (when KVM is used) and thus not > > > > Yup :). > > > trigger the state transfer. I'd also like to add an "info tlb" command, > > which would require the state transfer. > > > > Very nice. > > > BTW, how much other than the MMU is missing to be able to run an e500 > > target in qemu, without kvm? > > > > The last person working on BookE emulation was Edgar. Edgar, how far did > you get? Hi, TBH, I don't really know. My goal was to get linux running on an PPC-440 embedded with the Xilinx FPGA's. I managed to fix enough BookE emulation to get that far. After that, we've done a few more hacks to run fsboot and uboot. Also, we've added support for some of the BookE debug registers to be able to run gdbserver from within linux guests. Some of these patches haven't made it upstream yet. I haven't taken the time to compare the specs to qemu code, so I don't really know how much is missing. My guess is that If you wan't to run linux guests, the MMU won't be the limiting factor. Cheers -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 11:38 AM, Peter Maydell wrote: On 10 February 2011 10:13, Anthony Liguori wrote: On 02/10/2011 10:04 AM, Peter Maydell wrote: On 10 February 2011 08:36, Anthony Liguoriwrote: So you would model arm926ej-s as the chipset and then build up the machines by modifying parameters of the chipset (like the board id) and/or adding different components on top of it. Er, ARM926 is the CPU, it's not a chipset. The board ID is definitely not a property of an ARM926, it's a property of the board (clue is in the name :-)). I don't think versatile boards have a "chipset" really... As I said, I'm not well versed in the component names in ARM. But that said, an actual processor doesn't connect directly to a bunch of devices. It almost always go through some chipset and that chipset implements a lot of functionality typically. I think the name of the component I'm trying to refer to PL300 which I believe is the Northbridge used for the Versatile boards. PL300 is just a bus interconnect (so you can connect multiple AXI bus masters (cores) to multiple AXI bus slaves (devices)). Versatile PB doesn't have anything in the documentation that claims to be a Northbridge (PBX does, VExpress doesn't). This is the system diagram for the Versatile Express: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0447d/I1007683.html I don't know what you'd want to claim is a "northbridge" there. Basically there's an FPGA with a pile of devices in it, and there's a test chip with the core and some other devices in it. But from a modelling perspective this is all completely irrelevant because regardless of where the hardware designer put the devices, they're just devices at a particular point in the memory map and with a particular set of interrupt wiring and so on. But something interacts with each processor and dispatches the I/O operations in the address space, no? I can't believe there are 2^32 address lines coming off of every arm chip that each device connects. This relationship of how I/O fans out through various devices is important because occasionally platforms do weird things during I/O fan out like implement an IOMMU. If we don't model this I/O dispatch model within QEMU, then it's extremely difficult to implement things like IOMMUs. It might be the case that a platform has a chipset that is a pile of well isolated devices that are crammed in the same silicon space but that otherwise have very well defined interactions with each other. This is the exception though, not the rule. Particularly when looking at the relationship between certain devices on the PC (like the role the pckbd plays in address translation), things are simply not so idealized in practice. But if it makes sense for ARM to describe every single platform device through a factory interface, that's fine. Even in this case, you still want to model things like the distinction between the UART16650A and the ISA bus bridge for the serial device. In this case, you want to be able to do composition without going through a factory. An n900 is a very specific hardware configuration that is best represented by some sort of configuration file vs. something hard coded in QEMU. Yes, that's the whole point -- "machine" == "specific hardware configuration". That's not getting rid of "machine", it's just saying "we should have some custom scripting language to define them rather than doing them in C". You still want, fundamentally, to be able to say qemu-system-arm -M machinename No, qemu-system-arm -M /path/to/n900.cfg But yeah, no disagreement there. But today, the machine concept in QEMU is definitely not a specific hardware configuration. Regards, Anthony Liguori -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
Scott Wood wrote: > On Thu, 3 Feb 2011 10:19:06 +0100 > Alexander Graf wrote: > > >> Yeah, that one's tricky. Usually the way the memory resolver in qemu works >> is as follows: >> >> * kvm goes to qemu >> * qemu fetches all mmu and register data from kvm >> * qemu runs its mmu resolution function as if the target was emulated >> >> So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them >> into env and implement the MMU in qemu (at least enough of it to enable >> debugging). No other target modifies this code path. But no other target >> needs to copy > 30kb of data only to get the mmu data either :). >> > > I guess you mean that cpu_synchronize_state() is supposed to pull in the > MMU state, though I don't see where it gets called for 'm'/'M' commands in > the gdb stub. > Well, we could also call it in get_phys_page_debug in target-ppc, but yes. I guess the reason it works for now is that SDR1 is pretty constant and was fetched earlier on. For BookE not syncing is obviously even more broken. > The MMU code seems to be pretty target-specific. It's not clear to what > extent there is a "normal" way, versus what book3s happens to rely on in > its get_physical_address() code. I don't think there are any platforms > supported yet (with both KVM and a non-empty cpu_get_phys_page_debug() > implementation) that have a pure software-managed TLB. x86 has page > tables, and book3s has the hash table (603/e300 doesn't, or more accurately > Linux doesn't use it, but I guess that's not supported by KVM yet?). > As for PPC, only 440, e500 and G3-5 are basically supported. It happens to work on POWER4 and above too and I've even got reports that it's good on e600 :). > We could probably do some sort of lazy state transfer only when MMU code > that needs it is run. This could initially include debug translations, for > testing a non-KVM-dependent get_physical_address() implementation, but > eventually that would use KVM_TRANSLATE (when KVM is used) and thus not > Yup :). > trigger the state transfer. I'd also like to add an "info tlb" command, > which would require the state transfer. > Very nice. > BTW, how much other than the MMU is missing to be able to run an e500 > target in qemu, without kvm? > The last person working on BookE emulation was Edgar. Edgar, how far did you get? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On 02/10/2011 01:03 PM, Avi Kivity wrote: > On 02/10/2011 12:57 PM, Michael Goldish wrote: >> > >> > I can't easily think of a case where this might cause confusion. The >> > purpose of this is to allow people to write: >> > >> > only qcow2..raw..rtl8139 >> > >> > without having to remember the order in which those were defined in >> > tests_base.cfg. >> >> Sorry, I meant something like >> >> only qcow2..hugepages..rtl8139 >> >> Obviously qcow2 and raw can't coexist. > > The config files describe a cartesian product, in which order matters. > > [A B C] x [1 2] generates [A1 A2 B1 B2 C1 C2]; no confusion here if you > specify A..1 > > however > > [A B C] x [A B] generates [AA AB BA BB CA CB]; A..B is ambiguous This is a bad idea anyway: [A B C] x [A B] x [install boot migrate] 'only A..install' is ambiguous regardless of whether we match in-order or not. > we might require that keywords be unique. Ambiguity can be resolved by prefixing a name with its immediate parent. If we have Fedora.9.32 and Fedora.9.64, and some test 'foo' has both a 32 bit and a 64 bit version, then the following isn't ambiguous: only Fedora.9.32..foo.32 If we require that keywords be unique, such combinations will not be possible. The same applies to RHEL.3..sometest.3. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
Scott Wood wrote: > On Wed, 9 Feb 2011 18:21:40 +0100 > Alexander Graf wrote: > > >> On 07.02.2011, at 21:15, Scott Wood wrote: >> >> >>> That's pretty much what the proposed API does -- except it uses a void >>> pointer instead of uint64_t *. >>> >> Oh? Did I miss something there? The proposal looked as if it only transfers >> a single TLB entry at a time. >> > > Right, I just meant in terms of avoiding a fixed reference to a hw-specific > type. > > >>> How about: >>> >>> struct kvmppc_booke_tlb_entry { >>> union { >>> __u64 mas0_1; >>> struct { >>> __u32 mas0; >>> __u32 mas1; >>> }; >>> }; >>> __u64 mas2; >>> union { >>> __u64 mas7_3 >>> struct { >>> __u32 mas7; >>> __u32 mas3; >>> }; >>> }; >>> __u32 mas8; >>> __u32 pad; >>> >> Would it make sense to add some reserved fields or would we just bump up the >> mmu id? >> > > I was thinking we'd just bump the ID. I only stuck "pad" in there for > alignment. And we're making a large array of it, so padding could hurt. > Ok, thinking about this a bit more. You're basically proposing a list of tlb set calls, with each array field identifying one tlb set call. What I was thinking of was a full TLB sync, so we could keep qemu's internal TLB representation identical to the ioctl layout and then just call that one ioctl to completely overwrite all of qemu's internal data (and vice versa). >>> struct kvmppc_booke_tlb_params { >>> /* >>> * book3e defines 4 TLBs. Individual implementations may have >>> * fewer. TLBs that do not exist on the target must be configured >>> * with a size of zero. KVM will adjust TLBnCFG based on the sizes >>> * configured here, though arrays greater than 2048 entries will >>> * have TLBnCFG[NENTRY] set to zero. >>> */ >>> __u32 tlb_sizes[4]; >>> >> Add some reserved fields? >> > > MMU type ID also controls this, but could add some padding to make > extensions simpler (esp. since we're not making an array of it). How much > would you recommend? > How about making it 64 bytes? That should leave us plenty of room. > >>> struct kvmppc_booke_tlb_search { >>> >> Search? I thought we agreed on having a search later, after the full get/set >> is settled? >> > > We agreed on having a full array-like get/set... my preference was to keep > it all under one capability, which implies adding it at the same time. > But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB. I'm > skeptical that array-only will not be a performance issue under any usage > pattern, but we can implement it and try it out before finalizing any of > this. > Yup. We can even implement it, measure what exactly is slow and then decide on how to implement it. I'd bet that only the emulation stub is slow - and for that KVM_TRANSLATE seems like a good fit. > >>> struct kvmppc_booke_tlb_entry entry; >>> union { >>> __u64 mas5_6; >>> struct { >>> __u64 mas5; >>> __u64 mas6; >>> }; >>> }; >>> }; >>> > > The fields inside the struct should be __u32, of course. :-P > Ugh, yes :). But since we're dopping this anyways, it doesn't matter, right? :) > >>> - An entry with MAS1[V] = 0 terminates the list early (but there will >>> be no terminating entry if the full array is valid). On a call to >>> KVM_GET_TLB, the contents of elemnts after the terminator are undefined. >>> On a call to KVM_SET_TLB, excess elements beyond the terminating >>> entry may not be accessed by KVM. >>> >> Very implementation specific, but ok with me. >> > > I assumed most MMU types would have some straightforward way of marking an > entry invalid (if not, it can add a software field in the struct), and that > it would be MMU-specific code that is processing the list. > See above :). > >> It's constrained to the BOOKE implementation of that GET/SET anyway. Is >> this how the hardware works too? >> > > Hardware doesn't process lists of entries. But MAS1[V] is the valid > bit in hardware. > > >>> [Note: Once we implement sregs, Qemu can determine which TLBs are >>> implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be >>> unsupported by KVM if its existence is implied by the target CPU] >>> >>> KVM_SET_TLB >>> --- >>> >>> Capability: KVM_CAP_SW_TLB >>> Type: vcpu ioctl >>> Parameters: struct kvm_set_tlb (in) >>> Returns: 0 on success >>> -1 on error >>> >>> struct kvm_set_tlb { >>> __u64 params; >>> __u64 array; >>> __u32 mmu_type; >>> }; >>> >>> [Note: I used __u64 rather than void * to avoid the need for special >>> compat handling with 32-bit userspace on a 64-bit kernel -- if the other >>> way is preferred, that's fin
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 2011-02-10 11:16, Avi Kivity wrote: > On 02/08/2011 01:55 PM, Jan Kiszka wrote: >> Only for walking the list of VMs, we do not need to hold the preemption >> disabling kvm_lock. Convert stat services, the cpufreq callback and >> mmu_shrink to RCU. For the latter, special care is required to >> synchronize its list_move_tail with kvm_destroy_vm. >> >> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index b6a9963..e9d0ed8 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -3587,9 +3587,9 @@ static int mmu_shrink(struct shrinker *shrink, int >> nr_to_scan, gfp_t gfp_mask) >> if (nr_to_scan == 0) >> goto out; >> >> -raw_spin_lock(&kvm_lock); >> +rcu_read_lock(); >> >> -list_for_each_entry(kvm,&vm_list, vm_list) { >> +list_for_each_entry_rcu(kvm,&vm_list, vm_list) { >> int idx, freed_pages; >> LIST_HEAD(invalid_list); > > Have to #include rculist.h, OK. > and to change all list operations on vm_list > to rcu variants. Not sure if we have such variants for all cases... > >> >> @@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, int >> nr_to_scan, gfp_t gfp_mask) >> spin_unlock(&kvm->mmu_lock); >> srcu_read_unlock(&kvm->srcu, idx); >> } >> -if (kvm_freed) >> -list_move_tail(&kvm_freed->vm_list,&vm_list); >> +if (kvm_freed) { >> +raw_spin_lock(&kvm_lock); >> +if (!kvm->deleted) >> +list_move_tail(&kvm_freed->vm_list,&vm_list); > > There is no list_move_tail_rcu(). ...specifically not for this one. > > Why check kvm->deleted? it's in the process of being torn down anyway, > it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger. kvm_destroy_vm removes a vm from the list while mmu_shrink is running. Then mmu_shrink's list_move_tail will re-add that vm to the list tail again (unless already the removal in move_tail produces a crash). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 10:38:53AM +, Peter Maydell wrote: > This is the system diagram for the Versatile Express: > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0447d/I1007683.html > I don't know what you'd want to claim is a "northbridge" there. > Basically there's an FPGA with a pile of devices in it, > and there's a test chip with the core and some other devices in > it. But from a modelling perspective this is all completely > irrelevant because regardless of where the hardware designer > put the devices, they're just devices at a particular point in the > memory map and with a particular set of interrupt wiring and so > on. I don't see the point in modelling a concept that has no > user-visible effects and doesn't actually make the model any > clearer or simpler. > Exactly. This is really the same with x86. The fact that some company put several devices on the same chip and gave it commercial name shouldn't govern our design. > > > A machine today is basically the northbridge, southbridge, plus a bunch of > > default components to make the virtual hardware useful. > > This doesn't really correspond to ARM boards I've looked at, > by and large (for instance there's no mention of the word "northbridge" > in the whole 3700 page OMAP3 TRM). PCs may be best modelled > that way, sure, but I don't think you can cram everything into that mould. > Even on x86 this model is falling apart. Memory controller moves to cpu. PCI controller will follow. > >> If you mean that you want machines to be implemented under the > >> hood as a single huge "device" you can only have one of that spans > >> the entire memory map, well I guess that's an implementation > >> detail. But conceptually machines really do exist, and we definitely > >> still want users to be able to say "I want a beagle machine; I want > >> a versatile; I want an n900". > > > An n900 is a very specific hardware configuration that is best represented > > by some sort of configuration file vs. something hard coded in QEMU. > > Yes, that's the whole point -- "machine" == "specific hardware > configuration". > > That's not getting rid of "machine", it's just saying "we should have > some custom scripting language to define them rather than doing > them in C". You still want, fundamentally, to be able to say > qemu-system-arm -M machinename > +1 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 12:25:38PM +0200, Avi Kivity wrote: > On 02/10/2011 11:07 AM, Gleb Natapov wrote: > >On Thu, Feb 10, 2011 at 08:47:12AM +0100, Anthony Liguori wrote: > >> On 02/09/2011 09:15 PM, Blue Swirl wrote: > >> >On Wed, Feb 9, 2011 at 9:59 PM, Anthony Liguori > >> wrote: > >> >>On 02/09/2011 06:48 PM, Blue Swirl wrote: > >> ISASerialState dev; > >> > >> isa_serial_init(&dev, 0, 0x274, 0x07, NULL, NULL); > >> > >> >>>Do you mean that there should be a generic way of doing that, like > >> >>>sysbus_create_varargs() for qdev, or just add inline functions which > >> >>>hide qdev property setup? > >> >>> > >> >>>I still think that FDT should be used in the future. That would > >> >>>require that the properties can be set up mechanically, and I don't > >> >>>see how your proposal would help that. > >> >>> > >> >>Yeah, I don't think that is a good idea anymore. I think this is part > >> of > >> >>why we're having so many problems with qdev. > >> >> > >> >>While (most?) hardware hierarchies can be represented by device tree > >> syntax, > >> >>not all valid device trees correspond to interface and/or useful > >> hardware > >> >>hierarchies. > >> >User creates a non-working machine and so gets to fix the problems? > >> >How is that a problem for us? > >> > >> It's not about creating a non-working machine. It's about what > >> user-level abstraction we need to provide. > >> > >> It's a whole lot easier to implement an i440fx device with a fixed > >> set of parameters than it is to make every possible subdevice have a > >> proper factory interface along with mechanisms to hook everything > >> together. > >> > >So what if it is easier, it doesn't mean it is correct thing to do. What > >you are proposing is just a huge step backwards. May be we shouldn't > >support hooking everything together in completely arbitrary ways, but we > >shouldn't force isa/pci devices upon our users just because they are > >non-removable on real chip. > > I disagree. We don't want to deviate from the spec any more than we > already do. > Which spec? Even in this discussion we completely mixed different things. 440FX is not a chipset. It is memory controller/pci host bridge. PIIX3/4 is the chipset which is just an arbitrary combination of devices put on the same chip. We do not deviate from spec when we implement those devices. > The reason for wanting flexibility is because the code for the PIC > or RTC, for example, can be used in other Super-IO chipsets or even > standalone. If qemu only supported the 440FX chipset, we'd have no > reason to make things flexible. Again you probably mean PIIX3. Even then removing unused ide will free one more PCI slot for my cool virtio disk array. The things is, from code point of view, it does not cost you extra to allow composition of ide since it is just a regular PCI device and we need to support composing those anyway. > > >> > >> So very concretely, I'm suggesting we do the following to target-i386: > >> > >> 1) make the i440fx device have an embedded ide controller, piix3, > >> and usb controller that get initialized automatically. The piix3 > >> embeds the PCI-to-ISA bridge along with all of the default ISA > >> devices (rtc, serial, etc.). > >This may be a problem even from security point of view. What if usb code > >(ide, serial, parallel) has guest exploitable bug? Currently I can happily > >continue running guests if they do not need affected subsystem. If we'll > >get it your way I will no longer be able to do so. > > You can't just remove a device from a guest. You have to shut it > down. When you power it back up, you may end up with different IRQ > assignments or expose some guest bug. As I answered to Anthony already I am not talking about changing HW configuration after guest is created rather about creating minimal HW setup for the task from the start. This means no soundcard or usb for Windows exchange server for instance. > > If you have a security issue in code that is exposed to the guest, > you have to fix it. > Of course. That is why it is a good idea to expose as little code to guest as possible. Don't you think so? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On 02/10/2011 12:57 PM, Michael Goldish wrote: > > I can't easily think of a case where this might cause confusion. The > purpose of this is to allow people to write: > > only qcow2..raw..rtl8139 > > without having to remember the order in which those were defined in > tests_base.cfg. Sorry, I meant something like only qcow2..hugepages..rtl8139 Obviously qcow2 and raw can't coexist. The config files describe a cartesian product, in which order matters. [A B C] x [1 2] generates [A1 A2 B1 B2 C1 C2]; no confusion here if you specify A..1 however [A B C] x [A B] generates [AA AB BA BB CA CB]; A..B is ambiguous we might require that keywords be unique. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On 02/10/2011 12:55 PM, Michael Goldish wrote: > On 02/10/2011 12:47 PM, Avi Kivity wrote: >> On 02/10/2011 12:46 PM, Michael Goldish wrote: >>> On 02/10/2011 12:34 PM, Avi Kivity wrote: On 02/10/2011 11:14 AM, Michael Goldish wrote: > only Fedora..boot > So this would include Fedora.9.32.boot and Fedora.9.64.boot, but >>> exclude Windows.XP.32.boot or Fedora.9.32.migrate? seems reasonable. >>> >>> Correct, and it would also include boot.Fedora.9.32 and >>> boot.9.32.Fedora, if there were such things. >> >> That's counterintuitive and requires careful planning. > > I can't easily think of a case where this might cause confusion. The > purpose of this is to allow people to write: > > only qcow2..raw..rtl8139 > > without having to remember the order in which those were defined in > tests_base.cfg. Sorry, I meant something like only qcow2..hugepages..rtl8139 Obviously qcow2 and raw can't coexist. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On 02/10/2011 12:47 PM, Avi Kivity wrote: > On 02/10/2011 12:46 PM, Michael Goldish wrote: >> On 02/10/2011 12:34 PM, Avi Kivity wrote: >> > On 02/10/2011 11:14 AM, Michael Goldish wrote: >> >> only Fedora..boot >> >> >> > >> > So this would include Fedora.9.32.boot and Fedora.9.64.boot, but >> exclude >> > Windows.XP.32.boot or Fedora.9.32.migrate? seems reasonable. >> >> Correct, and it would also include boot.Fedora.9.32 and >> boot.9.32.Fedora, if there were such things. > > That's counterintuitive and requires careful planning. I can't easily think of a case where this might cause confusion. The purpose of this is to allow people to write: only qcow2..raw..rtl8139 without having to remember the order in which those were defined in tests_base.cfg. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.
2011/2/10 Daniel P. Berrange : > On Thu, Feb 10, 2011 at 07:23:33PM +0900, Yoshiaki Tamura wrote: >> 2011/2/10 Daniel P. Berrange : >> > On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote: >> >> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote: >> >> >Currently FdMigrationState doesn't support read(), and this patch >> >> >introduces it to get response from the other side. >> >> > >> >> >Signed-off-by: Yoshiaki Tamura >> >> >> >> Migration is unidirectional. Changing this is fundamental and not >> >> something to be done lightly. >> > >> > Making it bi-directional might break libvirt's save/restore >> > to file support which uses migration, passing a unidirectional >> > FD for the file. It could also break libvirt's secure tunnelled >> > migration support which is currently only expecting to have >> > data sent in one direction on the socket. >> >> Hi Daniel, >> >> IIUC, this patch isn't something to make existing live migration >> bi-directional. Just opens up a way for Kemari to use it. Do >> you think it's dangerous for libvirt still? > > The key is for it to be a no-op for any usage of the existing > 'migrate' command. I had thought this was wiring up read into > the event loop too, so it would be poll()ing for reads, but > after re-reading I see this isn't the case here. It's a no-op for existing migration related code. Anthony, did you have the same concern? Yoshi > > Regards, > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| > |: http://libvirt.org -o- http://virt-manager.org :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 11:19:48AM +0100, Anthony Liguori wrote: > On 02/10/2011 11:10 AM, Gleb Natapov wrote: > >On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote: > >>On 02/10/2011 10:07 AM, Gleb Natapov wrote: > >>>So what if it is easier, it doesn't mean it is correct thing to do. > >>If we spend the next 10 years trying to do the "correct thing" for > >>some arbitrary definition of correct, that's not terribly useful. > >Changing direction by 180 every 2 years even less useful. > > If we think through what we are doing and have a coherent > architecture before changing direction, then we won't have this > problem. > I'd like to believe this :) > >>It's really simple actually. Let's do the least clever thing and > >>model how hardware actual works. Once we have that, we can try to > >>be better than real hardware (if it's possible). > >I think out understanding on how HW actually works is very different. > >You are placing to much value on were device resides physically, for me > >it is completely unimportant detail. Not worth even mentioning. > > No, I place value on how things are modelled in the real world. Real world (physical HW) have consideration not relevant for our software emulation. Such as cost, physical dimension, power consumption and many other I am sure I missed. > > There simply aren't PC's out there that lack an RTC so I have no > interest in jumping through hoops in QEMU to make it possible to do > this without modifying QEMU code. It might sound nice to a > developer but it's of absolutely no use to users. > RTC is not good example. HPET suppose to replace it (and PIT too). AFAIC there are PCs without RTC already. Good example would be PIC or IOAPIC device and then I would agree with you that it is not worth it to make it possible to create x86 machine without them from command line if it means extra complexity. But how have you jumped from this to "lets make usb mandatory"? > If all composition is done through a factory interface, it doesn't. > But my main argument here is that we shouldn't try to make all > composition done through a factory interface--only where it makes > sense. > > So very concretely, I'm suggesting we do the following to target-i386: > > 1) make the i440fx device have an embedded ide controller, piix3, > and usb controller that get initialized automatically. The piix3 > embeds the PCI-to-ISA bridge along with all of the default ISA > devices (rtc, serial, etc.). > >>>This may be a problem even from security point of view. What if usb code > >>>(ide, serial, parallel) has guest exploitable bug? Currently I can happily > >>>continue running guests if they do not need affected subsystem. If we'll > >>>get it your way I will no longer be able to do so. > >>qemu -device i440fx,ide=off > >> > >So you still need to support arbitrary composition. What's the > >difference? > > No, we don't. It's possible to have an 'rtc=off' option but I'm > tremendously opposed to doing this. Arbitrary composition is not a > useful goal IMHO. IMHO is different. We should support composition where it makes sense. For PIC-less x86 it doesn't make it. For usb-less or even ide-less it does. > > > So why do you like -device i440fx over what we have now? > > Because I don't think tools like libvirt should be doing device > composition to create an i440fx-like chipset. I think the current > path we're on is pushing too much logic that belongs in QEMU into > the management stack. I can agree with that. But from this it doesn't follow that we should get rid of composition. We shouldn't push composition of common HW to libvirt. Looking at libvirt command line I do not think we do it though. Typical libvirt command line specifies disks, networks, usb, vga. How -device i440fx will simplified that? Well usb could be omitted (but not -usbdevice table), disks are not property of i440fx so they will stay, since user may want to use virtio controller (which is not part of i440fx) this should stay too. Network obviously will have to be specified by libvirt too, vga may go to i440fx, but since libvirt supports qxl we will have to have a way to disable default vga and enable qxl instead. So will we really simplify libvirt's life by introducing -device i440fx? > > >In current speak you propose will be implement by using i440fx machine > >type. Qdev will build it for you. > > If you had an i440fx machine type, that had no non-optional > components added, and you could specify options to the machine type, > yes. But I think you'll agree that there's no reason to not just > treat the i440fx as a device. I do not agree. There is not such device as i440fx. This is just packaging. > > >>If you really care to do this. But this desire to remove devices is > >>silly IMHO. Concerns about security are misplaced. If you have to > >>change the way a guest is invoked in order to eliminate security > >>problems, then there's someth
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On 02/10/2011 12:46 PM, Michael Goldish wrote: On 02/10/2011 12:34 PM, Avi Kivity wrote: > On 02/10/2011 11:14 AM, Michael Goldish wrote: >> only Fedora..boot >> > > So this would include Fedora.9.32.boot and Fedora.9.64.boot, but exclude > Windows.XP.32.boot or Fedora.9.32.migrate? seems reasonable. Correct, and it would also include boot.Fedora.9.32 and boot.9.32.Fedora, if there were such things. That's counterintuitive and requires careful planning. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On 02/10/2011 12:34 PM, Avi Kivity wrote: > On 02/10/2011 11:14 AM, Michael Goldish wrote: >> only Fedora..boot >> > > So this would include Fedora.9.32.boot and Fedora.9.64.boot, but exclude > Windows.XP.32.boot or Fedora.9.32.migrate? seems reasonable. Correct, and it would also include boot.Fedora.9.32 and boot.9.32.Fedora, if there were such things. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.
On Thu, Feb 10, 2011 at 07:23:33PM +0900, Yoshiaki Tamura wrote: > 2011/2/10 Daniel P. Berrange : > > On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote: > >> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote: > >> >Currently FdMigrationState doesn't support read(), and this patch > >> >introduces it to get response from the other side. > >> > > >> >Signed-off-by: Yoshiaki Tamura > >> > >> Migration is unidirectional. Changing this is fundamental and not > >> something to be done lightly. > > > > Making it bi-directional might break libvirt's save/restore > > to file support which uses migration, passing a unidirectional > > FD for the file. It could also break libvirt's secure tunnelled > > migration support which is currently only expecting to have > > data sent in one direction on the socket. > > Hi Daniel, > > IIUC, this patch isn't something to make existing live migration > bi-directional. Just opens up a way for Kemari to use it. Do > you think it's dangerous for libvirt still? The key is for it to be a no-op for any usage of the existing 'migrate' command. I had thought this was wiring up read into the event loop too, so it would be poll()ing for reads, but after re-reading I see this isn't the case here. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Convert tsc_write_lock to raw_spinlock
On 02/04/2011 11:49 AM, Jan Kiszka wrote: Code under this lock requires non-preemptibility. Ensure this also over -rt by converting it to raw spinlock. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 10 February 2011 10:13, Anthony Liguori wrote: > On 02/10/2011 10:04 AM, Peter Maydell wrote: >> >> On 10 February 2011 08:36, Anthony Liguori wrote: >>> So you would model arm926ej-s as the chipset and then build up the >>> machines >>> by modifying parameters of the chipset (like the board id) and/or adding >>> different components on top of it. >>> >> >> Er, ARM926 is the CPU, it's not a chipset. The board ID is definitely >> not a property of an ARM926, it's a property of the board (clue is in >> the name :-)). I don't think versatile boards have a "chipset" really... >> > > As I said, I'm not well versed in the component names in ARM. > > But that said, an actual processor doesn't connect directly to a bunch of > devices. It almost always go through some chipset and that chipset > implements a lot of functionality typically. > > I think the name of the component I'm trying to refer to PL300 which I > believe is the Northbridge used for the Versatile boards. PL300 is just a bus interconnect (so you can connect multiple AXI bus masters (cores) to multiple AXI bus slaves (devices)). Versatile PB doesn't have anything in the documentation that claims to be a Northbridge (PBX does, VExpress doesn't). This is the system diagram for the Versatile Express: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0447d/I1007683.html I don't know what you'd want to claim is a "northbridge" there. Basically there's an FPGA with a pile of devices in it, and there's a test chip with the core and some other devices in it. But from a modelling perspective this is all completely irrelevant because regardless of where the hardware designer put the devices, they're just devices at a particular point in the memory map and with a particular set of interrupt wiring and so on. I don't see the point in modelling a concept that has no user-visible effects and doesn't actually make the model any clearer or simpler. >> In my understanding the "machine" is the thing that says "I need a >> 926, and an MMC controller at this address, and some UARTS, >> and..." ie it is the thing that does the "modifying parameters" >> and "adding different components". So if we'd still be doing that >> I don't see how we've "got rid of the concept". I guess I'm missing >> the point somehow. > A machine today is basically the northbridge, southbridge, plus a bunch of > default components to make the virtual hardware useful. This doesn't really correspond to ARM boards I've looked at, by and large (for instance there's no mention of the word "northbridge" in the whole 3700 page OMAP3 TRM). PCs may be best modelled that way, sure, but I don't think you can cram everything into that mould. >> If you mean that you want machines to be implemented under the >> hood as a single huge "device" you can only have one of that spans >> the entire memory map, well I guess that's an implementation >> detail. But conceptually machines really do exist, and we definitely >> still want users to be able to say "I want a beagle machine; I want >> a versatile; I want an n900". > An n900 is a very specific hardware configuration that is best represented > by some sort of configuration file vs. something hard coded in QEMU. Yes, that's the whole point -- "machine" == "specific hardware configuration". That's not getting rid of "machine", it's just saying "we should have some custom scripting language to define them rather than doing them in C". You still want, fundamentally, to be able to say qemu-system-arm -M machinename -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: remove isr_ack logic from PIC
On 02/09/2011 12:09 PM, Gleb Natapov wrote: isr_ack logic was added by e48258009d to avoid unnecessary IPIs. Back then it made sense, but now the code checks that vcpu is ready to accept interrupt before sending IPI, so this logic is no longer needed. The patch removes it. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [KVM-AUTOTEST PATCH] KVM test: refactor kvm_config.py
On 02/10/2011 11:14 AM, Michael Goldish wrote: only Fedora..boot So this would include Fedora.9.32.boot and Fedora.9.64.boot, but exclude Windows.XP.32.boot or Fedora.9.32.migrate? seems reasonable. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 09:47 AM, Anthony Liguori wrote: So very concretely, I'm suggesting we do the following to target-i386: 1) make the i440fx device have an embedded ide controller, piix3, and usb controller that get initialized automatically. The piix3 embeds the PCI-to-ISA bridge along with all of the default ISA devices (rtc, serial, etc.). This I like. 2) get rid of the entire concept of machines. Creating a i440fx is essentially equivalent to creating a bare machine. No, it's not. The 440fx does not include an IOAPIC, for example. There may be other optional components, or differences in wiring, that make two machines with i440fx not identical. 4) model the CPUs as devices that take a pointer to a host controller, for x86, the normal case would be giving it a pointer to i440fx. Surely the connection is via a bus? An x86 cpu talks to the bus, and there happens to be an 440fx north bridge at the end of it. It could also be a Q35 or something else. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 11:07 AM, Gleb Natapov wrote: On Thu, Feb 10, 2011 at 08:47:12AM +0100, Anthony Liguori wrote: > On 02/09/2011 09:15 PM, Blue Swirl wrote: > >On Wed, Feb 9, 2011 at 9:59 PM, Anthony Liguori wrote: > >>On 02/09/2011 06:48 PM, Blue Swirl wrote: > ISASerialState dev; > > isa_serial_init(&dev, 0, 0x274, 0x07, NULL, NULL); > > >>>Do you mean that there should be a generic way of doing that, like > >>>sysbus_create_varargs() for qdev, or just add inline functions which > >>>hide qdev property setup? > >>> > >>>I still think that FDT should be used in the future. That would > >>>require that the properties can be set up mechanically, and I don't > >>>see how your proposal would help that. > >>> > >>Yeah, I don't think that is a good idea anymore. I think this is part of > >>why we're having so many problems with qdev. > >> > >>While (most?) hardware hierarchies can be represented by device tree syntax, > >>not all valid device trees correspond to interface and/or useful hardware > >>hierarchies. > >User creates a non-working machine and so gets to fix the problems? > >How is that a problem for us? > > It's not about creating a non-working machine. It's about what > user-level abstraction we need to provide. > > It's a whole lot easier to implement an i440fx device with a fixed > set of parameters than it is to make every possible subdevice have a > proper factory interface along with mechanisms to hook everything > together. > So what if it is easier, it doesn't mean it is correct thing to do. What you are proposing is just a huge step backwards. May be we shouldn't support hooking everything together in completely arbitrary ways, but we shouldn't force isa/pci devices upon our users just because they are non-removable on real chip. I disagree. We don't want to deviate from the spec any more than we already do. The reason for wanting flexibility is because the code for the PIC or RTC, for example, can be used in other Super-IO chipsets or even standalone. If qemu only supported the 440FX chipset, we'd have no reason to make things flexible. > > So very concretely, I'm suggesting we do the following to target-i386: > > 1) make the i440fx device have an embedded ide controller, piix3, > and usb controller that get initialized automatically. The piix3 > embeds the PCI-to-ISA bridge along with all of the default ISA > devices (rtc, serial, etc.). This may be a problem even from security point of view. What if usb code (ide, serial, parallel) has guest exploitable bug? Currently I can happily continue running guests if they do not need affected subsystem. If we'll get it your way I will no longer be able to do so. You can't just remove a device from a guest. You have to shut it down. When you power it back up, you may end up with different IRQ assignments or expose some guest bug. If you have a security issue in code that is exposed to the guest, you have to fix it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.
2011/2/10 Daniel P. Berrange : > On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote: >> On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote: >> >Currently FdMigrationState doesn't support read(), and this patch >> >introduces it to get response from the other side. >> > >> >Signed-off-by: Yoshiaki Tamura >> >> Migration is unidirectional. Changing this is fundamental and not >> something to be done lightly. > > Making it bi-directional might break libvirt's save/restore > to file support which uses migration, passing a unidirectional > FD for the file. It could also break libvirt's secure tunnelled > migration support which is currently only expecting to have > data sent in one direction on the socket. Hi Daniel, IIUC, this patch isn't something to make existing live migration bi-directional. Just opens up a way for Kemari to use it. Do you think it's dangerous for libvirt still? Thanks, Yoshi > > Daniel > -- > |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| > |: http://libvirt.org -o- http://virt-manager.org :| > |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| > |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 11:10 AM, Gleb Natapov wrote: On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote: On 02/10/2011 10:07 AM, Gleb Natapov wrote: So what if it is easier, it doesn't mean it is correct thing to do. If we spend the next 10 years trying to do the "correct thing" for some arbitrary definition of correct, that's not terribly useful. Changing direction by 180 every 2 years even less useful. If we think through what we are doing and have a coherent architecture before changing direction, then we won't have this problem. It's really simple actually. Let's do the least clever thing and model how hardware actual works. Once we have that, we can try to be better than real hardware (if it's possible). I think out understanding on how HW actually works is very different. You are placing to much value on were device resides physically, for me it is completely unimportant detail. Not worth even mentioning. No, I place value on how things are modelled in the real world. There simply aren't PC's out there that lack an RTC so I have no interest in jumping through hoops in QEMU to make it possible to do this without modifying QEMU code. It might sound nice to a developer but it's of absolutely no use to users. If all composition is done through a factory interface, it doesn't. But my main argument here is that we shouldn't try to make all composition done through a factory interface--only where it makes sense. So very concretely, I'm suggesting we do the following to target-i386: 1) make the i440fx device have an embedded ide controller, piix3, and usb controller that get initialized automatically. The piix3 embeds the PCI-to-ISA bridge along with all of the default ISA devices (rtc, serial, etc.). This may be a problem even from security point of view. What if usb code (ide, serial, parallel) has guest exploitable bug? Currently I can happily continue running guests if they do not need affected subsystem. If we'll get it your way I will no longer be able to do so. qemu -device i440fx,ide=off So you still need to support arbitrary composition. What's the difference? No, we don't. It's possible to have an 'rtc=off' option but I'm tremendously opposed to doing this. Arbitrary composition is not a useful goal IMHO. So why do you like -device i440fx over what we have now? Because I don't think tools like libvirt should be doing device composition to create an i440fx-like chipset. I think the current path we're on is pushing too much logic that belongs in QEMU into the management stack. In current speak you propose will be implement by using i440fx machine type. Qdev will build it for you. If you had an i440fx machine type, that had no non-optional components added, and you could specify options to the machine type, yes. But I think you'll agree that there's no reason to not just treat the i440fx as a device. If you really care to do this. But this desire to remove devices is silly IMHO. Concerns about security are misplaced. If you have to change the way a guest is invoked in order to eliminate security problems, then there's something seriously wrong. No I do not. I do not create guest with unneeded devices from the beginning. There is very little that isn't 'unneeded'. Regards, Anthony Liguori -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.
On Thu, Feb 10, 2011 at 10:54:01AM +0100, Anthony Liguori wrote: > On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote: > >Currently FdMigrationState doesn't support read(), and this patch > >introduces it to get response from the other side. > > > >Signed-off-by: Yoshiaki Tamura > > Migration is unidirectional. Changing this is fundamental and not > something to be done lightly. Making it bi-directional might break libvirt's save/restore to file support which uses migration, passing a unidirectional FD for the file. It could also break libvirt's secure tunnelled migration support which is currently only expecting to have data sent in one direction on the socket. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Convert read-only users of vm_list to RCU
On 02/08/2011 01:55 PM, Jan Kiszka wrote: Only for walking the list of VMs, we do not need to hold the preemption disabling kvm_lock. Convert stat services, the cpufreq callback and mmu_shrink to RCU. For the latter, special care is required to synchronize its list_move_tail with kvm_destroy_vm. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b6a9963..e9d0ed8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3587,9 +3587,9 @@ static int mmu_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) if (nr_to_scan == 0) goto out; - raw_spin_lock(&kvm_lock); + rcu_read_lock(); - list_for_each_entry(kvm,&vm_list, vm_list) { + list_for_each_entry_rcu(kvm,&vm_list, vm_list) { int idx, freed_pages; LIST_HEAD(invalid_list); Have to #include rculist.h, and to change all list operations on vm_list to rcu variants. @@ -3607,10 +3607,14 @@ static int mmu_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask) spin_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, idx); } - if (kvm_freed) - list_move_tail(&kvm_freed->vm_list,&vm_list); + if (kvm_freed) { + raw_spin_lock(&kvm_lock); + if (!kvm->deleted) + list_move_tail(&kvm_freed->vm_list,&vm_list); There is no list_move_tail_rcu(). Why check kvm->deleted? it's in the process of being torn down anyway, it doesn't matter if mmu_shrink or kvm_destroy_vm pulls the trigger. + raw_spin_unlock(&kvm_lock); + } - raw_spin_unlock(&kvm_lock); + rcu_read_unlock(); -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 10:04 AM, Peter Maydell wrote: On 10 February 2011 08:36, Anthony Liguori wrote: On 02/10/2011 09:16 AM, Peter Maydell wrote: On 10 February 2011 07:47, Anthony Liguoriwrote: 2) get rid of the entire concept of machines. Creating a i440fx is essentially equivalent to creating a bare machine. Does that make any sense for anything other than target-i386? The concept of a machine model seems a pretty obvious one for ARM boards, for instance, and I'm not sure we'd gain much by having i386 be different to the other architectures... Yes, it makes a lot of sense, I just don't know the component names as well so bear with me :-) There are two types of Versatile machines today, Versatile/AB and Versatile/PB. They are both made with the same core, ARM926EJ-S, with different expansions. So you would model arm926ej-s as the chipset and then build up the machines by modifying parameters of the chipset (like the board id) and/or adding different components on top of it. Er, ARM926 is the CPU, it's not a chipset. The board ID is definitely not a property of an ARM926, it's a property of the board (clue is in the name :-)). I don't think versatile boards have a "chipset" really... As I said, I'm not well versed in the component names in ARM. But that said, an actual processor doesn't connect directly to a bunch of devices. It almost always go through some chipset and that chipset implements a lot of functionality typically. I think the name of the component I'm trying to refer to PL300 which I believe is the Northbridge used for the Versatile boards. In my understanding the "machine" is the thing that says "I need a 926, and an MMC controller at this address, and some UARTS, and..." ie it is the thing that does the "modifying parameters" and "adding different components". So if we'd still be doing that I don't see how we've "got rid of the concept". I guess I'm missing the point somehow. A machine today is basically the northbridge, southbridge, plus a bunch of default components to make the virtual hardware useful. I'm suggesting that we model a proper northbridge/southbridge. A good way to think about what I'm proposing is that machine->init really should be a constructor for a device object. If you mean that you want machines to be implemented under the hood as a single huge "device" you can only have one of that spans the entire memory map, well I guess that's an implementation detail. But conceptually machines really do exist, and we definitely still want users to be able to say "I want a beagle machine; I want a versatile; I want an n900". An n900 is a very specific hardware configuration that is best represented by some sort of configuration file vs. something hard coded in QEMU. The question is, what level of component modelling do we need to do in order to make it practical to create such configurations from a file. Regards, Anthony Liguori -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On Thu, Feb 10, 2011 at 11:00:50AM +0100, Anthony Liguori wrote: > On 02/10/2011 10:07 AM, Gleb Natapov wrote: > >So what if it is easier, it doesn't mean it is correct thing to do. > > If we spend the next 10 years trying to do the "correct thing" for > some arbitrary definition of correct, that's not terribly useful. Changing direction by 180 every 2 years even less useful. > > It's really simple actually. Let's do the least clever thing and > model how hardware actual works. Once we have that, we can try to > be better than real hardware (if it's possible). I think out understanding on how HW actually works is very different. You are placing to much value on were device resides physically, for me it is completely unimportant detail. Not worth even mentioning. > > > > >>If all composition is done through a factory interface, it doesn't. > >>But my main argument here is that we shouldn't try to make all > >>composition done through a factory interface--only where it makes > >>sense. > >> > >>So very concretely, I'm suggesting we do the following to target-i386: > >> > >>1) make the i440fx device have an embedded ide controller, piix3, > >>and usb controller that get initialized automatically. The piix3 > >>embeds the PCI-to-ISA bridge along with all of the default ISA > >>devices (rtc, serial, etc.). > >This may be a problem even from security point of view. What if usb code > >(ide, serial, parallel) has guest exploitable bug? Currently I can happily > >continue running guests if they do not need affected subsystem. If we'll > >get it your way I will no longer be able to do so. > > qemu -device i440fx,ide=off > So you still need to support arbitrary composition. What's the difference? So why do you like -device i440fx over what we have now? In current speak you propose will be implement by using i440fx machine type. Qdev will build it for you. > If you really care to do this. But this desire to remove devices is > silly IMHO. Concerns about security are misplaced. If you have to > change the way a guest is invoked in order to eliminate security > problems, then there's something seriously wrong. > No I do not. I do not create guest with unneeded devices from the beginning. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix detection of BIOS disabling VMX
On 02/08/2011 09:45 PM, Joseph Cihula wrote: This patch fixes the logic used to detect whether BIOS has disabled VMX. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call minutes for Feb 8
On 02/10/2011 10:07 AM, Gleb Natapov wrote: So what if it is easier, it doesn't mean it is correct thing to do. If we spend the next 10 years trying to do the "correct thing" for some arbitrary definition of correct, that's not terribly useful. It's really simple actually. Let's do the least clever thing and model how hardware actual works. Once we have that, we can try to be better than real hardware (if it's possible). If all composition is done through a factory interface, it doesn't. But my main argument here is that we shouldn't try to make all composition done through a factory interface--only where it makes sense. So very concretely, I'm suggesting we do the following to target-i386: 1) make the i440fx device have an embedded ide controller, piix3, and usb controller that get initialized automatically. The piix3 embeds the PCI-to-ISA bridge along with all of the default ISA devices (rtc, serial, etc.). This may be a problem even from security point of view. What if usb code (ide, serial, parallel) has guest exploitable bug? Currently I can happily continue running guests if they do not need affected subsystem. If we'll get it your way I will no longer be able to do so. qemu -device i440fx,ide=off If you really care to do this. But this desire to remove devices is silly IMHO. Concerns about security are misplaced. If you have to change the way a guest is invoked in order to eliminate security problems, then there's something seriously wrong. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.
2011/2/10 Anthony Liguori : > On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote: >> >> Currently FdMigrationState doesn't support read(), and this patch >> introduces it to get response from the other side. >> >> Signed-off-by: Yoshiaki Tamura >> > > Migration is unidirectional. Changing this is fundamental and not something > to be done lightly. > > I thought we previously discussed using a protocol wrapper around the > existing migration protocol? AFAIR, I don't think we had that discussion before. I applied comments from Stefan though. If I missed the discussion, could you please give me the link? Thanks, Yoshi > > Regards, > > Anthony Liguori > >> --- >> migration-tcp.c | 15 +++ >> migration.c | 13 + >> migration.h | 3 +++ >> 3 files changed, 31 insertions(+), 0 deletions(-) >> >> diff --git a/migration-tcp.c b/migration-tcp.c >> index b55f419..55777c8 100644 >> --- a/migration-tcp.c >> +++ b/migration-tcp.c >> @@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void >> * buf, size_t size) >> return send(s->fd, buf, size, 0); >> } >> >> +static int socket_read(FdMigrationState *s, const void * buf, size_t >> size) >> +{ >> + ssize_t len; >> + >> + do { >> + len = recv(s->fd, (void *)buf, size, 0); >> + } while (len == -1&& socket_error() == EINTR); >> + if (len == -1) { >> + len = -socket_error(); >> + } >> + >> + return len; >> +} >> + >> static int tcp_close(FdMigrationState *s) >> { >> DPRINTF("tcp_close\n"); >> @@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor >> *mon, >> >> s->get_error = socket_errno; >> s->write = socket_write; >> + s->read = socket_read; >> s->close = tcp_close; >> s->mig_state.cancel = migrate_fd_cancel; >> s->mig_state.get_status = migrate_fd_get_status; >> diff --git a/migration.c b/migration.c >> index 3612572..f0df5fc 100644 >> --- a/migration.c >> +++ b/migration.c >> @@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const >> void *data, size_t size) >> return ret; >> } >> >> +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, >> size_t size) >> +{ >> + FdMigrationState *s = opaque; >> + int ret; >> + >> + ret = s->read(s, data, size); >> + if (ret == -1) { >> + ret = -(s->get_error(s)); >> + } >> + >> + return ret; >> +} >> + >> void migrate_fd_connect(FdMigrationState *s) >> { >> int ret; >> diff --git a/migration.h b/migration.h >> index 2170792..88a6987 100644 >> --- a/migration.h >> +++ b/migration.h >> @@ -48,6 +48,7 @@ struct FdMigrationState >> int (*get_error)(struct FdMigrationState*); >> int (*close)(struct FdMigrationState*); >> int (*write)(struct FdMigrationState*, const void *, size_t); >> + int (*read)(struct FdMigrationState *, const void *, size_t); >> void *opaque; >> }; >> >> @@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque); >> >> ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t >> size); >> >> +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, >> size_t size); >> + >> void migrate_fd_connect(FdMigrationState *s); >> >> void migrate_fd_put_ready(void *opaque); >> > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 02/18] Introduce read() to FdMigrationState.
On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote: Currently FdMigrationState doesn't support read(), and this patch introduces it to get response from the other side. Signed-off-by: Yoshiaki Tamura Migration is unidirectional. Changing this is fundamental and not something to be done lightly. I thought we previously discussed using a protocol wrapper around the existing migration protocol? Regards, Anthony Liguori --- migration-tcp.c | 15 +++ migration.c | 13 + migration.h |3 +++ 3 files changed, 31 insertions(+), 0 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index b55f419..55777c8 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * buf, size_t size) return send(s->fd, buf, size, 0); } +static int socket_read(FdMigrationState *s, const void * buf, size_t size) +{ +ssize_t len; + +do { +len = recv(s->fd, (void *)buf, size, 0); +} while (len == -1&& socket_error() == EINTR); +if (len == -1) { +len = -socket_error(); +} + +return len; +} + static int tcp_close(FdMigrationState *s) { DPRINTF("tcp_close\n"); @@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, s->get_error = socket_errno; s->write = socket_write; +s->read = socket_read; s->close = tcp_close; s->mig_state.cancel = migrate_fd_cancel; s->mig_state.get_status = migrate_fd_get_status; diff --git a/migration.c b/migration.c index 3612572..f0df5fc 100644 --- a/migration.c +++ b/migration.c @@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) return ret; } +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size) +{ +FdMigrationState *s = opaque; +int ret; + +ret = s->read(s, data, size); +if (ret == -1) { +ret = -(s->get_error(s)); +} + +return ret; +} + void migrate_fd_connect(FdMigrationState *s) { int ret; diff --git a/migration.h b/migration.h index 2170792..88a6987 100644 --- a/migration.h +++ b/migration.h @@ -48,6 +48,7 @@ struct FdMigrationState int (*get_error)(struct FdMigrationState*); int (*close)(struct FdMigrationState*); int (*write)(struct FdMigrationState*, const void *, size_t); +int (*read)(struct FdMigrationState *, const void *, size_t); void *opaque; }; @@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque); ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size); +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size); + void migrate_fd_connect(FdMigrationState *s); void migrate_fd_put_ready(void *opaque); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/18] Introduce "kemari:" to enable FT migration mode (Kemari).
On 02/10/2011 10:30 AM, Yoshiaki Tamura wrote: When "kemari:" is set in front of URI of migrate command, it will turn on ft_mode to start FT migration mode (Kemari). On the receiver side, the option looks like, -incoming kemari::: Signed-off-by: Yoshiaki Tamura --- hmp-commands.hx |4 +++- migration.c | 12 qmp-commands.hx |4 +++- 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/hmp-commands.hx b/hmp-commands.hx index 38e1eb7..ee14344 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -760,7 +760,9 @@ ETEXI "\n\t\t\t -b for migration without shared storage with" " full copy of disk\n\t\t\t -i for migration without " "shared storage with incremental copy of disk " - "(base image shared between src and destination)", + "(base image shared between src and destination)" + "\n\t\t\t put \"kemari:\" in front of URI to enable " + "Fault Tolerance mode (Kemari protocol)", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, diff --git a/migration.c b/migration.c index 7837c55..a3f7722 100644 --- a/migration.c +++ b/migration.c @@ -48,6 +48,12 @@ int qemu_start_incoming_migration(const char *uri) const char *p; int ret; +/* check ft_mode (Kemari protocol) */ +if (strstart(uri, "kemari:",&p)) { +ft_mode = FT_INIT; +uri = p; +} + if (strstart(uri, "tcp:",&p)) ret = tcp_start_incoming_migration(p); #if !defined(WIN32) @@ -99,6 +105,12 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) return -1; } +/* check ft_mode (Kemari protocol) */ +if (strstart(uri, "kemari:",&p)) { +ft_mode = FT_INIT; +uri = p; +} + if (strstart(uri, "tcp:",&p)) { s = tcp_start_outgoing_migration(mon, p, max_throttle, detach, blk, inc); diff --git a/qmp-commands.hx b/qmp-commands.hx index df40a3d..68ca48a 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -437,7 +437,9 @@ EQMP "\n\t\t\t -b for migration without shared storage with" " full copy of disk\n\t\t\t -i for migration without " "shared storage with incremental copy of disk " - "(base image shared between src and destination)", + "(base image shared between src and destination)" + "\n\t\t\t put \"kemari:\" in front of URI to enable " + "Fault Tolerance mode (Kemari protocol)", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, Acked-by: Paolo Bonzini Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM fix for 2.6.38-rc4
Linus, please pull a KVM fix from git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm-updates/2.6.38 This closes a small window during which an NMI could kill an AMD host. Joerg Roedel (1): KVM: SVM: Make sure KERNEL_GS_BASE is valid when loading gs_index arch/x86/kvm/svm.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/18] qemu-char: export socket_set_nodelay().
Signed-off-by: Yoshiaki Tamura --- qemu-char.c |2 +- qemu_socket.h |1 + 2 files changed, 2 insertions(+), 1 deletions(-) diff --git a/qemu-char.c b/qemu-char.c index ee4f4ca..7286aeb 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2111,7 +2111,7 @@ static void tcp_chr_telnet_init(int fd) send(fd, (char *)buf, 3, 0); } -static void socket_set_nodelay(int fd) +void socket_set_nodelay(int fd) { int val = 1; setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)&val, sizeof(val)); diff --git a/qemu_socket.h b/qemu_socket.h index 897a8ae..b7f8465 100644 --- a/qemu_socket.h +++ b/qemu_socket.h @@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia); int qemu_socket(int domain, int type, int protocol); int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen); void socket_set_nonblock(int fd); +void socket_set_nodelay(int fd); int send_all(int fd, const void *buf, int len1); /* New, ipv6-ready socket helper functions, see qemu-sockets.c */ -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/18] migration: introduce migrate_ft_trans_{put,get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.
Introduce migrate_ft_trans_put_ready() which kicks the FT transaction cycle. When ft_mode is on, migrate_fd_put_ready() would open ft_trans_file and turn on event_tap. To end or cancel FT transaction, ft_mode and event_tap is turned off. migrate_ft_trans_get_ready() is called to receive ack from the receiver. Signed-off-by: Yoshiaki Tamura --- migration.c | 261 ++- 1 files changed, 260 insertions(+), 1 deletions(-) diff --git a/migration.c b/migration.c index c5e0146..7837c55 100644 --- a/migration.c +++ b/migration.c @@ -21,6 +21,7 @@ #include "qemu_socket.h" #include "block-migration.h" #include "qemu-objects.h" +#include "event-tap.h" //#define DEBUG_MIGRATION @@ -283,6 +284,14 @@ void migrate_fd_error(FdMigrationState *s) migrate_fd_cleanup(s); } +static void migrate_ft_trans_error(FdMigrationState *s) +{ +ft_mode = FT_ERROR; +qemu_savevm_state_cancel(s->mon, s->file); +migrate_fd_error(s); +event_tap_unregister(); +} + int migrate_fd_cleanup(FdMigrationState *s) { int ret = 0; @@ -318,6 +327,17 @@ void migrate_fd_put_notify(void *opaque) qemu_file_put_notify(s->file); } +static void migrate_fd_get_notify(void *opaque) +{ +FdMigrationState *s = opaque; + +qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); +qemu_file_get_notify(s->file); +if (qemu_file_has_error(s->file)) { +migrate_ft_trans_error(s); +} +} + ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) { FdMigrationState *s = opaque; @@ -353,6 +373,10 @@ int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size) ret = -(s->get_error(s)); } +if (ret == -EAGAIN) { +qemu_set_fd_handler2(s->fd, NULL, migrate_fd_get_notify, NULL, s); +} + return ret; } @@ -379,6 +403,230 @@ void migrate_fd_connect(FdMigrationState *s) migrate_fd_put_ready(s); } +static int migrate_ft_trans_commit(void *opaque) +{ +FdMigrationState *s = opaque; +int ret = -1; + +if (ft_mode != FT_TRANSACTION_COMMIT && ft_mode != FT_TRANSACTION_ATOMIC) { +fprintf(stderr, +"migrate_ft_trans_commit: invalid ft_mode %d\n", ft_mode); +goto out; +} + +do { +if (ft_mode == FT_TRANSACTION_ATOMIC) { +if (qemu_ft_trans_begin(s->file) < 0) { +fprintf(stderr, "qemu_ft_trans_begin failed\n"); +goto out; +} + +ret = qemu_savevm_trans_begin(s->mon, s->file, 0); +if (ret < 0) { +fprintf(stderr, "qemu_savevm_trans_begin failed\n"); +goto out; +} + +ft_mode = FT_TRANSACTION_COMMIT; +if (ret) { +/* don't proceed until if fd isn't ready */ +goto out; +} +} + +/* make the VM state consistent by flushing outstanding events */ +vm_stop(0); + +/* send at full speed */ +qemu_file_set_rate_limit(s->file, 0); + +ret = qemu_savevm_trans_complete(s->mon, s->file); +if (ret < 0) { +fprintf(stderr, "qemu_savevm_trans_complete failed\n"); +goto out; +} + +ret = qemu_ft_trans_commit(s->file); +if (ret < 0) { +fprintf(stderr, "qemu_ft_trans_commit failed\n"); +goto out; +} + +if (ret) { +ft_mode = FT_TRANSACTION_RECV; +ret = 1; +goto out; +} + +/* flush and check if events are remaining */ +vm_start(); +ret = event_tap_flush_one(); +if (ret < 0) { +fprintf(stderr, "event_tap_flush_one failed\n"); +goto out; +} + +ft_mode = ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC; +} while (ft_mode != FT_TRANSACTION_BEGIN); + +vm_start(); +ret = 0; + +out: +return ret; +} + +static int migrate_ft_trans_get_ready(void *opaque) +{ +FdMigrationState *s = opaque; +int ret = -1; + +if (ft_mode != FT_TRANSACTION_RECV) { +fprintf(stderr, +"migrate_ft_trans_get_ready: invalid ft_mode %d\n", ft_mode); +goto error_out; +} + +/* flush and check if events are remaining */ +vm_start(); +ret = event_tap_flush_one(); +if (ret < 0) { +fprintf(stderr, "event_tap_flush_one failed\n"); +goto error_out; +} + +if (ret) { +ft_mode = FT_TRANSACTION_BEGIN; +} else { +ft_mode = FT_TRANSACTION_ATOMIC; + +ret = migrate_ft_trans_commit(s); +if (ret < 0) { +goto error_out; +} +if (ret) { +goto out; +} +} + +vm_start(); +ret = 0; +goto out; + +error_out: +migrate_ft_trans_error(s); + +out: +return ret; +} + +static int migrate_ft_trans_put_ready(void) +{ +FdMigrationState *s = migrate_to_fms(current_m
[PATCH 09/18] Introduce event-tap.
event-tap controls when to start FT transaction, and provides proxy functions to called from net/block devices. While FT transaction, it queues up net/block requests, and flush them when the transaction gets completed. Signed-off-by: Yoshiaki Tamura Signed-off-by: OHMURA Kei --- Makefile.target |1 + event-tap.c | 939 +++ event-tap.h | 44 +++ qemu-tool.c | 28 ++ trace-events| 10 + 5 files changed, 1022 insertions(+), 0 deletions(-) create mode 100644 event-tap.c create mode 100644 event-tap.h diff --git a/Makefile.target b/Makefile.target index b0ba95f..edbdbee 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,6 +199,7 @@ obj-y += rwhandler.o obj-$(CONFIG_KVM) += kvm.o kvm-all.o obj-$(CONFIG_NO_KVM) += kvm-stub.o LIBS+=-lz +obj-y += event-tap.o QEMU_CFLAGS += $(VNC_TLS_CFLAGS) QEMU_CFLAGS += $(VNC_SASL_CFLAGS) diff --git a/event-tap.c b/event-tap.c new file mode 100644 index 000..f44d835 --- /dev/null +++ b/event-tap.c @@ -0,0 +1,939 @@ +/* + * Event Tap functions for QEMU + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" +#include "qemu-error.h" +#include "block.h" +#include "block_int.h" +#include "ioport.h" +#include "osdep.h" +#include "sysemu.h" +#include "hw/hw.h" +#include "net.h" +#include "event-tap.h" +#include "trace.h" + +enum EVENT_TAP_STATE { +EVENT_TAP_OFF, +EVENT_TAP_ON, +EVENT_TAP_SUSPEND, +EVENT_TAP_FLUSH, +EVENT_TAP_LOAD, +EVENT_TAP_REPLAY, +}; + +static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF; + +typedef struct EventTapIOport { +uint32_t address; +uint32_t data; +int index; +} EventTapIOport; + +#define MMIO_BUF_SIZE 8 + +typedef struct EventTapMMIO { +uint64_t address; +uint8_t buf[MMIO_BUF_SIZE]; +int len; +} EventTapMMIO; + +typedef struct EventTapNetReq { +char *device_name; +int iovcnt; +int vlan_id; +bool vlan_needed; +bool async; +struct iovec *iov; +NetPacketSent *sent_cb; +} EventTapNetReq; + +#define MAX_BLOCK_REQUEST 32 + +typedef struct EventTapAIOCB EventTapAIOCB; + +typedef struct EventTapBlkReq { +char *device_name; +int num_reqs; +int num_cbs; +bool is_flush; +BlockRequest reqs[MAX_BLOCK_REQUEST]; +EventTapAIOCB *acb[MAX_BLOCK_REQUEST]; +} EventTapBlkReq; + +#define EVENT_TAP_IOPORT (1 << 0) +#define EVENT_TAP_MMIO (1 << 1) +#define EVENT_TAP_NET(1 << 2) +#define EVENT_TAP_BLK(1 << 3) + +#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1) + +typedef struct EventTapLog { +int mode; +union { +EventTapIOport ioport; +EventTapMMIO mmio; +}; +union { +EventTapNetReq net_req; +EventTapBlkReq blk_req; +}; +QTAILQ_ENTRY(EventTapLog) node; +} EventTapLog; + +struct EventTapAIOCB { +BlockDriverAIOCB common; +BlockDriverAIOCB *acb; +bool is_canceled; +}; + +static EventTapLog *last_event_tap; + +static QTAILQ_HEAD(, EventTapLog) event_list; +static QTAILQ_HEAD(, EventTapLog) event_pool; + +static int (*event_tap_cb)(void); +static QEMUBH *event_tap_bh; +static VMChangeStateEntry *vmstate; + +static void event_tap_bh_cb(void *p) +{ +if (event_tap_cb) { +event_tap_cb(); +} + +qemu_bh_delete(event_tap_bh); +event_tap_bh = NULL; +} + +static void event_tap_schedule_bh(void) +{ +trace_event_tap_ignore_bh(!!event_tap_bh); + +/* if bh is already set, we ignore it for now */ +if (event_tap_bh) { +return; +} + +event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL); +qemu_bh_schedule(event_tap_bh); + +return; +} + +static void *event_tap_alloc_log(void) +{ +EventTapLog *log; + +if (QTAILQ_EMPTY(&event_pool)) { +log = qemu_mallocz(sizeof(EventTapLog)); +} else { +log = QTAILQ_FIRST(&event_pool); +QTAILQ_REMOVE(&event_pool, log, node); +} + +return log; +} + +static void event_tap_free_net_req(EventTapNetReq *net_req); +static void event_tap_free_blk_req(EventTapBlkReq *blk_req); + +static void event_tap_free_log(EventTapLog *log) +{ +int mode = log->mode & ~EVENT_TAP_TYPE_MASK; + +if (mode == EVENT_TAP_NET) { +event_tap_free_net_req(&log->net_req); +} else if (mode == EVENT_TAP_BLK) { +event_tap_free_blk_req(&log->blk_req); +} + +log->mode = 0; + +/* return the log to event_pool */ +QTAILQ_INSERT_HEAD(&event_pool, log, node); +} + +static void event_tap_free_pool(void) +{ +EventTapLog *log, *next; + +QTAILQ_FOREACH_SAFE(log, &event_pool, node, next) { +QTAILQ_REMOVE(&event_pool, log, node); +qemu_free(log); +} +} + +static void event_tap_free_net_req(EventTapNetReq *net_req) +{ +int i; + +if (!net_req->async) { +for (i = 0; i <
[PATCH 06/18] virtio: decrement last_avail_idx with inuse before saving.
For regular migration inuse == 0 always as requests are flushed before save. However, event-tap log when enabled introduces an extra queue for requests which is not being flushed, thus the last inuse requests are left in the event-tap queue. Move the last_avail_idx value sent to the remote back to make it repeat the last inuse requests. Signed-off-by: Michael S. Tsirkin Signed-off-by: Yoshiaki Tamura --- hw/virtio.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/hw/virtio.c b/hw/virtio.c index 31bd9e3..f05d1b6 100644 --- a/hw/virtio.c +++ b/hw/virtio.c @@ -673,12 +673,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) qemu_put_be32(f, i); for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { +/* For regular migration inuse == 0 always as + * requests are flushed before save. However, + * event-tap log when enabled introduces an extra + * queue for requests which is not being flushed, + * thus the last inuse requests are left in the event-tap queue. + * Move the last_avail_idx value sent to the remote back + * to make it repeat the last inuse requests. */ +uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse; if (vdev->vq[i].vring.num == 0) break; qemu_put_be32(f, vdev->vq[i].vring.num); qemu_put_be64(f, vdev->vq[i].pa); -qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); +qemu_put_be16s(f, &last_avail); if (vdev->binding->save_queue) vdev->binding->save_queue(vdev->binding_opaque, i, f); } -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/18] Kemari for KVM v0.2.10
Hi, This patch series is a revised version of Kemari for KVM, which applied comments for the previous post. The current code is based on qemu.git f26e5a54f0554798a2e6f7a074b809b13635d007. The changes from v0.2.9 -> v0.2.10 are: - change migrate format to kemari::: (Paolo) The changes from v0.2.8 -> v0.2.9 are: - abstract common code between qemu_savevm_{state,trans}_* (Paolo) - change incoming format to kemari::: (Paolo) The changes from v0.2.7 -> v0.2.8 are: - fixed calling wrong cb in event-tap - add missing qemu_aio_release in event-tap The changes from v0.2.6 -> v0.2.7 are: - add AIOCB, AIOPool and cancel functions (Kevin) - insert event-tap for bdrv_flush (Kevin) - add error handing when calling bdrv functions (Kevin) - fix usage of qemu_aio_flush and bdrv_flush (Kevin) - use bs in AIOCB on the primary (Kevin) - reorder event-tap functions to gather with block/net (Kevin) - fix checking bs->device_name (Kevin) The changes from v0.2.5 -> v0.2.6 are: - use qemu_{put,get}_be32() to save/load niov in event-tap The changes from v0.2.4 -> v0.2.5 are: - fixed braces and trailing spaces by using Blue's checkpatch.pl (Blue) - event-tap: don't try to send blk_req if it's a bdrv_aio_flush event The changes from v0.2.3 -> v0.2.4 are: - call vm_start() before event_tap_flush_one() to avoid failure in virtio-net assertion - add vm_change_state_handler to turn off ft_mode - use qemu_iovec functions in event-tap - remove duplicated code in migration - remove unnecessary new line for error_report in ft_trans_file The changes from v0.2.2 -> v0.2.3 are: - queue async net requests without copying (MST) -- if not async, contents of the packets are sent to the secondary - better description for option -k (MST) - fix memory transfer failure - fix ft transaction initiation failure The changes from v0.2.1 -> v0.2.2 are: - decrement last_avaid_idx with inuse before saving (MST) - remove qemu_aio_flush() and bdrv_flush_all() in migrate_ft_trans_commit() The changes from v0.2 -> v0.2.1 are: - Move event-tap to net/block layer and use stubs (Blue, Paul, MST, Kevin) - Tap bdrv_aio_flush (Marcelo) - Remove multiwrite interface in event-tap (Stefan) - Fix event-tap to use pio/mmio to replay both net/block (Stefan) - Improve error handling in event-tap (Stefan) - Fix leak in event-tap (Stefan) - Revise virtio last_avail_idx manipulation (MST) - Clean up migration.c hook (Marcelo) - Make deleting change state handler robust (Isaku, Anthony) The changes from v0.1.1 -> v0.2 are: - Introduce a queue in event-tap to make VM sync live. - Change transaction receiver to a state machine for async receiving. - Replace net/block layer functions with event-tap proxy functions. - Remove dirty bitmap optimization for now. - convert DPRINTF() in ft_trans_file to trace functions. - convert fprintf() in ft_trans_file to error_report(). - improved error handling in ft_trans_file. - add a tmp pointer to qemu_del_vm_change_state_handler. The changes from v0.1 -> v0.1.1 are: - events are tapped in net/block layer instead of device emulation layer. - Introduce a new option for -incoming to accept FT transaction. - Removed writev() support to QEMUFile and FdMigrationState for now. I would post this work in a different series. - Modified virtio-blk save/load handler to send inuse variable to correctly replay. - Removed configure --enable-ft-mode. - Removed unnecessary check for qemu_realloc(). The first 6 patches modify several functions of qemu to prepare introducing Kemari specific components. The next 6 patches are the components of Kemari. They introduce event-tap and the FT transaction protocol file based on buffered file. The design document of FT transaction protocol can be found at, http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf Then the following 2 patches modifies net/block layer functions with event-tap functions. Please note that if Kemari is off, event-tap will just passthrough, and there is most no intrusion to exisiting functions including normal live migration. Finally, the migration layer are modified to support Kemari in the last 4 patches. Again, there shouldn't be any affection if a user doesn't specify Kemari specific options. The transaction is now async on both sender and receiver side. The sender side respects the max_downtime to decide when to switch from async to sync mode. The repository contains all patches I'm sending with this message. For those who want to try, please pull the following repository. It also includes dirty bitmap optimization which aren't ready for posting yet. To remove the dirty bitmap optimization, please look at HEAD~4 of the tree. git://kemari.git.sourceforge.net/gitroot/kemari/kemari next Thanks, Yoshi Yoshiaki Tamura (18): Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer(). Introduce read() to FdMigrationState. Introduce skip_header parameter to qemu_loadvm_state(). qemu-char: export soc
[PATCH 17/18] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.
When ft_mode is set in the header, tcp_accept_incoming_migration() sets ft_trans_incoming() as a callback, and call qemu_file_get_notify() to receive FT transaction iteratively. We also need a hack no to close fd before moving to ft_transaction mode, so that we can reuse the fd for it. vm_change_state_handler is added to turn off ft_mode when cont is pressed. Signed-off-by: Yoshiaki Tamura --- migration-tcp.c | 67 ++- 1 files changed, 66 insertions(+), 1 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index 55777c8..84076d6 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -18,6 +18,8 @@ #include "sysemu.h" #include "buffered_file.h" #include "block.h" +#include "ft_trans_file.h" +#include "event-tap.h" //#define DEBUG_MIGRATION_TCP @@ -29,6 +31,8 @@ do { } while (0) #endif +static VMChangeStateEntry *vmstate; + static int socket_errno(FdMigrationState *s) { return socket_error(); @@ -56,7 +60,8 @@ static int socket_read(FdMigrationState *s, const void * buf, size_t size) static int tcp_close(FdMigrationState *s) { DPRINTF("tcp_close\n"); -if (s->fd != -1) { +/* FIX ME: accessing ft_mode here isn't clean */ +if (s->fd != -1 && ft_mode != FT_INIT) { close(s->fd); s->fd = -1; } @@ -150,6 +155,36 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, return &s->mig_state; } +static void ft_trans_incoming(void *opaque) +{ +QEMUFile *f = opaque; + +qemu_file_get_notify(f); +if (qemu_file_has_error(f)) { +ft_mode = FT_ERROR; +qemu_fclose(f); +} +} + +static void ft_trans_reset(void *opaque, int running, int reason) +{ +QEMUFile *f = opaque; + +if (running) { +if (ft_mode != FT_ERROR) { +qemu_fclose(f); +} +ft_mode = FT_OFF; +qemu_del_vm_change_state_handler(vmstate); +} +} + +static void ft_trans_schedule_replay(QEMUFile *f) +{ +event_tap_schedule_replay(); +vmstate = qemu_add_vm_change_state_handler(ft_trans_reset, f); +} + static void tcp_accept_incoming_migration(void *opaque) { struct sockaddr_in addr; @@ -175,8 +210,38 @@ static void tcp_accept_incoming_migration(void *opaque) goto out; } +if (ft_mode == FT_INIT) { +autostart = 0; +} + process_incoming_migration(f); + +if (ft_mode == FT_INIT) { +int ret; + +socket_set_nodelay(c); + +f = qemu_fopen_ft_trans(s, c); +if (f == NULL) { +fprintf(stderr, "could not qemu_fopen_ft_trans\n"); +goto out; +} + +/* need to wait sender to setup */ +ret = qemu_ft_trans_begin(f); +if (ret < 0) { +goto out; +} + +qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f); +ft_trans_schedule_replay(f); +ft_mode = FT_TRANSACTION_RECV; + +return; +} + qemu_fclose(f); + out: close(c); out2: -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/18] ioport: insert event_tap_ioport() to ioport_write().
Record ioport event to replay it upon failover. Signed-off-by: Yoshiaki Tamura --- ioport.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/ioport.c b/ioport.c index aa4188a..74aebf5 100644 --- a/ioport.c +++ b/ioport.c @@ -27,6 +27,7 @@ #include "ioport.h" #include "trace.h" +#include "event-tap.h" /***/ /* IO Port */ @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data) default_ioport_writel }; IOPortWriteFunc *func = ioport_write_table[index][address]; +event_tap_ioport(index, address, data); if (!func) func = default_func[index]; func(ioport_opaque[address], address, data); -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/18] Call init handler of event-tap at main() in vl.c.
Signed-off-by: Yoshiaki Tamura --- vl.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/vl.c b/vl.c index 00155fb..f4d4abf 100644 --- a/vl.c +++ b/vl.c @@ -162,6 +162,7 @@ int main(int argc, char **argv) #include "qemu-queue.h" #include "cpus.h" #include "arch_init.h" +#include "event-tap.h" #include "ui/qemu-spice.h" @@ -2919,6 +2920,8 @@ int main(int argc, char **argv, char **envp) blk_mig_init(); +event_tap_init(); + /* open the virtual block devices */ if (snapshot) qemu_opts_foreach(qemu_find_opts("drive"), drive_enable_snapshot, NULL, 0); -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/18] Introduce read() to FdMigrationState.
Currently FdMigrationState doesn't support read(), and this patch introduces it to get response from the other side. Signed-off-by: Yoshiaki Tamura --- migration-tcp.c | 15 +++ migration.c | 13 + migration.h |3 +++ 3 files changed, 31 insertions(+), 0 deletions(-) diff --git a/migration-tcp.c b/migration-tcp.c index b55f419..55777c8 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * buf, size_t size) return send(s->fd, buf, size, 0); } +static int socket_read(FdMigrationState *s, const void * buf, size_t size) +{ +ssize_t len; + +do { +len = recv(s->fd, (void *)buf, size, 0); +} while (len == -1 && socket_error() == EINTR); +if (len == -1) { +len = -socket_error(); +} + +return len; +} + static int tcp_close(FdMigrationState *s) { DPRINTF("tcp_close\n"); @@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, s->get_error = socket_errno; s->write = socket_write; +s->read = socket_read; s->close = tcp_close; s->mig_state.cancel = migrate_fd_cancel; s->mig_state.get_status = migrate_fd_get_status; diff --git a/migration.c b/migration.c index 3612572..f0df5fc 100644 --- a/migration.c +++ b/migration.c @@ -340,6 +340,19 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) return ret; } +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size) +{ +FdMigrationState *s = opaque; +int ret; + +ret = s->read(s, data, size); +if (ret == -1) { +ret = -(s->get_error(s)); +} + +return ret; +} + void migrate_fd_connect(FdMigrationState *s) { int ret; diff --git a/migration.h b/migration.h index 2170792..88a6987 100644 --- a/migration.h +++ b/migration.h @@ -48,6 +48,7 @@ struct FdMigrationState int (*get_error)(struct FdMigrationState*); int (*close)(struct FdMigrationState*); int (*write)(struct FdMigrationState*, const void *, size_t); +int (*read)(struct FdMigrationState *, const void *, size_t); void *opaque; }; @@ -116,6 +117,8 @@ void migrate_fd_put_notify(void *opaque); ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size); +int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size); + void migrate_fd_connect(FdMigrationState *s); void migrate_fd_put_ready(void *opaque); -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/18] savevm: introduce util functions to control ft_trans_file from savevm layer.
To utilize ft_trans_file function, savevm needs interfaces to be exported. Signed-off-by: Yoshiaki Tamura --- hw/hw.h |5 ++ savevm.c | 149 ++ 2 files changed, 154 insertions(+), 0 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index a168a37..a9eff5a 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, QEMUFile *qemu_fopen(const char *filename, const char *mode); QEMUFile *qemu_fdopen(int fd, const char *mode); QEMUFile *qemu_fopen_socket(int fd); +QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd); QEMUFile *qemu_popen(FILE *popen_file, const char *mode); QEMUFile *qemu_popen_cmd(const char *command, const char *mode); int qemu_stdio_fd(QEMUFile *f); @@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); void *qemu_realloc_buffer(QEMUFile *f, int size); void qemu_clear_buffer(QEMUFile *f); +int qemu_ft_trans_begin(QEMUFile *f); +int qemu_ft_trans_commit(QEMUFile *f); +int qemu_ft_trans_cancel(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { @@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f); * halted due to rate limiting or EAGAIN errors occur as it can be used to * resume output. */ void qemu_file_put_notify(QEMUFile *f); +void qemu_file_get_notify(void *opaque); static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv) { diff --git a/savevm.c b/savevm.c index 58e48e3..e44eccd 100644 --- a/savevm.c +++ b/savevm.c @@ -82,6 +82,7 @@ #include "migration.h" #include "qemu_socket.h" #include "qemu-queue.h" +#include "ft_trans_file.h" #define SELF_ANNOUNCE_ROUNDS 5 @@ -189,6 +190,13 @@ typedef struct QEMUFileSocket QEMUFile *file; } QEMUFileSocket; +typedef struct QEMUFileSocketTrans +{ +int fd; +QEMUFileSocket *s; +VMChangeStateEntry *e; +} QEMUFileSocketTrans; + static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size) { QEMUFileSocket *s = opaque; @@ -204,6 +212,22 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size) return len; } +static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size) +{ +QEMUFileSocket *s = opaque; +ssize_t len; + +do { +len = send(s->fd, (void *)buf, size, 0); +} while (len == -1 && socket_error() == EINTR); + +if (len == -1) { +len = -socket_error(); +} + +return len; +} + static int socket_close(void *opaque) { QEMUFileSocket *s = opaque; @@ -211,6 +235,70 @@ static int socket_close(void *opaque) return 0; } +static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, size_t size) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t->s; +ssize_t len; + +len = socket_get_buffer(s, buf, pos, size); + +return len; +} + +static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t size) +{ +QEMUFileSocketTrans *t = opaque; + +return socket_put_buffer(t->s, buf, size); +} + + +static int socket_trans_get_ready(void *opaque) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t->s; +QEMUFile *f = s->file; +int ret = 0; + +ret = qemu_loadvm_state(f, 1); +if (ret < 0) { +fprintf(stderr, +"socket_trans_get_ready: error while loading vmstate\n"); +} + +return ret; +} + +static int socket_trans_close(void *opaque) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t->s; + +qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); +qemu_set_fd_handler2(t->fd, NULL, NULL, NULL, NULL); +qemu_del_vm_change_state_handler(t->e); +close(s->fd); +close(t->fd); +qemu_free(s); +qemu_free(t); + +return 0; +} + +static void socket_trans_resume(void *opaque, int running, int reason) +{ +QEMUFileSocketTrans *t = opaque; +QEMUFileSocket *s = t->s; + +if (!running) { +return; +} + +qemu_announce_self(); +qemu_fclose(s->file); +} + static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size) { QEMUFileStdio *s = opaque; @@ -333,6 +421,26 @@ QEMUFile *qemu_fopen_socket(int fd) return s->file; } +QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd) +{ +QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans)); +QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket)); + +t->s = s; +t->fd = s_fd; +t->e = qemu_add_vm_change_state_handler(socket_trans_resume, t); + +s->fd = c_fd; +s->file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer, + socket_trans_get_buffer, NULL, + socket_trans_get_ready, + migrate_fd_wait_for_unfreeze, + socket_trans_close, 0
[PATCH 05/18] vl.c: add deleted flag for deleting the handler.
Make deleting handlers robust against deletion of any elements in a handler by using a deleted flag like in file descriptors. Signed-off-by: Yoshiaki Tamura --- vl.c | 13 + 1 files changed, 9 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index ed2cdfa..00155fb 100644 --- a/vl.c +++ b/vl.c @@ -1158,6 +1158,7 @@ static void nographic_update(void *opaque) struct vm_change_state_entry { VMChangeStateHandler *cb; void *opaque; +int deleted; QLIST_ENTRY (vm_change_state_entry) entries; }; @@ -1178,8 +1179,7 @@ VMChangeStateEntry *qemu_add_vm_change_state_handler(VMChangeStateHandler *cb, void qemu_del_vm_change_state_handler(VMChangeStateEntry *e) { -QLIST_REMOVE (e, entries); -qemu_free (e); +e->deleted = 1; } void vm_state_notify(int running, int reason) @@ -1188,8 +1188,13 @@ void vm_state_notify(int running, int reason) trace_vm_state_notify(running, reason); -for (e = vm_change_state_head.lh_first; e; e = e->entries.le_next) { -e->cb(e->opaque, running, reason); +QLIST_FOREACH(e, &vm_change_state_head, entries) { +if (e->deleted) { +QLIST_REMOVE(e, entries); +qemu_free(e); +} else { +e->cb(e->opaque, running, reason); +} } } -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/18] block: insert event-tap to bdrv_aio_writev(), bdrv_aio_flush() and bdrv_flush().
event-tap function is called only when it is on, and requests were sent from device emulators. Signed-off-by: Yoshiaki Tamura Acked-by: Kevin Wolf --- block.c | 15 +++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index b476479..8ddce13 100644 --- a/block.c +++ b/block.c @@ -28,6 +28,7 @@ #include "block_int.h" #include "module.h" #include "qemu-objects.h" +#include "event-tap.h" #ifdef CONFIG_BSD #include @@ -1482,6 +1483,10 @@ int bdrv_flush(BlockDriverState *bs) } if (bs->drv && bs->drv->bdrv_flush) { +if (*bs->device_name && event_tap_is_on()) { +event_tap_bdrv_flush(); +} + return bs->drv->bdrv_flush(bs); } @@ -2117,6 +2122,11 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num, if (bdrv_check_request(bs, sector_num, nb_sectors)) return NULL; +if (*bs->device_name && event_tap_is_on()) { +return event_tap_bdrv_aio_writev(bs, sector_num, qiov, nb_sectors, + cb, opaque); +} + if (bs->dirty_bitmap) { blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb, opaque); @@ -2380,6 +2390,11 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs, if (!drv) return NULL; + +if (*bs->device_name && event_tap_is_on()) { +return event_tap_bdrv_aio_flush(bs, cb, opaque); +} + return drv->bdrv_aio_flush(bs, cb, opaque); } -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/18] Introduce fault tolerant VM transaction QEMUFile and ft_mode.
This code implements VM transaction protocol. Like buffered_file, it sits between savevm and migration layer. With this architecture, VM transaction protocol is implemented mostly independent from other existing code. Signed-off-by: Yoshiaki Tamura Signed-off-by: OHMURA Kei --- Makefile.objs |1 + ft_trans_file.c | 624 +++ ft_trans_file.h | 72 +++ migration.c |3 + trace-events| 15 ++ 5 files changed, 715 insertions(+), 0 deletions(-) create mode 100644 ft_trans_file.c create mode 100644 ft_trans_file.h diff --git a/Makefile.objs b/Makefile.objs index 353b1a8..04148b5 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -100,6 +100,7 @@ common-obj-y += msmouse.o ps2.o common-obj-y += qdev.o qdev-properties.o common-obj-y += block-migration.o common-obj-y += pflib.o +common-obj-y += ft_trans_file.o common-obj-$(CONFIG_BRLAPI) += baum.o common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o diff --git a/ft_trans_file.c b/ft_trans_file.c new file mode 100644 index 000..2b42b95 --- /dev/null +++ b/ft_trans_file.c @@ -0,0 +1,624 @@ +/* + * Fault tolerant VM transaction QEMUFile + * + * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * This source code is based on buffered_file.c. + * Copyright IBM, Corp. 2008 + * Authors: + * Anthony Liguori + */ + +#include "qemu-common.h" +#include "qemu-error.h" +#include "hw/hw.h" +#include "qemu-timer.h" +#include "sysemu.h" +#include "qemu-char.h" +#include "trace.h" +#include "ft_trans_file.h" + +typedef struct FtTransHdr +{ +uint16_t cmd; +uint16_t id; +uint32_t seq; +uint32_t payload_len; +} FtTransHdr; + +typedef struct QEMUFileFtTrans +{ +FtTransPutBufferFunc *put_buffer; +FtTransGetBufferFunc *get_buffer; +FtTransPutReadyFunc *put_ready; +FtTransGetReadyFunc *get_ready; +FtTransWaitForUnfreezeFunc *wait_for_unfreeze; +FtTransCloseFunc *close; +void *opaque; +QEMUFile *file; + +enum QEMU_VM_TRANSACTION_STATE state; +uint32_t seq; +uint16_t id; + +int has_error; + +bool freeze_output; +bool freeze_input; +bool rate_limit; +bool is_sender; +bool is_payload; + +uint8_t *buf; +size_t buf_max_size; +size_t put_offset; +size_t get_offset; + +FtTransHdr header; +size_t header_offset; +} QEMUFileFtTrans; + +#define IO_BUF_SIZE 32768 + +static void ft_trans_append(QEMUFileFtTrans *s, +const uint8_t *buf, size_t size) +{ +if (size > (s->buf_max_size - s->put_offset)) { +trace_ft_trans_realloc(s->buf_max_size, size + 1024); +s->buf_max_size += size + 1024; +s->buf = qemu_realloc(s->buf, s->buf_max_size); +} + +trace_ft_trans_append(size); +memcpy(s->buf + s->put_offset, buf, size); +s->put_offset += size; +} + +static void ft_trans_flush(QEMUFileFtTrans *s) +{ +size_t offset = 0; + +if (s->has_error) { +error_report("flush when error %d, bailing", s->has_error); +return; +} + +while (offset < s->put_offset) { +ssize_t ret; + +ret = s->put_buffer(s->opaque, s->buf + offset, s->put_offset - offset); +if (ret == -EAGAIN) { +break; +} + +if (ret <= 0) { +error_report("error flushing data, %s", strerror(errno)); +s->has_error = FT_TRANS_ERR_FLUSH; +break; +} else { +offset += ret; +} +} + +trace_ft_trans_flush(offset, s->put_offset); +memmove(s->buf, s->buf + offset, s->put_offset - offset); +s->put_offset -= offset; +s->freeze_output = !!s->put_offset; +} + +static ssize_t ft_trans_put(void *opaque, void *buf, int size) +{ +QEMUFileFtTrans *s = opaque; +size_t offset = 0; +ssize_t len; + +/* flush buffered data before putting next */ +if (s->put_offset) { +ft_trans_flush(s); +} + +while (!s->freeze_output && offset < size) { +len = s->put_buffer(s->opaque, (uint8_t *)buf + offset, size - offset); + +if (len == -EAGAIN) { +trace_ft_trans_freeze_output(); +s->freeze_output = 1; +break; +} + +if (len <= 0) { +error_report("putting data failed, %s", strerror(errno)); +s->has_error = 1; +offset = -EINVAL; +break; +} + +offset += len; +} + +if (s->freeze_output) { +ft_trans_append(s, buf + offset, size - offset); +offset = size; +} + +return offset; +} + +static int ft_trans_send_header(QEMUFileFtTrans *s, +enum QEMU_VM_TRANSACTION_STATE state, +uint32_t payload_len) +{ +int ret; +FtTransHdr *hdr = &s->h
[PATCH 12/18] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.
Record mmio write event to replay it upon failover. Signed-off-by: Yoshiaki Tamura --- exec.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/exec.c b/exec.c index e950df2..c81fd09 100644 --- a/exec.c +++ b/exec.c @@ -33,6 +33,7 @@ #include "osdep.h" #include "kvm.h" #include "qemu-timer.h" +#include "event-tap.h" #if defined(CONFIG_USER_ONLY) #include #include @@ -3632,6 +3633,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf, io_index = (pd >> IO_MEM_SHIFT) & (IO_MEM_NB_ENTRIES - 1); if (p) addr1 = (addr & ~TARGET_PAGE_MASK) + p->region_offset; + +event_tap_mmio(addr, buf, len); + /* XXX: could force cpu_single_env to NULL to avoid potential bugs */ if (l >= 4 && ((addr1 & 3) == 0)) { -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/18] savevm: introduce qemu_savevm_trans_{begin,commit}.
Introduce qemu_savevm_trans_{begin,commit} to send the memory and device info together, while avoiding cancelling memory state tracking. This patch also abstracts common code between qemu_savevm_state_{begin,iterate,commit}. Signed-off-by: Yoshiaki Tamura --- savevm.c | 157 +++--- sysemu.h |2 + 2 files changed, 101 insertions(+), 58 deletions(-) diff --git a/savevm.c b/savevm.c index e44eccd..1c2a7fb 100644 --- a/savevm.c +++ b/savevm.c @@ -1601,29 +1601,68 @@ bool qemu_savevm_state_blocked(Monitor *mon) return false; } -int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, -int shared) +/* + * section: header to write + * inc: if true, forces to pass SECTION_PART instead of SECTION_START + * pause: if true, breaks the loop when live handler returned 0 + */ +static int qemu_savevm_state_live(Monitor *mon, QEMUFile *f, int section, + bool inc, bool pause) { SaveStateEntry *se; +int skip = 0, ret; QTAILQ_FOREACH(se, &savevm_handlers, entry) { -if(se->set_params == NULL) { +int len, stage; + +if (se->save_live_state == NULL) { continue; - } - se->set_params(blk_enable, shared, se->opaque); +} + +/* Section type */ +qemu_put_byte(f, section); +qemu_put_be32(f, se->section_id); + +if (section == QEMU_VM_SECTION_START) { +/* ID string */ +len = strlen(se->idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se->idstr, len); + +qemu_put_be32(f, se->instance_id); +qemu_put_be32(f, se->version_id); + +stage = inc ? QEMU_VM_SECTION_PART : QEMU_VM_SECTION_START; +} else { +assert(inc); +stage = section; +} + +ret = se->save_live_state(mon, f, stage, se->opaque); +if (!ret) { +skip++; +if (pause) { +break; +} +} } - -qemu_put_be32(f, QEMU_VM_FILE_MAGIC); -qemu_put_be32(f, QEMU_VM_FILE_VERSION); + +return skip; +} + +static void qemu_savevm_state_full(QEMUFile *f) +{ +SaveStateEntry *se; QTAILQ_FOREACH(se, &savevm_handlers, entry) { int len; -if (se->save_live_state == NULL) +if (se->save_state == NULL && se->vmsd == NULL) { continue; +} /* Section type */ -qemu_put_byte(f, QEMU_VM_SECTION_START); +qemu_put_byte(f, QEMU_VM_SECTION_FULL); qemu_put_be32(f, se->section_id); /* ID string */ @@ -1634,9 +1673,29 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, qemu_put_be32(f, se->instance_id); qemu_put_be32(f, se->version_id); -se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque); +vmstate_save(f, se); +} + +qemu_put_byte(f, QEMU_VM_EOF); +} + +int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, +int shared) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, &savevm_handlers, entry) { +if (se->set_params == NULL) { +continue; +} +se->set_params(blk_enable, shared, se->opaque); } +qemu_put_be32(f, QEMU_VM_FILE_MAGIC); +qemu_put_be32(f, QEMU_VM_FILE_VERSION); + +qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_START, 0, 0); + if (qemu_file_has_error(f)) { qemu_savevm_state_cancel(mon, f); return -EIO; @@ -1647,29 +1706,16 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f) { -SaveStateEntry *se; int ret = 1; -QTAILQ_FOREACH(se, &savevm_handlers, entry) { -if (se->save_live_state == NULL) -continue; - -/* Section type */ -qemu_put_byte(f, QEMU_VM_SECTION_PART); -qemu_put_be32(f, se->section_id); - -ret = se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque); -if (!ret) { -/* Do not proceed to the next vmstate before this one reported - completion of the current stage. This serializes the migration - and reduces the probability that a faster changing state is - synchronized over and over again. */ -break; -} -} - -if (ret) +/* Do not proceed to the next vmstate before this one reported + completion of the current stage. This serializes the migration + and reduces the probability that a faster changing state is + synchronized over and over again. */ +ret = qemu_savevm_state_live(mon, f, QEMU_VM_SECTION_PART, 1, 1); +if (!ret) { return 1; +} if (qemu_file_has_error(f)) { qemu_savevm_state_cancel(mon, f); @@ -1681,46 +1727,41 @@ int qemu_save
[PATCH 13/18] net: insert event-tap to qemu_send_packet() and qemu_sendv_packet_async().
event-tap function is called only when it is on. Signed-off-by: Yoshiaki Tamura --- net.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/net.c b/net.c index 9ba5be2..1176124 100644 --- a/net.c +++ b/net.c @@ -36,6 +36,7 @@ #include "qemu-common.h" #include "qemu_socket.h" #include "hw/qdev.h" +#include "event-tap.h" static QTAILQ_HEAD(, VLANState) vlans; static QTAILQ_HEAD(, VLANClientState) non_vlan_clients; @@ -559,6 +560,10 @@ ssize_t qemu_send_packet_async(VLANClientState *sender, void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size) { +if (event_tap_is_on()) { +return event_tap_send_packet(vc, buf, size); +} + qemu_send_packet_async(vc, buf, size, NULL); } @@ -657,6 +662,10 @@ ssize_t qemu_sendv_packet_async(VLANClientState *sender, { NetQueue *queue; +if (event_tap_is_on()) { +return event_tap_sendv_packet_async(sender, iov, iovcnt, sent_cb); +} + if (sender->link_down || (!sender->peer && !sender->vlan)) { return calc_iov_length(iov, iovcnt); } -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/18] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
Currently buf size is fixed at 32KB. It would be useful if it could be flexible. Signed-off-by: Yoshiaki Tamura --- hw/hw.h |2 ++ savevm.c | 20 +++- 2 files changed, 21 insertions(+), 1 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index 5e24329..a168a37 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f); int qemu_fclose(QEMUFile *f); void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); +void *qemu_realloc_buffer(QEMUFile *f, int size); +void qemu_clear_buffer(QEMUFile *f); static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v) { diff --git a/savevm.c b/savevm.c index 6d83b0f..6c4c72b 100644 --- a/savevm.c +++ b/savevm.c @@ -171,7 +171,8 @@ struct QEMUFile { when reading */ int buf_index; int buf_size; /* 0 when writing */ -uint8_t buf[IO_BUF_SIZE]; +int buf_max_size; +uint8_t *buf; int has_error; }; @@ -422,6 +423,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, f->get_rate_limit = get_rate_limit; f->is_write = 0; +f->buf_max_size = IO_BUF_SIZE; +f->buf = qemu_malloc(sizeof(uint8_t) * f->buf_max_size); + return f; } @@ -452,6 +456,19 @@ void qemu_fflush(QEMUFile *f) } } +void *qemu_realloc_buffer(QEMUFile *f, int size) +{ +f->buf_max_size = size; +f->buf = qemu_realloc(f->buf, f->buf_max_size); + +return f->buf; +} + +void qemu_clear_buffer(QEMUFile *f) +{ +f->buf_size = f->buf_index = f->buf_offset = 0; +} + static void qemu_fill_buffer(QEMUFile *f) { int len; @@ -477,6 +494,7 @@ int qemu_fclose(QEMUFile *f) qemu_fflush(f); if (f->close) ret = f->close(f->opaque); +qemu_free(f->buf); qemu_free(f); return ret; } -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/18] Introduce skip_header parameter to qemu_loadvm_state().
Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header. Signed-off-by: Yoshiaki Tamura --- migration.c |2 +- savevm.c| 24 +--- sysemu.h|2 +- 3 files changed, 15 insertions(+), 13 deletions(-) diff --git a/migration.c b/migration.c index f0df5fc..dd3bf94 100644 --- a/migration.c +++ b/migration.c @@ -63,7 +63,7 @@ int qemu_start_incoming_migration(const char *uri) void process_incoming_migration(QEMUFile *f) { -if (qemu_loadvm_state(f) < 0) { +if (qemu_loadvm_state(f, 0) < 0) { fprintf(stderr, "load of migration failed\n"); exit(0); } diff --git a/savevm.c b/savevm.c index 6c4c72b..58e48e3 100644 --- a/savevm.c +++ b/savevm.c @@ -1716,7 +1716,7 @@ typedef struct LoadStateEntry { int version_id; } LoadStateEntry; -int qemu_loadvm_state(QEMUFile *f) +int qemu_loadvm_state(QEMUFile *f, int skip_header) { QLIST_HEAD(, LoadStateEntry) loadvm_handlers = QLIST_HEAD_INITIALIZER(loadvm_handlers); @@ -1729,17 +1729,19 @@ int qemu_loadvm_state(QEMUFile *f) return -EINVAL; } -v = qemu_get_be32(f); -if (v != QEMU_VM_FILE_MAGIC) -return -EINVAL; +if (!skip_header) { +v = qemu_get_be32(f); +if (v != QEMU_VM_FILE_MAGIC) +return -EINVAL; -v = qemu_get_be32(f); -if (v == QEMU_VM_FILE_VERSION_COMPAT) { -fprintf(stderr, "SaveVM v2 format is obsolete and don't work anymore\n"); -return -ENOTSUP; +v = qemu_get_be32(f); +if (v == QEMU_VM_FILE_VERSION_COMPAT) { +fprintf(stderr, "SaveVM v2 format is obsolete and don't work anymore\n"); +return -ENOTSUP; +} +if (v != QEMU_VM_FILE_VERSION) +return -ENOTSUP; } -if (v != QEMU_VM_FILE_VERSION) -return -ENOTSUP; while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { uint32_t instance_id, version_id, section_id; @@ -2062,7 +2064,7 @@ int load_vmstate(const char *name) return -EINVAL; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); qemu_fclose(f); if (ret < 0) { diff --git a/sysemu.h b/sysemu.h index 23ae17e..c86b4e8 100644 --- a/sysemu.h +++ b/sysemu.h @@ -81,7 +81,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); -int qemu_loadvm_state(QEMUFile *f); +int qemu_loadvm_state(QEMUFile *f, int skip_header); /* SLIRP */ void do_info_slirp(Monitor *mon); -- 1.7.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html