Re: [PATCH 2/3] Refactor zone_reclaim (v2)
* MinChan Kim [2010-12-14 19:01:26]: > Hi Balbir, > > On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh > wrote: > > Move reusable functionality outside of zone_reclaim. > > Make zone_reclaim_unmapped_pages modular > > > > Signed-off-by: Balbir Singh > > --- > > mm/vmscan.c | 35 +++ > > 1 files changed, 23 insertions(+), 12 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index e841cae..4e2ad05 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone > > *zone) > > } > > > > /* > > + * Helper function to reclaim unmapped pages, we might add something > > + * similar to this for slab cache as well. Currently this function > > + * is shared with __zone_reclaim() > > + */ > > +static inline void > > +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, > > + unsigned long nr_pages) > > +{ > > + int priority; > > + /* > > + * Free memory by calling shrink zone with increasing > > + * priorities until we have enough memory freed. > > + */ > > + priority = ZONE_RECLAIM_PRIORITY; > > + do { > > + shrink_zone(priority, zone, sc); > > + priority--; > > + } while (priority >= 0 && sc->nr_reclaimed < nr_pages); > > +} > > As I said previous version, zone_reclaim_unmapped_pages doesn't have > any functions related to reclaim unmapped pages. > The function name is rather strange. > It would be better to add scan_control setup in function inner to > reclaim only unmapped pages. OK, that is an idea worth looking at, I'll revisit this function. Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
soft lockup
What does "soft lockup" mean? Dec 14 02:35:18 hp1 kernel: [1492483.960150] BUG: soft lockup - CPU#1 stuck for 61s! [kvm:32398] It's associated with a loss of OCFS2 connectivity and other problems following. Viele Grüße Andreas Rittershofer -- Hier könnte keine Signatur stehen. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM Test report, kernel d335b15... qemu cb1983b8...
Hi, all, This is KVM test result against kvm.git d335b156f9fafd177d0606cf845d9a2df2dc5431, and qemu-kvm.git cb1983b8809d0e06a97384a40bad1194a32fc814. Currently qemu-kvm build fail on RHEL5 with a undeclared "PCI_PM_CTRL_NO_SOFT_RST" error. I saw there already were fix patch in mail list. There are 2 bugs got fixed. Fixed issues: 1. Guest qemu processor will be defunct process by be killed https://bugzilla.kernel.org/show_bug.cgi?id=23612 2. [SR] qemu return form "migrate " command spend long time https://sourceforge.net/tracker/?func=detail&aid=2942079&group_id=180599&atid=893831 Four old Issues: 1. ltp diotest running time is 2.54 times than before https://sourceforge.net/tracker/?func=detail&aid=2723366&group_id=180599&atid=893831 2. 32bits Rhel5/FC6 guest may fail to reboot after installation https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1991647&group_id=180599 3. perfctr wrmsr warning when booting 64bit RHEl5.3 https://sourceforge.net/tracker/?func=detail&aid=2721640&group_id=180599&atid=893831 4. [KVM] Noacpi Windows guest can not boot up on 32bit KVM host https://bugzilla.kernel.org/show_bug.cgi?id=21402 Best Regards, Xudong Hao-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2942079 ] [SR] qemu return form "migrate " command spend long time
Bugs item #2942079, was opened at 2010-01-29 17:13 Message generated for change (Comment added) made by haoxudong You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2942079&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open >Resolution: Fixed Priority: 5 Private: No Submitted By: Xudong Hao (haoxudong) Assigned to: Nobody/Anonymous (nobody) Summary: [SR] qemu return form "migrate " command spend long time Initial Comment: Environment: kvm.git Commit: 51ef04ce3219d05c88f204342b2db294b5590d0a qemu-kvm Commit: 3e6f07b0c86b7fabfce72c1a42e54b2ad79dc587 Host Kernel Version: 2.6.33-rc4 Bug detailed description: -- KVM guest Save-Restore command changed, the new command function can work. However, when we do in guest qemu console, it will cost much long time to return(~2 minutes of a 256M memory guest), the speed of saving only has ~1MB/s 344640+0 records in 344640+0 records out 176455680 bytes (176 MB) copied, 183.226 s, 963 kB/s Reproduce steps: 1) qemu-system-x86_64 -m 256 -smp 4 -net nic,macaddr=00:16:3e:57:87:39,model=rtl8139 -net tap,script=/etc/kvm/qemu-ifup -hda /share/xvs/var/guest.img 2) Ctrl+Alt+2migrate "exec:dd of=test.img" This step takes ~2 minutes with a 256MB memory guest, it will be more long for a >256MB memory guest. 3) qemu-system-x86_64 -m 256 -smp 4 -net nic,macaddr=00:16:3e:57:87:39,model=rtl8139 -net tap,script=/etc/kvm/qemu-ifup -hda /share/xvs/var/guest.img --incoming "exec:dd if=test.img" -- >Comment By: Xudong Hao (haoxudong) Date: 2010-12-15 14:02 Message: On kvm 66fc6be8d2b04153b753182610f919faf9c705bc and qemu-kvm 53b6d3d5c2522e881c8d194f122de3114f6f76eb, the issue is not exist. it will take <10s to save, speed: ~45MB/s mark this bug fixed and verified. -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2942079&group_id=180599 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM PCI passthrough issues, RTL-8169 PCI NICs
Hi, I have been trying to get the PCI pass through working on my Asus Crosshair IV Formula motherboard. The motherboard does support IOMMU (AMD-Vi), and IOMMU as well as SVM are enabled in the BIOS. PCI pass through does seem to work just fine when it comes to with built in network device (Marvel 8059 Yukon), however I can't get it working with two PCI RTL-8169 network cards. Both devices are bound to pci-stub. Every time I attempt I get a message that the device is busy, as shown below: PCI region 1 at address 0xf9dff800 has size 0x100, which is not a multiple of 4K. You might experience some performance hit due to that. Failed to assign device "(null)" : Device or resource busy *** The driver 'pci-stub' is occupying your device :01:05.0. The kernel logs show the following: Dec 14 10:19:55 phalsenet kernel: [ 1718.806644] pci-stub :01:05.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 Dec 14 10:19:55 phalsenet kernel: [ 1718.836741] pci-stub :01:05.0: restoring config space at offset 0x1 (was 0x2b00400, writing 0x2b00103) Dec 14 10:19:55 phalsenet kernel: [ 1718.903118] assign device 0:1:5.0 failed Dec 14 10:19:55 phalsenet kernel: [ 1718.903161] pci-stub :01:05.0: PCI INT A disabled The box is running Gentoo Linux. I have tested with 2.6.34 and 2.6.36 kernels, with kvm and kvm-amd modules that came with the kernels as well as with the latest kvm-kmod sources (2.6.36.1). The qemu-kvm version is 0.13.0. Here lspci -v output from one of the network devices: 01:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) Subsystem: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet Flags: 66MHz, medium devsel, IRQ 20 I/O ports at b400 [size=256] Memory at f9dff800 (32-bit, non-prefetchable) [size=256] Expansion ROM at f9da [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Kernel driver in use: pci-stub Any help would be appreciated. Regards, Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
AW: Freezing Windows 2008 x64bit guest
Vadim Rozenfeld redhat.com> writes: > > On Mon, 2010-12-13 at 22:12 +0200, Dor Laor wrote: > > On 12/13/2010 09:42 PM, Manfred Heubach wrote: > > > > > > I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly because > > > of performance problems which were solved by the upgrade. > > > > > > After the upgrade the system became extremly unstable. It was crashing as > > > soon > > > as disk io and network io load was growing. 100% reproduceable with > > > windows > > > server backup to an iscsi volume. > > > > > > i had virtio drivers for storage and network installed (redhat/fedora > > > 1.1.11). > > > > Which fedora/rhel release is that? The host is Ubuntu 10.10 x64 The drivers are from http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/ 1.1.11-0 released on 17-Aug-2010 - are there any newer drivers? > > What's the windows virtio driver version? The virtio storage version shown in Windows is 6.0.0.10 > > > > Have you tried using virt-manager/virhs instead of raw cmdline? I'm starting it with libvirt/virsh cmd-line copied from the log (and some log entries): LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name sbs2008 -uuid 933c2ef2-e5b0-0b39-db60-016b5d226534 -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/sbs2008.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=localtime -boot c -drive file=/var/lib/libvirt/images/olscanner/virtio-win-1.1.11-0.iso,if=none,media=cdrom, id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/lib/libvirt/images/sbs2008/sbs2008.img,if=none,id=drive-virtio-disk0, boot=on,format=qcow2 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/dev/volg1/sbsdata,if=none,id=drive-virtio-disk1,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=/dev/volg1/wsus,if=none,id=drive-virtio-disk2,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio-disk2 -device e1000,vlan=0,id=net0,mac=52:54:00:8a:bc:c9,bus=pci.0,addr=0x6 -net tap,fd=107,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k de -vga std -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 07:12:02.715: debug : qemudInitCpuAffinity:2423 : Setting CPU affinity 07:12:02.717: debug : qemuSecurityDACSetProcessLabel:547 : Dropping privileges of VM to 105:114 char device redirected to /dev/pts/0 pci_add_option_rom: failed to find romfile "pxe-e1000.bin" > > About e1000, some windows comes with buggy driver and an update e1000 > > from Intel fixes some issues. > > I'm running latest drivers from Intel: 8.3.15.0 > > > > > At each BSOD I had the following line in the log of the guest: > > > > > > virtio_ioport_write: unexpected address 0x13 value 0x1 > > > > > > I changed the network interface back to e1000. What I experience now (and I had > > > that a the very beginning before i switched to virtio network) are freezes. The > > > guest doesn't respond anymore (doesn't answer to pings and doesn't interact via > > > mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as > > > many > > > cores as there are virtual cpus (in this case 4). I had a crash today but no logentry on the host - but that could be because I had to restart syslog (ran out of diskspace after turning on debug logging ob libvirtd - didn't think that it would generate 6 GB of logs per day :-) > > > > Sounds like an interrupt storm to me. Can you try to ping your VM? No responds to ping. > Anyway the best way to start debugging a stalled system is just to crash > it with BSOD. For doing it you will need: > - enable NMICrashDump (please see http://support.microsoft.com/kb/927069 > for more information > - enable Kernel Memory Dump (actually Complete is much better, but it > can be too big) http://support.microsoft.com/kb/969028 > - you only will need to type "nmi 0" in the qemu monitor to crash the > system, when the system hangs next time. I prepared this. When the system crashed today I didn't have the complete memory dump ready - so I only have a minidump. The intersting point is that the system today crashed with a BSOD and didn't freeze. The result of dumpchk.exe is as follows: Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64 Copyright (c) Microsoft Corporation. All rights reserved. Loading Dump File [c:\Windows\Minidump\Mini121410-01.dmp] Mini Kernel Dump File: Only registers and stack trace are available Symbol search path is: SRV*http://msdl.microsoft.com/download/symbols Executable search path is: Windows Server 2008/Windows Vista Kernel Version 6002 (Service Pack 2) MP (4 procs) Free x64 Product: LanManNt, suite
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/10 14:46, Anthony Liguori wrote: > On 12/14/2010 01:54 PM, David S. Ahern wrote: >> >> On 12/14/10 12:49, Anthony Liguori wrote: >> >>> But that doesn't tell you what the impact is in real world workloads. >>> Before we start pushing all device emulation into the kernel, we need to >>> quantify how often gettimeofday() is really called in real workloads. >>> >> The workload that inspired that example program at its current max load >> calls gtod upwards of 1000 times per second. The overhead of >> gettimeofday was the biggest factor when comparing performance to bare >> metal and esx. That's why I wrote the test program --- boils a complex >> product/program to a single system call. >> > > So the absolute performance impact was on the order of what? At the time I did the investigations (18-24 months ago) KVM was on the order of 15-20% worse for a RHEL4 based workload and the overhead appeared to be due to the PIT or PM timer as the clock source. Switching the clock to the TSC brought the performance on par with bare metal, but that route has other issues. > > The difference in CPU time of a light weight vs. heavy weight exit > should be something like 2-3us. That would mean 2-3ms of CPU time at a > rate of 1000 per second. The PIT causes 3 VMEXITs for each gettimeofday (get_offset_pit in RHEL4): /* timer count may underflow right here */ outb_p(0x00, PIT_MODE); /* latch the count ASAP */ ... count = inb_p(PIT_CH0); /* read the latched count */ ... count |= inb_p(PIT_CH0) << 8; ... David > > That should be pretty much in the noise. > > There are possibly second order effects that might make a large impact > such as contention with the qemu_mutex. It's worth doing > experimentation to see if a non-mutex acquiring fast path in userspace > also resulted in a significant performance boost. > > Regards, > > Anthony Liguori > >> David >> >> >>> Regards, >>> >>> Anthony Liguori >>> >>> What's the relative speed of the in-kernel pmtimer compared to the PIT? David >>> > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
2010/12/14 Erik Brakkee : > Daniel P. Berrange wrote: >> >> On Tue, Dec 14, 2010 at 12:55:04PM +0100, Kenni Lund wrote: >> >>> >>> 2010/12/14 Erik Brakkee: >>> > > From: Kenni Lund > 2010/12/14 Erik Brakkee: > >>> >>> From: Kenni Lund >>> Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500 might work now? >>> >>> Passthrough of a PVR-500 has been working for a long time. I've been >>> running with passthrough of a PVR-500 in my HTPC, since >>> November/December 2009...so it should work with any recent kernel and >>> any recent version of qemu-kvm you can find today - No patching >>> needed. The only issue I had with the PVR-500 card, was when *I* >>> didn't free up the shared interrupts...once I fixed that, it "just >>> worked". >>> >> >> How did you free up those shared interrupts then? I tried different >> slots >> but always get conflicts with the USB irqs. >> > > I did an unbind of the conflicting device (eg. disabled it). I moved > the PVR-500 card around in the different slots and once I got a > conflict with the integrated sound card, I left the PVR-500 card in > that slot (it's a headless machine, so no need for sound) and > configured unbind of the sound card at boot time. On my old system I > think it was conflicting with one of the USB controllers as well, but > it didn't really matter, as I only lost a few of the ports on the back > of the computer for that particular USB controller - I still had > plenty of USB ports left and if I really needed more ports, I could > just plug in an extra USB PCI card. > > My /etc/rc.local boot script looks like the following today: > -- > #Remove HDA conflicting with ivtv1 > echo ":00:1b.0"> /sys/bus/pci/drivers/HDA\ Intel/unbind > > # ivtv0 > echo " 0016"> /sys/bus/pci/drivers/pci-stub/new_id > echo ":04:08.0"> /sys/bus/pci/drivers/ivtv/unbind > echo ":04:08.0"> /sys/bus/pci/drivers/pci-stub/bind > echo " 0016"> /sys/bus/pci/drivers/pci-stub/remove_id > > # ivtv1 > echo " 0016"> /sys/bus/pci/drivers/pci-stub/new_id > echo ":04:09.0"> /sys/bus/pci/drivers/ivtv/unbind > echo ":04:09.0"> /sys/bus/pci/drivers/pci-stub/bind > echo " 0016"> /sys/bus/pci/drivers/pci-stub/remove_id > I did not try unbinding the usb device so I can also try that. I don'.t understand what is happening with the 0016. I configured the pci card in kvm and I believe kvm does the binding to pci-stub in recent versions. Where is the 0016%oming from? >>> >>> Okay, qemu-kvm might do it today, I don't know - I haven't changed >>> that script for the past year. But are you sure that it's not >>> libvirt/virsh/virt-manager which does that for you? >>> >> >> If you use the managed="yes" attribute on the in libvirt >> XML, then libvirt will automatically do the pcistub bind/unbind, >> followed by a device reset at guest startup& the reverse at shutdown. >> If you have conflicting devices on the bus though, libvirt won't >> attempt to unbind them, unless you had also explicitly assigned all >> those conflicting devices to the same guest. >> >> Daniel >> > > I definitely have to try again (right now having some stability problems on > the server that I am debugging). > > The shared IRQs are as follows: > > 16: 0 0 0 0 0 0 > 0 0 IO-APIC-fasteoi uhci_hcd:usb3 > 18: 252995 0 0 0 0 0 > 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb8, ivtv0 > 19: 58281 0 0 0 0 0 > 0 0 IO-APIC-fasteoi ata_piix, ata_piix, uhci_hcd:usb5, > uhci_hcd:usb7, ivtv1 > 21: 0 0 0 0 0 0 > 0 0 IO-APIC-fasteoi uhci_hcd:usb4 > 23: 713 6906 0 76919 0 0 > 0 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb6 > > So I have IRQ sharing with usb1, usb8, usb5, usb7. Uffand your ata HDD controller. I guess i was much luckier than you are, my ivtv0 didn't conflict at all and ivtv1 only conflicted with USB. > I have also read that > ehci refers to USB 2.0 and uhci to USB 1.1 is that correct? Anyway, how > would I now identify the USB PCI devices that I would need to unbind to get > rid of the sharing with the USB ports? Play around with: lspci -v lspci -n lsusb -v lsusb -t You can also just start by unbinding the first one and take note when you hit the right ones...once you unbind one, it will disappear from cat /proc/interrupts. When you're down to having only ivtv0 on one interrupt and only ivtv1 on another interrupt, then you're ready to bind with
Re: [PATCH 5/5] pci-assign: Use PCI-2.3-based shared legacy interrupts
Am 14.12.2010 01:16, Alex Williamson wrote: > On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote: >> From: Jan Kiszka >> >> Enable the new KVM feature that allows legacy interrupt sharing for >> PCI-2.3-compliant devices. This requires to synchronize any guest >> change of the INTx mask bit to the kernel. >> >> Signed-off-by: Jan Kiszka >> --- >> hw/device-assignment.c | 38 +- >> qemu-kvm.c |8 >> qemu-kvm.h |3 +++ >> 3 files changed, 44 insertions(+), 5 deletions(-) >> >> diff --git a/hw/device-assignment.c b/hw/device-assignment.c >> index 26d3bd7..cf75c52 100644 >> --- a/hw/device-assignment.c >> +++ b/hw/device-assignment.c >> @@ -423,12 +423,21 @@ static uint8_t pci_find_cap_offset(PCIDevice *d, >> uint8_t cap, uint8_t start) >> return 0; >> } >> >> +static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t >> devfn) >> +{ >> +return (uint32_t)seg << 16 | (uint32_t)bus << 8 | (uint32_t)devfn; >> +} >> + >> static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, >>uint32_t val, int len) >> { >> int fd; >> ssize_t ret; >> AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); >> +struct kvm_assigned_pci_dev assigned_dev_data; >> +#ifdef KVM_CAP_PCI_2_3 >> +bool intx_masked, update_intx_mask; >> +#endif /* KVM_CAP_PCI_2_3 */ >> >> DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", >>((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), >> @@ -439,6 +448,26 @@ static void assigned_dev_pci_write_config(PCIDevice *d, >> uint32_t address, >> } >> >> if (ranges_overlap(address, len, PCI_COMMAND, 2)) { >> +#ifdef KVM_CAP_PCI_2_3 >> +update_intx_mask = false; >> +if (address == PCI_COMMAND+1) { >> +intx_masked = val & (PCI_COMMAND_INTX_DISABLE >> 8); >> +update_intx_mask = true; >> +} else if (len >= 2) { >> +intx_masked = val & PCI_COMMAND_INTX_DISABLE; >> +update_intx_mask = true; >> +} > > I wonder if this might be a little cleaner as something like this. > > if (ranges_overlap(address, len, PCI_COMMAND + 1, 1) { > update_intx_mask = true; > intx_masked = (len == 1 ? val << 8 : val) & PCI_COMMAND_INTX_DISABLE; > } That should even obsolete update_intx_mask - will look into this, and also the merge bits thing. Thanks! Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Re: [PATCHv8 00/16] boot order specification
On 14.12.2010, at 21:31, Benjamin Herrenschmidt wrote: > >> The only working system emulation we have are Macs (G3 beige, G4, G5), >> so we can't just ignore Apple. >> Alex even made me stick to their odd 0x41 rtas-version property. ;) > > Hah :-) Nothing ever used RTAS on these... afaik, it didn't even work > properly. Then let's not use rtas for the Mac machine, but rather go with Andreas' new machine. Changing the value there to what real FW uses on that machine is more than reasonable :) Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word
Am 14.12.2010 21:47, Thomas Gleixner wrote: > On Mon, 13 Dec 2010, Jan Kiszka wrote: >> +/** >> + * get_irq_status - read interrupt line status word >> + * @irq: Interrupt line of the status word >> + * >> + * This returns the current content of the status word associated with >> + * the given interrupt line. See IRQS_* flags for details. >> + */ >> +unsigned long get_irq_status(unsigned int irq) >> +{ >> +struct irq_desc *desc = irq_to_desc(irq); >> + >> +return desc ? desc->irq_data.drv_status : 0; >> +} >> +EXPORT_SYMBOL_GPL(get_irq_status); > > We should document that this is a snapshot and in no way serialized > against modifications of drv_status. I'll fix up the kernel doc. Yeah, I think I had some hint on this in the previous version but apparently dropped it for this round. Thanks, Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
Am 14.12.2010 22:46, Thomas Gleixner wrote: > On Mon, 13 Dec 2010, Jan Kiszka wrote: >> From: Jan Kiszka >> chip_bus_lock(desc); >> retval = __setup_irq(irq, desc, action); >> chip_bus_sync_unlock(desc); >> >> -if (retval) >> +if (retval) { >> +if (desc->action && !desc->action->next) >> +desc->irq_data.drv_status &= ~IRQS_SHARED; > > This is redundant. IRQS_SHARED gets set in a code path where all > checks are done already. Nope, it's also set before entry of __setup_irq in case we call an IRQF_ADAPTIVE handler. We need to set it that early as we may race with IRQ events for the already registered handler happening between the sharing notification and the actual registration of the second handler. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
Am 14.12.2010 21:54, Thomas Gleixner wrote: > On Mon, 13 Dec 2010, Jan Kiszka wrote: >> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, >> void *dev_id) >> /* Make sure it's not being used on another CPU: */ >> synchronize_irq(irq); >> >> +if (single_handler) >> +desc->irq_data.drv_status &= ~IRQS_SHARED; >> + > > What's the reason to clear this flag outside of the desc->lock held > region. We need to synchronize the irq first before clearing the flag. The problematic scenario behind this: An IRQ started in shared mode, this the line was unmasked after the hardirq. Now we clear IRQS_SHARED before calling into the threaded handler. And that handler may now think that the line is still masked as IRQS_SHARED is set. > I need this status for other purposes as well, where I > definitely need serialization. Well, two options: wrap all bit manipulations with desc->lock acquisition/release or turn drv_status into an atomic. I don't know what your plans with drv_status are, so... > >> +mutex_lock(®ister_lock); >> + >> +old_action = desc->action; >> +if (old_action && (old_action->flags & IRQF_ADAPTIVE) && >> +!(desc->irq_data.drv_status & IRQS_SHARED)) { >> +/* >> + * Signal the old handler that is has to switch to shareable >> + * handling mode. Disable the line to avoid any conflict with >> + * a real IRQ. >> + */ >> +disable_irq(irq); >> +local_irq_disable(); >> + >> +desc->irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE; > > Unserialized access as well. Will think about it. > >> +old_action->handler(irq, old_action->dev_id); >> +desc->irq_data.drv_status &= ~IRQS_MAKE_SHAREABLE; > > Thanks, > > tglx Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH 2/3] Refactor zone_reclaim (v2)
On Tue, Dec 14, 2010 at 8:45 PM, Balbir Singh wrote: > * MinChan Kim [2010-12-14 19:01:26]: > >> Hi Balbir, >> >> On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh >> wrote: >> > Move reusable functionality outside of zone_reclaim. >> > Make zone_reclaim_unmapped_pages modular >> > >> > Signed-off-by: Balbir Singh >> > --- >> > mm/vmscan.c | 35 +++ >> > 1 files changed, 23 insertions(+), 12 deletions(-) >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index e841cae..4e2ad05 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone >> > *zone) >> > } >> > >> > /* >> > + * Helper function to reclaim unmapped pages, we might add something >> > + * similar to this for slab cache as well. Currently this function >> > + * is shared with __zone_reclaim() >> > + */ >> > +static inline void >> > +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, >> > + unsigned long nr_pages) >> > +{ >> > + int priority; >> > + /* >> > + * Free memory by calling shrink zone with increasing >> > + * priorities until we have enough memory freed. >> > + */ >> > + priority = ZONE_RECLAIM_PRIORITY; >> > + do { >> > + shrink_zone(priority, zone, sc); >> > + priority--; >> > + } while (priority >= 0 && sc->nr_reclaimed < nr_pages); >> > +} >> >> As I said previous version, zone_reclaim_unmapped_pages doesn't have >> any functions related to reclaim unmapped pages. > > The scan control point has the right arguments for implementing > reclaim of unmapped pages. I mean you should set up scan_control setup in this function. Current zone_reclaim_unmapped_pages doesn't have any specific routine related to reclaim unmapped pages. Otherwise, change the function name with just "zone_reclaim_pages". I think you don't want it. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/4] KVM & genirq: Enable adaptive IRQ sharing for passed-through devices
On Mon, 13 Dec 2010, Jan Kiszka wrote: > This addresses the review comments of the previous round: > - renamed irq_data::status to drv_status > - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED > - fixed signature of get_irq_status (irq is now unsigned int) > - converted register_lock into a global one > - fixed critical white space breakage (that I just left in to check if >anyone is actually reading the code, of course...) Just for the record, you either missed or introduced some new white space noise :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
Daniel P. Berrange wrote: On Tue, Dec 14, 2010 at 12:55:04PM +0100, Kenni Lund wrote: 2010/12/14 Erik Brakkee: From: Kenni Lund 2010/12/14 Erik Brakkee: From: Kenni Lund Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500 might work now? Passthrough of a PVR-500 has been working for a long time. I've been running with passthrough of a PVR-500 in my HTPC, since November/December 2009...so it should work with any recent kernel and any recent version of qemu-kvm you can find today - No patching needed. The only issue I had with the PVR-500 card, was when *I* didn't free up the shared interrupts...once I fixed that, it "just worked". How did you free up those shared interrupts then? I tried different slots but always get conflicts with the USB irqs. I did an unbind of the conflicting device (eg. disabled it). I moved the PVR-500 card around in the different slots and once I got a conflict with the integrated sound card, I left the PVR-500 card in that slot (it's a headless machine, so no need for sound) and configured unbind of the sound card at boot time. On my old system I think it was conflicting with one of the USB controllers as well, but it didn't really matter, as I only lost a few of the ports on the back of the computer for that particular USB controller - I still had plenty of USB ports left and if I really needed more ports, I could just plug in an extra USB PCI card. My /etc/rc.local boot script looks like the following today: -- #Remove HDA conflicting with ivtv1 echo ":00:1b.0"> /sys/bus/pci/drivers/HDA\ Intel/unbind # ivtv0 echo " 0016"> /sys/bus/pci/drivers/pci-stub/new_id echo ":04:08.0"> /sys/bus/pci/drivers/ivtv/unbind echo ":04:08.0"> /sys/bus/pci/drivers/pci-stub/bind echo " 0016"> /sys/bus/pci/drivers/pci-stub/remove_id # ivtv1 echo " 0016"> /sys/bus/pci/drivers/pci-stub/new_id echo ":04:09.0"> /sys/bus/pci/drivers/ivtv/unbind echo ":04:09.0"> /sys/bus/pci/drivers/pci-stub/bind echo " 0016"> /sys/bus/pci/drivers/pci-stub/remove_id I did not try unbinding the usb device so I can also try that. I don'.t understand what is happening with the 0016. I configured the pci card in kvm and I believe kvm does the binding to pci-stub in recent versions. Where is the 0016%oming from? Okay, qemu-kvm might do it today, I don't know - I haven't changed that script for the past year. But are you sure that it's not libvirt/virsh/virt-manager which does that for you? If you use the managed="yes" attribute on the in libvirt XML, then libvirt will automatically do the pcistub bind/unbind, followed by a device reset at guest startup& the reverse at shutdown. If you have conflicting devices on the bus though, libvirt won't attempt to unbind them, unless you had also explicitly assigned all those conflicting devices to the same guest. Daniel I definitely have to try again (right now having some stability problems on the server that I am debugging). The shared IRQs are as follows: 16: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 18: 252995 0 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb8, ivtv0 19: 58281 0 0 0 0 0 0 0 IO-APIC-fasteoi ata_piix, ata_piix, uhci_hcd:usb5, uhci_hcd:usb7, ivtv1 21: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 23:713 6906 0 76919 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb6 So I have IRQ sharing with usb1, usb8, usb5, usb7. I have also read that ehci refers to USB 2.0 and uhci to USB 1.1 is that correct? Anyway, how would I now identify the USB PCI devices that I would need to unbind to get rid of the sharing with the USB ports? It also doesn't really matter in which slot I put the PVR-500 card because both cards share IRQs with USB in all cases. I have also used an add on USB PCI card but still got these conflicts. I was considering to get a PCIe USB card instead to try out in the hope that that would use different IRQs. Is that a realistic expectation? That way, I could disable all on-board USB (in the BIOS even) and use the add-on USB only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 01:54 PM, David S. Ahern wrote: On 12/14/10 12:49, Anthony Liguori wrote: But that doesn't tell you what the impact is in real world workloads. Before we start pushing all device emulation into the kernel, we need to quantify how often gettimeofday() is really called in real workloads. The workload that inspired that example program at its current max load calls gtod upwards of 1000 times per second. The overhead of gettimeofday was the biggest factor when comparing performance to bare metal and esx. That's why I wrote the test program --- boils a complex product/program to a single system call. So the absolute performance impact was on the order of what? The difference in CPU time of a light weight vs. heavy weight exit should be something like 2-3us. That would mean 2-3ms of CPU time at a rate of 1000 per second. That should be pretty much in the noise. There are possibly second order effects that might make a large impact such as contention with the qemu_mutex. It's worth doing experimentation to see if a non-mutex acquiring fast path in userspace also resulted in a significant performance boost. Regards, Anthony Liguori David Regards, Anthony Liguori What's the relative speed of the in-kernel pmtimer compared to the PIT? David -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Mon, 13 Dec 2010, Jan Kiszka wrote: > From: Jan Kiszka > chip_bus_lock(desc); > retval = __setup_irq(irq, desc, action); > chip_bus_sync_unlock(desc); > > - if (retval) > + if (retval) { > + if (desc->action && !desc->action->next) > + desc->irq_data.drv_status &= ~IRQS_SHARED; This is redundant. IRQS_SHARED gets set in a code path where all checks are done already. To make that more obvious we can set it right before raw_spin_unlock_irqrestore(&desc->lock, flags); conditionally on (shared). That way we can also move the kfree out of the mutex locked section. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state
On Mon, 13 Dec 2010, Jan Kiszka wrote: > @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, > void *dev_id) > /* Make sure it's not being used on another CPU: */ > synchronize_irq(irq); > > + if (single_handler) > + desc->irq_data.drv_status &= ~IRQS_SHARED; > + What's the reason to clear this flag outside of the desc->lock held region. I need this status for other purposes as well, where I definitely need serialization. > + mutex_lock(®ister_lock); > + > + old_action = desc->action; > + if (old_action && (old_action->flags & IRQF_ADAPTIVE) && > + !(desc->irq_data.drv_status & IRQS_SHARED)) { > + /* > + * Signal the old handler that is has to switch to shareable > + * handling mode. Disable the line to avoid any conflict with > + * a real IRQ. > + */ > + disable_irq(irq); > + local_irq_disable(); > + > + desc->irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE; Unserialized access as well. Will think about it. > + old_action->handler(irq, old_action->dev_id); > + desc->irq_data.drv_status &= ~IRQS_MAKE_SHAREABLE; Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word
On Mon, 13 Dec 2010, Jan Kiszka wrote: > +/** > + * get_irq_status - read interrupt line status word > + * @irq: Interrupt line of the status word > + * > + * This returns the current content of the status word associated with > + * the given interrupt line. See IRQS_* flags for details. > + */ > +unsigned long get_irq_status(unsigned int irq) > +{ > + struct irq_desc *desc = irq_to_desc(irq); > + > + return desc ? desc->irq_data.drv_status : 0; > +} > +EXPORT_SYMBOL_GPL(get_irq_status); We should document that this is a snapshot and in no way serialized against modifications of drv_status. I'll fix up the kernel doc. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCHv8 00/16] boot order specification
> The only working system emulation we have are Macs (G3 beige, G4, G5), > so we can't just ignore Apple. > Alex even made me stick to their odd 0x41 rtas-version property. ;) Hah :-) Nothing ever used RTAS on these... afaik, it didn't even work properly. > No, but that may be OpenBIOS' fault. Here's its reg, in case it helps: > > reg 1800 > 01001810 0008 > 01001814 0004 > 01001818 0008 > 0100181c 0004 > 01001820 0010 > That looks like PCI odd to keep a PCI addressing scheme below a PCI device... Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCHv8 00/16] boot order specification
Am 12.12.2010 um 00:22 schrieb Benjamin Herrenschmidt: On Sat, 2010-12-11 at 18:06 +0200, Gleb Natapov wrote: http://playground.sun.com/pub/p1275/bindings/pci/pci2_1.pdf has table on page 10 that defines how pci class code should be translated into OF name. This is what my patch is using. pci-ata does not look spec compliant (or is there more up-to-date spec?) What should we do with at...@600 vs dr...@1? There is no available IDE OF binding spec, so I when with the way OpenBIOS reports ata on qemu-x86. I have no idea what 600 in at...@600 may mean, but looking at g3_beige_300.html there is no such node there and looking at any other device tree in http://penguinppc.org/historical/dev-trees-html/ Those are old and I wouldn't look too closely at what Apple does. The only working system emulation we have are Macs (G3 beige, G4, G5), so we can't just ignore Apple. Alex even made me stick to their odd 0x41 rtas-version property. ;) ATA doesn't really need anything complex, mostly the ata controller, generally named "ata" nowadays with a #address-cells of 1 and a #size-cells of 0. Children are then typically disk, cdrom, ... (ie block devices) with a unit address of 0 for master and 1 for slave. In the case of controllers with multiple ports, typically you have one such "ata" node per bus. "pci-ata" is a liberal use by Apple here representing the actual host controller PCI device. In any case, what matters is the "compatible" property. This is what defines the programming interface of a device. I haven't found one that use this kind of addressing for pci-ata. http://penguinppc.org/historical/dev-trees-html/g3bw_400.html for instance has p...@8000/pci-bri...@d/pci-...@1/ata-4. at...@600 kind of addressing is used by devices on mac-io bus which I do not think we emulate in qemu. So it looks like OpneBIOS is wrong here. Well, it's possible that the @600 represents a register offset within pci-ata, this is entirely up to pci-ata to do as it wishes there to define it's own internal binding. Is there a "ranges" property defining translation accross "pci-ata" ? No, but that may be OpenBIOS' fault. Here's its reg, in case it helps: reg 1800 01001810 0008 01001814 0004 01001818 0008 0100181c 0004 01001820 0010 Regards, Andreas -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/10 12:49, Anthony Liguori wrote: > But that doesn't tell you what the impact is in real world workloads. > Before we start pushing all device emulation into the kernel, we need to > quantify how often gettimeofday() is really called in real workloads. The workload that inspired that example program at its current max load calls gtod upwards of 1000 times per second. The overhead of gettimeofday was the biggest factor when comparing performance to bare metal and esx. That's why I wrote the test program --- boils a complex product/program to a single system call. David > > Regards, > > Anthony Liguori > >> What's the relative speed of the in-kernel pmtimer compared to the PIT? >> >> David >> > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 12:00 PM, David S. Ahern wrote: On 12/14/10 08:29, Anthony Liguori wrote: I recently used to investigate the performance benefit. In a Linux guest, I was running a program that calls gettimeofday() 'n' times in a loop (the PM Timer register is read during each call). With in-kernel PM Timer, I observed a significant reduction of program execution time. I've played with this in the past. Can you post real numbers, preferably, with a real work load? 2 years ago I posted relative comparisons of the time sources for older RHEL guests: http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html Any time you write a program in userspace that effectively equates to a single PIO operation that is easy to emulate, it's going to be remarkably faster to implement that PIO emulation in the kernel than in userspace because vmexit exit cost dominates the execution path. But that doesn't tell you what the impact is in real world workloads. Before we start pushing all device emulation into the kernel, we need to quantify how often gettimeofday() is really called in real workloads. Regards, Anthony Liguori What's the relative speed of the in-kernel pmtimer compared to the PIT? David -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations
From: "Michael S. Tsirkin" Date: Tue, 14 Dec 2010 14:23:26 +0200 > On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote: >> Please merge the following tree for 2.6.38. >> Thanks! > > Rusty Acked it as is, so please pull the below. > Thanks very much! > >> The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0: >> >> net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800) >> >> are available in the git repository at: >> git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next Pulled, thanks a lot. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/10 08:29, Anthony Liguori wrote: >> I recently used to investigate the performance benefit. In a Linux >> guest, I was running a program that calls gettimeofday() 'n' times >> in a loop (the PM Timer register is read during each call). With >> in-kernel PM Timer, I observed a significant reduction of program >> execution time. >> > > I've played with this in the past. Can you post real numbers, > preferably, with a real work load? 2 years ago I posted relative comparisons of the time sources for older RHEL guests: http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html What's the relative speed of the in-kernel pmtimer compared to the PIT? David -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] RFC: delay pci_update_mappings for 64-bit BARs
On Mon, Dec 13, 2010 at 8:00 PM, Isaku Yamahata wrote: > On Mon, Dec 13, 2010 at 03:43:44PM -0700, Cam Macdonell wrote: >> Do not call pci_update_mappings on the lower 32-bits of a 64-bit bar. Wait >> for the upper 32 or else Qemu will try to map on just the lower 32 which is >> probably going to corrupt memory. >> >> I was encountering crashes when mapping certain PCI region sizes. The >> problem turns out that pci_update_mappings is being called without all >> 64-bits in the BAR. For example when mapping to 0x18000, once the lower >> 32-bits were written the remapping happened (mapping to 0x800) which >> would overwrite something. >> >> I'm not certain if this is completely correct, I'm simply testing the lower >> 4-bits to only be MEM_TYPE_64 flag. Upper 32-bit address parts can be >> values like 0xff which is tricky to test against. > > You're assuming that guest OS always write lower 32bit and them upper 32bit. > Is the assumption correct? > I found Linux does, but I don't know about other OSes. > And I couldn't find any sentence about how to update (64bit) BAR in the specs. > (Please correct me if I missed it) I think you're right, we probably can't assume the order. > > Some work around would be necessary regardless of 32bit-or-64bit. > because qemu doesn't emulate bus accurately at the moment. > How about the followings? > If BAR overlaps with RAM, don't map BAR. > If BAR overlaps with other BARs, record the overlapping and > when updating one of the BARs, update all the overlapping BARs. > Which BAR wins depends on the order of updating, it doesn't matter because > it's anomaly case. But the addresses in the BARs may not overlap. For example, Linux allocates memory from top down, so I recently had the mapping of a BAR to address 0xffc000 So BAR 0x18 sees 0xc004 Then BAR 0x1c sees 0xff So if I understand what you mean by overlapping BARs, 0xc000 and 0xffc000 will not be detected as overlapping and so we can't record it. But, we can allow harmless mappings of the incomplete lower-32 to proceed and then get remapped when the upper bits are written. (This is what happens currently, but fails when the lower-32 overwrite RAM). Case of writing upper-then-lower (non-Linux case): The addresses in the upper 32-bits are going to be limited to 16-bits (at most 48-bit addresses currently) and so those shouldn't update mappings because they will overlap with RAM. When the lower-bits are written, we have the full 64-bit address and can update mappings. Case of writing lower-then-upper: If the lower 32-bit BAR address doesn't conflict with RAM, map it. When the upper bits are written, update to the correct mapping. We would just have to ensure the first mapping is indeed harmless. Would that work? Cam > > This way, 32bit BAR case is also covered. > > thanks, > >> >> Cam >> --- >> hw/pci.c | 5 - >> 1 files changed, 4 insertions(+), 1 deletions(-) >> >> diff --git a/hw/pci.c b/hw/pci.c >> index 438c0d1..3b81792 100644 >> --- a/hw/pci.c >> +++ b/hw/pci.c >> @@ -1000,6 +1000,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t >> addr, uint32_t val, int l) >> { >> int i, was_irq_disabled = pci_irq_disabled(d); >> uint32_t config_size = pci_config_size(d); >> + int is_64 = 0; >> + >> + is_64 = ((val & 0xf) == PCI_BASE_ADDRESS_MEM_TYPE_64); >> >> for (i = 0; i < l && addr + i < config_size; val >>= 8, ++i) { >> uint8_t wmask = d->wmask[addr + i]; >> @@ -1008,7 +1011,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t >> addr, uint32_t val, int l) >> d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & >> wmask); >> d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */ >> } >> - if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) || >> + if ((ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) && (!is_64)) || >> ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) || >> ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) || >> range_covers_byte(addr, l, PCI_COMMAND)) >> -- >> 1.7.0.4 >> >> > > -- > yamahata > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 09:38 AM, Avi Kivity wrote: Fortunately, we have a very good bytecode interpreter that's accelerated in the kernel called KVM ;-) We have exactly the same bytecode interpreter under a different name, it's called userspace. If you can afford to make the transition back to the guest for emulation, you might as well transition to userspace. If you re-entered the guest and setup a stack that had the RIP of the source of the exit, then there's no additional need to exit the guest. The handler can just do an iret. Or am I missing something? Why not have the equivalent of a paravirtual SMM mode where we can reflect IO exits back to the guest in a well defined way? It could then implement PM timer in terms of HPET or something like that. More exits. Yeah, I should have said, implement in terms of kvmclock so no additional exits. We already have a virtual address space that works for most guests thanks to the TPR optimization. It only works for Windows XP and Windows XP with the /3GB extension. Is this a fundamental limitation or just a statement of today's heuristics? Does any guest not keep the BIOS in virtual memory in a static location? Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH unit-tests 1/4] Move idt.c into lib code.
Make it compilable in 32 and 64 bit mode. --- config-x86-common.mak |7 +- lib/x86/idt.c | 32 --- lib/x86/idt.h |1 + x86/idt.c | 148 - 4 files changed, 16 insertions(+), 172 deletions(-) delete mode 100644 x86/idt.c diff --git a/config-x86-common.mak b/config-x86-common.mak index c5508b3..2269c4a 100644 --- a/config-x86-common.mak +++ b/config-x86-common.mak @@ -11,6 +11,7 @@ cflatobjs += \ cflatobjs += lib/x86/fwcfg.o cflatobjs += lib/x86/apic.o cflatobjs += lib/x86/atomic.o +cflatobjs += lib/x86/idt.o $(libcflat): LDFLAGS += -nostdlib $(libcflat): CFLAGS += -ffreestanding -I lib @@ -50,7 +51,7 @@ $(TEST_DIR)/vmexit.elf: $(cstart.o) $(TEST_DIR)/vmexit.o $(TEST_DIR)/smptest.elf: $(cstart.o) $(TEST_DIR)/smptest.o $(TEST_DIR)/emulator.elf: $(cstart.o) $(TEST_DIR)/emulator.o \ - $(TEST_DIR)/vm.o $(TEST_DIR)/idt.o + $(TEST_DIR)/vm.o $(TEST_DIR)/port80.elf: $(cstart.o) $(TEST_DIR)/port80.o @@ -65,9 +66,9 @@ $(TEST_DIR)/realmode.o: bits = 32 $(TEST_DIR)/msr.elf: $(cstart.o) $(TEST_DIR)/msr.o -$(TEST_DIR)/idt_test.elf: $(cstart.o) $(TEST_DIR)/idt.o $(TEST_DIR)/idt_test.o +$(TEST_DIR)/idt_test.elf: $(cstart.o) $(TEST_DIR)/idt_test.o -$(TEST_DIR)/xsave.elf: $(cstart.o) $(TEST_DIR)/idt.o $(TEST_DIR)/xsave.o +$(TEST_DIR)/xsave.elf: $(cstart.o) $(TEST_DIR)/xsave.o $(TEST_DIR)/rmap_chain.elf: $(cstart.o) $(TEST_DIR)/rmap_chain.o \ $(TEST_DIR)/vm.o diff --git a/lib/x86/idt.c b/lib/x86/idt.c index ed2f4b0..b3e47d4 100644 --- a/lib/x86/idt.c +++ b/lib/x86/idt.c @@ -1,5 +1,6 @@ #include "idt.h" #include "libcflat.h" +#include "processor.h" typedef struct { unsigned short offset0; @@ -19,30 +20,19 @@ typedef struct { static idt_entry_t idt[256]; -typedef struct { -unsigned short limit; -unsigned long linear_addr; -} __attribute__((packed)) descriptor_table_t; - -void lidt(idt_entry_t *idt, int nentries) +void load_lidt(idt_entry_t *idt, int nentries) { -descriptor_table_t dt; +struct descriptor_table_ptr dt; dt.limit = nentries * sizeof(*idt) - 1; -dt.linear_addr = (unsigned long)idt; +dt.base = (unsigned long)idt; +lidt(&dt); asm volatile ("lidt %0" : : "m"(dt)); } -unsigned short read_cs() -{ -unsigned short r; - -asm volatile ("mov %%cs, %0" : "=r"(r)); -return r; -} - -void set_idt_entry(idt_entry_t *e, void *addr, int dpl) +void set_idt_entry(int vec, void *addr, int dpl) { +idt_entry_t *e = &idt[vec]; memset(e, 0, sizeof *e); e->offset0 = (unsigned long)addr; e->selector = read_cs(); @@ -146,10 +136,10 @@ void setup_idt(void) { extern char ud_fault, gp_fault, de_fault; -lidt(idt, 256); -set_idt_entry(&idt[0], &de_fault, 0); -set_idt_entry(&idt[6], &ud_fault, 0); -set_idt_entry(&idt[13], &gp_fault, 0); +load_lidt(idt, 256); +set_idt_entry(0, &de_fault, 0); +set_idt_entry(6, &ud_fault, 0); +set_idt_entry(13, &gp_fault, 0); } unsigned exception_vector(void) diff --git a/lib/x86/idt.h b/lib/x86/idt.h index 6babcb4..81b8944 100644 --- a/lib/x86/idt.h +++ b/lib/x86/idt.h @@ -15,5 +15,6 @@ void setup_idt(void); unsigned exception_vector(void); unsigned exception_error_code(void); +void set_idt_entry(int vec, void *addr, int dpl); #endif diff --git a/x86/idt.c b/x86/idt.c deleted file mode 100644 index 4480833..000 --- a/x86/idt.c +++ /dev/null @@ -1,148 +0,0 @@ -#include "idt.h" -#include "libcflat.h" - -typedef struct { -unsigned short offset0; -unsigned short selector; -unsigned short ist : 3; -unsigned short : 5; -unsigned short type : 4; -unsigned short : 1; -unsigned short dpl : 2; -unsigned short p : 1; -unsigned short offset1; -unsigned offset2; -unsigned reserved; -} idt_entry_t; - -static idt_entry_t idt[256]; - -typedef struct { -unsigned short limit; -unsigned long linear_addr; -} __attribute__((packed)) descriptor_table_t; - -void lidt(idt_entry_t *idt, int nentries) -{ -descriptor_table_t dt; - -dt.limit = nentries * sizeof(*idt) - 1; -dt.linear_addr = (unsigned long)idt; -asm volatile ("lidt %0" : : "m"(dt)); -} - -unsigned short read_cs() -{ -unsigned short r; - -asm volatile ("mov %%cs, %0" : "=r"(r)); -return r; -} - -void set_idt_entry(idt_entry_t *e, void *addr, int dpl) -{ -memset(e, 0, sizeof *e); -e->offset0 = (unsigned long)addr; -e->selector = read_cs(); -e->ist = 0; -e->type = 14; -e->dpl = dpl; -e->p = 1; -e->offset1 = (unsigned long)addr >> 16; -e->offset2 = (unsigned long)addr >> 32; -} - -struct ex_regs { -unsigned long rax, rcx, rdx, rbx; -unsigned long dummy, rbp, rsi, rdi; -unsigned long r8, r9, r10, r11; -unsigned long r12, r13, r14, r15; -unsigned long vector; -unsigned long error_code; -unsigned long rip
[PATCH unit-tests 2/4] Make access.c use library functions.
access.c has functions that are provided by library code. Remove them and use library functions instead. --- x86/access.c | 92 +++-- 1 files changed, 5 insertions(+), 87 deletions(-) diff --git a/x86/access.c b/x86/access.c index 067565b..df943d9 100644 --- a/x86/access.c +++ b/x86/access.c @@ -1,5 +1,7 @@ #include "libcflat.h" +#include "idt.h" +#include "processor.h" #define smp_id() 0 @@ -98,34 +100,6 @@ static inline void *va(pt_element_t phys) return (void *)phys; } -static unsigned long read_cr0() -{ -unsigned long cr0; - -asm volatile ("mov %%cr0, %0" : "=r"(cr0)); - -return cr0; -} - -static void write_cr0(unsigned long cr0) -{ -asm volatile ("mov %0, %%cr0" : : "r"(cr0)); -} - -typedef struct { -unsigned short offset0; -unsigned short selector; -unsigned short ist : 3; -unsigned short : 5; -unsigned short type : 4; -unsigned short : 1; -unsigned short dpl : 2; -unsigned short p : 1; -unsigned short offset1; -unsigned offset2; -unsigned reserved; -} idt_entry_t; - typedef struct { pt_element_t pt_pool; unsigned pt_pool_size; @@ -143,7 +117,6 @@ typedef struct { pt_element_t ignore_pde; int expected_fault; unsigned expected_error; -idt_entry_t idt[256]; } ac_test_t; typedef struct { @@ -154,51 +127,6 @@ typedef struct { static void ac_test_show(ac_test_t *at); -void lidt(idt_entry_t *idt, int nentries) -{ -descriptor_table_t dt; - -dt.limit = nentries * sizeof(*idt) - 1; -dt.linear_addr = (unsigned long)idt; -asm volatile ("lidt %0" : : "m"(dt)); -} - -unsigned short read_cs() -{ -unsigned short r; - -asm volatile ("mov %%cs, %0" : "=r"(r)); -return r; -} - -unsigned long long rdmsr(unsigned index) -{ -unsigned a, d; - -asm volatile("rdmsr" : "=a"(a), "=d"(d) : "c"(index)); -return ((unsigned long long)d << 32) | a; -} - -void wrmsr(unsigned index, unsigned long long val) -{ -unsigned a = val, d = val >> 32; - -asm volatile("wrmsr" : : "a"(a), "d"(d), "c"(index)); -} - -void set_idt_entry(idt_entry_t *e, void *addr, int dpl) -{ -memset(e, 0, sizeof *e); -e->offset0 = (unsigned long)addr; -e->selector = read_cs(); -e->ist = 0; -e->type = 14; -e->dpl = dpl; -e->p = 1; -e->offset1 = (unsigned long)addr >> 16; -e->offset2 = (unsigned long)addr >> 32; -} - void set_cr0_wp(int wp) { unsigned long cr0 = read_cr0(); @@ -222,13 +150,11 @@ void set_efer_nx(int nx) static void ac_env_int(ac_pool_t *pool) { -static idt_entry_t idt[256]; +setup_idt(); -memset(idt, 0, sizeof(idt)); -lidt(idt, 256); extern char page_fault, kernel_entry; -set_idt_entry(&idt[14], &page_fault, 0); -set_idt_entry(&idt[0x20], &kernel_entry, 3); +set_idt_entry(14, &page_fault, 0); +set_idt_entry(0x20, &kernel_entry, 3); pool->pt_pool = 33 * 1024 * 1024; pool->pt_pool_size = 120 * 1024 * 1024 - pool->pt_pool; @@ -273,14 +199,6 @@ int ac_test_bump(ac_test_t *at) return ret; } -unsigned long read_cr3() -{ -unsigned long cr3; - -asm volatile ("mov %%cr3, %0" : "=r"(cr3)); -return cr3; -} - void invlpg(void *addr) { asm volatile ("invlpg (%0)" : : "r"(addr)); -- 1.7.2.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH unit-tests 3/4] Remove duplicated idt code from apic test.
Use library idt code instead. --- x86/apic.c | 49 +++-- 1 files changed, 11 insertions(+), 38 deletions(-) diff --git a/x86/apic.c b/x86/apic.c index 2207040..6d06f9f 100644 --- a/x86/apic.c +++ b/x86/apic.c @@ -2,22 +2,7 @@ #include "apic.h" #include "vm.h" #include "smp.h" - -typedef struct { -unsigned short offset0; -unsigned short selector; -unsigned short ist : 3; -unsigned short : 5; -unsigned short type : 4; -unsigned short : 1; -unsigned short dpl : 2; -unsigned short p : 1; -unsigned short offset1; -#ifdef __x86_64__ -unsigned offset2; -unsigned reserved; -#endif -} idt_entry_t; +#include "idt.h" typedef struct { ulong regs[sizeof(ulong)*2]; @@ -90,8 +75,6 @@ asm ( #endif ); -static idt_entry_t *idt = 0; - static int g_fail; static int g_tests; @@ -128,22 +111,12 @@ void test_enable_x2apic(void) } } -static void set_idt_entry(unsigned vec, void (*func)(isr_regs_t *regs)) +static void handle_irq(unsigned vec, void (*func)(isr_regs_t *regs)) { u8 *thunk = vmalloc(50); -ulong ptr = (ulong)thunk; -idt_entry_t ent = { -.offset0 = ptr, -.selector = read_cs(), -.ist = 0, -.type = 14, -.dpl = 0, -.p = 1, -.offset1 = ptr >> 16, -#ifdef __x86_64__ -.offset2 = ptr >> 32, -#endif -}; + +set_idt_entry(vec, thunk, 0); + #ifdef __x86_64__ /* sub $8, %rsp */ *thunk++ = 0x48; *thunk++ = 0x83; *thunk++ = 0xec; *thunk++ = 0x08; @@ -164,7 +137,6 @@ static void set_idt_entry(unsigned vec, void (*func)(isr_regs_t *regs)) *thunk ++ = 0xe9; *(u32 *)thunk = (ulong)isr_entry_point - (ulong)(thunk + 4); #endif -idt[vec] = ent; } static void irq_disable(void) @@ -194,7 +166,7 @@ static void test_self_ipi(void) { int vec = 0xf1; -set_idt_entry(vec, self_ipi_isr); +handle_irq(vec, self_ipi_isr); irq_enable(); apic_icr_write(APIC_DEST_SELF | APIC_DEST_PHYSICAL | APIC_DM_FIXED | vec, 0); @@ -234,7 +206,7 @@ static void ioapic_isr_77(isr_regs_t *regs) static void test_ioapic_intr(void) { -set_idt_entry(0x77, ioapic_isr_77); +handle_irq(0x77, ioapic_isr_77); set_ioapic_redir(0x10, 0x77); toggle_irq_line(0x10); asm volatile ("nop"); @@ -262,8 +234,8 @@ static void ioapic_isr_66(isr_regs_t *regs) static void test_ioapic_simultaneous(void) { -set_idt_entry(0x78, ioapic_isr_78); -set_idt_entry(0x66, ioapic_isr_66); +handle_irq(0x78, ioapic_isr_78); +handle_irq(0x66, ioapic_isr_66); set_ioapic_redir(0x10, 0x78); set_ioapic_redir(0x11, 0x66); irq_disable(); @@ -323,7 +295,7 @@ static void test_sti_nmi(void) return; } -set_idt_entry(2, nmi_handler); +handle_irq(2, nmi_handler); on_cpu(1, update_cr3, (void *)read_cr3()); sti_loop_active = 1; @@ -343,6 +315,7 @@ int main() { setup_vm(); smp_init(); +setup_idt(); test_lapic_existence(); -- 1.7.2.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH unit-tests 4/4] Remove unused function from apic test.
--- x86/apic.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/x86/apic.c b/x86/apic.c index 6d06f9f..bcb9fc1 100644 --- a/x86/apic.c +++ b/x86/apic.c @@ -78,11 +78,6 @@ asm ( static int g_fail; static int g_tests; -static void outb(unsigned char data, unsigned short port) -{ -asm volatile ("out %0, %1" : : "a"(data), "d"(port)); -} - static void report(const char *msg, int pass) { ++g_tests; -- 1.7.2.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for Dec 14
* Jes Sorensen (jes.soren...@redhat.com) wrote: > Any chance you could fix your cronjob to send out the CFA a day earlier? > 15 hrs before is a bit short notice. Sure. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 05:32 PM, Anthony Liguori wrote: > If anything I'd expect hpet or the Microsoft synthetic timers to be a > lot more important. True. But also a lot more work. Implementing just the pm timer counter - not the whole of it - in kernel, gives us a lot of gain with not very much effort. Patch is pretty simple, as you can see, and most of it is even code to turn it on/off, etc. Partial emulation is not something I like since it causes a fuzzy kernel/user boundary. In this case, transitioning to userspace when interrupts are enabled doesn't look so hot. Are you sure all guests that benefit from this don't enable the pmtimer interrupt? What about the transition? Will we have a time discontinuity when that happens? What I'd really like to see is this stuff implemented in bytecode, unfortunately that's a lot of work which will be very hard to upstream. Fortunately, we have a very good bytecode interpreter that's accelerated in the kernel called KVM ;-) We have exactly the same bytecode interpreter under a different name, it's called userspace. If you can afford to make the transition back to the guest for emulation, you might as well transition to userspace. Why not have the equivalent of a paravirtual SMM mode where we can reflect IO exits back to the guest in a well defined way? It could then implement PM timer in terms of HPET or something like that. More exits. We already have a virtual address space that works for most guests thanks to the TPR optimization. It only works for Windows XP and Windows XP with the /3GB extension. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 07:49 AM, Avi Kivity wrote: On 12/14/2010 03:40 PM, Glauber Costa wrote: > > What is the motivation for this? Are there any important guests that > use the pmtimer? Avi, All older RHEL and Windows, for example, would benefit for this. They only benefit from it because we don't provide HPET. If we did, the guests would use HPET in preference to pmtimer, since HPET is so much better than pmtimer (yet still sucks in an absolute sense). > If anything I'd expect hpet or the Microsoft synthetic timers to be a > lot more important. True. But also a lot more work. Implementing just the pm timer counter - not the whole of it - in kernel, gives us a lot of gain with not very much effort. Patch is pretty simple, as you can see, and most of it is even code to turn it on/off, etc. Partial emulation is not something I like since it causes a fuzzy kernel/user boundary. In this case, transitioning to userspace when interrupts are enabled doesn't look so hot. Are you sure all guests that benefit from this don't enable the pmtimer interrupt? What about the transition? Will we have a time discontinuity when that happens? What I'd really like to see is this stuff implemented in bytecode, unfortunately that's a lot of work which will be very hard to upstream. Fortunately, we have a very good bytecode interpreter that's accelerated in the kernel called KVM ;-) Why not have the equivalent of a paravirtual SMM mode where we can reflect IO exits back to the guest in a well defined way? It could then implement PM timer in terms of HPET or something like that. We already have a virtual address space that works for most guests thanks to the TPR optimization. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 06:09 AM, Ulrich Obergfell wrote: Hi, This is an RFC through which I would like to get feedback on how the idea of in-kernel PM Timer would be received. The current implementation of PM Timer emulation is 'heavy-weight' because the code resides in qemu userspace. Guest operating systems that use PM Timer as a clock source (for example, older versions of Linux that do not have paravirtualized clock) would benefit from an in-kernel PM Timer emulation. Parts 1 thru 4 of this RFC contain experimental source code which I recently used to investigate the performance benefit. In a Linux guest, I was running a program that calls gettimeofday() 'n' times in a loop (the PM Timer register is read during each call). With in-kernel PM Timer, I observed a significant reduction of program execution time. I've played with this in the past. Can you post real numbers, preferably, with a real work load? Regards, Anthony Liguori The experimental code emulates the PM Timer register in KVM kernel. All other components of ACPI PM remain in qemu userspace. Also, the 'timer carry interrupt' feature is not implemented in-kernel. If a guest operating system needs to enable the 'timer carry interrupt', the code takes care that PM Timer emulation falls back to userspace. However, I think the design of the code has sufficient flexibility, so that anyone who would want to add the 'timer carry interrupt' feature in-kernel could try to do so later on. Please review and please comment. Regards, Uli Obergfell -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] RAM API: Make use of it for x86 PC
On 12/14/2010 09:16 AM, Alex Williamson wrote: On Tue, 2010-12-14 at 11:18 +0200, Avi Kivity wrote: On 12/13/2010 11:24 PM, Alex Williamson wrote: Register the actual VM RAM using the new API @@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size, /* allocate RAM */ ram_addr = qemu_ram_alloc(NULL, "pc.ram", below_4g_mem_size + above_4g_mem_size); -cpu_register_physical_memory(0, 0xa, ram_addr); -cpu_register_physical_memory(0x10, - below_4g_mem_size - 0x10, - ram_addr + 0x10); +ram_register(0, below_4g_mem_size, ram_addr); What's the impact of this? Won't it conflict with BIOS memory registration? What about VGA? There is no "conflict". Memory registration can punch through previous registrations. And the QEMU SMM code switches the VGA area back and forth between memory mapped and normal ram depending on the mode. This presents no functional change, just structures RAM allocation to closer reflect the way things actually work. Regards, Anthony Liguori In terms of patch hygiene, it should be in a separate patch titled "register 0xa-0x10 as RAM" or something. It's a much more drastic change than making use of the new RAM API. As we discussed in the v2 patch, the chipset can selectively switch regions within this range to point at VGA, ROM, or RAM, but there's always physical RAM backing the space, even when it's mapping isn't active. VGA and ROM will be overlay the RAM mapping. I'm fine with splitting this into two patches for debug-ability, but the change is reflective of following the RAM API and registering all of "RAM". Maybe it would be sufficient to make such a note explicit in this commit log? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] RAM API: Make use of it for x86 PC
On Tue, 2010-12-14 at 11:18 +0200, Avi Kivity wrote: > On 12/13/2010 11:24 PM, Alex Williamson wrote: > > Register the actual VM RAM using the new API > > > > > > @@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size, > > /* allocate RAM */ > > ram_addr = qemu_ram_alloc(NULL, "pc.ram", > > below_4g_mem_size + above_4g_mem_size); > > -cpu_register_physical_memory(0, 0xa, ram_addr); > > -cpu_register_physical_memory(0x10, > > - below_4g_mem_size - 0x10, > > - ram_addr + 0x10); > > +ram_register(0, below_4g_mem_size, ram_addr); > > > > What's the impact of this? Won't it conflict with BIOS memory > registration? What about VGA? > > In terms of patch hygiene, it should be in a separate patch titled > "register 0xa-0x10 as RAM" or something. It's a much more > drastic change than making use of the new RAM API. As we discussed in the v2 patch, the chipset can selectively switch regions within this range to point at VGA, ROM, or RAM, but there's always physical RAM backing the space, even when it's mapping isn't active. VGA and ROM will be overlay the RAM mapping. I'm fine with splitting this into two patches for debug-ability, but the change is reflective of following the RAM API and registering all of "RAM". Maybe it would be sufficient to make such a note explicit in this commit log? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 04:44 PM, Ulrich Obergfell wrote: > > Partial emulation is not something I like since it causes a fuzzy > kernel/user boundary. In this case, transitioning to userspace when > interrupts are enabled doesn't look so hot. Are you sure all guests > that benefit from this don't enable the pmtimer interrupt? What about > the transition? Will we have a time discontinuity when that happens? Avi, the idea is to use the '-kvm-pmtmr' option (in code part 4) only with guests that do not enable the 'timer carry interrupt'. Guests that need to enable the 'timer carry interrupt' should rather use the PM Timer emulation in qemu userspace (i.e. they should not be started with this option). If a guest is accidentally started with this option, the in-kernel PM Timer (in code part 1) detects if the guest attempts to enable the 'timer carry interrupt' and falls back to PM Timer emulation in qemu userspace (in-kernel PM Timer disables itself automatically). So, this is not a combination of in-kernel PM Timer register emulation and qemu userspace PM Timer interrupt emulation. We really try to avoid guest specific parameters. Having to decide if the guest has virtio is bad enough, but going into low level details like that is really bad. The host admin might not even know what operating systems its guests run. A guest might even dual boot two different operating systems. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for Dec 14
On 12/14/10 01:12, Chris Wright wrote: > Please send in any agenda items you are interested in covering. > > thanks, > -chris > Chris, Any chance you could fix your cronjob to send out the CFA a day earlier? 15 hrs before is a bit short notice. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Dec 14
* Chris Wright (chr...@redhat.com) wrote: > Please send in any agenda items you are interested in covering. No agenda, today's call is cancelled. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
- "Avi Kivity" wrote: > On 12/14/2010 03:40 PM, Glauber Costa wrote: > > > > > > What is the motivation for this? Are there any important guests that > > > use the pmtimer? > > Avi, > > > > All older RHEL and Windows, for example, would benefit for this. > > They only benefit from it because we don't provide HPET. If we did, the > guests would use HPET in preference to pmtimer, since HPET is so much > better than pmtimer (yet still sucks in an absolute sense). > > > > If anything I'd expect hpet or the Microsoft synthetic timers to be a > > > lot more important. > > > > True. But also a lot more work. > > Implementing just the pm timer counter - not the whole of it - in > > kernel, gives us a lot of gain with not very much effort. Patch is > > pretty simple, as you can see, and most of it is even code to turn it > > on/off, etc. > > > > Partial emulation is not something I like since it causes a fuzzy > kernel/user boundary. In this case, transitioning to userspace when > interrupts are enabled doesn't look so hot. Are you sure all guests > that benefit from this don't enable the pmtimer interrupt? What about > the transition? Will we have a time discontinuity when that happens? Avi, the idea is to use the '-kvm-pmtmr' option (in code part 4) only with guests that do not enable the 'timer carry interrupt'. Guests that need to enable the 'timer carry interrupt' should rather use the PM Timer emulation in qemu userspace (i.e. they should not be started with this option). If a guest is accidentally started with this option, the in-kernel PM Timer (in code part 1) detects if the guest attempts to enable the 'timer carry interrupt' and falls back to PM Timer emulation in qemu userspace (in-kernel PM Timer disables itself automatically). So, this is not a combination of in-kernel PM Timer register emulation and qemu userspace PM Timer interrupt emulation. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On Tue, Dec 14, 2010 at 03:49:37PM +0200, Avi Kivity wrote: > On 12/14/2010 03:40 PM, Glauber Costa wrote: > >> > >> What is the motivation for this? Are there any important guests that > >> use the pmtimer? > >Avi, > > > >All older RHEL and Windows, for example, would benefit for this. > > They only benefit from it because we don't provide HPET. If we did, > the guests would use HPET in preference to pmtimer, since HPET is so > much better than pmtimer (yet still sucks in an absolute sense). > > >> If anything I'd expect hpet or the Microsoft synthetic timers to be a > >> lot more important. > > > >True. But also a lot more work. > >Implementing just the pm timer counter - not the whole of it - in > >kernel, gives us a lot of gain with not very much effort. Patch is > >pretty simple, as you can see, and most of it is even code to turn it > >on/off, etc. > > > > Partial emulation is not something I like since it causes a fuzzy > kernel/user boundary. In this case, transitioning to userspace when > interrupts are enabled doesn't look so hot. Are you sure all guests > that benefit from this don't enable the pmtimer interrupt? What > about the transition? Will we have a time discontinuity when that > happens? > > What I'd really like to see is this stuff implemented in bytecode, > unfortunately that's a lot of work which will be very hard to > upstream. > Just use ACPI bytecode. It is upstream already. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 03:40 PM, Glauber Costa wrote: > > What is the motivation for this? Are there any important guests that > use the pmtimer? Avi, All older RHEL and Windows, for example, would benefit for this. They only benefit from it because we don't provide HPET. If we did, the guests would use HPET in preference to pmtimer, since HPET is so much better than pmtimer (yet still sucks in an absolute sense). > If anything I'd expect hpet or the Microsoft synthetic timers to be a > lot more important. True. But also a lot more work. Implementing just the pm timer counter - not the whole of it - in kernel, gives us a lot of gain with not very much effort. Patch is pretty simple, as you can see, and most of it is even code to turn it on/off, etc. Partial emulation is not something I like since it causes a fuzzy kernel/user boundary. In this case, transitioning to userspace when interrupts are enabled doesn't look so hot. Are you sure all guests that benefit from this don't enable the pmtimer interrupt? What about the transition? Will we have a time discontinuity when that happens? What I'd really like to see is this stuff implemented in bytecode, unfortunately that's a lot of work which will be very hard to upstream. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On Tue, 2010-12-14 at 15:34 +0200, Avi Kivity wrote: > On 12/14/2010 02:09 PM, Ulrich Obergfell wrote: > > Hi, > > > > This is an RFC through which I would like to get feedback on how the > > idea of in-kernel PM Timer would be received. > > > > The current implementation of PM Timer emulation is 'heavy-weight' > > because the code resides in qemu userspace. Guest operating systems > > that use PM Timer as a clock source (for example, older versions of > > Linux that do not have paravirtualized clock) would benefit from an > > in-kernel PM Timer emulation. > > > > Parts 1 thru 4 of this RFC contain experimental source code which > > I recently used to investigate the performance benefit. In a Linux > > guest, I was running a program that calls gettimeofday() 'n' times > > in a loop (the PM Timer register is read during each call). With > > in-kernel PM Timer, I observed a significant reduction of program > > execution time. > > > > The experimental code emulates the PM Timer register in KVM kernel. > > All other components of ACPI PM remain in qemu userspace. Also, the > > 'timer carry interrupt' feature is not implemented in-kernel. If a > > guest operating system needs to enable the 'timer carry interrupt', > > the code takes care that PM Timer emulation falls back to userspace. > > However, I think the design of the code has sufficient flexibility, > > so that anyone who would want to add the 'timer carry interrupt' > > feature in-kernel could try to do so later on. > > > > What is the motivation for this? Are there any important guests that > use the pmtimer? Avi, All older RHEL and Windows, for example, would benefit for this. > If anything I'd expect hpet or the Microsoft synthetic timers to be a > lot more important. True. But also a lot more work. Implementing just the pm timer counter - not the whole of it - in kernel, gives us a lot of gain with not very much effort. Patch is pretty simple, as you can see, and most of it is even code to turn it on/off, etc. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][unit-tests] fix i386 arch compilation.
On 12/14/2010 02:26 PM, Gleb Natapov wrote: Commit 750bbdb forgot to convert i386 arch to .elf. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
On 12/14/2010 02:09 PM, Ulrich Obergfell wrote: Hi, This is an RFC through which I would like to get feedback on how the idea of in-kernel PM Timer would be received. The current implementation of PM Timer emulation is 'heavy-weight' because the code resides in qemu userspace. Guest operating systems that use PM Timer as a clock source (for example, older versions of Linux that do not have paravirtualized clock) would benefit from an in-kernel PM Timer emulation. Parts 1 thru 4 of this RFC contain experimental source code which I recently used to investigate the performance benefit. In a Linux guest, I was running a program that calls gettimeofday() 'n' times in a loop (the PM Timer register is read during each call). With in-kernel PM Timer, I observed a significant reduction of program execution time. The experimental code emulates the PM Timer register in KVM kernel. All other components of ACPI PM remain in qemu userspace. Also, the 'timer carry interrupt' feature is not implemented in-kernel. If a guest operating system needs to enable the 'timer carry interrupt', the code takes care that PM Timer emulation falls back to userspace. However, I think the design of the code has sufficient flexibility, so that anyone who would want to add the 'timer carry interrupt' feature in-kernel could try to do so later on. What is the motivation for this? Are there any important guests that use the pmtimer? If anything I'd expect hpet or the Microsoft synthetic timers to be a lot more important. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v2 PATCH 2/3] sched: add yield_to function
On Tue, 2010-12-14 at 16:56 +0530, Srivatsa Vaddagiri wrote: > On Tue, Dec 14, 2010 at 12:03:58PM +0100, Mike Galbraith wrote: > > On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote: > > > On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote: > > > > > > That part looks ok, except for the yield cross cpu bit. Trying to yield > > > > a resource you don't have doesn't make much sense to me. > > > > > > So another (crazy) idea is to move the "yieldee" task on another cpu over > > > to > > > yielding task's cpu, let it run till the end of yielding tasks slice and > > > then > > > let it go back to the original cpu at the same vruntime position! > > > > Yeah, pulling the intended recipient makes fine sense. If he doesn't > > preempt you, you can try to swap vruntimes or whatever makes arithmetic > > sense and will help. Dunno how you tell him how long he can keep the > > cpu though, > > can't we adjust the new task's [prev_]sum_exec_runtime a bit so that it is > preempted at the end of yielding task's timeslice? And dork up accounting. Why? Besides, it won't work because you have no idea who may preempt whom, when, and for how long. (Why do people keep talking about timeslice? The only thing that exists is lag that changes the instant anyone does anything of interest.) > > and him somehow going back home needs to be a plain old > > migration, no fancy restoration of ancient history vruntime. > > What is the issue if it gets queued at the old vruntime (assuming fair stick > is > still behind that)? Without that it will hurt fairness for the yieldee (and > perhaps of the overall VM in this case). Who all are you placing this task in front of or behind based upon a non-existent relationship? Your recipient may well have been preempted, and is now further behind than the completely irrelevant to the current situation stored vruntime would indicate, so why would you want to move it rightward? Certainly not in the interest of fairness. -Mike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][unit-tests] fix i386 arch compilation.
Commit 750bbdb forgot to convert i386 arch to .elf. diff --git a/config-i386.mak b/config-i386.mak index 6dbd19f..c1b6e08 100644 --- a/config-i386.mak +++ b/config-i386.mak @@ -9,4 +9,4 @@ tests = $(TEST_DIR)/taskswitch.flat include config-x86-common.mak -$(TEST_DIR)/taskswitch.flat: $(cstart.o) $(TEST_DIR)/taskswitch.o +$(TEST_DIR)/taskswitch.elf: $(cstart.o) $(TEST_DIR)/taskswitch.o -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations
On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote: > Please merge the following tree for 2.6.38. > Thanks! Rusty Acked it as is, so please pull the below. Thanks very much! > The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0: > > net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800) > > are available in the git repository at: > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next > > Jason Wang (1): > vhost: fix typos in comment > > Julia Lawall (1): > drivers/vhost/vhost.c: delete double assignment > > Michael S. Tsirkin (9): > vhost: put mm after thread stop > vhost-net: batch use/unuse mm > vhost: copy_to_user -> __copy_to_user > vhost: get/put_user -> __get/__put_user > vhost: remove unused include > vhost: correctly set bits of dirty pages > vhost: better variable name in logging > vhost test module > tools/virtio: virtio_test tool > > drivers/vhost/net.c |9 +- > drivers/vhost/test.c | 320 > ++ > drivers/vhost/test.h |7 + > drivers/vhost/vhost.c| 44 +++--- > drivers/vhost/vhost.h|2 +- > tools/virtio/Makefile| 12 ++ > tools/virtio/linux/device.h |2 + > tools/virtio/linux/slab.h|2 + > tools/virtio/linux/virtio.h | 223 +++ > tools/virtio/vhost_test/Makefile |2 + > tools/virtio/vhost_test/vhost_test.c |1 + > tools/virtio/virtio_test.c | 248 ++ > 12 files changed, 842 insertions(+), 30 deletions(-) > create mode 100644 drivers/vhost/test.c > create mode 100644 drivers/vhost/test.h > create mode 100644 tools/virtio/Makefile > create mode 100644 tools/virtio/linux/device.h > create mode 100644 tools/virtio/linux/slab.h > create mode 100644 tools/virtio/linux/virtio.h > create mode 100644 tools/virtio/vhost_test/Makefile > create mode 100644 tools/virtio/vhost_test/vhost_test.c > create mode 100644 tools/virtio/virtio_test.c -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v2 PATCH 2/3] sched: add yield_to function
On Mon, 2010-12-13 at 22:46 -0500, Rik van Riel wrote: > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 2c79e92..408326f 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1086,6 +1086,8 @@ struct sched_class { > #ifdef CONFIG_FAIR_GROUP_SCHED > void (*task_move_group) (struct task_struct *p, int on_rq); > #endif > + > + void (*yield_to) (struct rq *rq, struct task_struct *p); > }; > > struct load_weight { > @@ -1947,6 +1949,7 @@ extern void set_user_nice(struct task_struct *p, long > nice); > extern int task_prio(const struct task_struct *p); > extern int task_nice(const struct task_struct *p); > extern int can_nice(const struct task_struct *p, const int nice); > +extern void requeue_task(struct rq *rq, struct task_struct *p); That definitely doesn't want to be a globally visible symbol. > extern int task_curr(const struct task_struct *p); > extern int idle_cpu(int cpu); > extern int sched_setscheduler(struct task_struct *, int, struct sched_param > *); > @@ -2020,6 +2023,10 @@ extern int wake_up_state(struct task_struct *tsk, > unsigned int state); > extern int wake_up_process(struct task_struct *tsk); > extern void wake_up_new_task(struct task_struct *tsk, > unsigned long clone_flags); > + > +extern u64 slice_remain(struct task_struct *); idem. > +void yield_to(struct task_struct *p) > +{ > + unsigned long flags; > + struct rq *rq, *p_rq; > + > + local_irq_save(flags); > + rq = this_rq(); > +again: > + p_rq = task_rq(p); > + double_rq_lock(rq, p_rq); > + if (p_rq != task_rq(p)) { > + double_rq_unlock(rq, p_rq); > + goto again; > + } > + > + /* We can't yield to a process that doesn't want to run. */ > + if (!p->se.on_rq) > + goto out; > + > + /* > +* We can only yield to a runnable task, in the same schedule class > +* as the current task, if the schedule class implements > yield_to_task. > +*/ > + if (!task_running(rq, p) && current->sched_class == p->sched_class && > + current->sched_class->yield_to) > + current->sched_class->yield_to(rq, p); rq and p don't match, see below. > + > +out: > + double_rq_unlock(rq, p_rq); > + local_irq_restore(flags); > + yield(); That wants to be plain: schedule(), possibly conditional on having called sched_class::yield_to. > +} > +EXPORT_SYMBOL_GPL(yield_to); > +u64 slice_remain(struct task_struct *p) > +{ > + unsigned long flags; > + struct sched_entity *se = &p->se; > + struct cfs_rq *cfs_rq; > + struct rq *rq; > + u64 slice, ran; > + s64 delta; > + > + rq = task_rq_lock(p, &flags); You're calling this from yield_to()->sched_class::yield_to()->yield_to_fair()->slice_remain(), yield_to() already holds p's rq lock. > + cfs_rq = cfs_rq_of(se); > + slice = sched_slice(cfs_rq, se); > + ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; > + delta = slice - ran; > + task_rq_unlock(rq, &flags); > + > + return max(delta, 0LL); > +} Like Mike said, the returned figure doesn't really mean anything, its definitely not the remaining time of a slice. It might qualify for a weak random number generator though.. :-) > +static void yield_to_fair(struct rq *rq, struct task_struct *p) > +{ > + struct sched_entity *se = &p->se; > + struct cfs_rq *cfs_rq = cfs_rq_of(se); > + u64 remain = slice_remain(current); > + > + dequeue_task(rq, p, 0); Here you assume @p lives on @rq, but you passed: + current->sched_class->yield_to(rq, p); and rq = this_rq(), so this will go splat. > + se->vruntime -= remain; You cannot simply subtract wall-time from virtual time, see the usage of calc_delta_fair() in the proposal below. > + if (se->vruntime < cfs_rq->min_vruntime) > + se->vruntime = cfs_rq->min_vruntime; Then clipping it to min_vruntime doesn't make any sense at all. > + enqueue_task(rq, p, 0); > + check_preempt_curr(rq, p, 0); > +} Also, modifying the vruntime of one task without also modifying the vruntime of the other task breaks stuff. You're injecting time into p without taking time out of current. Maybe something like: static void yield_to_fair(struct rq *p_rq, struct task_struct *p) { struct rq *rq = this_rq(); struct sched_entity *se = ¤t->se; struct cfs_rq *cfs_rq = cfs_rq_of(se); struct sched_entity *pse = &p->se; struct cfs_rq *p_cfs_rq = cfs_rq_of(pse); /* * Transfer wakeup_gran worth of time from current to @p, * this should ensure current is no longer eligible to run. */ unsigned long wakeup_gran = ACCESS_ONCE(sysctl_sched_wakeup_granularity); update_rq_clock(rq); update_curr(cfs_rq); if (pse !=
[RFC 4/4] KVM in-kernel PM Timer implementation (experimental code part 4)
experimental code part 4 (qemu userspace) - This code introduces the new qemu command line option '-kvm-pmtmr'. qemu only creates and configures in-kernel PM Timer if this option is specified on the command line. diff -up ./qemu-kvm.c.orig4 ./qemu-kvm.c --- ./qemu-kvm.c.orig4 2010-12-10 10:50:42.857811776 +0100 +++ ./qemu-kvm.c2010-12-10 11:45:23.783748044 +0100 @@ -54,6 +54,9 @@ int kvm_irqchip = 1; int kvm_pit = 1; int kvm_pit_reinject = 1; int kvm_nested = 0; +#ifdef KVM_CAP_PMTMR +int kvm_pmtmr = 0; +#endif KVMState *kvm_state; @@ -186,7 +189,7 @@ int kvm_init(int smp_cpus) kvm_context->no_irqchip_creation = 0; kvm_context->no_pit_creation = 0; #ifdef KVM_CAP_PMTMR -kvm_context->no_pmtmr_creation = 0; +kvm_context->no_pmtmr_creation = 1; #endif #ifdef KVM_CAP_SET_GUEST_DEBUG @@ -241,6 +244,11 @@ void kvm_disable_pit_creation(kvm_contex } #ifdef KVM_CAP_PMTMR +void kvm_enable_pmtmr_creation(kvm_context_t kvm) +{ +kvm->no_pmtmr_creation = 0; +} + void (*kvm_arch_pmtmr_handler)(kvm_context_t kvm); /* * This handler is called by @@ -1654,6 +1662,11 @@ static int kvm_create_context(void) if (!kvm_pit) { kvm_disable_pit_creation(kvm_context); } +#ifdef KVM_CAP_PMTMR +if (kvm_pmtmr) { +kvm_enable_pmtmr_creation(kvm_context); +} +#endif if (kvm_create(kvm_context, 0, NULL) < 0) { kvm_finalize(kvm_state); return -1; diff -up ./qemu-kvm.h.orig4 ./qemu-kvm.h --- ./qemu-kvm.h.orig4 2010-12-10 11:26:43.726790319 +0100 +++ ./qemu-kvm.h2010-12-10 11:47:50.074805792 +0100 @@ -124,6 +124,18 @@ void kvm_disable_irqchip_creation(kvm_co */ void kvm_disable_pit_creation(kvm_context_t kvm); +#ifdef KVM_CAP_PMTMR +/*! + * \brief Enable the in-kernel ACPI PM Timer register creation + * + * In-kernel ACPI PM Timer register is disabled by default. + * If in-kernel is to be used, this should be called prior to kvm_create(). + * + * \param kvm Pointer to the kvm_context + */ +void kvm_enable_pmtmr_creation(kvm_context_t kvm); +#endif + /*! * \brief Create new virtual machine * @@ -706,6 +718,9 @@ extern int kvm_irqchip; extern int kvm_pit; extern int kvm_pit_reinject; extern int kvm_nested; +#ifdef KVM_CAP_PMTMR +extern int kvm_pmtmr; +#endif extern kvm_context_t kvm_context; struct ioperm_data { diff -up ./qemu-options.hx.orig4 ./qemu-options.hx --- ./qemu-options.hx.orig4 2010-12-02 15:15:20.0 +0100 +++ ./qemu-options.hx 2010-12-06 11:27:57.273648509 +0100 @@ -2330,6 +2330,9 @@ DEF("no-kvm-pit-reinjection", 0, QEMU_OP QEMU_ARCH_I386) DEF("enable-nesting", 0, QEMU_OPTION_enable_nesting, "-enable-nesting enable support for running a VM inside the VM (AMD only)\n", QEMU_ARCH_I386) +DEF("kvm-pmtmr", 0, QEMU_OPTION_kvm_pmtmr, +"-kvm-pmtmr enable KVM kernel mode ACPI PM Timer register emulation\n", +QEMU_ARCH_I386) DEF("nvram", HAS_ARG, QEMU_OPTION_nvram, "-nvram FILE provide ia64 nvram contents\n", QEMU_ARCH_ALL) DEF("tdf", 0, QEMU_OPTION_tdf, diff -up ./vl.c.orig4 ./vl.c --- ./vl.c.orig42010-12-10 10:34:55.388997058 +0100 +++ ./vl.c 2010-12-10 11:50:20.566810444 +0100 @@ -2474,6 +2474,12 @@ int main(int argc, char **argv, char **e kvm_nested = 1; break; } +#ifdef KVM_CAP_PMTMR + case QEMU_OPTION_kvm_pmtmr: { + kvm_pmtmr = 1; + break; + } +#endif #endif case QEMU_OPTION_usb: usb_enabled = 1; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 3/4] KVM in-kernel PM Timer implementation (experimental code part 3)
experimental code part 3 (qemu userspace) - This code utlizes the new ioctl commands introduced by code part 2. The KVM_CREATE_PMTMR ioctl command is simply called once when a virtual machine is being created. However, calling KVM_CONFIGURE_PMTMR is more challenging because it involves ... - passing the base address of PM I/O port range to code part 1 - passing the clock offset to code part 1 'timers_state.cpu_clock_offset' gets updated at each vm_start() call. However, the PM I/O port base address is not available at the first vm_start() call. So, configuring the in-kernel PM Timer needs to be postponed until the PIIX4 PCI configuration is initialized. This is facilitated by the new function kvm_pmtmr_handler() which is called by vm_start() and by pm_io_space_update(). kvm_pmtmr_handler() calls architecture-specific code thru a function pointer 'kvm_arch_pmtmr_handler'. kvm_pmtmr_handler() is a 'no-op' if an architecture does not provide or clears this function pointer. The architecture-specific code is responsible for configuring the in-kernel PM Timer. The experimental code provides kvm_arch_configure_pmtmr_wrapper() in qemu-kvm-x86.c. kvm_arch_create_pmtmr() sets 'kvm_arch_pmtmr_handler' to 'kvm_arch_configure_pmtmr_wrapper' after successful completion of the KVM_CREATE_PMTMR ioctl command. kvm_arch_configure_pmtmr_wrapper() requires ACPI PM code to provide a function pointer 'kvm_arch_get_pm_io_base' thru which the PM I/O port base address can be obtained. kvm_arch_configure_pmtmr_wrapper() is a 'no-op' too if ACPI PM code does not provide or clears this function pointer. The experimental code provides piix4_get_pm_io_base() in hw/acpi_piix4.c. pm_io_space_update() sets 'kvm_arch_get_pm_io_base' to 'piix4_get_pm_io_base'. Consider two scenarios ... - during virtual machine creation and startup kvm_arch_create kvm_arch_create_pmtmr ioctl(KVM_CREATE_PMTMR) kvm_arch_pmtmr_handler = kvm_arch_configure_pmtmr_wrapper : vm_start kvm_pmtmr_handler kvm_arch_configure_pmtmr_wrapper 'no-op' because kvm_arch_get_pm_io_base not set yet : pm_io_space_update kvm_arch_get_pm_io_base = piix4_get_pm_io_base kvm_pmtmr_handler kvm_arch_configure_pmtmr_wrapper obtain PM I/O port base thru kvm_arch_get_pm_io_base kvm_arch_configure_pmtmr ioctl(KVM_CONFIGURE_PMTMR) - any other vm_start() call, for example after migration vm_start kvm_pmtmr_handler kvm_arch_configure_pmtmr_wrapper obtain PM I/O port base thru kvm_arch_get_pm_io_base kvm_arch_configure_pmtmr ioctl(KVM_CONFIGURE_PMTMR) diff -up ./hw/acpi_piix4.c.orig3 ./hw/acpi_piix4.c --- ./hw/acpi_piix4.c.orig3 2010-12-02 15:15:20.0 +0100 +++ ./hw/acpi_piix4.c 2010-12-10 11:26:53.943753235 +0100 @@ -23,6 +23,7 @@ #include "acpi.h" #include "sysemu.h" #include "range.h" +#include "qemu-kvm.h" //#define DEBUG @@ -80,6 +81,9 @@ typedef struct PIIX4PMState { static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s); +/* for cpu hotadd (and in-kernel PM Timer if KVM_CAP_PMTMR is defined) */ +static PIIX4PMState *global_piix4_pm_state; + #define ACPI_ENABLE 0xf1 #define ACPI_DISABLE 0xf0 @@ -250,6 +254,19 @@ static void acpi_dbg_writel(void *opaque PIIX4_DPRINTF("ACPI: DBG: 0x%08x\n", val); } +#ifdef KVM_CAP_PMTMR +static uint64_t piix4_get_pm_io_base(void) +{ +PIIX4PMState *s = global_piix4_pm_state; +uint32_t pm_io_base; + +pm_io_base = le32_to_cpu(*(uint32_t *)(s->dev.config + 0x40)); +pm_io_base &= 0xffc0; + +return (uint64_t)pm_io_base; +} +#endif + static void pm_io_space_update(PIIX4PMState *s) { uint32_t pm_io_base; @@ -262,6 +279,16 @@ static void pm_io_space_update(PIIX4PMSt PIIX4_DPRINTF("PM: mapping to 0x%x\n", pm_io_base); iorange_init(&s->ioport, &pm_iorange_ops, pm_io_base, 64); ioport_register(&s->ioport); +#ifdef KVM_CAP_PMTMR +kvm_arch_get_pm_io_base = piix4_get_pm_io_base; +/* + * The base address of the PM I/O port address range is now known. + * The following call is needed to pass the base address to the + * in-kernel PM Timer emulation. Note that 'kvm_arch_get_pm_io_base' + * must be set _before_ this call. + */ +kvm_pmtmr_handler(); +#endif } } @@ -354,14 +381,12 @@ static void piix4_powerdown(void *opaque } } -static PIIX4PMState *global_piix4_pm_state; /* cpu hotadd */ - static int piix4_pm_initfn(PCIDevice *dev) { PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev, dev); uint8_t *pci_conf; -/* for cpu hotadd */ +/* for cpu hotadd and in-kernel PM Timer */ global_piix4_pm_state = s; pci_conf = s->dev.config; diff -up ./kvm/include/linux/kvm.h.orig3 ./kvm/include/linux/kvm.h --- ./kvm/include/
[RFC 2/4] KVM in-kernel PM Timer implementation (experimental code part 2)
experimental code part 2 (KVM kernel) - This code introduces two new ioctl commands KVM_CREATE_PMTMR and KVM_CONFIGURE_PMTMR plus the new capability KVM_CAP_PMTMR to the ioctl infrastructure of the KVM kernel. This code utilizes some helper functions introduced by code part 1. diff -up ./arch/x86/include/asm/kvm.h.orig2 ./arch/x86/include/asm/kvm.h --- ./arch/x86/include/asm/kvm.h.orig2 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/include/asm/kvm.h2010-12-10 12:32:47.067686432 +0100 @@ -24,6 +24,7 @@ #define __KVM_HAVE_DEBUGREGS #define __KVM_HAVE_XSAVE #define __KVM_HAVE_XCRS +#define __KVM_HAVE_PMTMR /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 diff -up ./arch/x86/kvm/x86.c.orig2 ./arch/x86/kvm/x86.c --- ./arch/x86/kvm/x86.c.orig2 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/kvm/x86.c2010-12-10 12:24:58.083739549 +0100 @@ -26,6 +26,9 @@ #include "tss.h" #include "kvm_cache_regs.h" #include "x86.h" +#ifdef KVM_CAP_PMTMR +#include "pmtmr.h" +#endif #include #include @@ -1965,6 +1968,9 @@ int kvm_dev_ioctl_check_extension(long e case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: +#ifdef KVM_CAP_PMTMR + case KVM_CAP_PMTMR: +#endif r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3274,6 +3280,7 @@ long kvm_arch_vm_ioctl(struct file *filp struct kvm_pit_state ps; struct kvm_pit_state2 ps2; struct kvm_pit_config pit_config; + struct kvm_pmtmr_config pmtmr_config; } u; switch (ioctl) { @@ -3541,6 +3548,23 @@ long kvm_arch_vm_ioctl(struct file *filp r = 0; break; } +#ifdef KVM_CAP_PMTMR + case KVM_CREATE_PMTMR: { + mutex_lock(&kvm->slots_lock); + r = kvm_create_pmtmr(kvm); + mutex_unlock(&kvm->slots_lock); + break; + } + case KVM_CONFIGURE_PMTMR: { + r = -EFAULT; + if (copy_from_user(&u.pmtmr_config, argp, + sizeof(struct kvm_pmtmr_config))) + goto out; + + r = kvm_configure_pmtmr(kvm, &u.pmtmr_config); + break; + } +#endif default: ; diff -up ./include/linux/kvm.h.orig2 ./include/linux/kvm.h --- ./include/linux/kvm.h.orig2 2010-12-05 09:35:17.0 +0100 +++ ./include/linux/kvm.h 2010-12-10 12:30:13.677745093 +0100 @@ -140,6 +140,12 @@ struct kvm_pit_config { __u32 pad[15]; }; +/* for KVM_CONFIGURE_PMTMR */ +struct kvm_pmtmr_config { + __u64 pm_io_base; + __s64 clock_offset; +}; + #define KVM_PIT_SPEAKER_DUMMY 1 #define KVM_EXIT_UNKNOWN 0 @@ -541,6 +547,9 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 #define KVM_CAP_ASYNC_PF 59 +#ifdef __KVM_HAVE_PMTMR +#define KVM_CAP_PMTMR 60 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -672,6 +681,8 @@ struct kvm_clock_data { #define KVM_XEN_HVM_CONFIG_IOW(KVMIO, 0x7a, struct kvm_xen_hvm_config) #define KVM_SET_CLOCK _IOW(KVMIO, 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define KVM_CREATE_PMTMR _IO(KVMIO, 0x7d) +#define KVM_CONFIGURE_PMTMR _IOW(KVMIO, 0x7e, struct kvm_pmtmr_config) /* Available with KVM_CAP_PIT_STATE2 */ #define KVM_GET_PIT2 _IOR(KVMIO, 0x9f, struct kvm_pit_state2) #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/4] KVM in-kernel PM Timer implementation (experimental code part 1)
experimental code part 1 (KVM kernel) - This code introduces the actual emulation of the PM Timer register plus some helper functions to create and configure the in-kernel PM Timer. The emulation utilizes the 'kvm_io_bus' infrastructure. diff -up ./arch/x86/include/asm/kvm_host.h.orig1 ./arch/x86/include/asm/kvm_host.h --- ./arch/x86/include/asm/kvm_host.h.orig1 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/include/asm/kvm_host.h 2010-12-10 12:14:29.282686691 +0100 @@ -459,6 +459,10 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + +#ifdef KVM_CAP_PMTMR + struct kvm_pmtmr *vpmtmr; +#endif }; struct kvm_vm_stat { diff -up ./arch/x86/kvm/i8254.c.orig1 ./arch/x86/kvm/i8254.c --- ./arch/x86/kvm/i8254.c.orig12010-12-05 09:35:17.0 +0100 +++ ./arch/x86/kvm/i8254.c 2010-12-10 12:09:36.877729064 +0100 @@ -51,7 +51,7 @@ #define RW_STATE_WORD1 4 /* Compute with 96 bit intermediate result: (a*b)/c */ -static u64 muldiv64(u64 a, u32 b, u32 c) +u64 muldiv64(u64 a, u32 b, u32 c) { union { u64 ll; diff -up ./arch/x86/kvm/Makefile.orig1 ./arch/x86/kvm/Makefile --- ./arch/x86/kvm/Makefile.orig1 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/kvm/Makefile 2010-12-10 12:07:14.379811121 +0100 @@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API) += $(addprefix . kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ - i8254.o timer.o + i8254.o timer.o pmtmr.o kvm-intel-y+= vmx.o kvm-amd-y += svm.o diff -up ./arch/x86/kvm/pmtmr.c.orig1 ./arch/x86/kvm/pmtmr.c --- ./arch/x86/kvm/pmtmr.c.orig12010-12-10 12:05:39.878691941 +0100 +++ ./arch/x86/kvm/pmtmr.c 2010-12-10 12:06:00.987738524 +0100 @@ -0,0 +1,151 @@ +/* + * in-kernel ACPI PM Timer emulation + * + * Note: 'timer carry interrupt' is not implemented + */ + +#include + +#ifdef KVM_CAP_PMTMR + +#include "pmtmr.h" + +static int emulate_acpi_reg_pmtmr(struct kvm_pmtmr *pmtmr, void *data, int len) +{ + s64 tmp; + u32 running_count; + + if (len != 4) + return -EOPNOTSUPP; + + tmp = ktime_to_ns(ktime_get()) + pmtmr->clock_offset; + running_count = (u32)muldiv64(tmp, KVM_ACPI_PMTMR_FREQ, NSEC_PER_SEC); + *(u32 *)data = running_count & KVM_ACPI_PMTMR_MASK; + +#ifdef KVM_ACPI_PMTMR_STATS + pmtmr->read_count++; +#endif + return 0; +} + +/* + * This function returns true for I/O ports in the range from 'PM base' + * to 'PM Timer' (this range contains the PM1 Status and the PM1 Enable + * registers). + */ +static inline int pmtmr_in_range(struct kvm_pmtmr *pmtmr, gpa_t ioport) +{ + return ((ioport >= pmtmr->pm_io_base) && + (ioport <= pmtmr->pm_io_base + KVM_ACPI_REG_PMTMR)); +} + +static inline struct kvm_pmtmr *dev_to_pmtmr(struct kvm_io_device *dev) +{ +return container_of(dev, struct kvm_pmtmr, dev); +} + +static int pmtmr_ioport_read(struct kvm_io_device *this, +gpa_t ioport, int len, void *data) +{ + struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this); + + if (!pmtmr_in_range(pmtmr, ioport)) + return -EOPNOTSUPP; + + switch (ioport - pmtmr->pm_io_base) { + case KVM_ACPI_REG_PMTMR: + /* emulate PM Timer read if in-kernel emulation is enabled */ + if (pmtmr->state == KVM_PMTMR_STATE_ENABLED) + return(emulate_acpi_reg_pmtmr(pmtmr, data, len)); + + /* fall thru */ + default: + /* let qemu userspace handle everything else */ + return -EOPNOTSUPP; + } +} + +static int pmtmr_ioport_write(struct kvm_io_device *this, + gpa_t ioport, int len, const void *data) +{ + struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this); + + if (!pmtmr_in_range(pmtmr, ioport)) + return -EOPNOTSUPP; + + switch (ioport - pmtmr->pm_io_base) { + case KVM_ACPI_REG_PMTMR: + /* ignore PM Timer write */ + return 0; + case KVM_ACPI_REG_PMEN: + if (len == 2) { + u16 val = *(u16 *)data; + /* +* Fall back to qemu userspace PM Timer emulation if +* the VM sets the 'timer carry interrupt enable' bit +* in the PM1 Enable register. +*/ + if (val & KVM_ACPI_PMTMR_TMR_EN) + /* disable in-kernel PM Timer emulation */ + pmtmr->state = KVM_PMTMR_STATE_DISABLED; + } + /* fall thru */ + default: + /* let qemu userspace handle everything
[RFC 0/4] KVM in-kernel PM Timer implementation
Hi, This is an RFC through which I would like to get feedback on how the idea of in-kernel PM Timer would be received. The current implementation of PM Timer emulation is 'heavy-weight' because the code resides in qemu userspace. Guest operating systems that use PM Timer as a clock source (for example, older versions of Linux that do not have paravirtualized clock) would benefit from an in-kernel PM Timer emulation. Parts 1 thru 4 of this RFC contain experimental source code which I recently used to investigate the performance benefit. In a Linux guest, I was running a program that calls gettimeofday() 'n' times in a loop (the PM Timer register is read during each call). With in-kernel PM Timer, I observed a significant reduction of program execution time. The experimental code emulates the PM Timer register in KVM kernel. All other components of ACPI PM remain in qemu userspace. Also, the 'timer carry interrupt' feature is not implemented in-kernel. If a guest operating system needs to enable the 'timer carry interrupt', the code takes care that PM Timer emulation falls back to userspace. However, I think the design of the code has sufficient flexibility, so that anyone who would want to add the 'timer carry interrupt' feature in-kernel could try to do so later on. Please review and please comment. Regards, Uli Obergfell -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
On Tue, Dec 14, 2010 at 12:55:04PM +0100, Kenni Lund wrote: > 2010/12/14 Erik Brakkee : > >> From: Kenni Lund > >> 2010/12/14 Erik Brakkee : > > From: Kenni Lund > > > > Does this mean I have a chance now that PCI passthrough of my WinTV > > PVR-500 > > might work now? > > Passthrough of a PVR-500 has been working for a long time. I've been > running with passthrough of a PVR-500 in my HTPC, since > November/December 2009...so it should work with any recent kernel and > any recent version of qemu-kvm you can find today - No patching > needed. The only issue I had with the PVR-500 card, was when *I* > didn't free up the shared interrupts...once I fixed that, it "just > worked". > >>> > >>> How did you free up those shared interrupts then? I tried different slots > >>> but always get conflicts with the USB irqs. > >> > >> I did an unbind of the conflicting device (eg. disabled it). I moved > >> the PVR-500 card around in the different slots and once I got a > >> conflict with the integrated sound card, I left the PVR-500 card in > >> that slot (it's a headless machine, so no need for sound) and > >> configured unbind of the sound card at boot time. On my old system I > >> think it was conflicting with one of the USB controllers as well, but > >> it didn't really matter, as I only lost a few of the ports on the back > >> of the computer for that particular USB controller - I still had > >> plenty of USB ports left and if I really needed more ports, I could > >> just plug in an extra USB PCI card. > >> > >> My /etc/rc.local boot script looks like the following today: > >> -- > >> #Remove HDA conflicting with ivtv1 > >> echo ":00:1b.0" > /sys/bus/pci/drivers/HDA\ Intel/unbind > >> > >> # ivtv0 > >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id > >> echo ":04:08.0" > /sys/bus/pci/drivers/ivtv/unbind > >> echo ":04:08.0" > /sys/bus/pci/drivers/pci-stub/bind > >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id > >> > >> # ivtv1 > >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id > >> echo ":04:09.0" > /sys/bus/pci/drivers/ivtv/unbind > >> echo ":04:09.0" > /sys/bus/pci/drivers/pci-stub/bind > >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id > > > > I did not try unbinding the usb device so I can also try that. > > > > I don'.t understand what is happening with the 0016. I configured the > > pci card in kvm and I believe kvm does the binding to pci-stub in recent > > versions. Where is the 0016%oming from? > > Okay, qemu-kvm might do it today, I don't know - I haven't changed > that script for the past year. But are you sure that it's not > libvirt/virsh/virt-manager which does that for you? If you use the managed="yes" attribute on the in libvirt XML, then libvirt will automatically do the pcistub bind/unbind, followed by a device reset at guest startup & the reverse at shutdown. If you have conflicting devices on the bus though, libvirt won't attempt to unbind them, unless you had also explicitly assigned all those conflicting devices to the same guest. Daniel -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
2010/12/14 Erik Brakkee : >> From: Kenni Lund >> 2010/12/14 Erik Brakkee : From: Kenni Lund > > Does this mean I have a chance now that PCI passthrough of my WinTV > PVR-500 > might work now? Passthrough of a PVR-500 has been working for a long time. I've been running with passthrough of a PVR-500 in my HTPC, since November/December 2009...so it should work with any recent kernel and any recent version of qemu-kvm you can find today - No patching needed. The only issue I had with the PVR-500 card, was when *I* didn't free up the shared interrupts...once I fixed that, it "just worked". >>> >>> How did you free up those shared interrupts then? I tried different slots >>> but always get conflicts with the USB irqs. >> >> I did an unbind of the conflicting device (eg. disabled it). I moved >> the PVR-500 card around in the different slots and once I got a >> conflict with the integrated sound card, I left the PVR-500 card in >> that slot (it's a headless machine, so no need for sound) and >> configured unbind of the sound card at boot time. On my old system I >> think it was conflicting with one of the USB controllers as well, but >> it didn't really matter, as I only lost a few of the ports on the back >> of the computer for that particular USB controller - I still had >> plenty of USB ports left and if I really needed more ports, I could >> just plug in an extra USB PCI card. >> >> My /etc/rc.local boot script looks like the following today: >> -- >> #Remove HDA conflicting with ivtv1 >> echo ":00:1b.0" > /sys/bus/pci/drivers/HDA\ Intel/unbind >> >> # ivtv0 >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id >> echo ":04:08.0" > /sys/bus/pci/drivers/ivtv/unbind >> echo ":04:08.0" > /sys/bus/pci/drivers/pci-stub/bind >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id >> >> # ivtv1 >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id >> echo ":04:09.0" > /sys/bus/pci/drivers/ivtv/unbind >> echo ":04:09.0" > /sys/bus/pci/drivers/pci-stub/bind >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id > > I did not try unbinding the usb device so I can also try that. > > I don'.t understand what is happening with the 0016. I configured the > pci card in kvm and I believe kvm does the binding to pci-stub in recent > versions. Where is the 0016%oming from? Okay, qemu-kvm might do it today, I don't know - I haven't changed that script for the past year. But are you sure that it's not libvirt/virsh/virt-manager which does that for you? Anyway, it's coming from lspci -n. See the wiki page: http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM I can't remember why I run remove_id in the end, it's probably unneeded, but I can't remember (and it works, so I don't toch it). Best regards Kenni -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Refactor zone_reclaim (v2)
* MinChan Kim [2010-12-14 19:01:26]: > Hi Balbir, > > On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh > wrote: > > Move reusable functionality outside of zone_reclaim. > > Make zone_reclaim_unmapped_pages modular > > > > Signed-off-by: Balbir Singh > > --- > > mm/vmscan.c | 35 +++ > > 1 files changed, 23 insertions(+), 12 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index e841cae..4e2ad05 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone > > *zone) > > } > > > > /* > > + * Helper function to reclaim unmapped pages, we might add something > > + * similar to this for slab cache as well. Currently this function > > + * is shared with __zone_reclaim() > > + */ > > +static inline void > > +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, > > + unsigned long nr_pages) > > +{ > > + int priority; > > + /* > > + * Free memory by calling shrink zone with increasing > > + * priorities until we have enough memory freed. > > + */ > > + priority = ZONE_RECLAIM_PRIORITY; > > + do { > > + shrink_zone(priority, zone, sc); > > + priority--; > > + } while (priority >= 0 && sc->nr_reclaimed < nr_pages); > > +} > > As I said previous version, zone_reclaim_unmapped_pages doesn't have > any functions related to reclaim unmapped pages. The scan control point has the right arguments for implementing reclaim of unmapped pages. > The function name is rather strange. > It would be better to add scan_control setup in function inner to > reclaim only unmapped pages. > > -- > Kind regards, > Minchan Kim -- Three Cheers, Balbir -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v2 PATCH 2/3] sched: add yield_to function
On Tue, Dec 14, 2010 at 12:03:58PM +0100, Mike Galbraith wrote: > On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote: > > On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote: > > > > That part looks ok, except for the yield cross cpu bit. Trying to yield > > > a resource you don't have doesn't make much sense to me. > > > > So another (crazy) idea is to move the "yieldee" task on another cpu over > > to > > yielding task's cpu, let it run till the end of yielding tasks slice and > > then > > let it go back to the original cpu at the same vruntime position! > > Yeah, pulling the intended recipient makes fine sense. If he doesn't > preempt you, you can try to swap vruntimes or whatever makes arithmetic > sense and will help. Dunno how you tell him how long he can keep the > cpu though, can't we adjust the new task's [prev_]sum_exec_runtime a bit so that it is preempted at the end of yielding task's timeslice? > and him somehow going back home needs to be a plain old > migration, no fancy restoration of ancient history vruntime. What is the issue if it gets queued at the old vruntime (assuming fair stick is still behind that)? Without that it will hurt fairness for the yieldee (and perhaps of the overall VM in this case). - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v2 PATCH 2/3] sched: add yield_to function
On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote: > On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote: > > That part looks ok, except for the yield cross cpu bit. Trying to yield > > a resource you don't have doesn't make much sense to me. > > So another (crazy) idea is to move the "yieldee" task on another cpu over to > yielding task's cpu, let it run till the end of yielding tasks slice and then > let it go back to the original cpu at the same vruntime position! Yeah, pulling the intended recipient makes fine sense. If he doesn't preempt you, you can try to swap vruntimes or whatever makes arithmetic sense and will help. Dunno how you tell him how long he can keep the cpu though, and him somehow going back home needs to be a plain old migration, no fancy restoration of ancient history vruntime. -Mike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v2)
On Fri, Dec 10, 2010 at 11:32 PM, Balbir Singh wrote: > Changelog v2 > 1. Use a config option to enable the code (Andrew Morton) > 2. Explain the magic tunables in the code or at-least attempt > to explain them (General comment) > 3. Hint uses of the boot parameter with unlikely (Andrew Morton) > 4. Use better names (balanced is not a good naming convention) > 5. Updated Documentation/kernel-parameters.txt (Andrew Morton) > > Provide control using zone_reclaim() and a boot parameter. The > code reuses functionality from zone_reclaim() to isolate unmapped > pages and reclaim them as a priority, ahead of other mapped pages. > > Signed-off-by: Balbir Singh > --- > Documentation/kernel-parameters.txt | 8 +++ > include/linux/swap.h | 21 ++-- > init/Kconfig | 12 > kernel/sysctl.c | 2 + > mm/page_alloc.c | 9 +++ > mm/vmscan.c | 97 > +++ > 6 files changed, 142 insertions(+), 7 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt > b/Documentation/kernel-parameters.txt > index dd8fe2b..f52b0bd 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined > in the file > [X86] > Set unknown_nmi_panic=1 early on boot. > > + unmapped_page_control > + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL > + is enabled. It controls the amount of unmapped memory > + that is present in the system. This boot option plus > + vm.min_unmapped_ratio (sysctl) provide granular > control > + over how much unmapped page cache can exist in the > system > + before kswapd starts reclaiming unmapped page cache > pages. > + > usbcore.autosuspend= > [USB] The autosuspend time delay (in seconds) used > for newly-detected USB devices (default 2). This > diff --git a/include/linux/swap.h b/include/linux/swap.h > index ac5c06e..773d7e5 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -253,19 +253,32 @@ extern int vm_swappiness; > extern int remove_mapping(struct address_space *mapping, struct page *page); > extern long vm_total_pages; > > +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) > extern int sysctl_min_unmapped_ratio; > extern int zone_reclaim(struct zone *, gfp_t, unsigned int); > -#ifdef CONFIG_NUMA > -extern int zone_reclaim_mode; > -extern int sysctl_min_slab_ratio; > #else > -#define zone_reclaim_mode 0 > static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int > order) > { > return 0; > } > #endif > > +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) > +extern bool should_reclaim_unmapped_pages(struct zone *zone); > +#else > +static inline bool should_reclaim_unmapped_pages(struct zone *zone) > +{ > + return false; > +} > +#endif > + > +#ifdef CONFIG_NUMA > +extern int zone_reclaim_mode; > +extern int sysctl_min_slab_ratio; > +#else > +#define zone_reclaim_mode 0 > +#endif > + > extern int page_evictable(struct page *page, struct vm_area_struct *vma); > extern void scan_mapping_unevictable_pages(struct address_space *); > > diff --git a/init/Kconfig b/init/Kconfig > index 3eb22ad..78c9169 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -782,6 +782,18 @@ endif # NAMESPACES > config MM_OWNER > bool > > +config UNMAPPED_PAGECACHE_CONTROL > + bool "Provide control over unmapped page cache" > + default n > + help > + This option adds support for controlling unmapped page cache > + via a boot parameter (unmapped_page_control). The boot parameter > + with sysctl (vm.min_unmapped_ratio) control the total number > + of unmapped pages in the system. This feature is useful if > + you want to limit the amount of unmapped page cache or want > + to reduce page cache duplication in a virtualized environment. > + If unsure say 'N' > + > config SYSFS_DEPRECATED > bool "enable deprecated sysfs features to support old userspace tools" > depends on SYSFS > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index e40040e..ab2c60a 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = { > .extra1 = &zero, > }, > #endif > +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) > { > .procname = "min_unmapped_ratio", > .data = &sysctl_min_unmapped_ratio, > @@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = { > .extra1 = &zero, > .extra2
Re: [RFC -v2 PATCH 2/3] sched: add yield_to function
On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote: > > +/* > > + * Yield the CPU, giving the remainder of our time slice to task p. > > + * Typically used to hand CPU time to another thread inside the same > > + * process, eg. when p holds a resource other threads are waiting for. > > + * Giving priority to p may help get that resource released sooner. > > + */ > > +void yield_to(struct task_struct *p) > > +{ > > + unsigned long flags; > > + struct rq *rq, *p_rq; > > + > > + local_irq_save(flags); > > + rq = this_rq(); > > +again: > > + p_rq = task_rq(p); > > + double_rq_lock(rq, p_rq); > > + if (p_rq != task_rq(p)) { > > + double_rq_unlock(rq, p_rq); > > + goto again; > > + } > > + > > + /* We can't yield to a process that doesn't want to run. */ > > + if (!p->se.on_rq) > > + goto out; > > + > > + /* > > +* We can only yield to a runnable task, in the same schedule class > > +* as the current task, if the schedule class implements yield_to_task. > > +*/ > > + if (!task_running(rq, p) && current->sched_class == p->sched_class && > > + current->sched_class->yield_to) > > + current->sched_class->yield_to(rq, p); > > + > > +out: > > + double_rq_unlock(rq, p_rq); > > + local_irq_restore(flags); > > + yield(); > > +} > > +EXPORT_SYMBOL_GPL(yield_to); > > That part looks ok, except for the yield cross cpu bit. Trying to yield > a resource you don't have doesn't make much sense to me. So another (crazy) idea is to move the "yieldee" task on another cpu over to yielding task's cpu, let it run till the end of yielding tasks slice and then let it go back to the original cpu at the same vruntime position! - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
Am 14.12.2010 um 11:02 schrieb Avi Kivity : > On 12/13/2010 10:25 AM, Alexander Graf wrote: >> >> >> > Is your point in this case that USB in a VM based on PCI passthrough will >> > always have problems when it comes to more real-time issues or does this >> > only apply to USB passthrough? I can imagine that PCI passthrough is >> > better since it uses hardware support. By the way, I have seen issues in >> > the past whereby the tv card stopped working because of high load on the >> > server running natively so real-time issues also exist apart from >> > virtualization. >> >> IIRC the reason that PCI passthrough with EHCI performs as badly as it does >> is that BARs< 4k get passed through using the slow path (trap to qemu, >> issue MMIO in user space). Unfortunately, EHCI seems to have a 256 byte BAR >> region usually that is used for some handshaking: >> >> 00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller >> (prog-if 20 [EHCI]) >> Subsystem: ATI Technologies Inc SB700/SB800 USB EHCI Controller >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- >> Stepping- SERR- FastB2B- DisINTx- >> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- >> DEVSEL=medium>TAbort-SERR-> Latency: 64, Cache Line Size: 64 bytes >> Interrupt: pin B routed to IRQ 17 >> Region 0: Memory at c8014400 (32-bit, non-prefetchable) [size=256] >> > > That could certainly be optimized. If the BAR is all along in its page, both > on guest and host (if not, we can migrate it, at least on the host), we can > use the same offset within the page on the host as it appears on the guest, > and assign the entire page. > > We should make sure SeaBIOS uses a minimum alignment of 4k for mmio BARs. Yep, I agree :). Back when I tried that, it seemed rather hard to change BAR mappings after init from user space. But it's certainly a thing the vfio stuff could easily tackle! Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
On 12/13/2010 10:25 AM, Alexander Graf wrote: >> > Is your point in this case that USB in a VM based on PCI passthrough will always have problems when it comes to more real-time issues or does this only apply to USB passthrough? I can imagine that PCI passthrough is better since it uses hardware support. By the way, I have seen issues in the past whereby the tv card stopped working because of high load on the server running natively so real-time issues also exist apart from virtualization. IIRC the reason that PCI passthrough with EHCI performs as badly as it does is that BARs< 4k get passed through using the slow path (trap to qemu, issue MMIO in user space). Unfortunately, EHCI seems to have a 256 byte BAR region usually that is used for some handshaking: 00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller (prog-if 20 [EHCI]) Subsystem: ATI Technologies Inc SB700/SB800 USB EHCI Controller Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium>TAbort-SERR- That could certainly be optimized. If the BAR is all along in its page, both on guest and host (if not, we can migrate it, at least on the host), we can use the same offset within the page on the host as it appears on the guest, and assign the entire page. We should make sure SeaBIOS uses a minimum alignment of 4k for mmio BARs. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Refactor zone_reclaim (v2)
Hi Balbir, On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh wrote: > Move reusable functionality outside of zone_reclaim. > Make zone_reclaim_unmapped_pages modular > > Signed-off-by: Balbir Singh > --- > mm/vmscan.c | 35 +++ > 1 files changed, 23 insertions(+), 12 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index e841cae..4e2ad05 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone > *zone) > } > > /* > + * Helper function to reclaim unmapped pages, we might add something > + * similar to this for slab cache as well. Currently this function > + * is shared with __zone_reclaim() > + */ > +static inline void > +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, > + unsigned long nr_pages) > +{ > + int priority; > + /* > + * Free memory by calling shrink zone with increasing > + * priorities until we have enough memory freed. > + */ > + priority = ZONE_RECLAIM_PRIORITY; > + do { > + shrink_zone(priority, zone, sc); > + priority--; > + } while (priority >= 0 && sc->nr_reclaimed < nr_pages); > +} As I said previous version, zone_reclaim_unmapped_pages doesn't have any functions related to reclaim unmapped pages. The function name is rather strange. It would be better to add scan_control setup in function inner to reclaim only unmapped pages. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm,x86: return true when user space query KVM_CAP_USER_NMI extension
userspace may check this extension in runtime. Signed-off-by: Lai Jiangshan --- diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cdac9e5..3d6b9ec 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1909,6 +1909,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_NOP_IO_DELAY: case KVM_CAP_MP_STATE: case KVM_CAP_SYNC_MMU: + case KVM_CAP_USER_NMI: case KVM_CAP_REINJECT_CONTROL: case KVM_CAP_IRQ_INJECT_STATUS: case KVM_CAP_ASSIGN_DEV_IRQ: -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Rik van Riel [2010-12-13 12:02:51]: > On 12/11/2010 08:57 AM, Balbir Singh wrote: > > >If the vpcu holding the lock runs more and capped, the timeslice > >transfer is a heuristic that will not help. > > That indicates you really need the cap to be per guest, and > not per VCPU. > Yes, I personally think so too, but I suspect there needs to be a larger agreement on the semantics. The VCPU semantics in terms of power apply to each VCPU as opposed to the entire system (per guest). > Having one VCPU spin on a lock (and achieve nothing), because > the other one cannot give up the lock due to hitting its CPU > cap could lead to showstoppingly bad performance. Yes, that seems right! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] RAM API: Make use of it for x86 PC
On 12/13/2010 11:24 PM, Alex Williamson wrote: Register the actual VM RAM using the new API @@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size, /* allocate RAM */ ram_addr = qemu_ram_alloc(NULL, "pc.ram", below_4g_mem_size + above_4g_mem_size); -cpu_register_physical_memory(0, 0xa, ram_addr); -cpu_register_physical_memory(0x10, - below_4g_mem_size - 0x10, - ram_addr + 0x10); +ram_register(0, below_4g_mem_size, ram_addr); What's the impact of this? Won't it conflict with BIOS memory registration? What about VGA? In terms of patch hygiene, it should be in a separate patch titled "register 0xa-0x10 as RAM" or something. It's a much more drastic change than making use of the new RAM API. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable
On 12/14/2010 03:53 AM, Xiao Guangrong wrote: > I just sent a patch to fix this in a different way, please review it. > Your patch is good for me, please ignore this one :-) Umm, do we need move "access&= ~ACC_WRITE_MASK" into set_spte() then can remove the same code in the caller? I guess set_spte() is the better place for this. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/4] KVM & genirq: Enable adaptive IRQ sharing for passed-through devices
On 12/14/2010 12:59 AM, Jan Kiszka wrote: Final but critical question: Who will pick up which bits? The procedure which has served us well in the past is that tip picks up the irq stuff and sticks them in a fast-forward-only branch; kvm merges the branch and applies the kvm bits on top. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html