Re: [PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-14 Thread Balbir Singh
* MinChan Kim  [2010-12-14 19:01:26]:

> Hi Balbir,
> 
> On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh
>  wrote:
> > Move reusable functionality outside of zone_reclaim.
> > Make zone_reclaim_unmapped_pages modular
> >
> > Signed-off-by: Balbir Singh 
> > ---
> >  mm/vmscan.c |   35 +++
> >  1 files changed, 23 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index e841cae..4e2ad05 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone 
> > *zone)
> >  }
> >
> >  /*
> > + * Helper function to reclaim unmapped pages, we might add something
> > + * similar to this for slab cache as well. Currently this function
> > + * is shared with __zone_reclaim()
> > + */
> > +static inline void
> > +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
> > +                               unsigned long nr_pages)
> > +{
> > +       int priority;
> > +       /*
> > +        * Free memory by calling shrink zone with increasing
> > +        * priorities until we have enough memory freed.
> > +        */
> > +       priority = ZONE_RECLAIM_PRIORITY;
> > +       do {
> > +               shrink_zone(priority, zone, sc);
> > +               priority--;
> > +       } while (priority >= 0 && sc->nr_reclaimed < nr_pages);
> > +}
> 
> As I said previous version, zone_reclaim_unmapped_pages doesn't have
> any functions related to reclaim unmapped pages.
> The function name is rather strange.
> It would be better to add scan_control setup in function inner to
> reclaim only unmapped pages.

OK, that is an idea worth looking at, I'll revisit this function.

Thanks for the review!

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


soft lockup

2010-12-14 Thread Andreas Rittershofer
What does "soft lockup" mean?

Dec 14 02:35:18 hp1 kernel: [1492483.960150] BUG: soft lockup - CPU#1 stuck for 
61s! [kvm:32398]

It's associated with a loss of OCFS2 connectivity and other problems following.


Viele Grüße

Andreas Rittershofer

-- 
Hier könnte keine Signatur stehen.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM Test report, kernel d335b15... qemu cb1983b8...

2010-12-14 Thread Hao, Xudong
Hi, all,
This is KVM test result against kvm.git 
d335b156f9fafd177d0606cf845d9a2df2dc5431, and qemu-kvm.git 
cb1983b8809d0e06a97384a40bad1194a32fc814.

Currently qemu-kvm build fail on RHEL5 with a undeclared 
"PCI_PM_CTRL_NO_SOFT_RST" error. I saw there already were fix patch in mail 
list.
There are 2 bugs got fixed.

Fixed issues:
1. Guest qemu processor will be defunct process by be killed
https://bugzilla.kernel.org/show_bug.cgi?id=23612
2. [SR] qemu return form "migrate " command spend long time 
https://sourceforge.net/tracker/?func=detail&aid=2942079&group_id=180599&atid=893831


Four old Issues:

1. ltp diotest running time is 2.54 times than before
https://sourceforge.net/tracker/?func=detail&aid=2723366&group_id=180599&atid=893831
2. 32bits Rhel5/FC6 guest may fail to reboot after installation
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1991647&group_id=180599
3. perfctr wrmsr warning when booting 64bit RHEl5.3
https://sourceforge.net/tracker/?func=detail&aid=2721640&group_id=180599&atid=893831
4. [KVM] Noacpi Windows guest can not boot up on 32bit KVM host
https://bugzilla.kernel.org/show_bug.cgi?id=21402


Best Regards,
Xudong Hao--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2942079 ] [SR] qemu return form "migrate " command spend long time

2010-12-14 Thread SourceForge.net
Bugs item #2942079, was opened at 2010-01-29 17:13
Message generated for change (Comment added) made by haoxudong
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2942079&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Xudong Hao (haoxudong)
Assigned to: Nobody/Anonymous (nobody)
Summary: [SR] qemu return form "migrate " command spend long time 

Initial Comment:
Environment:

kvm.git Commit: 51ef04ce3219d05c88f204342b2db294b5590d0a
qemu-kvm Commit: 3e6f07b0c86b7fabfce72c1a42e54b2ad79dc587
Host Kernel Version: 2.6.33-rc4


Bug detailed description:
--
KVM guest Save-Restore command changed, the new command function can work.
However, when we do  in guest qemu console, it 
will cost much long time to return(~2 minutes of a 256M memory guest), the 
speed of saving only has ~1MB/s

344640+0 records in
344640+0 records out 
176455680 bytes (176 MB) copied, 183.226 s, 963 kB/s


Reproduce steps:

1) qemu-system-x86_64  -m 256 -smp 4  -net
nic,macaddr=00:16:3e:57:87:39,model=rtl8139 -net tap,script=/etc/kvm/qemu-ifup
-hda /share/xvs/var/guest.img
2) Ctrl+Alt+2migrate "exec:dd of=test.img"   This step takes ~2 minutes 
with a
256MB memory guest, it will be more long for a >256MB memory guest.
3) qemu-system-x86_64  -m 256 -smp 4  -net
nic,macaddr=00:16:3e:57:87:39,model=rtl8139 -net tap,script=/etc/kvm/qemu-ifup
-hda /share/xvs/var/guest.img --incoming "exec:dd if=test.img"



--

>Comment By: Xudong Hao (haoxudong)
Date: 2010-12-15 14:02

Message:
On kvm 66fc6be8d2b04153b753182610f919faf9c705bc and qemu-kvm
53b6d3d5c2522e881c8d194f122de3114f6f76eb, the issue is not exist.

it will take <10s to save, speed: ~45MB/s

mark this bug fixed and verified.


--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2942079&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM PCI passthrough issues, RTL-8169 PCI NICs

2010-12-14 Thread Andrew Useckas
Hi,

I have been trying to get the PCI pass through working on my Asus Crosshair IV 
Formula motherboard.

The motherboard does support IOMMU (AMD-Vi), and IOMMU as well as SVM are 
enabled in the BIOS.

PCI pass through does seem to work just fine when it comes to with built in 
network device (Marvel 8059 Yukon), however I can't get it working with two 
PCI RTL-8169 network cards. Both devices are bound to pci-stub.

Every time I attempt I get a message that the device is busy, as shown below:

PCI region 1 at address 0xf9dff800 has size 0x100, which is not a multiple of 
4K. You might experience some performance hit due to that.
Failed to assign device "(null)" : Device or resource busy
*** The driver 'pci-stub' is occupying your device :01:05.0.


The kernel logs show the following:

Dec 14 10:19:55 phalsenet kernel: [ 1718.806644] pci-stub :01:05.0: PCI 
INT A -> GSI 20 (level, low) -> IRQ 20
Dec 14 10:19:55 phalsenet kernel: [ 1718.836741] pci-stub :01:05.0: 
restoring config space at offset 0x1 (was 0x2b00400, writing 0x2b00103)
Dec 14 10:19:55 phalsenet kernel: [ 1718.903118] assign device 0:1:5.0 failed
Dec 14 10:19:55 phalsenet kernel: [ 1718.903161] pci-stub :01:05.0: PCI 
INT A disabled


The box is running Gentoo Linux. I have tested with 2.6.34 and 2.6.36 kernels, 
with kvm and kvm-amd modules that came with the kernels as well as with the 
latest kvm-kmod sources (2.6.36.1).

The qemu-kvm version is 0.13.0.

Here lspci -v output from one of the network devices:

01:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit 
Ethernet (rev 10)
Subsystem: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet
Flags: 66MHz, medium devsel, IRQ 20
I/O ports at b400 [size=256]
Memory at f9dff800 (32-bit, non-prefetchable) [size=256]
Expansion ROM at f9da [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Kernel driver in use: pci-stub


Any help would be appreciated.

Regards,
Andrew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: Freezing Windows 2008 x64bit guest

2010-12-14 Thread Manfred Heubach
Vadim Rozenfeld  redhat.com> writes:

>
> On Mon, 2010-12-13 at 22:12 +0200, Dor Laor wrote:
> > On 12/13/2010 09:42 PM, Manfred Heubach wrote:
> > >
> > > I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly
because
> > > of performance problems which were solved by the upgrade.
> > >
> > > After the upgrade the system became extremly unstable. It was crashing as 
> > > soon
> > > as disk io and network io load was growing. 100% reproduceable with 
> > > windows
> > > server backup to an iscsi volume.
> > >
> > > i had virtio drivers for storage and network installed (redhat/fedora 
> > > 1.1.11).
> >
> > Which fedora/rhel release is that?


The host is Ubuntu 10.10 x64

The drivers are from
http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/ 1.1.11-0
released on 17-Aug-2010 - are there any newer drivers?


> > What's the windows virtio driver version?

The virtio storage version shown in Windows is 6.0.0.10

> >
> > Have you tried using virt-manager/virhs instead of raw cmdline?

I'm starting it with libvirt/virsh

cmd-line copied from the log (and some log entries):

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
QEMU_AUDIO_DRV=none /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 8192 -smp
4,sockets=4,cores=1,threads=1 -name sbs2008 -uuid
933c2ef2-e5b0-0b39-db60-016b5d226534 -nodefaults -chardev
socket,id=monitor,path=/var/lib/libvirt/qemu/sbs2008.monitor,server,nowait -mon
chardev=monitor,mode=readline -rtc base=localtime -boot c -drive
file=/var/lib/libvirt/images/olscanner/virtio-win-1.1.11-0.iso,if=none,media=cdrom,
id=drive-ide0-1-0,readonly=on,format=raw
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive
file=/var/lib/libvirt/images/sbs2008/sbs2008.img,if=none,id=drive-virtio-disk0,
boot=on,format=qcow2 -device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive
file=/dev/volg1/sbsdata,if=none,id=drive-virtio-disk1,format=raw,cache=none
-device
virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
-drive file=/dev/volg1/wsus,if=none,id=drive-virtio-disk2,format=raw,cache=none
-device
virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio-disk2
-device e1000,vlan=0,id=net0,mac=52:54:00:8a:bc:c9,bus=pci.0,addr=0x6 -net
tap,fd=107,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device
isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k
de -vga std -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3
07:12:02.715: debug : qemudInitCpuAffinity:2423 : Setting CPU affinity
07:12:02.717: debug : qemuSecurityDACSetProcessLabel:547 : Dropping privileges
of VM to 105:114
char device redirected to /dev/pts/0
pci_add_option_rom: failed to find romfile "pxe-e1000.bin"


> > About e1000, some windows comes with buggy driver and an update e1000
> > from Intel fixes some issues.
> >

I'm running latest drivers from Intel: 8.3.15.0

> >
> > > At each BSOD I had the following line in the log of the guest:
> > >
> > >   virtio_ioport_write: unexpected address 0x13 value 0x1
> > >
> > > I changed the network interface back to e1000. What I experience now (and
I had
> > > that a the very beginning before i switched to virtio network) are
freezes. The
> > > guest doesn't respond anymore (doesn't answer to pings and doesn't
interact via
> > > mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as 
> > > many
> > > cores as there are virtual cpus (in this case 4).

I had a crash today but no logentry on the host - but that could be because I
had to restart syslog (ran out of diskspace after turning on debug logging ob
libvirtd - didn't think that it would generate 6 GB of logs per day :-)

> > >
> Sounds like an interrupt storm to me. Can you try to ping your VM?

No responds to ping.

> Anyway the best way to start debugging a stalled system is just to crash
> it with BSOD. For doing it you will need:
> - enable NMICrashDump (please see http://support.microsoft.com/kb/927069
> for more information
> - enable Kernel Memory Dump (actually Complete is much better, but it
> can be too big)  http://support.microsoft.com/kb/969028
> - you only will need to type "nmi 0" in the qemu monitor to crash the
> system, when the system hangs next time.

I prepared this. When the system crashed today I didn't have the complete
memory dump ready - so I only have a minidump. The intersting point is that
the system today crashed with a BSOD and didn't freeze.

The result of dumpchk.exe is as follows:


Microsoft (R) Windows Debugger Version 6.12.0002.633 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [c:\Windows\Minidump\Mini121410-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SRV*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Server 2008/Windows Vista Kernel Version 6002 (Service Pack 2) MP (4
procs) Free x64
Product: LanManNt, suite

Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread David S. Ahern


On 12/14/10 14:46, Anthony Liguori wrote:
> On 12/14/2010 01:54 PM, David S. Ahern wrote:
>>
>> On 12/14/10 12:49, Anthony Liguori wrote:
>>   
>>> But that doesn't tell you what the impact is in real world workloads.
>>> Before we start pushing all device emulation into the kernel, we need to
>>> quantify how often gettimeofday() is really called in real workloads.
>>>  
>> The workload that inspired that example program at its current max load
>> calls gtod upwards of 1000 times per second. The overhead of
>> gettimeofday was the biggest factor when comparing performance to bare
>> metal and esx. That's why I wrote the test program --- boils a complex
>> product/program to a single system call.
>>
> 
> So the absolute performance impact was on the order of what?

At the time I did the investigations (18-24 months ago) KVM was on the
order of 15-20% worse for a RHEL4 based workload and the overhead
appeared to be due to the PIT or PM timer as the clock source. Switching
the clock to the TSC brought the performance on par with bare metal, but
that route has other issues.

> 
> The difference in CPU time of a light weight vs. heavy weight exit
> should be something like 2-3us.  That would mean 2-3ms of CPU time at a
> rate of 1000 per second.

The PIT causes 3 VMEXITs for each gettimeofday (get_offset_pit in RHEL4):

/* timer count may underflow right here */
outb_p(0x00, PIT_MODE); /* latch the count ASAP */
...
count = inb_p(PIT_CH0); /* read the latched count */
...
count |= inb_p(PIT_CH0) << 8;
...


David


> 
> That should be pretty much in the noise.
> 
> There are possibly second order effects that might make a large impact
> such as contention with the qemu_mutex.  It's worth doing
> experimentation to see if a non-mutex acquiring fast path in userspace
> also resulted in a significant performance boost.
> 
> Regards,
> 
> Anthony Liguori
> 
>> David
>>
>>   
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>> 
 What's the relative speed of the in-kernel pmtimer compared to the PIT?

 David


>>>  
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-14 Thread Kenni Lund
2010/12/14 Erik Brakkee :
> Daniel P. Berrange wrote:
>>
>> On Tue, Dec 14, 2010 at 12:55:04PM +0100, Kenni Lund wrote:
>>
>>>
>>> 2010/12/14 Erik Brakkee:
>>>
>
> From: Kenni Lund
> 2010/12/14 Erik Brakkee:
>
>>>
>>> From: Kenni Lund
>>>

 Does this mean I have a chance now that PCI passthrough of my WinTV
 PVR-500
 might work now?

>>>
>>> Passthrough of a PVR-500 has been working for a long time. I've been
>>> running with passthrough of a PVR-500 in my HTPC, since
>>> November/December 2009...so it should work with any recent kernel and
>>> any recent version of qemu-kvm you can find today - No patching
>>> needed. The only issue I had with the PVR-500 card, was when *I*
>>> didn't free up the shared interrupts...once I fixed that, it "just
>>> worked".
>>>
>>
>> How did you free up those shared interrupts then? I tried different
>> slots
>> but always get conflicts with the USB irqs.
>>
>
> I did an unbind of the conflicting device (eg. disabled it). I moved
> the PVR-500 card around in the different slots and once I got a
> conflict with the integrated sound card, I left the PVR-500 card in
> that slot (it's a headless machine, so no need for sound) and
> configured unbind of the sound card at boot time. On my old system I
> think it was conflicting with one of the USB controllers as well, but
> it didn't really matter, as I only lost a few of the ports on the back
> of the computer for that particular USB controller - I still had
> plenty of USB ports left and if I really needed more ports, I could
> just plug in an extra USB PCI card.
>
> My /etc/rc.local boot script looks like the following today:
> --
> #Remove HDA conflicting with ivtv1
> echo ":00:1b.0">  /sys/bus/pci/drivers/HDA\ Intel/unbind
>
> # ivtv0
> echo " 0016">  /sys/bus/pci/drivers/pci-stub/new_id
> echo ":04:08.0">  /sys/bus/pci/drivers/ivtv/unbind
> echo ":04:08.0">  /sys/bus/pci/drivers/pci-stub/bind
> echo " 0016">  /sys/bus/pci/drivers/pci-stub/remove_id
>
> # ivtv1
> echo " 0016">  /sys/bus/pci/drivers/pci-stub/new_id
> echo ":04:09.0">  /sys/bus/pci/drivers/ivtv/unbind
> echo ":04:09.0">  /sys/bus/pci/drivers/pci-stub/bind
> echo " 0016">  /sys/bus/pci/drivers/pci-stub/remove_id
>

 I did not try unbinding the usb device so I can also try that.

 I don'.t understand what is happening with the  0016. I configured
 the
 pci card in kvm and I believe kvm does the binding to pci-stub in recent
 versions. Where is the  0016%oming from?

>>>
>>> Okay, qemu-kvm might do it today, I don't know - I haven't changed
>>> that script for the past year. But are you sure that it's not
>>> libvirt/virsh/virt-manager which does that for you?
>>>
>>
>> If you use the managed="yes" attribute on the  in libvirt
>> XML, then libvirt will automatically do the pcistub bind/unbind,
>> followed by a device reset at guest startup&  the reverse at shutdown.
>> If you have conflicting devices on the bus though, libvirt won't
>> attempt to unbind them, unless you had also explicitly assigned all
>> those conflicting devices to the same guest.
>>
>> Daniel
>>
>
> I definitely have to try again (right now having some stability problems on
> the server that I am debugging).
>
> The shared IRQs are as follows:
>
>  16:          0          0          0          0          0          0
>    0          0   IO-APIC-fasteoi   uhci_hcd:usb3
>  18:     252995          0          0          0          0          0
>    0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb8, ivtv0
>  19:      58281          0          0          0          0          0
>    0          0   IO-APIC-fasteoi   ata_piix, ata_piix, uhci_hcd:usb5,
> uhci_hcd:usb7, ivtv1
>  21:          0          0          0          0          0          0
>    0          0   IO-APIC-fasteoi   uhci_hcd:usb4
>  23:        713       6906          0      76919          0          0
>    0          0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb6
>
> So I have IRQ sharing with usb1, usb8, usb5, usb7.

Uffand your ata HDD controller. I guess i was much luckier than
you are, my ivtv0 didn't conflict at all and ivtv1 only conflicted
with USB.

> I have also read that
> ehci refers to USB 2.0 and uhci to USB 1.1 is that correct? Anyway, how
> would I now identify the USB PCI devices that I would need to unbind to get
> rid of the sharing with the USB ports?

Play around with:
lspci -v
lspci -n
lsusb -v
lsusb -t

You can also just start by unbinding the first one and take note when
you hit the right ones...once you unbind one, it will disappear from
cat /proc/interrupts. When you're down to having only ivtv0 on one
interrupt and only ivtv1 on another interrupt, then you're ready to
bind with 

Re: [PATCH 5/5] pci-assign: Use PCI-2.3-based shared legacy interrupts

2010-12-14 Thread Jan Kiszka
Am 14.12.2010 01:16, Alex Williamson wrote:
> On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> Enable the new KVM feature that allows legacy interrupt sharing for
>> PCI-2.3-compliant devices. This requires to synchronize any guest
>> change of the INTx mask bit to the kernel.
>>
>> Signed-off-by: Jan Kiszka 
>> ---
>>  hw/device-assignment.c |   38 +-
>>  qemu-kvm.c |8 
>>  qemu-kvm.h |3 +++
>>  3 files changed, 44 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
>> index 26d3bd7..cf75c52 100644
>> --- a/hw/device-assignment.c
>> +++ b/hw/device-assignment.c
>> @@ -423,12 +423,21 @@ static uint8_t pci_find_cap_offset(PCIDevice *d, 
>> uint8_t cap, uint8_t start)
>>  return 0;
>>  }
>>  
>> +static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t 
>> devfn)
>> +{
>> +return (uint32_t)seg << 16 | (uint32_t)bus << 8 | (uint32_t)devfn;
>> +}
>> +
>>  static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address,
>>uint32_t val, int len)
>>  {
>>  int fd;
>>  ssize_t ret;
>>  AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
>> +struct kvm_assigned_pci_dev assigned_dev_data;
>> +#ifdef KVM_CAP_PCI_2_3
>> +bool intx_masked, update_intx_mask;
>> +#endif /* KVM_CAP_PCI_2_3 */
>>  
>>  DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
>>((d->devfn >> 3) & 0x1F), (d->devfn & 0x7),
>> @@ -439,6 +448,26 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
>> uint32_t address,
>>  }
>>  
>>  if (ranges_overlap(address, len, PCI_COMMAND, 2)) {
>> +#ifdef KVM_CAP_PCI_2_3
>> +update_intx_mask = false;
>> +if (address == PCI_COMMAND+1) {
>> +intx_masked = val & (PCI_COMMAND_INTX_DISABLE >> 8);
>> +update_intx_mask = true;
>> +} else if (len >= 2) {
>> +intx_masked = val & PCI_COMMAND_INTX_DISABLE;
>> +update_intx_mask = true;
>> +}
> 
> I wonder if this might be a little cleaner as something like this.
> 
> if (ranges_overlap(address, len, PCI_COMMAND + 1, 1) {
> update_intx_mask = true;
> intx_masked = (len == 1 ? val << 8 : val) & PCI_COMMAND_INTX_DISABLE;
> }

That should even obsolete update_intx_mask - will look into this, and
also the merge bits thing.

Thanks!
Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [PATCHv8 00/16] boot order specification

2010-12-14 Thread Alexander Graf

On 14.12.2010, at 21:31, Benjamin Herrenschmidt wrote:

> 
>> The only working system emulation we have are Macs (G3 beige, G4, G5),  
>> so we can't just ignore Apple.
>> Alex even made me stick to their odd 0x41 rtas-version property. ;)
> 
> Hah :-) Nothing ever used RTAS on these... afaik, it didn't even work
> properly.

Then let's not use rtas for the Mac machine, but rather go with Andreas' new 
machine. Changing the value there to what real FW uses on that machine is more 
than reasonable :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word

2010-12-14 Thread Jan Kiszka
Am 14.12.2010 21:47, Thomas Gleixner wrote:
> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>> +/**
>> + *  get_irq_status - read interrupt line status word
>> + *  @irq: Interrupt line of the status word
>> + *
>> + *  This returns the current content of the status word associated with
>> + *  the given interrupt line. See IRQS_* flags for details.
>> + */
>> +unsigned long get_irq_status(unsigned int irq)
>> +{
>> +struct irq_desc *desc = irq_to_desc(irq);
>> +
>> +return desc ? desc->irq_data.drv_status : 0;
>> +}
>> +EXPORT_SYMBOL_GPL(get_irq_status);
> 
> We should document that this is a snapshot and in no way serialized
> against modifications of drv_status. I'll fix up the kernel doc.

Yeah, I think I had some hint on this in the previous version but
apparently dropped it for this round.

Thanks,
Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-14 Thread Jan Kiszka
Am 14.12.2010 22:46, Thomas Gleixner wrote:
> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>  chip_bus_lock(desc);
>>  retval = __setup_irq(irq, desc, action);
>>  chip_bus_sync_unlock(desc);
>>  
>> -if (retval)
>> +if (retval) {
>> +if (desc->action && !desc->action->next)
>> +desc->irq_data.drv_status &= ~IRQS_SHARED;
> 
> This is redundant. IRQS_SHARED gets set in a code path where all
> checks are done already.

Nope, it's also set before entry of __setup_irq in case we call an
IRQF_ADAPTIVE handler.

We need to set it that early as we may race with IRQ events for the
already registered handler happening between the sharing notification
and the actual registration of the second handler.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-14 Thread Jan Kiszka
Am 14.12.2010 21:54, Thomas Gleixner wrote:
> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, 
>> void *dev_id)
>>  /* Make sure it's not being used on another CPU: */
>>  synchronize_irq(irq);
>>  
>> +if (single_handler)
>> +desc->irq_data.drv_status &= ~IRQS_SHARED;
>> +
> 
> What's the reason to clear this flag outside of the desc->lock held
> region.

We need to synchronize the irq first before clearing the flag.

The problematic scenario behind this: An IRQ started in shared mode,
this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
before calling into the threaded handler. And that handler may now think
that the line is still masked as IRQS_SHARED is set.

> I need this status for other purposes as well, where I
> definitely need serialization.

Well, two options: wrap all bit manipulations with desc->lock
acquisition/release or turn drv_status into an atomic. I don't know what
your plans with drv_status are, so...

> 
>> +mutex_lock(®ister_lock);
>> +
>> +old_action = desc->action;
>> +if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
>> +!(desc->irq_data.drv_status & IRQS_SHARED)) {
>> +/*
>> + * Signal the old handler that is has to switch to shareable
>> + * handling mode. Disable the line to avoid any conflict with
>> + * a real IRQ.
>> + */
>> +disable_irq(irq);
>> +local_irq_disable();
>> +
>> +desc->irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE;
> 
> Unserialized access as well. Will think about it.
> 
>> +old_action->handler(irq, old_action->dev_id);
>> +desc->irq_data.drv_status &= ~IRQS_MAKE_SHAREABLE;
> 
> Thanks,
> 
>   tglx

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-14 Thread Minchan Kim
On Tue, Dec 14, 2010 at 8:45 PM, Balbir Singh  wrote:
> * MinChan Kim  [2010-12-14 19:01:26]:
>
>> Hi Balbir,
>>
>> On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh
>>  wrote:
>> > Move reusable functionality outside of zone_reclaim.
>> > Make zone_reclaim_unmapped_pages modular
>> >
>> > Signed-off-by: Balbir Singh 
>> > ---
>> >  mm/vmscan.c |   35 +++
>> >  1 files changed, 23 insertions(+), 12 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index e841cae..4e2ad05 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone 
>> > *zone)
>> >  }
>> >
>> >  /*
>> > + * Helper function to reclaim unmapped pages, we might add something
>> > + * similar to this for slab cache as well. Currently this function
>> > + * is shared with __zone_reclaim()
>> > + */
>> > +static inline void
>> > +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
>> > +                               unsigned long nr_pages)
>> > +{
>> > +       int priority;
>> > +       /*
>> > +        * Free memory by calling shrink zone with increasing
>> > +        * priorities until we have enough memory freed.
>> > +        */
>> > +       priority = ZONE_RECLAIM_PRIORITY;
>> > +       do {
>> > +               shrink_zone(priority, zone, sc);
>> > +               priority--;
>> > +       } while (priority >= 0 && sc->nr_reclaimed < nr_pages);
>> > +}
>>
>> As I said previous version, zone_reclaim_unmapped_pages doesn't have
>> any functions related to reclaim unmapped pages.
>
> The scan control point has the right arguments for implementing
> reclaim of unmapped pages.

I mean you should set up scan_control setup in this function.
Current zone_reclaim_unmapped_pages doesn't have any specific routine
related to reclaim unmapped pages.
Otherwise, change the function name with just "zone_reclaim_pages". I
think you don't want it.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] KVM & genirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-14 Thread Thomas Gleixner
On Mon, 13 Dec 2010, Jan Kiszka wrote:

> This addresses the review comments of the previous round:
>  - renamed irq_data::status to drv_status
>  - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED
>  - fixed signature of get_irq_status (irq is now unsigned int)
>  - converted register_lock into a global one
>  - fixed critical white space breakage (that I just left in to check if
>anyone is actually reading the code, of course...)

Just for the record, you either missed or introduced some new white
space noise :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-14 Thread Erik Brakkee

Daniel P. Berrange wrote:

On Tue, Dec 14, 2010 at 12:55:04PM +0100, Kenni Lund wrote:
   

2010/12/14 Erik Brakkee:
 

From: Kenni Lund
2010/12/14 Erik Brakkee:
 

From: Kenni Lund
 

Does this mean I have a chance now that PCI passthrough of my WinTV
PVR-500
might work now?
   

Passthrough of a PVR-500 has been working for a long time. I've been
running with passthrough of a PVR-500 in my HTPC, since
November/December 2009...so it should work with any recent kernel and
any recent version of qemu-kvm you can find today - No patching
needed. The only issue I had with the PVR-500 card, was when *I*
didn't free up the shared interrupts...once I fixed that, it "just
worked".
 

How did you free up those shared interrupts then? I tried different slots
but always get conflicts with the USB irqs.
   

I did an unbind of the conflicting device (eg. disabled it). I moved
the PVR-500 card around in the different slots and once I got a
conflict with the integrated sound card, I left the PVR-500 card in
that slot (it's a headless machine, so no need for sound) and
configured unbind of the sound card at boot time. On my old system I
think it was conflicting with one of the USB controllers as well, but
it didn't really matter, as I only lost a few of the ports on the back
of the computer for that particular USB controller - I still had
plenty of USB ports left and if I really needed more ports, I could
just plug in an extra USB PCI card.

My /etc/rc.local boot script looks like the following today:
--
#Remove HDA conflicting with ivtv1
echo ":00:1b.0">  /sys/bus/pci/drivers/HDA\ Intel/unbind

# ivtv0
echo " 0016">  /sys/bus/pci/drivers/pci-stub/new_id
echo ":04:08.0">  /sys/bus/pci/drivers/ivtv/unbind
echo ":04:08.0">  /sys/bus/pci/drivers/pci-stub/bind
echo " 0016">  /sys/bus/pci/drivers/pci-stub/remove_id

# ivtv1
echo " 0016">  /sys/bus/pci/drivers/pci-stub/new_id
echo ":04:09.0">  /sys/bus/pci/drivers/ivtv/unbind
echo ":04:09.0">  /sys/bus/pci/drivers/pci-stub/bind
echo " 0016">  /sys/bus/pci/drivers/pci-stub/remove_id
 

I did not try unbinding the usb device so I can also try that.

I don'.t understand what is happening with the  0016. I configured the
pci card in kvm and I believe kvm does the binding to pci-stub in recent
versions. Where is the  0016%oming from?
   

Okay, qemu-kvm might do it today, I don't know - I haven't changed
that script for the past year. But are you sure that it's not
libvirt/virsh/virt-manager which does that for you?
 

If you use the managed="yes" attribute on the  in libvirt
XML, then libvirt will automatically do the pcistub bind/unbind,
followed by a device reset at guest startup&  the reverse at shutdown.
If you have conflicting devices on the bus though, libvirt won't
attempt to unbind them, unless you had also explicitly assigned all
those conflicting devices to the same guest.

Daniel
   
I definitely have to try again (right now having some stability problems 
on the server that I am debugging).


The shared IRQs are as follows:

  16:  0  0  0  0  0  
0  0  0   IO-APIC-fasteoi   uhci_hcd:usb3
  18: 252995  0  0  0  0  
0  0  0   IO-APIC-fasteoi   ehci_hcd:usb1, 
uhci_hcd:usb8, ivtv0
  19:  58281  0  0  0  0  
0  0  0   IO-APIC-fasteoi   ata_piix, ata_piix, 
uhci_hcd:usb5, uhci_hcd:usb7, ivtv1
  21:  0  0  0  0  0  
0  0  0   IO-APIC-fasteoi   uhci_hcd:usb4
  23:713   6906  0  76919  0  
0  0  0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb6


So I have IRQ sharing with usb1, usb8, usb5, usb7. I have also read that 
ehci refers to USB 2.0 and uhci to USB 1.1 is that correct? Anyway, how 
would I now identify the USB PCI devices that I would need to unbind to 
get rid of the sharing with the USB ports? It also doesn't really matter 
in which slot I put the PVR-500 card because both cards share IRQs with 
USB in all cases.


I have also used an add on USB PCI card but still got these conflicts. I 
was considering to get a PCIe USB card instead to try out in the hope 
that that would use different IRQs. Is that a realistic expectation? 
That way, I could disable all on-board USB (in the BIOS even) and use 
the add-on USB only.





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Anthony Liguori

On 12/14/2010 01:54 PM, David S. Ahern wrote:


On 12/14/10 12:49, Anthony Liguori wrote:
   

But that doesn't tell you what the impact is in real world workloads.
Before we start pushing all device emulation into the kernel, we need to
quantify how often gettimeofday() is really called in real workloads.
 

The workload that inspired that example program at its current max load
calls gtod upwards of 1000 times per second. The overhead of
gettimeofday was the biggest factor when comparing performance to bare
metal and esx. That's why I wrote the test program --- boils a complex
product/program to a single system call.
   


So the absolute performance impact was on the order of what?

The difference in CPU time of a light weight vs. heavy weight exit 
should be something like 2-3us.  That would mean 2-3ms of CPU time at a 
rate of 1000 per second.


That should be pretty much in the noise.

There are possibly second order effects that might make a large impact 
such as contention with the qemu_mutex.  It's worth doing 
experimentation to see if a non-mutex acquiring fast path in userspace 
also resulted in a significant performance boost.


Regards,

Anthony Liguori


David

   

Regards,

Anthony Liguori

 

What's the relative speed of the in-kernel pmtimer compared to the PIT?

David

   
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-14 Thread Thomas Gleixner
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> From: Jan Kiszka 
>   chip_bus_lock(desc);
>   retval = __setup_irq(irq, desc, action);
>   chip_bus_sync_unlock(desc);
>  
> - if (retval)
> + if (retval) {
> + if (desc->action && !desc->action->next)
> + desc->irq_data.drv_status &= ~IRQS_SHARED;

This is redundant. IRQS_SHARED gets set in a code path where all
checks are done already.

To make that more obvious we can set it right before

   raw_spin_unlock_irqrestore(&desc->lock, flags);

conditionally on (shared).

That way we can also move the kfree out of the mutex locked section.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-14 Thread Thomas Gleixner
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, 
> void *dev_id)
>   /* Make sure it's not being used on another CPU: */
>   synchronize_irq(irq);
>  
> + if (single_handler)
> + desc->irq_data.drv_status &= ~IRQS_SHARED;
> +

What's the reason to clear this flag outside of the desc->lock held
region. I need this status for other purposes as well, where I
definitely need serialization.

> + mutex_lock(®ister_lock);
> +
> + old_action = desc->action;
> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
> + /*
> +  * Signal the old handler that is has to switch to shareable
> +  * handling mode. Disable the line to avoid any conflict with
> +  * a real IRQ.
> +  */
> + disable_irq(irq);
> + local_irq_disable();
> +
> + desc->irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE;

Unserialized access as well. Will think about it.

> + old_action->handler(irq, old_action->dev_id);
> + desc->irq_data.drv_status &= ~IRQS_MAKE_SHAREABLE;

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word

2010-12-14 Thread Thomas Gleixner
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> +/**
> + *   get_irq_status - read interrupt line status word
> + *   @irq: Interrupt line of the status word
> + *
> + *   This returns the current content of the status word associated with
> + *   the given interrupt line. See IRQS_* flags for details.
> + */
> +unsigned long get_irq_status(unsigned int irq)
> +{
> + struct irq_desc *desc = irq_to_desc(irq);
> +
> + return desc ? desc->irq_data.drv_status : 0;
> +}
> +EXPORT_SYMBOL_GPL(get_irq_status);

We should document that this is a snapshot and in no way serialized
against modifications of drv_status. I'll fix up the kernel doc.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCHv8 00/16] boot order specification

2010-12-14 Thread Benjamin Herrenschmidt

> The only working system emulation we have are Macs (G3 beige, G4, G5),  
> so we can't just ignore Apple.
> Alex even made me stick to their odd 0x41 rtas-version property. ;)

Hah :-) Nothing ever used RTAS on these... afaik, it didn't even work
properly.

> No, but that may be OpenBIOS' fault. Here's its reg, in case it helps:
> 
> reg   1800      
>   01001810      0008
>   01001814      0004
>   01001818      0008
>   0100181c      0004
>   01001820      0010
> 

That looks like PCI odd to keep a PCI addressing scheme below a PCI
device...

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCHv8 00/16] boot order specification

2010-12-14 Thread Andreas Färber

Am 12.12.2010 um 00:22 schrieb Benjamin Herrenschmidt:


On Sat, 2010-12-11 at 18:06 +0200, Gleb Natapov wrote:

http://playground.sun.com/pub/p1275/bindings/pci/pci2_1.pdf has table
on
page 10 that defines how pci class code should be translated into OF
name. This is what my patch is using. pci-ata does not look spec
compliant (or is there more up-to-date spec?)


  What should we do

with

at...@600 vs dr...@1?

There is no available IDE OF binding spec, so I when with the way
OpenBIOS reports ata on qemu-x86. I have no idea what 600 in  
at...@600
may mean, but looking at g3_beige_300.html there is no such node  
there

and looking at any other device tree in
http://penguinppc.org/historical/dev-trees-html/


Those are old and I wouldn't look too closely at what Apple does.


The only working system emulation we have are Macs (G3 beige, G4, G5),  
so we can't just ignore Apple.

Alex even made me stick to their odd 0x41 rtas-version property. ;)


ATA doesn't really need anything complex, mostly the ata controller,
generally named "ata" nowadays with a #address-cells of 1 and a
#size-cells of 0. Children are then typically disk, cdrom, ... (ie  
block

devices) with a unit address of 0 for master and 1 for slave.

In the case of controllers with multiple ports, typically you have one
such "ata" node per bus. "pci-ata" is a liberal use by Apple here
representing the actual host controller PCI device.

In any case, what matters is the "compatible" property. This is what
defines the programming interface of a device.


I haven't found one that use this kind of addressing for pci-ata.
http://penguinppc.org/historical/dev-trees-html/g3bw_400.html for
instance has p...@8000/pci-bri...@d/pci-...@1/ata-4. at...@600  
kind

of
addressing is used by devices on mac-io bus which I do not think we
emulate in qemu. So it looks like OpneBIOS is wrong here.


Well, it's possible that the @600 represents a register offset within
pci-ata, this is entirely up to pci-ata to do as it wishes there to
define it's own internal binding. Is there a "ranges" property  
defining

translation accross "pci-ata" ?


No, but that may be OpenBIOS' fault. Here's its reg, in case it helps:

reg   1800      
 01001810      0008
 01001814      0004
 01001818      0008
 0100181c      0004
 01001820      0010

Regards,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread David S. Ahern


On 12/14/10 12:49, Anthony Liguori wrote:
> But that doesn't tell you what the impact is in real world workloads. 
> Before we start pushing all device emulation into the kernel, we need to
> quantify how often gettimeofday() is really called in real workloads.

The workload that inspired that example program at its current max load
calls gtod upwards of 1000 times per second. The overhead of
gettimeofday was the biggest factor when comparing performance to bare
metal and esx. That's why I wrote the test program --- boils a complex
product/program to a single system call.

David

> 
> Regards,
> 
> Anthony Liguori
> 
>> What's the relative speed of the in-kernel pmtimer compared to the PIT?
>>
>> David
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Anthony Liguori

On 12/14/2010 12:00 PM, David S. Ahern wrote:


On 12/14/10 08:29, Anthony Liguori wrote:

   

I recently used to investigate the performance benefit. In a Linux
guest, I was running a program that calls gettimeofday() 'n' times
in a loop (the PM Timer register is read during each call). With
in-kernel PM Timer, I observed a significant reduction of program
execution time.

   

I've played with this in the past.  Can you post real numbers,
preferably, with a real work load?
 

2 years ago I posted relative comparisons of the time sources for older
RHEL guests:
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html
   


Any time you write a program in userspace that effectively equates to a 
single PIO operation that is easy to emulate, it's going to be 
remarkably faster to implement that PIO emulation in the kernel than in 
userspace because vmexit exit cost dominates the execution path.


But that doesn't tell you what the impact is in real world workloads.  
Before we start pushing all device emulation into the kernel, we need to 
quantify how often gettimeofday() is really called in real workloads.


Regards,

Anthony Liguori


What's the relative speed of the in-kernel pmtimer compared to the PIT?

David
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations

2010-12-14 Thread David Miller
From: "Michael S. Tsirkin" 
Date: Tue, 14 Dec 2010 14:23:26 +0200

> On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote:
>> Please merge the following tree for 2.6.38.
>> Thanks!
> 
> Rusty Acked it as is, so please pull the below.
> Thanks very much!
> 
>> The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0:
>> 
>>   net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800)
>> 
>> are available in the git repository at:
>>   git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next

Pulled, thanks a lot.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread David S. Ahern


On 12/14/10 08:29, Anthony Liguori wrote:

>> I recently used to investigate the performance benefit. In a Linux
>> guest, I was running a program that calls gettimeofday() 'n' times
>> in a loop (the PM Timer register is read during each call). With
>> in-kernel PM Timer, I observed a significant reduction of program
>> execution time.
>>
> 
> I've played with this in the past.  Can you post real numbers,
> preferably, with a real work load?

2 years ago I posted relative comparisons of the time sources for older
RHEL guests:
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

What's the relative speed of the in-kernel pmtimer compared to the PIT?

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] RFC: delay pci_update_mappings for 64-bit BARs

2010-12-14 Thread Cam Macdonell
On Mon, Dec 13, 2010 at 8:00 PM, Isaku Yamahata  wrote:
> On Mon, Dec 13, 2010 at 03:43:44PM -0700, Cam Macdonell wrote:
>> Do not call pci_update_mappings on the lower 32-bits of a 64-bit bar.  Wait 
>> for the upper 32 or else Qemu will try to map on just the lower 32 which is 
>> probably going to corrupt memory.
>>
>> I was encountering crashes when mapping certain PCI region sizes.  The 
>> problem turns out that pci_update_mappings is being called without all 
>> 64-bits in the BAR.  For example when mapping to 0x18000, once the lower 
>> 32-bits were written the remapping happened (mapping to 0x800) which 
>> would overwrite something.
>>
>> I'm not certain if this is completely correct, I'm simply testing the lower 
>> 4-bits to only be MEM_TYPE_64 flag.  Upper 32-bit address parts can be 
>> values like 0xff which is tricky to test against.
>
> You're assuming that guest OS always write lower 32bit and them upper 32bit.
> Is the assumption correct?
> I found Linux does, but I don't know about other OSes.
> And I couldn't find any sentence about how to update (64bit) BAR in the specs.
> (Please correct me if I missed it)

I think you're right, we probably can't assume the order.

>
> Some work around would be necessary regardless of 32bit-or-64bit.
> because qemu doesn't emulate bus accurately at the moment.
> How about the followings?
> If BAR overlaps with RAM, don't map BAR.
> If BAR overlaps with other BARs, record the overlapping and
> when updating one of the BARs, update all the overlapping BARs.
> Which BAR wins depends on the order of updating, it doesn't matter because
> it's anomaly case.

But the addresses in the BARs may not overlap.  For example, Linux
allocates memory from top down, so I recently had the mapping of a BAR
to address 0xffc000

So BAR 0x18 sees 0xc004
Then BAR 0x1c sees 0xff

So if I understand what you mean by overlapping BARs, 0xc000 and
0xffc000 will not be detected as overlapping and so we can't
record it.  But, we can allow harmless mappings of the incomplete
lower-32 to proceed and then get remapped when the upper bits are
written.  (This is what happens currently, but fails when the lower-32
overwrite RAM).

Case of writing upper-then-lower (non-Linux case):
The addresses in the upper 32-bits are going to be limited to 16-bits
(at most 48-bit addresses currently) and so those shouldn't update
mappings because they will overlap with RAM.  When the lower-bits are
written, we have the full 64-bit address and can update mappings.

Case of writing lower-then-upper:
If the lower 32-bit BAR address doesn't conflict with RAM, map it.
When the upper bits are written, update to the correct mapping.

We would just have to ensure the first mapping is indeed harmless.

Would that work?
Cam

>
> This way, 32bit BAR case is also covered.
>
> thanks,
>
>>
>> Cam
>> ---
>>  hw/pci.c |    5 -
>>  1 files changed, 4 insertions(+), 1 deletions(-)
>>
>> diff --git a/hw/pci.c b/hw/pci.c
>> index 438c0d1..3b81792 100644
>> --- a/hw/pci.c
>> +++ b/hw/pci.c
>> @@ -1000,6 +1000,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
>> addr, uint32_t val, int l)
>>  {
>>      int i, was_irq_disabled = pci_irq_disabled(d);
>>      uint32_t config_size = pci_config_size(d);
>> +    int is_64 = 0;
>> +
>> +    is_64 = ((val & 0xf) == PCI_BASE_ADDRESS_MEM_TYPE_64);
>>
>>      for (i = 0; i < l && addr + i < config_size; val >>= 8, ++i) {
>>          uint8_t wmask = d->wmask[addr + i];
>> @@ -1008,7 +1011,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
>> addr, uint32_t val, int l)
>>          d->config[addr + i] = (d->config[addr + i] & ~wmask) | (val & 
>> wmask);
>>          d->config[addr + i] &= ~(val & w1cmask); /* W1C: Write 1 to Clear */
>>      }
>> -    if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
>> +    if ((ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) && (!is_64)) ||
>>          ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
>>          ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
>>          range_covers_byte(addr, l, PCI_COMMAND))
>> --
>> 1.7.0.4
>>
>>
>
> --
> yamahata
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Anthony Liguori

On 12/14/2010 09:38 AM, Avi Kivity wrote:
Fortunately, we have a very good bytecode interpreter that's 
accelerated in the kernel called KVM ;-)


We have exactly the same bytecode interpreter under a different name, 
it's called userspace.


If you can afford to make the transition back to the guest for 
emulation, you might as well transition to userspace.


If you re-entered the guest and setup a stack that had the RIP of the 
source of the exit, then there's no additional need to exit the guest.  
The handler can just do an iret.  Or am I missing something?




Why not have the equivalent of a paravirtual SMM mode where we can 
reflect IO exits back to the guest in a well defined way?  It could 
then implement PM timer in terms of HPET or something like that.


More exits.


Yeah, I should have said, implement in terms of kvmclock so no 
additional exits.




We already have a virtual address space that works for most guests 
thanks to the TPR optimization.


It only works for Windows XP and Windows XP with the /3GB extension.


Is this a fundamental limitation or just a statement of today's 
heuristics?  Does any guest not keep the BIOS in virtual memory in a 
static location?


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH unit-tests 1/4] Move idt.c into lib code.

2010-12-14 Thread Gleb Natapov
Make it compilable in 32 and 64 bit mode.
---
 config-x86-common.mak |7 +-
 lib/x86/idt.c |   32 ---
 lib/x86/idt.h |1 +
 x86/idt.c |  148 -
 4 files changed, 16 insertions(+), 172 deletions(-)
 delete mode 100644 x86/idt.c

diff --git a/config-x86-common.mak b/config-x86-common.mak
index c5508b3..2269c4a 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -11,6 +11,7 @@ cflatobjs += \
 cflatobjs += lib/x86/fwcfg.o
 cflatobjs += lib/x86/apic.o
 cflatobjs += lib/x86/atomic.o
+cflatobjs += lib/x86/idt.o
 
 $(libcflat): LDFLAGS += -nostdlib
 $(libcflat): CFLAGS += -ffreestanding -I lib
@@ -50,7 +51,7 @@ $(TEST_DIR)/vmexit.elf: $(cstart.o) $(TEST_DIR)/vmexit.o
 $(TEST_DIR)/smptest.elf: $(cstart.o) $(TEST_DIR)/smptest.o
 
 $(TEST_DIR)/emulator.elf: $(cstart.o) $(TEST_DIR)/emulator.o \
-  $(TEST_DIR)/vm.o $(TEST_DIR)/idt.o
+  $(TEST_DIR)/vm.o
 
 $(TEST_DIR)/port80.elf: $(cstart.o) $(TEST_DIR)/port80.o
 
@@ -65,9 +66,9 @@ $(TEST_DIR)/realmode.o: bits = 32
 
 $(TEST_DIR)/msr.elf: $(cstart.o) $(TEST_DIR)/msr.o
 
-$(TEST_DIR)/idt_test.elf: $(cstart.o) $(TEST_DIR)/idt.o $(TEST_DIR)/idt_test.o
+$(TEST_DIR)/idt_test.elf: $(cstart.o) $(TEST_DIR)/idt_test.o
 
-$(TEST_DIR)/xsave.elf: $(cstart.o) $(TEST_DIR)/idt.o $(TEST_DIR)/xsave.o
+$(TEST_DIR)/xsave.elf: $(cstart.o) $(TEST_DIR)/xsave.o
 
 $(TEST_DIR)/rmap_chain.elf: $(cstart.o) $(TEST_DIR)/rmap_chain.o \
 $(TEST_DIR)/vm.o
diff --git a/lib/x86/idt.c b/lib/x86/idt.c
index ed2f4b0..b3e47d4 100644
--- a/lib/x86/idt.c
+++ b/lib/x86/idt.c
@@ -1,5 +1,6 @@
 #include "idt.h"
 #include "libcflat.h"
+#include "processor.h"
 
 typedef struct {
 unsigned short offset0;
@@ -19,30 +20,19 @@ typedef struct {
 
 static idt_entry_t idt[256];
 
-typedef struct {
-unsigned short limit;
-unsigned long linear_addr;
-} __attribute__((packed)) descriptor_table_t;
-
-void lidt(idt_entry_t *idt, int nentries)
+void load_lidt(idt_entry_t *idt, int nentries)
 {
-descriptor_table_t dt;
+struct descriptor_table_ptr dt;
 
 dt.limit = nentries * sizeof(*idt) - 1;
-dt.linear_addr = (unsigned long)idt;
+dt.base = (unsigned long)idt;
+lidt(&dt);
 asm volatile ("lidt %0" : : "m"(dt));
 }
 
-unsigned short read_cs()
-{
-unsigned short r;
-
-asm volatile ("mov %%cs, %0" : "=r"(r));
-return r;
-}
-
-void set_idt_entry(idt_entry_t *e, void *addr, int dpl)
+void set_idt_entry(int vec, void *addr, int dpl)
 {
+idt_entry_t *e = &idt[vec];
 memset(e, 0, sizeof *e);
 e->offset0 = (unsigned long)addr;
 e->selector = read_cs();
@@ -146,10 +136,10 @@ void setup_idt(void)
 {
 extern char ud_fault, gp_fault, de_fault;
 
-lidt(idt, 256);
-set_idt_entry(&idt[0], &de_fault, 0);
-set_idt_entry(&idt[6], &ud_fault, 0);
-set_idt_entry(&idt[13], &gp_fault, 0);
+load_lidt(idt, 256);
+set_idt_entry(0, &de_fault, 0);
+set_idt_entry(6, &ud_fault, 0);
+set_idt_entry(13, &gp_fault, 0);
 }
 
 unsigned exception_vector(void)
diff --git a/lib/x86/idt.h b/lib/x86/idt.h
index 6babcb4..81b8944 100644
--- a/lib/x86/idt.h
+++ b/lib/x86/idt.h
@@ -15,5 +15,6 @@ void setup_idt(void);
 
 unsigned exception_vector(void);
 unsigned exception_error_code(void);
+void set_idt_entry(int vec, void *addr, int dpl);
 
 #endif
diff --git a/x86/idt.c b/x86/idt.c
deleted file mode 100644
index 4480833..000
--- a/x86/idt.c
+++ /dev/null
@@ -1,148 +0,0 @@
-#include "idt.h"
-#include "libcflat.h"
-
-typedef struct {
-unsigned short offset0;
-unsigned short selector;
-unsigned short ist : 3;
-unsigned short : 5;
-unsigned short type : 4;
-unsigned short : 1;
-unsigned short dpl : 2;
-unsigned short p : 1;
-unsigned short offset1;
-unsigned offset2;
-unsigned reserved;
-} idt_entry_t;
-
-static idt_entry_t idt[256];
-
-typedef struct {
-unsigned short limit;
-unsigned long linear_addr;
-} __attribute__((packed)) descriptor_table_t;
-
-void lidt(idt_entry_t *idt, int nentries)
-{
-descriptor_table_t dt;
-
-dt.limit = nentries * sizeof(*idt) - 1;
-dt.linear_addr = (unsigned long)idt;
-asm volatile ("lidt %0" : : "m"(dt));
-}
-
-unsigned short read_cs()
-{
-unsigned short r;
-
-asm volatile ("mov %%cs, %0" : "=r"(r));
-return r;
-}
-
-void set_idt_entry(idt_entry_t *e, void *addr, int dpl)
-{
-memset(e, 0, sizeof *e);
-e->offset0 = (unsigned long)addr;
-e->selector = read_cs();
-e->ist = 0;
-e->type = 14;
-e->dpl = dpl;
-e->p = 1;
-e->offset1 = (unsigned long)addr >> 16;
-e->offset2 = (unsigned long)addr >> 32;
-}
-
-struct ex_regs {
-unsigned long rax, rcx, rdx, rbx;
-unsigned long dummy, rbp, rsi, rdi;
-unsigned long r8, r9, r10, r11;
-unsigned long r12, r13, r14, r15;
-unsigned long vector;
-unsigned long error_code;
-unsigned long rip

[PATCH unit-tests 2/4] Make access.c use library functions.

2010-12-14 Thread Gleb Natapov
access.c has functions that are provided by library code. Remove them
and use library functions instead.
---
 x86/access.c |   92 +++--
 1 files changed, 5 insertions(+), 87 deletions(-)

diff --git a/x86/access.c b/x86/access.c
index 067565b..df943d9 100644
--- a/x86/access.c
+++ b/x86/access.c
@@ -1,5 +1,7 @@
 
 #include "libcflat.h"
+#include "idt.h"
+#include "processor.h"
 
 #define smp_id() 0
 
@@ -98,34 +100,6 @@ static inline void *va(pt_element_t phys)
 return (void *)phys;
 }
 
-static unsigned long read_cr0()
-{
-unsigned long cr0;
-
-asm volatile ("mov %%cr0, %0" : "=r"(cr0));
-
-return cr0;
-}
-
-static void write_cr0(unsigned long cr0)
-{
-asm volatile ("mov %0, %%cr0" : : "r"(cr0));
-}
-
-typedef struct {
-unsigned short offset0;
-unsigned short selector;
-unsigned short ist : 3;
-unsigned short : 5;
-unsigned short type : 4;
-unsigned short : 1;
-unsigned short dpl : 2;
-unsigned short p : 1;
-unsigned short offset1;
-unsigned offset2;
-unsigned reserved;
-} idt_entry_t;
-
 typedef struct {
 pt_element_t pt_pool;
 unsigned pt_pool_size;
@@ -143,7 +117,6 @@ typedef struct {
 pt_element_t ignore_pde;
 int expected_fault;
 unsigned expected_error;
-idt_entry_t idt[256];
 } ac_test_t;
 
 typedef struct {
@@ -154,51 +127,6 @@ typedef struct {
 
 static void ac_test_show(ac_test_t *at);
 
-void lidt(idt_entry_t *idt, int nentries)
-{
-descriptor_table_t dt;
-
-dt.limit = nentries * sizeof(*idt) - 1;
-dt.linear_addr = (unsigned long)idt;
-asm volatile ("lidt %0" : : "m"(dt));
-}
-
-unsigned short read_cs()
-{
-unsigned short r;
-
-asm volatile ("mov %%cs, %0" : "=r"(r));
-return r;
-}
-
-unsigned long long rdmsr(unsigned index)
-{
-unsigned a, d;
-
-asm volatile("rdmsr" : "=a"(a), "=d"(d) : "c"(index));
-return ((unsigned long long)d << 32) | a;
-}
-
-void wrmsr(unsigned index, unsigned long long val)
-{
-unsigned a = val, d = val >> 32;
-
-asm volatile("wrmsr" : : "a"(a), "d"(d), "c"(index));
-}
-
-void set_idt_entry(idt_entry_t *e, void *addr, int dpl)
-{
-memset(e, 0, sizeof *e);
-e->offset0 = (unsigned long)addr;
-e->selector = read_cs();
-e->ist = 0;
-e->type = 14;
-e->dpl = dpl;
-e->p = 1;
-e->offset1 = (unsigned long)addr >> 16;
-e->offset2 = (unsigned long)addr >> 32;
-}
-
 void set_cr0_wp(int wp)
 {
 unsigned long cr0 = read_cr0();
@@ -222,13 +150,11 @@ void set_efer_nx(int nx)
 
 static void ac_env_int(ac_pool_t *pool)
 {
-static idt_entry_t idt[256];
+setup_idt();
 
-memset(idt, 0, sizeof(idt));
-lidt(idt, 256);
 extern char page_fault, kernel_entry;
-set_idt_entry(&idt[14], &page_fault, 0);
-set_idt_entry(&idt[0x20], &kernel_entry, 3);
+set_idt_entry(14, &page_fault, 0);
+set_idt_entry(0x20, &kernel_entry, 3);
 
 pool->pt_pool = 33 * 1024 * 1024;
 pool->pt_pool_size = 120 * 1024 * 1024 - pool->pt_pool;
@@ -273,14 +199,6 @@ int ac_test_bump(ac_test_t *at)
 return ret;
 }
 
-unsigned long read_cr3()
-{
-unsigned long cr3;
-
-asm volatile ("mov %%cr3, %0" : "=r"(cr3));
-return cr3;
-}
-
 void invlpg(void *addr)
 {
 asm volatile ("invlpg (%0)" : : "r"(addr));
-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH unit-tests 3/4] Remove duplicated idt code from apic test.

2010-12-14 Thread Gleb Natapov
Use library idt code instead.
---
 x86/apic.c |   49 +++--
 1 files changed, 11 insertions(+), 38 deletions(-)

diff --git a/x86/apic.c b/x86/apic.c
index 2207040..6d06f9f 100644
--- a/x86/apic.c
+++ b/x86/apic.c
@@ -2,22 +2,7 @@
 #include "apic.h"
 #include "vm.h"
 #include "smp.h"
-
-typedef struct {
-unsigned short offset0;
-unsigned short selector;
-unsigned short ist : 3;
-unsigned short : 5;
-unsigned short type : 4;
-unsigned short : 1;
-unsigned short dpl : 2;
-unsigned short p : 1;
-unsigned short offset1;
-#ifdef __x86_64__
-unsigned offset2;
-unsigned reserved;
-#endif
-} idt_entry_t;
+#include "idt.h"
 
 typedef struct {
 ulong regs[sizeof(ulong)*2];
@@ -90,8 +75,6 @@ asm (
 #endif
 );
 
-static idt_entry_t *idt = 0;
-
 static int g_fail;
 static int g_tests;
 
@@ -128,22 +111,12 @@ void test_enable_x2apic(void)
 }
 }
 
-static void set_idt_entry(unsigned vec, void (*func)(isr_regs_t *regs))
+static void handle_irq(unsigned vec, void (*func)(isr_regs_t *regs))
 {
 u8 *thunk = vmalloc(50);
-ulong ptr = (ulong)thunk;
-idt_entry_t ent = {
-.offset0 = ptr,
-.selector = read_cs(),
-.ist = 0,
-.type = 14,
-.dpl = 0,
-.p = 1,
-.offset1 = ptr >> 16,
-#ifdef __x86_64__
-.offset2 = ptr >> 32,
-#endif
-};
+
+set_idt_entry(vec, thunk, 0);
+
 #ifdef __x86_64__
 /* sub $8, %rsp */
 *thunk++ = 0x48; *thunk++ = 0x83; *thunk++ = 0xec; *thunk++ = 0x08;
@@ -164,7 +137,6 @@ static void set_idt_entry(unsigned vec, void 
(*func)(isr_regs_t *regs))
 *thunk ++ = 0xe9;
 *(u32 *)thunk = (ulong)isr_entry_point - (ulong)(thunk + 4);
 #endif
-idt[vec] = ent;
 }
 
 static void irq_disable(void)
@@ -194,7 +166,7 @@ static void test_self_ipi(void)
 {
 int vec = 0xf1;
 
-set_idt_entry(vec, self_ipi_isr);
+handle_irq(vec, self_ipi_isr);
 irq_enable();
 apic_icr_write(APIC_DEST_SELF | APIC_DEST_PHYSICAL | APIC_DM_FIXED | vec,
0);
@@ -234,7 +206,7 @@ static void ioapic_isr_77(isr_regs_t *regs)
 
 static void test_ioapic_intr(void)
 {
-set_idt_entry(0x77, ioapic_isr_77);
+handle_irq(0x77, ioapic_isr_77);
 set_ioapic_redir(0x10, 0x77);
 toggle_irq_line(0x10);
 asm volatile ("nop");
@@ -262,8 +234,8 @@ static void ioapic_isr_66(isr_regs_t *regs)
 
 static void test_ioapic_simultaneous(void)
 {
-set_idt_entry(0x78, ioapic_isr_78);
-set_idt_entry(0x66, ioapic_isr_66);
+handle_irq(0x78, ioapic_isr_78);
+handle_irq(0x66, ioapic_isr_66);
 set_ioapic_redir(0x10, 0x78);
 set_ioapic_redir(0x11, 0x66);
 irq_disable();
@@ -323,7 +295,7 @@ static void test_sti_nmi(void)
return;
 }
 
-set_idt_entry(2, nmi_handler);
+handle_irq(2, nmi_handler);
 on_cpu(1, update_cr3, (void *)read_cr3());
 
 sti_loop_active = 1;
@@ -343,6 +315,7 @@ int main()
 {
 setup_vm();
 smp_init();
+setup_idt();
 
 test_lapic_existence();
 
-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH unit-tests 4/4] Remove unused function from apic test.

2010-12-14 Thread Gleb Natapov
---
 x86/apic.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/x86/apic.c b/x86/apic.c
index 6d06f9f..bcb9fc1 100644
--- a/x86/apic.c
+++ b/x86/apic.c
@@ -78,11 +78,6 @@ asm (
 static int g_fail;
 static int g_tests;
 
-static void outb(unsigned char data, unsigned short port)
-{
-asm volatile ("out %0, %1" : : "a"(data), "d"(port));
-}
-
 static void report(const char *msg, int pass)
 {
 ++g_tests;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Dec 14

2010-12-14 Thread Chris Wright
* Jes Sorensen (jes.soren...@redhat.com) wrote:
> Any chance you could fix your cronjob to send out the CFA a day earlier?
> 15 hrs before is a bit short notice.

Sure.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Avi Kivity

On 12/14/2010 05:32 PM, Anthony Liguori wrote:


>  If anything I'd expect hpet or the Microsoft synthetic timers to 
be a

>  lot more important.

True. But also a lot more work.
Implementing just the pm timer counter - not the whole of it - in
kernel, gives us a lot of gain with not very much effort. Patch is
pretty simple, as you can see, and most of it is even code to turn it
on/off, etc.



Partial emulation is not something I like since it causes a fuzzy 
kernel/user boundary.  In this case, transitioning to userspace when 
interrupts are enabled doesn't look so hot.  Are you sure all guests 
that benefit from this don't enable the pmtimer interrupt?  What 
about the transition?  Will we have a time discontinuity when that 
happens?


What I'd really like to see is this stuff implemented in bytecode, 
unfortunately that's a lot of work which will be very hard to upstream.



Fortunately, we have a very good bytecode interpreter that's 
accelerated in the kernel called KVM ;-)


We have exactly the same bytecode interpreter under a different name, 
it's called userspace.


If you can afford to make the transition back to the guest for 
emulation, you might as well transition to userspace.




Why not have the equivalent of a paravirtual SMM mode where we can 
reflect IO exits back to the guest in a well defined way?  It could 
then implement PM timer in terms of HPET or something like that.


More exits.



We already have a virtual address space that works for most guests 
thanks to the TPR optimization.


It only works for Windows XP and Windows XP with the /3GB extension.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Anthony Liguori

On 12/14/2010 07:49 AM, Avi Kivity wrote:

On 12/14/2010 03:40 PM, Glauber Costa wrote:

>
>  What is the motivation for this?  Are there any important guests that
>  use the pmtimer?
Avi,

All older RHEL and Windows, for example, would benefit for this.


They only benefit from it because we don't provide HPET.  If we did, 
the guests would use HPET in preference to pmtimer, since HPET is so 
much better than pmtimer (yet still sucks in an absolute sense).



>  If anything I'd expect hpet or the Microsoft synthetic timers to be a
>  lot more important.

True. But also a lot more work.
Implementing just the pm timer counter - not the whole of it - in
kernel, gives us a lot of gain with not very much effort. Patch is
pretty simple, as you can see, and most of it is even code to turn it
on/off, etc.



Partial emulation is not something I like since it causes a fuzzy 
kernel/user boundary.  In this case, transitioning to userspace when 
interrupts are enabled doesn't look so hot.  Are you sure all guests 
that benefit from this don't enable the pmtimer interrupt?  What about 
the transition?  Will we have a time discontinuity when that happens?


What I'd really like to see is this stuff implemented in bytecode, 
unfortunately that's a lot of work which will be very hard to upstream.


Fortunately, we have a very good bytecode interpreter that's accelerated 
in the kernel called KVM ;-)


Why not have the equivalent of a paravirtual SMM mode where we can 
reflect IO exits back to the guest in a well defined way?  It could then 
implement PM timer in terms of HPET or something like that.


We already have a virtual address space that works for most guests 
thanks to the TPR optimization.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Anthony Liguori

On 12/14/2010 06:09 AM, Ulrich Obergfell wrote:

Hi,

This is an RFC through which I would like to get feedback on how the
idea of in-kernel PM Timer would be received.

The current implementation of PM Timer emulation is 'heavy-weight'
because the code resides in qemu userspace. Guest operating systems
that use PM Timer as a clock source (for example, older versions of
Linux that do not have paravirtualized clock) would benefit from an
in-kernel PM Timer emulation.

Parts 1 thru 4 of this RFC contain experimental source code which
I recently used to investigate the performance benefit. In a Linux
guest, I was running a program that calls gettimeofday() 'n' times
in a loop (the PM Timer register is read during each call). With
in-kernel PM Timer, I observed a significant reduction of program
execution time.
   


I've played with this in the past.  Can you post real numbers, 
preferably, with a real work load?


Regards,

Anthony Liguori


The experimental code emulates the PM Timer register in KVM kernel.
All other components of ACPI PM remain in qemu userspace. Also, the
'timer carry interrupt' feature is not implemented in-kernel. If a
guest operating system needs to enable the 'timer carry interrupt',
the code takes care that PM Timer emulation falls back to userspace.
However, I think the design of the code has sufficient flexibility,
so that anyone who would want to add the 'timer carry interrupt'
feature in-kernel could try to do so later on.

Please review and please comment.


Regards,

Uli Obergfell
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] RAM API: Make use of it for x86 PC

2010-12-14 Thread Anthony Liguori

On 12/14/2010 09:16 AM, Alex Williamson wrote:

On Tue, 2010-12-14 at 11:18 +0200, Avi Kivity wrote:
   

On 12/13/2010 11:24 PM, Alex Williamson wrote:
 

Register the actual VM RAM using the new API


@@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size,
   /* allocate RAM */
   ram_addr = qemu_ram_alloc(NULL, "pc.ram",
 below_4g_mem_size + above_4g_mem_size);
-cpu_register_physical_memory(0, 0xa, ram_addr);
-cpu_register_physical_memory(0x10,
- below_4g_mem_size - 0x10,
- ram_addr + 0x10);
+ram_register(0, below_4g_mem_size, ram_addr);

   

What's the impact of this?  Won't it conflict with BIOS memory
registration?  What about VGA?
 


There is no "conflict".  Memory registration can punch through previous 
registrations.


And the QEMU SMM code switches the VGA area back and forth between 
memory mapped and normal ram depending on the mode.


This presents no functional change, just structures RAM allocation to 
closer reflect the way things actually work.


Regards,

Anthony Liguori


In terms of patch hygiene, it should be in a separate patch titled
"register 0xa-0x10 as RAM" or something.  It's a much more
drastic change than making use of the new RAM API.
 

As we discussed in the v2 patch, the chipset can selectively switch
regions within this range to point at VGA, ROM, or RAM, but there's
always physical RAM backing the space, even when it's mapping isn't
active.  VGA and ROM will be overlay the RAM mapping.  I'm fine with
splitting this into two patches for debug-ability, but the change is
reflective of following the RAM API and registering all of "RAM".  Maybe
it would be sufficient to make such a note explicit in this commit log?
Thanks,

Alex

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] RAM API: Make use of it for x86 PC

2010-12-14 Thread Alex Williamson
On Tue, 2010-12-14 at 11:18 +0200, Avi Kivity wrote:
> On 12/13/2010 11:24 PM, Alex Williamson wrote:
> > Register the actual VM RAM using the new API
> >
> >
> > @@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size,
> >   /* allocate RAM */
> >   ram_addr = qemu_ram_alloc(NULL, "pc.ram",
> > below_4g_mem_size + above_4g_mem_size);
> > -cpu_register_physical_memory(0, 0xa, ram_addr);
> > -cpu_register_physical_memory(0x10,
> > - below_4g_mem_size - 0x10,
> > - ram_addr + 0x10);
> > +ram_register(0, below_4g_mem_size, ram_addr);
> >
> 
> What's the impact of this?  Won't it conflict with BIOS memory 
> registration?  What about VGA?
> 
> In terms of patch hygiene, it should be in a separate patch titled 
> "register 0xa-0x10 as RAM" or something.  It's a much more 
> drastic change than making use of the new RAM API.

As we discussed in the v2 patch, the chipset can selectively switch
regions within this range to point at VGA, ROM, or RAM, but there's
always physical RAM backing the space, even when it's mapping isn't
active.  VGA and ROM will be overlay the RAM mapping.  I'm fine with
splitting this into two patches for debug-ability, but the change is
reflective of following the RAM API and registering all of "RAM".  Maybe
it would be sufficient to make such a note explicit in this commit log?
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Avi Kivity

On 12/14/2010 04:44 PM, Ulrich Obergfell wrote:

>
>  Partial emulation is not something I like since it causes a fuzzy
>  kernel/user boundary.  In this case, transitioning to userspace when
>  interrupts are enabled doesn't look so hot.  Are you sure all guests
>  that benefit from this don't enable the pmtimer interrupt?  What about
>  the transition?  Will we have a time discontinuity when that happens?

Avi,

the idea is to use the '-kvm-pmtmr' option (in code part 4) only
with guests that do not enable the 'timer carry interrupt'. Guests
that need to enable the 'timer carry interrupt' should rather use
the PM Timer emulation in qemu userspace (i.e. they should not be
started with this option). If a guest is accidentally started with
this option, the in-kernel PM Timer (in code part 1) detects if
the guest attempts to enable the 'timer carry interrupt' and falls
back to PM Timer emulation in qemu userspace (in-kernel PM Timer
disables itself automatically). So, this is not a combination of
in-kernel PM Timer register emulation and qemu userspace PM Timer
interrupt emulation.



We really try to avoid guest specific parameters.  Having to decide if 
the guest has virtio is bad enough, but going into low level details 
like that is really bad.  The host admin might not even know what 
operating systems its guests run.


A guest might even dual boot two different operating systems.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Dec 14

2010-12-14 Thread Jes Sorensen
On 12/14/10 01:12, Chris Wright wrote:
> Please send in any agenda items you are interested in covering.
> 
> thanks,
> -chris
> 

Chris,

Any chance you could fix your cronjob to send out the CFA a day earlier?
15 hrs before is a bit short notice.

Cheers,
Jes

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Dec 14

2010-12-14 Thread Chris Wright
* Chris Wright (chr...@redhat.com) wrote:
> Please send in any agenda items you are interested in covering.

No agenda, today's call is cancelled.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Ulrich Obergfell

- "Avi Kivity"  wrote:

> On 12/14/2010 03:40 PM, Glauber Costa wrote:
> > >
> > >  What is the motivation for this?  Are there any important guests that
> > >  use the pmtimer?
> > Avi,
> >
> > All older RHEL and Windows, for example, would benefit for this.
> 
> They only benefit from it because we don't provide HPET.  If we did, the 
> guests would use HPET in preference to pmtimer, since HPET is so much
> better than pmtimer (yet still sucks in an absolute sense).
> 
> > >  If anything I'd expect hpet or the Microsoft synthetic timers to be a
> > >  lot more important.
> >
> > True. But also a lot more work.
> > Implementing just the pm timer counter - not the whole of it - in
> > kernel, gives us a lot of gain with not very much effort. Patch is
> > pretty simple, as you can see, and most of it is even code to turn it
> > on/off, etc.
> >
> 
> Partial emulation is not something I like since it causes a fuzzy 
> kernel/user boundary.  In this case, transitioning to userspace when 
> interrupts are enabled doesn't look so hot.  Are you sure all guests 
> that benefit from this don't enable the pmtimer interrupt?  What about
> the transition?  Will we have a time discontinuity when that happens?

Avi,

the idea is to use the '-kvm-pmtmr' option (in code part 4) only
with guests that do not enable the 'timer carry interrupt'. Guests
that need to enable the 'timer carry interrupt' should rather use
the PM Timer emulation in qemu userspace (i.e. they should not be
started with this option). If a guest is accidentally started with
this option, the in-kernel PM Timer (in code part 1) detects if
the guest attempts to enable the 'timer carry interrupt' and falls
back to PM Timer emulation in qemu userspace (in-kernel PM Timer
disables itself automatically). So, this is not a combination of
in-kernel PM Timer register emulation and qemu userspace PM Timer
interrupt emulation.

Regards,

Uli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Gleb Natapov
On Tue, Dec 14, 2010 at 03:49:37PM +0200, Avi Kivity wrote:
> On 12/14/2010 03:40 PM, Glauber Costa wrote:
> >>
> >>  What is the motivation for this?  Are there any important guests that
> >>  use the pmtimer?
> >Avi,
> >
> >All older RHEL and Windows, for example, would benefit for this.
> 
> They only benefit from it because we don't provide HPET.  If we did,
> the guests would use HPET in preference to pmtimer, since HPET is so
> much better than pmtimer (yet still sucks in an absolute sense).
> 
> >>  If anything I'd expect hpet or the Microsoft synthetic timers to be a
> >>  lot more important.
> >
> >True. But also a lot more work.
> >Implementing just the pm timer counter - not the whole of it - in
> >kernel, gives us a lot of gain with not very much effort. Patch is
> >pretty simple, as you can see, and most of it is even code to turn it
> >on/off, etc.
> >
> 
> Partial emulation is not something I like since it causes a fuzzy
> kernel/user boundary.  In this case, transitioning to userspace when
> interrupts are enabled doesn't look so hot.  Are you sure all guests
> that benefit from this don't enable the pmtimer interrupt?  What
> about the transition?  Will we have a time discontinuity when that
> happens?
> 
> What I'd really like to see is this stuff implemented in bytecode,
> unfortunately that's a lot of work which will be very hard to
> upstream.
>
 
Just use ACPI bytecode. It is upstream already.


--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Avi Kivity

On 12/14/2010 03:40 PM, Glauber Costa wrote:

>
>  What is the motivation for this?  Are there any important guests that
>  use the pmtimer?
Avi,

All older RHEL and Windows, for example, would benefit for this.


They only benefit from it because we don't provide HPET.  If we did, the 
guests would use HPET in preference to pmtimer, since HPET is so much 
better than pmtimer (yet still sucks in an absolute sense).



>  If anything I'd expect hpet or the Microsoft synthetic timers to be a
>  lot more important.

True. But also a lot more work.
Implementing just the pm timer counter - not the whole of it - in
kernel, gives us a lot of gain with not very much effort. Patch is
pretty simple, as you can see, and most of it is even code to turn it
on/off, etc.



Partial emulation is not something I like since it causes a fuzzy 
kernel/user boundary.  In this case, transitioning to userspace when 
interrupts are enabled doesn't look so hot.  Are you sure all guests 
that benefit from this don't enable the pmtimer interrupt?  What about 
the transition?  Will we have a time discontinuity when that happens?


What I'd really like to see is this stuff implemented in bytecode, 
unfortunately that's a lot of work which will be very hard to upstream.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Glauber Costa
On Tue, 2010-12-14 at 15:34 +0200, Avi Kivity wrote:
> On 12/14/2010 02:09 PM, Ulrich Obergfell wrote:
> > Hi,
> >
> > This is an RFC through which I would like to get feedback on how the
> > idea of in-kernel PM Timer would be received.
> >
> > The current implementation of PM Timer emulation is 'heavy-weight'
> > because the code resides in qemu userspace. Guest operating systems
> > that use PM Timer as a clock source (for example, older versions of
> > Linux that do not have paravirtualized clock) would benefit from an
> > in-kernel PM Timer emulation.
> >
> > Parts 1 thru 4 of this RFC contain experimental source code which
> > I recently used to investigate the performance benefit. In a Linux
> > guest, I was running a program that calls gettimeofday() 'n' times
> > in a loop (the PM Timer register is read during each call). With
> > in-kernel PM Timer, I observed a significant reduction of program
> > execution time.
> >
> > The experimental code emulates the PM Timer register in KVM kernel.
> > All other components of ACPI PM remain in qemu userspace. Also, the
> > 'timer carry interrupt' feature is not implemented in-kernel. If a
> > guest operating system needs to enable the 'timer carry interrupt',
> > the code takes care that PM Timer emulation falls back to userspace.
> > However, I think the design of the code has sufficient flexibility,
> > so that anyone who would want to add the 'timer carry interrupt'
> > feature in-kernel could try to do so later on.
> >
> 
> What is the motivation for this?  Are there any important guests that 
> use the pmtimer?
Avi,

All older RHEL and Windows, for example, would benefit for this.

> If anything I'd expect hpet or the Microsoft synthetic timers to be a 
> lot more important.

True. But also a lot more work.
Implementing just the pm timer counter - not the whole of it - in
kernel, gives us a lot of gain with not very much effort. Patch is
pretty simple, as you can see, and most of it is even code to turn it
on/off, etc.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][unit-tests] fix i386 arch compilation.

2010-12-14 Thread Avi Kivity

On 12/14/2010 02:26 PM, Gleb Natapov wrote:

Commit 750bbdb forgot to convert i386 arch to .elf.



Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Avi Kivity

On 12/14/2010 02:09 PM, Ulrich Obergfell wrote:

Hi,

This is an RFC through which I would like to get feedback on how the
idea of in-kernel PM Timer would be received.

The current implementation of PM Timer emulation is 'heavy-weight'
because the code resides in qemu userspace. Guest operating systems
that use PM Timer as a clock source (for example, older versions of
Linux that do not have paravirtualized clock) would benefit from an
in-kernel PM Timer emulation.

Parts 1 thru 4 of this RFC contain experimental source code which
I recently used to investigate the performance benefit. In a Linux
guest, I was running a program that calls gettimeofday() 'n' times
in a loop (the PM Timer register is read during each call). With
in-kernel PM Timer, I observed a significant reduction of program
execution time.

The experimental code emulates the PM Timer register in KVM kernel.
All other components of ACPI PM remain in qemu userspace. Also, the
'timer carry interrupt' feature is not implemented in-kernel. If a
guest operating system needs to enable the 'timer carry interrupt',
the code takes care that PM Timer emulation falls back to userspace.
However, I think the design of the code has sufficient flexibility,
so that anyone who would want to add the 'timer carry interrupt'
feature in-kernel could try to do so later on.



What is the motivation for this?  Are there any important guests that 
use the pmtimer?


If anything I'd expect hpet or the Microsoft synthetic timers to be a 
lot more important.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

2010-12-14 Thread Mike Galbraith
On Tue, 2010-12-14 at 16:56 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Dec 14, 2010 at 12:03:58PM +0100, Mike Galbraith wrote:
> > On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote:
> > > On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:
> > 
> > > > That part looks ok, except for the yield cross cpu bit.  Trying to yield
> > > > a resource you don't have doesn't make much sense to me.
> > > 
> > > So another (crazy) idea is to move the "yieldee" task on another cpu over 
> > > to 
> > > yielding task's cpu, let it run till the end of yielding tasks slice and 
> > > then
> > > let it go back to the original cpu at the same vruntime position!
> > 
> > Yeah, pulling the intended recipient makes fine sense.  If he doesn't
> > preempt you, you can try to swap vruntimes or whatever makes arithmetic
> > sense and will help.  Dunno how you tell him how long he can keep the
> > cpu though,
> 
> can't we adjust the new task's [prev_]sum_exec_runtime a bit so that it is 
> preempted at the end of yielding task's timeslice?

And dork up accounting.  Why?  Besides, it won't work because you have
no idea who may preempt whom, when, and for how long.

(Why do people keep talking about timeslice?  The only thing that exists
is lag that changes the instant anyone does anything of interest.)

> > and him somehow going back home needs to be a plain old
> > migration, no fancy restoration of ancient history vruntime.
> 
> What is the issue if it gets queued at the old vruntime (assuming fair stick 
> is
> still behind that)? Without that it will hurt fairness for the yieldee (and
> perhaps of the overall VM in this case).

Who all are you placing this task in front of or behind based upon a
non-existent relationship?

Your recipient may well have been preempted, and is now further behind
than the completely irrelevant to the current situation stored vruntime
would indicate, so why would you want to move it rightward?  Certainly
not in the interest of fairness.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][unit-tests] fix i386 arch compilation.

2010-12-14 Thread Gleb Natapov
Commit 750bbdb forgot to convert i386 arch to .elf.

diff --git a/config-i386.mak b/config-i386.mak
index 6dbd19f..c1b6e08 100644
--- a/config-i386.mak
+++ b/config-i386.mak
@@ -9,4 +9,4 @@ tests = $(TEST_DIR)/taskswitch.flat
 
 include config-x86-common.mak
 
-$(TEST_DIR)/taskswitch.flat: $(cstart.o) $(TEST_DIR)/taskswitch.o
+$(TEST_DIR)/taskswitch.elf: $(cstart.o) $(TEST_DIR)/taskswitch.o
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations

2010-12-14 Thread Michael S. Tsirkin
On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote:
> Please merge the following tree for 2.6.38.
> Thanks!

Rusty Acked it as is, so please pull the below.
Thanks very much!

> The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0:
> 
>   net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800)
> 
> are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next
> 
> Jason Wang (1):
>   vhost: fix typos in comment
> 
> Julia Lawall (1):
>   drivers/vhost/vhost.c: delete double assignment
> 
> Michael S. Tsirkin (9):
>   vhost: put mm after thread stop
>   vhost-net: batch use/unuse mm
>   vhost: copy_to_user -> __copy_to_user
>   vhost: get/put_user -> __get/__put_user
>   vhost: remove unused include
>   vhost: correctly set bits of dirty pages
>   vhost: better variable name in logging
>   vhost test module
>   tools/virtio: virtio_test tool
> 
>  drivers/vhost/net.c  |9 +-
>  drivers/vhost/test.c |  320 
> ++
>  drivers/vhost/test.h |7 +
>  drivers/vhost/vhost.c|   44 +++---
>  drivers/vhost/vhost.h|2 +-
>  tools/virtio/Makefile|   12 ++
>  tools/virtio/linux/device.h  |2 +
>  tools/virtio/linux/slab.h|2 +
>  tools/virtio/linux/virtio.h  |  223 +++
>  tools/virtio/vhost_test/Makefile |2 +
>  tools/virtio/vhost_test/vhost_test.c |1 +
>  tools/virtio/virtio_test.c   |  248 ++
>  12 files changed, 842 insertions(+), 30 deletions(-)
>  create mode 100644 drivers/vhost/test.c
>  create mode 100644 drivers/vhost/test.h
>  create mode 100644 tools/virtio/Makefile
>  create mode 100644 tools/virtio/linux/device.h
>  create mode 100644 tools/virtio/linux/slab.h
>  create mode 100644 tools/virtio/linux/virtio.h
>  create mode 100644 tools/virtio/vhost_test/Makefile
>  create mode 100644 tools/virtio/vhost_test/vhost_test.c
>  create mode 100644 tools/virtio/virtio_test.c
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

2010-12-14 Thread Peter Zijlstra
On Mon, 2010-12-13 at 22:46 -0500, Rik van Riel wrote:


> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2c79e92..408326f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1086,6 +1086,8 @@ struct sched_class {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> void (*task_move_group) (struct task_struct *p, int on_rq);
>  #endif
> +
> +   void (*yield_to) (struct rq *rq, struct task_struct *p);
>  };
>  
>  struct load_weight {
> @@ -1947,6 +1949,7 @@ extern void set_user_nice(struct task_struct *p, long 
> nice);
>  extern int task_prio(const struct task_struct *p);
>  extern int task_nice(const struct task_struct *p);
>  extern int can_nice(const struct task_struct *p, const int nice);
> +extern void requeue_task(struct rq *rq, struct task_struct *p);

That definitely doesn't want to be a globally visible symbol.

>  extern int task_curr(const struct task_struct *p);
>  extern int idle_cpu(int cpu);
>  extern int sched_setscheduler(struct task_struct *, int, struct sched_param 
> *);
> @@ -2020,6 +2023,10 @@ extern int wake_up_state(struct task_struct *tsk, 
> unsigned int state);
>  extern int wake_up_process(struct task_struct *tsk);
>  extern void wake_up_new_task(struct task_struct *tsk,
> unsigned long clone_flags);
> +
> +extern u64 slice_remain(struct task_struct *);

idem.


> +void yield_to(struct task_struct *p)
> +{
> +   unsigned long flags;
> +   struct rq *rq, *p_rq;
> +
> +   local_irq_save(flags);
> +   rq = this_rq();
> +again:
> +   p_rq = task_rq(p);
> +   double_rq_lock(rq, p_rq);
> +   if (p_rq != task_rq(p)) {
> +   double_rq_unlock(rq, p_rq);
> +   goto again;
> +   }
> +
> +   /* We can't yield to a process that doesn't want to run. */
> +   if (!p->se.on_rq)
> +   goto out;
> +
> +   /*
> +* We can only yield to a runnable task, in the same schedule class
> +* as the current task, if the schedule class implements 
> yield_to_task.
> +*/
> +   if (!task_running(rq, p) && current->sched_class == p->sched_class &&
> +   current->sched_class->yield_to)
> +   current->sched_class->yield_to(rq, p);

rq and p don't match, see below.

> +
> +out:
> +   double_rq_unlock(rq, p_rq);
> +   local_irq_restore(flags);
> +   yield();

That wants to be plain: schedule(), possibly conditional on having
called sched_class::yield_to.

> +}
> +EXPORT_SYMBOL_GPL(yield_to);

> +u64 slice_remain(struct task_struct *p)
> +{
> +   unsigned long flags;
> +   struct sched_entity *se = &p->se;
> +   struct cfs_rq *cfs_rq;
> +   struct rq *rq;
> +   u64 slice, ran;
> +   s64 delta;
> +
> +   rq = task_rq_lock(p, &flags);

You're calling this from
yield_to()->sched_class::yield_to()->yield_to_fair()->slice_remain(),
yield_to() already holds p's rq lock.

> +   cfs_rq = cfs_rq_of(se);
> +   slice = sched_slice(cfs_rq, se);
> +   ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +   delta = slice - ran;
> +   task_rq_unlock(rq, &flags);
> +
> +   return max(delta, 0LL);
> +}

Like Mike said, the returned figure doesn't really mean anything, its
definitely not the remaining time of a slice. It might qualify for a
weak random number generator though.. :-)

> +static void yield_to_fair(struct rq *rq, struct task_struct *p)
> +{
> +   struct sched_entity *se = &p->se;
> +   struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +   u64 remain = slice_remain(current);
> +
> +   dequeue_task(rq, p, 0);

Here you assume @p lives on @rq, but you passed:

+   current->sched_class->yield_to(rq, p);

and rq = this_rq(), so this will go splat.

> +   se->vruntime -= remain;

You cannot simply subtract wall-time from virtual time, see the usage of
calc_delta_fair() in the proposal below.

> +   if (se->vruntime < cfs_rq->min_vruntime)
> +   se->vruntime = cfs_rq->min_vruntime;

Then clipping it to min_vruntime doesn't make any sense at all.

> +   enqueue_task(rq, p, 0);
> +   check_preempt_curr(rq, p, 0);
> +} 

Also, modifying the vruntime of one task without also modifying the
vruntime of the other task breaks stuff. You're injecting time into p
without taking time out of current. 


Maybe something like:

static void yield_to_fair(struct rq *p_rq, struct task_struct *p)
{
struct rq *rq = this_rq();
struct sched_entity *se = ¤t->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
struct sched_entity *pse = &p->se;
struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);

/*
 * Transfer wakeup_gran worth of time from current to @p,
 * this should ensure current is no longer eligible to run.
 */
unsigned long wakeup_gran = 
ACCESS_ONCE(sysctl_sched_wakeup_granularity);

update_rq_clock(rq);
update_curr(cfs_rq);

if (pse != 

[RFC 4/4] KVM in-kernel PM Timer implementation (experimental code part 4)

2010-12-14 Thread Ulrich Obergfell

experimental code part 4 (qemu userspace)
-


This code introduces the new qemu command line option '-kvm-pmtmr'.
qemu only creates and configures in-kernel PM Timer if this option
is specified on the command line.



diff -up ./qemu-kvm.c.orig4 ./qemu-kvm.c
--- ./qemu-kvm.c.orig4  2010-12-10 10:50:42.857811776 +0100
+++ ./qemu-kvm.c2010-12-10 11:45:23.783748044 +0100
@@ -54,6 +54,9 @@ int kvm_irqchip = 1;
 int kvm_pit = 1;
 int kvm_pit_reinject = 1;
 int kvm_nested = 0;
+#ifdef KVM_CAP_PMTMR
+int kvm_pmtmr = 0;
+#endif
 
 
 KVMState *kvm_state;
@@ -186,7 +189,7 @@ int kvm_init(int smp_cpus)
 kvm_context->no_irqchip_creation = 0;
 kvm_context->no_pit_creation = 0;
 #ifdef KVM_CAP_PMTMR
-kvm_context->no_pmtmr_creation = 0;
+kvm_context->no_pmtmr_creation = 1;
 #endif
 
 #ifdef KVM_CAP_SET_GUEST_DEBUG
@@ -241,6 +244,11 @@ void kvm_disable_pit_creation(kvm_contex
 }
 
 #ifdef KVM_CAP_PMTMR
+void kvm_enable_pmtmr_creation(kvm_context_t kvm)
+{
+kvm->no_pmtmr_creation = 0;
+}
+
 void (*kvm_arch_pmtmr_handler)(kvm_context_t kvm);
 /*
  * This handler is called by
@@ -1654,6 +1662,11 @@ static int kvm_create_context(void)
 if (!kvm_pit) {
 kvm_disable_pit_creation(kvm_context);
 }
+#ifdef KVM_CAP_PMTMR
+if (kvm_pmtmr) {
+kvm_enable_pmtmr_creation(kvm_context);
+}
+#endif
 if (kvm_create(kvm_context, 0, NULL) < 0) {
 kvm_finalize(kvm_state);
 return -1;
diff -up ./qemu-kvm.h.orig4 ./qemu-kvm.h
--- ./qemu-kvm.h.orig4  2010-12-10 11:26:43.726790319 +0100
+++ ./qemu-kvm.h2010-12-10 11:47:50.074805792 +0100
@@ -124,6 +124,18 @@ void kvm_disable_irqchip_creation(kvm_co
  */
 void kvm_disable_pit_creation(kvm_context_t kvm);
 
+#ifdef KVM_CAP_PMTMR
+/*!
+ * \brief Enable the in-kernel ACPI PM Timer register creation
+ *
+ * In-kernel ACPI PM Timer register is disabled by default.
+ * If in-kernel is to be used, this should be called prior to kvm_create().
+ *
+ *  \param kvm Pointer to the kvm_context
+ */
+void kvm_enable_pmtmr_creation(kvm_context_t kvm);
+#endif
+
 /*!
  * \brief Create new virtual machine
  *
@@ -706,6 +718,9 @@ extern int kvm_irqchip;
 extern int kvm_pit;
 extern int kvm_pit_reinject;
 extern int kvm_nested;
+#ifdef KVM_CAP_PMTMR
+extern int kvm_pmtmr;
+#endif
 extern kvm_context_t kvm_context;
 
 struct ioperm_data {
diff -up ./qemu-options.hx.orig4 ./qemu-options.hx
--- ./qemu-options.hx.orig4 2010-12-02 15:15:20.0 +0100
+++ ./qemu-options.hx   2010-12-06 11:27:57.273648509 +0100
@@ -2330,6 +2330,9 @@ DEF("no-kvm-pit-reinjection", 0, QEMU_OP
 QEMU_ARCH_I386)
 DEF("enable-nesting", 0, QEMU_OPTION_enable_nesting,
 "-enable-nesting enable support for running a VM inside the VM (AMD 
only)\n", QEMU_ARCH_I386)
+DEF("kvm-pmtmr", 0, QEMU_OPTION_kvm_pmtmr,
+"-kvm-pmtmr  enable KVM kernel mode ACPI PM Timer register 
emulation\n",
+QEMU_ARCH_I386)
 DEF("nvram", HAS_ARG, QEMU_OPTION_nvram,
 "-nvram FILE provide ia64 nvram contents\n", QEMU_ARCH_ALL)
 DEF("tdf", 0, QEMU_OPTION_tdf,
diff -up ./vl.c.orig4 ./vl.c
--- ./vl.c.orig42010-12-10 10:34:55.388997058 +0100
+++ ./vl.c  2010-12-10 11:50:20.566810444 +0100
@@ -2474,6 +2474,12 @@ int main(int argc, char **argv, char **e
kvm_nested = 1;
break;
}
+#ifdef KVM_CAP_PMTMR
+   case QEMU_OPTION_kvm_pmtmr: {
+   kvm_pmtmr = 1;
+   break;
+   }
+#endif
 #endif
 case QEMU_OPTION_usb:
 usb_enabled = 1;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 3/4] KVM in-kernel PM Timer implementation (experimental code part 3)

2010-12-14 Thread Ulrich Obergfell

experimental code part 3 (qemu userspace)
-


This code utlizes the new ioctl commands introduced by code part 2.

The KVM_CREATE_PMTMR ioctl command is simply called once when a virtual
machine is being created. However, calling KVM_CONFIGURE_PMTMR is more
challenging because it involves ...

-  passing the base address of PM I/O port range to code part 1
-  passing the clock offset to code part 1

'timers_state.cpu_clock_offset' gets updated at each vm_start() call.
However, the PM I/O port base address is not available at the first
vm_start() call. So, configuring the in-kernel PM Timer needs to be
postponed until the PIIX4 PCI configuration is initialized. This is
facilitated by the new function kvm_pmtmr_handler() which is called
by vm_start() and by pm_io_space_update().

kvm_pmtmr_handler() calls architecture-specific code thru a function
pointer 'kvm_arch_pmtmr_handler'. kvm_pmtmr_handler() is a 'no-op' if
an architecture does not provide or clears this function pointer. The
architecture-specific code is responsible for configuring the in-kernel
PM Timer.

The experimental code provides kvm_arch_configure_pmtmr_wrapper() in
qemu-kvm-x86.c. kvm_arch_create_pmtmr() sets 'kvm_arch_pmtmr_handler'
to 'kvm_arch_configure_pmtmr_wrapper' after successful completion of
the KVM_CREATE_PMTMR ioctl command.

kvm_arch_configure_pmtmr_wrapper() requires ACPI PM code to provide a
function pointer 'kvm_arch_get_pm_io_base' thru which the PM I/O port
base address can be obtained. kvm_arch_configure_pmtmr_wrapper() is a
'no-op' too if ACPI PM code does not provide or clears this function
pointer. The experimental code provides piix4_get_pm_io_base() in
hw/acpi_piix4.c. pm_io_space_update() sets 'kvm_arch_get_pm_io_base'
to 'piix4_get_pm_io_base'.

Consider two scenarios ...

-  during virtual machine creation and startup

 kvm_arch_create
   kvm_arch_create_pmtmr
 ioctl(KVM_CREATE_PMTMR)
 kvm_arch_pmtmr_handler = kvm_arch_configure_pmtmr_wrapper
  :
 vm_start
   kvm_pmtmr_handler
 kvm_arch_configure_pmtmr_wrapper
   'no-op' because kvm_arch_get_pm_io_base not set yet
  :
 pm_io_space_update
   kvm_arch_get_pm_io_base = piix4_get_pm_io_base
   kvm_pmtmr_handler
 kvm_arch_configure_pmtmr_wrapper
   obtain PM I/O port base thru kvm_arch_get_pm_io_base
   kvm_arch_configure_pmtmr
 ioctl(KVM_CONFIGURE_PMTMR)

-  any other vm_start() call, for example after migration

 vm_start
   kvm_pmtmr_handler
 kvm_arch_configure_pmtmr_wrapper
   obtain PM I/O port base thru kvm_arch_get_pm_io_base
   kvm_arch_configure_pmtmr
 ioctl(KVM_CONFIGURE_PMTMR)



diff -up ./hw/acpi_piix4.c.orig3 ./hw/acpi_piix4.c
--- ./hw/acpi_piix4.c.orig3 2010-12-02 15:15:20.0 +0100
+++ ./hw/acpi_piix4.c   2010-12-10 11:26:53.943753235 +0100
@@ -23,6 +23,7 @@
 #include "acpi.h"
 #include "sysemu.h"
 #include "range.h"
+#include "qemu-kvm.h"
 
 //#define DEBUG
 
@@ -80,6 +81,9 @@ typedef struct PIIX4PMState {
 
 static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s);
 
+/* for cpu hotadd (and in-kernel PM Timer if KVM_CAP_PMTMR is defined) */
+static PIIX4PMState *global_piix4_pm_state;
+
 #define ACPI_ENABLE 0xf1
 #define ACPI_DISABLE 0xf0
 
@@ -250,6 +254,19 @@ static void acpi_dbg_writel(void *opaque
 PIIX4_DPRINTF("ACPI: DBG: 0x%08x\n", val);
 }
 
+#ifdef KVM_CAP_PMTMR
+static uint64_t piix4_get_pm_io_base(void)
+{
+PIIX4PMState *s = global_piix4_pm_state;
+uint32_t pm_io_base;
+
+pm_io_base = le32_to_cpu(*(uint32_t *)(s->dev.config + 0x40));
+pm_io_base &= 0xffc0;
+
+return (uint64_t)pm_io_base;
+}
+#endif
+
 static void pm_io_space_update(PIIX4PMState *s)
 {
 uint32_t pm_io_base;
@@ -262,6 +279,16 @@ static void pm_io_space_update(PIIX4PMSt
 PIIX4_DPRINTF("PM: mapping to 0x%x\n", pm_io_base);
 iorange_init(&s->ioport, &pm_iorange_ops, pm_io_base, 64);
 ioport_register(&s->ioport);
+#ifdef  KVM_CAP_PMTMR
+kvm_arch_get_pm_io_base = piix4_get_pm_io_base;
+/*
+ * The base address of the PM I/O port address range is now known.
+ * The following call is needed to pass the base address to the
+ * in-kernel PM Timer emulation. Note that 'kvm_arch_get_pm_io_base'
+ * must be set _before_ this call.
+ */
+kvm_pmtmr_handler();
+#endif
 }
 }
 
@@ -354,14 +381,12 @@ static void piix4_powerdown(void *opaque
 }
 }
 
-static PIIX4PMState *global_piix4_pm_state; /* cpu hotadd */
-
 static int piix4_pm_initfn(PCIDevice *dev)
 {
 PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev, dev);
 uint8_t *pci_conf;
 
-/* for cpu hotadd */
+/* for cpu hotadd and in-kernel PM Timer */
 global_piix4_pm_state = s;
 
 pci_conf = s->dev.config;
diff -up ./kvm/include/linux/kvm.h.orig3 ./kvm/include/linux/kvm.h
--- ./kvm/include/

[RFC 2/4] KVM in-kernel PM Timer implementation (experimental code part 2)

2010-12-14 Thread Ulrich Obergfell

experimental code part 2 (KVM kernel)
-


This code introduces two new ioctl commands KVM_CREATE_PMTMR and
KVM_CONFIGURE_PMTMR plus the new capability KVM_CAP_PMTMR to the
ioctl infrastructure of the KVM kernel. This code utilizes some
helper functions introduced by code part 1.



diff -up ./arch/x86/include/asm/kvm.h.orig2 ./arch/x86/include/asm/kvm.h
--- ./arch/x86/include/asm/kvm.h.orig2  2010-12-05 09:35:17.0 +0100
+++ ./arch/x86/include/asm/kvm.h2010-12-10 12:32:47.067686432 +0100
@@ -24,6 +24,7 @@
 #define __KVM_HAVE_DEBUGREGS
 #define __KVM_HAVE_XSAVE
 #define __KVM_HAVE_XCRS
+#define __KVM_HAVE_PMTMR
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff -up ./arch/x86/kvm/x86.c.orig2 ./arch/x86/kvm/x86.c
--- ./arch/x86/kvm/x86.c.orig2  2010-12-05 09:35:17.0 +0100
+++ ./arch/x86/kvm/x86.c2010-12-10 12:24:58.083739549 +0100
@@ -26,6 +26,9 @@
 #include "tss.h"
 #include "kvm_cache_regs.h"
 #include "x86.h"
+#ifdef KVM_CAP_PMTMR
+#include "pmtmr.h"
+#endif
 
 #include 
 #include 
@@ -1965,6 +1968,9 @@ int kvm_dev_ioctl_check_extension(long e
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
+#ifdef KVM_CAP_PMTMR
+   case KVM_CAP_PMTMR:
+#endif
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -3274,6 +3280,7 @@ long kvm_arch_vm_ioctl(struct file *filp
struct kvm_pit_state ps;
struct kvm_pit_state2 ps2;
struct kvm_pit_config pit_config;
+   struct kvm_pmtmr_config pmtmr_config;
} u;
 
switch (ioctl) {
@@ -3541,6 +3548,23 @@ long kvm_arch_vm_ioctl(struct file *filp
r = 0;
break;
}
+#ifdef KVM_CAP_PMTMR
+   case KVM_CREATE_PMTMR: {
+   mutex_lock(&kvm->slots_lock);
+   r = kvm_create_pmtmr(kvm);
+   mutex_unlock(&kvm->slots_lock);
+   break;
+   }
+   case KVM_CONFIGURE_PMTMR: {
+   r = -EFAULT;
+   if (copy_from_user(&u.pmtmr_config, argp,
+  sizeof(struct kvm_pmtmr_config)))
+   goto out;
+
+   r = kvm_configure_pmtmr(kvm, &u.pmtmr_config);
+   break;
+   }
+#endif
 
default:
;
diff -up ./include/linux/kvm.h.orig2 ./include/linux/kvm.h
--- ./include/linux/kvm.h.orig2 2010-12-05 09:35:17.0 +0100
+++ ./include/linux/kvm.h   2010-12-10 12:30:13.677745093 +0100
@@ -140,6 +140,12 @@ struct kvm_pit_config {
__u32 pad[15];
 };
 
+/* for KVM_CONFIGURE_PMTMR */
+struct kvm_pmtmr_config {
+   __u64 pm_io_base;
+   __s64 clock_offset;
+};
+
 #define KVM_PIT_SPEAKER_DUMMY 1
 
 #define KVM_EXIT_UNKNOWN  0
@@ -541,6 +547,9 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#ifdef __KVM_HAVE_PMTMR
+#define KVM_CAP_PMTMR 60
+#endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -672,6 +681,8 @@ struct kvm_clock_data {
 #define KVM_XEN_HVM_CONFIG_IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+#define KVM_CREATE_PMTMR  _IO(KVMIO,   0x7d)
+#define KVM_CONFIGURE_PMTMR   _IOW(KVMIO,  0x7e, struct kvm_pmtmr_config)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2  _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/4] KVM in-kernel PM Timer implementation (experimental code part 1)

2010-12-14 Thread Ulrich Obergfell

experimental code part 1 (KVM kernel)
-


This code introduces the actual emulation of the PM Timer register
plus some helper functions to create and configure the in-kernel
PM Timer. The emulation utilizes the 'kvm_io_bus' infrastructure.



diff -up ./arch/x86/include/asm/kvm_host.h.orig1 
./arch/x86/include/asm/kvm_host.h
--- ./arch/x86/include/asm/kvm_host.h.orig1 2010-12-05 09:35:17.0 
+0100
+++ ./arch/x86/include/asm/kvm_host.h   2010-12-10 12:14:29.282686691 +0100
@@ -459,6 +459,10 @@ struct kvm_arch {
/* fields used by HYPER-V emulation */
u64 hv_guest_os_id;
u64 hv_hypercall;
+
+#ifdef KVM_CAP_PMTMR
+   struct kvm_pmtmr *vpmtmr;
+#endif
 };
 
 struct kvm_vm_stat {
diff -up ./arch/x86/kvm/i8254.c.orig1 ./arch/x86/kvm/i8254.c
--- ./arch/x86/kvm/i8254.c.orig12010-12-05 09:35:17.0 +0100
+++ ./arch/x86/kvm/i8254.c  2010-12-10 12:09:36.877729064 +0100
@@ -51,7 +51,7 @@
 #define RW_STATE_WORD1 4
 
 /* Compute with 96 bit intermediate result: (a*b)/c */
-static u64 muldiv64(u64 a, u32 b, u32 c)
+u64 muldiv64(u64 a, u32 b, u32 c)
 {
union {
u64 ll;
diff -up ./arch/x86/kvm/Makefile.orig1 ./arch/x86/kvm/Makefile
--- ./arch/x86/kvm/Makefile.orig1   2010-12-05 09:35:17.0 +0100
+++ ./arch/x86/kvm/Makefile 2010-12-10 12:07:14.379811121 +0100
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API)   += $(addprefix .
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
-  i8254.o timer.o
+  i8254.o timer.o pmtmr.o
 kvm-intel-y+= vmx.o
 kvm-amd-y  += svm.o
 
diff -up ./arch/x86/kvm/pmtmr.c.orig1 ./arch/x86/kvm/pmtmr.c
--- ./arch/x86/kvm/pmtmr.c.orig12010-12-10 12:05:39.878691941 +0100
+++ ./arch/x86/kvm/pmtmr.c  2010-12-10 12:06:00.987738524 +0100
@@ -0,0 +1,151 @@
+/*
+ * in-kernel ACPI PM Timer emulation
+ *
+ * Note: 'timer carry interrupt' is not implemented
+ */
+
+#include 
+
+#ifdef KVM_CAP_PMTMR
+
+#include "pmtmr.h"
+
+static int emulate_acpi_reg_pmtmr(struct kvm_pmtmr *pmtmr, void *data, int len)
+{
+   s64 tmp;
+   u32 running_count;
+
+   if (len != 4)
+   return -EOPNOTSUPP;
+
+   tmp = ktime_to_ns(ktime_get()) + pmtmr->clock_offset;
+   running_count = (u32)muldiv64(tmp, KVM_ACPI_PMTMR_FREQ, NSEC_PER_SEC);
+   *(u32 *)data = running_count & KVM_ACPI_PMTMR_MASK;
+
+#ifdef KVM_ACPI_PMTMR_STATS
+   pmtmr->read_count++;
+#endif
+   return 0;
+}
+
+/*
+ * This function returns true for I/O ports in the range from 'PM base'
+ * to 'PM Timer' (this range contains the PM1 Status and the PM1 Enable
+ * registers).
+ */
+static inline int pmtmr_in_range(struct kvm_pmtmr *pmtmr, gpa_t ioport)
+{
+   return ((ioport >= pmtmr->pm_io_base) &&
+   (ioport <= pmtmr->pm_io_base + KVM_ACPI_REG_PMTMR));
+}
+
+static inline struct kvm_pmtmr *dev_to_pmtmr(struct kvm_io_device *dev)
+{
+return container_of(dev, struct kvm_pmtmr, dev);
+}
+
+static int pmtmr_ioport_read(struct kvm_io_device *this,
+gpa_t ioport, int len, void *data)
+{
+   struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this);
+
+   if (!pmtmr_in_range(pmtmr, ioport))
+   return -EOPNOTSUPP;
+
+   switch (ioport - pmtmr->pm_io_base) {
+   case KVM_ACPI_REG_PMTMR:
+   /* emulate PM Timer read if in-kernel emulation is enabled */
+   if (pmtmr->state == KVM_PMTMR_STATE_ENABLED)
+   return(emulate_acpi_reg_pmtmr(pmtmr, data, len));
+
+   /* fall thru */
+   default:
+   /* let qemu userspace handle everything else */
+   return -EOPNOTSUPP;
+   }
+}
+
+static int pmtmr_ioport_write(struct kvm_io_device *this,
+ gpa_t ioport, int len, const void *data)
+{
+   struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this);
+
+   if (!pmtmr_in_range(pmtmr, ioport))
+   return -EOPNOTSUPP;
+
+   switch (ioport - pmtmr->pm_io_base) {
+   case KVM_ACPI_REG_PMTMR:
+   /* ignore PM Timer write */
+   return 0;
+   case KVM_ACPI_REG_PMEN:
+   if (len == 2) {
+   u16 val = *(u16 *)data;
+   /*
+* Fall back to qemu userspace PM Timer emulation if
+* the VM sets the 'timer carry interrupt enable' bit
+* in the PM1 Enable register.
+*/
+   if (val & KVM_ACPI_PMTMR_TMR_EN)
+   /* disable in-kernel PM Timer emulation */
+   pmtmr->state = KVM_PMTMR_STATE_DISABLED;
+   }
+   /* fall thru */
+   default:
+   /* let qemu userspace handle everything 

[RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread Ulrich Obergfell

Hi,

This is an RFC through which I would like to get feedback on how the
idea of in-kernel PM Timer would be received.

The current implementation of PM Timer emulation is 'heavy-weight'
because the code resides in qemu userspace. Guest operating systems
that use PM Timer as a clock source (for example, older versions of
Linux that do not have paravirtualized clock) would benefit from an
in-kernel PM Timer emulation.

Parts 1 thru 4 of this RFC contain experimental source code which
I recently used to investigate the performance benefit. In a Linux
guest, I was running a program that calls gettimeofday() 'n' times
in a loop (the PM Timer register is read during each call). With
in-kernel PM Timer, I observed a significant reduction of program
execution time.

The experimental code emulates the PM Timer register in KVM kernel.
All other components of ACPI PM remain in qemu userspace. Also, the
'timer carry interrupt' feature is not implemented in-kernel. If a
guest operating system needs to enable the 'timer carry interrupt',
the code takes care that PM Timer emulation falls back to userspace.
However, I think the design of the code has sufficient flexibility,
so that anyone who would want to add the 'timer carry interrupt'
feature in-kernel could try to do so later on.

Please review and please comment.


Regards,

Uli Obergfell
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-14 Thread Daniel P. Berrange
On Tue, Dec 14, 2010 at 12:55:04PM +0100, Kenni Lund wrote:
> 2010/12/14 Erik Brakkee :
> >> From: Kenni Lund 
> >> 2010/12/14 Erik Brakkee :
> 
>  From: Kenni Lund 
> >
> > Does this mean I have a chance now that PCI passthrough of my WinTV
> > PVR-500
> > might work now?
> 
>  Passthrough of a PVR-500 has been working for a long time. I've been
>  running with passthrough of a PVR-500 in my HTPC, since
>  November/December 2009...so it should work with any recent kernel and
>  any recent version of qemu-kvm you can find today - No patching
>  needed. The only issue I had with the PVR-500 card, was when *I*
>  didn't free up the shared interrupts...once I fixed that, it "just
>  worked".
> >>>
> >>> How did you free up those shared interrupts then? I tried different slots
> >>> but always get conflicts with the USB irqs.
> >>
> >> I did an unbind of the conflicting device (eg. disabled it). I moved
> >> the PVR-500 card around in the different slots and once I got a
> >> conflict with the integrated sound card, I left the PVR-500 card in
> >> that slot (it's a headless machine, so no need for sound) and
> >> configured unbind of the sound card at boot time. On my old system I
> >> think it was conflicting with one of the USB controllers as well, but
> >> it didn't really matter, as I only lost a few of the ports on the back
> >> of the computer for that particular USB controller - I still had
> >> plenty of USB ports left and if I really needed more ports, I could
> >> just plug in an extra USB PCI card.
> >>
> >> My /etc/rc.local boot script looks like the following today:
> >> --
> >> #Remove HDA conflicting with ivtv1
> >> echo ":00:1b.0" > /sys/bus/pci/drivers/HDA\ Intel/unbind
> >>
> >> # ivtv0
> >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id
> >> echo ":04:08.0" > /sys/bus/pci/drivers/ivtv/unbind
> >> echo ":04:08.0" > /sys/bus/pci/drivers/pci-stub/bind
> >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id
> >>
> >> # ivtv1
> >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id
> >> echo ":04:09.0" > /sys/bus/pci/drivers/ivtv/unbind
> >> echo ":04:09.0" > /sys/bus/pci/drivers/pci-stub/bind
> >> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id
> >
> > I did not try unbinding the usb device so I can also try that.
> >
> > I don'.t understand what is happening with the  0016. I configured the
> > pci card in kvm and I believe kvm does the binding to pci-stub in recent
> > versions. Where is the  0016%oming from?
> 
> Okay, qemu-kvm might do it today, I don't know - I haven't changed
> that script for the past year. But are you sure that it's not
> libvirt/virsh/virt-manager which does that for you?

If you use the managed="yes" attribute on the  in libvirt
XML, then libvirt will automatically do the pcistub bind/unbind,
followed by a device reset at guest startup & the reverse at shutdown.
If you have conflicting devices on the bus though, libvirt won't
attempt to unbind them, unless you had also explicitly assigned all
those conflicting devices to the same guest.

Daniel
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-14 Thread Kenni Lund
2010/12/14 Erik Brakkee :
>> From: Kenni Lund 
>> 2010/12/14 Erik Brakkee :

 From: Kenni Lund 
>
> Does this mean I have a chance now that PCI passthrough of my WinTV
> PVR-500
> might work now?

 Passthrough of a PVR-500 has been working for a long time. I've been
 running with passthrough of a PVR-500 in my HTPC, since
 November/December 2009...so it should work with any recent kernel and
 any recent version of qemu-kvm you can find today - No patching
 needed. The only issue I had with the PVR-500 card, was when *I*
 didn't free up the shared interrupts...once I fixed that, it "just
 worked".
>>>
>>> How did you free up those shared interrupts then? I tried different slots
>>> but always get conflicts with the USB irqs.
>>
>> I did an unbind of the conflicting device (eg. disabled it). I moved
>> the PVR-500 card around in the different slots and once I got a
>> conflict with the integrated sound card, I left the PVR-500 card in
>> that slot (it's a headless machine, so no need for sound) and
>> configured unbind of the sound card at boot time. On my old system I
>> think it was conflicting with one of the USB controllers as well, but
>> it didn't really matter, as I only lost a few of the ports on the back
>> of the computer for that particular USB controller - I still had
>> plenty of USB ports left and if I really needed more ports, I could
>> just plug in an extra USB PCI card.
>>
>> My /etc/rc.local boot script looks like the following today:
>> --
>> #Remove HDA conflicting with ivtv1
>> echo ":00:1b.0" > /sys/bus/pci/drivers/HDA\ Intel/unbind
>>
>> # ivtv0
>> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id
>> echo ":04:08.0" > /sys/bus/pci/drivers/ivtv/unbind
>> echo ":04:08.0" > /sys/bus/pci/drivers/pci-stub/bind
>> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id
>>
>> # ivtv1
>> echo " 0016" > /sys/bus/pci/drivers/pci-stub/new_id
>> echo ":04:09.0" > /sys/bus/pci/drivers/ivtv/unbind
>> echo ":04:09.0" > /sys/bus/pci/drivers/pci-stub/bind
>> echo " 0016" > /sys/bus/pci/drivers/pci-stub/remove_id
>
> I did not try unbinding the usb device so I can also try that.
>
> I don'.t understand what is happening with the  0016. I configured the
> pci card in kvm and I believe kvm does the binding to pci-stub in recent
> versions. Where is the  0016%oming from?

Okay, qemu-kvm might do it today, I don't know - I haven't changed
that script for the past year. But are you sure that it's not
libvirt/virsh/virt-manager which does that for you?

Anyway, it's coming from lspci -n. See the wiki page:
http://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM

I can't remember why I run remove_id in the end, it's probably
unneeded, but I can't remember (and it works, so I don't toch it).

Best regards
Kenni
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-14 Thread Balbir Singh
* MinChan Kim  [2010-12-14 19:01:26]:

> Hi Balbir,
> 
> On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh
>  wrote:
> > Move reusable functionality outside of zone_reclaim.
> > Make zone_reclaim_unmapped_pages modular
> >
> > Signed-off-by: Balbir Singh 
> > ---
> >  mm/vmscan.c |   35 +++
> >  1 files changed, 23 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index e841cae..4e2ad05 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone 
> > *zone)
> >  }
> >
> >  /*
> > + * Helper function to reclaim unmapped pages, we might add something
> > + * similar to this for slab cache as well. Currently this function
> > + * is shared with __zone_reclaim()
> > + */
> > +static inline void
> > +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
> > +                               unsigned long nr_pages)
> > +{
> > +       int priority;
> > +       /*
> > +        * Free memory by calling shrink zone with increasing
> > +        * priorities until we have enough memory freed.
> > +        */
> > +       priority = ZONE_RECLAIM_PRIORITY;
> > +       do {
> > +               shrink_zone(priority, zone, sc);
> > +               priority--;
> > +       } while (priority >= 0 && sc->nr_reclaimed < nr_pages);
> > +}
> 
> As I said previous version, zone_reclaim_unmapped_pages doesn't have
> any functions related to reclaim unmapped pages.

The scan control point has the right arguments for implementing
reclaim of unmapped pages.

> The function name is rather strange.
> It would be better to add scan_control setup in function inner to
> reclaim only unmapped pages.
> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

2010-12-14 Thread Srivatsa Vaddagiri
On Tue, Dec 14, 2010 at 12:03:58PM +0100, Mike Galbraith wrote:
> On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:
> 
> > > That part looks ok, except for the yield cross cpu bit.  Trying to yield
> > > a resource you don't have doesn't make much sense to me.
> > 
> > So another (crazy) idea is to move the "yieldee" task on another cpu over 
> > to 
> > yielding task's cpu, let it run till the end of yielding tasks slice and 
> > then
> > let it go back to the original cpu at the same vruntime position!
> 
> Yeah, pulling the intended recipient makes fine sense.  If he doesn't
> preempt you, you can try to swap vruntimes or whatever makes arithmetic
> sense and will help.  Dunno how you tell him how long he can keep the
> cpu though,

can't we adjust the new task's [prev_]sum_exec_runtime a bit so that it is 
preempted at the end of yielding task's timeslice?

> and him somehow going back home needs to be a plain old
> migration, no fancy restoration of ancient history vruntime.

What is the issue if it gets queued at the old vruntime (assuming fair stick is
still behind that)? Without that it will hurt fairness for the yieldee (and
perhaps of the overall VM in this case).

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

2010-12-14 Thread Mike Galbraith
On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:

> > That part looks ok, except for the yield cross cpu bit.  Trying to yield
> > a resource you don't have doesn't make much sense to me.
> 
> So another (crazy) idea is to move the "yieldee" task on another cpu over to 
> yielding task's cpu, let it run till the end of yielding tasks slice and then
> let it go back to the original cpu at the same vruntime position!

Yeah, pulling the intended recipient makes fine sense.  If he doesn't
preempt you, you can try to swap vruntimes or whatever makes arithmetic
sense and will help.  Dunno how you tell him how long he can keep the
cpu though, and him somehow going back home needs to be a plain old
migration, no fancy restoration of ancient history vruntime.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v2)

2010-12-14 Thread Minchan Kim
On Fri, Dec 10, 2010 at 11:32 PM, Balbir Singh
 wrote:
> Changelog v2
> 1. Use a config option to enable the code (Andrew Morton)
> 2. Explain the magic tunables in the code or at-least attempt
>   to explain them (General comment)
> 3. Hint uses of the boot parameter with unlikely (Andrew Morton)
> 4. Use better names (balanced is not a good naming convention)
> 5. Updated Documentation/kernel-parameters.txt (Andrew Morton)
>
> Provide control using zone_reclaim() and a boot parameter. The
> code reuses functionality from zone_reclaim() to isolate unmapped
> pages and reclaim them as a priority, ahead of other mapped pages.
>
> Signed-off-by: Balbir Singh 
> ---
>  Documentation/kernel-parameters.txt |    8 +++
>  include/linux/swap.h                |   21 ++--
>  init/Kconfig                        |   12 
>  kernel/sysctl.c                     |    2 +
>  mm/page_alloc.c                     |    9 +++
>  mm/vmscan.c                         |   97 
> +++
>  6 files changed, 142 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index dd8fe2b..f52b0bd 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined 
> in the file
>                        [X86]
>                        Set unknown_nmi_panic=1 early on boot.
>
> +       unmapped_page_control
> +                       [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
> +                       is enabled. It controls the amount of unmapped memory
> +                       that is present in the system. This boot option plus
> +                       vm.min_unmapped_ratio (sysctl) provide granular 
> control
> +                       over how much unmapped page cache can exist in the 
> system
> +                       before kswapd starts reclaiming unmapped page cache 
> pages.
> +
>        usbcore.autosuspend=
>                        [USB] The autosuspend time delay (in seconds) used
>                        for newly-detected USB devices (default 2).  This
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ac5c06e..773d7e5 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -253,19 +253,32 @@ extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern long vm_total_pages;
>
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
>  extern int sysctl_min_unmapped_ratio;
>  extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
> -#ifdef CONFIG_NUMA
> -extern int zone_reclaim_mode;
> -extern int sysctl_min_slab_ratio;
>  #else
> -#define zone_reclaim_mode 0
>  static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int 
> order)
>  {
>        return 0;
>  }
>  #endif
>
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +extern bool should_reclaim_unmapped_pages(struct zone *zone);
> +#else
> +static inline bool should_reclaim_unmapped_pages(struct zone *zone)
> +{
> +       return false;
> +}
> +#endif
> +
> +#ifdef CONFIG_NUMA
> +extern int zone_reclaim_mode;
> +extern int sysctl_min_slab_ratio;
> +#else
> +#define zone_reclaim_mode 0
> +#endif
> +
>  extern int page_evictable(struct page *page, struct vm_area_struct *vma);
>  extern void scan_mapping_unevictable_pages(struct address_space *);
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 3eb22ad..78c9169 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -782,6 +782,18 @@ endif # NAMESPACES
>  config MM_OWNER
>        bool
>
> +config UNMAPPED_PAGECACHE_CONTROL
> +       bool "Provide control over unmapped page cache"
> +       default n
> +       help
> +         This option adds support for controlling unmapped page cache
> +         via a boot parameter (unmapped_page_control). The boot parameter
> +         with sysctl (vm.min_unmapped_ratio) control the total number
> +         of unmapped pages in the system. This feature is useful if
> +         you want to limit the amount of unmapped page cache or want
> +         to reduce page cache duplication in a virtualized environment.
> +         If unsure say 'N'
> +
>  config SYSFS_DEPRECATED
>        bool "enable deprecated sysfs features to support old userspace tools"
>        depends on SYSFS
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index e40040e..ab2c60a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = {
>                .extra1         = &zero,
>        },
>  #endif
> +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
>        {
>                .procname       = "min_unmapped_ratio",
>                .data           = &sysctl_min_unmapped_ratio,
> @@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = {
>                .extra1         = &zero,
>                .extra2         

Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

2010-12-14 Thread Srivatsa Vaddagiri
On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:
> > +/*
> > + * Yield the CPU, giving the remainder of our time slice to task p.
> > + * Typically used to hand CPU time to another thread inside the same
> > + * process, eg. when p holds a resource other threads are waiting for.
> > + * Giving priority to p may help get that resource released sooner.
> > + */
> > +void yield_to(struct task_struct *p)
> > +{
> > +   unsigned long flags;
> > +   struct rq *rq, *p_rq;
> > +
> > +   local_irq_save(flags);
> > +   rq = this_rq();
> > +again:
> > +   p_rq = task_rq(p);
> > +   double_rq_lock(rq, p_rq);
> > +   if (p_rq != task_rq(p)) {
> > +   double_rq_unlock(rq, p_rq);
> > +   goto again;
> > +   }
> > +
> > +   /* We can't yield to a process that doesn't want to run. */
> > +   if (!p->se.on_rq)
> > +   goto out;
> > +
> > +   /*
> > +* We can only yield to a runnable task, in the same schedule class
> > +* as the current task, if the schedule class implements yield_to_task.
> > +*/
> > +   if (!task_running(rq, p) && current->sched_class == p->sched_class &&
> > +   current->sched_class->yield_to)
> > +   current->sched_class->yield_to(rq, p);
> > +
> > +out:
> > +   double_rq_unlock(rq, p_rq);
> > +   local_irq_restore(flags);
> > +   yield();
> > +}
> > +EXPORT_SYMBOL_GPL(yield_to);
> 
> That part looks ok, except for the yield cross cpu bit.  Trying to yield
> a resource you don't have doesn't make much sense to me.

So another (crazy) idea is to move the "yieldee" task on another cpu over to 
yielding task's cpu, let it run till the end of yielding tasks slice and then
let it go back to the original cpu at the same vruntime position!

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-14 Thread Alexander Graf

Am 14.12.2010 um 11:02 schrieb Avi Kivity :

> On 12/13/2010 10:25 AM, Alexander Graf wrote:
>> >>
>> >  Is your point in this case that USB in a VM based on PCI passthrough will 
>> > always have problems when it comes to more real-time issues or does this 
>> > only apply to USB passthrough? I can imagine that PCI passthrough is 
>> > better since it uses hardware support. By the way, I have seen issues in 
>> > the past whereby the tv card stopped working because of high load on the 
>> > server running natively so real-time issues also exist apart from 
>> > virtualization.
>> 
>> IIRC the reason that PCI passthrough with EHCI performs as badly as it does 
>> is that BARs<  4k get passed through using the slow path (trap to qemu, 
>> issue MMIO in user space). Unfortunately, EHCI seems to have a 256 byte BAR 
>> region usually that is used for some handshaking:
>> 
>> 00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller 
>> (prog-if 20 [EHCI])
>> Subsystem: ATI Technologies Inc SB700/SB800 USB EHCI Controller
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
>> Stepping- SERR- FastB2B- DisINTx-
>> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- 
>> DEVSEL=medium>TAbort-SERR-> Latency: 64, Cache Line Size: 64 bytes
>> Interrupt: pin B routed to IRQ 17
>> Region 0: Memory at c8014400 (32-bit, non-prefetchable) [size=256]
>> 
> 
> That could certainly be optimized.  If the BAR is all along in its page, both 
> on guest and host (if not, we can migrate it, at least on the host), we can 
> use the same offset within the page on the host as it appears on the guest, 
> and assign the entire page.
> 
> We should make sure SeaBIOS uses a minimum alignment of 4k for mmio BARs.

Yep, I agree :). Back when I tried that, it seemed rather hard to change BAR 
mappings after init from user space. But it's certainly a thing the vfio stuff 
could easily tackle!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-14 Thread Avi Kivity

On 12/13/2010 10:25 AM, Alexander Graf wrote:

>>
>  Is your point in this case that USB in a VM based on PCI passthrough will 
always have problems when it comes to more real-time issues or does this only 
apply to USB passthrough? I can imagine that PCI passthrough is better since it 
uses hardware support. By the way, I have seen issues in the past whereby the tv 
card stopped working because of high load on the server running natively so 
real-time issues also exist apart from virtualization.

IIRC the reason that PCI passthrough with EHCI performs as badly as it does is 
that BARs<  4k get passed through using the slow path (trap to qemu, issue MMIO 
in user space). Unfortunately, EHCI seems to have a 256 byte BAR region usually 
that is used for some handshaking:

00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller 
(prog-if 20 [EHCI])
 Subsystem: ATI Technologies Inc SB700/SB800 USB EHCI Controller
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- 
DEVSEL=medium>TAbort-SERR-

That could certainly be optimized.  If the BAR is all along in its page, 
both on guest and host (if not, we can migrate it, at least on the 
host), we can use the same offset within the page on the host as it 
appears on the guest, and assign the entire page.


We should make sure SeaBIOS uses a minimum alignment of 4k for mmio BARs.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-14 Thread Minchan Kim
Hi Balbir,

On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh
 wrote:
> Move reusable functionality outside of zone_reclaim.
> Make zone_reclaim_unmapped_pages modular
>
> Signed-off-by: Balbir Singh 
> ---
>  mm/vmscan.c |   35 +++
>  1 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e841cae..4e2ad05 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone 
> *zone)
>  }
>
>  /*
> + * Helper function to reclaim unmapped pages, we might add something
> + * similar to this for slab cache as well. Currently this function
> + * is shared with __zone_reclaim()
> + */
> +static inline void
> +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
> +                               unsigned long nr_pages)
> +{
> +       int priority;
> +       /*
> +        * Free memory by calling shrink zone with increasing
> +        * priorities until we have enough memory freed.
> +        */
> +       priority = ZONE_RECLAIM_PRIORITY;
> +       do {
> +               shrink_zone(priority, zone, sc);
> +               priority--;
> +       } while (priority >= 0 && sc->nr_reclaimed < nr_pages);
> +}

As I said previous version, zone_reclaim_unmapped_pages doesn't have
any functions related to reclaim unmapped pages.
The function name is rather strange.
It would be better to add scan_control setup in function inner to
reclaim only unmapped pages.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm,x86: return true when user space query KVM_CAP_USER_NMI extension

2010-12-14 Thread Lai Jiangshan
userspace may check this extension in runtime.

Signed-off-by:  Lai Jiangshan 
---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cdac9e5..3d6b9ec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1909,6 +1909,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_NOP_IO_DELAY:
case KVM_CAP_MP_STATE:
case KVM_CAP_SYNC_MMU:
+   case KVM_CAP_USER_NMI:
case KVM_CAP_REINJECT_CONTROL:
case KVM_CAP_IRQ_INJECT_STATUS:
case KVM_CAP_ASSIGN_DEV_IRQ:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-14 Thread Balbir Singh
* Rik van Riel  [2010-12-13 12:02:51]:

> On 12/11/2010 08:57 AM, Balbir Singh wrote:
> 
> >If the vpcu holding the lock runs more and capped, the timeslice
> >transfer is a heuristic that will not help.
> 
> That indicates you really need the cap to be per guest, and
> not per VCPU.
>

Yes, I personally think so too, but I suspect there needs to be a
larger agreement on the semantics. The VCPU semantics in terms of
power apply to each VCPU as opposed to the entire system (per guest).
 
> Having one VCPU spin on a lock (and achieve nothing), because
> the other one cannot give up the lock due to hitting its CPU
> cap could lead to showstoppingly bad performance.

Yes, that seems right!

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/2] RAM API: Make use of it for x86 PC

2010-12-14 Thread Avi Kivity

On 12/13/2010 11:24 PM, Alex Williamson wrote:

Register the actual VM RAM using the new API


@@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size,
  /* allocate RAM */
  ram_addr = qemu_ram_alloc(NULL, "pc.ram",
below_4g_mem_size + above_4g_mem_size);
-cpu_register_physical_memory(0, 0xa, ram_addr);
-cpu_register_physical_memory(0x10,
- below_4g_mem_size - 0x10,
- ram_addr + 0x10);
+ram_register(0, below_4g_mem_size, ram_addr);



What's the impact of this?  Won't it conflict with BIOS memory 
registration?  What about VGA?


In terms of patch hygiene, it should be in a separate patch titled 
"register 0xa-0x10 as RAM" or something.  It's a much more 
drastic change than making use of the new RAM API.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable

2010-12-14 Thread Avi Kivity

On 12/14/2010 03:53 AM, Xiao Guangrong wrote:

>  I just sent a patch to fix this in a different way, please review it.
>

Your patch is good for me, please ignore this one :-)

Umm, do we need move "access&= ~ACC_WRITE_MASK" into set_spte() then
can remove the same code in the caller?


I guess set_spte() is the better place for this.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] KVM & genirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-14 Thread Avi Kivity

On 12/14/2010 12:59 AM, Jan Kiszka wrote:

Final but critical question: Who will pick up which bits?



The procedure which has served us well in the past is that tip picks up 
the irq stuff and sticks them in a fast-forward-only branch; kvm merges 
the branch and applies the kvm bits on top.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html