Re: Questions about the real mode in kvm/qemu

2019-09-28 Thread Avi Kivity

On 9/26/19 12:18 PM, Paolo Bonzini wrote:

On 26/09/19 10:59, Maxim Levitsky wrote:

If you mean to ask if there is a way to let guest access use no
paging at all, that is access host physical addresses directly, then
indeed there is no way, since regular non 'unrestricted guest' mode
required both protected mode and paging, and 'unrestricted guest'
requires EPT. Academically speaking it is of course possible to
create paging tables that are 1:1...

Not so academically, it's exactly what KVM does.  However, indeed it
would also be possible to switch out of EPT mode when CR0.PG=0.  I'm not
sure why it was done this way, maybe when the code was written it was
simpler to use the identity map.

Let's see if Avi is listening... :)



I think it was just simpler for the people who implemented it at the 
time. Switching out of EPT would have been a better solution as it would 
no longer require allocating guest physical address space with the few 
warts that requires.







Re: [Qemu-devel] Linux kernel polling for QEMU

2016-12-07 Thread Avi Kivity



On 12/02/2016 12:12 PM, Stefan Hajnoczi wrote:

On Thu, Dec 01, 2016 at 09:35:27AM -0500, Paolo Bonzini wrote:

Maybe we could do the same for sockets?  When data is available on a
socket (or when it becomes writable), write to a user memory location.

I, too, have an interest in polling; in my situation most of the polling
happens in userspace.

You are trying to improve on the latency of non-blocking
ppoll(2)/epoll_wait(2) call?

Yes, but the concern is for throughput, not latency.

My main loop looks like

 execute some tasks
 poll from many sources

Since epoll_wait(..., 0) has non-trivial costs, I have to limit the
polling rate, and even so it shows up in the profile.  If the cost were
lower, I could poll at a higher frequency, resulting in lower worst-case
latencies for high-priority tasks.

IMHO, the ideal model wouldn't enter the kernel at all unless you _want_
to go to sleep.

Something like mmapping an epoll file descriptor to get a ring of
epoll_events.  It has to play nice with blocking epoll_wait(2) and
ppoll(2) of an eventpoll file descriptor so that the userspace process
can go to sleep without a lot of work to switch epoll wait vs mmap
modes.


I think that's overkill; as soon as data is returned, you're going to be 
talking to the kernel in order to get it.  Besides, leave something for 
userspace tcp stack implementations.



I'm bothered by the fact that the kernel will not be able to poll NIC or
storage controller rings if userspace is doing the polling :(.



Yes, that's a concern.

We have three conflicting requirements:
 - fast inter-thread IPC (or guest/host IPC) -> shared memory+polling
 - kernel multiplexing of hardware -> kernel interfaces + kernel poll / 
sleep

 - composable interfaces that allow you to use the two methods

You could use something like Flow Director to move interesting flows 
directly to the application, let the application poll those queues (and 
maybe tell the kernel if something interesting happens), while the 
kernel manages the other queues.  It gets ugly very quickly.





Re: [Qemu-devel] Linux kernel polling for QEMU

2016-12-07 Thread Avi Kivity



On 12/01/2016 04:35 PM, Paolo Bonzini wrote:

Maybe we could do the same for sockets?  When data is available on a
socket (or when it becomes writable), write to a user memory location.

I, too, have an interest in polling; in my situation most of the polling
happens in userspace.

You are trying to improve on the latency of non-blocking
ppoll(2)/epoll_wait(2) call?

Yes, but the concern is for throughput, not latency.

My main loop looks like

 execute some tasks
 poll from many sources

Since epoll_wait(..., 0) has non-trivial costs, I have to limit the
polling rate, and even so it shows up in the profile.  If the cost were
lower, I could poll at a higher frequency, resulting in lower worst-case
latencies for high-priority tasks.

IMHO, the ideal model wouldn't enter the kernel at all unless you _want_
to go to sleep.


Right.  Note that if data is available, it doesn't really matter, 
because you'd have to read() it anyway.  It matters for the case where 
data is not available.




Re: [Qemu-devel] Linux kernel polling for QEMU

2016-12-01 Thread Avi Kivity

On 12/01/2016 01:45 PM, Stefan Hajnoczi wrote:

On Wed, Nov 30, 2016 at 07:41:11PM +0200, Avi Kivity wrote:

On 11/29/2016 12:45 PM, Stefan Hajnoczi wrote:

On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote:

On 28/11/2016 16:29, Stefan Hajnoczi wrote:

Thanks for sharing the link.  I'll let you know before embarking on an
effort to make epoll support busy_loop.

At the moment I'm still evaluating whether the good results we've gotten
from polling in QEMU userspace are preserved when polling is shifted to
the kernel.

FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
chance to test it yet:
https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13

This would add a system call every time the main loop processes a vring,
wouldn't it?

Yes, this is a problem and is the reason I haven't finished implementing
a test using QEMU yet.

My proposed eventfd polling mechanism doesn't work well with descriptor
ring indices because the polling info needs to be updated each event
loop iteration with the last seen ring index.

This can be solved by making struct eventfd_poll_info.value take a
userspace memory address.  The value to compare against is fetched each
polling iteration, avoiding the need for ioctl calls.


Maybe we could do the same for sockets?  When data is available on a socket
(or when it becomes writable), write to a user memory location.

I, too, have an interest in polling; in my situation most of the polling
happens in userspace.

You are trying to improve on the latency of non-blocking
ppoll(2)/epoll_wait(2) call?



Yes, but the concern is for throughput, not latency.


My main loop looks like

   execute some tasks
   poll from many sources

Since epoll_wait(..., 0) has non-trivial costs, I have to limit the 
polling rate, and even so it shows up in the profile.  If the cost were 
lower, I could poll at a higher frequency, resulting in lower worst-case 
latencies for high-priority tasks.





Re: [Qemu-devel] Linux kernel polling for QEMU

2016-11-30 Thread Avi Kivity



On 11/29/2016 12:45 PM, Stefan Hajnoczi wrote:

On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote:

On 28/11/2016 16:29, Stefan Hajnoczi wrote:

Thanks for sharing the link.  I'll let you know before embarking on an
effort to make epoll support busy_loop.

At the moment I'm still evaluating whether the good results we've gotten
from polling in QEMU userspace are preserved when polling is shifted to
the kernel.

FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a
chance to test it yet:
https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13

This would add a system call every time the main loop processes a vring,
wouldn't it?

Yes, this is a problem and is the reason I haven't finished implementing
a test using QEMU yet.

My proposed eventfd polling mechanism doesn't work well with descriptor
ring indices because the polling info needs to be updated each event
loop iteration with the last seen ring index.

This can be solved by making struct eventfd_poll_info.value take a
userspace memory address.  The value to compare against is fetched each
polling iteration, avoiding the need for ioctl calls.



Maybe we could do the same for sockets?  When data is available on a 
socket (or when it becomes writable), write to a user memory location.


I, too, have an interest in polling; in my situation most of the polling 
happens in userspace.




Re: [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache

2016-09-01 Thread Avi Kivity

On 08/29/2016 08:09 PM, Pavel Butsykin wrote:

The prefetch cache aims to improve the performance of sequential read data.
Of most interest here are the requests of a small size of data for sequential
read, such requests can be optimized by extending them and moving into
the prefetch cache. However, there are 2 issues:
  - In aggregate only a small portion of requests is sequential, so delays 
caused
by the need to read more volumes of data will lead to an overall decrease
in performance.
  - The presence of redundant data in the cache memory with a large number of
random requests.
This pcache implementation solves the above and other problems prefetching data.
The pcache algorithm can be summarised by the following main steps.

1. Monitor I/O requests to identify typical sequences.
This implementation of prefetch cache works at the storage system level and has
information only about the physical block addresses of I/O requests. Statistics
are collected only from read requests to a maximum size of 32kb(by default),
each request that matches the criteria falls into a pool of requests. In order
to store requests statistic used by the rb-tree(lreq.tree), it's simple but for
this issue a quite efficient data structure.

2. Identifying sequential I/O streams.
For each read request to be carried out attempting to lift the chain sequence
from lreq.tree, where this request will be element of a sequential chain of
requests. The key to search for consecutive requests is the area of sectors
preceding the current request. The size of this area should not be too small to
avoid false readahead. The sequential stream data requests can be identified
even when a large number of random requests. For example, if there is access to
the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of request
processing 102 will be identified the chain of sequential requests 100, 101. 102
and then should a decision be made to do readahead. Also a situation may arise
when multiple applications A, B, C simultaneously perform sequential read of
data. For each separate application that will be sequential read data
A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block devices it
may look like a random data reading: 100,300,700,101,301,701,102,302,702.
In this case, the sequential streams will also be recognised because location
requests in the rb-tree will allow to separate the sequential I/O streams.

3. Do the readahead into the cache for recognized sequential data streams.
After the issue of the detection of pcache case was resolved, need using larger
requests to bring data into the cache. In this implementation the pcache used
readahead instead of the extension request, therefore the request goes as is.
There is not any reason to put data in the cache that will never be picked up,
but this will always happen in the case of extension requests. In order to store
areas of cached blocks is also used by the rb-tree(pcache.tree), it's simple but
for this issue a quite efficient data structure.

4. Control size of the prefetch cache pool and the requests statistic pool
For control the border of the pool statistic of requests, the data of requests
are placed and replaced according to the FIFO principle, everything is simple.
For control the boundaries of the memory cache used LRU list, it allows to limit
the max amount memory that we can allocate for pcache. But the LRU is there
mainly to prevent displacement of the cache blocks that was read partially.
The main way the memory is pushed out immediately after use, as soon as a chunk
of memory from the cache has been completely read, since the probability of
repetition of the request is very low. Cases when one and the same portion of
the cache memory has been read several times are not optimized and do not apply
to the cases that can optimize the pcache. Thus, using a cache memory of small
volume, by the optimization of the operations read-ahead and clear memory, we
can read entire volumes of data, providing a 100% cache hit. Also does not
decrease the effectiveness of random read requests.

PCache is implemented as a qemu block filter driver, has some configurable
parameters, such as: total cache size, readahead size, maximum size of block
that can be processed.

For performance evaluation has been used several test cases with different
sequential and random read data on SSD disk. Here are the results of tests and
qemu parameters:

qemu parameters:
-M pc-i440fx-2.4 --enable-kvm -smp 4 -m 1024
-drive file=centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,
aio=native,pcache-full-size=4MB,pcache-readahead-size=128KB,
pcache-max-aio-size=32KB
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0,
 id=virtio-disk0
(-set device.virtio-disk0.x-data-plane=on)


* Testcase* Results in iops 

Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-27 Thread Avi Kivity



On 05/27/2015 12:30 PM, Paolo Bonzini wrote:


On 26/05/2015 23:25, Christopher Covington wrote:

On 05/25/2015 08:53 AM, Paolo Bonzini wrote:

On 22/05/2015 13:12, Daniel P. Berrange wrote:

In
particular I don't see why we need to have a SATA controller and ISA/LPC
bridge in every virt machine - root PCI bus only should be possible, as you
can provide disks via virtio-blk or virtio-scsi and serial, parallel, mouse,
floppy via PCI devices and/or by adding a USB bus in the cases where you
really need one.

I think removing the ISA/LPC bridge is hard.  It includes the real-time
clock and fw_cfg, for example.

Could VirtIO specified replacements make sense for these peripherals?

Not really.  virtio is too heavyweight and you'd be reinventing the
wheel unnecessarily.

For example, ARM's -M virt uses a pl011 block for the RTC, and also
uses fw_cfg.  Another commonly used ISA device is the UART, for which
again -M virt uses a pl031.



The RTC can be replaced by kvmclock, the keyboard by virtio-console.  
Maybe we can provide an msr- or pci- based interface to fw_cfg.




Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Avi Kivity

On 05/21/2015 07:21 PM, Paolo Bonzini wrote:


On 21/05/2015 17:48, Avi Kivity wrote:

Lovely!

Note you have memcpy.o instead of memcpy.c.

Doh, and it's not used anyway.  Check the repository, and let me know if
OSv boots with it (it probably needs ACPI; Linux doesn't boot virtio
without ACPI).



Yes, it requires ACPI.  We don't implement the pre-ACPI bootstrap methods.



Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Avi Kivity

On 05/21/2015 04:51 PM, Paolo Bonzini wrote:

Some of you may have heard about the Clear Containers initiative from
Intel, which couple KVM with various kernel tricks to create extremely
lightweight virtual machines.  The experimental Clear Containers setup
requires only 18-20 MB to launch a virtual machine, and needs about 60
ms to boot.

Now, as all of you probably know, QEMU is great for running Windows or
legacy Linux guests, but that flexibility comes at a hefty price. Not
only does all of the emulation consume memory, it also requires some
form of low-level firmware in the guest as well. All of this adds quite
a bit to virtual-machine startup times (500 to 700 milliseconds is not
unusual).

Right?  In fact, it's for this reason that Clear Containers uses kvmtool
instead of QEMU.

No, wrong!  In fact, reporting bad performance is pretty much the same
as throwing down the gauntlet.

Enter qboot, a minimal x86 firmware that runs on QEMU and, together with
a slimmed-down QEMU configuration, boots a virtual machine in 40
milliseconds[2] on an Ivy Bridge Core i7 processor.

qboot is available at git://github.com/bonzini/qboot.git.  In all the
glory of its 8KB of code, it brings together various existing open
source components:

* a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS

* a couple hardware initialization routines written mostly from scratch
but with good help from SeaBIOS source code

* a minimal 32-bit libc based on kvm-unit-tests

* the Linux loader from QEMU itself

The repository has more information on how to achieve fast boot times,
and examples of using qboot.  Right now there is a limit of 8 MB for
vmlinuz+initrd+cmdline, which however should be enough for initrd-less
containers.

The first commit to qboot is more or less 24 hours old, so there is
definitely more work to do, in particular to extract ACPI tables from
QEMU and present them to the guest.  This is probably another day of
work or so, and it will enable multiprocessor guests with little or no
impact on the boot times.  SMBIOS information is also available from QEMU.

On the QEMU side, there is no support yet for persistent memory and the
NFIT tables from ACPI 6.0.  Once that (and ACPI support) is added, qboot
will automatically start using it.

Happy hacking!



Lovely!

Note you have memcpy.o instead of memcpy.c.



Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?

2014-10-08 Thread Avi Kivity


On 09/30/2014 12:33 PM, Michael S. Tsirkin wrote:

a single descriptor might use all of
the virtqueue. In this case we wont to be able to pass the
descriptor directly to linux as a single iov, since



You could separate maximum request scatter/gather list size from the 
virtqueue size.  They are totally unrelated - even now you can have a 
larger request by using indirect descriptors.





Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?

2014-10-08 Thread Avi Kivity


On 10/08/2014 12:15 PM, Michael S. Tsirkin wrote:

On Wed, Oct 08, 2014 at 10:43:07AM +0300, Avi Kivity wrote:

On 09/30/2014 12:33 PM, Michael S. Tsirkin wrote:

a single descriptor might use all of
the virtqueue. In this case we wont to be able to pass the
descriptor directly to linux as a single iov, since


You could separate maximum request scatter/gather list size from the
virtqueue size.  They are totally unrelated - even now you can have a larger
request by using indirect descriptors.

We could add a feature to have a smaller or larger S/G length limit.
Is this something useful?



Having a larger ring size is useful, esp. with zero-copy transmit, and 
you would need the sglist length limit in order to not require 
linearization on linux hosts.  So the limit is not useful in itself, 
only indirectly.


Google cloud engine exposes virtio ring sizes of 16384.

Even more useful is getting rid of the desc array and instead passing 
descs inline in avail and used.





Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?

2014-10-08 Thread Avi Kivity


On 10/08/2014 01:14 PM, Michael S. Tsirkin wrote:

On Wed, Oct 08, 2014 at 12:51:21PM +0300, Avi Kivity wrote:

On 10/08/2014 12:15 PM, Michael S. Tsirkin wrote:

On Wed, Oct 08, 2014 at 10:43:07AM +0300, Avi Kivity wrote:

On 09/30/2014 12:33 PM, Michael S. Tsirkin wrote:

a single descriptor might use all of
the virtqueue. In this case we wont to be able to pass the
descriptor directly to linux as a single iov, since


You could separate maximum request scatter/gather list size from the
virtqueue size.  They are totally unrelated - even now you can have a larger
request by using indirect descriptors.

We could add a feature to have a smaller or larger S/G length limit.
Is this something useful?


Having a larger ring size is useful, esp. with zero-copy transmit, and you
would need the sglist length limit in order to not require linearization on
linux hosts.  So the limit is not useful in itself, only indirectly.

Google cloud engine exposes virtio ring sizes of 16384.

OK this sounds useful, I'll queue this up for consideration.
Thanks!


Thanks.


Even more useful is getting rid of the desc array and instead passing descs
inline in avail and used.

You expect this to improve performance?
Quite possibly but this will have to be demonstrated.



The top vhost function in small packet workloads is vhost_get_vq_desc, 
and the top instruction within that (50%) is the one that reads the 
first 8 bytes of desc.  It's a guaranteed cache line miss (and again on 
the guest side when it's time to reuse).


Inline descriptors will amortize the cache miss over 4 descriptors, and 
will allow the hardware to prefetch, since the descriptors are linear in 
memory.






Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?

2014-10-08 Thread Avi Kivity


On 10/08/2014 01:55 PM, Michael S. Tsirkin wrote:

Even more useful is getting rid of the desc array and instead passing descs
inline in avail and used.

You expect this to improve performance?
Quite possibly but this will have to be demonstrated.


The top vhost function in small packet workloads is vhost_get_vq_desc, and
the top instruction within that (50%) is the one that reads the first 8
bytes of desc.  It's a guaranteed cache line miss (and again on the guest
side when it's time to reuse).

OK so basically what you are pointing out is that we get 5 accesses:
read of available head, read of available ring, read of descriptor,
write of used ring, write of used ring head.


Right.  And only read of descriptor is not amortized.


If processing is in-order, we could build a much simpler design, with a
valid bit in the descriptor, cleared by host as descriptors are
consumed.

Basically get rid of both used and available ring.


That only works if you don't allow reordering, which is never the case 
for block, and not the case for zero-copy net.  It also has writers on 
both side of the ring.


The right design is to keep avail and used, but instead of making them 
rings of pointers to descs, make them rings of descs.


The host reads descs from avail, processes them, then writes them back 
on used (possibly out-of-order).  The guest writes descs to avail and 
reads them back from used.


You'll probably have to add a 64-bit cookie to desc so you can complete 
without an additional lookup.




Sounds good in theory.


Inline descriptors will amortize the cache miss over 4 descriptors, and will
allow the hardware to prefetch, since the descriptors are linear in memory.

If descriptors are used in order (as they are with current qemu)
then aren't they amortized already?






Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?

2014-10-08 Thread Avi Kivity


On 10/08/2014 01:55 PM, Michael S. Tsirkin wrote:



Inline descriptors will amortize the cache miss over 4 descriptors, and will
allow the hardware to prefetch, since the descriptors are linear in memory.

If descriptors are used in order (as they are with current qemu)
then aren't they amortized already?



The descriptors are only in-order for non-zero-copy net.  They are out 
of order for block and zero-copy net.


(also, the guest has to be careful in how it allocates descriptors).



Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?

2014-10-08 Thread Avi Kivity


On 10/08/2014 03:22 PM, Michael S. Tsirkin wrote:

On Wed, Oct 08, 2014 at 01:59:13PM +0300, Avi Kivity wrote:

On 10/08/2014 01:55 PM, Michael S. Tsirkin wrote:

Even more useful is getting rid of the desc array and instead passing descs
inline in avail and used.

You expect this to improve performance?
Quite possibly but this will have to be demonstrated.


The top vhost function in small packet workloads is vhost_get_vq_desc, and
the top instruction within that (50%) is the one that reads the first 8
bytes of desc.  It's a guaranteed cache line miss (and again on the guest
side when it's time to reuse).

OK so basically what you are pointing out is that we get 5 accesses:
read of available head, read of available ring, read of descriptor,
write of used ring, write of used ring head.

Right.  And only read of descriptor is not amortized.


If processing is in-order, we could build a much simpler design, with a
valid bit in the descriptor, cleared by host as descriptors are
consumed.

Basically get rid of both used and available ring.

That only works if you don't allow reordering, which is never the case for
block, and not the case for zero-copy net.  It also has writers on both side
of the ring.

The right design is to keep avail and used, but instead of making them rings
of pointers to descs, make them rings of descs.

The host reads descs from avail, processes them, then writes them back on
used (possibly out-of-order).  The guest writes descs to avail and reads
them back from used.

You'll probably have to add a 64-bit cookie to desc so you can complete
without an additional lookup.

My old presentation from 2012 or so suggested something like this.
We don't need a 64 bit cookie I think - a small 16 bit one
should be enough.



A 16 bit cookie means you need an extra table to hold the real request 
pointers.


With a 64-bit cookie you can store a pointer to the skbuff or bio in the 
ring itself, and avoid the extra lookup.


The extra lookup isn't the end of the world, since doesn't cross core 
boundaries, but it's worth avoiding.





Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?

2014-10-08 Thread Avi Kivity


On 10/08/2014 03:28 PM, Avi Kivity wrote:

My old presentation from 2012 or so suggested something like this.

We don't need a 64 bit cookie I think - a small 16 bit one
should be enough.



A 16 bit cookie means you need an extra table to hold the real request 
pointers.


With a 64-bit cookie you can store a pointer to the skbuff or bio in 
the ring itself, and avoid the extra lookup.


The extra lookup isn't the end of the world, since doesn't cross core 
boundaries, but it's worth avoiding.




What you can do is have two types of descriptors: head and fragment

union desc {
struct head {
 u16 nfrags
 u16 flags
 u64 cookie
}
struct frag {
 u64 paddr
 u16 flen
 u16 flags
}
}

so now a request length is 12*(nfrags+1).

You can be evil and steal some bits from paddr/cookie, and have each 
descriptor 8 bytes long.


btw, I also recommend storing things like vnet_hdr in the ring itself, 
instead of out-of-line in memory.  Maybe the ring should just transport 
bytes and let the upper layer decide how it's formatted.






Re: [Qemu-devel] memory: memory_region_transaction_commit() slow

2014-06-29 Thread Avi Kivity


On 06/26/2014 05:31 PM, Etienne Martineau wrote:

On 14-06-26 04:18 AM, Avi Kivity wrote:

On 06/25/2014 08:53 PM, Etienne Martineau wrote:

Hi,

It seems to me that there is a scale issue O(n) in 
memory_region_transaction_commit().

It's actually O(n^3).  Flatview is kept sorted but is just a vector, so if you 
insert n regions, you have n^2 operations. In addition every PCI device has an 
address space, so we get n^3 (technically the third n is different from the 
first two, but they are related).

The first problem can be solved by implementing Flatview with an std::set or 
equivalent, the second by memoization - most pci address spaces are equal (they only 
differ based on whether bus mastering is enabled or not), so a clever cache can 
reduce the effort to generate them.

However I'm not at all sure that the problem is cpu time in qemu. It could be 
due to rcu_synchronize delays when the new memory maps are fed to kvm and vfio. 
 I recommend trying to isolate exactly where the time is spent.


It's seem like the linear increase in CPU time comes from QEMU ( at least from 
my measurements below)


In those code paths QEMU calls back into KVM (KVM_SET_MEMORY_REGION) and 
vfio.  So it would be good to understand exactly where the time is 
spent. I doubt it's computation (which is O(n^3), but very fast), 
instead it's likely waiting for something.




In QEMU kvm_cpu_exec() I've added a hook that measure the time that is spent 
outside 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)'
 From the logs below this is QEMU long exit vCPU n x(msec) exit_reason'

Similarly in KVM vcpu_enter_guest() I've added a new ftrace that measure the time 
spent outside 'kvm_x86_ops-run(vcpu)'
 From the logs below this is kvm_long_exit: x(msec)'. Please note that this is 
a trimmed down view of the real ftrace output.

Also please note that the above hacks are useful ( at least to me since I 
haven't figured out a better way to do the same with existing ftrace ) to 
measure the RTT at both QEMU and KVM level.

The time spent outside KVM 'kvm_x86_ops-run(vcpu)' will always be greater than 
the time spent outside QEMU 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)' for a given vCPU. Now
the difference between the time spent outside KVM to the time spend outside 
QEMU ( for a given vCPU ) tells us who is burning cycle ( QEMU or KVM ) and how 
much ( in msec )

In the below experiment I've put side by side the QEMU and the KVM RTT time. We 
can see that the time to assign device ( same BAR size for all devices ) 
increase
linearly ( like previously reported ). Also from the RTT measurement both QEMU 
and KVM are mostly within the same range suggesting that the increase comes 
from QEMU and not KVM.

The one exception is that for every device assign there is a KVM operation that 
seems to be taking ~100msec each time. Since this is O(1) I'm not too concerned.


device assign #1:
device_add pci-assign,host=28:10.2,bus=pciehp.3.8

  kvm_long_exit: 100

QEMU long exit vCPU 0 25 2kvm_long_exit: 26
QEMU long exit vCPU 0 20 2kvm_long_exit: 20
QEMU long exit vCPU 0 20 2kvm_long_exit: 20
QEMU long exit vCPU 0 20 2kvm_long_exit: 20
QEMU long exit vCPU 0 19 2kvm_long_exit: 19
QEMU long exit vCPU 0 19 2kvm_long_exit: 19
QEMU long exit vCPU 0 19 2kvm_long_exit: 20
QEMU long exit vCPU 0 19 2kvm_long_exit: 19
QEMU long exit vCPU 0 19 2kvm_long_exit: 19
QEMU long exit vCPU 0 19 2kvm_long_exit: 19
QEMU long exit vCPU 0 19 2kvm_long_exit: 20
QEMU long exit vCPU 0 42 2kvm_long_exit: 42
QEMU long exit vCPU 0 21 2kvm_long_exit: 21

device assign #2:
device_add pci-assign,host=28:10.3,bus=pciehp.3.9

  kvm_long_exit: 101	

QEMU long exit vCPU 0 25 2kvm_long_exit: 25
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 21 2kvm_long_exit: 21
QEMU long exit vCPU 0 45 2kvm_long_exit: 45
QEMU long exit vCPU 0 23 2kvm_long_exit: 23

device assign #3:
device_add pci-assign,host=28:10.4,bus=pciehp.3.10

  kvm_long_exit: 100

QEMU long exit vCPU 0 25 2kvm_long_exit: 25
QEMU long exit vCPU 0 23 2kvm_long_exit: 23
QEMU long exit vCPU 0 23 2kvm_long_exit: 23
QEMU long exit vCPU 0 23 2kvm_long_exit: 23
QEMU long exit vCPU 0 23 2kvm_long_exit: 23
QEMU long exit vCPU 0 23 2kvm_long_exit: 23
QEMU long exit vCPU 0 23 2

Re: [Qemu-devel] memory: memory_region_transaction_commit() slow

2014-06-26 Thread Avi Kivity


On 06/25/2014 08:53 PM, Etienne Martineau wrote:

Hi,

It seems to me that there is a scale issue O(n) in 
memory_region_transaction_commit().


It's actually O(n^3).  Flatview is kept sorted but is just a vector, so 
if you insert n regions, you have n^2 operations. In addition every PCI 
device has an address space, so we get n^3 (technically the third n is 
different from the first two, but they are related).


The first problem can be solved by implementing Flatview with an 
std::set or equivalent, the second by memoization - most pci address 
spaces are equal (they only differ based on whether bus mastering is 
enabled or not), so a clever cache can reduce the effort to generate them.


However I'm not at all sure that the problem is cpu time in qemu. It 
could be due to rcu_synchronize delays when the new memory maps are fed 
to kvm and vfio.  I recommend trying to isolate exactly where the time 
is spent.



Basically the time it takes to rebuild the memory view during device assignment
pci_bridge_update_mappings() increase linearly with respect to the number of
device already assigned to the guest.

I'm running on a recent qemu.git and I merged from 
git://github.com/bonzini/qemu.git memory
the following patches that seems to be related to the scale issue I'm facing:
   Fam Zheng (1):
   memory: Don't call memory_region_update_coalesced_range if nothing 
changed
   Gonglei (1):
   memory: Don't update all memory region when ioeventfd changed

Those patches help but don't fix the issue. The problem become more noticeable
when lots of device are being assigned to the guest.

I'm running my test on a QEMU q35 machine with the following topology:
  ioh3420 ( root port )
   x3130-upstream
xio3130-downstream
xio3130-downstream
xio3130-downstream
...

I have added instrumentation in kvm_cpu_exec() to track to amount of time spend
in the emulation ( patch at the end but not relevant for this discussion )

Here what I see when I assign device one after to other. NOTE the time-stamp is 
in
msec. The linear increase in the time comes from 
memory_region_transaction_commit().

(qemu) device_add pci-assign,host=28:10.1,bus=pciehp.3.7
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 22 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 22 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 21 2
QEMU long exit vCPU 0 45 2  
QEMU long exit vCPU 0 23 2

(qemu) device_add pci-assign,host=28:10.2,bus=pciehp.3.8
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 23 2
QEMU long exit vCPU 0 49 2 
QEMU long exit vCPU 0 25 2


(qemu) device_add pci-assign,host=28:10.3,bus=pciehp.3.9
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 25 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 24 2
QEMU long exit vCPU 0 52 2 
QEMU long exit vCPU 0 26 2

(qemu) device_add pci-assign,host=28:10.4,bus=pciehp.3.10
QEMU long exit vCPU 0 35 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 27 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 26 2
QEMU long exit vCPU 0 56 2 
QEMU long exit vCPU 0 28 2

(qemu) device_add pci-assign,host=28:10.5,bus=pciehp.3.11
QEMU long exit vCPU 0 33 2
QEMU long exit vCPU 0 30 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 29 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 28 2
QEMU long exit vCPU 0 59 2 
QEMU long exit vCPU 0 30 2



diff --git a/kvm-all.c b/kvm-all.c
index 3ae30ee..e3a1964 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1685,6 +1685,8 @@ int kvm_cpu_exec(CPUState *cpu)
  {
  struct kvm_run *run = cpu-kvm_run;
  int ret, run_ret;
+int64_t clock_ns, delta_ms;
+__u32 last_exit_reason, last_vcpu;
  
  DPRINTF(kvm_cpu_exec()\n);
  
@@ -1711,6 +1713,12 @@ int kvm_cpu_exec(CPUState *cpu)

  }
  qemu_mutex_unlock_iothread();
  
+delta_ms = (get_clock() - clock_ns)/(1000*1000);

+if( delta_ms = 10){
+fprintf(stderr, QEMU long exit vCPU %d %ld %d\n,last_vcpu,
+delta_ms, last_exit_reason);
+}
+
  run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
  
  qemu_mutex_lock_iothread();

@@ -1727,7 +1735,15 @@ int kvm_cpu_exec(CPUState 

Re: [Qemu-devel] [Qemu-trivial] [PATCH 2/3] int128.h: Avoid undefined behaviours involving signed arithmetic

2014-04-07 Thread Avi Kivity

On 04/06/2014 01:18 PM, Peter Maydell wrote:

On 6 April 2014 08:09, Michael Tokarev m...@tls.msk.ru wrote:

28.03.2014 19:12, Peter Maydell wrote:

Add casts when we're performing arithmetic on the .hi parts of an
Int128, to avoid undefined behaviour.

[]

  static inline Int128 int128_sub(Int128 a, Int128 b)
  {
-return (Int128){ a.lo - b.lo, a.hi - b.hi - (a.lo  b.lo) };
+return (Int128){ a.lo - b.lo, (uint64_t)a.hi - b.hi - (a.lo  b.lo) };

What was wrong with this one?  I don't think casting to unsigned here is
a good idea.

This patch is fixing these three clang sanitizer warnings:
/home/petmay01/linaro/qemu-from-laptop/qemu/include/qemu/int128.h:81:40:
runtime error: signed integer overflow: 0 - -9223372036854775808
cannot be represented in type 'long'
/home/petmay01/linaro/qemu-from-laptop/qemu/include/qemu/int128.h:81:47:
runtime error: signed integer overflow: -9223372036854775808 - 1
cannot be represented in type 'long'
/home/petmay01/linaro/qemu-from-laptop/qemu/include/qemu/int128.h:56:47:
runtime error: left shift of negative value -9223372036854775807

of which the first two are in this function.

Note that int128_add() already has a cast.

The alternative would be to say that Int128 should have
undefined behaviour on underflow/overflow and the test
code is wrong, but that doesn't seem very useful to me.




Isn't the test broken here?  It is trying to add (or shift) -2^127 and 
something else, and the result truly overflows.


A better behaviour would be to abort when this happens.  Int128 was 
designed to avoid silent overflows, not to silently cause breakage.


Not that I think it is necessary, there is no way for the guest to 
trigger an overflow.




Re: [Qemu-devel] [Qemu-trivial] [PATCH 2/3] int128.h: Avoid undefined behaviours involving signed arithmetic

2014-04-07 Thread Avi Kivity

On 04/07/2014 06:17 PM, Peter Maydell wrote:

On 7 April 2014 15:56, Avi Kivity a...@cloudius-systems.com wrote:

On 04/06/2014 01:18 PM, Peter Maydell wrote:

The alternative would be to say that Int128 should have
undefined behaviour on underflow/overflow and the test
code is wrong, but that doesn't seem very useful to me.

Isn't the test broken here?  It is trying to add (or shift) -2^127 and
something else, and the result truly overflows.

Well, the test code is assuming semantics as per 2s
complement arithmetic and checking various corner cases.
As I say, we could define that this is invalid and
rewrite the test cases.


It is invalid.  The test thinks that -2^127 * 2 == 0, but if a guest 
could trigger it, it would probably be a security issue.






Re: [Qemu-devel] KVM and variable-endianness guest CPUs

2014-01-28 Thread Avi Kivity

On 01/22/2014 12:22 PM, Peter Maydell wrote:

On 22 January 2014 05:39, Victor Kamensky victor.kamen...@linaro.org wrote:

Hi Guys,

Christoffer and I had a bit heated chat :) on this
subject last night. Christoffer, really appreciate
your time! We did not really reach agreement
during the chat and Christoffer asked me to follow
up on this thread.
Here it goes. Sorry, it is very long email.

I don't believe we can assign any endianity to
mmio.data[] byte array. I believe mmio.data[] and
mmio.len acts just memcpy and that is all. As
memcpy does not imply any endianity of underlying
data mmio.data[] should not either.

This email is about five times too long to be actually
useful, but the major issue here is that the data being
transferred is not just a bag of bytes. The data[]
array plus the size field are being (mis)used to indicate
that the memory transaction is one of:
  * an 8 bit access
  * a 16 bit access of some uint16_t value
  * a 32 bit access of some uint32_t value
  * a 64 bit access of some uint64_t value

exactly as a CPU hardware bus would do. It's
because the API is defined in this awkward way with
a uint8_t[] array that we need to specify how both
sides should go from the actual properties of the
memory transaction (value and size) to filling in the
array.


That is not how x86 hardware works.  Back when there was a bus, there 
were no address lines A0-A2; instead we had 8 byte enables BE0-BE7.  A 
memory transaction placed the qword address on the address lines and 
asserted the byte enables for the appropriate byte, word, dword, or 
qword, shifted for the low order bits of the address.


If you generated an unaligned access, the transaction was split into 
two, so an 8-byte write might appear as a 5-byte write followed by a 
3-byte write.  In fact, the two halves of the transaction might go to 
different devices, or one might go to a device and another to memory.


PCI works the same way.





Furthermore, device endianness is entirely irrelevant
for deciding the properties of mmio.data[], because the
thing we're modelling here is essentially the CPU-bus
interface. In real hardware, the properties of individual
devices on the bus are irrelevant to how the CPU's
interface to the bus behaves, and similarly here the
properties of emulated devices don't affect how KVM's
interface to QEMU userspace needs to work.

MemoryRegion's 'endianness' field, incidentally, is
a dreadful mess that we should get rid of. It is attempting
to model the property that some buses/bridges have of
doing byte-lane-swaps on data that passes through as
a property of the device itself. It would be better if we
modelled it properly, with container regions having possible
byte-swapping and devices just being devices.



No, that is not what it is modelling.

Suppose a little endian cpu writes a dword 0x12345678 to address 0 of a 
device, and read back a byte from address 0.  What value do you read back?


Some (most) devices will return 0x78, others will return 0x12. Other 
devices don't support mixed sizes at all, but many do.  PCI 
configuration space is an example; it is common to read both Device ID 
and Vendor ID with a single 32-bit transaction, but you can also read 
them separately with two 16-bit transaction.  Because PCI is 
little-endian, the Vendor ID at address 0 will be returned as the low 
word of the 32-bit read of a little-endian processor.


If you remove device endianness from memory regions, you have to pass 
the data as arrays of bytes (like the KVM interface) and let the device 
assemble words from those bytes itself, taking into consideration its 
own endianness.  What MemoryRegion's endianness does is let the device 
declare its endianness to the API and let it do all the work.




Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-28 Thread Avi Kivity

On 01/28/2014 01:27 AM, Benjamin Herrenschmidt wrote:

On Wed, 2014-01-22 at 17:29 +, Peter Maydell wrote:

Basically if it would be on real bus, get byte value
that corresponds to phys_addr + 0 address place
it into data[0], get byte value that corresponds to
phys_addr + 1 address place it into data[1], etc.

This just isn't how real buses work.

Actually it can be :-)


  There is no
address + 1, address + 2. There is a single address
for the memory transaction and a set of data on
data lines and some separate size information.
How the device at the far end of the bus chooses
to respond to 32 bit accesses to address X versus
8 bit accesses to addresses X through X+3 is entirely
its own business and unrelated to the CPU.

However the bus has a definition of what byte lane is the lowest in
address order. Byte order invariance is an important function of
all busses.

I think that trying to treat it any differently than an address
ordered series of bytes is going to turn into a complete and
inextricable mess.


I agree.

The two options are:

 (address, byte array, length)

and

 (address, value, word size, endianness)

the first is the KVM ABI, the second is how MemoryRegions work. Both are 
valid, but the first is more general (supports the 3-byte accesses 
sometimes generated on x86).






  (It would
be perfectly possible to have a device which when
you read from address X as 32 bits returned 0x12345678,
when you read from address X as 16 bits returned
0x9abc, returned 0x42 for an 8 bit read from X+1,
and so on. Having byte reads from X..X+3 return
values corresponding to parts of the 32 bit access
is purely a convention.)

Right, it's possible. It's also stupid and not how most modern devices
and busses work. Besides there is no reason why that can't be
implemented with Victor proposal anyway.


Right.



Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide

2014-01-14 Thread Avi Kivity

On 01/14/2014 12:48 AM, Alex Williamson wrote:

On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:

Am 13.01.2014 um 22:39 schrieb Alex Williamson alex.william...@redhat.com:


On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:

On 12.01.2014, at 08:54, Michael S. Tsirkin m...@redhat.com wrote:


On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:

On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:

On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:

On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:

On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:

On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:

On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:

On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:

On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
From: Paolo Bonzini pbonz...@redhat.com

As an alternative to commit 818f86b (exec: limit system memory
size, 2013-11-04) let's just make all address spaces 64-bit wide.
This eliminates problems with phys_page_find ignoring bits above
TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
consequently messing up the computations.

In Luiz's reported crash, at startup gdb attempts to read from address
0xffe6 to 0x inclusive.  The region it gets
is the newly introduced master abort region, which is as big as the PCI
address space (see pci_bus_init).  Due to a typo that's only 2^63-1,
not 2^64.  But we get it anyway because phys_page_find ignores the upper
bits of the physical address.  In address_space_translate_internal then

   diff = int128_sub(section-mr-size, int128_make64(addr));
   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));

diff becomes negative, and int128_get64 booms.

The size of the PCI address space region should be fixed anyway.

Reported-by: Luiz Capitulino lcapitul...@redhat.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
exec.c | 8 ++--
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 7e5ce93..f907f5f 100644
--- a/exec.c
+++ b/exec.c
@@ -94,7 +94,7 @@ struct PhysPageEntry {
#define PHYS_MAP_NODE_NIL (((uint32_t)~0)  6)

/* Size of the L2 (and L3, etc) page tables.  */
-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
+#define ADDR_SPACE_BITS 64

#define P_L2_BITS 10
#define P_L2_SIZE (1  P_L2_BITS)
@@ -1861,11 +1861,7 @@ static void memory_map_init(void)
{
system_memory = g_malloc(sizeof(*system_memory));

-assert(ADDR_SPACE_BITS = 64);
-
-memory_region_init(system_memory, NULL, system,
-   ADDR_SPACE_BITS == 64 ?
-   UINT64_MAX : (0x1ULL  ADDR_SPACE_BITS));
+memory_region_init(system_memory, NULL, system, UINT64_MAX);
address_space_init(address_space_memory, system_memory, memory);

system_io = g_malloc(sizeof(*system_io));

This seems to have some unexpected consequences around sizing 64bit PCI
BARs that I'm not sure how to handle.

BARs are often disabled during sizing. Maybe you
don't detect BAR being disabled?

See the trace below, the BARs are not disabled.  QEMU pci-core is doing
the sizing an memory region updates for the BARs, vfio is just a
pass-through here.

Sorry, not in the trace below, but yes the sizing seems to be happening
while I/O  memory are enabled int he command register.  Thanks,

Alex

OK then from QEMU POV this BAR value is not special at all.

Unfortunately


After this patch I get vfio
traces like this:

vfio: vfio_pci_read_config(:01:10.0, @0x10, len=0x4) febe0004
(save lower 32bits of BAR)
vfio: vfio_pci_write_config(:01:10.0, @0x10, 0x, len=0x4)
(write mask to BAR)
vfio: region_del febe - febe3fff
(memory region gets unmapped)
vfio: vfio_pci_read_config(:01:10.0, @0x10, len=0x4) c004
(read size mask)
vfio: vfio_pci_write_config(:01:10.0, @0x10, 0xfebe0004, len=0x4)
(restore BAR)
vfio: region_add febe - febe3fff [0x7fcf3654d000]
(memory region re-mapped)
vfio: vfio_pci_read_config(:01:10.0, @0x14, len=0x4) 0
(save upper 32bits of BAR)
vfio: vfio_pci_write_config(:01:10.0, @0x14, 0x, len=0x4)
(write mask to BAR)
vfio: region_del febe - febe3fff
(memory region gets unmapped)
vfio: region_add febe - febe3fff [0x7fcf3654d000]
(memory region gets re-mapped with new address)
qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfebe, 0x4000, 
0x7fcf3654d000) = -14 (Bad address)
(iommu barfs because it can only handle 48bit physical addresses)

Why are you trying to program BAR addresses for dma in the iommu?

Two reasons, first I can't tell the difference between RAM and MMIO.

Why can't you? Generally memory core let you find out easily.

My MemoryListener is setup for address_space_memory and I then filter
out anything that's not memory_region_is_ram().  This still gets
through, so how do I easily 

Re: [Qemu-devel] outlined TLB lookup on x86

2013-12-08 Thread Avi Kivity

On 11/28/2013 04:12 AM, Richard Henderson wrote:

2. why not use a TLB or bigger size?  currently the TLB has 18 entries. the
TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
measured this using Intel PIN. so even the miss rate is low (say 3%) the
overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.

I'd be interested to experiment with different TLB sizes, to see what effect
that has on performance.  But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.




You could use a generation counter to flush the TLB in O(1) by 
incrementing the counter.  That slows down the fast path though. Maybe 
you can do that for the larger second level TLB only.




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).

  
Not changing current behaviour is certainly safer, but I am still not 100%

convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.



Linux is safe, it does interrupt migration from within the interrupt 
handler.  If you do that before the device-specific EOI, you won't get 
another interrupt until programming the MSI is complete.


Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes, 
consistently faster.




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?




Shouldn't it work with old userspace too? Maybe I misunderstood your intent.



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:53 AM, Paolo Bonzini wrote:

Il 28/11/2013 10:49, Avi Kivity ha scritto:

Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't get
another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes,
consistently faster.

call_rcu() has the problem of rate limiting, too.  It wasn't such a
great idea, I think.

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.



I guess srcu would work.  Do you know what's the typical write-side 
latency there? (in terms of what it waits for, not nanoseconds).




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:40 PM, Paolo Bonzini wrote:

Il 28/11/2013 11:16, Avi Kivity ha scritto:

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.

I guess srcu would work.  Do you know what's the typical write-side
latency there? (in terms of what it waits for, not nanoseconds).

If there's no concurrent reader, it's zero if I read the code right.
Otherwise it depends:

- if there are many callbacks, only 10 of them are processed per
millisecond.  But unless there are concurrent synchronize_srcu calls
there should not be any callback at all.  If all VCPUs were to furiously
change the MSIs, the latency could go up to #vcpu/10 milliseconds.

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

So, given the very restricted usage this SRCU would have, we probably
can expect synchronize_srcu_expedited to finish its job in the
busy-looping phase, and 120 us should be the expected maximum
latency---more likely to be an order of magnitude smaller, and in very
rare cases higher.



Looks like it's a good fit then.



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:10 PM, Paolo Bonzini wrote:

Il 28/11/2013 12:09, Gleb Natapov ha scritto:

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.



You need a grace period.  Or just sleep().



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.




Isn't the latency due to interrupt migration causing long 
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


The problem occurs with assigned devices too AFAICT.




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:22 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.



Isn't the latency due to interrupt migration causing long
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
irq routing table changing MSI configuration should not cause update to
irq routing table (not saying this is what happens with current QEMU, but
theoretically there is not reason to update routing table in this case).


I see.  That pushes the problem to userspace, which uses traditional 
locking, so the problem disappears until qemu starts using rcu too to 
manage this.


There is also irqfd, however.  We could also do a KVM_UPDATE_IRQFD to 
change the payload it delivers, but that has exactly the same problems.





Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity
On Tue, Nov 26, 2013 at 2:47 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
 When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread
will IOCTL return to QEMU from hypervisor, then vcpu thread ask the
hypervisor to update the irq routing table,
 in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread
is blocked for so much time to wait RCU grace period, and during this
period, this vcpu cannot provide service to VM,
 so those interrupts delivered to this vcpu cannot be handled in time,
and the apps running on this vcpu cannot be serviced too.
 It's unacceptable in some real-time scenario, e.g. telecom.

 So, I want to create a single workqueue for each VM, to asynchronously
performing the RCU synchronization for irq routing table,
 and let the vcpu thread return and VMENTRY to service VM immediately, no
more need to blocked to wait RCU grace period.
 And, I have implemented a raw patch, took a test in our telecom
environment, above problem disappeared.

 I don't think a workqueue is even needed.  You just need to use call_rcu
 to free old after releasing kvm-irq_lock.

 What do you think?

Can this cause an interrupt to be delivered to the wrong (old) vcpu?

The way Linux sets interrupt affinity, it cannot, since changing the
affinity is (IIRC) done in the interrupt handler, so the next interrupt
cannot be in flight and thus pick up the old interrupt routing table.

However it may be vulnerable in other ways.


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity
On Tue, Nov 26, 2013 at 3:47 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 26/11/2013 14:18, Avi Kivity ha scritto:
 
  I don't think a workqueue is even needed.  You just need to use call_rcu
  to free old after releasing kvm-irq_lock.
 
  What do you think?
 
  Can this cause an interrupt to be delivered to the wrong (old) vcpu?

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.



I understood the proposal was also to eliminate the synchronize_rcu(), so
while new interrupts would see the new routing table, interrupts already in
flight could pick up the old one.


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).


With synchronize_rcu(), you have the additional guarantee that any 
parallel accesses to the old routing table have completed.  Since we 
also trigger the irq from rcu context, you know that after 
synchronize_rcu() you won't get any interrupts to the old destination 
(see kvm_set_irq_inatomic()).


It's another question whether the hardware provides the same guarantee.


If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.


Sure, if you drop the synchronize_rcu(), you have to add call_rcu().



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:03 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().



Consider this guest code:

  write msi entry, directing the interrupt away from this vcpu
  nop
  memset(idt, 0, sizeof(idt));

Currently, this code will never trigger a triple fault.  With the change 
to call_rcu(), it may.


Now it may be that the guest does not expect this to work (PCI writes 
are posted; and interrupts can be delayed indefinitely by the pci 
fabric), but we don't know if there's a path that guarantees the guest 
something that we're taking away with this change.







Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:20 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.


An alternative path is to convince ourselves that the hardware does not 
provide the guarantees that the current code provides, and so we can 
relax them.




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:25, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.



Why?



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.


It's indeed safe, but I think there's a nice win to be had if we drop 
the assumption.



(BTW, PCI memory writes are posted, but configuration writes are not).


MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush the 
write via a read.  But I think the fact that the interrupt messages 
themselves are posted proves that it is safe.  The fact that Linux does 
interrupt migration from within the interrupt handler also shows that 
someone else believes that it is the only safe place to do it.






Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote:

On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

It's indeed safe, but I think there's a nice win to be had if we
drop the assumption.

I'm not arguing with that, but a minor commoent below:


(BTW, PCI memory writes are posted, but configuration writes are not).

MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush
the write via a read.  But I think the fact that the interrupt
messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.



I guess that kills the optimization then.  Maybe you can do qrcu, 
whatever that is.





Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:24 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

   



Suppose the guest did not disable_irq() and enable_irq(), but instead 
had a pci read where you have the enable_irq().  After the read you 
cannot have a stale irq (assuming the read flushes the irq all the way 
to the APIC).





Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 17:24, Gleb Natapov ha scritto:

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

Adding a disable/enable IRQs looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.



Those are guest operations, which may not be there at all.



Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression

2013-11-17 Thread Avi Kivity

On 11/17/2013 05:37 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote:

On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:

On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote:

Michael S. Tsirkin mst at redhat.com writes:


At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.


I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
if (index  rt-invalid_index_mask) {
// not found
}
lp = rt-root;
for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
  ...

This exploits the fact the lower portion of the address space is always
filled, at least in the cases that matter to us.






Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.

Not much value in combining them -- the variable level tree check
will be dominated by the level skip logic.

IMO however skipping intermediate levels will be too rare to justify
the complexity and the doubling of the page table size -- it can
only happen in iommu setups that place memory in very high
addresses.  These ought to be rare.


Well maybe not very high address, but you can have a device with
a 64 bit bar and this will add back levels (though it would not
be slower than it is currently).

I agree the simplicity is attractive.

However I really would like some logic that can handle  1 leaf
somehow.

Specifically both tricks break if we add a full 64 bit io region
in order to return -1 for reads on master abort.

That can be achieved by directing a missed lookup to a catch-all region.

It would be best to do this completely internally to the memory
code, without API changes.

You mean e.g. add a catch-all region to the address space?



The API already specifies a catch-all region - the container region 
catches everything if an access doesn't hit a sub-region (though I 
wanted to discourage the practice of using the same region for both I/O 
and a container), or you can have an I/O region as a subregion with the 
lowest priority, that spans the entire space, so that any access which 
misses the other sub-regions will hit the default subregion.


So we don't need API changes, just to change the internals.

I think this can be done by looking at the address space, and finding 
out what region is mapped to the very last byte in the address space, if 
any, and using that as the default if an access misses the tree.  That 
is, this region will be the target of null pointers in the tree, or if 
(addr  addr_invalid_mask) is true.


This will ensure the tree is of the smallest possible size.

Another option is to simply give up on the tree and use 
std::binary_search() on the FlatView.  It will be more cpu intensive, 
but on the other hand more cache-friendly.  tcg already caches stuff in 
the iotlb, and maybe it makes sense to add a cache for kvm lookups too 
(the advantage of the cache is that it only needs to contain addresses 
that are actually touched, so it can be very small and fast, compared to 
the radix tree which needs to cover all addresses).





Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression

2013-11-14 Thread Avi Kivity
Michael S. Tsirkin mst at redhat.com writes:

 
 At the moment, memory radix tree is already variable width, but it can
 only skip the low bits of address.
 
 This is efficient if we have huge memory regions but inefficient if we
 are only using a tiny portion of the address space.
 
 After we have built up the map, detect
 configurations where a single L2 entry is valid.
 
 We then speed up the lookup by skipping one or more levels.
 In case any levels were skipped, we might end up in a valid section
 instead of erroring out. We handle this by checking that
 the address is in range of the resulting section.
 


I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
   if (index  rt-invalid_index_mask) {
   // not found
   }
   lp = rt-root;
   for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
 ...

This exploits the fact the lower portion of the address space is always 
filled, at least in the cases that matter to us.








Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper

2013-11-14 Thread Avi Kivity

On 11/14/2013 03:29 PM, Stefan Hajnoczi wrote:

On Tue, Nov 12, 2013 at 01:10:24PM +0200, Avi Kivity wrote:

Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since
it doesn't have the required capabilities.

Fix by adding them.

Up until now, downstreams had to make the bridge helper executable
setuid, add the cap_net_admin capability, or they did nothing and it was
broken ;-).  CCing downstream package maintainers in case they have any
comments on this patch.


And it was, indeed, broken.




Note: this may break installing as non-root.  This is actually the right
thing to do, since not setting up the capability would result in a broken
setup.  Perhaps we need a configure flag to disable helpers.

Users who have been successfully installing QEMU would be upset if it
suddenly starts failing after this patch.  The bridge helper is a niche
feature that shouldn't cause a regression for the majority of users who
don't care about it.

If we're installing non-root then the bridge helper simply shouldn't be
installed.


Or maybe installed without the capabilities?  This way if the user 
invokes qemu with sudo, it still works as before.




Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression

2013-11-14 Thread Avi Kivity

On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote:

Michael S. Tsirkin mst at redhat.com writes:


At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.



I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
if (index  rt-invalid_index_mask) {
// not found
}
lp = rt-root;
for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
  ...

This exploits the fact the lower portion of the address space is always
filled, at least in the cases that matter to us.






Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.


Not much value in combining them -- the variable level tree check will 
be dominated by the level skip logic.


IMO however skipping intermediate levels will be too rare to justify the 
complexity and the doubling of the page table size -- it can only happen 
in iommu setups that place memory in very high addresses.  These ought 
to be rare.






Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression

2013-11-14 Thread Avi Kivity

On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:

On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote:

Michael S. Tsirkin mst at redhat.com writes:


At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.


I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
if (index  rt-invalid_index_mask) {
// not found
}
lp = rt-root;
for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
  ...

This exploits the fact the lower portion of the address space is always
filled, at least in the cases that matter to us.






Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.

Not much value in combining them -- the variable level tree check
will be dominated by the level skip logic.

IMO however skipping intermediate levels will be too rare to justify
the complexity and the doubling of the page table size -- it can
only happen in iommu setups that place memory in very high
addresses.  These ought to be rare.


Well maybe not very high address, but you can have a device with
a 64 bit bar and this will add back levels (though it would not
be slower than it is currently).

I agree the simplicity is attractive.

However I really would like some logic that can handle  1 leaf
somehow.

Specifically both tricks break if we add a full 64 bit io region
in order to return -1 for reads on master abort.


That can be achieved by directing a missed lookup to a catch-all region.

It would be best to do this completely internally to the memory code, 
without API changes.




[Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper

2013-11-12 Thread Avi Kivity
Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since
it doesn't have the required capabilities.

Fix by adding them.

Note: this may break installing as non-root.  This is actually the right
thing to do, since not setting up the capability would result in a broken
setup.  Perhaps we need a configure flag to disable helpers.

Signed-off-by: Avi Kivity a...@cloudius-systems.com
---
 Makefile  | 3 +++
 rules.mak | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Makefile b/Makefile
index b15003f..af7748c 100644
--- a/Makefile
+++ b/Makefile
@@ -188,6 +188,7 @@ qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a 
libqemustub.a
 qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a
 qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a
 
+qemu-bridge-helper$(EXESUF).capabilities = cap_net_admin
 qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o
 
 fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o 
fsdev/virtio-9p-marshal.o libqemuutil.a libqemustub.a
@@ -345,6 +346,8 @@ endif
 ifneq ($(HELPERS-y),)
$(INSTALL_DIR) $(DESTDIR)$(libexecdir)
$(INSTALL_PROG) $(STRIP_OPT) $(HELPERS-y) $(DESTDIR)$(libexecdir)
+   $(SETCAP) $(foreach helper, $(HELPERS-y), \
+   $($(helper).capabilities)+pe $(DESTDIR)$(libexecdir)/$(helper))
 endif
 ifneq ($(BLOBS),)
set -e; for x in $(BLOBS); do \
diff --git a/rules.mak b/rules.mak
index 49edb9b..9194c79 100644
--- a/rules.mak
+++ b/rules.mak
@@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach var,$(nested-vars),$(dir 
$($(var))
 $(foreach var,$(nested-vars), $(eval \
   -include $(addsuffix *.d, $(sort $(dir $($(var)))
 endef
+
+SETCAP = setcap
\ No newline at end of file
-- 
1.8.3.1




Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper

2013-11-12 Thread Avi Kivity
On Nov 12, 2013 9:23 PM, Eric Blake ebl...@redhat.com wrote:

 On 11/12/2013 04:10 AM, Avi Kivity wrote:
  Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper
since
  it doesn't have the required capabilities.
 
  Fix by adding them.
 
  Note: this may break installing as non-root.  This is actually the right
  thing to do, since not setting up the capability would result in a
broken
  setup.  Perhaps we need a configure flag to disable helpers.
 

  +++ b/rules.mak
  @@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach
var,$(nested-vars),$(dir $($(var))
   $(foreach var,$(nested-vars), $(eval \
 -include $(addsuffix *.d, $(sort $(dir $($(var)))
   endef
  +
  +SETCAP = setcap
  \ No newline at end of file

 Is this intentional?

No.


 --
 Eric Blake   eblake redhat com+1-919-301-3266
 Libvirt virtualization library http://libvirt.org



Re: [Qemu-devel] [PATCH] softfloat: rebase to version 2a

2013-04-30 Thread Avi Kivity
On Apr 29, 2013 9:06 PM, Anthony Liguori aligu...@us.ibm.com wrote:

 In order to make this change, we need to relicense all contributions
 from initial import of the SoftFloat code to match the license of
 SoftFloat-2a (instead of the implied SoftFloat-2b license).


Acked-by: Avi Kivity avi.kiv...@gmail.com


Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant

2013-02-19 Thread Avi Kivity
On Tue, Feb 19, 2013 at 4:41 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Feb 14, 2013 at 08:23:04PM +0200, Avi Kivity wrote:
 On Thu, Feb 14, 2013 at 8:12 PM, Michael S. Tsirkin m...@redhat.com wrote:
 
  Is there an actual real problem that needs fixing?
 
  Yes. Guests sometimes cause device BARs to temporary overlap
  the APIC range during BAR sizing. It works fine on a physical
  system but fails on KVM since pci has same priority.
 
  See the report:
  [BUG] Guest OS hangs on boot when 64bit BAR present
 

 Is PCI_COMMAND_MEMORY set while this is going on?

 I think Linux never clears PCI_COMMAND_MEMORY because
 it's buggy in some devices.

Ok.  Then I recommend defining the MSI message area as overlapped with
sufficient priority.  It should probably be a child of the PCI address
space.

The IOAPIC is actually closer to ISA, but again it's sufficient to
move it to the PCI address space.  I doubt its priority matters.



Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant

2013-02-19 Thread Avi Kivity
On Tue, Feb 19, 2013 at 6:08 PM, Michael S. Tsirkin m...@redhat.com wrote:

 The IOAPIC is actually closer to ISA, but again it's sufficient to
 move it to the PCI address space.  I doubt its priority matters.

 Well moving IOAPIC to PCI seems strange, it's not a PCI thing,
 and I think it can be moved outside PCI though guests don't do it.

Look at the 440fx/piix datasheets.  It's connected to the piix which
decodes its address.  So it's definitely part of the pci address
space.


 So I think ideally we really should have it look something like:

 sysbus - ioapic
- pci - msi




Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant

2013-02-14 Thread Avi Kivity
On Thu, Feb 14, 2013 at 3:09 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Feb 14, 2013 at 12:56:02PM +, Peter Maydell wrote:
 On 14 February 2013 12:45, Michael S. Tsirkin m...@redhat.com wrote:
  overlap flag in the region is currently unused, most devices have no
  idea whether their region overlaps with anything, so drop it,
  assume that all regions can overlap and always require priority.

 Devices themselves shouldn't care, for the most part -- they just
 provide a memory region and it's their parent that has to map it
 and know whether it overlaps or not. Similarly, parents should
 generally be in control of the container they're mapping the
 memory region into, and know whether it will be an overlapping
 map or not.

  It's also not clear how should devices allocate priorities.

 Up to the parent which controls the region being mapped into.

 We could just assume same priority as parent but what happens if it
 has to be different?

Priority is only considered relative to siblings.  The parent's
priority is only considered wrt the parent's siblings, not its
children.

 There are also aliases so a region
 can have multiple parents. Presumably it will have to have
 different priorities depending on what the parent does?

The alias region has its own priority

 Here's a list of instances using priority != 0.

 hw/armv7m_nvic.c:MEMORY_PRIO_LOW);
 hw/cirrus_vga.c:MEMORY_PRIO_LOW);
 hw/cirrus_vga.c:MEMORY_PRIO_LOW);
 hw/cirrus_vga.c:s-low_mem_container, 
 MEMORY_PRIO_LOW);
 hw/kvm/pci-assign.c: r_dev-mmio, MEMORY_PRIO_LOW);
 hw/kvmvapic.c:memory_region_add_subregion(as, rom_paddr, s-rom, 
 MEMORY_PRIO_HIGH);
 hw/lpc_ich9.c:MEMORY_PRIO_LOW);
 hw/onenand.c:s-mapped_ram, MEMORY_PRIO_LOW);
 hw/pam.c:MEMORY_PRIO_LOW);
 hw/pc.c:MEMORY_PRIO_LOW);
 hw/pc_sysfw.c:isa_bios, MEMORY_PRIO_LOW);
 hw/pc_sysfw.c:isa_bios, MEMORY_PRIO_LOW);
 hw/pci/pci.c:MEMORY_PRIO_LOW);
 hw/pci/pci_bridge.c:memory_region_add_subregion(parent_space, base, 
 alias, MEMORY_PRIO_LOW);
 hw/piix_pci.c:MEMORY_PRIO_LOW);
 hw/piix_pci.c:d-rcr_mem, MEMORY_PRIO_LOW);
 hw/q35.c:mch-smram_region, MEMORY_PRIO_LOW);
 hw/vga-isa.c:MEMORY_PRIO_LOW);
 hw/vga.c:MEMORY_PRIO_MEDIUM);
 hw/vga.c:vga_io_memory, MEMORY_PRIO_LOW);
 hw/xen_pt_msi.c:MEMORY_PRIO_MEDIUM); /* 
 Priority: pci default + 1

 Making priority relative to parent but not the same just seems like a recipe 
 for disaster.

 I definitely don't like making the priority argument mandatory:
 this is just introducing pointless boilerplate for the common
 case where nothing overlaps and you know nothing overlaps.

 Non overlapping is not a common case at all.  E.g. with normal PCI
 devices you have no way to know nothing overlaps - addresses are guest
 programmable.

Non overlapping is mostly useful for embedded platforms.



Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant

2013-02-14 Thread Avi Kivity
On Thu, Feb 14, 2013 at 4:02 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Feb 14, 2013 at 01:22:18PM +, Peter Maydell wrote:
 On 14 February 2013 13:09, Michael S. Tsirkin m...@redhat.com wrote:
  On Thu, Feb 14, 2013 at 12:56:02PM +, Peter Maydell wrote:
  Up to the parent which controls the region being mapped into.
 
  We could just assume same priority as parent

 Er, no. I mean the code in control of the parent MR sets the
 priority, when it calls memory_region_add_subregion_overlap().

  but what happens if it
  has to be different? There are also aliases so a region
  can have multiple parents.

 The alias has its own priority.

 Well that's the status quo. One of the issues is, you have
 no idea what else uses each priority. With this change,
 at least you can grep for it.

The question what priorities do aliases of this region have is not
an interesting question.  Priority is a local attribute, not an
attribute of the region being prioritized.


  Presumably it will have to have
  different priorities depending on what the parent does?
  Here's a list of instances using priority != 0.
 
  hw/armv7m_nvic.c:MEMORY_PRIO_LOW);

 So this one I know about, and it's a good example of what
 I'm talking about. This function sets up a container memory
 region (nvic), and it is in total control of what is
 mapped into that container. Specifically, it puts in a
 nvic_sysregs background region which covers the whole
 0x1000 size of the container (at an implicit priority of
 zero). It then layers over that an alias of the GIC
 registers (nvic-gic) at a specific address and with
 a priority of 1 so it appears above the background region.
 Nobody else ever puts anything in this container, so
 the only thing we care about is that the priority of
 the nvic-gic region is higher than that of the nvic_sysregs
 region; and it's clear from the code that we do that.
 Priority is a local question whose meaning is only relevant
 within a particular container region, not system-wide,
 and
 having system-wide MEMORY_PRIO_ defines obscures that IMHO.

 Well that's not how it seems to work, and I don't see how it *could*
 work. Imagine the specific example: ioapic and pci devices. ioapic has
 an address within the pci hole but it is not a subregion.
 If priority has no meaning how would you decide which one
 to use?

Like PMM said.  You look at the semantics of the hardware, and map
that onto the API.  If the pci controller says that BARs hide the
ioapic, then you give them higher priority.  If it says that the
ioapic hides BARs, then that gets higher priority.  If it doesn't say
anything, take your pick (or give them the same priority).


 Also, on a PC many addresses are guest programmable. We need to behave
 in some defined way if guest programs addresses to something silly.

That's why _overlap exists.

 The only reason it works sometimes is because some systems
 use fixes addresses which never overlap.

That's why the no overlap API exists.



  I definitely don't like making the priority argument mandatory:
  this is just introducing pointless boilerplate for the common
  case where nothing overlaps and you know nothing overlaps.
 
  Non overlapping is not a common case at all.  E.g. with normal PCI
  devices you have no way to know nothing overlaps - addresses are guest
  programmable.

 That means PCI is a special case :-)
 If the guest can
 program overlap then presumably PCI specifies semantics
 for what happens then, and there need to be PCI specific
 wrappers that enforce those semantics and they can call
 the relevant _overlap functions when mapping things.
 In any case this isn't a concern for the PCI *device*,
 which can just provide its memory regions. It's a problem
 the PCI *host adaptor* has to deal with when it's figuring
 out how to map those regions into the container which
 corresponds to its area of the address space.

 Issue is, a PCI device overlapping something else suddenly
 becomes this something else's problem.

It is not a problem at all.


  Maybe we should take the printf() about subregion collisions
  in memory_region_add_subregion_common() out of the #if 0
  that it currently sits in?

  This is just a debugging tool, it won't fix anything.

 It might tell us what bits of code are currently erroneously
 mapping regions that overlap without using the _overlap()
 function. Then we could fix them.

 -- PMM

 When there is a single guest programmable device,
 any address can be overlapped by it.

No.  Only addresses within the same container.  Other containers work
fine without overlap.

 We could invent rules like 'non overlappable is higher
 priority' but it seems completely arbitrary, a single
 priority is clearer.

It's just noise for the xx% of cases which don't need it.



Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant

2013-02-14 Thread Avi Kivity
On Thu, Feb 14, 2013 at 4:40 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Feb 14, 2013 at 04:14:39PM +0200, Avi Kivity wrote:

 But some parents are system created and shared by many devices so children for
 such have no idea who their siblings are.

 Please take a look at the typical map in this mail:
 '[BUG] Guest OS hangs on boot when 64bit BAR present'

 system overlap 0 pri 0 [0x0 - 0x7fff]
  kvmvapic-rom overlap 1 pri 1000 [0xca000 - 0xcd000]
  pc.ram overlap 0 pri 0 [0xca000 - 0xcd000]
  ++ pc.ram [0xca000 - 0xcd000] is added to view
  
  smram-region overlap 1 pri 1 [0xa - 0xc]
  pci overlap 0 pri 0 [0xa - 0xc]
  cirrus-lowmem-container overlap 1 pri 1 [0xa - 0xc]
  cirrus-low-memory overlap 0 pri 0 [0xa - 0xc]
 ++cirrus-low-memory [0xa - 0xc] is added to view
  kvm-ioapic overlap 0 pri 0 [0xfec0 - 0xfec01000]
 ++kvm-ioapic [0xfec0 - 0xfec01000] is added to view
  pci-hole64 overlap 0 pri 0 [0x1 - 0x4001]
  pci overlap 0 pri 0 [0x1 - 0x4001]
  pci-hole overlap 0 pri 0 [0x7d00 - 0x1]
  pci overlap 0 pri 0 [0x7d00 - 0x1]
  ivshmem-bar2-container overlap 1 pri 1 [0xfe00 - 0x1]
  ivshmem.bar2 overlap 0 pri 0 [0xfe00 - 0x1]
 ++ivshmem.bar2 [0xfe00 - 0xfec0] is added to view
 ++ivshmem.bar2  [0xfec01000 - 0x1] is added to view

 As you see, ioapic at 0xfec0 overlaps pci-hole.
 ioapic is guest programmable in theory - should use _overlap?
 pci-hole is not but can overlap with ioapic.
 So also _overlap?

It's a bug.  The ioapic is in the pci address space, not the system
address space.  And yes it's overlappable.


 Let's imagine someone writes a guest programmable device for
 ARM. Now we should update all ARM devices from regular to _overlap?

It's sufficient to update the programmable device.

 
  Non overlapping is not a common case at all.  E.g. with normal PCI
  devices you have no way to know nothing overlaps - addresses are guest
  programmable.

 Non overlapping is mostly useful for embedded platforms.

 Maybe it should have a longer name like _nonoverlap then?
 Current API makes people assume _overlap is only for special
 cases and default should be non overlap.

The assumption is correct.



Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant

2013-02-14 Thread Avi Kivity
On Thu, Feb 14, 2013 at 6:50 PM, Michael S. Tsirkin m...@redhat.com wrote:
  As you see, ioapic at 0xfec0 overlaps pci-hole.
  ioapic is guest programmable in theory - should use _overlap?
  pci-hole is not but can overlap with ioapic.
  So also _overlap?

 It's a bug.  The ioapic is in the pci address space, not the system
 address space.  And yes it's overlappable.

 So you want to put it where? Under pci-hole?

No, under the pci address space.  Look at the 440fx block diagram.

 And we'll have to teach all machine types
 creating pci-hole about it?

No.


 
  Let's imagine someone writes a guest programmable device for
  ARM. Now we should update all ARM devices from regular to _overlap?

 It's sufficient to update the programmable device.

 Then the device can be higher priority (works for apic)
 but not lower priority. Make priority signed?

Is there an actual real problem that needs fixing?



Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant

2013-02-14 Thread Avi Kivity
On Thu, Feb 14, 2013 at 8:12 PM, Michael S. Tsirkin m...@redhat.com wrote:

 Is there an actual real problem that needs fixing?

 Yes. Guests sometimes cause device BARs to temporary overlap
 the APIC range during BAR sizing. It works fine on a physical
 system but fails on KVM since pci has same priority.

 See the report:
 [BUG] Guest OS hangs on boot when 64bit BAR present


Is PCI_COMMAND_MEMORY set while this is going on?



Re: [Qemu-devel] RFC migration of zero pages

2013-01-31 Thread Avi Kivity
On Thu, Jan 31, 2013 at 4:30 PM, Gleb Natapov g...@redhat.com wrote:
 
  Avi/Michael do you remember why mincore can't be used to check if a guest
 page is unmapped?

 A page may be not in core, but also nonzero (for example swap).
 Yes, mincore() should not be used as zero page check, but it can be used
 as an indicator that page can be dealt with in dead stage of migration
 since it is likely zero and will not take much time to send.

Or it's swapped, in which case we know nothing about it.

 It is
 possible to call madvise(MADV_WILLNEED) on them meanwhile to pre-load
 swap without faulting on each page individually.

During migration you're faulting like mad anyway.



Re: [Qemu-devel] RFC migration of zero pages

2013-01-31 Thread Avi Kivity
On Jan 31, 2013 4:03 PM, Peter Lieven p...@dlhnet.de wrote:


 Am 31.01.2013 um 14:59 schrieb Avi Kivity avi.kiv...@gmail.com:


 On Jan 31, 2013 12:29 PM, Orit Wasserman owass...@redhat.com wrote:
 
  On 01/31/2013 11:48 AM, Gleb Natapov wrote:
   On Thu, Jan 31, 2013 at 09:47:24AM +0200, Orit Wasserman wrote:
   On 01/31/2013 08:57 AM, Peter Lieven wrote:
   Hi,
  
   I just came across an idea and would like to have feedback if it
makes sence or not.
  
   If a VM is started without preallocated memory all memory that has
not been written to
   reads as zeros, right?
   Hi,
   No the memory will be unmapped (we allocate on demand).
   If a VM with a lot of unwritten memory is migrated or if the
memory contains a lot
   of zeroed out memory (e.g. Windows or Linux guest with page
sanitization) all this memory
   is allocated on the target during live migration. Especially with
KSM this leads
   to the problem that this memory is allocated and might be not
available completely as
   merging of the pages will happen async.
  
   Wouldn't it make sense to not send zero pages in the first round
where the complete
   ram is sent (if it is detectable that we are in this stage)?
   We send one byte per zero page at the moment (see is_dup_page) we
can further optimizing it
   by not sending it.
   I have to point out that this is a very idle guest and we need to
work on a loaded guest
   which is the more hard problem in migration.
  
   Also I notice that the bottle neck in migrating unmapped pages is
the detection of those pages
   because we map the pages in order to check them, for a large guest
this is very expensive as mapping a page
   results in a page fault in the host.
   So what will be very helpful is actually locating those pages
without mapping them
   which looks very complicated.
  
   What is wrong with mincore()?
 
  Avi/Michael do you remember why mincore can't be used to check if a
guest page is unmapped?

 A page may be not in core, but also nonzero (for example swap).

 What about hugetables? Afaik, there are not swappable - at least they
where somewhen in the past.
 Maybe it would be possible to have a performance improvement if they are
used.


Nobody uses them, now that transparent huge pages are available.


Re: [Qemu-devel] RFC migration of zero pages

2013-01-31 Thread Avi Kivity
On Thu, Jan 31, 2013 at 4:42 PM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Jan 31, 2013 at 04:36:25PM +0200, Avi Kivity wrote:
 On Thu, Jan 31, 2013 at 4:30 PM, Gleb Natapov g...@redhat.com wrote:
  
   Avi/Michael do you remember why mincore can't be used to check if a 
   guest
  page is unmapped?
 
  A page may be not in core, but also nonzero (for example swap).
  Yes, mincore() should not be used as zero page check, but it can be used
  as an indicator that page can be dealt with in dead stage of migration
  since it is likely zero and will not take much time to send.

 Or it's swapped, in which case we know nothing about it.

  It is
 Of course, that is why I said likely' :)

  possible to call madvise(MADV_WILLNEED) on them meanwhile to pre-load
  swap without faulting on each page individually.

 During migration you're faulting like mad anyway.
 That's guest faulting on dirty bit logging, as far as I understand Orit
 says that in addition to that she sees a lot of faults generated by
 migration reading unmapped guest memory. She wants to eliminate at least
 those.

For that, we once talked about extending mincore() to return info
whether a page is known zero (anonymous memory without a backing page,
or backing page == zero page).  But I doubt it's worth it, yes we're
faulting a lot, but it's still a lot cheaper than actually sending a
page, so we're still ahead of a non-idle guest.



Re: [Qemu-devel] RFC migration of zero pages

2013-01-31 Thread Avi Kivity
On Jan 31, 2013 12:29 PM, Orit Wasserman owass...@redhat.com wrote:

 On 01/31/2013 11:48 AM, Gleb Natapov wrote:
  On Thu, Jan 31, 2013 at 09:47:24AM +0200, Orit Wasserman wrote:
  On 01/31/2013 08:57 AM, Peter Lieven wrote:
  Hi,
 
  I just came across an idea and would like to have feedback if it
makes sence or not.
 
  If a VM is started without preallocated memory all memory that has
not been written to
  reads as zeros, right?
  Hi,
  No the memory will be unmapped (we allocate on demand).
  If a VM with a lot of unwritten memory is migrated or if the memory
contains a lot
  of zeroed out memory (e.g. Windows or Linux guest with page
sanitization) all this memory
  is allocated on the target during live migration. Especially with KSM
this leads
  to the problem that this memory is allocated and might be not
available completely as
  merging of the pages will happen async.
 
  Wouldn't it make sense to not send zero pages in the first round
where the complete
  ram is sent (if it is detectable that we are in this stage)?
  We send one byte per zero page at the moment (see is_dup_page) we can
further optimizing it
  by not sending it.
  I have to point out that this is a very idle guest and we need to work
on a loaded guest
  which is the more hard problem in migration.
 
  Also I notice that the bottle neck in migrating unmapped pages is the
detection of those pages
  because we map the pages in order to check them, for a large guest
this is very expensive as mapping a page
  results in a page fault in the host.
  So what will be very helpful is actually locating those pages without
mapping them
  which looks very complicated.
 
  What is wrong with mincore()?

 Avi/Michael do you remember why mincore can't be used to check if a guest
page is unmapped?

A page may be not in core, but also nonzero (for example swap).


Re: [Qemu-devel] RFC migration of zero pages

2013-01-31 Thread Avi Kivity
On Jan 31, 2013 11:14 PM, Michael S. Tsirkin m...@redhat.com wrote:

 On Thu, Jan 31, 2013 at 05:20:02PM +0200, Avi Kivity wrote:
  On Thu, Jan 31, 2013 at 4:42 PM, Gleb Natapov g...@redhat.com wrote:
   On Thu, Jan 31, 2013 at 04:36:25PM +0200, Avi Kivity wrote:
   On Thu, Jan 31, 2013 at 4:30 PM, Gleb Natapov g...@redhat.com
wrote:

 Avi/Michael do you remember why mincore can't be used to check
if a guest
page is unmapped?
   
A page may be not in core, but also nonzero (for example swap).
Yes, mincore() should not be used as zero page check, but it can
be used
as an indicator that page can be dealt with in dead stage of
migration
since it is likely zero and will not take much time to send.
  
   Or it's swapped, in which case we know nothing about it.
  
It is
   Of course, that is why I said likely' :)
  
possible to call madvise(MADV_WILLNEED) on them meanwhile to
pre-load
swap without faulting on each page individually.
  
   During migration you're faulting like mad anyway.
   That's guest faulting on dirty bit logging, as far as I understand
Orit
   says that in addition to that she sees a lot of faults generated by
   migration reading unmapped guest memory. She wants to eliminate at
least
   those.
 
  For that, we once talked about extending mincore() to return info
  whether a page is known zero (anonymous memory without a backing page,
  or backing page == zero page).

 Shoudn't be hard to expose in /proc, no?

Unfortunately.


  But I doubt it's worth it, yes we're
  faulting a lot, but it's still a lot cheaper than actually sending a
  page, so we're still ahead of a non-idle guest.

 It's not just fault, reading in the page is bad for the cache.


There's just one zero page.


Re: [Qemu-devel] [PATCH 1.3] build: compile translate.o at -O1 optimization

2012-11-27 Thread Avi Kivity
On 11/27/2012 10:34 AM, Paolo Bonzini wrote:
 Some versions of GCC require insane (2GB) amounts of memory
 to compile translate.o.  As a countermeasure, compile it
 with -O1.  This should fix the buildbot failure for
 default_x86_64_fedora16.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  Makefile.target | 2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/Makefile.target b/Makefile.target
 index 8b658c0..3981931 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -143,6 +143,8 @@ GENERATED_HEADERS += hmp-commands.h qmp-commands-old.h
  
  endif # CONFIG_SOFTMMU
  
 +%/translate.o: CFLAGS := $(patsubst -O2,-O1,$(CFLAGS))
 +

This may change some string argument in CFLAGS, for example an argument
to -I.

How about:

  CFLAGS_opt = -O2
  CFLAGS += $(CFLAGS_opt)
  ...
  %/translate.o: CFLAGS_opt = -O1


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration

2012-11-25 Thread Avi Kivity
On 09/19/2012 02:50 PM, Julien Grall wrote:
 This is the tenth version of patch series about ioport registration.
 
 Some part of QEMU still use register_ioport* functions to register ioport.
 These functions doesn't allow to use Memory Listener on it.

Acked-by: Avi Kivity a...@redhat.com

Unfortunately, I no longer have the time to maintain the memory patches.
 Anthony, please apply directly.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration

2012-11-25 Thread Avi Kivity
On 11/24/2012 03:09 AM, Julien Grall wrote:
 ping ?
 

Sorry, I must have missed this (twice).


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions

2012-11-25 Thread Avi Kivity
On 11/05/2012 02:37 PM, Jan Kiszka wrote:
 
 As I noted, init and destroy cannot cause a topology update.
 
 Ah, right. Why are we wrapping them in transaction_begin/commit at all then?
 

We aren't.


void memory_region_destroy(MemoryRegion *mr)
{
assert(QTAILQ_EMPTY(mr-subregions));
assert(memory_region_transaction_depth == 0);


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions

2012-11-25 Thread Avi Kivity
On 11/25/2012 12:45 PM, Jan Kiszka wrote:
 On 2012-11-25 11:18, Avi Kivity wrote:
 On 11/05/2012 02:37 PM, Jan Kiszka wrote:

 As I noted, init and destroy cannot cause a topology update.

 Ah, right. Why are we wrapping them in transaction_begin/commit at all then?

 
 We aren't.
 
 
 void memory_region_destroy(MemoryRegion *mr)
 {
 assert(QTAILQ_EMPTY(mr-subregions));
 assert(memory_region_transaction_depth == 0);
 
 
 We were talking about address_space_init/destroy.

This is to force a re-rendering of the address space, so that listeners
see the construction/destruction.  Simply assigning as-root wouldn't do
that.

This kind of reliance on side effects should be documented with a
comment (and forbidden to anything that is outside the implementation).
 My bad.

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration

2012-11-25 Thread Avi Kivity
On 11/25/2012 02:42 PM, Andreas Färber wrote:
 Am 25.11.2012 10:10, schrieb Avi Kivity:
 On 09/19/2012 02:50 PM, Julien Grall wrote:
 This is the tenth version of patch series about ioport registration.

 Some part of QEMU still use register_ioport* functions to register ioport.
 These functions doesn't allow to use Memory Listener on it.
 
 Acked-by: Avi Kivity a...@redhat.com
 
 Unfortunately, I no longer have the time to maintain the memory patches.
  Anthony, please apply directly.
 
 Are they for 1.3? 

They are not.

 If not, I can offer to handle them since I have an
 upcoming qdev/QOM series that I would have to rebase on this.
 

Great, thanks.

 Good luck with your new project, Avi!

Thanks for that as well!

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration

2012-11-25 Thread Avi Kivity
On 11/25/2012 05:01 PM, Andreas Färber wrote:
 
 Note that there are still some register_ioport_*() usages left:
 $ git grep register_ioport -- hw/ | grep -v isa_register_ioport | wc --lines
 32
 

For the record, my plan for when that number drops to zero was to use
address_space_rw() (or friends) directly from the ioport helpers or kvm
code.  There should not be an address_space_io for non-x86; the I/O
address space would be just a memory region under a bridge somewhere.

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [BUG] qemu crash when using lsilogic

2012-11-20 Thread Avi Kivity
On 11/20/2012 11:01 AM, Wanlong Gao wrote:

 $ git bisect good
 d22b096ef6e0b20810193b68a1d472f3fb8a4f9e is the first bad commit
 commit d22b096ef6e0b20810193b68a1d472f3fb8a4f9e
 Author: Avi Kivity a...@redhat.com
 Date:   Sun Sep 30 22:21:11 2012 +0200
 
 kvm: use separate MemoryListeners for memory and I/O
 
 The construct
 
if (address_space == get_system_memory()) {
// memory thing
} else {
// io thing
}
 
 fails if we have more than two address spaces.  Use a separate listener
 for memory and I/O, and utilize MemoryListener's address space filtering 
 to
 fix this.
 
 Signed-off-by: Avi Kivity a...@redhat.com
 
 :100644 100644 92a71374ed1e040cef5ad70a6cb00adabf671dd4 
 c69e01200461c7a87440f7a915bd171a9fc8f318 M   kvm-all.c
 
 
 Ops, I didn't find any error in above patch, can you guys help to 
 investigate this bug?
 
 I confirmed again and found that lsi can't work on the upstream qemu.
 Any thoughts?
 

I appears to be a bug in Seabios:

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fffebfff700 (LWP 19032))]
#0  0x5586df2c in access_with_adjusted_size (addr=29563620,
value=0x7fffebffe5f0, size=4, access_size_min=0, access_size_max=0, access=
0x5586ddbb memory_region_read_accessor, opaque=0x5655dd00)
at /home/tlv/akivity/qemu/memory.c:349
349 {
(gdb) bt
#0  0x5586df2c in access_with_adjusted_size (addr=29563620,
value=0x7fffebffe5f0, size=4, access_size_min=0, access_size_max=0, access=
0x5586ddbb memory_region_read_accessor, opaque=0x5655dd00)
at /home/tlv/akivity/qemu/memory.c:349
#1  0x55870a4e in memory_region_dispatch_read1
(mr=0x5655dd00, addr=29563620, size=4) at
/home/tlv/akivity/qemu/memory.c:862
#2  0x55870b3e in memory_region_dispatch_read
(mr=0x5655dd00, addr=29563620, size=4) at
/home/tlv/akivity/qemu/memory.c:894
#3  0x55873c2d in io_mem_read (mr=0x5655dd00, addr=29563620,
size=4) at /home/tlv/akivity/qemu/memory.c:1575
#4  0x558054ed in address_space_rw (as=0x56629d78,
addr=29563620, buf=0x7fffebffe874 , len=4, is_write=false) at
/home/tlv/akivity/qemu/exec.c:3428
#5  0x556bf4c3 in dma_memory_rw_relaxed (dma=0x56603cf0,
addr=29563620, buf=0x7fffebffe874, len=4, dir=DMA_DIRECTION_TO_DEVICE)
at /home/tlv/akivity/qemu/dma.h:130
#6  0x556bf558 in dma_memory_rw (dma=0x56603cf0,
addr=29563620, buf=0x7fffebffe874, len=4, dir=DMA_DIRECTION_TO_DEVICE)
at /home/tlv/akivity/qemu/dma.h:156
#7  0x556bf5fb in pci_dma_rw (dev=0x56629b60, addr=29563620,
buf=0x7fffebffe874, len=4, dir=DMA_DIRECTION_TO_DEVICE) at
/home/tlv/akivity/qemu/hw/pci.h:607
#8  0x556bf65b in pci_dma_read (dev=0x56629b60,
addr=29563620, buf=0x7fffebffe874, len=4) at
/home/tlv/akivity/qemu/hw/pci.h:614
#9  0x556bfc08 in read_dword (s=0x56629b60, addr=29563620)
at /home/tlv/akivity/qemu/hw/lsi53c895a.c:385
#10 0x556c1937 in lsi_execute_script (s=0x56629b60) at
/home/tlv/akivity/qemu/hw/lsi53c895a.c:1040
#11 0x556c3c82 in lsi_reg_writeb (s=0x56629b60, offset=47,
val=0 '\000') at /home/tlv/akivity/qemu/hw/lsi53c895a.c:1781
#12 0x556c513c in lsi_io_write (opaque=0x56629b60, addr=47,
val=0, size=1) at /home/tlv/akivity/qemu/hw/lsi53c895a.c:1953
#13 0x5586def2 in memory_region_write_accessor
(opaque=0x5662a340, addr=47, value=0x7fffebffeab0, size=1, shift=0,
mask=255) at /home/tlv/akivity/qemu/memory.c:334
#14 0x5586dfd4 in access_with_adjusted_size (addr=47,
value=0x7fffebffeab0, size=1, access_size_min=1, access_size_max=1,
access=0x5586de6d memory_region_write_accessor,
opaque=0x5662a340) at /home/tlv/akivity/qemu/memory.c:364
#15 0x5586e43c in memory_region_iorange_write
(iorange=0x7fffe4000ed0, offset=47, width=1, data=0) at
/home/tlv/akivity/qemu/memory.c:439
#16 0x55866acc in ioport_writeb_thunk (opaque=0x7fffe4000ed0,
addr=49199, data=0) at /home/tlv/akivity/qemu/ioport.c:212
#17 0x558664a6 in ioport_write (index=0, address=49199, data=0)
at /home/tlv/akivity/qemu/ioport.c:83
#18 0x55867046 in cpu_outb (addr=49199, val=0 '\000') at
/home/tlv/akivity/qemu/ioport.c:289
#19 0x5586a958 in kvm_handle_io (port=49199,
data=0x77ff3000, direction=1, size=1, count=1) at
/home/tlv/akivity/qemu/kvm-all.c:1423
#20 0x5586af5e in kvm_cpu_exec (env=0x5659b0b0) at
/home/tlv/akivity/qemu/kvm-all.c:1571
#21 0x557f74e4 in qemu_kvm_cpu_thread_fn (arg=0x5659b0b0) at
/home/tlv/akivity/qemu/cpus.c:757
#22 0x76727d14 in start_thread () from /lib64/libpthread.so.0
#23 0x7533667d in clone () from /lib64/libc.so.6
(gdb) fr 4
#4  0x558054ed in address_space_rw (as=0x56629d78,
addr=29563620, buf=0x7fffebffe874 , len=4, is_write=false) at
/home/tlv/akivity/qemu/exec.c:3428
3428val = io_mem_read(section-mr, addr1, 4);
(gdb) p

Re: [Qemu-devel] [PATCH v6 1/8] atomic: introduce atomic operations

2012-11-18 Thread Avi Kivity
On 11/15/2012 09:47 AM, liu ping fan wrote:
 On Wed, Nov 14, 2012 at 5:47 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 14/11/2012 10:38, liu ping fan ha scritto:
 On Tue, Nov 13, 2012 at 6:11 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 05/11/2012 06:38, Liu Ping Fan ha scritto:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 If out of global lock, we will be challenged by SMP in low level,
 so need atomic ops.

 This file is a wrapper of GCC atomic builtin.

 I still object to this.

 I know it enforces type-safety, but it is incomplete.  It doesn't

 Although it is incomplete, but the rest cases are rarely used.  Linux
 faces such issue, and the int version is enough, so I think we can
 borrow experience from there.

 One of the two places that use __sync_* require 64-bit accesses.  My
 Yes, these two places are not easy to fix.

 Which shows that Linux's atomic_t is not suited for QEMU, in my opinion.

 RCU prototype required pointer-sized access, which you cannot make type-
 But I think that your RCU prototype should rely on atomic of CPU, not
 gcc‘s atomic.

 What's the difference?  gcc's atomic produces the same instructions as
 hand-written assembly (or should).

 Probably I made a mistake here, in vhost,  log =
 __sync_fetch_and_and(from, 0)  is used to fetch 64bits atomically in
 the case  32bits qemu running on 64bits linux.  Right?   But how can
 we read 32bits twice in atomic?  Seem that no instruction like _lock
 xchg for this ops.  So I guess _sync_fetch_and_and() based on
 something like spinlock.
 
 And I think the broken issue is caused by vhost thread updates log,
 while qemu read out it not atomicly, Right?

For the log, 32-bit sync_fetch_and_and() is sufficient.  We only need to
ensure no bits are lost, we don't need 64-bit atomicity.



-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH] dma: Define dma_context_memory and use in sysbus-ohci

2012-11-18 Thread Avi Kivity
On 11/13/2012 01:44 PM, Peter Maydell wrote:
 On 26 October 2012 17:00, Peter Maydell peter.mayd...@linaro.org wrote:
 On 26 October 2012 14:09, Paolo Bonzini pbonz...@redhat.com wrote:
 As soon as Avi's iommu patches go in, in fact, dma-as will just be as.
 Even if as == NULL were to be outlawed and you'd be forced to write
 get_address_space_memory(), taking the pain to create dummy DMAContexts
 now is just not worth it.

 Personally I think it's better not to permit NULL DMAContexts or
 AddressSpaces here, because they're kind of a hack (in the same way
 that the system address space is kind of a hack). In real hardware
 you probably aren't really doing dma to that address space but to
 some more local bus. dma_context_memory/address_space_memory are
 nice and easy to search for, whereas NULL isn't. [And in general
 I think NULL is too easy to use; if you have to go looking for the
 system dma context you've been prompted to think about whether
 that's the right one...]
 
 Ping! Can we have a ruling on what the right fix for this is so
 we can fix these segfaults before 1.3 release, please?

I agree with you here.  Callers should be fixed to supply the proper
AddressSpace.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions

2012-11-05 Thread Avi Kivity
On 11/05/2012 08:26 AM, Jan Kiszka wrote:
 On 2012-11-04 20:21, Avi Kivity wrote:
  On 11/04/2012 10:30 AM, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  Cirrus is triggering this, e.g. during Win2k boot: Changes only on
  disabled regions require no topology update when transaction depth drops
  to 0 again.
  
  817dcc5368988b0 (pci: give each device its own address space) mad this
  much worse by multiplying the number of address spaces.  Each change is
  now evaluated N+2 times, where N is the number of PCI devices.  It also
  causes a corresponding expansion in memory usage.

 I know... But this regression predates your changes, is already visible
 right after 02e2b95fb4.

  
  I want to address this by caching AddressSpaceDispatch trees with the
  key being the contents of the FlatView for that address space.  This
  will drop the number of distinct trees to 2-4 (3 if some devices have
  PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
  from the cpu memory address space) but will fail if we make each address
  space different (for example filtering out the device's own BARs).
  
  If this change also improves cpu usage sufficiently, then it will be
  better than your patch, which doesn't recognize changes in an enabled
  region inside a disabled or hidden region.

 True, though the question is how common such scenarios are. This one
 (cirrus with win2k) is already special.

   In other words, your patch
  fits the problem at hand but isn't general.  On the other hand my
  approach doesn't eliminate render_memory_region(), just the exec.c stuff
  and listener updates.  So we need to understand where the slowness comes
  from.

 I would just like to have some even intermediate solution for 1.3. We
 can still make it more perfect later on if required.


I think we should apply a v2 then, the more general optimizations will
take some time.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions

2012-11-05 Thread Avi Kivity
On 11/05/2012 10:51 AM, Jan Kiszka wrote:
 On 2012-11-05 09:12, Avi Kivity wrote:
  On 11/05/2012 08:26 AM, Jan Kiszka wrote:
  On 2012-11-04 20:21, Avi Kivity wrote:
  On 11/04/2012 10:30 AM, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  Cirrus is triggering this, e.g. during Win2k boot: Changes only on
  disabled regions require no topology update when transaction depth drops
  to 0 again.
 
  817dcc5368988b0 (pci: give each device its own address space) mad this
  much worse by multiplying the number of address spaces.  Each change is
  now evaluated N+2 times, where N is the number of PCI devices.  It also
  causes a corresponding expansion in memory usage.
 
  I know... But this regression predates your changes, is already visible
  right after 02e2b95fb4.
 
 
  I want to address this by caching AddressSpaceDispatch trees with the
  key being the contents of the FlatView for that address space.  This
  will drop the number of distinct trees to 2-4 (3 if some devices have
  PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
  from the cpu memory address space) but will fail if we make each address
  space different (for example filtering out the device's own BARs).
 
  If this change also improves cpu usage sufficiently, then it will be
  better than your patch, which doesn't recognize changes in an enabled
  region inside a disabled or hidden region.
 
  True, though the question is how common such scenarios are. This one
  (cirrus with win2k) is already special.
 
   In other words, your patch
  fits the problem at hand but isn't general.  On the other hand my
  approach doesn't eliminate render_memory_region(), just the exec.c stuff
  and listener updates.  So we need to understand where the slowness comes
  from.
 
  I would just like to have some even intermediate solution for 1.3. We
  can still make it more perfect later on if required.
 
  
  I think we should apply a v2 then, the more general optimizations will
  take some time.

 OK - what should v2 do differently?


As I noted, init and destroy cannot cause a topology update.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [patch v5 5/8] memory: introduce local lock for address space

2012-11-05 Thread Avi Kivity
On 11/02/2012 10:00 AM, Jan Kiszka wrote:
  
  As I understand the series, as-lock == NULL means that we will never
  take any lock during dispatch as the caller is not yet ready for
  fine-grained locking. This prevents the problem - for PIO at least. But
  this series should break TCG as it calls into MMIO dispatch from the
  VCPU while holding the BQL.
 
  What about add another condition dispatch_type == DISPATCH_MMIO to
  tell this situation.

 An alternative pattern that we will also use for core services is to
 provide an additional entry point, one that indicates that the caller
 doesn't hold the BQL. Then we will gradually move things over until the
 existing entry point is obsolete.


I like it (or rather, least dislike it).  So we'd have
address_space_rw() and address_space_rw_unlocked() (or maybe,
address_space_rw() and address_space_rw_locked(), to indicate the
preferred way of doing things).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions

2012-11-04 Thread Avi Kivity
On 11/04/2012 10:30 AM, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 Cirrus is triggering this, e.g. during Win2k boot: Changes only on
 disabled regions require no topology update when transaction depth drops
 to 0 again.

817dcc5368988b0 (pci: give each device its own address space) mad this
much worse by multiplying the number of address spaces.  Each change is
now evaluated N+2 times, where N is the number of PCI devices.  It also
causes a corresponding expansion in memory usage.

I want to address this by caching AddressSpaceDispatch trees with the
key being the contents of the FlatView for that address space.  This
will drop the number of distinct trees to 2-4 (3 if some devices have
PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different
from the cpu memory address space) but will fail if we make each address
space different (for example filtering out the device's own BARs).

If this change also improves cpu usage sufficiently, then it will be
better than your patch, which doesn't recognize changes in an enabled
region inside a disabled or hidden region.  In other words, your patch
fits the problem at hand but isn't general.  On the other hand my
approach doesn't eliminate render_memory_region(), just the exec.c stuff
and listener updates.  So we need to understand where the slowness comes
from.

  
 @@ -1543,6 +1554,7 @@ void address_space_init(AddressSpace *as, MemoryRegion 
 *root)
  flatview_init(as-current_map);
  QTAILQ_INSERT_TAIL(address_spaces, as, address_spaces_link);
  as-name = NULL;
 +memory_region_update_pending = true;
  memory_region_transaction_commit();
  address_space_init_dispatch(as);
  }
 @@ -1552,6 +1564,7 @@ void address_space_destroy(AddressSpace *as)
  /* Flush out anything from MemoryListeners listening in on this */
  memory_region_transaction_begin();
  as-root = NULL;
 +memory_region_update_pending = true;
  memory_region_transaction_commit();
  QTAILQ_REMOVE(address_spaces, as, address_spaces_link);
  address_space_destroy_dispatch(as);

init and destroy cannot happen to a region that is mapped (and cannot
happen within a transaction), so these two are redundant.



-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH v2 3/7] memory: iommu support

2012-11-01 Thread Avi Kivity
On 10/31/2012 08:59 PM, Benjamin Herrenschmidt wrote:
 On Wed, 2012-10-31 at 12:32 +0200, Avi Kivity wrote:
 This has nothing to do with device endianness; we're translating from a
 byte buffer (address_space_rw()) to a uint64_t
 (MemoryRegionOps::read/write()) so we need to take host endianess into
 account.
 
 This code cleverly makes use of memory core endian support to do the
 conversion for us.  But it's probably too clever and should be replaced
 by an explicit call to a helper.
 
 Well, the whole idea of providing sized-type accessors (u8/u16/...) is
 very fishy for DMA. I understand it comes from the MemoryRegion ops, and
 It made somewhat sense in CPU to device (though even then endianness had
 always gotten wrong) but for DMA it's going to be a can of worms.

Endianness here has no effect on the result.  An address_space_rw()
causes a lookup of a memory region, which happens to be an iommu memory
region.  Because of the way MemoryRegionOps are done, it is converted to
a 1/2/4/8 accessor, and then converted immediately back to a byte array.
 As long as we're consistent there's no change to the data path.

However we do have a problem with non-1/2/4/8 byte writes.  Right now
any mismatched access ends up as an 8 byte write, we need an extra
accessor for arbitrary writes, or rather better use of the .impl members
of MemoryRegionOps.

 
 Taking host endianness into account means nothing if you don't define
 what endianness those are expected to use to write to memory. If you add
 yet another case of guest endianness in the mix, then you make yet
 another mistake akin to the virtio trainwreck since things like powerpc
 or ARM can operate in different endianness.
 
 The best thing to do is to forbid use of those functions :-) And
 basically mandate the use of endian explicit accessor that specifically
 indicate what endianness the value should have once written in target
 memory. The most sensible expectation when using the raw Memory ops is
 to have the value go as is ie in host endianness.
-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH v2 3/7] memory: iommu support

2012-11-01 Thread Avi Kivity
On 11/01/2012 03:44 PM, Avi Kivity wrote:
 
 However we do have a problem with non-1/2/4/8 byte writes.  Right now
 any mismatched access ends up as an 8 byte write, we need an extra
 accessor for arbitrary writes, or rather better use of the .impl members
 of MemoryRegionOps.

Sorry, it's converted into a series of 8-bit writes.  So the code is
correct now, if inefficient.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [patch v5 5/8] memory: introduce local lock for address space

2012-11-01 Thread Avi Kivity
On 10/29/2012 11:46 AM, liu ping fan wrote:
 On Mon, Oct 29, 2012 at 5:32 PM, Avi Kivity a...@redhat.com wrote:
 On 10/29/2012 01:48 AM, Liu Ping Fan wrote:
 For those address spaces which want to be able out of big lock, they
 will be protected by their own local.

 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  memory.c |   11 ++-
  memory.h |5 -
  2 files changed, 14 insertions(+), 2 deletions(-)

 diff --git a/memory.c b/memory.c
 index 2f68d67..ff34aed 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -1532,9 +1532,15 @@ void memory_listener_unregister(MemoryListener 
 *listener)
  QTAILQ_REMOVE(memory_listeners, listener, link);
  }

 -void address_space_init(AddressSpace *as, MemoryRegion *root)
 +void address_space_init(AddressSpace *as, MemoryRegion *root, bool lock)


 Why not always use the lock?  Even if the big lock is taken, it doesn't
 hurt.  And eventually all address spaces will be fine-grained.

 I had thought only mmio is out of big lock's protection. While others
 address space will take extra expense. So leave them until they are
 ready to be out of big lock.

The other address spaces are pio (which also needs fine-grained locking)
and the dma address spaces (which are like address_space_memory, except
they are accessed via DMA instead of from the vcpu).



-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [patch v5 6/8] memory: make mmio dispatch able to be out of biglock

2012-11-01 Thread Avi Kivity
On 10/30/2012 09:06 AM, liu ping fan wrote:
  while (len  0) {
  page = addr  TARGET_PAGE_MASK;
  l = (page + TARGET_PAGE_SIZE) - addr;
  if (l  len)
  l = len;
 -section = phys_page_find(d, page  TARGET_PAGE_BITS);
 +
 +if (as-lock) {
 +qemu_mutex_lock(as-lock);
 +safe_ref = memory_region_section_lookup_ref(d, page, obj_mrs);
 +qemu_mutex_unlock(as-lock);
 +if (!safe_ref) {
 +qemu_mutex_lock_iothread();
 +qemu_mutex_lock(as-lock);
 +/* when 2nd try, mem map can change, need to judge it 
 again */
 +safe_ref = memory_region_section_lookup_ref(d, page, 
 obj_mrs);
 +qemu_mutex_unlock(as-lock);
 +if (safe_ref) {
 +qemu_mutex_unlock_iothread();
 +}
 +}
 +} else {
 +/* Caller hold the big lock */
 +memory_region_section_lookup_ref(d, page, obj_mrs);

 It's not a property of the address space, it's a property of the caller.

 Sorry, what is your meaning?

Whether the caller holds the big lock or not is not known by the address
space.  It is only known by the caller itself.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH v2 3/7] memory: iommu support

2012-10-31 Thread Avi Kivity
On 10/30/2012 09:11 PM, Blue Swirl wrote:
 On Tue, Oct 30, 2012 at 11:47 AM, Avi Kivity a...@redhat.com wrote:
 Add a new memory region type that translates addresses it is given,
 then forwards them to a target address space.  This is similar to
 an alias, except that the mapping is more flexible than a linear
 translation and trucation, and also less efficient since the
 translation happens at runtime.

 The implementation uses an AddressSpace mapping the target region to
 avoid hierarchical dispatch all the way to the resolved region; only
 iommu regions are looked up dynamically.

 +
 +static MemoryRegionOps memory_region_iommu_ops = {
 +.read = memory_region_iommu_read,
 +.write = memory_region_iommu_write,
 +#ifdef HOST_BIGENDIAN
 +.endianness = DEVICE_BIG_ENDIAN,
 
 Why couple this with host endianness? I'd expect IOMMU to operate at
 target bus endianness, for example LE for PCI on PPC guest.

This has nothing to do with device endianness; we're translating from a
byte buffer (address_space_rw()) to a uint64_t
(MemoryRegionOps::read/write()) so we need to take host endianess into
account.

This code cleverly makes use of memory core endian support to do the
conversion for us.  But it's probably too clever and should be replaced
by an explicit call to a helper.



-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH v2 4/7] memory: provide a MemoryRegion for IOMMUs to log faults

2012-10-31 Thread Avi Kivity
On 10/30/2012 09:14 PM, Blue Swirl wrote:
 On Tue, Oct 30, 2012 at 11:47 AM, Avi Kivity a...@redhat.com wrote:
 Accesses which do not translate will hit the fault region, which
 can then log the access.
 
 Maybe special casing the fault region should be avoided, the IOMMU
 could set up suitable fault regions by itself.

How can it?  Please elaborate.

I'm not too pleased with using a MemoryRegion for faults, perhaps a
MemoryRegionIOMMUOps::fault() callback would be better.

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH v2 7/7] i440fx: add an iommu

2012-10-31 Thread Avi Kivity
On 10/30/2012 09:18 PM, Blue Swirl wrote:
 On Tue, Oct 30, 2012 at 11:47 AM, Avi Kivity a...@redhat.com wrote:
 This iommu encrypts addresses on the device bus to avoid divuling information
 to hackers equipped with bus analyzers.  Following 3DES, addresses are 
 encrypted
 multiple times.  A XOR cypher is employed for efficiency.
 
 If this is not useful as a test device or other purposes, I'd drop it.

It's just for demonstrating the API.  I won't propose it for merging.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction

2012-10-31 Thread Avi Kivity
On 10/24/2012 10:06 AM, liu ping fan wrote:
 On Tue, Oct 23, 2012 at 8:25 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Sep 21, 2012 at 01:17:21PM +0200, Vasilis Liaskovitis wrote:
 +static void dimm_populate(DimmDevice *s)
 +{
 +DeviceState *dev= (DeviceState*)s;
 +MemoryRegion *new = NULL;
 +
 +new = g_malloc(sizeof(MemoryRegion));
 +memory_region_init_ram(new, dev-id, s-size);
 +vmstate_register_ram_global(new);
 +memory_region_add_subregion(get_system_memory(), s-start, new);
 +s-mr = new;
 +}
 +
 +static void dimm_depopulate(DimmDevice *s)
 +{
 +assert(s);
 +vmstate_unregister_ram(s-mr, NULL);
 +memory_region_del_subregion(get_system_memory(), s-mr);
 +memory_region_destroy(s-mr);
 +s-mr = NULL;
 +}

 How is dimm hot unplug protected against callers who currently have RAM
 mapped (from cpu_physical_memory_map())?

 Emulated devices call cpu_physical_memory_map() directly or indirectly
 through DMA emulation code.  The RAM pointer may be held for arbitrary
 lengths of time, across main loop iterations, etc.

 It's not clear to me that it is safe to unplug a DIMM that has network
 or disk I/O buffers, for example.  We also need to be robust against
 malicious guests who abuse the hotplug lifecycle.  QEMU should never be
 left with dangling pointers.

 Not sure about the block layer. But I think those thread are already
 out of big lock, so there should be a MemoryListener to catch the
 RAM-unplug event, and if needed, bdrv_flush.


IMO we should use the same mechanism as proposed for other devices:
address_space_map() should grab a reference on the dimm device, and
address_space_unmap() can release it.  This way device destruction will
be deferred as soon as all devices complete I/O.

We will have to be careful with network receive buffers though, since
they can be held indefinitely.

-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [RFC PATCH v3 00/19] ACPI memory hotplug

2012-10-31 Thread Avi Kivity
On 10/31/2012 12:58 PM, Stefan Hajnoczi wrote:
 On Fri, Sep 21, 2012 at 1:17 PM, Vasilis Liaskovitis
 vasilis.liaskovi...@profitbricks.com wrote:
 This is v3 of the ACPI memory hotplug functionality. Only x86_64 target is 
 supported
 for now.
 
 Hi Vasilis,
 Regarding the hot unplug issue we've been discussing, it's possible to
 progress this patch series without fully solving that problem upfront.
 
 Karen Noel suggested that the series could be rolled without the hot
 unplug command, so that it's not possible to hit the unsafe case.
 This would allow users to hot plug additional memory.  They would have
 to use virtio-balloon to reduce the memory footprint again.  Later,
 when the memory region referencing issue has been solved the hot
 unplug command can be added.
 
 Just wanted to mention Karen's idea in case you feel stuck right now.

We could introduce hotunplug as an experimental feature so people can
test and play with it, and later graduate it to a fully supported feature.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH] Fix calculation of number of bits in the migration bitmap

2012-10-31 Thread Avi Kivity
On 10/31/2012 01:19 PM, Orit Wasserman wrote:
 The number of bits is off by one, for example if last_ram_offset
 is 0x1000 (the guest has one page) we get 0 bits instead of 1.
 
 Signed-off-by: Orit Wasserman owass...@redhat.com
 ---
  arch_init.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch_init.c b/arch_init.c
 index b75a4c5..a80c3c8 100644
 --- a/arch_init.c
 +++ b/arch_init.c
 @@ -565,7 +565,7 @@ static void reset_ram_globals(void)
  static int ram_save_setup(QEMUFile *f, void *opaque)
  {
  RAMBlock *block;
 -int64_t ram_pages = last_ram_offset()  TARGET_PAGE_BITS;
 +int64_t ram_pages = (last_ram_offset()  TARGET_PAGE_BITS) + 1;
  

The original code calculates 1 for your example case.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction

2012-10-31 Thread Avi Kivity
On 10/31/2012 02:18 PM, Stefan Hajnoczi wrote:

 IMO we should use the same mechanism as proposed for other devices:
 address_space_map() should grab a reference on the dimm device, and
 address_space_unmap() can release it.  This way device destruction will
 be deferred as soon as all devices complete I/O.

 We will have to be careful with network receive buffers though, since
 they can be held indefinitely.
 
 Network receive buffers aren't mapped.  Net receive is not zero-copy.
 For example, virtio-net does virtqueue_pop() inside
 virtio_net_receive().
 
 I don't see a problem with networking.

What about vhost-net?  But that is managed separately with a MemoryListener.


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH v5 1/2] pl330: Initial version

2012-10-30 Thread Avi Kivity
On 10/29/2012 10:32 AM, Peter Maydell wrote:
 On 29 October 2012 06:35, Peter Crosthwaite
 peter.crosthwa...@xilinx.com wrote:
 Device model for Primecell PL330 dma controller.
 
 A general question -- this is a DMA controller so should it be using
 the DMAContext APIs now? Avi?

IOMMU functionality will be folded into the memory core, so
address_space_rw() and family.




-- 
error compiling committee.c: too many arguments to function



[Qemu-devel] [PATCH v2 0/7] IOMMU support

2012-10-30 Thread Avi Kivity
Changes from RFC:
 - change -translate to return read/write permissions in IOTLBEntry (was:
   -translate received is_write parameter)
 - add support for iommu fault reporting

Avi Kivity (7):
  memory: fix address space initialization/destruction
  memory: limit sections in the radix tree to the actual address space
size
  memory: iommu support
  memory: provide a MemoryRegion for IOMMUs to log faults
  pci: use memory core for iommu support
  vfio: abort if an emulated iommu is used
  i440fx: add an iommu

 exec.c |  43 +---
 hw/pci.c   |  59 +---
 hw/pci.h   |   7 +++-
 hw/pci_internals.h |   5 ++-
 hw/piix_pci.c  |  77 
 hw/spapr.h |   1 +
 hw/spapr_iommu.c   |  45 +
 hw/spapr_pci.c |  27 +++--
 hw/spapr_pci.h |   2 +
 hw/vfio_pci.c  |   2 +
 memory.c   | 112 +
 memory.h   |  49 +++
 12 files changed, 366 insertions(+), 63 deletions(-)

-- 
1.7.12




[Qemu-devel] [PATCH v2 1/7] memory: fix address space initialization/destruction

2012-10-30 Thread Avi Kivity
A couple of fields were left uninitialized.  This was not observed earlier
because all address spaces were statically allocated.  Also free allocation
for those fields.

Signed-off-by: Avi Kivity a...@redhat.com
---
 memory.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/memory.c b/memory.c
index 243cb23..ae3552b 100644
--- a/memory.c
+++ b/memory.c
@@ -1541,6 +1541,8 @@ void address_space_init(AddressSpace *as, MemoryRegion 
*root)
 as-root = root;
 as-current_map = g_new(FlatView, 1);
 flatview_init(as-current_map);
+as-ioeventfd_nb = 0;
+as-ioeventfds = NULL;
 QTAILQ_INSERT_TAIL(address_spaces, as, address_spaces_link);
 as-name = NULL;
 memory_region_transaction_commit();
@@ -1557,6 +1559,7 @@ void address_space_destroy(AddressSpace *as)
 address_space_destroy_dispatch(as);
 flatview_destroy(as-current_map);
 g_free(as-current_map);
+g_free(as-ioeventfds);
 }
 
 uint64_t io_mem_read(MemoryRegion *mr, hwaddr addr, unsigned size)
-- 
1.7.12




[Qemu-devel] [PATCH v2 5/7] pci: use memory core for iommu support

2012-10-30 Thread Avi Kivity
Use the new iommu support in the memory core for iommu support.
The only user, spapr, is also converted.

Signed-off-by: Avi Kivity a...@redhat.com
---
 hw/pci.c   | 59 +-
 hw/pci.h   |  7 +--
 hw/pci_internals.h |  5 +++--
 hw/spapr.h |  1 +
 hw/spapr_iommu.c   | 45 +
 hw/spapr_pci.c | 27 +
 hw/spapr_pci.h |  2 ++
 7 files changed, 88 insertions(+), 58 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index dceda0b..5056ddb 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -274,6 +274,21 @@ int pci_find_domain(const PCIBus *bus)
 return -1;
 }
 
+static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
+{
+MemoryRegion *mr = g_new(MemoryRegion, 1);
+
+/* FIXME: inherit memory region from bus creator */
+memory_region_init_alias(mr, iommu-nop, get_system_memory(), 0, 
INT64_MAX);
+return mr;
+}
+
+static void pci_default_iommu_dtor(MemoryRegion *mr)
+{
+memory_region_destroy(mr);
+g_free(mr);
+}
+
 void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
  const char *name,
  MemoryRegion *address_space_mem,
@@ -285,6 +300,7 @@ void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
 bus-devfn_min = devfn_min;
 bus-address_space_mem = address_space_mem;
 bus-address_space_io = address_space_io;
+pci_setup_iommu(bus, pci_default_iommu, pci_default_iommu_dtor, NULL);
 
 /* host bridge */
 QLIST_INIT(bus-child);
@@ -775,21 +791,16 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
  PCI_SLOT(devfn), PCI_FUNC(devfn), name, 
bus-devices[devfn]-name);
 return NULL;
 }
+
 pci_dev-bus = bus;
-if (bus-dma_context_fn) {
-pci_dev-dma = bus-dma_context_fn(bus, bus-dma_context_opaque, 
devfn);
-} else {
-/* FIXME: Make dma_context_fn use MemoryRegions instead, so this path 
is
- * taken unconditionally */
-/* FIXME: inherit memory region from bus creator */
-memory_region_init_alias(pci_dev-bus_master_enable_region, bus 
master,
- get_system_memory(), 0,
- memory_region_size(get_system_memory()));
-memory_region_set_enabled(pci_dev-bus_master_enable_region, false);
-address_space_init(pci_dev-bus_master_as, 
pci_dev-bus_master_enable_region);
-pci_dev-dma = g_new(DMAContext, 1);
-dma_context_init(pci_dev-dma, pci_dev-bus_master_as, NULL, NULL, 
NULL);
-}
+pci_dev-iommu = bus-iommu_fn(bus, bus-iommu_opaque, devfn);
+memory_region_init_alias(pci_dev-bus_master_enable_region, bus master,
+ pci_dev-iommu, 0, 
memory_region_size(pci_dev-iommu));
+memory_region_set_enabled(pci_dev-bus_master_enable_region, false);
+address_space_init(pci_dev-bus_master_as, 
pci_dev-bus_master_enable_region);
+pci_dev-dma = g_new(DMAContext, 1);
+dma_context_init(pci_dev-dma, pci_dev-bus_master_as, NULL, NULL, NULL);
+
 pci_dev-devfn = devfn;
 pstrcpy(pci_dev-name, sizeof(pci_dev-name), name);
 pci_dev-irq_state = 0;
@@ -843,12 +854,12 @@ static void do_pci_unregister_device(PCIDevice *pci_dev)
 pci_dev-bus-devices[pci_dev-devfn] = NULL;
 pci_config_free(pci_dev);
 
-if (!pci_dev-bus-dma_context_fn) {
-address_space_destroy(pci_dev-bus_master_as);
-memory_region_destroy(pci_dev-bus_master_enable_region);
-g_free(pci_dev-dma);
-pci_dev-dma = NULL;
-}
+address_space_destroy(pci_dev-bus_master_as);
+memory_region_del_subregion(pci_dev-bus_master_enable_region, 
pci_dev-iommu);
+pci_dev-bus-iommu_dtor_fn(pci_dev-iommu);
+memory_region_destroy(pci_dev-bus_master_enable_region);
+g_free(pci_dev-dma);
+pci_dev-dma = NULL;
 }
 
 static void pci_unregister_io_regions(PCIDevice *pci_dev)
@@ -2140,10 +2151,12 @@ static void pci_device_class_init(ObjectClass *klass, 
void *data)
 k-props = pci_props;
 }
 
-void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque)
+void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
+ void *opaque)
 {
-bus-dma_context_fn = fn;
-bus-dma_context_opaque = opaque;
+bus-iommu_fn = fn;
+bus-iommu_dtor_fn = dtor;
+bus-iommu_opaque = opaque;
 }
 
 static TypeInfo pci_device_type_info = {
diff --git a/hw/pci.h b/hw/pci.h
index 241c1d8..c3dd282 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -213,6 +213,7 @@ struct PCIDevice {
 PCIIORegion io_regions[PCI_NUM_REGIONS];
 AddressSpace bus_master_as;
 MemoryRegion bus_master_enable_region;
+MemoryRegion *iommu;
 DMAContext *dma;
 
 /* do not access the following fields */
@@ -357,9 +358,11 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int 
*domp, int *busp,
 
 void

[Qemu-devel] [PATCH v2 7/7] i440fx: add an iommu

2012-10-30 Thread Avi Kivity
This iommu encrypts addresses on the device bus to avoid divuling information
to hackers equipped with bus analyzers.  Following 3DES, addresses are encrypted
multiple times.  A XOR cypher is employed for efficiency.

Signed-off-by: Avi Kivity a...@redhat.com
---
 hw/piix_pci.c | 77 +++
 1 file changed, 77 insertions(+)

diff --git a/hw/piix_pci.c b/hw/piix_pci.c
index 9af5847..99601f4 100644
--- a/hw/piix_pci.c
+++ b/hw/piix_pci.c
@@ -30,6 +30,7 @@
 #include sysbus.h
 #include range.h
 #include xen.h
+#include exec-memory.h
 
 /*
  * I440FX chipset data sheet.
@@ -248,6 +249,81 @@ static int i440fx_initfn(PCIDevice *dev)
 return 0;
 }
 
+typedef struct SillyIOMMU SillyIOMMU;
+
+struct SillyIOMMU {
+MemoryRegion fault;
+MemoryRegion l1;
+MemoryRegion l2;
+hwaddr mask;
+hwaddr secret;
+};
+
+static IOMMUTLBEntry silly_l1_translate(MemoryRegion *l1, hwaddr addr)
+{
+SillyIOMMU *s = container_of(l1, SillyIOMMU, l1);
+hwaddr xlat = addr ^ s-secret;
+
+printf(l1: % HWADDR_PRIx  - % HWADDR_PRIx \n, addr, xlat);
+
+return (IOMMUTLBEntry) {
+.device_addr = addr  ~s-mask,
+.translated_addr = xlat  ~s-mask,
+.addr_mask = s-mask,
+.perm = { true, true },
+};
+}
+
+static MemoryRegionIOMMUOps silly_l1_iommu_ops = {
+.translate = silly_l1_translate,
+};
+
+static IOMMUTLBEntry silly_l2_translate(MemoryRegion *l2, hwaddr addr)
+{
+SillyIOMMU *s = container_of(l2, SillyIOMMU, l2);
+hwaddr xlat = addr ^ s-secret;
+
+printf(l2: % HWADDR_PRIx  - % HWADDR_PRIx \n, addr, xlat);
+
+return (IOMMUTLBEntry) {
+.device_addr = addr  ~s-mask,
+.translated_addr = xlat  ~s-mask,
+.addr_mask = s-mask,
+.perm = { true, true },
+};
+}
+
+static MemoryRegionIOMMUOps silly_l2_iommu_ops = {
+.translate = silly_l2_translate,
+};
+
+static MemoryRegion *silly_iommu_new(PCIBus *bus, void *opaque, int devfn)
+{
+SillyIOMMU *s = g_new(SillyIOMMU, 1);
+MemoryRegion *sysmem = get_system_memory();
+
+s-mask = (0x1000  (devfn  3)) - 1;
+s-secret = ((devfn  24) | 0x00aabbccdd)  ~s-mask;
+memory_region_init(s-fault, silly-fault, INT64_MAX);
+memory_region_init_iommu(s-l2, silly_l2_iommu_ops, sysmem, s-fault,
+ silly-l2, INT64_MAX);
+memory_region_init_iommu(s-l1, silly_l1_iommu_ops, s-l2, s-fault,
+ silly-l1, INT64_MAX);
+return s-l1;
+}
+
+static void silly_iommu_del(MemoryRegion *l1)
+{
+SillyIOMMU *s = container_of(l1, SillyIOMMU, l1);
+
+memory_region_del_subregion(s-l2, get_system_memory());
+memory_region_del_subregion(s-l1, s-l2);
+memory_region_destroy(s-l2);
+memory_region_destroy(s-l1);
+memory_region_destroy(s-fault);
+g_free(s);
+}
+
 static PCIBus *i440fx_common_init(const char *device_name,
   PCII440FXState **pi440fx_state,
   int *piix3_devfn,
@@ -275,6 +351,7 @@ static PCIBus *i440fx_common_init(const char *device_name,
 s-address_space = address_space_mem;
 b = pci_bus_new(dev, NULL, pci_address_space,
 address_space_io, 0);
+pci_setup_iommu(b, silly_iommu_new, silly_iommu_del, NULL);
 s-bus = b;
 object_property_add_child(qdev_get_machine(), i440fx, OBJECT(dev), NULL);
 qdev_init_nofail(dev);
-- 
1.7.12




[Qemu-devel] [PATCH v2 2/7] memory: limit sections in the radix tree to the actual address space size

2012-10-30 Thread Avi Kivity
The radix tree is statically sized to fit TARGET_PHYS_ADDR_SPACE_BITS.
If a larger memory region is registered, it will overflow.

Fix by limiting any section in the radix tree to the supported size.

This problem was not observed earlier since artificial regions (containers
and aliases) are eliminated by the memory core, leaving only device regions
which have reasonable sizes.  An IOMMU however cannot be eliminated by the
memory core, and may have an artificial size.

Signed-off-by: Avi Kivity a...@redhat.com
---
 exec.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/exec.c b/exec.c
index b0ed593..deee8ec 100644
--- a/exec.c
+++ b/exec.c
@@ -2280,10 +2280,23 @@ static void register_multipage(AddressSpaceDispatch *d, 
MemoryRegionSection *sec
   section_index);
 }
 
+static MemoryRegionSection limit(MemoryRegionSection section)
+{
+unsigned practical_as_bits = MIN(TARGET_PHYS_ADDR_SPACE_BITS, 62);
+hwaddr as_limit;
+
+as_limit = (hwaddr)1  practical_as_bits;
+
+section.size = MIN(section.offset_within_address_space + section.size, 
as_limit)
+   - section.offset_within_address_space;
+
+return section;
+}
+
 static void mem_add(MemoryListener *listener, MemoryRegionSection *section)
 {
 AddressSpaceDispatch *d = container_of(listener, AddressSpaceDispatch, 
listener);
-MemoryRegionSection now = *section, remain = *section;
+MemoryRegionSection now = limit(*section), remain = limit(*section);
 
 if ((now.offset_within_address_space  ~TARGET_PAGE_MASK)
 || (now.size  TARGET_PAGE_SIZE)) {
-- 
1.7.12




[Qemu-devel] [PATCH v2 6/7] vfio: abort if an emulated iommu is used

2012-10-30 Thread Avi Kivity
vfio doesn't support guest iommus yet, indicate it to the user
by gently depositing a core on their disk.

Signed-off-by: Avi Kivity a...@redhat.com
---
 hw/vfio_pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 0473ae8..bd7a075 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -943,6 +943,8 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 void *vaddr;
 int ret;
 
+assert(!memory_region_is_iommu(section-mr));
+
 if (vfio_listener_skipped_section(section)) {
 DPRINTF(vfio: SKIPPING region_add %HWADDR_PRIx - %PRIx64\n,
 section-offset_within_address_space,
-- 
1.7.12




[Qemu-devel] [PATCH v2 3/7] memory: iommu support

2012-10-30 Thread Avi Kivity
Add a new memory region type that translates addresses it is given,
then forwards them to a target address space.  This is similar to
an alias, except that the mapping is more flexible than a linear
translation and trucation, and also less efficient since the
translation happens at runtime.

The implementation uses an AddressSpace mapping the target region to
avoid hierarchical dispatch all the way to the resolved region; only
iommu regions are looked up dynamically.

Signed-off-by: Avi Kivity a...@redhat.com
---
 exec.c   |  28 ++---
 memory.c | 106 +++
 memory.h |  45 +++
 3 files changed, 175 insertions(+), 4 deletions(-)

diff --git a/exec.c b/exec.c
index deee8ec..c3914fc 100644
--- a/exec.c
+++ b/exec.c
@@ -3507,6 +3507,7 @@ void cpu_physical_memory_write_rom(hwaddr addr,
 
 typedef struct {
 void *buffer;
+AddressSpace *as;
 hwaddr addr;
 hwaddr len;
 } BounceBuffer;
@@ -3572,23 +3573,42 @@ void *address_space_map(AddressSpace *as,
 ram_addr_t raddr = RAM_ADDR_MAX;
 ram_addr_t rlen;
 void *ret;
+IOMMUTLBEntry iotlb;
+hwaddr xlat;
+AddressSpace *as_xlat;
 
 while (len  0) {
+xlat = addr;
+as_xlat = as;
 page = addr  TARGET_PAGE_MASK;
 l = (page + TARGET_PAGE_SIZE) - addr;
 if (l  len)
 l = len;
 section = phys_page_find(d, page  TARGET_PAGE_BITS);
 
+while (section-mr-iommu_ops) {
+iotlb = section-mr-iommu_ops-translate(section-mr, xlat);
+if (iotlb.perm[is_write]) {
+xlat = ((iotlb.translated_addr  ~iotlb.addr_mask)
+| (addr  iotlb.addr_mask));
+as_xlat = section-mr-iommu_target_as;
+l = (MIN(xlat + l - 1, xlat | iotlb.addr_mask) - xlat) + 1;
+section = phys_page_find(as_xlat-dispatch, xlat  
TARGET_PAGE_BITS);
+} else {
+section = phys_sections[phys_section_unassigned];
+}
+}
+
 if (!(memory_region_is_ram(section-mr)  !section-readonly)) {
 if (todo || bounce.buffer) {
 break;
 }
 bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
-bounce.addr = addr;
+bounce.addr = xlat;
+bounce.as = as_xlat;
 bounce.len = l;
 if (!is_write) {
-address_space_read(as, addr, bounce.buffer, l);
+address_space_read(as_xlat, xlat, bounce.buffer, l);
 }
 
 *plen = l;
@@ -3596,7 +3616,7 @@ void *address_space_map(AddressSpace *as,
 }
 if (!todo) {
 raddr = memory_region_get_ram_addr(section-mr)
-+ memory_region_section_addr(section, addr);
++ memory_region_section_addr(section, xlat);
 }
 
 len -= l;
@@ -3635,7 +3655,7 @@ void address_space_unmap(AddressSpace *as, void *buffer, 
hwaddr len,
 return;
 }
 if (is_write) {
-address_space_write(as, bounce.addr, bounce.buffer, access_len);
+address_space_write(bounce.as, bounce.addr, bounce.buffer, access_len);
 }
 qemu_vfree(bounce.buffer);
 bounce.buffer = NULL;
diff --git a/memory.c b/memory.c
index ae3552b..ba2d4a0 100644
--- a/memory.c
+++ b/memory.c
@@ -775,6 +775,12 @@ static void 
memory_region_destructor_rom_device(MemoryRegion *mr)
 qemu_ram_free(mr-ram_addr  TARGET_PAGE_MASK);
 }
 
+static void memory_region_destructor_iommu(MemoryRegion *mr)
+{
+address_space_destroy(mr-iommu_target_as);
+g_free(mr-iommu_target_as);
+}
+
 static bool memory_region_wrong_endianness(MemoryRegion *mr)
 {
 #ifdef TARGET_WORDS_BIGENDIAN
@@ -789,6 +795,7 @@ void memory_region_init(MemoryRegion *mr,
 uint64_t size)
 {
 mr-ops = NULL;
+mr-iommu_ops = NULL;
 mr-parent = NULL;
 mr-size = int128_make64(size);
 if (size == UINT64_MAX) {
@@ -980,6 +987,100 @@ void memory_region_init_rom_device(MemoryRegion *mr,
 mr-ram_addr = qemu_ram_alloc(size, mr);
 }
 
+static void memory_region_iommu_rw(MemoryRegion *iommu, hwaddr addr,
+   uint8_t *buf, unsigned len, bool is_write)
+{
+IOMMUTLBEntry tlb;
+unsigned clen;
+hwaddr xlat;
+
+while (len) {
+tlb = iommu-iommu_ops-translate(iommu, addr);
+clen = (MIN(addr | tlb.addr_mask, addr + len - 1) - addr) + 1;
+if (tlb.perm[is_write]) {
+xlat = (tlb.translated_addr  ~tlb.addr_mask) | (addr  
tlb.addr_mask);
+address_space_rw(iommu-iommu_target_as, xlat, buf, clen, 
is_write);
+} else {
+if (!is_write) {
+memset(buf, 0xff, clen);
+}
+}
+buf += clen;
+addr += clen;
+len -= clen;
+}
+}
+
+static uint64_t memory_region_iommu_read(void *opaque, hwaddr addr

[Qemu-devel] [PATCH v2 4/7] memory: provide a MemoryRegion for IOMMUs to log faults

2012-10-30 Thread Avi Kivity
Accesses which do not translate will hit the fault region, which
can then log the access.

Signed-off-by: Avi Kivity a...@redhat.com
---
 memory.c | 9 ++---
 memory.h | 4 
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/memory.c b/memory.c
index ba2d4a0..d6b46fd 100644
--- a/memory.c
+++ b/memory.c
@@ -778,7 +778,9 @@ static void 
memory_region_destructor_rom_device(MemoryRegion *mr)
 static void memory_region_destructor_iommu(MemoryRegion *mr)
 {
 address_space_destroy(mr-iommu_target_as);
+address_space_destroy(mr-iommu_fault_as);
 g_free(mr-iommu_target_as);
+g_free(mr-iommu_fault_as);
 }
 
 static bool memory_region_wrong_endianness(MemoryRegion *mr)
@@ -1001,9 +1003,7 @@ static void memory_region_iommu_rw(MemoryRegion *iommu, 
hwaddr addr,
 xlat = (tlb.translated_addr  ~tlb.addr_mask) | (addr  
tlb.addr_mask);
 address_space_rw(iommu-iommu_target_as, xlat, buf, clen, 
is_write);
 } else {
-if (!is_write) {
-memset(buf, 0xff, clen);
-}
+address_space_rw(iommu-iommu_fault_as, addr, buf, clen, is_write);
 }
 buf += clen;
 addr += clen;
@@ -1068,6 +1068,7 @@ static void memory_region_iommu_write(void *opaque, 
hwaddr addr,
 void memory_region_init_iommu(MemoryRegion *mr,
   MemoryRegionIOMMUOps *ops,
   MemoryRegion *target,
+  MemoryRegion *fault,
   const char *name,
   uint64_t size)
 {
@@ -1079,6 +1080,8 @@ void memory_region_init_iommu(MemoryRegion *mr,
 mr-destructor = memory_region_destructor_iommu;
 mr-iommu_target_as = g_new(AddressSpace, 1);
 address_space_init(mr-iommu_target_as, target);
+mr-iommu_fault_as = g_new(AddressSpace, 1);
+address_space_init(mr-iommu_fault_as, fault);
 }
 
 static uint64_t invalid_read(void *opaque, hwaddr addr,
diff --git a/memory.h b/memory.h
index 47362c9..6312197 100644
--- a/memory.h
+++ b/memory.h
@@ -162,6 +162,7 @@ struct MemoryRegion {
 unsigned ioeventfd_nb;
 MemoryRegionIoeventfd *ioeventfds;
 struct AddressSpace *iommu_target_as;
+struct AddressSpace *iommu_fault_as;
 };
 
 struct MemoryRegionPortio {
@@ -361,12 +362,15 @@ void memory_region_init_reservation(MemoryRegion *mr,
  * @ops: a function that translates addresses into the @target region
  * @target: a #MemoryRegion that will be used to satisfy accesses to translated
  *  addresses
+ * @fault: a #MemoryRegion for servicing failed translations; accessed with
+ * untranslated addresses
  * @name: used for debugging; not visible to the user or ABI
  * @size: size of the region.
  */
 void memory_region_init_iommu(MemoryRegion *mr,
   MemoryRegionIOMMUOps *ops,
   MemoryRegion *target,
+  MemoryRegion *fault,
   const char *name,
   uint64_t size);
 
-- 
1.7.12




Re: [Qemu-devel] [PATCH 3/3] apic: always update the in-kernel status after loading

2012-10-30 Thread Avi Kivity
On 10/30/2012 02:16 PM, Paolo Bonzini wrote:
 The LAPIC is loaded separately from the rest of the VCPU state.  Thus,
 when restoring the CPU the dummy post-reset state is passed to the
 in-kernel APIC.
 
 This can cause MSI injection to fail if attempted during the restore of
 another device, because the LAPIC believes it's not enabled.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  hw/apic_common.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/hw/apic_common.c b/hw/apic_common.c
 index f373ba8..1ef52b2 100644
 --- a/hw/apic_common.c
 +++ b/hw/apic_common.c
 @@ -362,6 +362,7 @@ static int apic_dispatch_post_load(void *opaque, int 
 version_id)
  if (info-post_load) {
  info-post_load(s);
  }
 +cpu_put_apic_state(DEVICE(s));
  return 0;
  }

Aren't we still dependent on the order of processing?  If the APIC is
restored after the device, won't we get the same problem?


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [PATCH v5 1/2] pl330: Initial version

2012-10-30 Thread Avi Kivity
On 10/30/2012 03:42 PM, Peter Crosthwaite wrote:
 On Tue, Oct 30, 2012 at 7:51 PM, Avi Kivity a...@redhat.com wrote:
 On 10/29/2012 10:32 AM, Peter Maydell wrote:
 On 29 October 2012 06:35, Peter Crosthwaite
 peter.crosthwa...@xilinx.com wrote:
 Device model for Primecell PL330 dma controller.

 A general question -- this is a DMA controller so should it be using
 the DMAContext APIs now? Avi?

 IOMMU functionality will be folded into the memory core, so
 address_space_rw() and family.


 
 Hi Avi,
 
 So does this mean no more dma_contexts and dma_memory_foo? 

Correct.

 I have
 remade the series (and others) using dma_memory_*. Can you point to a
 sysbus attached DMA controller currently in tree that does it properly
 so we can use it for a basis as there are several (four in total
 think) series I have pending all containing DMA which will be effected
 by this.

I don't think there are any.

If your dma controller does not have an IOMMU, all you need to do is
replace cpu_physical_memory_*(...) with
address_space_*(address_space_memory, ...).  If it does have an IOMMU,
then you need to wait until my IOMMU patchset hits mainline (or use
DMAContext).



-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [patch v5 5/8] memory: introduce local lock for address space

2012-10-29 Thread Avi Kivity
On 10/29/2012 01:48 AM, Liu Ping Fan wrote:
 For those address spaces which want to be able out of big lock, they
 will be protected by their own local.
 
 Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
 ---
  memory.c |   11 ++-
  memory.h |5 -
  2 files changed, 14 insertions(+), 2 deletions(-)
 
 diff --git a/memory.c b/memory.c
 index 2f68d67..ff34aed 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -1532,9 +1532,15 @@ void memory_listener_unregister(MemoryListener 
 *listener)
  QTAILQ_REMOVE(memory_listeners, listener, link);
  }
  
 -void address_space_init(AddressSpace *as, MemoryRegion *root)
 +void address_space_init(AddressSpace *as, MemoryRegion *root, bool lock)


Why not always use the lock?  Even if the big lock is taken, it doesn't
hurt.  And eventually all address spaces will be fine-grained.

  {
  memory_region_transaction_begin();
 +if (lock) {
 +as-lock = g_new(QemuMutex, 1);
 +qemu_mutex_init(as-lock);
 +} else {
 +as-lock = NULL;
 +}
  as-root = root;
  as-current_map = g_new(FlatView, 1);
  flatview_init(as-current_map);
 @@ -1553,6 +1559,9 @@ void address_space_destroy(AddressSpace *as)
  QTAILQ_REMOVE(address_spaces, as, address_spaces_link);
  address_space_destroy_dispatch(as);
  flatview_destroy(as-current_map);
 +if (as-lock) {
 +g_free(as-lock);
 +}
  g_free(as-current_map);
  }
  
 diff --git a/memory.h b/memory.h
 index 79393f1..12d1c56 100644
 --- a/memory.h
 +++ b/memory.h
 @@ -22,6 +22,7 @@
  #include cpu-common.h
  #include targphys.h
  #include qemu-queue.h
 +#include qemu-thread.h
  #include iorange.h
  #include ioport.h
  #include int128.h
 @@ -164,6 +165,7 @@ typedef struct AddressSpace AddressSpace;
   */
  struct AddressSpace {
  /* All fields are private. */
 +QemuMutex *lock;
  const char *name;
  MemoryRegion *root;
  struct FlatView *current_map;
 @@ -801,8 +803,9 @@ void mtree_info(fprintf_function mon_printf, void *f);
   *
   * @as: an uninitialized #AddressSpace
   * @root: a #MemoryRegion that routes addesses for the address space
 + * @lock: if true, the physmap protected by local lock, otherwise big lock
   */
 -void address_space_init(AddressSpace *as, MemoryRegion *root);
 +void address_space_init(AddressSpace *as, MemoryRegion *root, bool lock);
  
  
  /**
 


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [patch v5 6/8] memory: make mmio dispatch able to be out of biglock

2012-10-29 Thread Avi Kivity
On 10/29/2012 01:48 AM, Liu Ping Fan wrote:
 Without biglock, we try to protect the mr by increase refcnt.
 If we can inc refcnt, go backward and resort to biglock.
 
 Another point is memory radix-tree can be flushed by another
 thread, so we should get the copy of terminal mr to survive
 from such issue.
 

 +static bool memory_region_section_ref(MemoryRegionSection *mrs)
 +{
 +MemoryRegion *mr;
 +bool ret = false;
 +
 +mr = mrs-mr;
 +if (mr-ops  mr-ops-ref) {
 +ret = mr-ops-ref(mr);

I still don't see why -ref() needs to return something.

 +}
 +return ret;
 +}
 +
  
  while (len  0) {
  page = addr  TARGET_PAGE_MASK;
  l = (page + TARGET_PAGE_SIZE) - addr;
  if (l  len)
  l = len;
 -section = phys_page_find(d, page  TARGET_PAGE_BITS);
 +
 +if (as-lock) {
 +qemu_mutex_lock(as-lock);
 +safe_ref = memory_region_section_lookup_ref(d, page, obj_mrs);
 +qemu_mutex_unlock(as-lock);
 +if (!safe_ref) {
 +qemu_mutex_lock_iothread();
 +qemu_mutex_lock(as-lock);
 +/* when 2nd try, mem map can change, need to judge it again 
 */
 +safe_ref = memory_region_section_lookup_ref(d, page, 
 obj_mrs);
 +qemu_mutex_unlock(as-lock);
 +if (safe_ref) {
 +qemu_mutex_unlock_iothread();
 +}
 +}
 +} else {
 +/* Caller hold the big lock */
 +memory_region_section_lookup_ref(d, page, obj_mrs);

It's not a property of the address space, it's a property of the caller.

 +}
 +section = obj_mrs;
  
  if (is_write) {
  if (!memory_region_is_ram(section-mr)) {


-- 
error compiling committee.c: too many arguments to function



Re: [Qemu-devel] [memory] abort with head a8170e5

2012-10-29 Thread Avi Kivity
On 10/29/2012 09:54 AM, Aurelien Jarno wrote:
 On Thu, Oct 25, 2012 at 06:12:06PM +0200, Avi Kivity wrote:
 On 10/25/2012 04:39 PM, Aurelien Jarno wrote:
  On Thu, Oct 25, 2012 at 03:47:34PM +0200, Avi Kivity wrote:
  On 10/24/2012 04:00 PM, Aurelien Jarno wrote:
   
   mips is also broken but by commit 
   1c380f9460522f32c8dd2577b2a53d518ec91c6d:
   
   | [0.436000] PCI: Enabling device :00:0a.1 ( - 0001)
   | Segmentation fault (core dumped)
   
  
  How do you reproduce it?
  
  You can use the mips kernel version 2.6.32 from:
http://people.debian.org/~aurel32/qemu/mips/
  
  Then just run it with the following command:
qemu-system-mips -M malta -kernel vmlinux-2.6.32-5-4kc-malta -append 
  console=tty0
  
  (You can also get the README command line if you don't care about
  downloading the disk image).
 
 Doesn't reproduce here with this command line (upstream + the bridge patch).
 
 [0.568000] PCI: Enabling device :00:12.0 ( - 0002)
 [0.572000] cirrusfb :00:12.0: Cirrus Logic chipset on PCI bus,
 RAM (4096 kB) at 0x1000
 
 ...
 
 [1.172000] PCI: Enabling device :00:0a.1 ( - 0001)
 [1.188000] scsi0 : ata_piix
 
 (with console=ttyS0)
 
 Ok, looks like I didn't provide the right command line. I am only able
 to reproduce it when using -nographic, and only with -vga cirrus (yes it
 starts to be quite strange). In that case it's better to pass 
 console=ttyS0, even if you can reproduce it with console=tty0.
 
 In short it seems heavily related to the cirrus VGA card.

I was able to reproduce it, and in fact it's unrelated to VGA, it's deep in the 
memory core.

Please try this patch:

From: Avi Kivity a...@redhat.com
Date: Mon, 29 Oct 2012 17:07:09 +0200
Subject: [PATCH] memory: fix rendering of a region obscured by another

The memory core drops regions that are hidden by another region (for example,
during BAR sizing), but it doesn't do so correctly if the lower address of the
existing range is below the lower address of the new range.

Example (qemu-system-mips -M malta -kernel vmlinux-2.6.32-5-4kc-malta
 -append console=ttyS0  -nographic -vga cirrus):

Existing range: 1000-107f
New range:  100a-100b

Correct behaviour: drop new range
Incorrect behaviour: add new range

Fix by taking this case into account (previously we only considered
equal lower boundaries).

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/memory.c b/memory.c
index 36bb9a5..243cb23 100644
--- a/memory.c
+++ b/memory.c
@@ -539,12 +539,12 @@ static void render_memory_region(FlatView *view,
 offset_in_region += int128_get64(now);
 int128_subfrom(remain, now);
 }
-if (int128_eq(base, view-ranges[i].addr.start)) {
-now = int128_min(remain, view-ranges[i].addr.size);
-int128_addto(base, now);
-offset_in_region += int128_get64(now);
-int128_subfrom(remain, now);
-}
+now = int128_sub(int128_min(int128_add(base, remain),
+addrrange_end(view-ranges[i].addr)),
+ base);
+int128_addto(base, now);
+offset_in_region += int128_get64(now);
+int128_subfrom(remain, now);
 }
 if (int128_nz(remain)) {
 fr.mr = mr;



-- 
error compiling committee.c: too many arguments to function



  1   2   3   4   5   6   7   8   9   10   >