Re: Questions about the real mode in kvm/qemu
On 9/26/19 12:18 PM, Paolo Bonzini wrote: On 26/09/19 10:59, Maxim Levitsky wrote: If you mean to ask if there is a way to let guest access use no paging at all, that is access host physical addresses directly, then indeed there is no way, since regular non 'unrestricted guest' mode required both protected mode and paging, and 'unrestricted guest' requires EPT. Academically speaking it is of course possible to create paging tables that are 1:1... Not so academically, it's exactly what KVM does. However, indeed it would also be possible to switch out of EPT mode when CR0.PG=0. I'm not sure why it was done this way, maybe when the code was written it was simpler to use the identity map. Let's see if Avi is listening... :) I think it was just simpler for the people who implemented it at the time. Switching out of EPT would have been a better solution as it would no longer require allocating guest physical address space with the few warts that requires.
Re: [Qemu-devel] Linux kernel polling for QEMU
On 12/02/2016 12:12 PM, Stefan Hajnoczi wrote: On Thu, Dec 01, 2016 at 09:35:27AM -0500, Paolo Bonzini wrote: Maybe we could do the same for sockets? When data is available on a socket (or when it becomes writable), write to a user memory location. I, too, have an interest in polling; in my situation most of the polling happens in userspace. You are trying to improve on the latency of non-blocking ppoll(2)/epoll_wait(2) call? Yes, but the concern is for throughput, not latency. My main loop looks like execute some tasks poll from many sources Since epoll_wait(..., 0) has non-trivial costs, I have to limit the polling rate, and even so it shows up in the profile. If the cost were lower, I could poll at a higher frequency, resulting in lower worst-case latencies for high-priority tasks. IMHO, the ideal model wouldn't enter the kernel at all unless you _want_ to go to sleep. Something like mmapping an epoll file descriptor to get a ring of epoll_events. It has to play nice with blocking epoll_wait(2) and ppoll(2) of an eventpoll file descriptor so that the userspace process can go to sleep without a lot of work to switch epoll wait vs mmap modes. I think that's overkill; as soon as data is returned, you're going to be talking to the kernel in order to get it. Besides, leave something for userspace tcp stack implementations. I'm bothered by the fact that the kernel will not be able to poll NIC or storage controller rings if userspace is doing the polling :(. Yes, that's a concern. We have three conflicting requirements: - fast inter-thread IPC (or guest/host IPC) -> shared memory+polling - kernel multiplexing of hardware -> kernel interfaces + kernel poll / sleep - composable interfaces that allow you to use the two methods You could use something like Flow Director to move interesting flows directly to the application, let the application poll those queues (and maybe tell the kernel if something interesting happens), while the kernel manages the other queues. It gets ugly very quickly.
Re: [Qemu-devel] Linux kernel polling for QEMU
On 12/01/2016 04:35 PM, Paolo Bonzini wrote: Maybe we could do the same for sockets? When data is available on a socket (or when it becomes writable), write to a user memory location. I, too, have an interest in polling; in my situation most of the polling happens in userspace. You are trying to improve on the latency of non-blocking ppoll(2)/epoll_wait(2) call? Yes, but the concern is for throughput, not latency. My main loop looks like execute some tasks poll from many sources Since epoll_wait(..., 0) has non-trivial costs, I have to limit the polling rate, and even so it shows up in the profile. If the cost were lower, I could poll at a higher frequency, resulting in lower worst-case latencies for high-priority tasks. IMHO, the ideal model wouldn't enter the kernel at all unless you _want_ to go to sleep. Right. Note that if data is available, it doesn't really matter, because you'd have to read() it anyway. It matters for the case where data is not available.
Re: [Qemu-devel] Linux kernel polling for QEMU
On 12/01/2016 01:45 PM, Stefan Hajnoczi wrote: On Wed, Nov 30, 2016 at 07:41:11PM +0200, Avi Kivity wrote: On 11/29/2016 12:45 PM, Stefan Hajnoczi wrote: On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote: On 28/11/2016 16:29, Stefan Hajnoczi wrote: Thanks for sharing the link. I'll let you know before embarking on an effort to make epoll support busy_loop. At the moment I'm still evaluating whether the good results we've gotten from polling in QEMU userspace are preserved when polling is shifted to the kernel. FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a chance to test it yet: https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13 This would add a system call every time the main loop processes a vring, wouldn't it? Yes, this is a problem and is the reason I haven't finished implementing a test using QEMU yet. My proposed eventfd polling mechanism doesn't work well with descriptor ring indices because the polling info needs to be updated each event loop iteration with the last seen ring index. This can be solved by making struct eventfd_poll_info.value take a userspace memory address. The value to compare against is fetched each polling iteration, avoiding the need for ioctl calls. Maybe we could do the same for sockets? When data is available on a socket (or when it becomes writable), write to a user memory location. I, too, have an interest in polling; in my situation most of the polling happens in userspace. You are trying to improve on the latency of non-blocking ppoll(2)/epoll_wait(2) call? Yes, but the concern is for throughput, not latency. My main loop looks like execute some tasks poll from many sources Since epoll_wait(..., 0) has non-trivial costs, I have to limit the polling rate, and even so it shows up in the profile. If the cost were lower, I could poll at a higher frequency, resulting in lower worst-case latencies for high-priority tasks.
Re: [Qemu-devel] Linux kernel polling for QEMU
On 11/29/2016 12:45 PM, Stefan Hajnoczi wrote: On Mon, Nov 28, 2016 at 04:41:13PM +0100, Paolo Bonzini wrote: On 28/11/2016 16:29, Stefan Hajnoczi wrote: Thanks for sharing the link. I'll let you know before embarking on an effort to make epoll support busy_loop. At the moment I'm still evaluating whether the good results we've gotten from polling in QEMU userspace are preserved when polling is shifted to the kernel. FWIW I've prototyped ioctl(EVENTFD_SET_POLL_INFO) but haven't had a chance to test it yet: https://github.com/stefanha/linux/commit/133e8f1da8eb5364cd5c5f7162decbc79175cd13 This would add a system call every time the main loop processes a vring, wouldn't it? Yes, this is a problem and is the reason I haven't finished implementing a test using QEMU yet. My proposed eventfd polling mechanism doesn't work well with descriptor ring indices because the polling info needs to be updated each event loop iteration with the last seen ring index. This can be solved by making struct eventfd_poll_info.value take a userspace memory address. The value to compare against is fetched each polling iteration, avoiding the need for ioctl calls. Maybe we could do the same for sockets? When data is available on a socket (or when it becomes writable), write to a user memory location. I, too, have an interest in polling; in my situation most of the polling happens in userspace.
Re: [Qemu-devel] [PATCH RFC v2 00/22] I/O prefetch cache
On 08/29/2016 08:09 PM, Pavel Butsykin wrote: The prefetch cache aims to improve the performance of sequential read data. Of most interest here are the requests of a small size of data for sequential read, such requests can be optimized by extending them and moving into the prefetch cache. However, there are 2 issues: - In aggregate only a small portion of requests is sequential, so delays caused by the need to read more volumes of data will lead to an overall decrease in performance. - The presence of redundant data in the cache memory with a large number of random requests. This pcache implementation solves the above and other problems prefetching data. The pcache algorithm can be summarised by the following main steps. 1. Monitor I/O requests to identify typical sequences. This implementation of prefetch cache works at the storage system level and has information only about the physical block addresses of I/O requests. Statistics are collected only from read requests to a maximum size of 32kb(by default), each request that matches the criteria falls into a pool of requests. In order to store requests statistic used by the rb-tree(lreq.tree), it's simple but for this issue a quite efficient data structure. 2. Identifying sequential I/O streams. For each read request to be carried out attempting to lift the chain sequence from lreq.tree, where this request will be element of a sequential chain of requests. The key to search for consecutive requests is the area of sectors preceding the current request. The size of this area should not be too small to avoid false readahead. The sequential stream data requests can be identified even when a large number of random requests. For example, if there is access to the blocks 100, 1157, 27520, 4, 101, 312, 1337, 102, in the context of request processing 102 will be identified the chain of sequential requests 100, 101. 102 and then should a decision be made to do readahead. Also a situation may arise when multiple applications A, B, C simultaneously perform sequential read of data. For each separate application that will be sequential read data A(100, 101, 102), B(300, 301, 302), C(700, 701, 702), but for block devices it may look like a random data reading: 100,300,700,101,301,701,102,302,702. In this case, the sequential streams will also be recognised because location requests in the rb-tree will allow to separate the sequential I/O streams. 3. Do the readahead into the cache for recognized sequential data streams. After the issue of the detection of pcache case was resolved, need using larger requests to bring data into the cache. In this implementation the pcache used readahead instead of the extension request, therefore the request goes as is. There is not any reason to put data in the cache that will never be picked up, but this will always happen in the case of extension requests. In order to store areas of cached blocks is also used by the rb-tree(pcache.tree), it's simple but for this issue a quite efficient data structure. 4. Control size of the prefetch cache pool and the requests statistic pool For control the border of the pool statistic of requests, the data of requests are placed and replaced according to the FIFO principle, everything is simple. For control the boundaries of the memory cache used LRU list, it allows to limit the max amount memory that we can allocate for pcache. But the LRU is there mainly to prevent displacement of the cache blocks that was read partially. The main way the memory is pushed out immediately after use, as soon as a chunk of memory from the cache has been completely read, since the probability of repetition of the request is very low. Cases when one and the same portion of the cache memory has been read several times are not optimized and do not apply to the cases that can optimize the pcache. Thus, using a cache memory of small volume, by the optimization of the operations read-ahead and clear memory, we can read entire volumes of data, providing a 100% cache hit. Also does not decrease the effectiveness of random read requests. PCache is implemented as a qemu block filter driver, has some configurable parameters, such as: total cache size, readahead size, maximum size of block that can be processed. For performance evaluation has been used several test cases with different sequential and random read data on SSD disk. Here are the results of tests and qemu parameters: qemu parameters: -M pc-i440fx-2.4 --enable-kvm -smp 4 -m 1024 -drive file=centos7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none, aio=native,pcache-full-size=4MB,pcache-readahead-size=128KB, pcache-max-aio-size=32KB -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-disk0, id=virtio-disk0 (-set device.virtio-disk0.x-data-plane=on) * Testcase* Results in iops
Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU
On 05/27/2015 12:30 PM, Paolo Bonzini wrote: On 26/05/2015 23:25, Christopher Covington wrote: On 05/25/2015 08:53 AM, Paolo Bonzini wrote: On 22/05/2015 13:12, Daniel P. Berrange wrote: In particular I don't see why we need to have a SATA controller and ISA/LPC bridge in every virt machine - root PCI bus only should be possible, as you can provide disks via virtio-blk or virtio-scsi and serial, parallel, mouse, floppy via PCI devices and/or by adding a USB bus in the cases where you really need one. I think removing the ISA/LPC bridge is hard. It includes the real-time clock and fw_cfg, for example. Could VirtIO specified replacements make sense for these peripherals? Not really. virtio is too heavyweight and you'd be reinventing the wheel unnecessarily. For example, ARM's -M virt uses a pl011 block for the RTC, and also uses fw_cfg. Another commonly used ISA device is the UART, for which again -M virt uses a pl031. The RTC can be replaced by kvmclock, the keyboard by virtio-console. Maybe we can provide an msr- or pci- based interface to fw_cfg.
Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU
On 05/21/2015 07:21 PM, Paolo Bonzini wrote: On 21/05/2015 17:48, Avi Kivity wrote: Lovely! Note you have memcpy.o instead of memcpy.c. Doh, and it's not used anyway. Check the repository, and let me know if OSv boots with it (it probably needs ACPI; Linux doesn't boot virtio without ACPI). Yes, it requires ACPI. We don't implement the pre-ACPI bootstrap methods.
Re: [Qemu-devel] Announcing qboot, a minimal x86 firmware for QEMU
On 05/21/2015 04:51 PM, Paolo Bonzini wrote: Some of you may have heard about the Clear Containers initiative from Intel, which couple KVM with various kernel tricks to create extremely lightweight virtual machines. The experimental Clear Containers setup requires only 18-20 MB to launch a virtual machine, and needs about 60 ms to boot. Now, as all of you probably know, QEMU is great for running Windows or legacy Linux guests, but that flexibility comes at a hefty price. Not only does all of the emulation consume memory, it also requires some form of low-level firmware in the guest as well. All of this adds quite a bit to virtual-machine startup times (500 to 700 milliseconds is not unusual). Right? In fact, it's for this reason that Clear Containers uses kvmtool instead of QEMU. No, wrong! In fact, reporting bad performance is pretty much the same as throwing down the gauntlet. Enter qboot, a minimal x86 firmware that runs on QEMU and, together with a slimmed-down QEMU configuration, boots a virtual machine in 40 milliseconds[2] on an Ivy Bridge Core i7 processor. qboot is available at git://github.com/bonzini/qboot.git. In all the glory of its 8KB of code, it brings together various existing open source components: * a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS * a couple hardware initialization routines written mostly from scratch but with good help from SeaBIOS source code * a minimal 32-bit libc based on kvm-unit-tests * the Linux loader from QEMU itself The repository has more information on how to achieve fast boot times, and examples of using qboot. Right now there is a limit of 8 MB for vmlinuz+initrd+cmdline, which however should be enough for initrd-less containers. The first commit to qboot is more or less 24 hours old, so there is definitely more work to do, in particular to extract ACPI tables from QEMU and present them to the guest. This is probably another day of work or so, and it will enable multiprocessor guests with little or no impact on the boot times. SMBIOS information is also available from QEMU. On the QEMU side, there is no support yet for persistent memory and the NFIT tables from ACPI 6.0. Once that (and ACPI support) is added, qboot will automatically start using it. Happy hacking! Lovely! Note you have memcpy.o instead of memcpy.c.
Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?
On 09/30/2014 12:33 PM, Michael S. Tsirkin wrote: a single descriptor might use all of the virtqueue. In this case we wont to be able to pass the descriptor directly to linux as a single iov, since You could separate maximum request scatter/gather list size from the virtqueue size. They are totally unrelated - even now you can have a larger request by using indirect descriptors.
Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?
On 10/08/2014 12:15 PM, Michael S. Tsirkin wrote: On Wed, Oct 08, 2014 at 10:43:07AM +0300, Avi Kivity wrote: On 09/30/2014 12:33 PM, Michael S. Tsirkin wrote: a single descriptor might use all of the virtqueue. In this case we wont to be able to pass the descriptor directly to linux as a single iov, since You could separate maximum request scatter/gather list size from the virtqueue size. They are totally unrelated - even now you can have a larger request by using indirect descriptors. We could add a feature to have a smaller or larger S/G length limit. Is this something useful? Having a larger ring size is useful, esp. with zero-copy transmit, and you would need the sglist length limit in order to not require linearization on linux hosts. So the limit is not useful in itself, only indirectly. Google cloud engine exposes virtio ring sizes of 16384. Even more useful is getting rid of the desc array and instead passing descs inline in avail and used.
Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?
On 10/08/2014 01:14 PM, Michael S. Tsirkin wrote: On Wed, Oct 08, 2014 at 12:51:21PM +0300, Avi Kivity wrote: On 10/08/2014 12:15 PM, Michael S. Tsirkin wrote: On Wed, Oct 08, 2014 at 10:43:07AM +0300, Avi Kivity wrote: On 09/30/2014 12:33 PM, Michael S. Tsirkin wrote: a single descriptor might use all of the virtqueue. In this case we wont to be able to pass the descriptor directly to linux as a single iov, since You could separate maximum request scatter/gather list size from the virtqueue size. They are totally unrelated - even now you can have a larger request by using indirect descriptors. We could add a feature to have a smaller or larger S/G length limit. Is this something useful? Having a larger ring size is useful, esp. with zero-copy transmit, and you would need the sglist length limit in order to not require linearization on linux hosts. So the limit is not useful in itself, only indirectly. Google cloud engine exposes virtio ring sizes of 16384. OK this sounds useful, I'll queue this up for consideration. Thanks! Thanks. Even more useful is getting rid of the desc array and instead passing descs inline in avail and used. You expect this to improve performance? Quite possibly but this will have to be demonstrated. The top vhost function in small packet workloads is vhost_get_vq_desc, and the top instruction within that (50%) is the one that reads the first 8 bytes of desc. It's a guaranteed cache line miss (and again on the guest side when it's time to reuse). Inline descriptors will amortize the cache miss over 4 descriptors, and will allow the hardware to prefetch, since the descriptors are linear in memory.
Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?
On 10/08/2014 01:55 PM, Michael S. Tsirkin wrote: Even more useful is getting rid of the desc array and instead passing descs inline in avail and used. You expect this to improve performance? Quite possibly but this will have to be demonstrated. The top vhost function in small packet workloads is vhost_get_vq_desc, and the top instruction within that (50%) is the one that reads the first 8 bytes of desc. It's a guaranteed cache line miss (and again on the guest side when it's time to reuse). OK so basically what you are pointing out is that we get 5 accesses: read of available head, read of available ring, read of descriptor, write of used ring, write of used ring head. Right. And only read of descriptor is not amortized. If processing is in-order, we could build a much simpler design, with a valid bit in the descriptor, cleared by host as descriptors are consumed. Basically get rid of both used and available ring. That only works if you don't allow reordering, which is never the case for block, and not the case for zero-copy net. It also has writers on both side of the ring. The right design is to keep avail and used, but instead of making them rings of pointers to descs, make them rings of descs. The host reads descs from avail, processes them, then writes them back on used (possibly out-of-order). The guest writes descs to avail and reads them back from used. You'll probably have to add a 64-bit cookie to desc so you can complete without an additional lookup. Sounds good in theory. Inline descriptors will amortize the cache miss over 4 descriptors, and will allow the hardware to prefetch, since the descriptors are linear in memory. If descriptors are used in order (as they are with current qemu) then aren't they amortized already?
Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?
On 10/08/2014 01:55 PM, Michael S. Tsirkin wrote: Inline descriptors will amortize the cache miss over 4 descriptors, and will allow the hardware to prefetch, since the descriptors are linear in memory. If descriptors are used in order (as they are with current qemu) then aren't they amortized already? The descriptors are only in-order for non-zero-copy net. They are out of order for block and zero-copy net. (also, the guest has to be careful in how it allocates descriptors).
Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?
On 10/08/2014 03:22 PM, Michael S. Tsirkin wrote: On Wed, Oct 08, 2014 at 01:59:13PM +0300, Avi Kivity wrote: On 10/08/2014 01:55 PM, Michael S. Tsirkin wrote: Even more useful is getting rid of the desc array and instead passing descs inline in avail and used. You expect this to improve performance? Quite possibly but this will have to be demonstrated. The top vhost function in small packet workloads is vhost_get_vq_desc, and the top instruction within that (50%) is the one that reads the first 8 bytes of desc. It's a guaranteed cache line miss (and again on the guest side when it's time to reuse). OK so basically what you are pointing out is that we get 5 accesses: read of available head, read of available ring, read of descriptor, write of used ring, write of used ring head. Right. And only read of descriptor is not amortized. If processing is in-order, we could build a much simpler design, with a valid bit in the descriptor, cleared by host as descriptors are consumed. Basically get rid of both used and available ring. That only works if you don't allow reordering, which is never the case for block, and not the case for zero-copy net. It also has writers on both side of the ring. The right design is to keep avail and used, but instead of making them rings of pointers to descs, make them rings of descs. The host reads descs from avail, processes them, then writes them back on used (possibly out-of-order). The guest writes descs to avail and reads them back from used. You'll probably have to add a 64-bit cookie to desc so you can complete without an additional lookup. My old presentation from 2012 or so suggested something like this. We don't need a 64 bit cookie I think - a small 16 bit one should be enough. A 16 bit cookie means you need an extra table to hold the real request pointers. With a 64-bit cookie you can store a pointer to the skbuff or bio in the ring itself, and avoid the extra lookup. The extra lookup isn't the end of the world, since doesn't cross core boundaries, but it's worth avoiding.
Re: [Qemu-devel] [QA-virtio]:Why vring size is limited to 1024?
On 10/08/2014 03:28 PM, Avi Kivity wrote: My old presentation from 2012 or so suggested something like this. We don't need a 64 bit cookie I think - a small 16 bit one should be enough. A 16 bit cookie means you need an extra table to hold the real request pointers. With a 64-bit cookie you can store a pointer to the skbuff or bio in the ring itself, and avoid the extra lookup. The extra lookup isn't the end of the world, since doesn't cross core boundaries, but it's worth avoiding. What you can do is have two types of descriptors: head and fragment union desc { struct head { u16 nfrags u16 flags u64 cookie } struct frag { u64 paddr u16 flen u16 flags } } so now a request length is 12*(nfrags+1). You can be evil and steal some bits from paddr/cookie, and have each descriptor 8 bytes long. btw, I also recommend storing things like vnet_hdr in the ring itself, instead of out-of-line in memory. Maybe the ring should just transport bytes and let the upper layer decide how it's formatted.
Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
On 06/26/2014 05:31 PM, Etienne Martineau wrote: On 14-06-26 04:18 AM, Avi Kivity wrote: On 06/25/2014 08:53 PM, Etienne Martineau wrote: Hi, It seems to me that there is a scale issue O(n) in memory_region_transaction_commit(). It's actually O(n^3). Flatview is kept sorted but is just a vector, so if you insert n regions, you have n^2 operations. In addition every PCI device has an address space, so we get n^3 (technically the third n is different from the first two, but they are related). The first problem can be solved by implementing Flatview with an std::set or equivalent, the second by memoization - most pci address spaces are equal (they only differ based on whether bus mastering is enabled or not), so a clever cache can reduce the effort to generate them. However I'm not at all sure that the problem is cpu time in qemu. It could be due to rcu_synchronize delays when the new memory maps are fed to kvm and vfio. I recommend trying to isolate exactly where the time is spent. It's seem like the linear increase in CPU time comes from QEMU ( at least from my measurements below) In those code paths QEMU calls back into KVM (KVM_SET_MEMORY_REGION) and vfio. So it would be good to understand exactly where the time is spent. I doubt it's computation (which is O(n^3), but very fast), instead it's likely waiting for something. In QEMU kvm_cpu_exec() I've added a hook that measure the time that is spent outside 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)' From the logs below this is QEMU long exit vCPU n x(msec) exit_reason' Similarly in KVM vcpu_enter_guest() I've added a new ftrace that measure the time spent outside 'kvm_x86_ops-run(vcpu)' From the logs below this is kvm_long_exit: x(msec)'. Please note that this is a trimmed down view of the real ftrace output. Also please note that the above hacks are useful ( at least to me since I haven't figured out a better way to do the same with existing ftrace ) to measure the RTT at both QEMU and KVM level. The time spent outside KVM 'kvm_x86_ops-run(vcpu)' will always be greater than the time spent outside QEMU 'kvm_vcpu_ioctl(cpu, KVM_RUN, 0)' for a given vCPU. Now the difference between the time spent outside KVM to the time spend outside QEMU ( for a given vCPU ) tells us who is burning cycle ( QEMU or KVM ) and how much ( in msec ) In the below experiment I've put side by side the QEMU and the KVM RTT time. We can see that the time to assign device ( same BAR size for all devices ) increase linearly ( like previously reported ). Also from the RTT measurement both QEMU and KVM are mostly within the same range suggesting that the increase comes from QEMU and not KVM. The one exception is that for every device assign there is a KVM operation that seems to be taking ~100msec each time. Since this is O(1) I'm not too concerned. device assign #1: device_add pci-assign,host=28:10.2,bus=pciehp.3.8 kvm_long_exit: 100 QEMU long exit vCPU 0 25 2kvm_long_exit: 26 QEMU long exit vCPU 0 20 2kvm_long_exit: 20 QEMU long exit vCPU 0 20 2kvm_long_exit: 20 QEMU long exit vCPU 0 20 2kvm_long_exit: 20 QEMU long exit vCPU 0 19 2kvm_long_exit: 19 QEMU long exit vCPU 0 19 2kvm_long_exit: 19 QEMU long exit vCPU 0 19 2kvm_long_exit: 20 QEMU long exit vCPU 0 19 2kvm_long_exit: 19 QEMU long exit vCPU 0 19 2kvm_long_exit: 19 QEMU long exit vCPU 0 19 2kvm_long_exit: 19 QEMU long exit vCPU 0 19 2kvm_long_exit: 20 QEMU long exit vCPU 0 42 2kvm_long_exit: 42 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 device assign #2: device_add pci-assign,host=28:10.3,bus=pciehp.3.9 kvm_long_exit: 101 QEMU long exit vCPU 0 25 2kvm_long_exit: 25 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 21 2kvm_long_exit: 21 QEMU long exit vCPU 0 45 2kvm_long_exit: 45 QEMU long exit vCPU 0 23 2kvm_long_exit: 23 device assign #3: device_add pci-assign,host=28:10.4,bus=pciehp.3.10 kvm_long_exit: 100 QEMU long exit vCPU 0 25 2kvm_long_exit: 25 QEMU long exit vCPU 0 23 2kvm_long_exit: 23 QEMU long exit vCPU 0 23 2kvm_long_exit: 23 QEMU long exit vCPU 0 23 2kvm_long_exit: 23 QEMU long exit vCPU 0 23 2kvm_long_exit: 23 QEMU long exit vCPU 0 23 2kvm_long_exit: 23 QEMU long exit vCPU 0 23 2
Re: [Qemu-devel] memory: memory_region_transaction_commit() slow
On 06/25/2014 08:53 PM, Etienne Martineau wrote: Hi, It seems to me that there is a scale issue O(n) in memory_region_transaction_commit(). It's actually O(n^3). Flatview is kept sorted but is just a vector, so if you insert n regions, you have n^2 operations. In addition every PCI device has an address space, so we get n^3 (technically the third n is different from the first two, but they are related). The first problem can be solved by implementing Flatview with an std::set or equivalent, the second by memoization - most pci address spaces are equal (they only differ based on whether bus mastering is enabled or not), so a clever cache can reduce the effort to generate them. However I'm not at all sure that the problem is cpu time in qemu. It could be due to rcu_synchronize delays when the new memory maps are fed to kvm and vfio. I recommend trying to isolate exactly where the time is spent. Basically the time it takes to rebuild the memory view during device assignment pci_bridge_update_mappings() increase linearly with respect to the number of device already assigned to the guest. I'm running on a recent qemu.git and I merged from git://github.com/bonzini/qemu.git memory the following patches that seems to be related to the scale issue I'm facing: Fam Zheng (1): memory: Don't call memory_region_update_coalesced_range if nothing changed Gonglei (1): memory: Don't update all memory region when ioeventfd changed Those patches help but don't fix the issue. The problem become more noticeable when lots of device are being assigned to the guest. I'm running my test on a QEMU q35 machine with the following topology: ioh3420 ( root port ) x3130-upstream xio3130-downstream xio3130-downstream xio3130-downstream ... I have added instrumentation in kvm_cpu_exec() to track to amount of time spend in the emulation ( patch at the end but not relevant for this discussion ) Here what I see when I assign device one after to other. NOTE the time-stamp is in msec. The linear increase in the time comes from memory_region_transaction_commit(). (qemu) device_add pci-assign,host=28:10.1,bus=pciehp.3.7 QEMU long exit vCPU 0 25 2 QEMU long exit vCPU 0 22 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 22 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 21 2 QEMU long exit vCPU 0 45 2 QEMU long exit vCPU 0 23 2 (qemu) device_add pci-assign,host=28:10.2,bus=pciehp.3.8 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 24 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 23 2 QEMU long exit vCPU 0 49 2 QEMU long exit vCPU 0 25 2 (qemu) device_add pci-assign,host=28:10.3,bus=pciehp.3.9 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 25 2 QEMU long exit vCPU 0 25 2 QEMU long exit vCPU 0 25 2 QEMU long exit vCPU 0 25 2 QEMU long exit vCPU 0 24 2 QEMU long exit vCPU 0 24 2 QEMU long exit vCPU 0 24 2 QEMU long exit vCPU 0 24 2 QEMU long exit vCPU 0 24 2 QEMU long exit vCPU 0 52 2 QEMU long exit vCPU 0 26 2 (qemu) device_add pci-assign,host=28:10.4,bus=pciehp.3.10 QEMU long exit vCPU 0 35 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 27 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 26 2 QEMU long exit vCPU 0 56 2 QEMU long exit vCPU 0 28 2 (qemu) device_add pci-assign,host=28:10.5,bus=pciehp.3.11 QEMU long exit vCPU 0 33 2 QEMU long exit vCPU 0 30 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 29 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 28 2 QEMU long exit vCPU 0 59 2 QEMU long exit vCPU 0 30 2 diff --git a/kvm-all.c b/kvm-all.c index 3ae30ee..e3a1964 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1685,6 +1685,8 @@ int kvm_cpu_exec(CPUState *cpu) { struct kvm_run *run = cpu-kvm_run; int ret, run_ret; +int64_t clock_ns, delta_ms; +__u32 last_exit_reason, last_vcpu; DPRINTF(kvm_cpu_exec()\n); @@ -1711,6 +1713,12 @@ int kvm_cpu_exec(CPUState *cpu) } qemu_mutex_unlock_iothread(); +delta_ms = (get_clock() - clock_ns)/(1000*1000); +if( delta_ms = 10){ +fprintf(stderr, QEMU long exit vCPU %d %ld %d\n,last_vcpu, +delta_ms, last_exit_reason); +} + run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0); qemu_mutex_lock_iothread(); @@ -1727,7 +1735,15 @@ int kvm_cpu_exec(CPUState
Re: [Qemu-devel] [Qemu-trivial] [PATCH 2/3] int128.h: Avoid undefined behaviours involving signed arithmetic
On 04/06/2014 01:18 PM, Peter Maydell wrote: On 6 April 2014 08:09, Michael Tokarev m...@tls.msk.ru wrote: 28.03.2014 19:12, Peter Maydell wrote: Add casts when we're performing arithmetic on the .hi parts of an Int128, to avoid undefined behaviour. [] static inline Int128 int128_sub(Int128 a, Int128 b) { -return (Int128){ a.lo - b.lo, a.hi - b.hi - (a.lo b.lo) }; +return (Int128){ a.lo - b.lo, (uint64_t)a.hi - b.hi - (a.lo b.lo) }; What was wrong with this one? I don't think casting to unsigned here is a good idea. This patch is fixing these three clang sanitizer warnings: /home/petmay01/linaro/qemu-from-laptop/qemu/include/qemu/int128.h:81:40: runtime error: signed integer overflow: 0 - -9223372036854775808 cannot be represented in type 'long' /home/petmay01/linaro/qemu-from-laptop/qemu/include/qemu/int128.h:81:47: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long' /home/petmay01/linaro/qemu-from-laptop/qemu/include/qemu/int128.h:56:47: runtime error: left shift of negative value -9223372036854775807 of which the first two are in this function. Note that int128_add() already has a cast. The alternative would be to say that Int128 should have undefined behaviour on underflow/overflow and the test code is wrong, but that doesn't seem very useful to me. Isn't the test broken here? It is trying to add (or shift) -2^127 and something else, and the result truly overflows. A better behaviour would be to abort when this happens. Int128 was designed to avoid silent overflows, not to silently cause breakage. Not that I think it is necessary, there is no way for the guest to trigger an overflow.
Re: [Qemu-devel] [Qemu-trivial] [PATCH 2/3] int128.h: Avoid undefined behaviours involving signed arithmetic
On 04/07/2014 06:17 PM, Peter Maydell wrote: On 7 April 2014 15:56, Avi Kivity a...@cloudius-systems.com wrote: On 04/06/2014 01:18 PM, Peter Maydell wrote: The alternative would be to say that Int128 should have undefined behaviour on underflow/overflow and the test code is wrong, but that doesn't seem very useful to me. Isn't the test broken here? It is trying to add (or shift) -2^127 and something else, and the result truly overflows. Well, the test code is assuming semantics as per 2s complement arithmetic and checking various corner cases. As I say, we could define that this is invalid and rewrite the test cases. It is invalid. The test thinks that -2^127 * 2 == 0, but if a guest could trigger it, it would probably be a security issue.
Re: [Qemu-devel] KVM and variable-endianness guest CPUs
On 01/22/2014 12:22 PM, Peter Maydell wrote: On 22 January 2014 05:39, Victor Kamensky victor.kamen...@linaro.org wrote: Hi Guys, Christoffer and I had a bit heated chat :) on this subject last night. Christoffer, really appreciate your time! We did not really reach agreement during the chat and Christoffer asked me to follow up on this thread. Here it goes. Sorry, it is very long email. I don't believe we can assign any endianity to mmio.data[] byte array. I believe mmio.data[] and mmio.len acts just memcpy and that is all. As memcpy does not imply any endianity of underlying data mmio.data[] should not either. This email is about five times too long to be actually useful, but the major issue here is that the data being transferred is not just a bag of bytes. The data[] array plus the size field are being (mis)used to indicate that the memory transaction is one of: * an 8 bit access * a 16 bit access of some uint16_t value * a 32 bit access of some uint32_t value * a 64 bit access of some uint64_t value exactly as a CPU hardware bus would do. It's because the API is defined in this awkward way with a uint8_t[] array that we need to specify how both sides should go from the actual properties of the memory transaction (value and size) to filling in the array. That is not how x86 hardware works. Back when there was a bus, there were no address lines A0-A2; instead we had 8 byte enables BE0-BE7. A memory transaction placed the qword address on the address lines and asserted the byte enables for the appropriate byte, word, dword, or qword, shifted for the low order bits of the address. If you generated an unaligned access, the transaction was split into two, so an 8-byte write might appear as a 5-byte write followed by a 3-byte write. In fact, the two halves of the transaction might go to different devices, or one might go to a device and another to memory. PCI works the same way. Furthermore, device endianness is entirely irrelevant for deciding the properties of mmio.data[], because the thing we're modelling here is essentially the CPU-bus interface. In real hardware, the properties of individual devices on the bus are irrelevant to how the CPU's interface to the bus behaves, and similarly here the properties of emulated devices don't affect how KVM's interface to QEMU userspace needs to work. MemoryRegion's 'endianness' field, incidentally, is a dreadful mess that we should get rid of. It is attempting to model the property that some buses/bridges have of doing byte-lane-swaps on data that passes through as a property of the device itself. It would be better if we modelled it properly, with container regions having possible byte-swapping and devices just being devices. No, that is not what it is modelling. Suppose a little endian cpu writes a dword 0x12345678 to address 0 of a device, and read back a byte from address 0. What value do you read back? Some (most) devices will return 0x78, others will return 0x12. Other devices don't support mixed sizes at all, but many do. PCI configuration space is an example; it is common to read both Device ID and Vendor ID with a single 32-bit transaction, but you can also read them separately with two 16-bit transaction. Because PCI is little-endian, the Vendor ID at address 0 will be returned as the low word of the 32-bit read of a little-endian processor. If you remove device endianness from memory regions, you have to pass the data as arrays of bytes (like the KVM interface) and let the device assemble words from those bytes itself, taking into consideration its own endianness. What MemoryRegion's endianness does is let the device declare its endianness to the API and let it do all the work.
Re: [Qemu-devel] [Qemu-ppc] KVM and variable-endianness guest CPUs
On 01/28/2014 01:27 AM, Benjamin Herrenschmidt wrote: On Wed, 2014-01-22 at 17:29 +, Peter Maydell wrote: Basically if it would be on real bus, get byte value that corresponds to phys_addr + 0 address place it into data[0], get byte value that corresponds to phys_addr + 1 address place it into data[1], etc. This just isn't how real buses work. Actually it can be :-) There is no address + 1, address + 2. There is a single address for the memory transaction and a set of data on data lines and some separate size information. How the device at the far end of the bus chooses to respond to 32 bit accesses to address X versus 8 bit accesses to addresses X through X+3 is entirely its own business and unrelated to the CPU. However the bus has a definition of what byte lane is the lowest in address order. Byte order invariance is an important function of all busses. I think that trying to treat it any differently than an address ordered series of bytes is going to turn into a complete and inextricable mess. I agree. The two options are: (address, byte array, length) and (address, value, word size, endianness) the first is the KVM ABI, the second is how MemoryRegions work. Both are valid, but the first is more general (supports the 3-byte accesses sometimes generated on x86). (It would be perfectly possible to have a device which when you read from address X as 32 bits returned 0x12345678, when you read from address X as 16 bits returned 0x9abc, returned 0x42 for an 8 bit read from X+1, and so on. Having byte reads from X..X+3 return values corresponding to parts of the 32 bit access is purely a convention.) Right, it's possible. It's also stupid and not how most modern devices and busses work. Besides there is no reason why that can't be implemented with Victor proposal anyway. Right.
Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
On 01/14/2014 12:48 AM, Alex Williamson wrote: On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote: Am 13.01.2014 um 22:39 schrieb Alex Williamson alex.william...@redhat.com: On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: On 12.01.2014, at 08:54, Michael S. Tsirkin m...@redhat.com wrote: On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote: On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote: On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote: On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote: On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote: On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote: On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote: On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote: On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote: From: Paolo Bonzini pbonz...@redhat.com As an alternative to commit 818f86b (exec: limit system memory size, 2013-11-04) let's just make all address spaces 64-bit wide. This eliminates problems with phys_page_find ignoring bits above TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal consequently messing up the computations. In Luiz's reported crash, at startup gdb attempts to read from address 0xffe6 to 0x inclusive. The region it gets is the newly introduced master abort region, which is as big as the PCI address space (see pci_bus_init). Due to a typo that's only 2^63-1, not 2^64. But we get it anyway because phys_page_find ignores the upper bits of the physical address. In address_space_translate_internal then diff = int128_sub(section-mr-size, int128_make64(addr)); *plen = int128_get64(int128_min(diff, int128_make64(*plen))); diff becomes negative, and int128_get64 booms. The size of the PCI address space region should be fixed anyway. Reported-by: Luiz Capitulino lcapitul...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- exec.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/exec.c b/exec.c index 7e5ce93..f907f5f 100644 --- a/exec.c +++ b/exec.c @@ -94,7 +94,7 @@ struct PhysPageEntry { #define PHYS_MAP_NODE_NIL (((uint32_t)~0) 6) /* Size of the L2 (and L3, etc) page tables. */ -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS +#define ADDR_SPACE_BITS 64 #define P_L2_BITS 10 #define P_L2_SIZE (1 P_L2_BITS) @@ -1861,11 +1861,7 @@ static void memory_map_init(void) { system_memory = g_malloc(sizeof(*system_memory)); -assert(ADDR_SPACE_BITS = 64); - -memory_region_init(system_memory, NULL, system, - ADDR_SPACE_BITS == 64 ? - UINT64_MAX : (0x1ULL ADDR_SPACE_BITS)); +memory_region_init(system_memory, NULL, system, UINT64_MAX); address_space_init(address_space_memory, system_memory, memory); system_io = g_malloc(sizeof(*system_io)); This seems to have some unexpected consequences around sizing 64bit PCI BARs that I'm not sure how to handle. BARs are often disabled during sizing. Maybe you don't detect BAR being disabled? See the trace below, the BARs are not disabled. QEMU pci-core is doing the sizing an memory region updates for the BARs, vfio is just a pass-through here. Sorry, not in the trace below, but yes the sizing seems to be happening while I/O memory are enabled int he command register. Thanks, Alex OK then from QEMU POV this BAR value is not special at all. Unfortunately After this patch I get vfio traces like this: vfio: vfio_pci_read_config(:01:10.0, @0x10, len=0x4) febe0004 (save lower 32bits of BAR) vfio: vfio_pci_write_config(:01:10.0, @0x10, 0x, len=0x4) (write mask to BAR) vfio: region_del febe - febe3fff (memory region gets unmapped) vfio: vfio_pci_read_config(:01:10.0, @0x10, len=0x4) c004 (read size mask) vfio: vfio_pci_write_config(:01:10.0, @0x10, 0xfebe0004, len=0x4) (restore BAR) vfio: region_add febe - febe3fff [0x7fcf3654d000] (memory region re-mapped) vfio: vfio_pci_read_config(:01:10.0, @0x14, len=0x4) 0 (save upper 32bits of BAR) vfio: vfio_pci_write_config(:01:10.0, @0x14, 0x, len=0x4) (write mask to BAR) vfio: region_del febe - febe3fff (memory region gets unmapped) vfio: region_add febe - febe3fff [0x7fcf3654d000] (memory region gets re-mapped with new address) qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 0xfebe, 0x4000, 0x7fcf3654d000) = -14 (Bad address) (iommu barfs because it can only handle 48bit physical addresses) Why are you trying to program BAR addresses for dma in the iommu? Two reasons, first I can't tell the difference between RAM and MMIO. Why can't you? Generally memory core let you find out easily. My MemoryListener is setup for address_space_memory and I then filter out anything that's not memory_region_is_ram(). This still gets through, so how do I easily
Re: [Qemu-devel] outlined TLB lookup on x86
On 11/28/2013 04:12 AM, Richard Henderson wrote: 2. why not use a TLB or bigger size? currently the TLB has 18 entries. the TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i measured this using Intel PIN. so even the miss rate is low (say 3%) the overall time spent in the cpu_x86_handle_mmu_fault is still signifcant. I'd be interested to experiment with different TLB sizes, to see what effect that has on performance. But I suspect that lack of TLB contexts mean that we wind up flushing the TLB more often than real hardware does, and therefore a larger TLB merely takes longer to flush. You could use a generation counter to flush the TLB in O(1) by incrementing the counter. That slows down the fast path though. Maybe you can do that for the larger second level TLB only.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 11:53 AM, Paolo Bonzini wrote: Il 28/11/2013 10:49, Avi Kivity ha scritto: Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. call_rcu() has the problem of rate limiting, too. It wasn't such a great idea, I think. The QRCU I linked would work great latency-wise (it has roughly the same latency of an rwsem but readers are lock-free). However, the locked operations in the read path would hurt because of cache misses, so it's not good either. I guess srcu would work. Do you know what's the typical write-side latency there? (in terms of what it waits for, not nanoseconds).
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 12:40 PM, Paolo Bonzini wrote: Il 28/11/2013 11:16, Avi Kivity ha scritto: The QRCU I linked would work great latency-wise (it has roughly the same latency of an rwsem but readers are lock-free). However, the locked operations in the read path would hurt because of cache misses, so it's not good either. I guess srcu would work. Do you know what's the typical write-side latency there? (in terms of what it waits for, not nanoseconds). If there's no concurrent reader, it's zero if I read the code right. Otherwise it depends: - if there are many callbacks, only 10 of them are processed per millisecond. But unless there are concurrent synchronize_srcu calls there should not be any callback at all. If all VCPUs were to furiously change the MSIs, the latency could go up to #vcpu/10 milliseconds. - if there are no callbacks, but there are readers, synchronize_srcu busy-loops for some time checking if the readers complete. After a while (20 us for synchronize_srcu, 120 us for synchronize_srcu_expedited) it gives up and starts using a workqueue to poll every millisecond. This should never happen unless So, given the very restricted usage this SRCU would have, we probably can expect synchronize_srcu_expedited to finish its job in the busy-looping phase, and 120 us should be the expected maximum latency---more likely to be an order of magnitude smaller, and in very rare cases higher. Looks like it's a good fit then.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:10 PM, Paolo Bonzini wrote: Il 28/11/2013 12:09, Gleb Natapov ha scritto: - if there are no callbacks, but there are readers, synchronize_srcu busy-loops for some time checking if the readers complete. After a while (20 us for synchronize_srcu, 120 us for synchronize_srcu_expedited) it gives up and starts using a workqueue to poll every millisecond. This should never happen unless Unless what ? :) Unless reader is scheduled out? Yes. Or unless my brain is scheduled out in the middle of a sentence. You need a grace period. Or just sleep().
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:02 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote: On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent. Zhanghaoyu said that the problem mostly hurts in real-time telecom environment, so I propose how he can fix the problem in his specific environment. It will not fix older userspace obviously, but kernel fix will also require kernel update and usually updating userspace is easier. Isn't the latency due to interrupt migration causing long synchronize_rcu()s? How does KVM_SIGNAL_MSI help? The problem occurs with assigned devices too AFAICT.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/28/2013 01:22 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote: On 11/28/2013 01:02 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote: On 11/28/2013 12:11 PM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote: On 11/28/2013 11:19 AM, Gleb Natapov wrote: On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote: Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto: Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we use call_rcu()(Not consider the problem that Gleb pointed out temporarily) instead of synchronize_rcu(), should we still ensure this? The problem is that we should ensure this, so using call_rcu is not possible (even not considering the memory allocation problem). Not changing current behaviour is certainly safer, but I am still not 100% convinced we have to ensure this. Suppose guest does: 1: change msi interrupt by writing to pci register 2: read the pci register to flush the write 3: zero idt I am pretty certain that this code can get interrupt after step 2 on real HW, but I cannot tell if guest can rely on it to be delivered exactly after read instruction or it can be delayed by couple of instructions. Seems to me it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not. Linux is safe, it does interrupt migration from within the interrupt handler. If you do that before the device-specific EOI, you won't get another interrupt until programming the MSI is complete. Is virtio safe? IIRC it can post multiple interrupts without guest acks. Using call_rcu() is a better solution than srcu IMO. Less code changes, consistently faster. Why not fix userspace to use KVM_SIGNAL_MSI instead? Shouldn't it work with old userspace too? Maybe I misunderstood your intent. Zhanghaoyu said that the problem mostly hurts in real-time telecom environment, so I propose how he can fix the problem in his specific environment. It will not fix older userspace obviously, but kernel fix will also require kernel update and usually updating userspace is easier. Isn't the latency due to interrupt migration causing long synchronize_rcu()s? How does KVM_SIGNAL_MSI help? If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in irq routing table changing MSI configuration should not cause update to irq routing table (not saying this is what happens with current QEMU, but theoretically there is not reason to update routing table in this case). I see. That pushes the problem to userspace, which uses traditional locking, so the problem disappears until qemu starts using rcu too to manage this. There is also irqfd, however. We could also do a KVM_UPDATE_IRQFD to change the payload it delivers, but that has exactly the same problems.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 2:47 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto: When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to update the irq routing table, in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is blocked for so much time to wait RCU grace period, and during this period, this vcpu cannot provide service to VM, so those interrupts delivered to this vcpu cannot be handled in time, and the apps running on this vcpu cannot be serviced too. It's unacceptable in some real-time scenario, e.g. telecom. So, I want to create a single workqueue for each VM, to asynchronously performing the RCU synchronization for irq routing table, and let the vcpu thread return and VMENTRY to service VM immediately, no more need to blocked to wait RCU grace period. And, I have implemented a raw patch, took a test in our telecom environment, above problem disappeared. I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? Can this cause an interrupt to be delivered to the wrong (old) vcpu? The way Linux sets interrupt affinity, it cannot, since changing the affinity is (IIRC) done in the interrupt handler, so the next interrupt cannot be in flight and thus pick up the old interrupt routing table. However it may be vulnerable in other ways.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On Tue, Nov 26, 2013 at 3:47 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 26/11/2013 14:18, Avi Kivity ha scritto: I don't think a workqueue is even needed. You just need to use call_rcu to free old after releasing kvm-irq_lock. What do you think? Can this cause an interrupt to be delivered to the wrong (old) vcpu? No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). It's another question whether the hardware provides the same guarantee. If you eliminate the synchronize_rcu, new interrupts would see the new routing table, while interrupts already in flight will get a dangling pointer. Sure, if you drop the synchronize_rcu(), you have to add call_rcu().
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:03 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote: On 11/26/2013 04:46 PM, Paolo Bonzini wrote: Il 26/11/2013 15:36, Avi Kivity ha scritto: No, this would be exactly the same code that is running now: mutex_lock(kvm-irq_lock); old = kvm-irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(kvm-irq_lock); synchronize_rcu(); kfree(old); return 0; Except that the kfree would run in the call_rcu kernel thread instead of the vcpu thread. But the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update. I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Consider this guest code: write msi entry, directing the interrupt away from this vcpu nop memset(idt, 0, sizeof(idt)); Currently, this code will never trigger a triple fault. With the change to call_rcu(), it may. Now it may be that the guest does not expect this to work (PCI writes are posted; and interrupts can be delayed indefinitely by the pci fabric), but we don't know if there's a path that guarantees the guest something that we're taking away with this change.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:20 PM, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:28 PM, Paolo Bonzini wrote: Il 26/11/2013 16:25, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why?
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. The fact that Linux does interrupt migration from within the interrupt handler also shows that someone else believes that it is the only safe place to do it.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote: On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote: On 11/26/2013 05:58 PM, Paolo Bonzini wrote: Il 26/11/2013 16:35, Avi Kivity ha scritto: If we want to ensure, we need to use a different mechanism for synchronization than the global RCU. QRCU would work; readers are not wait-free but only if there is a concurrent synchronize_qrcu, which should be rare. An alternative path is to convince ourselves that the hardware does not provide the guarantees that the current code provides, and so we can relax them. No, I think it's a reasonable guarantee to provide. Why? Because IIUC the semantics may depend not just on the interrupt controller, but also on the specific PCI device. It seems safer to assume that at least one device/driver pair wants this to work. It's indeed safe, but I think there's a nice win to be had if we drop the assumption. I'm not arguing with that, but a minor commoent below: (BTW, PCI memory writes are posted, but configuration writes are not). MSIs are configured via PCI memory writes. By itself, that doesn't buy us anything, since the guest could flush the write via a read. But I think the fact that the interrupt messages themselves are posted proves that it is safe. FYI, PCI read flushes the interrupt itself in, too. I guess that kills the optimization then. Maybe you can do qrcu, whatever that is.
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:24 PM, Gleb Natapov wrote: On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote: Il 26/11/2013 16:03, Gleb Natapov ha scritto: I understood the proposal was also to eliminate the synchronize_rcu(), so while new interrupts would see the new routing table, interrupts already in flight could pick up the old one. Isn't that always the case with RCU? (See my answer above: the vcpus already see the new routing table after the rcu_assign_pointer that is in kvm_irq_routing_update). With synchronize_rcu(), you have the additional guarantee that any parallel accesses to the old routing table have completed. Since we also trigger the irq from rcu context, you know that after synchronize_rcu() you won't get any interrupts to the old destination (see kvm_set_irq_inatomic()). We do not have this guaranty for other vcpus that do not call synchronize_rcu(). They may still use outdated routing table while a vcpu or iothread that performed table update sits in synchronize_rcu(). Avi's point is that, after the VCPU resumes execution, you know that no interrupt will be sent to the old destination because kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU read-side critical section. Without synchronize_rcu you could have VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Suppose the guest did not disable_irq() and enable_irq(), but instead had a pci read where you have the enable_irq(). After the read you cannot have a stale irq (assuming the read flushes the irq all the way to the APIC).
Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table
On 11/26/2013 06:28 PM, Paolo Bonzini wrote: Il 26/11/2013 17:24, Gleb Natapov ha scritto: VCPU writes to routing table e = entry from IRQ routing table kvm_irq_routing_update(kvm, new); VCPU resumes execution kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); where the entry is stale but the VCPU has already resumed execution. So how is it different from what we have now: disable_irq() VCPU writes to routing table e = entry from IRQ routing table kvm_set_msi_irq(e, irq); kvm_irq_delivery_to_apic_fast(); kvm_irq_routing_update(kvm, new); synchronize_rcu() VCPU resumes execution enable_irq() receive stale irq Adding a disable/enable IRQs looks like a relatively big change. But perhaps it's not for some reason I'm missing. Those are guest operations, which may not be there at all.
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
On 11/17/2013 05:37 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote: On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote: On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote: Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us. Basically skip unused high bits? Sure. In fact I think both optimizations can be combined. Not much value in combining them -- the variable level tree check will be dominated by the level skip logic. IMO however skipping intermediate levels will be too rare to justify the complexity and the doubling of the page table size -- it can only happen in iommu setups that place memory in very high addresses. These ought to be rare. Well maybe not very high address, but you can have a device with a 64 bit bar and this will add back levels (though it would not be slower than it is currently). I agree the simplicity is attractive. However I really would like some logic that can handle 1 leaf somehow. Specifically both tricks break if we add a full 64 bit io region in order to return -1 for reads on master abort. That can be achieved by directing a missed lookup to a catch-all region. It would be best to do this completely internally to the memory code, without API changes. You mean e.g. add a catch-all region to the address space? The API already specifies a catch-all region - the container region catches everything if an access doesn't hit a sub-region (though I wanted to discourage the practice of using the same region for both I/O and a container), or you can have an I/O region as a subregion with the lowest priority, that spans the entire space, so that any access which misses the other sub-regions will hit the default subregion. So we don't need API changes, just to change the internals. I think this can be done by looking at the address space, and finding out what region is mapped to the very last byte in the address space, if any, and using that as the default if an access misses the tree. That is, this region will be the target of null pointers in the tree, or if (addr addr_invalid_mask) is true. This will ensure the tree is of the smallest possible size. Another option is to simply give up on the tree and use std::binary_search() on the FlatView. It will be more cpu intensive, but on the other hand more cache-friendly. tcg already caches stuff in the iotlb, and maybe it makes sense to add a cache for kvm lookups too (the advantage of the cache is that it only needs to contain addresses that are actually touched, so it can be very small and fast, compared to the radix tree which needs to cover all addresses).
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression
Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us.
Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper
On 11/14/2013 03:29 PM, Stefan Hajnoczi wrote: On Tue, Nov 12, 2013 at 01:10:24PM +0200, Avi Kivity wrote: Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since it doesn't have the required capabilities. Fix by adding them. Up until now, downstreams had to make the bridge helper executable setuid, add the cap_net_admin capability, or they did nothing and it was broken ;-). CCing downstream package maintainers in case they have any comments on this patch. And it was, indeed, broken. Note: this may break installing as non-root. This is actually the right thing to do, since not setting up the capability would result in a broken setup. Perhaps we need a configure flag to disable helpers. Users who have been successfully installing QEMU would be upset if it suddenly starts failing after this patch. The bridge helper is a niche feature that shouldn't cause a regression for the majority of users who don't care about it. If we're installing non-root then the bridge helper simply shouldn't be installed. Or maybe installed without the capabilities? This way if the user invokes qemu with sudo, it still works as before.
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote: Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us. Basically skip unused high bits? Sure. In fact I think both optimizations can be combined. Not much value in combining them -- the variable level tree check will be dominated by the level skip logic. IMO however skipping intermediate levels will be too rare to justify the complexity and the doubling of the page table size -- it can only happen in iommu setups that place memory in very high addresses. These ought to be rare.
Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression
On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote: On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote: On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote: Michael S. Tsirkin mst at redhat.com writes: At the moment, memory radix tree is already variable width, but it can only skip the low bits of address. This is efficient if we have huge memory regions but inefficient if we are only using a tiny portion of the address space. After we have built up the map, detect configurations where a single L2 entry is valid. We then speed up the lookup by skipping one or more levels. In case any levels were skipped, we might end up in a valid section instead of erroring out. We handle this by checking that the address is in range of the resulting section. I think this is overkill. It can be done in a simpler way as follows: phys_page_find(RadixTree* tr, hwaddr index, ...) { if (index rt-invalid_index_mask) { // not found } lp = rt-root; for (i = rt-nb_levels - 1; i = 0 !lp.is_leaf; --i) { ... This exploits the fact the lower portion of the address space is always filled, at least in the cases that matter to us. Basically skip unused high bits? Sure. In fact I think both optimizations can be combined. Not much value in combining them -- the variable level tree check will be dominated by the level skip logic. IMO however skipping intermediate levels will be too rare to justify the complexity and the doubling of the page table size -- it can only happen in iommu setups that place memory in very high addresses. These ought to be rare. Well maybe not very high address, but you can have a device with a 64 bit bar and this will add back levels (though it would not be slower than it is currently). I agree the simplicity is attractive. However I really would like some logic that can handle 1 leaf somehow. Specifically both tricks break if we add a full 64 bit io region in order to return -1 for reads on master abort. That can be achieved by directing a missed lookup to a catch-all region. It would be best to do this completely internally to the memory code, without API changes.
[Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper
Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since it doesn't have the required capabilities. Fix by adding them. Note: this may break installing as non-root. This is actually the right thing to do, since not setting up the capability would result in a broken setup. Perhaps we need a configure flag to disable helpers. Signed-off-by: Avi Kivity a...@cloudius-systems.com --- Makefile | 3 +++ rules.mak | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Makefile b/Makefile index b15003f..af7748c 100644 --- a/Makefile +++ b/Makefile @@ -188,6 +188,7 @@ qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a +qemu-bridge-helper$(EXESUF).capabilities = cap_net_admin qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o fsdev/virtio-9p-marshal.o libqemuutil.a libqemustub.a @@ -345,6 +346,8 @@ endif ifneq ($(HELPERS-y),) $(INSTALL_DIR) $(DESTDIR)$(libexecdir) $(INSTALL_PROG) $(STRIP_OPT) $(HELPERS-y) $(DESTDIR)$(libexecdir) + $(SETCAP) $(foreach helper, $(HELPERS-y), \ + $($(helper).capabilities)+pe $(DESTDIR)$(libexecdir)/$(helper)) endif ifneq ($(BLOBS),) set -e; for x in $(BLOBS); do \ diff --git a/rules.mak b/rules.mak index 49edb9b..9194c79 100644 --- a/rules.mak +++ b/rules.mak @@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach var,$(nested-vars),$(dir $($(var)) $(foreach var,$(nested-vars), $(eval \ -include $(addsuffix *.d, $(sort $(dir $($(var))) endef + +SETCAP = setcap \ No newline at end of file -- 1.8.3.1
Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper
On Nov 12, 2013 9:23 PM, Eric Blake ebl...@redhat.com wrote: On 11/12/2013 04:10 AM, Avi Kivity wrote: Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since it doesn't have the required capabilities. Fix by adding them. Note: this may break installing as non-root. This is actually the right thing to do, since not setting up the capability would result in a broken setup. Perhaps we need a configure flag to disable helpers. +++ b/rules.mak @@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach var,$(nested-vars),$(dir $($(var)) $(foreach var,$(nested-vars), $(eval \ -include $(addsuffix *.d, $(sort $(dir $($(var))) endef + +SETCAP = setcap \ No newline at end of file Is this intentional? No. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org
Re: [Qemu-devel] [PATCH] softfloat: rebase to version 2a
On Apr 29, 2013 9:06 PM, Anthony Liguori aligu...@us.ibm.com wrote: In order to make this change, we need to relicense all contributions from initial import of the SoftFloat code to match the license of SoftFloat-2a (instead of the implied SoftFloat-2b license). Acked-by: Avi Kivity avi.kiv...@gmail.com
Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant
On Tue, Feb 19, 2013 at 4:41 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Feb 14, 2013 at 08:23:04PM +0200, Avi Kivity wrote: On Thu, Feb 14, 2013 at 8:12 PM, Michael S. Tsirkin m...@redhat.com wrote: Is there an actual real problem that needs fixing? Yes. Guests sometimes cause device BARs to temporary overlap the APIC range during BAR sizing. It works fine on a physical system but fails on KVM since pci has same priority. See the report: [BUG] Guest OS hangs on boot when 64bit BAR present Is PCI_COMMAND_MEMORY set while this is going on? I think Linux never clears PCI_COMMAND_MEMORY because it's buggy in some devices. Ok. Then I recommend defining the MSI message area as overlapped with sufficient priority. It should probably be a child of the PCI address space. The IOAPIC is actually closer to ISA, but again it's sufficient to move it to the PCI address space. I doubt its priority matters.
Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant
On Tue, Feb 19, 2013 at 6:08 PM, Michael S. Tsirkin m...@redhat.com wrote: The IOAPIC is actually closer to ISA, but again it's sufficient to move it to the PCI address space. I doubt its priority matters. Well moving IOAPIC to PCI seems strange, it's not a PCI thing, and I think it can be moved outside PCI though guests don't do it. Look at the 440fx/piix datasheets. It's connected to the piix which decodes its address. So it's definitely part of the pci address space. So I think ideally we really should have it look something like: sysbus - ioapic - pci - msi
Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant
On Thu, Feb 14, 2013 at 3:09 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Feb 14, 2013 at 12:56:02PM +, Peter Maydell wrote: On 14 February 2013 12:45, Michael S. Tsirkin m...@redhat.com wrote: overlap flag in the region is currently unused, most devices have no idea whether their region overlaps with anything, so drop it, assume that all regions can overlap and always require priority. Devices themselves shouldn't care, for the most part -- they just provide a memory region and it's their parent that has to map it and know whether it overlaps or not. Similarly, parents should generally be in control of the container they're mapping the memory region into, and know whether it will be an overlapping map or not. It's also not clear how should devices allocate priorities. Up to the parent which controls the region being mapped into. We could just assume same priority as parent but what happens if it has to be different? Priority is only considered relative to siblings. The parent's priority is only considered wrt the parent's siblings, not its children. There are also aliases so a region can have multiple parents. Presumably it will have to have different priorities depending on what the parent does? The alias region has its own priority Here's a list of instances using priority != 0. hw/armv7m_nvic.c:MEMORY_PRIO_LOW); hw/cirrus_vga.c:MEMORY_PRIO_LOW); hw/cirrus_vga.c:MEMORY_PRIO_LOW); hw/cirrus_vga.c:s-low_mem_container, MEMORY_PRIO_LOW); hw/kvm/pci-assign.c: r_dev-mmio, MEMORY_PRIO_LOW); hw/kvmvapic.c:memory_region_add_subregion(as, rom_paddr, s-rom, MEMORY_PRIO_HIGH); hw/lpc_ich9.c:MEMORY_PRIO_LOW); hw/onenand.c:s-mapped_ram, MEMORY_PRIO_LOW); hw/pam.c:MEMORY_PRIO_LOW); hw/pc.c:MEMORY_PRIO_LOW); hw/pc_sysfw.c:isa_bios, MEMORY_PRIO_LOW); hw/pc_sysfw.c:isa_bios, MEMORY_PRIO_LOW); hw/pci/pci.c:MEMORY_PRIO_LOW); hw/pci/pci_bridge.c:memory_region_add_subregion(parent_space, base, alias, MEMORY_PRIO_LOW); hw/piix_pci.c:MEMORY_PRIO_LOW); hw/piix_pci.c:d-rcr_mem, MEMORY_PRIO_LOW); hw/q35.c:mch-smram_region, MEMORY_PRIO_LOW); hw/vga-isa.c:MEMORY_PRIO_LOW); hw/vga.c:MEMORY_PRIO_MEDIUM); hw/vga.c:vga_io_memory, MEMORY_PRIO_LOW); hw/xen_pt_msi.c:MEMORY_PRIO_MEDIUM); /* Priority: pci default + 1 Making priority relative to parent but not the same just seems like a recipe for disaster. I definitely don't like making the priority argument mandatory: this is just introducing pointless boilerplate for the common case where nothing overlaps and you know nothing overlaps. Non overlapping is not a common case at all. E.g. with normal PCI devices you have no way to know nothing overlaps - addresses are guest programmable. Non overlapping is mostly useful for embedded platforms.
Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant
On Thu, Feb 14, 2013 at 4:02 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Feb 14, 2013 at 01:22:18PM +, Peter Maydell wrote: On 14 February 2013 13:09, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Feb 14, 2013 at 12:56:02PM +, Peter Maydell wrote: Up to the parent which controls the region being mapped into. We could just assume same priority as parent Er, no. I mean the code in control of the parent MR sets the priority, when it calls memory_region_add_subregion_overlap(). but what happens if it has to be different? There are also aliases so a region can have multiple parents. The alias has its own priority. Well that's the status quo. One of the issues is, you have no idea what else uses each priority. With this change, at least you can grep for it. The question what priorities do aliases of this region have is not an interesting question. Priority is a local attribute, not an attribute of the region being prioritized. Presumably it will have to have different priorities depending on what the parent does? Here's a list of instances using priority != 0. hw/armv7m_nvic.c:MEMORY_PRIO_LOW); So this one I know about, and it's a good example of what I'm talking about. This function sets up a container memory region (nvic), and it is in total control of what is mapped into that container. Specifically, it puts in a nvic_sysregs background region which covers the whole 0x1000 size of the container (at an implicit priority of zero). It then layers over that an alias of the GIC registers (nvic-gic) at a specific address and with a priority of 1 so it appears above the background region. Nobody else ever puts anything in this container, so the only thing we care about is that the priority of the nvic-gic region is higher than that of the nvic_sysregs region; and it's clear from the code that we do that. Priority is a local question whose meaning is only relevant within a particular container region, not system-wide, and having system-wide MEMORY_PRIO_ defines obscures that IMHO. Well that's not how it seems to work, and I don't see how it *could* work. Imagine the specific example: ioapic and pci devices. ioapic has an address within the pci hole but it is not a subregion. If priority has no meaning how would you decide which one to use? Like PMM said. You look at the semantics of the hardware, and map that onto the API. If the pci controller says that BARs hide the ioapic, then you give them higher priority. If it says that the ioapic hides BARs, then that gets higher priority. If it doesn't say anything, take your pick (or give them the same priority). Also, on a PC many addresses are guest programmable. We need to behave in some defined way if guest programs addresses to something silly. That's why _overlap exists. The only reason it works sometimes is because some systems use fixes addresses which never overlap. That's why the no overlap API exists. I definitely don't like making the priority argument mandatory: this is just introducing pointless boilerplate for the common case where nothing overlaps and you know nothing overlaps. Non overlapping is not a common case at all. E.g. with normal PCI devices you have no way to know nothing overlaps - addresses are guest programmable. That means PCI is a special case :-) If the guest can program overlap then presumably PCI specifies semantics for what happens then, and there need to be PCI specific wrappers that enforce those semantics and they can call the relevant _overlap functions when mapping things. In any case this isn't a concern for the PCI *device*, which can just provide its memory regions. It's a problem the PCI *host adaptor* has to deal with when it's figuring out how to map those regions into the container which corresponds to its area of the address space. Issue is, a PCI device overlapping something else suddenly becomes this something else's problem. It is not a problem at all. Maybe we should take the printf() about subregion collisions in memory_region_add_subregion_common() out of the #if 0 that it currently sits in? This is just a debugging tool, it won't fix anything. It might tell us what bits of code are currently erroneously mapping regions that overlap without using the _overlap() function. Then we could fix them. -- PMM When there is a single guest programmable device, any address can be overlapped by it. No. Only addresses within the same container. Other containers work fine without overlap. We could invent rules like 'non overlappable is higher priority' but it seems completely arbitrary, a single priority is clearer. It's just noise for the xx% of cases which don't need it.
Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant
On Thu, Feb 14, 2013 at 4:40 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Feb 14, 2013 at 04:14:39PM +0200, Avi Kivity wrote: But some parents are system created and shared by many devices so children for such have no idea who their siblings are. Please take a look at the typical map in this mail: '[BUG] Guest OS hangs on boot when 64bit BAR present' system overlap 0 pri 0 [0x0 - 0x7fff] kvmvapic-rom overlap 1 pri 1000 [0xca000 - 0xcd000] pc.ram overlap 0 pri 0 [0xca000 - 0xcd000] ++ pc.ram [0xca000 - 0xcd000] is added to view smram-region overlap 1 pri 1 [0xa - 0xc] pci overlap 0 pri 0 [0xa - 0xc] cirrus-lowmem-container overlap 1 pri 1 [0xa - 0xc] cirrus-low-memory overlap 0 pri 0 [0xa - 0xc] ++cirrus-low-memory [0xa - 0xc] is added to view kvm-ioapic overlap 0 pri 0 [0xfec0 - 0xfec01000] ++kvm-ioapic [0xfec0 - 0xfec01000] is added to view pci-hole64 overlap 0 pri 0 [0x1 - 0x4001] pci overlap 0 pri 0 [0x1 - 0x4001] pci-hole overlap 0 pri 0 [0x7d00 - 0x1] pci overlap 0 pri 0 [0x7d00 - 0x1] ivshmem-bar2-container overlap 1 pri 1 [0xfe00 - 0x1] ivshmem.bar2 overlap 0 pri 0 [0xfe00 - 0x1] ++ivshmem.bar2 [0xfe00 - 0xfec0] is added to view ++ivshmem.bar2 [0xfec01000 - 0x1] is added to view As you see, ioapic at 0xfec0 overlaps pci-hole. ioapic is guest programmable in theory - should use _overlap? pci-hole is not but can overlap with ioapic. So also _overlap? It's a bug. The ioapic is in the pci address space, not the system address space. And yes it's overlappable. Let's imagine someone writes a guest programmable device for ARM. Now we should update all ARM devices from regular to _overlap? It's sufficient to update the programmable device. Non overlapping is not a common case at all. E.g. with normal PCI devices you have no way to know nothing overlaps - addresses are guest programmable. Non overlapping is mostly useful for embedded platforms. Maybe it should have a longer name like _nonoverlap then? Current API makes people assume _overlap is only for special cases and default should be non overlap. The assumption is correct.
Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant
On Thu, Feb 14, 2013 at 6:50 PM, Michael S. Tsirkin m...@redhat.com wrote: As you see, ioapic at 0xfec0 overlaps pci-hole. ioapic is guest programmable in theory - should use _overlap? pci-hole is not but can overlap with ioapic. So also _overlap? It's a bug. The ioapic is in the pci address space, not the system address space. And yes it's overlappable. So you want to put it where? Under pci-hole? No, under the pci address space. Look at the 440fx block diagram. And we'll have to teach all machine types creating pci-hole about it? No. Let's imagine someone writes a guest programmable device for ARM. Now we should update all ARM devices from regular to _overlap? It's sufficient to update the programmable device. Then the device can be higher priority (works for apic) but not lower priority. Make priority signed? Is there an actual real problem that needs fixing?
Re: [Qemu-devel] [PATCH RFC] memory: drop _overlap variant
On Thu, Feb 14, 2013 at 8:12 PM, Michael S. Tsirkin m...@redhat.com wrote: Is there an actual real problem that needs fixing? Yes. Guests sometimes cause device BARs to temporary overlap the APIC range during BAR sizing. It works fine on a physical system but fails on KVM since pci has same priority. See the report: [BUG] Guest OS hangs on boot when 64bit BAR present Is PCI_COMMAND_MEMORY set while this is going on?
Re: [Qemu-devel] RFC migration of zero pages
On Thu, Jan 31, 2013 at 4:30 PM, Gleb Natapov g...@redhat.com wrote: Avi/Michael do you remember why mincore can't be used to check if a guest page is unmapped? A page may be not in core, but also nonzero (for example swap). Yes, mincore() should not be used as zero page check, but it can be used as an indicator that page can be dealt with in dead stage of migration since it is likely zero and will not take much time to send. Or it's swapped, in which case we know nothing about it. It is possible to call madvise(MADV_WILLNEED) on them meanwhile to pre-load swap without faulting on each page individually. During migration you're faulting like mad anyway.
Re: [Qemu-devel] RFC migration of zero pages
On Jan 31, 2013 4:03 PM, Peter Lieven p...@dlhnet.de wrote: Am 31.01.2013 um 14:59 schrieb Avi Kivity avi.kiv...@gmail.com: On Jan 31, 2013 12:29 PM, Orit Wasserman owass...@redhat.com wrote: On 01/31/2013 11:48 AM, Gleb Natapov wrote: On Thu, Jan 31, 2013 at 09:47:24AM +0200, Orit Wasserman wrote: On 01/31/2013 08:57 AM, Peter Lieven wrote: Hi, I just came across an idea and would like to have feedback if it makes sence or not. If a VM is started without preallocated memory all memory that has not been written to reads as zeros, right? Hi, No the memory will be unmapped (we allocate on demand). If a VM with a lot of unwritten memory is migrated or if the memory contains a lot of zeroed out memory (e.g. Windows or Linux guest with page sanitization) all this memory is allocated on the target during live migration. Especially with KSM this leads to the problem that this memory is allocated and might be not available completely as merging of the pages will happen async. Wouldn't it make sense to not send zero pages in the first round where the complete ram is sent (if it is detectable that we are in this stage)? We send one byte per zero page at the moment (see is_dup_page) we can further optimizing it by not sending it. I have to point out that this is a very idle guest and we need to work on a loaded guest which is the more hard problem in migration. Also I notice that the bottle neck in migrating unmapped pages is the detection of those pages because we map the pages in order to check them, for a large guest this is very expensive as mapping a page results in a page fault in the host. So what will be very helpful is actually locating those pages without mapping them which looks very complicated. What is wrong with mincore()? Avi/Michael do you remember why mincore can't be used to check if a guest page is unmapped? A page may be not in core, but also nonzero (for example swap). What about hugetables? Afaik, there are not swappable - at least they where somewhen in the past. Maybe it would be possible to have a performance improvement if they are used. Nobody uses them, now that transparent huge pages are available.
Re: [Qemu-devel] RFC migration of zero pages
On Thu, Jan 31, 2013 at 4:42 PM, Gleb Natapov g...@redhat.com wrote: On Thu, Jan 31, 2013 at 04:36:25PM +0200, Avi Kivity wrote: On Thu, Jan 31, 2013 at 4:30 PM, Gleb Natapov g...@redhat.com wrote: Avi/Michael do you remember why mincore can't be used to check if a guest page is unmapped? A page may be not in core, but also nonzero (for example swap). Yes, mincore() should not be used as zero page check, but it can be used as an indicator that page can be dealt with in dead stage of migration since it is likely zero and will not take much time to send. Or it's swapped, in which case we know nothing about it. It is Of course, that is why I said likely' :) possible to call madvise(MADV_WILLNEED) on them meanwhile to pre-load swap without faulting on each page individually. During migration you're faulting like mad anyway. That's guest faulting on dirty bit logging, as far as I understand Orit says that in addition to that she sees a lot of faults generated by migration reading unmapped guest memory. She wants to eliminate at least those. For that, we once talked about extending mincore() to return info whether a page is known zero (anonymous memory without a backing page, or backing page == zero page). But I doubt it's worth it, yes we're faulting a lot, but it's still a lot cheaper than actually sending a page, so we're still ahead of a non-idle guest.
Re: [Qemu-devel] RFC migration of zero pages
On Jan 31, 2013 12:29 PM, Orit Wasserman owass...@redhat.com wrote: On 01/31/2013 11:48 AM, Gleb Natapov wrote: On Thu, Jan 31, 2013 at 09:47:24AM +0200, Orit Wasserman wrote: On 01/31/2013 08:57 AM, Peter Lieven wrote: Hi, I just came across an idea and would like to have feedback if it makes sence or not. If a VM is started without preallocated memory all memory that has not been written to reads as zeros, right? Hi, No the memory will be unmapped (we allocate on demand). If a VM with a lot of unwritten memory is migrated or if the memory contains a lot of zeroed out memory (e.g. Windows or Linux guest with page sanitization) all this memory is allocated on the target during live migration. Especially with KSM this leads to the problem that this memory is allocated and might be not available completely as merging of the pages will happen async. Wouldn't it make sense to not send zero pages in the first round where the complete ram is sent (if it is detectable that we are in this stage)? We send one byte per zero page at the moment (see is_dup_page) we can further optimizing it by not sending it. I have to point out that this is a very idle guest and we need to work on a loaded guest which is the more hard problem in migration. Also I notice that the bottle neck in migrating unmapped pages is the detection of those pages because we map the pages in order to check them, for a large guest this is very expensive as mapping a page results in a page fault in the host. So what will be very helpful is actually locating those pages without mapping them which looks very complicated. What is wrong with mincore()? Avi/Michael do you remember why mincore can't be used to check if a guest page is unmapped? A page may be not in core, but also nonzero (for example swap).
Re: [Qemu-devel] RFC migration of zero pages
On Jan 31, 2013 11:14 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Jan 31, 2013 at 05:20:02PM +0200, Avi Kivity wrote: On Thu, Jan 31, 2013 at 4:42 PM, Gleb Natapov g...@redhat.com wrote: On Thu, Jan 31, 2013 at 04:36:25PM +0200, Avi Kivity wrote: On Thu, Jan 31, 2013 at 4:30 PM, Gleb Natapov g...@redhat.com wrote: Avi/Michael do you remember why mincore can't be used to check if a guest page is unmapped? A page may be not in core, but also nonzero (for example swap). Yes, mincore() should not be used as zero page check, but it can be used as an indicator that page can be dealt with in dead stage of migration since it is likely zero and will not take much time to send. Or it's swapped, in which case we know nothing about it. It is Of course, that is why I said likely' :) possible to call madvise(MADV_WILLNEED) on them meanwhile to pre-load swap without faulting on each page individually. During migration you're faulting like mad anyway. That's guest faulting on dirty bit logging, as far as I understand Orit says that in addition to that she sees a lot of faults generated by migration reading unmapped guest memory. She wants to eliminate at least those. For that, we once talked about extending mincore() to return info whether a page is known zero (anonymous memory without a backing page, or backing page == zero page). Shoudn't be hard to expose in /proc, no? Unfortunately. But I doubt it's worth it, yes we're faulting a lot, but it's still a lot cheaper than actually sending a page, so we're still ahead of a non-idle guest. It's not just fault, reading in the page is bad for the cache. There's just one zero page.
Re: [Qemu-devel] [PATCH 1.3] build: compile translate.o at -O1 optimization
On 11/27/2012 10:34 AM, Paolo Bonzini wrote: Some versions of GCC require insane (2GB) amounts of memory to compile translate.o. As a countermeasure, compile it with -O1. This should fix the buildbot failure for default_x86_64_fedora16. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- Makefile.target | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Makefile.target b/Makefile.target index 8b658c0..3981931 100644 --- a/Makefile.target +++ b/Makefile.target @@ -143,6 +143,8 @@ GENERATED_HEADERS += hmp-commands.h qmp-commands-old.h endif # CONFIG_SOFTMMU +%/translate.o: CFLAGS := $(patsubst -O2,-O1,$(CFLAGS)) + This may change some string argument in CFLAGS, for example an argument to -I. How about: CFLAGS_opt = -O2 CFLAGS += $(CFLAGS_opt) ... %/translate.o: CFLAGS_opt = -O1 -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration
On 09/19/2012 02:50 PM, Julien Grall wrote: This is the tenth version of patch series about ioport registration. Some part of QEMU still use register_ioport* functions to register ioport. These functions doesn't allow to use Memory Listener on it. Acked-by: Avi Kivity a...@redhat.com Unfortunately, I no longer have the time to maintain the memory patches. Anthony, please apply directly. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration
On 11/24/2012 03:09 AM, Julien Grall wrote: ping ? Sorry, I must have missed this (twice). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions
On 11/05/2012 02:37 PM, Jan Kiszka wrote: As I noted, init and destroy cannot cause a topology update. Ah, right. Why are we wrapping them in transaction_begin/commit at all then? We aren't. void memory_region_destroy(MemoryRegion *mr) { assert(QTAILQ_EMPTY(mr-subregions)); assert(memory_region_transaction_depth == 0); -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions
On 11/25/2012 12:45 PM, Jan Kiszka wrote: On 2012-11-25 11:18, Avi Kivity wrote: On 11/05/2012 02:37 PM, Jan Kiszka wrote: As I noted, init and destroy cannot cause a topology update. Ah, right. Why are we wrapping them in transaction_begin/commit at all then? We aren't. void memory_region_destroy(MemoryRegion *mr) { assert(QTAILQ_EMPTY(mr-subregions)); assert(memory_region_transaction_depth == 0); We were talking about address_space_init/destroy. This is to force a re-rendering of the address space, so that listeners see the construction/destruction. Simply assigning as-root wouldn't do that. This kind of reliance on side effects should be documented with a comment (and forbidden to anything that is outside the implementation). My bad. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration
On 11/25/2012 02:42 PM, Andreas Färber wrote: Am 25.11.2012 10:10, schrieb Avi Kivity: On 09/19/2012 02:50 PM, Julien Grall wrote: This is the tenth version of patch series about ioport registration. Some part of QEMU still use register_ioport* functions to register ioport. These functions doesn't allow to use Memory Listener on it. Acked-by: Avi Kivity a...@redhat.com Unfortunately, I no longer have the time to maintain the memory patches. Anthony, please apply directly. Are they for 1.3? They are not. If not, I can offer to handle them since I have an upcoming qdev/QOM series that I would have to rebase on this. Great, thanks. Good luck with your new project, Avi! Thanks for that as well! -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH V10 0/8] memory: unify ioport registration
On 11/25/2012 05:01 PM, Andreas Färber wrote: Note that there are still some register_ioport_*() usages left: $ git grep register_ioport -- hw/ | grep -v isa_register_ioport | wc --lines 32 For the record, my plan for when that number drops to zero was to use address_space_rw() (or friends) directly from the ioport helpers or kvm code. There should not be an address_space_io for non-x86; the I/O address space would be just a memory region under a bridge somewhere. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [BUG] qemu crash when using lsilogic
On 11/20/2012 11:01 AM, Wanlong Gao wrote: $ git bisect good d22b096ef6e0b20810193b68a1d472f3fb8a4f9e is the first bad commit commit d22b096ef6e0b20810193b68a1d472f3fb8a4f9e Author: Avi Kivity a...@redhat.com Date: Sun Sep 30 22:21:11 2012 +0200 kvm: use separate MemoryListeners for memory and I/O The construct if (address_space == get_system_memory()) { // memory thing } else { // io thing } fails if we have more than two address spaces. Use a separate listener for memory and I/O, and utilize MemoryListener's address space filtering to fix this. Signed-off-by: Avi Kivity a...@redhat.com :100644 100644 92a71374ed1e040cef5ad70a6cb00adabf671dd4 c69e01200461c7a87440f7a915bd171a9fc8f318 M kvm-all.c Ops, I didn't find any error in above patch, can you guys help to investigate this bug? I confirmed again and found that lsi can't work on the upstream qemu. Any thoughts? I appears to be a bug in Seabios: (gdb) thread 3 [Switching to thread 3 (Thread 0x7fffebfff700 (LWP 19032))] #0 0x5586df2c in access_with_adjusted_size (addr=29563620, value=0x7fffebffe5f0, size=4, access_size_min=0, access_size_max=0, access= 0x5586ddbb memory_region_read_accessor, opaque=0x5655dd00) at /home/tlv/akivity/qemu/memory.c:349 349 { (gdb) bt #0 0x5586df2c in access_with_adjusted_size (addr=29563620, value=0x7fffebffe5f0, size=4, access_size_min=0, access_size_max=0, access= 0x5586ddbb memory_region_read_accessor, opaque=0x5655dd00) at /home/tlv/akivity/qemu/memory.c:349 #1 0x55870a4e in memory_region_dispatch_read1 (mr=0x5655dd00, addr=29563620, size=4) at /home/tlv/akivity/qemu/memory.c:862 #2 0x55870b3e in memory_region_dispatch_read (mr=0x5655dd00, addr=29563620, size=4) at /home/tlv/akivity/qemu/memory.c:894 #3 0x55873c2d in io_mem_read (mr=0x5655dd00, addr=29563620, size=4) at /home/tlv/akivity/qemu/memory.c:1575 #4 0x558054ed in address_space_rw (as=0x56629d78, addr=29563620, buf=0x7fffebffe874 , len=4, is_write=false) at /home/tlv/akivity/qemu/exec.c:3428 #5 0x556bf4c3 in dma_memory_rw_relaxed (dma=0x56603cf0, addr=29563620, buf=0x7fffebffe874, len=4, dir=DMA_DIRECTION_TO_DEVICE) at /home/tlv/akivity/qemu/dma.h:130 #6 0x556bf558 in dma_memory_rw (dma=0x56603cf0, addr=29563620, buf=0x7fffebffe874, len=4, dir=DMA_DIRECTION_TO_DEVICE) at /home/tlv/akivity/qemu/dma.h:156 #7 0x556bf5fb in pci_dma_rw (dev=0x56629b60, addr=29563620, buf=0x7fffebffe874, len=4, dir=DMA_DIRECTION_TO_DEVICE) at /home/tlv/akivity/qemu/hw/pci.h:607 #8 0x556bf65b in pci_dma_read (dev=0x56629b60, addr=29563620, buf=0x7fffebffe874, len=4) at /home/tlv/akivity/qemu/hw/pci.h:614 #9 0x556bfc08 in read_dword (s=0x56629b60, addr=29563620) at /home/tlv/akivity/qemu/hw/lsi53c895a.c:385 #10 0x556c1937 in lsi_execute_script (s=0x56629b60) at /home/tlv/akivity/qemu/hw/lsi53c895a.c:1040 #11 0x556c3c82 in lsi_reg_writeb (s=0x56629b60, offset=47, val=0 '\000') at /home/tlv/akivity/qemu/hw/lsi53c895a.c:1781 #12 0x556c513c in lsi_io_write (opaque=0x56629b60, addr=47, val=0, size=1) at /home/tlv/akivity/qemu/hw/lsi53c895a.c:1953 #13 0x5586def2 in memory_region_write_accessor (opaque=0x5662a340, addr=47, value=0x7fffebffeab0, size=1, shift=0, mask=255) at /home/tlv/akivity/qemu/memory.c:334 #14 0x5586dfd4 in access_with_adjusted_size (addr=47, value=0x7fffebffeab0, size=1, access_size_min=1, access_size_max=1, access=0x5586de6d memory_region_write_accessor, opaque=0x5662a340) at /home/tlv/akivity/qemu/memory.c:364 #15 0x5586e43c in memory_region_iorange_write (iorange=0x7fffe4000ed0, offset=47, width=1, data=0) at /home/tlv/akivity/qemu/memory.c:439 #16 0x55866acc in ioport_writeb_thunk (opaque=0x7fffe4000ed0, addr=49199, data=0) at /home/tlv/akivity/qemu/ioport.c:212 #17 0x558664a6 in ioport_write (index=0, address=49199, data=0) at /home/tlv/akivity/qemu/ioport.c:83 #18 0x55867046 in cpu_outb (addr=49199, val=0 '\000') at /home/tlv/akivity/qemu/ioport.c:289 #19 0x5586a958 in kvm_handle_io (port=49199, data=0x77ff3000, direction=1, size=1, count=1) at /home/tlv/akivity/qemu/kvm-all.c:1423 #20 0x5586af5e in kvm_cpu_exec (env=0x5659b0b0) at /home/tlv/akivity/qemu/kvm-all.c:1571 #21 0x557f74e4 in qemu_kvm_cpu_thread_fn (arg=0x5659b0b0) at /home/tlv/akivity/qemu/cpus.c:757 #22 0x76727d14 in start_thread () from /lib64/libpthread.so.0 #23 0x7533667d in clone () from /lib64/libc.so.6 (gdb) fr 4 #4 0x558054ed in address_space_rw (as=0x56629d78, addr=29563620, buf=0x7fffebffe874 , len=4, is_write=false) at /home/tlv/akivity/qemu/exec.c:3428 3428val = io_mem_read(section-mr, addr1, 4); (gdb) p
Re: [Qemu-devel] [PATCH v6 1/8] atomic: introduce atomic operations
On 11/15/2012 09:47 AM, liu ping fan wrote: On Wed, Nov 14, 2012 at 5:47 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 14/11/2012 10:38, liu ping fan ha scritto: On Tue, Nov 13, 2012 at 6:11 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 05/11/2012 06:38, Liu Ping Fan ha scritto: From: Liu Ping Fan pingf...@linux.vnet.ibm.com If out of global lock, we will be challenged by SMP in low level, so need atomic ops. This file is a wrapper of GCC atomic builtin. I still object to this. I know it enforces type-safety, but it is incomplete. It doesn't Although it is incomplete, but the rest cases are rarely used. Linux faces such issue, and the int version is enough, so I think we can borrow experience from there. One of the two places that use __sync_* require 64-bit accesses. My Yes, these two places are not easy to fix. Which shows that Linux's atomic_t is not suited for QEMU, in my opinion. RCU prototype required pointer-sized access, which you cannot make type- But I think that your RCU prototype should rely on atomic of CPU, not gcc‘s atomic. What's the difference? gcc's atomic produces the same instructions as hand-written assembly (or should). Probably I made a mistake here, in vhost, log = __sync_fetch_and_and(from, 0) is used to fetch 64bits atomically in the case 32bits qemu running on 64bits linux. Right? But how can we read 32bits twice in atomic? Seem that no instruction like _lock xchg for this ops. So I guess _sync_fetch_and_and() based on something like spinlock. And I think the broken issue is caused by vhost thread updates log, while qemu read out it not atomicly, Right? For the log, 32-bit sync_fetch_and_and() is sufficient. We only need to ensure no bits are lost, we don't need 64-bit atomicity. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH] dma: Define dma_context_memory and use in sysbus-ohci
On 11/13/2012 01:44 PM, Peter Maydell wrote: On 26 October 2012 17:00, Peter Maydell peter.mayd...@linaro.org wrote: On 26 October 2012 14:09, Paolo Bonzini pbonz...@redhat.com wrote: As soon as Avi's iommu patches go in, in fact, dma-as will just be as. Even if as == NULL were to be outlawed and you'd be forced to write get_address_space_memory(), taking the pain to create dummy DMAContexts now is just not worth it. Personally I think it's better not to permit NULL DMAContexts or AddressSpaces here, because they're kind of a hack (in the same way that the system address space is kind of a hack). In real hardware you probably aren't really doing dma to that address space but to some more local bus. dma_context_memory/address_space_memory are nice and easy to search for, whereas NULL isn't. [And in general I think NULL is too easy to use; if you have to go looking for the system dma context you've been prompted to think about whether that's the right one...] Ping! Can we have a ruling on what the right fix for this is so we can fix these segfaults before 1.3 release, please? I agree with you here. Callers should be fixed to supply the proper AddressSpace. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions
On 11/05/2012 08:26 AM, Jan Kiszka wrote: On 2012-11-04 20:21, Avi Kivity wrote: On 11/04/2012 10:30 AM, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Cirrus is triggering this, e.g. during Win2k boot: Changes only on disabled regions require no topology update when transaction depth drops to 0 again. 817dcc5368988b0 (pci: give each device its own address space) mad this much worse by multiplying the number of address spaces. Each change is now evaluated N+2 times, where N is the number of PCI devices. It also causes a corresponding expansion in memory usage. I know... But this regression predates your changes, is already visible right after 02e2b95fb4. I want to address this by caching AddressSpaceDispatch trees with the key being the contents of the FlatView for that address space. This will drop the number of distinct trees to 2-4 (3 if some devices have PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different from the cpu memory address space) but will fail if we make each address space different (for example filtering out the device's own BARs). If this change also improves cpu usage sufficiently, then it will be better than your patch, which doesn't recognize changes in an enabled region inside a disabled or hidden region. True, though the question is how common such scenarios are. This one (cirrus with win2k) is already special. In other words, your patch fits the problem at hand but isn't general. On the other hand my approach doesn't eliminate render_memory_region(), just the exec.c stuff and listener updates. So we need to understand where the slowness comes from. I would just like to have some even intermediate solution for 1.3. We can still make it more perfect later on if required. I think we should apply a v2 then, the more general optimizations will take some time. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions
On 11/05/2012 10:51 AM, Jan Kiszka wrote: On 2012-11-05 09:12, Avi Kivity wrote: On 11/05/2012 08:26 AM, Jan Kiszka wrote: On 2012-11-04 20:21, Avi Kivity wrote: On 11/04/2012 10:30 AM, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Cirrus is triggering this, e.g. during Win2k boot: Changes only on disabled regions require no topology update when transaction depth drops to 0 again. 817dcc5368988b0 (pci: give each device its own address space) mad this much worse by multiplying the number of address spaces. Each change is now evaluated N+2 times, where N is the number of PCI devices. It also causes a corresponding expansion in memory usage. I know... But this regression predates your changes, is already visible right after 02e2b95fb4. I want to address this by caching AddressSpaceDispatch trees with the key being the contents of the FlatView for that address space. This will drop the number of distinct trees to 2-4 (3 if some devices have PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different from the cpu memory address space) but will fail if we make each address space different (for example filtering out the device's own BARs). If this change also improves cpu usage sufficiently, then it will be better than your patch, which doesn't recognize changes in an enabled region inside a disabled or hidden region. True, though the question is how common such scenarios are. This one (cirrus with win2k) is already special. In other words, your patch fits the problem at hand but isn't general. On the other hand my approach doesn't eliminate render_memory_region(), just the exec.c stuff and listener updates. So we need to understand where the slowness comes from. I would just like to have some even intermediate solution for 1.3. We can still make it more perfect later on if required. I think we should apply a v2 then, the more general optimizations will take some time. OK - what should v2 do differently? As I noted, init and destroy cannot cause a topology update. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [patch v5 5/8] memory: introduce local lock for address space
On 11/02/2012 10:00 AM, Jan Kiszka wrote: As I understand the series, as-lock == NULL means that we will never take any lock during dispatch as the caller is not yet ready for fine-grained locking. This prevents the problem - for PIO at least. But this series should break TCG as it calls into MMIO dispatch from the VCPU while holding the BQL. What about add another condition dispatch_type == DISPATCH_MMIO to tell this situation. An alternative pattern that we will also use for core services is to provide an additional entry point, one that indicates that the caller doesn't hold the BQL. Then we will gradually move things over until the existing entry point is obsolete. I like it (or rather, least dislike it). So we'd have address_space_rw() and address_space_rw_unlocked() (or maybe, address_space_rw() and address_space_rw_locked(), to indicate the preferred way of doing things). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH] memory: Reintroduce dirty flag to optimize changes on disabled regions
On 11/04/2012 10:30 AM, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Cirrus is triggering this, e.g. during Win2k boot: Changes only on disabled regions require no topology update when transaction depth drops to 0 again. 817dcc5368988b0 (pci: give each device its own address space) mad this much worse by multiplying the number of address spaces. Each change is now evaluated N+2 times, where N is the number of PCI devices. It also causes a corresponding expansion in memory usage. I want to address this by caching AddressSpaceDispatch trees with the key being the contents of the FlatView for that address space. This will drop the number of distinct trees to 2-4 (3 if some devices have PCI_COMMAND_MASTER disabled, 4 if the PCI address space is different from the cpu memory address space) but will fail if we make each address space different (for example filtering out the device's own BARs). If this change also improves cpu usage sufficiently, then it will be better than your patch, which doesn't recognize changes in an enabled region inside a disabled or hidden region. In other words, your patch fits the problem at hand but isn't general. On the other hand my approach doesn't eliminate render_memory_region(), just the exec.c stuff and listener updates. So we need to understand where the slowness comes from. @@ -1543,6 +1554,7 @@ void address_space_init(AddressSpace *as, MemoryRegion *root) flatview_init(as-current_map); QTAILQ_INSERT_TAIL(address_spaces, as, address_spaces_link); as-name = NULL; +memory_region_update_pending = true; memory_region_transaction_commit(); address_space_init_dispatch(as); } @@ -1552,6 +1564,7 @@ void address_space_destroy(AddressSpace *as) /* Flush out anything from MemoryListeners listening in on this */ memory_region_transaction_begin(); as-root = NULL; +memory_region_update_pending = true; memory_region_transaction_commit(); QTAILQ_REMOVE(address_spaces, as, address_spaces_link); address_space_destroy_dispatch(as); init and destroy cannot happen to a region that is mapped (and cannot happen within a transaction), so these two are redundant. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH v2 3/7] memory: iommu support
On 10/31/2012 08:59 PM, Benjamin Herrenschmidt wrote: On Wed, 2012-10-31 at 12:32 +0200, Avi Kivity wrote: This has nothing to do with device endianness; we're translating from a byte buffer (address_space_rw()) to a uint64_t (MemoryRegionOps::read/write()) so we need to take host endianess into account. This code cleverly makes use of memory core endian support to do the conversion for us. But it's probably too clever and should be replaced by an explicit call to a helper. Well, the whole idea of providing sized-type accessors (u8/u16/...) is very fishy for DMA. I understand it comes from the MemoryRegion ops, and It made somewhat sense in CPU to device (though even then endianness had always gotten wrong) but for DMA it's going to be a can of worms. Endianness here has no effect on the result. An address_space_rw() causes a lookup of a memory region, which happens to be an iommu memory region. Because of the way MemoryRegionOps are done, it is converted to a 1/2/4/8 accessor, and then converted immediately back to a byte array. As long as we're consistent there's no change to the data path. However we do have a problem with non-1/2/4/8 byte writes. Right now any mismatched access ends up as an 8 byte write, we need an extra accessor for arbitrary writes, or rather better use of the .impl members of MemoryRegionOps. Taking host endianness into account means nothing if you don't define what endianness those are expected to use to write to memory. If you add yet another case of guest endianness in the mix, then you make yet another mistake akin to the virtio trainwreck since things like powerpc or ARM can operate in different endianness. The best thing to do is to forbid use of those functions :-) And basically mandate the use of endian explicit accessor that specifically indicate what endianness the value should have once written in target memory. The most sensible expectation when using the raw Memory ops is to have the value go as is ie in host endianness. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH v2 3/7] memory: iommu support
On 11/01/2012 03:44 PM, Avi Kivity wrote: However we do have a problem with non-1/2/4/8 byte writes. Right now any mismatched access ends up as an 8 byte write, we need an extra accessor for arbitrary writes, or rather better use of the .impl members of MemoryRegionOps. Sorry, it's converted into a series of 8-bit writes. So the code is correct now, if inefficient. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [patch v5 5/8] memory: introduce local lock for address space
On 10/29/2012 11:46 AM, liu ping fan wrote: On Mon, Oct 29, 2012 at 5:32 PM, Avi Kivity a...@redhat.com wrote: On 10/29/2012 01:48 AM, Liu Ping Fan wrote: For those address spaces which want to be able out of big lock, they will be protected by their own local. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- memory.c | 11 ++- memory.h |5 - 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/memory.c b/memory.c index 2f68d67..ff34aed 100644 --- a/memory.c +++ b/memory.c @@ -1532,9 +1532,15 @@ void memory_listener_unregister(MemoryListener *listener) QTAILQ_REMOVE(memory_listeners, listener, link); } -void address_space_init(AddressSpace *as, MemoryRegion *root) +void address_space_init(AddressSpace *as, MemoryRegion *root, bool lock) Why not always use the lock? Even if the big lock is taken, it doesn't hurt. And eventually all address spaces will be fine-grained. I had thought only mmio is out of big lock's protection. While others address space will take extra expense. So leave them until they are ready to be out of big lock. The other address spaces are pio (which also needs fine-grained locking) and the dma address spaces (which are like address_space_memory, except they are accessed via DMA instead of from the vcpu). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [patch v5 6/8] memory: make mmio dispatch able to be out of biglock
On 10/30/2012 09:06 AM, liu ping fan wrote: while (len 0) { page = addr TARGET_PAGE_MASK; l = (page + TARGET_PAGE_SIZE) - addr; if (l len) l = len; -section = phys_page_find(d, page TARGET_PAGE_BITS); + +if (as-lock) { +qemu_mutex_lock(as-lock); +safe_ref = memory_region_section_lookup_ref(d, page, obj_mrs); +qemu_mutex_unlock(as-lock); +if (!safe_ref) { +qemu_mutex_lock_iothread(); +qemu_mutex_lock(as-lock); +/* when 2nd try, mem map can change, need to judge it again */ +safe_ref = memory_region_section_lookup_ref(d, page, obj_mrs); +qemu_mutex_unlock(as-lock); +if (safe_ref) { +qemu_mutex_unlock_iothread(); +} +} +} else { +/* Caller hold the big lock */ +memory_region_section_lookup_ref(d, page, obj_mrs); It's not a property of the address space, it's a property of the caller. Sorry, what is your meaning? Whether the caller holds the big lock or not is not known by the address space. It is only known by the caller itself. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH v2 3/7] memory: iommu support
On 10/30/2012 09:11 PM, Blue Swirl wrote: On Tue, Oct 30, 2012 at 11:47 AM, Avi Kivity a...@redhat.com wrote: Add a new memory region type that translates addresses it is given, then forwards them to a target address space. This is similar to an alias, except that the mapping is more flexible than a linear translation and trucation, and also less efficient since the translation happens at runtime. The implementation uses an AddressSpace mapping the target region to avoid hierarchical dispatch all the way to the resolved region; only iommu regions are looked up dynamically. + +static MemoryRegionOps memory_region_iommu_ops = { +.read = memory_region_iommu_read, +.write = memory_region_iommu_write, +#ifdef HOST_BIGENDIAN +.endianness = DEVICE_BIG_ENDIAN, Why couple this with host endianness? I'd expect IOMMU to operate at target bus endianness, for example LE for PCI on PPC guest. This has nothing to do with device endianness; we're translating from a byte buffer (address_space_rw()) to a uint64_t (MemoryRegionOps::read/write()) so we need to take host endianess into account. This code cleverly makes use of memory core endian support to do the conversion for us. But it's probably too clever and should be replaced by an explicit call to a helper. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH v2 4/7] memory: provide a MemoryRegion for IOMMUs to log faults
On 10/30/2012 09:14 PM, Blue Swirl wrote: On Tue, Oct 30, 2012 at 11:47 AM, Avi Kivity a...@redhat.com wrote: Accesses which do not translate will hit the fault region, which can then log the access. Maybe special casing the fault region should be avoided, the IOMMU could set up suitable fault regions by itself. How can it? Please elaborate. I'm not too pleased with using a MemoryRegion for faults, perhaps a MemoryRegionIOMMUOps::fault() callback would be better. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH v2 7/7] i440fx: add an iommu
On 10/30/2012 09:18 PM, Blue Swirl wrote: On Tue, Oct 30, 2012 at 11:47 AM, Avi Kivity a...@redhat.com wrote: This iommu encrypts addresses on the device bus to avoid divuling information to hackers equipped with bus analyzers. Following 3DES, addresses are encrypted multiple times. A XOR cypher is employed for efficiency. If this is not useful as a test device or other purposes, I'd drop it. It's just for demonstrating the API. I won't propose it for merging. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction
On 10/24/2012 10:06 AM, liu ping fan wrote: On Tue, Oct 23, 2012 at 8:25 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Sep 21, 2012 at 01:17:21PM +0200, Vasilis Liaskovitis wrote: +static void dimm_populate(DimmDevice *s) +{ +DeviceState *dev= (DeviceState*)s; +MemoryRegion *new = NULL; + +new = g_malloc(sizeof(MemoryRegion)); +memory_region_init_ram(new, dev-id, s-size); +vmstate_register_ram_global(new); +memory_region_add_subregion(get_system_memory(), s-start, new); +s-mr = new; +} + +static void dimm_depopulate(DimmDevice *s) +{ +assert(s); +vmstate_unregister_ram(s-mr, NULL); +memory_region_del_subregion(get_system_memory(), s-mr); +memory_region_destroy(s-mr); +s-mr = NULL; +} How is dimm hot unplug protected against callers who currently have RAM mapped (from cpu_physical_memory_map())? Emulated devices call cpu_physical_memory_map() directly or indirectly through DMA emulation code. The RAM pointer may be held for arbitrary lengths of time, across main loop iterations, etc. It's not clear to me that it is safe to unplug a DIMM that has network or disk I/O buffers, for example. We also need to be robust against malicious guests who abuse the hotplug lifecycle. QEMU should never be left with dangling pointers. Not sure about the block layer. But I think those thread are already out of big lock, so there should be a MemoryListener to catch the RAM-unplug event, and if needed, bdrv_flush. IMO we should use the same mechanism as proposed for other devices: address_space_map() should grab a reference on the dimm device, and address_space_unmap() can release it. This way device destruction will be deferred as soon as all devices complete I/O. We will have to be careful with network receive buffers though, since they can be held indefinitely. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [RFC PATCH v3 00/19] ACPI memory hotplug
On 10/31/2012 12:58 PM, Stefan Hajnoczi wrote: On Fri, Sep 21, 2012 at 1:17 PM, Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com wrote: This is v3 of the ACPI memory hotplug functionality. Only x86_64 target is supported for now. Hi Vasilis, Regarding the hot unplug issue we've been discussing, it's possible to progress this patch series without fully solving that problem upfront. Karen Noel suggested that the series could be rolled without the hot unplug command, so that it's not possible to hit the unsafe case. This would allow users to hot plug additional memory. They would have to use virtio-balloon to reduce the memory footprint again. Later, when the memory region referencing issue has been solved the hot unplug command can be added. Just wanted to mention Karen's idea in case you feel stuck right now. We could introduce hotunplug as an experimental feature so people can test and play with it, and later graduate it to a fully supported feature. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH] Fix calculation of number of bits in the migration bitmap
On 10/31/2012 01:19 PM, Orit Wasserman wrote: The number of bits is off by one, for example if last_ram_offset is 0x1000 (the guest has one page) we get 0 bits instead of 1. Signed-off-by: Orit Wasserman owass...@redhat.com --- arch_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch_init.c b/arch_init.c index b75a4c5..a80c3c8 100644 --- a/arch_init.c +++ b/arch_init.c @@ -565,7 +565,7 @@ static void reset_ram_globals(void) static int ram_save_setup(QEMUFile *f, void *opaque) { RAMBlock *block; -int64_t ram_pages = last_ram_offset() TARGET_PAGE_BITS; +int64_t ram_pages = (last_ram_offset() TARGET_PAGE_BITS) + 1; The original code calculates 1 for your example case. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [RFC PATCH v3 05/19] Implement dimm device abstraction
On 10/31/2012 02:18 PM, Stefan Hajnoczi wrote: IMO we should use the same mechanism as proposed for other devices: address_space_map() should grab a reference on the dimm device, and address_space_unmap() can release it. This way device destruction will be deferred as soon as all devices complete I/O. We will have to be careful with network receive buffers though, since they can be held indefinitely. Network receive buffers aren't mapped. Net receive is not zero-copy. For example, virtio-net does virtqueue_pop() inside virtio_net_receive(). I don't see a problem with networking. What about vhost-net? But that is managed separately with a MemoryListener. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH v5 1/2] pl330: Initial version
On 10/29/2012 10:32 AM, Peter Maydell wrote: On 29 October 2012 06:35, Peter Crosthwaite peter.crosthwa...@xilinx.com wrote: Device model for Primecell PL330 dma controller. A general question -- this is a DMA controller so should it be using the DMAContext APIs now? Avi? IOMMU functionality will be folded into the memory core, so address_space_rw() and family. -- error compiling committee.c: too many arguments to function
[Qemu-devel] [PATCH v2 0/7] IOMMU support
Changes from RFC: - change -translate to return read/write permissions in IOTLBEntry (was: -translate received is_write parameter) - add support for iommu fault reporting Avi Kivity (7): memory: fix address space initialization/destruction memory: limit sections in the radix tree to the actual address space size memory: iommu support memory: provide a MemoryRegion for IOMMUs to log faults pci: use memory core for iommu support vfio: abort if an emulated iommu is used i440fx: add an iommu exec.c | 43 +--- hw/pci.c | 59 +--- hw/pci.h | 7 +++- hw/pci_internals.h | 5 ++- hw/piix_pci.c | 77 hw/spapr.h | 1 + hw/spapr_iommu.c | 45 + hw/spapr_pci.c | 27 +++-- hw/spapr_pci.h | 2 + hw/vfio_pci.c | 2 + memory.c | 112 + memory.h | 49 +++ 12 files changed, 366 insertions(+), 63 deletions(-) -- 1.7.12
[Qemu-devel] [PATCH v2 1/7] memory: fix address space initialization/destruction
A couple of fields were left uninitialized. This was not observed earlier because all address spaces were statically allocated. Also free allocation for those fields. Signed-off-by: Avi Kivity a...@redhat.com --- memory.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/memory.c b/memory.c index 243cb23..ae3552b 100644 --- a/memory.c +++ b/memory.c @@ -1541,6 +1541,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root) as-root = root; as-current_map = g_new(FlatView, 1); flatview_init(as-current_map); +as-ioeventfd_nb = 0; +as-ioeventfds = NULL; QTAILQ_INSERT_TAIL(address_spaces, as, address_spaces_link); as-name = NULL; memory_region_transaction_commit(); @@ -1557,6 +1559,7 @@ void address_space_destroy(AddressSpace *as) address_space_destroy_dispatch(as); flatview_destroy(as-current_map); g_free(as-current_map); +g_free(as-ioeventfds); } uint64_t io_mem_read(MemoryRegion *mr, hwaddr addr, unsigned size) -- 1.7.12
[Qemu-devel] [PATCH v2 5/7] pci: use memory core for iommu support
Use the new iommu support in the memory core for iommu support. The only user, spapr, is also converted. Signed-off-by: Avi Kivity a...@redhat.com --- hw/pci.c | 59 +- hw/pci.h | 7 +-- hw/pci_internals.h | 5 +++-- hw/spapr.h | 1 + hw/spapr_iommu.c | 45 + hw/spapr_pci.c | 27 + hw/spapr_pci.h | 2 ++ 7 files changed, 88 insertions(+), 58 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index dceda0b..5056ddb 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -274,6 +274,21 @@ int pci_find_domain(const PCIBus *bus) return -1; } +static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn) +{ +MemoryRegion *mr = g_new(MemoryRegion, 1); + +/* FIXME: inherit memory region from bus creator */ +memory_region_init_alias(mr, iommu-nop, get_system_memory(), 0, INT64_MAX); +return mr; +} + +static void pci_default_iommu_dtor(MemoryRegion *mr) +{ +memory_region_destroy(mr); +g_free(mr); +} + void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent, const char *name, MemoryRegion *address_space_mem, @@ -285,6 +300,7 @@ void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent, bus-devfn_min = devfn_min; bus-address_space_mem = address_space_mem; bus-address_space_io = address_space_io; +pci_setup_iommu(bus, pci_default_iommu, pci_default_iommu_dtor, NULL); /* host bridge */ QLIST_INIT(bus-child); @@ -775,21 +791,16 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, PCI_SLOT(devfn), PCI_FUNC(devfn), name, bus-devices[devfn]-name); return NULL; } + pci_dev-bus = bus; -if (bus-dma_context_fn) { -pci_dev-dma = bus-dma_context_fn(bus, bus-dma_context_opaque, devfn); -} else { -/* FIXME: Make dma_context_fn use MemoryRegions instead, so this path is - * taken unconditionally */ -/* FIXME: inherit memory region from bus creator */ -memory_region_init_alias(pci_dev-bus_master_enable_region, bus master, - get_system_memory(), 0, - memory_region_size(get_system_memory())); -memory_region_set_enabled(pci_dev-bus_master_enable_region, false); -address_space_init(pci_dev-bus_master_as, pci_dev-bus_master_enable_region); -pci_dev-dma = g_new(DMAContext, 1); -dma_context_init(pci_dev-dma, pci_dev-bus_master_as, NULL, NULL, NULL); -} +pci_dev-iommu = bus-iommu_fn(bus, bus-iommu_opaque, devfn); +memory_region_init_alias(pci_dev-bus_master_enable_region, bus master, + pci_dev-iommu, 0, memory_region_size(pci_dev-iommu)); +memory_region_set_enabled(pci_dev-bus_master_enable_region, false); +address_space_init(pci_dev-bus_master_as, pci_dev-bus_master_enable_region); +pci_dev-dma = g_new(DMAContext, 1); +dma_context_init(pci_dev-dma, pci_dev-bus_master_as, NULL, NULL, NULL); + pci_dev-devfn = devfn; pstrcpy(pci_dev-name, sizeof(pci_dev-name), name); pci_dev-irq_state = 0; @@ -843,12 +854,12 @@ static void do_pci_unregister_device(PCIDevice *pci_dev) pci_dev-bus-devices[pci_dev-devfn] = NULL; pci_config_free(pci_dev); -if (!pci_dev-bus-dma_context_fn) { -address_space_destroy(pci_dev-bus_master_as); -memory_region_destroy(pci_dev-bus_master_enable_region); -g_free(pci_dev-dma); -pci_dev-dma = NULL; -} +address_space_destroy(pci_dev-bus_master_as); +memory_region_del_subregion(pci_dev-bus_master_enable_region, pci_dev-iommu); +pci_dev-bus-iommu_dtor_fn(pci_dev-iommu); +memory_region_destroy(pci_dev-bus_master_enable_region); +g_free(pci_dev-dma); +pci_dev-dma = NULL; } static void pci_unregister_io_regions(PCIDevice *pci_dev) @@ -2140,10 +2151,12 @@ static void pci_device_class_init(ObjectClass *klass, void *data) k-props = pci_props; } -void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque) +void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor, + void *opaque) { -bus-dma_context_fn = fn; -bus-dma_context_opaque = opaque; +bus-iommu_fn = fn; +bus-iommu_dtor_fn = dtor; +bus-iommu_opaque = opaque; } static TypeInfo pci_device_type_info = { diff --git a/hw/pci.h b/hw/pci.h index 241c1d8..c3dd282 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -213,6 +213,7 @@ struct PCIDevice { PCIIORegion io_regions[PCI_NUM_REGIONS]; AddressSpace bus_master_as; MemoryRegion bus_master_enable_region; +MemoryRegion *iommu; DMAContext *dma; /* do not access the following fields */ @@ -357,9 +358,11 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp, void
[Qemu-devel] [PATCH v2 7/7] i440fx: add an iommu
This iommu encrypts addresses on the device bus to avoid divuling information to hackers equipped with bus analyzers. Following 3DES, addresses are encrypted multiple times. A XOR cypher is employed for efficiency. Signed-off-by: Avi Kivity a...@redhat.com --- hw/piix_pci.c | 77 +++ 1 file changed, 77 insertions(+) diff --git a/hw/piix_pci.c b/hw/piix_pci.c index 9af5847..99601f4 100644 --- a/hw/piix_pci.c +++ b/hw/piix_pci.c @@ -30,6 +30,7 @@ #include sysbus.h #include range.h #include xen.h +#include exec-memory.h /* * I440FX chipset data sheet. @@ -248,6 +249,81 @@ static int i440fx_initfn(PCIDevice *dev) return 0; } +typedef struct SillyIOMMU SillyIOMMU; + +struct SillyIOMMU { +MemoryRegion fault; +MemoryRegion l1; +MemoryRegion l2; +hwaddr mask; +hwaddr secret; +}; + +static IOMMUTLBEntry silly_l1_translate(MemoryRegion *l1, hwaddr addr) +{ +SillyIOMMU *s = container_of(l1, SillyIOMMU, l1); +hwaddr xlat = addr ^ s-secret; + +printf(l1: % HWADDR_PRIx - % HWADDR_PRIx \n, addr, xlat); + +return (IOMMUTLBEntry) { +.device_addr = addr ~s-mask, +.translated_addr = xlat ~s-mask, +.addr_mask = s-mask, +.perm = { true, true }, +}; +} + +static MemoryRegionIOMMUOps silly_l1_iommu_ops = { +.translate = silly_l1_translate, +}; + +static IOMMUTLBEntry silly_l2_translate(MemoryRegion *l2, hwaddr addr) +{ +SillyIOMMU *s = container_of(l2, SillyIOMMU, l2); +hwaddr xlat = addr ^ s-secret; + +printf(l2: % HWADDR_PRIx - % HWADDR_PRIx \n, addr, xlat); + +return (IOMMUTLBEntry) { +.device_addr = addr ~s-mask, +.translated_addr = xlat ~s-mask, +.addr_mask = s-mask, +.perm = { true, true }, +}; +} + +static MemoryRegionIOMMUOps silly_l2_iommu_ops = { +.translate = silly_l2_translate, +}; + +static MemoryRegion *silly_iommu_new(PCIBus *bus, void *opaque, int devfn) +{ +SillyIOMMU *s = g_new(SillyIOMMU, 1); +MemoryRegion *sysmem = get_system_memory(); + +s-mask = (0x1000 (devfn 3)) - 1; +s-secret = ((devfn 24) | 0x00aabbccdd) ~s-mask; +memory_region_init(s-fault, silly-fault, INT64_MAX); +memory_region_init_iommu(s-l2, silly_l2_iommu_ops, sysmem, s-fault, + silly-l2, INT64_MAX); +memory_region_init_iommu(s-l1, silly_l1_iommu_ops, s-l2, s-fault, + silly-l1, INT64_MAX); +return s-l1; +} + +static void silly_iommu_del(MemoryRegion *l1) +{ +SillyIOMMU *s = container_of(l1, SillyIOMMU, l1); + +memory_region_del_subregion(s-l2, get_system_memory()); +memory_region_del_subregion(s-l1, s-l2); +memory_region_destroy(s-l2); +memory_region_destroy(s-l1); +memory_region_destroy(s-fault); +g_free(s); +} + static PCIBus *i440fx_common_init(const char *device_name, PCII440FXState **pi440fx_state, int *piix3_devfn, @@ -275,6 +351,7 @@ static PCIBus *i440fx_common_init(const char *device_name, s-address_space = address_space_mem; b = pci_bus_new(dev, NULL, pci_address_space, address_space_io, 0); +pci_setup_iommu(b, silly_iommu_new, silly_iommu_del, NULL); s-bus = b; object_property_add_child(qdev_get_machine(), i440fx, OBJECT(dev), NULL); qdev_init_nofail(dev); -- 1.7.12
[Qemu-devel] [PATCH v2 2/7] memory: limit sections in the radix tree to the actual address space size
The radix tree is statically sized to fit TARGET_PHYS_ADDR_SPACE_BITS. If a larger memory region is registered, it will overflow. Fix by limiting any section in the radix tree to the supported size. This problem was not observed earlier since artificial regions (containers and aliases) are eliminated by the memory core, leaving only device regions which have reasonable sizes. An IOMMU however cannot be eliminated by the memory core, and may have an artificial size. Signed-off-by: Avi Kivity a...@redhat.com --- exec.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/exec.c b/exec.c index b0ed593..deee8ec 100644 --- a/exec.c +++ b/exec.c @@ -2280,10 +2280,23 @@ static void register_multipage(AddressSpaceDispatch *d, MemoryRegionSection *sec section_index); } +static MemoryRegionSection limit(MemoryRegionSection section) +{ +unsigned practical_as_bits = MIN(TARGET_PHYS_ADDR_SPACE_BITS, 62); +hwaddr as_limit; + +as_limit = (hwaddr)1 practical_as_bits; + +section.size = MIN(section.offset_within_address_space + section.size, as_limit) + - section.offset_within_address_space; + +return section; +} + static void mem_add(MemoryListener *listener, MemoryRegionSection *section) { AddressSpaceDispatch *d = container_of(listener, AddressSpaceDispatch, listener); -MemoryRegionSection now = *section, remain = *section; +MemoryRegionSection now = limit(*section), remain = limit(*section); if ((now.offset_within_address_space ~TARGET_PAGE_MASK) || (now.size TARGET_PAGE_SIZE)) { -- 1.7.12
[Qemu-devel] [PATCH v2 6/7] vfio: abort if an emulated iommu is used
vfio doesn't support guest iommus yet, indicate it to the user by gently depositing a core on their disk. Signed-off-by: Avi Kivity a...@redhat.com --- hw/vfio_pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 0473ae8..bd7a075 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -943,6 +943,8 @@ static void vfio_listener_region_add(MemoryListener *listener, void *vaddr; int ret; +assert(!memory_region_is_iommu(section-mr)); + if (vfio_listener_skipped_section(section)) { DPRINTF(vfio: SKIPPING region_add %HWADDR_PRIx - %PRIx64\n, section-offset_within_address_space, -- 1.7.12
[Qemu-devel] [PATCH v2 3/7] memory: iommu support
Add a new memory region type that translates addresses it is given, then forwards them to a target address space. This is similar to an alias, except that the mapping is more flexible than a linear translation and trucation, and also less efficient since the translation happens at runtime. The implementation uses an AddressSpace mapping the target region to avoid hierarchical dispatch all the way to the resolved region; only iommu regions are looked up dynamically. Signed-off-by: Avi Kivity a...@redhat.com --- exec.c | 28 ++--- memory.c | 106 +++ memory.h | 45 +++ 3 files changed, 175 insertions(+), 4 deletions(-) diff --git a/exec.c b/exec.c index deee8ec..c3914fc 100644 --- a/exec.c +++ b/exec.c @@ -3507,6 +3507,7 @@ void cpu_physical_memory_write_rom(hwaddr addr, typedef struct { void *buffer; +AddressSpace *as; hwaddr addr; hwaddr len; } BounceBuffer; @@ -3572,23 +3573,42 @@ void *address_space_map(AddressSpace *as, ram_addr_t raddr = RAM_ADDR_MAX; ram_addr_t rlen; void *ret; +IOMMUTLBEntry iotlb; +hwaddr xlat; +AddressSpace *as_xlat; while (len 0) { +xlat = addr; +as_xlat = as; page = addr TARGET_PAGE_MASK; l = (page + TARGET_PAGE_SIZE) - addr; if (l len) l = len; section = phys_page_find(d, page TARGET_PAGE_BITS); +while (section-mr-iommu_ops) { +iotlb = section-mr-iommu_ops-translate(section-mr, xlat); +if (iotlb.perm[is_write]) { +xlat = ((iotlb.translated_addr ~iotlb.addr_mask) +| (addr iotlb.addr_mask)); +as_xlat = section-mr-iommu_target_as; +l = (MIN(xlat + l - 1, xlat | iotlb.addr_mask) - xlat) + 1; +section = phys_page_find(as_xlat-dispatch, xlat TARGET_PAGE_BITS); +} else { +section = phys_sections[phys_section_unassigned]; +} +} + if (!(memory_region_is_ram(section-mr) !section-readonly)) { if (todo || bounce.buffer) { break; } bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE); -bounce.addr = addr; +bounce.addr = xlat; +bounce.as = as_xlat; bounce.len = l; if (!is_write) { -address_space_read(as, addr, bounce.buffer, l); +address_space_read(as_xlat, xlat, bounce.buffer, l); } *plen = l; @@ -3596,7 +3616,7 @@ void *address_space_map(AddressSpace *as, } if (!todo) { raddr = memory_region_get_ram_addr(section-mr) -+ memory_region_section_addr(section, addr); ++ memory_region_section_addr(section, xlat); } len -= l; @@ -3635,7 +3655,7 @@ void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len, return; } if (is_write) { -address_space_write(as, bounce.addr, bounce.buffer, access_len); +address_space_write(bounce.as, bounce.addr, bounce.buffer, access_len); } qemu_vfree(bounce.buffer); bounce.buffer = NULL; diff --git a/memory.c b/memory.c index ae3552b..ba2d4a0 100644 --- a/memory.c +++ b/memory.c @@ -775,6 +775,12 @@ static void memory_region_destructor_rom_device(MemoryRegion *mr) qemu_ram_free(mr-ram_addr TARGET_PAGE_MASK); } +static void memory_region_destructor_iommu(MemoryRegion *mr) +{ +address_space_destroy(mr-iommu_target_as); +g_free(mr-iommu_target_as); +} + static bool memory_region_wrong_endianness(MemoryRegion *mr) { #ifdef TARGET_WORDS_BIGENDIAN @@ -789,6 +795,7 @@ void memory_region_init(MemoryRegion *mr, uint64_t size) { mr-ops = NULL; +mr-iommu_ops = NULL; mr-parent = NULL; mr-size = int128_make64(size); if (size == UINT64_MAX) { @@ -980,6 +987,100 @@ void memory_region_init_rom_device(MemoryRegion *mr, mr-ram_addr = qemu_ram_alloc(size, mr); } +static void memory_region_iommu_rw(MemoryRegion *iommu, hwaddr addr, + uint8_t *buf, unsigned len, bool is_write) +{ +IOMMUTLBEntry tlb; +unsigned clen; +hwaddr xlat; + +while (len) { +tlb = iommu-iommu_ops-translate(iommu, addr); +clen = (MIN(addr | tlb.addr_mask, addr + len - 1) - addr) + 1; +if (tlb.perm[is_write]) { +xlat = (tlb.translated_addr ~tlb.addr_mask) | (addr tlb.addr_mask); +address_space_rw(iommu-iommu_target_as, xlat, buf, clen, is_write); +} else { +if (!is_write) { +memset(buf, 0xff, clen); +} +} +buf += clen; +addr += clen; +len -= clen; +} +} + +static uint64_t memory_region_iommu_read(void *opaque, hwaddr addr
[Qemu-devel] [PATCH v2 4/7] memory: provide a MemoryRegion for IOMMUs to log faults
Accesses which do not translate will hit the fault region, which can then log the access. Signed-off-by: Avi Kivity a...@redhat.com --- memory.c | 9 ++--- memory.h | 4 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/memory.c b/memory.c index ba2d4a0..d6b46fd 100644 --- a/memory.c +++ b/memory.c @@ -778,7 +778,9 @@ static void memory_region_destructor_rom_device(MemoryRegion *mr) static void memory_region_destructor_iommu(MemoryRegion *mr) { address_space_destroy(mr-iommu_target_as); +address_space_destroy(mr-iommu_fault_as); g_free(mr-iommu_target_as); +g_free(mr-iommu_fault_as); } static bool memory_region_wrong_endianness(MemoryRegion *mr) @@ -1001,9 +1003,7 @@ static void memory_region_iommu_rw(MemoryRegion *iommu, hwaddr addr, xlat = (tlb.translated_addr ~tlb.addr_mask) | (addr tlb.addr_mask); address_space_rw(iommu-iommu_target_as, xlat, buf, clen, is_write); } else { -if (!is_write) { -memset(buf, 0xff, clen); -} +address_space_rw(iommu-iommu_fault_as, addr, buf, clen, is_write); } buf += clen; addr += clen; @@ -1068,6 +1068,7 @@ static void memory_region_iommu_write(void *opaque, hwaddr addr, void memory_region_init_iommu(MemoryRegion *mr, MemoryRegionIOMMUOps *ops, MemoryRegion *target, + MemoryRegion *fault, const char *name, uint64_t size) { @@ -1079,6 +1080,8 @@ void memory_region_init_iommu(MemoryRegion *mr, mr-destructor = memory_region_destructor_iommu; mr-iommu_target_as = g_new(AddressSpace, 1); address_space_init(mr-iommu_target_as, target); +mr-iommu_fault_as = g_new(AddressSpace, 1); +address_space_init(mr-iommu_fault_as, fault); } static uint64_t invalid_read(void *opaque, hwaddr addr, diff --git a/memory.h b/memory.h index 47362c9..6312197 100644 --- a/memory.h +++ b/memory.h @@ -162,6 +162,7 @@ struct MemoryRegion { unsigned ioeventfd_nb; MemoryRegionIoeventfd *ioeventfds; struct AddressSpace *iommu_target_as; +struct AddressSpace *iommu_fault_as; }; struct MemoryRegionPortio { @@ -361,12 +362,15 @@ void memory_region_init_reservation(MemoryRegion *mr, * @ops: a function that translates addresses into the @target region * @target: a #MemoryRegion that will be used to satisfy accesses to translated * addresses + * @fault: a #MemoryRegion for servicing failed translations; accessed with + * untranslated addresses * @name: used for debugging; not visible to the user or ABI * @size: size of the region. */ void memory_region_init_iommu(MemoryRegion *mr, MemoryRegionIOMMUOps *ops, MemoryRegion *target, + MemoryRegion *fault, const char *name, uint64_t size); -- 1.7.12
Re: [Qemu-devel] [PATCH 3/3] apic: always update the in-kernel status after loading
On 10/30/2012 02:16 PM, Paolo Bonzini wrote: The LAPIC is loaded separately from the rest of the VCPU state. Thus, when restoring the CPU the dummy post-reset state is passed to the in-kernel APIC. This can cause MSI injection to fail if attempted during the restore of another device, because the LAPIC believes it's not enabled. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- hw/apic_common.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/hw/apic_common.c b/hw/apic_common.c index f373ba8..1ef52b2 100644 --- a/hw/apic_common.c +++ b/hw/apic_common.c @@ -362,6 +362,7 @@ static int apic_dispatch_post_load(void *opaque, int version_id) if (info-post_load) { info-post_load(s); } +cpu_put_apic_state(DEVICE(s)); return 0; } Aren't we still dependent on the order of processing? If the APIC is restored after the device, won't we get the same problem? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH v5 1/2] pl330: Initial version
On 10/30/2012 03:42 PM, Peter Crosthwaite wrote: On Tue, Oct 30, 2012 at 7:51 PM, Avi Kivity a...@redhat.com wrote: On 10/29/2012 10:32 AM, Peter Maydell wrote: On 29 October 2012 06:35, Peter Crosthwaite peter.crosthwa...@xilinx.com wrote: Device model for Primecell PL330 dma controller. A general question -- this is a DMA controller so should it be using the DMAContext APIs now? Avi? IOMMU functionality will be folded into the memory core, so address_space_rw() and family. Hi Avi, So does this mean no more dma_contexts and dma_memory_foo? Correct. I have remade the series (and others) using dma_memory_*. Can you point to a sysbus attached DMA controller currently in tree that does it properly so we can use it for a basis as there are several (four in total think) series I have pending all containing DMA which will be effected by this. I don't think there are any. If your dma controller does not have an IOMMU, all you need to do is replace cpu_physical_memory_*(...) with address_space_*(address_space_memory, ...). If it does have an IOMMU, then you need to wait until my IOMMU patchset hits mainline (or use DMAContext). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [patch v5 5/8] memory: introduce local lock for address space
On 10/29/2012 01:48 AM, Liu Ping Fan wrote: For those address spaces which want to be able out of big lock, they will be protected by their own local. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- memory.c | 11 ++- memory.h |5 - 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/memory.c b/memory.c index 2f68d67..ff34aed 100644 --- a/memory.c +++ b/memory.c @@ -1532,9 +1532,15 @@ void memory_listener_unregister(MemoryListener *listener) QTAILQ_REMOVE(memory_listeners, listener, link); } -void address_space_init(AddressSpace *as, MemoryRegion *root) +void address_space_init(AddressSpace *as, MemoryRegion *root, bool lock) Why not always use the lock? Even if the big lock is taken, it doesn't hurt. And eventually all address spaces will be fine-grained. { memory_region_transaction_begin(); +if (lock) { +as-lock = g_new(QemuMutex, 1); +qemu_mutex_init(as-lock); +} else { +as-lock = NULL; +} as-root = root; as-current_map = g_new(FlatView, 1); flatview_init(as-current_map); @@ -1553,6 +1559,9 @@ void address_space_destroy(AddressSpace *as) QTAILQ_REMOVE(address_spaces, as, address_spaces_link); address_space_destroy_dispatch(as); flatview_destroy(as-current_map); +if (as-lock) { +g_free(as-lock); +} g_free(as-current_map); } diff --git a/memory.h b/memory.h index 79393f1..12d1c56 100644 --- a/memory.h +++ b/memory.h @@ -22,6 +22,7 @@ #include cpu-common.h #include targphys.h #include qemu-queue.h +#include qemu-thread.h #include iorange.h #include ioport.h #include int128.h @@ -164,6 +165,7 @@ typedef struct AddressSpace AddressSpace; */ struct AddressSpace { /* All fields are private. */ +QemuMutex *lock; const char *name; MemoryRegion *root; struct FlatView *current_map; @@ -801,8 +803,9 @@ void mtree_info(fprintf_function mon_printf, void *f); * * @as: an uninitialized #AddressSpace * @root: a #MemoryRegion that routes addesses for the address space + * @lock: if true, the physmap protected by local lock, otherwise big lock */ -void address_space_init(AddressSpace *as, MemoryRegion *root); +void address_space_init(AddressSpace *as, MemoryRegion *root, bool lock); /** -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [patch v5 6/8] memory: make mmio dispatch able to be out of biglock
On 10/29/2012 01:48 AM, Liu Ping Fan wrote: Without biglock, we try to protect the mr by increase refcnt. If we can inc refcnt, go backward and resort to biglock. Another point is memory radix-tree can be flushed by another thread, so we should get the copy of terminal mr to survive from such issue. +static bool memory_region_section_ref(MemoryRegionSection *mrs) +{ +MemoryRegion *mr; +bool ret = false; + +mr = mrs-mr; +if (mr-ops mr-ops-ref) { +ret = mr-ops-ref(mr); I still don't see why -ref() needs to return something. +} +return ret; +} + while (len 0) { page = addr TARGET_PAGE_MASK; l = (page + TARGET_PAGE_SIZE) - addr; if (l len) l = len; -section = phys_page_find(d, page TARGET_PAGE_BITS); + +if (as-lock) { +qemu_mutex_lock(as-lock); +safe_ref = memory_region_section_lookup_ref(d, page, obj_mrs); +qemu_mutex_unlock(as-lock); +if (!safe_ref) { +qemu_mutex_lock_iothread(); +qemu_mutex_lock(as-lock); +/* when 2nd try, mem map can change, need to judge it again */ +safe_ref = memory_region_section_lookup_ref(d, page, obj_mrs); +qemu_mutex_unlock(as-lock); +if (safe_ref) { +qemu_mutex_unlock_iothread(); +} +} +} else { +/* Caller hold the big lock */ +memory_region_section_lookup_ref(d, page, obj_mrs); It's not a property of the address space, it's a property of the caller. +} +section = obj_mrs; if (is_write) { if (!memory_region_is_ram(section-mr)) { -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [memory] abort with head a8170e5
On 10/29/2012 09:54 AM, Aurelien Jarno wrote: On Thu, Oct 25, 2012 at 06:12:06PM +0200, Avi Kivity wrote: On 10/25/2012 04:39 PM, Aurelien Jarno wrote: On Thu, Oct 25, 2012 at 03:47:34PM +0200, Avi Kivity wrote: On 10/24/2012 04:00 PM, Aurelien Jarno wrote: mips is also broken but by commit 1c380f9460522f32c8dd2577b2a53d518ec91c6d: | [0.436000] PCI: Enabling device :00:0a.1 ( - 0001) | Segmentation fault (core dumped) How do you reproduce it? You can use the mips kernel version 2.6.32 from: http://people.debian.org/~aurel32/qemu/mips/ Then just run it with the following command: qemu-system-mips -M malta -kernel vmlinux-2.6.32-5-4kc-malta -append console=tty0 (You can also get the README command line if you don't care about downloading the disk image). Doesn't reproduce here with this command line (upstream + the bridge patch). [0.568000] PCI: Enabling device :00:12.0 ( - 0002) [0.572000] cirrusfb :00:12.0: Cirrus Logic chipset on PCI bus, RAM (4096 kB) at 0x1000 ... [1.172000] PCI: Enabling device :00:0a.1 ( - 0001) [1.188000] scsi0 : ata_piix (with console=ttyS0) Ok, looks like I didn't provide the right command line. I am only able to reproduce it when using -nographic, and only with -vga cirrus (yes it starts to be quite strange). In that case it's better to pass console=ttyS0, even if you can reproduce it with console=tty0. In short it seems heavily related to the cirrus VGA card. I was able to reproduce it, and in fact it's unrelated to VGA, it's deep in the memory core. Please try this patch: From: Avi Kivity a...@redhat.com Date: Mon, 29 Oct 2012 17:07:09 +0200 Subject: [PATCH] memory: fix rendering of a region obscured by another The memory core drops regions that are hidden by another region (for example, during BAR sizing), but it doesn't do so correctly if the lower address of the existing range is below the lower address of the new range. Example (qemu-system-mips -M malta -kernel vmlinux-2.6.32-5-4kc-malta -append console=ttyS0 -nographic -vga cirrus): Existing range: 1000-107f New range: 100a-100b Correct behaviour: drop new range Incorrect behaviour: add new range Fix by taking this case into account (previously we only considered equal lower boundaries). Signed-off-by: Avi Kivity a...@redhat.com diff --git a/memory.c b/memory.c index 36bb9a5..243cb23 100644 --- a/memory.c +++ b/memory.c @@ -539,12 +539,12 @@ static void render_memory_region(FlatView *view, offset_in_region += int128_get64(now); int128_subfrom(remain, now); } -if (int128_eq(base, view-ranges[i].addr.start)) { -now = int128_min(remain, view-ranges[i].addr.size); -int128_addto(base, now); -offset_in_region += int128_get64(now); -int128_subfrom(remain, now); -} +now = int128_sub(int128_min(int128_add(base, remain), +addrrange_end(view-ranges[i].addr)), + base); +int128_addto(base, now); +offset_in_region += int128_get64(now); +int128_subfrom(remain, now); } if (int128_nz(remain)) { fr.mr = mr; -- error compiling committee.c: too many arguments to function