Re: [RFC v5 86/86] 440fx: fix PAM, PCI holes

2011-07-26 Thread Avi Kivity

On 07/26/2011 12:34 AM, Eric Northup wrote:

On Wed, Jul 20, 2011 at 9:50 AM, Avi Kivity  wrote:
[...]
>  @@ -130,7 +137,13 @@ static void pc_init1(MemoryRegion *system_memory,
>
>   if (pci_enabled) {
>   pci_bus = i440fx_init(&i440fx_state,&piix3_devfn, isa_irq,
>  -  system_memory, system_io, ram_size);
>  +  system_memory, system_io, ram_size,
>  +  0xe000, 0x1fe0,
>  +  0x1 + above_4g_mem_size,
>  +  (sizeof(target_phys_addr_t) == 32

sizeof(target_phys_addr_t) == 8 ?


Good catch.  Fixed

(== 4).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtagent for qemu-kvm ?

2011-07-26 Thread Stefan Hajnoczi
On Tue, Jul 26, 2011 at 6:58 AM, Prateek Sharma  wrote:
> On Tue, 26 Jul 2011, Stefan Hajnoczi wrote:
>
>> On Tue, Jul 26, 2011 at 6:05 AM, Prateek Sharma 
>> wrote:
>>>
>>> Hi all,
>>>    Is there any equivalent of qemu's virtagent in qemu-kvm?
>>> [http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg02149.html]  .
>>> In particular , i want to share pages between KVM guests and the host .
>>> Is
>>> there an appropriate mechanism for this in existence which i could use ?
>>> Another nice feature virtagent provides is the ability to see guest dmesg
>>> output in the host...
>>
>> virtagent doesn't share pages between guest and host AFAIK.
>>
>> Have you looked at hw/ivshmem.c?
>>
>> Most of the time you don't need to actually share pages between guest
>> and host.  Use existing mechanisms like networking or serial to
>> communicate instead.
>>
>> Stefan
>>
>
> Thanks for the quick response!
>
> I was tempted by virtagent because it provides dmesg output directly,
> something which is exactly what i need.
>
> Anyway, looks like i'll need to fallback to serial/networking, as you
> suggested.
>
> On a related note, is porting virtagent to qemu-kvm
> possible/useful/in-the-pipeline ?

I expect qemu-ga will be merged into qemu-kvm.git when they merge from
qemu.git again soon.

Mike: I don't see a dmesg command in the guest schema in qemu.git.  Is
there a recommended way of getting dmesg from the host (maybe read
file /var/log/messages)?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] The codes V2 for QEMU disk I/O limits.

2011-07-26 Thread Zhi Yong Wu
Welcome to give me your comments, thanks.

Signed-off-by: Zhi Yong Wu 
---
 Makefile.objs |2 +-
 block.c   |  288 +++--
 block.h   |1 -
 block/blk-queue.c |  116 +
 block/blk-queue.h |   70 +
 block_int.h   |   28 +
 blockdev.c|   21 
 qemu-config.c |   24 +
 qemu-option.c |   17 +++
 qemu-option.h |1 +
 qemu-options.hx   |1 +
 11 files changed, 559 insertions(+), 10 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

diff --git a/Makefile.objs b/Makefile.objs
index 9f99ed4..06f2033 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o 
dmg.o bochs.o vpc.o vv
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += qed-check.o
-block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o 
blk-queue.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block.c b/block.c
index 24a25d5..e54e59c 100644
--- a/block.c
+++ b/block.c
@@ -29,6 +29,9 @@
 #include "module.h"
 #include "qemu-objects.h"
 
+#include "qemu-timer.h"
+#include "block/blk-queue.h"
+
 #ifdef CONFIG_BSD
 #include 
 #include 
@@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t 
sector_num,
 static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
  const uint8_t *buf, int nb_sectors);
 
+static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
+bool is_write, double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
+double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
+bool is_write, uint64_t *wait);
+
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -90,6 +100,20 @@ int is_windows_drive(const char *filename)
 }
 #endif
 
+static int bdrv_io_limits_enable(BlockIOLimit *io_limits)
+{
+if ((io_limits->bps[0] == 0)
+ && (io_limits->bps[1] == 0)
+ && (io_limits->bps[2] == 0)
+ && (io_limits->iops[0] == 0)
+ && (io_limits->iops[1] == 0)
+ && (io_limits->iops[2] == 0)) {
+return 0;
+}
+
+return 1;
+}
+
 /* check if the path starts with ":" */
 static int path_has_protocol(const char *path)
 {
@@ -167,6 +191,28 @@ void path_combine(char *dest, int dest_size,
 }
 }
 
+static void bdrv_block_timer(void *opaque)
+{
+BlockDriverState *bs = opaque;
+BlockQueue *queue = bs->block_queue;
+
+while (!QTAILQ_EMPTY(&queue->requests)) {
+BlockIORequest *request;
+int ret;
+
+request = QTAILQ_FIRST(&queue->requests);
+QTAILQ_REMOVE(&queue->requests, request, entry);
+
+ret = qemu_block_queue_handler(request);
+if (ret == 0) {
+QTAILQ_INSERT_HEAD(&queue->requests, request, entry);
+break;
+}
+
+qemu_free(request);
+}
+}
+
 void bdrv_register(BlockDriver *bdrv)
 {
 if (!bdrv->bdrv_aio_readv) {
@@ -642,6 +688,15 @@ int bdrv_open(BlockDriverState *bs, const char *filename, 
int flags,
 bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
 }
 
+/* throttling disk I/O limits */
+if (bdrv_io_limits_enable(&bs->io_limits)) {
+bs->block_queue = qemu_new_block_queue();
+bs->block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
+
+bs->slice_start[0] = qemu_get_clock_ns(vm_clock);
+bs->slice_start[1] = qemu_get_clock_ns(vm_clock);
+}
+
 return 0;
 
 unlink_and_fail:
@@ -680,6 +735,16 @@ void bdrv_close(BlockDriverState *bs)
 if (bs->change_cb)
 bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
 }
+
+/* throttling disk I/O limits */
+if (bs->block_queue) {
+qemu_del_block_queue(bs->block_queue);
+}
+
+if (bs->block_timer) {
+qemu_del_timer(bs->block_timer);
+qemu_free_timer(bs->block_timer);
+}
 }
 
 void bdrv_close_all(void)
@@ -1312,6 +1377,14 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
 *psecs = bs->secs;
 }
 
+/* throttling disk io limits */
+void bdrv_set_io_limits(BlockDriverState *bs,
+   BlockIOLimit *io_limits)
+{
+memset(&bs->io_limits, 0, sizeof(BlockIOLimit));
+bs->io_limits = *io_limits;
+}
+
 /* Recognize floppy formats */
 typedef struct FDFormat {
 FDriveType drive;
@@ -2111,6 +2184,155 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, 
QEMUSnapshotInfo *sn)
 return buf;
 }
 
+static bool bdrv_exc

[PATCH 0/1] The intro for QEMU disk I/O limits

2011-07-26 Thread Zhi Yong Wu
The main goal of the patch is to effectively cap the disk I/O speed or counts 
of one single VM.It is only one draft, so it unavoidably has some drawbacks, if 
you catch them, please let me know.

The patch will mainly introduce one block I/O throttling algorithm, one global 
timer and one block queue for each I/O limits enabled drive.

When a block request is coming in, the throttling algorithm will check if its 
I/O rate or counts exceed the limits; if yes, then it will enqueue to the block 
queue; The timer will periodically handle the I/O requests in it.

Some available features follow as below:
(1) global bps limit.
-drive bps=xxxin bytes/s
(2) only read bps limit
-drive bps_rd=xxx in bytes/s
(3) only write bps limit
-drive bps_wr=xxx in bytes/s
(4) global iops limit
-drive iops=xxx   in ios/s
(5) only read iops limit
-drive iops_rd=xxxin ios/s
(6) only write iops limit
-drive iops_wr=xxxin ios/s
(7) the combination of some limits.
-drive bps=xxx,iops=xxx

Known Limitations:
(1) #1 can not coexist with #2, #3
(2) #4 can not coexist with #5, #6


Zhi Yong Wu (1):
  v2: The codes V2 for QEMU disk I/O limits.
  Modified the codes mainly based on stefan's comments.

  v1: Submit the codes for QEMU disk I/O limits.
  Only a code draft.

 Makefile.objs |2 +-
 block.c   |  288 +++--
 block.h   |1 -
 block/blk-queue.c |  116 +
 block/blk-queue.h |   70 +
 block_int.h   |   28 +
 blockdev.c|   21 
 qemu-config.c |   24 +
 qemu-option.c |   17 +++
 qemu-option.h |1 +
 qemu-options.hx   |1 +
 11 files changed, 559 insertions(+), 10 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/1] The intro for QEMU disk I/O limits

2011-07-26 Thread Zhi Yong Wu
The main goal of the patch is to effectively cap the disk I/O speed or counts 
of one single VM.It is only one draft, so it unavoidably has some drawbacks, if 
you catch them, please let me know.

The patch will mainly introduce one block I/O throttling algorithm, one global 
timer and one block queue for each I/O limits enabled drive.

When a block request is coming in, the throttling algorithm will check if its 
I/O rate or counts exceed the limits; if yes, then it will enqueue to the block 
queue; The timer will periodically handle the I/O requests in it.

Some available features follow as below:
(1) global bps limit.
-drive bps=xxxin bytes/s
(2) only read bps limit
-drive bps_rd=xxx in bytes/s
(3) only write bps limit
-drive bps_wr=xxx in bytes/s
(4) global iops limit
-drive iops=xxx   in ios/s
(5) only read iops limit
-drive iops_rd=xxxin ios/s
(6) only write iops limit
-drive iops_wr=xxxin ios/s
(7) the combination of some limits.
-drive bps=xxx,iops=xxx

Known Limitations:
(1) #1 can not coexist with #2, #3
(2) #4 can not coexist with #5, #6


Zhi Yong Wu (1):
  v2: The codes V2 for QEMU disk I/O limits.
  Modified the codes mainly based on stefan's comments.

  v1: Submit the codes for QEMU disk I/O limits.
  Only a code draft.

 Makefile.objs |2 +-
 block.c   |  288 +++--
 block.h   |1 -
 block/blk-queue.c |  116 +
 block/blk-queue.h |   70 +
 block_int.h   |   28 +
 blockdev.c|   21 
 qemu-config.c |   24 +
 qemu-option.c |   17 +++
 qemu-option.h |1 +
 qemu-options.hx   |1 +
 11 files changed, 559 insertions(+), 10 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/1] The codes V2 for QEMU disk I/O limits.

2011-07-26 Thread Zhi Yong Wu
Welcome to give me your comments, thanks.

Signed-off-by: Zhi Yong Wu 
---
 Makefile.objs |2 +-
 block.c   |  288 +++--
 block.h   |1 -
 block/blk-queue.c |  116 +
 block/blk-queue.h |   70 +
 block_int.h   |   28 +
 blockdev.c|   21 
 qemu-config.c |   24 +
 qemu-option.c |   17 +++
 qemu-option.h |1 +
 qemu-options.hx   |1 +
 11 files changed, 559 insertions(+), 10 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

diff --git a/Makefile.objs b/Makefile.objs
index 9f99ed4..06f2033 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o 
dmg.o bochs.o vpc.o vv
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += qed-check.o
-block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o 
blk-queue.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block.c b/block.c
index 24a25d5..e54e59c 100644
--- a/block.c
+++ b/block.c
@@ -29,6 +29,9 @@
 #include "module.h"
 #include "qemu-objects.h"
 
+#include "qemu-timer.h"
+#include "block/blk-queue.h"
+
 #ifdef CONFIG_BSD
 #include 
 #include 
@@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t 
sector_num,
 static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
  const uint8_t *buf, int nb_sectors);
 
+static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
+bool is_write, double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
+double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
+bool is_write, uint64_t *wait);
+
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -90,6 +100,20 @@ int is_windows_drive(const char *filename)
 }
 #endif
 
+static int bdrv_io_limits_enable(BlockIOLimit *io_limits)
+{
+if ((io_limits->bps[0] == 0)
+ && (io_limits->bps[1] == 0)
+ && (io_limits->bps[2] == 0)
+ && (io_limits->iops[0] == 0)
+ && (io_limits->iops[1] == 0)
+ && (io_limits->iops[2] == 0)) {
+return 0;
+}
+
+return 1;
+}
+
 /* check if the path starts with ":" */
 static int path_has_protocol(const char *path)
 {
@@ -167,6 +191,28 @@ void path_combine(char *dest, int dest_size,
 }
 }
 
+static void bdrv_block_timer(void *opaque)
+{
+BlockDriverState *bs = opaque;
+BlockQueue *queue = bs->block_queue;
+
+while (!QTAILQ_EMPTY(&queue->requests)) {
+BlockIORequest *request;
+int ret;
+
+request = QTAILQ_FIRST(&queue->requests);
+QTAILQ_REMOVE(&queue->requests, request, entry);
+
+ret = qemu_block_queue_handler(request);
+if (ret == 0) {
+QTAILQ_INSERT_HEAD(&queue->requests, request, entry);
+break;
+}
+
+qemu_free(request);
+}
+}
+
 void bdrv_register(BlockDriver *bdrv)
 {
 if (!bdrv->bdrv_aio_readv) {
@@ -642,6 +688,15 @@ int bdrv_open(BlockDriverState *bs, const char *filename, 
int flags,
 bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
 }
 
+/* throttling disk I/O limits */
+if (bdrv_io_limits_enable(&bs->io_limits)) {
+bs->block_queue = qemu_new_block_queue();
+bs->block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
+
+bs->slice_start[0] = qemu_get_clock_ns(vm_clock);
+bs->slice_start[1] = qemu_get_clock_ns(vm_clock);
+}
+
 return 0;
 
 unlink_and_fail:
@@ -680,6 +735,16 @@ void bdrv_close(BlockDriverState *bs)
 if (bs->change_cb)
 bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
 }
+
+/* throttling disk I/O limits */
+if (bs->block_queue) {
+qemu_del_block_queue(bs->block_queue);
+}
+
+if (bs->block_timer) {
+qemu_del_timer(bs->block_timer);
+qemu_free_timer(bs->block_timer);
+}
 }
 
 void bdrv_close_all(void)
@@ -1312,6 +1377,14 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
 *psecs = bs->secs;
 }
 
+/* throttling disk io limits */
+void bdrv_set_io_limits(BlockDriverState *bs,
+   BlockIOLimit *io_limits)
+{
+memset(&bs->io_limits, 0, sizeof(BlockIOLimit));
+bs->io_limits = *io_limits;
+}
+
 /* Recognize floppy formats */
 typedef struct FDFormat {
 FDriveType drive;
@@ -2111,6 +2184,155 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, 
QEMUSnapshotInfo *sn)
 return buf;
 }
 
+static bool bdrv_exc

Re: [Qemu-devel] [PATCH 01/23] Hierarchical memory region API

2011-07-26 Thread Avi Kivity

On 07/25/2011 09:41 PM, Anthony Liguori wrote:

+/* Initialize a memory region
+ *
+ * The region typically acts as a container for other memory regions.
+ */



I'm also writing a fair bit of documentation right now.  We should use 
a common format and declare that the doc format in CODING_STYLE.


I think the two choices are gtk-doc (which I have been using):




I think gtk-doc is more concise and easier to write.  But Doxygen 
clearly has better tooling.


Updated to use gtk-doc.  Ugh, a large API.



Otherwise, Reviewed-by: Anthony Liguori  but not 
tested yet.


I assume that all of the FIXMEs don't actually introduce regressions, 
right?


Not intentionally.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 03/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Avi Kivity

On 07/25/2011 09:48 PM, Anthony Liguori wrote:

+/* Attempt to simplify a view by merging ajacent ranges */
+static void flatview_simplify(FlatView *view)
+{
+unsigned i;
+FlatRange *r1, *r2;
+
+for (i = 0; i + 1<  view->nr; ++i) {
+r1 =&view->ranges[i];
+r2 =&view->ranges[i+1];
+if (addrrange_end(r1->addr) == r2->addr.start
+&&  r1->mr == r2->mr
+&&  r1->offset_in_region + r1->addr.size == r2->offset_in_region
+&&  r1->dirty_log_mask == r2->dirty_log_mask) {
+r1->addr.size += r2->addr.size;
+memmove(r2, r2 + 1, (view->nr - (i + 2)) * sizeof(*r2));
+--view->nr;
+--i;
+}



The --i is pretty subtle.  Moving the index variable backwards in a 
conditional in a for loop is pretty evil :-)  I started writing up why 
this was wrong until I noticed that.


I think the following would be more straight forward:

i = 0;
while (i + 1 < view->nr) {
   int begin = i, end = i + 1;

   while (matches(&view->ranges[begin], &view->ranges[end])) {
  end++;
   }

   memmove(...)
}



Right; updated to something along these lines.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 06/23] memory: rename MemoryRegion::has_ram_addr to ::terminates

2011-07-26 Thread Avi Kivity

On 07/25/2011 09:56 PM, Anthony Liguori wrote:

On 07/25/2011 09:02 AM, Avi Kivity wrote:

I/O regions will not have ram_addrs, so this is a better name.

Signed-off-by: Avi Kivity


Reviewed-by: Anthony Liguori 

Although it seems squashable.


Yeah, I'm loath to squash across a conflicting commit though.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows7 crashes inside the VM when starting a certain program

2011-07-26 Thread Gleb Natapov
On Tue, Jul 26, 2011 at 07:29:04AM +0200, André Weidemann wrote:
> On 07.07.2011 07:26, André Weidemann wrote:
> >Hi,
> >I am running Windows7 x64 in a VM which crashes after starting a certain
> >game. Actually there are two games both from the same company, that make
> >the VM crash after starting them.
> >Windows crashes right after starting the game. With the 1st game the
> >screen goes black as usual and the cursor keeps spinning for 3-5 seconds
> >until Windows crashes. With the second game I get to 3D the login
> >screen. The game then crashes after logging in.
> >Windows displays this error message on the first crash:
> >http://pastebin.com/kMzk9Jif
> >Windows then finishes writing the crash dump and restarts.
> >I can reproduce Windows crashing every time I start the game while the
> >VM keeps running without any problems.
> >When Windows reboots after the first crash and the game is started
> >again, the message on the following blue screen changes slightly and
> >stays the same(except for the addresses) for every following crash:
> >http://pastebin.com/jVtBc4ZH
> >
> >I first thought that this might be related to a certain feature in 3D
> >acceleration being used, but Futuremark 3DMark Vantage or 3DMark 11 run
> >without any problems. They run a bit choppy on some occasions, but do
> >that without crashing Windows7 or the VM.
> >
> >How can I proceed to investigate what is going wrong?
> 
> I did some testing and found out that Windows7 does not crash
> anymore when changing "-cpu host" to "-cpu Nehalem". After doing so,
What is your host cpu (cat /proc/cpuinfo)?

> the "only" thing crashing, is the application itself.
> Why is that? What is different between the "real" CPU and the one
> provided by qemu-kvm? How can "-cpu host" cause Windows7 to crash,
> while "-cpu Nehalem" "only" crashes the application.
> 
> I then had WinDbg attach to the process in question. When the game
> crashes the debugger reports an Assertion Failure. This seems so
> happen as soon as the game accesses the network through a certain
> DLL.
> To exclude the emulated e1000 hardware as the cause, I  removed it
> from the VM and passed an Intel network card 82574L to it using
> these lines:
> 
> -device pci-assign,host=04:00.0,id=82574L,addr=0x10 \
> -net none \
> 
> The network card works under Windows7, but the problem of the
> crashing game remains.
> 
> Any ideas on how to track the problem are greatly appreciated.
> 
> 
> Regards
>  André
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 11/23] memory: add ioeventfd support

2011-07-26 Thread Avi Kivity

On 07/25/2011 10:08 PM, Anthony Liguori wrote:


+static void as_memory_ioeventfd_add(AddressSpace *as, 
MemoryRegionIoeventfd *fd)

+{
+int r;
+
+if (!fd->match_data || fd->addr.size != 4) {
+abort();
+}
+
+r = kvm_set_ioeventfd_mmio_long(fd->fd, fd->addr.start, 
fd->data, true);

+if (r<  0) {
+abort();
+}



asserts would be friendlier.


I thought asserts were disabled by default.  But I see it isn't so; will 
update.




I really dislike baking ioeventfd into this API.  There is only one 
user of ioeventfd in the tree.


Two: virtio-pci and ivshmem.



I worry that by having things like ioeventfd the API, we're making it 
too difficult to side-step the API which prevents future optimizations.


I'd prefer virtio-pci to have ugliness in it where it circumvented the 
layering vs. having such a device specific thing in generic code.


It's impossible (or at least, impossible without further information 
from the API) to do this and retain correctness.  Currently virtio-pci 
is broken wrt bridges and overlapping BARs; it's probably also broken on 
targets that bridge the I/O address space to MMIO.


With the memory API, this is fixed in a natural way by making the I/O 
address space a subregion of the bridge which does the conversion; the 
code will automatically add the needed offset and use MMIO ioeventfd 
instead of portio.


On a more general note, I don't want this to be a lean and mean API that 
throws any complexity to the users; instead I want to make writing 
devices as simple as possible and own all the smarts.


(an example - undecoded address bits can be specified to the API which 
then takes care of replication or shifting).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 12/23] memory: separate building the final memory map into two steps

2011-07-26 Thread Avi Kivity

On 07/25/2011 10:12 PM, Anthony Liguori wrote:

On 07/25/2011 09:02 AM, Avi Kivity wrote:

Instead of adding and deleting regions in one pass, do a delete
pass followed by an add pass.  This fixes the following case:

from:
   0x-0x0fff ram  (a1)
   0x1000-0x1fff mmio (a2)
   0x2000-0x2fff ram  (a3)

to:
   0x-0x2fff ram  (b1)

The single pass algorithm removed a1, added b2, then removed a2 and a2,


You mean a2 and a3 I suppose?


Yes; fixed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 13/23] memory: document the memory API

2011-07-26 Thread Avi Kivity

On 07/25/2011 10:15 PM, Anthony Liguori wrote:

+Region names
+
+
+Regions are assigned names by the constructor.  For most regions 
these are
+only used for debugging purposes, but RAM regions also use the name 
to identify
+live migration sections.  This means that RAM region names need to 
have ABI

+stability.

+

I don't think this needs to change for this series, but long term, 
this is something that we need to better think through.


Device ROM is part of the device.  It shouldn't be migrated as RAM, it 
should be included in the device state.


Agreed; however I tries to limit the amount of functional change, 
especially with anything resembling an ABI.




Would be nice to make this patch #1 in the series.


Done.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] pci: correct pci config size default for cap version 2 endpoints

2011-07-26 Thread Michael S. Tsirkin
On Mon, Jul 25, 2011 at 04:42:50PM -0400, Don Dutile wrote:
> On 07/25/2011 04:20 PM, Alex Williamson wrote:
> >On Mon, 2011-07-25 at 15:37 -0400, Don Dutile wrote:
> >>On 07/24/2011 06:58 AM, Michael S. Tsirkin wrote:
> >>>On Sun, Jul 24, 2011 at 11:41:10AM +0300, Michael S. Tsirkin wrote:
> On Sun, Jul 24, 2011 at 11:12:44AM +0300, Michael S. Tsirkin wrote:
> >On Fri, Jul 22, 2011 at 02:35:47PM -0700, Chris Wright wrote:
> >>* Alex Williamson (alex.william...@redhat.com) wrote:
> >>>On Fri, 2011-07-22 at 14:24 -0700, Chris Wright wrote:
> * Donald Dutile (ddut...@redhat.com) wrote:
> >diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> >index 36ad6b0..34db52e 100644
> >--- a/hw/device-assignment.c
> >+++ b/hw/device-assignment.c
> >@@ -1419,16 +1419,18 @@ static int 
> >assigned_device_pci_cap_init(PCIDevice *pci_dev)
> >   }
> >
> >   if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
> >-uint8_t version;
> >+uint8_t version, size;
> >   uint16_t type, devctl, lnkcap, lnksta;
> >   uint32_t devcap;
> >-int size = 0x3c; /* version 2 size */
> >
> >   version = pci_get_byte(pci_dev->config + pos + 
> > PCI_EXP_FLAGS);
> >   version&= PCI_EXP_FLAGS_VERS;
> >   if (version == 1) {
> >   size = 0x14;
> >-} else if (version>   2) {
> >+} else if (version == 2) {
> >+/* don't include slot cap/stat/ctrl 2 regs; only 
> >support endpoints */
> >+size = 0x34;
> 
> That doesn't look correct to me.  The size is fixed, just that some
> registers are Reserved Zero when they do not apply (e.g. endpoint 
> only).
> >>>
> >>>Apparently it can be interpreted differently.  In this case, we've seen
> >>>a tg3 device expose a v2 PCI express capability at offset 0xcc.  Using
> >>>0x3c bytes, we extend 8 bytes past the legacy config space area :(
> >>
> >>Wow, that device sounds broken to me.  The spec is pretty clear.
> >
> >Yes, I agree it's broken. Looks like something that
> >happens when a device is designed in parallel with the spec.
> >
> >What bothers me is this patch seems to make devices that do behave
> >correctly out of spec (registers will be writeable by default) -
> >correct?
> >
> >How about we check for overflow and only do the hacks
> >if it happens?
> >
> >Also, the code to initialize slot and root control registers is still
> >there: it would seem that running it will corrupt memmory beyond the
> >config array?
> 
> I take this last bit back: registers we touch are at offset<   0x34.
> Sorry about the noise. But the question about read-only registers
> still stands.
> >>>
> >>>Also, where does the magic 0x34 come from? I'm guessing this is
> >>>simply what's left till the end of the config space.
> >>>So let's be conservative specific as possible with
> >>>this hack:
> >>>
> >>
> >>I believe the spec leaves room for interpretation, and thus the
> >>resulting 'broken' device.  As I read the spec, the size of the struct can 
> >>be:
> >>
> >>-- 0x2c for all devices, min., that are cap version 2 or higher.
> >>-- 0x34 for devices with links, i.e., not a root-port-based device,
> >>, a device not integrated into the root port
> >>, or if it is, it uses a serial link anyhow
> >>  (doesn't strip out 8b/10b serial link 
> >> btwn device
> >>   &  internal root port)
> >>-- 0x3c for devices with slots, i.e., a bridge with downstream slots,
> >>  i.e., not an endpoint, i.e., a PCI(e) 
> >> bridge.
> >>
> >>Thus, 0x34 was chosen, since we don't support device assigning PCI bridges,
> >>(not until MRIOV shows up, at least), and 0x34 fits the bug at hand, and
> >>device cap/stat/control may be used/modified.
> >>
> >>So, a 'hack' is not needed.  In fact, the 0x34 size is a bit of a hack,
> >>since the case to use 0x2c could be ascertained by checking if the device
> >>is a root port device, _and_ it's not using a serial link, but a perusal of
> >>root port devices on a number of systems I looked at always had this 
> >>structure
> >>greater than 0x2c, so I figured the simple heuristic of 0x34 was sufficient.
> >>
> >>>/* A version 2 device was observed to only have a partial
> >>>   * implementation for the capability structure. Apparently, it doesn't
> >>>   * implement the registers from slot capability 2 and on (offset 0x34),
> >>>   * with the capability at offset 0xCC = 256 - 0x34. This is out of spec,
> >>>   * but let's try to support this. */
> >>>if (version == 2

Re: [Qemu-devel] [PATCH 14/23] memory: transaction API

2011-07-26 Thread Avi Kivity

On 07/25/2011 10:16 PM, Anthony Liguori wrote:

On 07/25/2011 09:02 AM, Avi Kivity wrote:

Allow changes to the memory hierarchy to be accumulated and
made visible all at once.  This reduces computational effort,
especially when an accelerator (e.g. kvm) is involved.

Useful when a single register update causes multiple changes
to an address space.

Signed-off-by: Avi Kivity


What's the motivation for this?  Was this just because it seemed neat 
to do or did you run into a performance issue you were trying to solve?


Both cirrus and 440fx need this; look at i440fx_update_memory_mappings() 
for example, it blindly updates mappings even for PAMs which haven't 
changed.


These issues can be also solved by more care on the caller's side, or by 
making the API richer, but it's good to have a no-thought-required 
solution, particularly as it's so easy to do.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 15/23] exec.c: initialize memory map

2011-07-26 Thread Avi Kivity

On 07/25/2011 10:17 PM, Anthony Liguori wrote:

+static void memory_map_init(void)
+{
+system_memory = qemu_malloc(sizeof(*system_memory));
+memory_region_init(system_memory, "system", UINT64_MAX);



Would be nice to #define MEM_REG_SIZE_ALL UINT64_MAX

Without reading the docs, it seems like an odd bit of code otherwise.


It's really temporary.  It should come from the machine description and 
specify the processor's system bus width.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows7 crashes inside the VM when starting a certain program

2011-07-26 Thread André Weidemann

Hi,

On 26.07.2011 12:08, Gleb Natapov wrote:

On Tue, Jul 26, 2011 at 07:29:04AM +0200, André Weidemann wrote:

On 07.07.2011 07:26, André Weidemann wrote:

Hi,
I am running Windows7 x64 in a VM which crashes after starting a certain
game. Actually there are two games both from the same company, that make
the VM crash after starting them.
Windows crashes right after starting the game. With the 1st game the
screen goes black as usual and the cursor keeps spinning for 3-5 seconds
until Windows crashes. With the second game I get to 3D the login
screen. The game then crashes after logging in.
Windows displays this error message on the first crash:
http://pastebin.com/kMzk9Jif
Windows then finishes writing the crash dump and restarts.
I can reproduce Windows crashing every time I start the game while the
VM keeps running without any problems.
When Windows reboots after the first crash and the game is started
again, the message on the following blue screen changes slightly and
stays the same(except for the addresses) for every following crash:
http://pastebin.com/jVtBc4ZH

I first thought that this might be related to a certain feature in 3D
acceleration being used, but Futuremark 3DMark Vantage or 3DMark 11 run
without any problems. They run a bit choppy on some occasions, but do
that without crashing Windows7 or the VM.

How can I proceed to investigate what is going wrong?


I did some testing and found out that Windows7 does not crash
anymore when changing "-cpu host" to "-cpu Nehalem". After doing so,

What is your host cpu (cat /proc/cpuinfo)?


The server is currently running on 2 out of 8 cores with kernel boot 
parameter "maxcpus=2".


processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
stepping: 5
cpu MHz : 1596.000
cache size  : 8192 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good 
xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 
ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi 
flexpriority ept vpid

bogomips: 5405.59
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz
stepping: 5
cpu MHz : 1596.000
cache size  : 8192 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
apicid  : 2
initial apicid  : 2
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good 
xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 
ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi 
flexpriority ept vpid

bogomips: 5404.84
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:


the "only" thing crashing, is the application itself.
Why is that? What is different between the "real" CPU and the one
provided by qemu-kvm? How can "-cpu host" cause Windows7 to crash,
while "-cpu Nehalem" "only" crashes the application.

I then had WinDbg attach to the process in question. When the game
crashes the debugger reports an Assertion Failure. This seems so
happen as soon as the game accesses the network through a certain
DLL.
To exclude the emulated e1000 hardware as the cause, I  removed it
from the VM and passed an Intel network card 82574L to it using
these lines:

-device pci-assign,host=04:00.0,id=82574L,addr=0x10 \
-net none \

The network card works under Windows7, but the problem of the
crashing game remains.

Any ideas on how to track the problem are greatly appreciated.


Regards
  André
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Gleb.


 André
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 16/23] ioport: register ranges by byte aligned addresses always

2011-07-26 Thread Avi Kivity

On 07/25/2011 10:20 PM, Anthony Liguori wrote:

On 07/25/2011 09:02 AM, Avi Kivity wrote:

The I/O port space is byte addressable, even for word and long accesses.

An example is the VMware svga card, which has long ports on offsets 0,
1, and 2.

Signed-off-by: Avi Kivity


I've always thought this was odd but didn't know of a specific 
circumstance where it broke a device.


This was a big problem with the old API.  Devices don't register their 
interest in specific sizes.  They may ignore certain size accesses but 
that's a device specific behavior.


In fact there's on counterexample - 440FX consumes dword accesses on 
port cf8 but forwards byte and word accesses to the PCI bus.  This 
cannot be implemented with the current API (it requires the opaque to be 
equal for all word sizes).  The new API doesn't support it either; if we 
need to, we can re-issue the access but using the PCI address space 
instead of the root I/O address space.


(of course this is purely theoretical)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 19/23] pc: move global memory map out of pc_init1() and into its callers

2011-07-26 Thread Avi Kivity

On 07/25/2011 11:02 PM, Anthony Liguori wrote:

On 07/25/2011 09:03 AM, Avi Kivity wrote:

Signed-off-by: Avi Kivity


What's the rationale here?


Removing globals and making dependencies explicit.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 21/23] pci: add MemoryRegion based BAR management API

2011-07-26 Thread Avi Kivity

On 07/25/2011 11:20 PM, Anthony Liguori wrote:

On 07/25/2011 09:03 AM, Avi Kivity wrote:
Allow registering a BAR using a MemoryRegion.  Once all users are 
converted,

pci_register_bar() and pci_register_bar_simple() will be removed.

Signed-off-by: Avi Kivity



diff --git a/hw/pci.h b/hw/pci.h
index cfeb042..c51156d 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -94,6 +94,7 @@ typedef struct PCIIORegion {
  uint8_t type;
  PCIMapIORegionFunc *map_func;
  ram_addr_t ram_addr;
+MemoryRegion *memory;
  } PCIIORegion;

  #define PCI_ROM_SLOT 6
@@ -204,6 +205,8 @@ void pci_register_bar(PCIDevice *pci_dev, int 
region_num,

  PCIMapIORegionFunc *map_func);
  void pci_register_bar_simple(PCIDevice *pci_dev, int region_num,
   pcibus_t size, uint8_t attr, 
ram_addr_t ram_addr);

+void pci_register_bar_region(PCIDevice *pci_dev, int region_num,
+ uint8_t attr, MemoryRegion *memory);


This ends up being a very nice API.  I had always thought this should 
be a PCI specific set of callbacks but I do see the benefits of having 
the callbacks be generic.


Reviewed-by: Anthony Liguori 

One thing I'm curious about, what's the symmetric view of this API?

Would you see a device doing something like:

memory_region_read(&dev->pci_bus->memory, addr, &data, sizeof(data))



I haven't really considered this, but it makes sense to consider the 
initiating point for DMA.  This allows us to do IOMMU transformations 
naturally.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for July 26

2011-07-26 Thread Stefan Hajnoczi
On Tue, Jul 26, 2011 at 12:30 AM, Juan Quintela  wrote:
> Please send in any agenda items you are interested in covering.

0.15.0 release candidate testing
 * http://wiki.qemu.org/Planning/0.15/Testing
 * Please test hosts, targets, subsystems, or features you care about!
 * May just include running the test suite (KVM-Autotest, qemu-iotests, etc)

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] separate thread for VM migration

2011-07-26 Thread Paolo Bonzini

On 07/22/2011 09:58 PM, Umesh Deshapnde wrote:

-qemu_mod_timer(s->timer, qemu_get_clock_ms(rt_clock) + 100);
+qemu_mod_timer(s->timer, qemu_get_clock_ms(migration_clock) + 100);

  if (s->freeze_output)
  return;
@@ -246,8 +246,10 @@ static void buffered_rate_tick(void *opaque)

  buffered_flush(s);

-/* Add some checks around this */
  s->put_ready(s->opaque);
+qemu_mutex_unlock_iothread();
+usleep(qemu_timer_difference(s->timer, migration_clock) * 1000);
+qemu_mutex_lock_iothread();
  }


Do you really need a timer?  The timer will only run in the migration 
thread, where you should have full control on when to wait.


The BufferedFile code is more general, but you can create one thread per 
BufferedFile (you will only have one).  Then, since you only have one 
timer, calling buffered_rate_tick repeatedly is really all that is done 
in the body of the thread:


while (s->state == MIG_STATE_ACTIVE)
start_time = qemu_get_clock_ms(rt_clock);
buffered_rate_tick (s->file);

qemu_mutex_unlock_iothread();
usleep ((qemu_get_clock_ms(rt_clock) + 100 - start_time) * 1000);
qemu_mutex_lock_iothread();
}


Perhaps the whole QEMUFile abstraction should be abandoned as the basis 
of QEMUBufferedFile.  It is simply too heavy when you're working in a 
separate thread and you can afford blocking operation.


I also am not sure about the correctness.  Here you removed the delayed 
call to migrate_fd_put_notify, which is what resets freeze_output to 0:



@@ -327,9 +320,7 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
  if (ret == -1)
  ret = -(s->get_error(s));

-if (ret == -EAGAIN) {
-qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s);
-} else if (ret<  0) {
+if (ret<  0&&  ret != -EAGAIN) {
  if (s->mon) {
  monitor_resume(s->mon);
  }


The call to migrate_fd_put_notify is here:


@@ -396,6 +401,9 @@ void migrate_fd_put_ready(void *opaque)
 }
 s->state = state;
 notifier_list_notify(&migration_state_notifiers);
+} else {
+migrate_fd_wait_for_unfreeze(s);
+qemu_file_put_notify(s->file);
 }
 }



But it is not clear to me what calls migrate_fd_put_ready when the 
output file is frozen.


Finally, here you are busy waiting:


@@ -340,10 +331,34 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void 
*data, size_t size)
  return ret;
  }

-void migrate_fd_connect(FdMigrationState *s)
+void *migrate_run_timers(void *arg)
  {
+FdMigrationState *s = arg;
  int ret;

+qemu_mutex_lock_iothread();
+ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk,
+s->mig_state.shared);
+if (ret<  0) {
+DPRINTF("failed, %d\n", ret);
+migrate_fd_error(s);
+return NULL;
+}
+
+migrate_fd_put_ready(s);
+
+while (s->state == MIG_STATE_ACTIVE) {
+qemu_run_timers(migration_clock);
+}


which also convinces me that if you make rate limiting the main purpose 
of the migration thread's main loop, most problems will go away.  You 
can also use select() instead of usleep, so that you fix the problem 
with qemu_file_put_notify.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/11] KVM: x86: optimize for guest page written

2011-07-26 Thread Xiao Guangrong
Too keep shadow page consistency, we should write-protect the guest page if
if it is a page structure. Unfortunately, even if the guest page structure is
tear-down and is used for other usage, we still write-protect it and cause page
fault if it is written, in this case, we need to zap the corresponding shadow
page and let the guest page became normal as possible, that is just what
kvm_mmu_pte_write does, however, sometimes, it does not work well:
- kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
  function when spte is prefetched, unfortunately, we can not know how many
  spte need to be prefetched on this path, that means we can use out of the
  free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
  path does not fill the cache, such as INS instruction emulated that does not
  trigger page fault.


- we usually use repeat string instructions to clear the page, for example,
  we call memset to clear a page table, 'stosb' is used in this function, and 
  repeated for 1024 times, that means we should occupy mmu lock for 1024 times
  and walking shadow page cache for 1024 times, it is terrible.

- Sometimes, we only modify the last one byte of a pte to update status bit,
  for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
  instruction is used in this function, in this case, kvm_mmu_pte_write will
  treat it as misaligned access, and the shadow page table is zapped.

- detecting write-flooding does not work well, when we handle page written, if
  the last speculative spte is not accessed, we treat the page is
  write-flooding, however, we can speculative spte on many path, such as pte
  prefetch, page synced, that means the last speculative spte may be not point
  to the written page and the written page can be accessed via other sptes, so
  depends on the Accessed bit of the last speculative spte is not enough.


In this patchset, we fixed/avoided these issues:
- instead of filling the cache in page fault path, we do it in
  kvm_mmu_pte_write, and do not prefetch the spte if it dose not have free
  pte_list_desc object in the cache.

- if it is the repeat string instructions emulated and it is not a IO/MMIO
  access, we can zap all the corresponding shadow pages and return to the guest
  then, the mapping can became writable and directly write the page

- do not zap the shadow page if it only modify the last byte of pte.

- Instead of detected page accessed, we can detect whether the spte is accessed
  or not, if the spte is not accessed but it is written frequently, we treat is
  not a page table or it not used for a long time.


Performance test result:
the performance is obvious improved tested by kernebench:

Before patchset  After patchset
3m0.094s   2m50.177s
3m1.813s   2m52.774s
3m6.2392m51.512




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] KVM: MMU: avoid pte_list_desc run out in kvm_mmu_pte_write

2011-07-26 Thread Xiao Guangrong
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9335e1b..587695b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -593,6 +593,11 @@ static int mmu_topup_memory_cache(struct 
kvm_mmu_memory_cache *cache,
return 0;
 }
 
+static int mmu_memory_cache_free_object(struct kvm_mmu_memory_cache *cache)
+{
+   return cache->nobjs;
+}
+
 static void mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc,
  struct kmem_cache *cache)
 {
@@ -3526,6 +3531,14 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, 
gfn_t gfn)
set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
 }
 
+static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
+{
+   struct kvm_mmu_memory_cache *cache;
+
+   cache = &vcpu->arch.mmu_pte_list_desc_cache;
+   return mmu_memory_cache_free_object(cache);
+}
+
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
   const u8 *new, int bytes,
   bool guest_initiated)
@@ -3583,6 +3596,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
break;
}
 
+   mmu_topup_memory_caches(vcpu);
spin_lock(&vcpu->kvm->mmu_lock);
if (atomic_read(&vcpu->kvm->arch.invlpg_counter) != invlpg_counter)
gentry = 0;
@@ -3653,7 +3667,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
mmu_page_zap_pte(vcpu->kvm, sp, spte);
if (gentry &&
  !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
- & mask.word))
+ & mask.word) && get_free_pte_list_desc_nr(vcpu))
mmu_pte_write_new_pte(vcpu, sp, spte, &gentry);
if (!remote_flush && need_remote_flush(entry, *spte))
remote_flush = true;
@@ -3714,10 +3728,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u32 error_code,
goto out;
}
 
-   r = mmu_topup_memory_caches(vcpu);
-   if (r)
-   goto out;
-
er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
 
switch (er) {
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/11] KVM: x86: cleanup pio/pout emulated

2011-07-26 Thread Xiao Guangrong
Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/x86.c |   59 ++-
 1 files changed, 26 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e80f0d7..5543f99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4320,32 +4320,24 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd)
return r;
 }
 
-
-static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
-   int size, unsigned short port, void *val,
-   unsigned int count)
+static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
+  unsigned short port, void *val,
+  unsigned int count, bool in)
 {
-   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-
-   if (vcpu->arch.pio.count)
-   goto data_avail;
-
-   trace_kvm_pio(0, port, size, count);
+   trace_kvm_pio(!in, port, size, count);
 
vcpu->arch.pio.port = port;
-   vcpu->arch.pio.in = 1;
+   vcpu->arch.pio.in = in;
vcpu->arch.pio.count  = count;
vcpu->arch.pio.size = size;
 
if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
-   data_avail:
-   memcpy(val, vcpu->arch.pio_data, size * count);
vcpu->arch.pio.count = 0;
return 1;
}
 
vcpu->run->exit_reason = KVM_EXIT_IO;
-   vcpu->run->io.direction = KVM_EXIT_IO_IN;
+   vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
vcpu->run->io.size = size;
vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
vcpu->run->io.count = count;
@@ -4354,36 +4346,37 @@ static int emulator_pio_in_emulated(struct 
x86_emulate_ctxt *ctxt,
return 0;
 }
 
-static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
-int size, unsigned short port,
-const void *val, unsigned int count)
+static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
+   int size, unsigned short port, void *val,
+   unsigned int count)
 {
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+   int ret;
 
-   trace_kvm_pio(1, port, size, count);
-
-   vcpu->arch.pio.port = port;
-   vcpu->arch.pio.in = 0;
-   vcpu->arch.pio.count = count;
-   vcpu->arch.pio.size = size;
-
-   memcpy(vcpu->arch.pio_data, val, size * count);
+   if (vcpu->arch.pio.count)
+   goto data_avail;
 
-   if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
+   ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
+   if (ret) {
+data_avail:
+   memcpy(val, vcpu->arch.pio_data, size * count);
vcpu->arch.pio.count = 0;
return 1;
}
 
-   vcpu->run->exit_reason = KVM_EXIT_IO;
-   vcpu->run->io.direction = KVM_EXIT_IO_OUT;
-   vcpu->run->io.size = size;
-   vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
-   vcpu->run->io.count = count;
-   vcpu->run->io.port = port;
-
return 0;
 }
 
+static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
+int size, unsigned short port,
+const void *val, unsigned int count)
+{
+   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+   memcpy(vcpu->arch.pio_data, val, size * count);
+   return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false);
+}
+
 static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg)
 {
return kvm_x86_ops->get_segment_base(vcpu, seg);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/11] KVM: x86: fast emulate repeat string write instructions

2011-07-26 Thread Xiao Guangrong
We usually use repeat string instructions to clear the page, for example,
we call memset to clear a page table, stosb is used in this function, and 
repeated for 1024 times, that means we should occupy mmu lock for 1024 times
and walking shadow page cache for 1024 times, it is terrible

In fact, if it is the repeat string instructions emulated and it is not a
IO/MMIO access, we can zap all the corresponding shadow pages and return to the
guest, then the mapping can became writable and directly write the page

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/include/asm/kvm_host.h|2 +-
 arch/x86/kvm/emulate.c |5 +++--
 arch/x86/kvm/mmu.c |4 ++--
 arch/x86/kvm/paging_tmpl.h |3 ++-
 arch/x86/kvm/x86.c |9 +++--
 6 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 6040d11..7e5c4d5 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -244,6 +244,7 @@ struct x86_emulate_ctxt {
bool guest_mode; /* guest running a nested guest */
bool perm_ok; /* do not check permissions if true */
bool only_vendor_specific_insn;
+   bool pio_mmio_emulate; /* it is the pio/mmio access if true */
 
bool have_exception;
struct x86_exception exception;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c00ec28..95f55a9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -752,7 +752,7 @@ int fx_init(struct kvm_vcpu *vcpu);
 void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
   const u8 *new, int bytes,
-  bool guest_initiated);
+  bool guest_initiated, bool repeat_write);
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6f08bc9..36fbac6 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4009,8 +4009,9 @@ writeback:
 * Re-enter guest when pio read ahead buffer is empty
 * or, if it is not used, after each 1024 iteration.
 */
-   if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff) 
&&
-   (r->end == 0 || r->end != r->pos)) {
+   if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff)
+ && (r->end == 0 || r->end != r->pos) &&
+ ctxt->pio_mmio_emulate) {
/*
 * Reset read cache. Usually happens before
 * decode, but since instruction is restarted
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 587695b..5193de8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3541,7 +3541,7 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu 
*vcpu)
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
   const u8 *new, int bytes,
-  bool guest_initiated)
+  bool guest_initiated, bool repeat_write)
 {
gfn_t gfn = gpa >> PAGE_SHIFT;
union kvm_mmu_page_role mask = { .word = 0 };
@@ -3622,7 +3622,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
pte_size = sp->role.cr4_pae ? 8 : 4;
misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
misaligned |= bytes < 4;
-   if (misaligned || flooded) {
+   if (misaligned || flooded || repeat_write) {
/*
 * Misaligned accesses are too much trouble to fix
 * up; also, they usually indicate a page is not used
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 507e2b8..02daac5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -710,7 +710,8 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 
if (mmu_topup_memory_caches(vcpu))
return;
-   kvm_mmu_pte_write(vcpu, pte_gpa, NULL, sizeof(pt_element_t), 0);
+   kvm_mmu_pte_write(vcpu, pte_gpa, NULL, sizeof(pt_element_t),
+ false, false);
 }
 
 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5543f99..6fb9001 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4058,7 +4058,8 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
ret = kvm_write_guest(vcpu->kvm, gpa, val, bytes);
if (ret < 0)
return 0;
-   kvm_

[PATCH 04/11] KVM: MMU: do not mark access bit on pte write path

2011-07-26 Thread Xiao Guangrong
In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |1 -
 arch/x86/kvm/mmu.c  |   22 +-
 2 files changed, 1 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 95f55a9..ad88bfb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -355,7 +355,6 @@ struct kvm_vcpu_arch {
gfn_t last_pt_write_gfn;
int   last_pt_write_count;
u64  *last_pte_updated;
-   gfn_t last_pte_gfn;
 
struct fpu guest_fpu;
u64 xcr0;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5193de8..992a8a5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2196,11 +2196,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if (set_mmio_spte(sptep, gfn, pfn, pte_access))
return 0;
 
-   /*
-* We don't set the accessed bit, since we sometimes want to see
-* whether the guest actually used the pte (in order to detect
-* demand paging).
-*/
spte = PT_PRESENT_MASK;
if (!speculative)
spte |= shadow_accessed_mask;
@@ -2351,10 +2346,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
}
kvm_release_pfn_clean(pfn);
-   if (speculative) {
+   if (speculative)
vcpu->arch.last_pte_updated = sptep;
-   vcpu->arch.last_pte_gfn = gfn;
-   }
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3519,18 +3512,6 @@ static bool last_updated_pte_accessed(struct kvm_vcpu 
*vcpu)
return !!(spte && (*spte & shadow_accessed_mask));
 }
 
-static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
-{
-   u64 *spte = vcpu->arch.last_pte_updated;
-
-   if (spte
-   && vcpu->arch.last_pte_gfn == gfn
-   && shadow_accessed_mask
-   && !(*spte & shadow_accessed_mask)
-   && is_shadow_present_pte(*spte))
-   set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
-}
-
 static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 {
struct kvm_mmu_memory_cache *cache;
@@ -3604,7 +3585,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
++vcpu->kvm->stat.mmu_pte_write;
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
if (guest_initiated) {
-   kvm_mmu_access_page(vcpu, gfn);
if (gfn == vcpu->arch.last_pt_write_gfn
&& !last_updated_pte_accessed(vcpu)) {
++vcpu->arch.last_pt_write_count;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/23] Memory API, batch 1

2011-07-26 Thread Avi Kivity
This patchset contains the core of the memory API, with one device
(usb-ohci) coverted for reference.  The API is currently implemented on
top of the old ram_addr_t/cpu_register_physical_memory() API, but the plan
is to make it standalone later.

The goals of the API are:
 - correctness: by modelling the memory hierarchy, things like the 440FX PAM
   registers and movable, overlapping PCI BARs can be modelled accurately.
 - efficiency: by maintaining an object tree describing guest memory, we
   can eventually get rid of the page descriptor array
 - security: by having more information available declaratively, we reduce
   coding errors that may be exploited by malicious guests

Also available from

  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
 refs/tags/memory-region-batch-1-v2

Changes from v1:
 - switched to gtk-doc
 - more copyright blurbs
 - simplified flatview_simplify()
 - use assert() instead of abort() for invariant checks
   (but keep abort() for runtime errors)
 - commit log fixups

Avi Kivity (23):
  Add memory API documentation
  Hierarchical memory region API
  memory: implement dirty tracking
  memory: merge adjacent segments of a single memory region
  Internal interfaces for memory API
  memory: abstract address space operations
  memory: rename MemoryRegion::has_ram_addr to ::terminates
  memory: late initialization of ram_addr
  memory: I/O address space support
  memory: add backward compatibility for old portio registration
  memory: add backward compatibility for old mmio registration
  memory: add ioeventfd support
  memory: separate building the final memory map into two steps
  memory: transaction API
  exec.c: initialize memory map
  ioport: register ranges by byte aligned addresses always
  pc: grab system_memory
  pc: convert pc_memory_init() to memory API
  pc: move global memory map out of pc_init1() and into its callers
  pci: pass address space to pci bus when created
  pci: add MemoryRegion based BAR management API
  sysbus: add MemoryRegion based memory management API
  usb-ohci: convert to MemoryRegion

 Makefile.target|1 +
 docs/memory.txt|  172 
 exec-memory.h  |   39 ++
 exec.c |   19 +
 hw/apb_pci.c   |2 +
 hw/bonito.c|4 +-
 hw/grackle_pci.c   |5 +-
 hw/gt64xxx.c   |4 +-
 hw/pc.c|   62 ++-
 hw/pc.h|9 +-
 hw/pc_piix.c   |   20 +-
 hw/pci.c   |   63 +++-
 hw/pci.h   |   15 +-
 hw/pci_host.h  |1 +
 hw/pci_internals.h |1 +
 hw/piix_pci.c  |   13 +-
 hw/ppc4xx_pci.c|5 +-
 hw/ppc_mac.h   |9 +-
 hw/ppc_newworld.c  |5 +-
 hw/ppc_oldworld.c  |3 +-
 hw/ppc_prep.c  |3 +-
 hw/ppce500_pci.c   |6 +-
 hw/prep_pci.c  |5 +-
 hw/prep_pci.h  |3 +-
 hw/sh_pci.c|4 +-
 hw/sysbus.c|   27 ++-
 hw/sysbus.h|3 +
 hw/unin_pci.c  |   10 +-
 hw/usb-ohci.c  |   42 +--
 hw/versatile_pci.c |2 +
 ioport.c   |4 +-
 memory.c   | 1141 
 memory.h   |  469 +
 33 files changed, 2072 insertions(+), 99 deletions(-)
 create mode 100644 docs/memory.txt
 create mode 100644 exec-memory.h
 create mode 100644 memory.c
 create mode 100644 memory.h

-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 04/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Avi Kivity
Simple implementations of memory routers, for example the Cirrus VGA memory 
banks
or the 440FX PAM registers can generate adjacent memory regions which are 
contiguous.
Detect these and merge them; this saves kvm memory slots and shortens lookup 
times.

Signed-off-by: Avi Kivity 
---
 memory.c |   29 +
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index d858b47..121f9e1 100644
--- a/memory.c
+++ b/memory.c
@@ -122,6 +122,34 @@ static void flatview_destroy(FlatView *view)
 qemu_free(view->ranges);
 }
 
+static bool can_merge(FlatRange *r1, FlatRange *r2)
+{
+return addrrange_end(r1->addr) == r2->addr.start
+&& r1->mr == r2->mr
+&& r1->offset_in_region + r1->addr.size == r2->offset_in_region
+&& r1->dirty_log_mask == r2->dirty_log_mask;
+}
+
+/* Attempt to simplify a view by merging ajacent ranges */
+static void flatview_simplify(FlatView *view)
+{
+unsigned i, j;
+
+i = 0;
+while (i < view->nr) {
+j = i + 1;
+while (j < view->nr
+   && can_merge(&view->ranges[j-1], &view->ranges[j])) {
+view->ranges[i].addr.size += view->ranges[j].addr.size;
+++j;
+}
+++i;
+memmove(&view->ranges[i], &view->ranges[j],
+(view->nr - j) * sizeof(view->ranges[j]));
+view->nr -= j - i;
+}
+}
+
 /* Render a memory region into the global view.  Ranges in @view obscure
  * ranges in @mr.
  */
@@ -209,6 +237,7 @@ static FlatView generate_memory_topology(MemoryRegion *mr)
 flatview_init(&view);
 
 render_memory_region(&view, mr, 0, addrrange_make(0, UINT64_MAX));
+flatview_simplify(&view);
 
 return view;
 }
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/23] memory: implement dirty tracking

2011-07-26 Thread Avi Kivity
Currently dirty tracking is implemented by passing through
all calls to the underlying cpu_physical_memory_*() calls.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 memory.c |   39 +++
 memory.h |1 +
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/memory.c b/memory.c
index a9cf317..d858b47 100644
--- a/memory.c
+++ b/memory.c
@@ -69,6 +69,7 @@ struct FlatRange {
 MemoryRegion *mr;
 target_phys_addr_t offset_in_region;
 AddrRange addr;
+uint8_t dirty_log_mask;
 };
 
 /* Flattened global view of current active memory hierarchy.  Kept in sorted
@@ -177,6 +178,7 @@ static void render_memory_region(FlatView *view,
 fr.mr = mr;
 fr.offset_in_region = offset_in_region;
 fr.addr = addrrange_make(base, now);
+fr.dirty_log_mask = mr->dirty_log_mask;
 flatview_insert(view, i, &fr);
 ++i;
 base += now;
@@ -194,6 +196,7 @@ static void render_memory_region(FlatView *view,
 fr.mr = mr;
 fr.offset_in_region = offset_in_region;
 fr.addr = addrrange_make(base, remain);
+fr.dirty_log_mask = mr->dirty_log_mask;
 flatview_insert(view, i, &fr);
 }
 }
@@ -247,9 +250,14 @@ static void memory_region_update_topology(void)
 } else if (frold && frnew && flatrange_equal(frold, frnew)) {
 /* In both (logging may have changed) */
 
+if (frold->dirty_log_mask && !frnew->dirty_log_mask) {
+cpu_physical_log_stop(frnew->addr.start, frnew->addr.size);
+} else if (frnew->dirty_log_mask && !frold->dirty_log_mask) {
+cpu_physical_log_start(frnew->addr.start, frnew->addr.size);
+}
+
 ++iold;
 ++inew;
-/* FIXME: dirty logging */
 } else {
 /* In new */
 
@@ -267,7 +275,7 @@ static void memory_region_update_topology(void)
  frnew->addr.size,
  phys_offset,
  region_offset,
- 0);
+ frnew->dirty_log_mask);
 ++inew;
 }
 }
@@ -292,6 +300,7 @@ void memory_region_init(MemoryRegion *mr,
 memset(&mr->subregions_link, 0, sizeof mr->subregions_link);
 QTAILQ_INIT(&mr->coalesced);
 mr->name = qemu_strdup(name);
+mr->dirty_log_mask = 0;
 }
 
 static bool memory_region_access_valid(MemoryRegion *mr,
@@ -496,24 +505,35 @@ void memory_region_set_offset(MemoryRegion *mr, 
target_phys_addr_t offset)
 
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
-/* FIXME */
+uint8_t mask = 1 << client;
+
+mr->dirty_log_mask = (mr->dirty_log_mask & ~mask) | (log * mask);
+memory_region_update_topology();
 }
 
 bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr,
  unsigned client)
 {
-/* FIXME */
-return true;
+assert(mr->has_ram_addr);
+return cpu_physical_memory_get_dirty(mr->ram_addr + addr, 1 << client);
 }
 
 void memory_region_set_dirty(MemoryRegion *mr, target_phys_addr_t addr)
 {
-/* FIXME */
+assert(mr->has_ram_addr);
+return cpu_physical_memory_set_dirty(mr->ram_addr + addr);
 }
 
 void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
 {
-/* FIXME */
+FlatRange *fr;
+
+FOR_EACH_FLAT_RANGE(fr, ¤t_memory_map) {
+if (fr->mr == mr) {
+cpu_physical_sync_dirty_bitmap(fr->addr.start,
+   fr->addr.start + fr->addr.size);
+}
+}
 }
 
 void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
@@ -524,7 +544,10 @@ void memory_region_set_readonly(MemoryRegion *mr, bool 
readonly)
 void memory_region_reset_dirty(MemoryRegion *mr, target_phys_addr_t addr,
target_phys_addr_t size, unsigned client)
 {
-/* FIXME */
+assert(mr->has_ram_addr);
+cpu_physical_memory_reset_dirty(mr->ram_addr + addr,
+mr->ram_addr + addr + size,
+1 << client);
 }
 
 void *memory_region_get_ram_ptr(MemoryRegion *mr)
diff --git a/memory.h b/memory.h
index a4c94bd..d441bd8 100644
--- a/memory.h
+++ b/memory.h
@@ -99,6 +99,7 @@ struct MemoryRegion {
 QTAILQ_ENTRY(MemoryRegion) subregions_link;
 QTAILQ_HEAD(coalesced_ranges, CoalescedMemoryRange) coalesced;
 const char *name;
+uint8_t dirty_log_mask;
 };
 
 /**
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/23] Add memory API documentation

2011-07-26 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 docs/memory.txt |  172 +++
 1 files changed, 172 insertions(+), 0 deletions(-)
 create mode 100644 docs/memory.txt

diff --git a/docs/memory.txt b/docs/memory.txt
new file mode 100644
index 000..4460c06
--- /dev/null
+++ b/docs/memory.txt
@@ -0,0 +1,172 @@
+The memory API
+==
+
+The memory API models the memory and I/O buses and controllers of a QEMU
+machine.  It attempts to allow modelling of:
+
+ - ordinary RAM
+ - memory-mapped I/O (MMIO)
+ - memory controllers that can dynamically reroute physical memory regions
+  to different destinations
+
+The memory model provides support for
+
+ - tracking RAM changes by the guest
+ - setting up coalesced memory for kvm
+ - setting up ioeventfd regions for kvm
+
+Memory is modelled as an tree (really acyclic graph) of MemoryRegion objects.
+The root of the tree is memory as seen from the CPU's viewpoint (the system
+bus).  Nodes in the tree represent other buses, memory controllers, and
+memory regions that have been rerouted.  Leaves are RAM and MMIO regions.
+
+Types of regions
+
+
+There are four types of memory regions (all represented by a single C type
+MemoryRegion):
+
+- RAM: a RAM region is simply a range of host memory that can be made available
+  to the guest.
+
+- MMIO: a range of guest memory that is implemented by host callbacks;
+  each read or write causes a callback to be called on the host.
+
+- container: a container simply includes other memory regions, each at
+  a different offset.  Containers are useful for grouping several regions
+  into one unit.  For example, a PCI BAR may be composed of a RAM region
+  and an MMIO region.
+
+  A container's subregions are usually non-overlapping.  In some cases it is
+  useful to have overlapping regions; for example a memory controller that
+  can overlay a subregion of RAM with MMIO or ROM, or a PCI controller
+  that does not prevent card from claiming overlapping BARs.
+
+- alias: a subsection of another region.  Aliases allow a region to be
+  split apart into discontiguous regions.  Examples of uses are memory banks
+  used when the guest address space is smaller than the amount of RAM
+  addressed, or a memory controller that splits main memory to expose a "PCI
+  hole".  Aliases may point to any type of region, including other aliases,
+  but an alias may not point back to itself, directly or indirectly.
+
+
+Region names
+
+
+Regions are assigned names by the constructor.  For most regions these are
+only used for debugging purposes, but RAM regions also use the name to identify
+live migration sections.  This means that RAM region names need to have ABI
+stability.
+
+Region lifecycle
+
+
+A region is created by one of the constructor functions (memory_region_init*())
+and destroyed by the destructor (memory_region_destroy()).  In between,
+a region can be added to an address space by using 
memory_region_add_subregion()
+and removed using memory_region_del_subregion().  Region attributes may be
+changed at any point; they take effect once the region becomes exposed to the
+guest.
+
+Overlapping regions and priority
+
+Usually, regions may not overlap each other; a memory address decodes into
+exactly one target.  In some cases it is useful to allow regions to overlap,
+and sometimes to control which of an overlapping regions is visible to the
+guest.  This is done with memory_region_add_subregion_overlap(), which
+allows the region to overlap any other region in the same container, and
+specifies a priority that allows the core to decide which of two regions at
+the same address are visible (highest wins).
+
+Visibility
+--
+The memory core uses the following rules to select a memory region when the
+guest accesses an address:
+
+- all direct subregions of the root region are matched against the address, in
+  descending priority order
+  - if the address lies outside the region offset/size, the subregion is
+discarded
+  - if the subregion is a leaf (RAM or MMIO), the seach terminates
+  - if the subregion is a container, the same algorithm is used within the
+subregion (after the address is adjusted by the subregion offset)
+  - if the subregion is an alias, the search is continues at the alias target
+(after the address is adjusted by the subregion offset and alias offset)
+
+Example memory map
+--
+
+system_memory: container@0-2^48-1
+ |
+ + lomem: alias@0-0xdfff ---> #ram (0-0xdfff)
+ |
+ + himem: alias@0x1-0x11fff ---> #ram (0xe000-0x)
+ |
+ + vga-window: alias@0xa-0xbf ---> #pci (0xa-0xb)
+ |  (prio 1)
+ |
+ + pci-hole: alias@0xe000-0x ---> #pci (0xe000-0x)
+
+pci (0-2^32-1)
+ |
+ +--- vga-area: container@0xa-0xb
+ |  |
+ |  +--- alias@0x0-0x7fff  ---> #vram (0x010

[PATCH v2 02/23] Hierarchical memory region API

2011-07-26 Thread Avi Kivity
The memory API separates the attributes of a memory region (its size, how
reads or writes are handled, dirty logging, and coalescing) from where it
is mapped and whether it is enabled.  This allows a device to configure
a memory region once, then hand it off to its parent bus to map it according
to the bus configuration.

Hierarchical registration also allows a device to compose a region out of
a number of sub-regions with different properties; for example some may be
RAM while others may be MMIO.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 Makefile.target |1 +
 memory.c|  653 +++
 memory.h|  385 
 3 files changed, 1039 insertions(+), 0 deletions(-)
 create mode 100644 memory.c
 create mode 100644 memory.h

diff --git a/Makefile.target b/Makefile.target
index cde509b..8884a56 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -198,6 +198,7 @@ obj-$(CONFIG_REALLY_VIRTFS) += 9pfs/virtio-9p-device.o
 obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_NO_KVM) += kvm-stub.o
+obj-y += memory.o
 LIBS+=-lz
 
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
diff --git a/memory.c b/memory.c
new file mode 100644
index 000..a9cf317
--- /dev/null
+++ b/memory.c
@@ -0,0 +1,653 @@
+/*
+ * Physical memory management
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *  Avi Kivity 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "memory.h"
+#include 
+
+typedef struct AddrRange AddrRange;
+
+struct AddrRange {
+uint64_t start;
+uint64_t size;
+};
+
+static AddrRange addrrange_make(uint64_t start, uint64_t size)
+{
+return (AddrRange) { start, size };
+}
+
+static bool addrrange_equal(AddrRange r1, AddrRange r2)
+{
+return r1.start == r2.start && r1.size == r2.size;
+}
+
+static uint64_t addrrange_end(AddrRange r)
+{
+return r.start + r.size;
+}
+
+static AddrRange addrrange_shift(AddrRange range, int64_t delta)
+{
+range.start += delta;
+return range;
+}
+
+static bool addrrange_intersects(AddrRange r1, AddrRange r2)
+{
+return (r1.start >= r2.start && r1.start < r2.start + r2.size)
+|| (r2.start >= r1.start && r2.start < r1.start + r1.size);
+}
+
+static AddrRange addrrange_intersection(AddrRange r1, AddrRange r2)
+{
+uint64_t start = MAX(r1.start, r2.start);
+/* off-by-one arithmetic to prevent overflow */
+uint64_t end = MIN(addrrange_end(r1) - 1, addrrange_end(r2) - 1);
+return addrrange_make(start, end - start + 1);
+}
+
+struct CoalescedMemoryRange {
+AddrRange addr;
+QTAILQ_ENTRY(CoalescedMemoryRange) link;
+};
+
+typedef struct FlatRange FlatRange;
+typedef struct FlatView FlatView;
+
+/* Range of memory in the global map.  Addresses are absolute. */
+struct FlatRange {
+MemoryRegion *mr;
+target_phys_addr_t offset_in_region;
+AddrRange addr;
+};
+
+/* Flattened global view of current active memory hierarchy.  Kept in sorted
+ * order.
+ */
+struct FlatView {
+FlatRange *ranges;
+unsigned nr;
+unsigned nr_allocated;
+};
+
+#define FOR_EACH_FLAT_RANGE(var, view)  \
+for (var = (view)->ranges; var < (view)->ranges + (view)->nr; ++var)
+
+static FlatView current_memory_map;
+static MemoryRegion *root_memory_region;
+
+static bool flatrange_equal(FlatRange *a, FlatRange *b)
+{
+return a->mr == b->mr
+&& addrrange_equal(a->addr, b->addr)
+&& a->offset_in_region == b->offset_in_region;
+}
+
+static void flatview_init(FlatView *view)
+{
+view->ranges = NULL;
+view->nr = 0;
+view->nr_allocated = 0;
+}
+
+/* Insert a range into a given position.  Caller is responsible for maintaining
+ * sorting order.
+ */
+static void flatview_insert(FlatView *view, unsigned pos, FlatRange *range)
+{
+if (view->nr == view->nr_allocated) {
+view->nr_allocated = MAX(2 * view->nr, 10);
+view->ranges = qemu_realloc(view->ranges,
+view->nr_allocated * 
sizeof(*view->ranges));
+}
+memmove(view->ranges + pos + 1, view->ranges + pos,
+(view->nr - pos) * sizeof(FlatRange));
+view->ranges[pos] = *range;
+++view->nr;
+}
+
+static void flatview_destroy(FlatView *view)
+{
+qemu_free(view->ranges);
+}
+
+/* Render a memory region into the global view.  Ranges in @view obscure
+ * ranges in @mr.
+ */
+static void render_memory_region(FlatView *view,
+ MemoryRegion *mr,
+ target_phys_addr_t base,
+ AddrRange clip)
+{
+MemoryRegion *subregion;
+unsigned i;
+target_phys_addr_t offset_in_region;
+uint64_t remain;
+uint64_t now;
+FlatRange fr;
+AddrRange tmp;
+
+base += mr->addr;
+
+tmp = addrrange_make(base, mr->size);
+
+if (!addrrange_intersects(t

[PATCH v2 08/23] memory: late initialization of ram_addr

2011-07-26 Thread Avi Kivity
For non-RAM memory regions, we cannot tell whether this is an I/O region
or an MMIO region.  Since the qemu backing registration is different for
the two, we have to defer initialization until we know which address
space we are in.

These shenanigans will be removed once the backing registration is unified
with the memory API.

Signed-off-by: Avi Kivity 
---
 memory.c |   24 
 memory.h |1 +
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/memory.c b/memory.c
index 9e1a838..e839c9e 100644
--- a/memory.c
+++ b/memory.c
@@ -165,10 +165,14 @@ static void flatview_simplify(FlatView *view)
 }
 }
 
+static void memory_region_prepare_ram_addr(MemoryRegion *mr);
+
 static void as_memory_range_add(AddressSpace *as, FlatRange *fr)
 {
 ram_addr_t phys_offset, region_offset;
 
+memory_region_prepare_ram_addr(fr->mr);
+
 phys_offset = fr->mr->ram_addr;
 region_offset = fr->offset_in_region;
 /* cpu_register_physical_memory_log() wants region_offset for
@@ -519,6 +523,19 @@ static CPUWriteMemoryFunc * const 
memory_region_write_thunk[] = {
 memory_region_write_thunk_l,
 };
 
+static void memory_region_prepare_ram_addr(MemoryRegion *mr)
+{
+if (mr->backend_registered) {
+return;
+}
+
+mr->ram_addr = cpu_register_io_memory(memory_region_read_thunk,
+  memory_region_write_thunk,
+  mr,
+  mr->ops->endianness);
+mr->backend_registered = true;
+}
+
 void memory_region_init_io(MemoryRegion *mr,
const MemoryRegionOps *ops,
void *opaque,
@@ -529,10 +546,7 @@ void memory_region_init_io(MemoryRegion *mr,
 mr->ops = ops;
 mr->opaque = opaque;
 mr->terminates = true;
-mr->ram_addr = cpu_register_io_memory(memory_region_read_thunk,
-  memory_region_write_thunk,
-  mr,
-  mr->ops->endianness);
+mr->backend_registered = false;
 }
 
 void memory_region_init_ram(MemoryRegion *mr,
@@ -543,6 +557,7 @@ void memory_region_init_ram(MemoryRegion *mr,
 memory_region_init(mr, name, size);
 mr->terminates = true;
 mr->ram_addr = qemu_ram_alloc(dev, name, size);
+mr->backend_registered = true;
 }
 
 void memory_region_init_ram_ptr(MemoryRegion *mr,
@@ -554,6 +569,7 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
 memory_region_init(mr, name, size);
 mr->terminates = true;
 mr->ram_addr = qemu_ram_alloc_from_ptr(dev, name, size, ptr);
+mr->backend_registered = true;
 }
 
 void memory_region_init_alias(MemoryRegion *mr,
diff --git a/memory.h b/memory.h
index 47d6b9d..c481038 100644
--- a/memory.h
+++ b/memory.h
@@ -89,6 +89,7 @@ struct MemoryRegion {
 uint64_t size;
 target_phys_addr_t addr;
 target_phys_addr_t offset;
+bool backend_registered;
 ram_addr_t ram_addr;
 bool terminates;
 MemoryRegion *alias;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/23] Internal interfaces for memory API

2011-07-26 Thread Avi Kivity
get_system_memory() provides the root of the memory hierarchy.

This interface is intended to be private between memory.c and exec.c.
If this file is included elsewhere, it should be regarded as a bug (or
TODO item).  However, it will be temporarily needed for the conversion
to hierarchical memory routing.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 exec-memory.h |   36 
 memory.c  |7 +++
 2 files changed, 43 insertions(+), 0 deletions(-)
 create mode 100644 exec-memory.h

diff --git a/exec-memory.h b/exec-memory.h
new file mode 100644
index 000..aea1b45
--- /dev/null
+++ b/exec-memory.h
@@ -0,0 +1,36 @@
+/*
+ * Internal memory managment interfaces
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates
+ *
+ * Authors:
+ *  Avi Kivity 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef EXEC_MEMORY_H
+#define EXEC_MEMORY_H
+
+/*
+ * Internal interfaces between memory.c/exec.c/vl.c.  Do not #include unless
+ * you're one of them.
+ */
+
+#include "memory.h"
+
+#ifndef CONFIG_USER_ONLY
+
+/* Get the root memory region.  This interface should only be used temporarily
+ * until a proper bus interface is available.
+ */
+MemoryRegion *get_system_memory(void);
+
+/* Set the root memory region.  This region is the system memory map. */
+void set_system_memory_map(MemoryRegion *mr);
+
+#endif
+
+#endif
diff --git a/memory.c b/memory.c
index 121f9e1..fcb612e 100644
--- a/memory.c
+++ b/memory.c
@@ -12,6 +12,7 @@
  */
 
 #include "memory.h"
+#include "exec-memory.h"
 #include 
 
 typedef struct AddrRange AddrRange;
@@ -703,3 +704,9 @@ void memory_region_del_subregion(MemoryRegion *mr,
 QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
 memory_region_update_topology();
 }
+
+void set_system_memory_map(MemoryRegion *mr)
+{
+root_memory_region = mr;
+memory_region_update_topology();
+}
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/23] memory: abstract address space operations

2011-07-26 Thread Avi Kivity
Prepare for multiple address space support by abstracting away the details
of registering a memory range with qemu's flat representation into an
AddressSpace object.

Note operations which are memory specific are not abstracted, since they will
never be called on I/O address spaces anyway.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 memory.c |  111 +-
 1 files changed, 81 insertions(+), 30 deletions(-)

diff --git a/memory.c b/memory.c
index fcb612e..bae7765 100644
--- a/memory.c
+++ b/memory.c
@@ -82,12 +82,26 @@ struct FlatView {
 unsigned nr_allocated;
 };
 
+typedef struct AddressSpace AddressSpace;
+typedef struct AddressSpaceOps AddressSpaceOps;
+
+/* A system address space - I/O, memory, etc. */
+struct AddressSpace {
+const AddressSpaceOps *ops;
+MemoryRegion *root;
+FlatView current_map;
+};
+
+struct AddressSpaceOps {
+void (*range_add)(AddressSpace *as, FlatRange *fr);
+void (*range_del)(AddressSpace *as, FlatRange *fr);
+void (*log_start)(AddressSpace *as, FlatRange *fr);
+void (*log_stop)(AddressSpace *as, FlatRange *fr);
+};
+
 #define FOR_EACH_FLAT_RANGE(var, view)  \
 for (var = (view)->ranges; var < (view)->ranges + (view)->nr; ++var)
 
-static FlatView current_memory_map;
-static MemoryRegion *root_memory_region;
-
 static bool flatrange_equal(FlatRange *a, FlatRange *b)
 {
 return a->mr == b->mr
@@ -151,6 +165,54 @@ static void flatview_simplify(FlatView *view)
 }
 }
 
+static void as_memory_range_add(AddressSpace *as, FlatRange *fr)
+{
+ram_addr_t phys_offset, region_offset;
+
+phys_offset = fr->mr->ram_addr;
+region_offset = fr->offset_in_region;
+/* cpu_register_physical_memory_log() wants region_offset for
+ * mmio, but prefers offseting phys_offset for RAM.  Humour it.
+ */
+if ((phys_offset & ~TARGET_PAGE_MASK) <= IO_MEM_ROM) {
+phys_offset += region_offset;
+region_offset = 0;
+}
+
+cpu_register_physical_memory_log(fr->addr.start,
+ fr->addr.size,
+ phys_offset,
+ region_offset,
+ fr->dirty_log_mask);
+}
+
+static void as_memory_range_del(AddressSpace *as, FlatRange *fr)
+{
+cpu_register_physical_memory(fr->addr.start, fr->addr.size,
+ IO_MEM_UNASSIGNED);
+}
+
+static void as_memory_log_start(AddressSpace *as, FlatRange *fr)
+{
+cpu_physical_log_start(fr->addr.start, fr->addr.size);
+}
+
+static void as_memory_log_stop(AddressSpace *as, FlatRange *fr)
+{
+cpu_physical_log_stop(fr->addr.start, fr->addr.size);
+}
+
+static const AddressSpaceOps address_space_ops_memory = {
+.range_add = as_memory_range_add,
+.range_del = as_memory_range_del,
+.log_start = as_memory_log_start,
+.log_stop = as_memory_log_stop,
+};
+
+static AddressSpace address_space_memory = {
+.ops = &address_space_ops_memory,
+};
+
 /* Render a memory region into the global view.  Ranges in @view obscure
  * ranges in @mr.
  */
@@ -243,13 +305,12 @@ static FlatView generate_memory_topology(MemoryRegion *mr)
 return view;
 }
 
-static void memory_region_update_topology(void)
+static void address_space_update_topology(AddressSpace *as)
 {
-FlatView old_view = current_memory_map;
-FlatView new_view = generate_memory_topology(root_memory_region);
+FlatView old_view = as->current_map;
+FlatView new_view = generate_memory_topology(as->root);
 unsigned iold, inew;
 FlatRange *frold, *frnew;
-ram_addr_t phys_offset, region_offset;
 
 /* Generate a symmetric difference of the old and new memory maps.
  * Kill ranges in the old map, and instantiate ranges in the new map.
@@ -274,16 +335,15 @@ static void memory_region_update_topology(void)
 && !flatrange_equal(frold, frnew {
 /* In old, but (not in new, or in new but attributes changed). */
 
-cpu_register_physical_memory(frold->addr.start, frold->addr.size,
- IO_MEM_UNASSIGNED);
+as->ops->range_del(as, frold);
 ++iold;
 } else if (frold && frnew && flatrange_equal(frold, frnew)) {
 /* In both (logging may have changed) */
 
 if (frold->dirty_log_mask && !frnew->dirty_log_mask) {
-cpu_physical_log_stop(frnew->addr.start, frnew->addr.size);
+as->ops->log_stop(as, frnew);
 } else if (frnew->dirty_log_mask && !frold->dirty_log_mask) {
-cpu_physical_log_start(frnew->addr.start, frnew->addr.size);
+as->ops->log_start(as, frnew);
 }
 
 ++iold;
@@ -291,28 +351,19 @@ static void memory_region_update_topology(void)
 } else {
 /* In new */
 
-phys_offset = frnew->mr->ram_addr;
-r

[PATCH v2 09/23] memory: I/O address space support

2011-07-26 Thread Avi Kivity
Allow registering I/O ports via the same mechanism as mmio ranges.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 exec-memory.h |3 ++
 memory.c  |   60 -
 memory.h  |2 +
 3 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/exec-memory.h b/exec-memory.h
index aea1b45..c439aba 100644
--- a/exec-memory.h
+++ b/exec-memory.h
@@ -31,6 +31,9 @@ MemoryRegion *get_system_memory(void);
 /* Set the root memory region.  This region is the system memory map. */
 void set_system_memory_map(MemoryRegion *mr);
 
+/* Set the I/O memory region.  This region is the I/O memory map. */
+void set_system_io_map(MemoryRegion *mr);
+
 #endif
 
 #endif
diff --git a/memory.c b/memory.c
index e839c9e..df0ed0e 100644
--- a/memory.c
+++ b/memory.c
@@ -13,6 +13,7 @@
 
 #include "memory.h"
 #include "exec-memory.h"
+#include "ioport.h"
 #include 
 
 typedef struct AddrRange AddrRange;
@@ -217,6 +218,52 @@ static AddressSpace address_space_memory = {
 .ops = &address_space_ops_memory,
 };
 
+static void memory_region_iorange_read(IORange *iorange,
+   uint64_t offset,
+   unsigned width,
+   uint64_t *data)
+{
+MemoryRegion *mr = container_of(iorange, MemoryRegion, iorange);
+
+*data = mr->ops->read(mr->opaque, offset, width);
+}
+
+static void memory_region_iorange_write(IORange *iorange,
+uint64_t offset,
+unsigned width,
+uint64_t data)
+{
+MemoryRegion *mr = container_of(iorange, MemoryRegion, iorange);
+
+mr->ops->write(mr->opaque, offset, data, width);
+}
+
+static const IORangeOps memory_region_iorange_ops = {
+.read = memory_region_iorange_read,
+.write = memory_region_iorange_write,
+};
+
+static void as_io_range_add(AddressSpace *as, FlatRange *fr)
+{
+iorange_init(&fr->mr->iorange, &memory_region_iorange_ops,
+ fr->addr.start,fr->addr.size);
+ioport_register(&fr->mr->iorange);
+}
+
+static void as_io_range_del(AddressSpace *as, FlatRange *fr)
+{
+isa_unassign_ioport(fr->addr.start, fr->addr.size);
+}
+
+static const AddressSpaceOps address_space_ops_io = {
+.range_add = as_io_range_add,
+.range_del = as_io_range_del,
+};
+
+static AddressSpace address_space_io = {
+.ops = &address_space_ops_io,
+};
+
 /* Render a memory region into the global view.  Ranges in @view obscure
  * ranges in @mr.
  */
@@ -365,7 +412,12 @@ static void address_space_update_topology(AddressSpace *as)
 
 static void memory_region_update_topology(void)
 {
-address_space_update_topology(&address_space_memory);
+if (address_space_memory.root) {
+address_space_update_topology(&address_space_memory);
+}
+if (address_space_io.root) {
+address_space_update_topology(&address_space_io);
+}
 }
 
 void memory_region_init(MemoryRegion *mr,
@@ -777,3 +829,9 @@ void set_system_memory_map(MemoryRegion *mr)
 address_space_memory.root = mr;
 memory_region_update_topology();
 }
+
+void set_system_io_map(MemoryRegion *mr)
+{
+address_space_io.root = mr;
+memory_region_update_topology();
+}
diff --git a/memory.h b/memory.h
index c481038..88ba428 100644
--- a/memory.h
+++ b/memory.h
@@ -22,6 +22,7 @@
 #include "cpu-common.h"
 #include "targphys.h"
 #include "qemu-queue.h"
+#include "iorange.h"
 
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegion MemoryRegion;
@@ -91,6 +92,7 @@ struct MemoryRegion {
 target_phys_addr_t offset;
 bool backend_registered;
 ram_addr_t ram_addr;
+IORange iorange;
 bool terminates;
 MemoryRegion *alias;
 target_phys_addr_t alias_offset;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/23] memory: rename MemoryRegion::has_ram_addr to ::terminates

2011-07-26 Thread Avi Kivity
I/O regions will not have ram_addrs, so this is a better name.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 memory.c |   18 +-
 memory.h |2 +-
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/memory.c b/memory.c
index bae7765..9e1a838 100644
--- a/memory.c
+++ b/memory.c
@@ -251,7 +251,7 @@ static void render_memory_region(FlatView *view,
 render_memory_region(view, subregion, base, clip);
 }
 
-if (!mr->has_ram_addr) {
+if (!mr->terminates) {
 return;
 }
 
@@ -373,7 +373,7 @@ void memory_region_init(MemoryRegion *mr,
 mr->size = size;
 mr->addr = 0;
 mr->offset = 0;
-mr->has_ram_addr = false;
+mr->terminates = false;
 mr->priority = 0;
 mr->may_overlap = false;
 mr->alias = NULL;
@@ -528,7 +528,7 @@ void memory_region_init_io(MemoryRegion *mr,
 memory_region_init(mr, name, size);
 mr->ops = ops;
 mr->opaque = opaque;
-mr->has_ram_addr = true;
+mr->terminates = true;
 mr->ram_addr = cpu_register_io_memory(memory_region_read_thunk,
   memory_region_write_thunk,
   mr,
@@ -541,7 +541,7 @@ void memory_region_init_ram(MemoryRegion *mr,
 uint64_t size)
 {
 memory_region_init(mr, name, size);
-mr->has_ram_addr = true;
+mr->terminates = true;
 mr->ram_addr = qemu_ram_alloc(dev, name, size);
 }
 
@@ -552,7 +552,7 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
 void *ptr)
 {
 memory_region_init(mr, name, size);
-mr->has_ram_addr = true;
+mr->terminates = true;
 mr->ram_addr = qemu_ram_alloc_from_ptr(dev, name, size, ptr);
 }
 
@@ -595,13 +595,13 @@ void memory_region_set_log(MemoryRegion *mr, bool log, 
unsigned client)
 bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr,
  unsigned client)
 {
-assert(mr->has_ram_addr);
+assert(mr->terminates);
 return cpu_physical_memory_get_dirty(mr->ram_addr + addr, 1 << client);
 }
 
 void memory_region_set_dirty(MemoryRegion *mr, target_phys_addr_t addr)
 {
-assert(mr->has_ram_addr);
+assert(mr->terminates);
 return cpu_physical_memory_set_dirty(mr->ram_addr + addr);
 }
 
@@ -625,7 +625,7 @@ void memory_region_set_readonly(MemoryRegion *mr, bool 
readonly)
 void memory_region_reset_dirty(MemoryRegion *mr, target_phys_addr_t addr,
target_phys_addr_t size, unsigned client)
 {
-assert(mr->has_ram_addr);
+assert(mr->terminates);
 cpu_physical_memory_reset_dirty(mr->ram_addr + addr,
 mr->ram_addr + addr + size,
 1 << client);
@@ -637,7 +637,7 @@ void *memory_region_get_ram_ptr(MemoryRegion *mr)
 return memory_region_get_ram_ptr(mr->alias) + mr->alias_offset;
 }
 
-assert(mr->has_ram_addr);
+assert(mr->terminates);
 
 return qemu_get_ram_ptr(mr->ram_addr);
 }
diff --git a/memory.h b/memory.h
index d441bd8..47d6b9d 100644
--- a/memory.h
+++ b/memory.h
@@ -90,7 +90,7 @@ struct MemoryRegion {
 target_phys_addr_t addr;
 target_phys_addr_t offset;
 ram_addr_t ram_addr;
-bool has_ram_addr;
+bool terminates;
 MemoryRegion *alias;
 target_phys_addr_t alias_offset;
 unsigned priority;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 15/23] exec.c: initialize memory map

2011-07-26 Thread Avi Kivity
Allocate the root memory region and initialize it.

Signed-off-by: Avi Kivity 
---
 exec.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index 2160ded..d51502f 100644
--- a/exec.c
+++ b/exec.c
@@ -33,6 +33,8 @@
 #include "kvm.h"
 #include "hw/xen.h"
 #include "qemu-timer.h"
+#include "memory.h"
+#include "exec-memory.h"
 #if defined(CONFIG_USER_ONLY)
 #include 
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -109,6 +111,9 @@ int phys_ram_fd;
 static int in_migration;
 
 RAMList ram_list = { .blocks = QLIST_HEAD_INITIALIZER(ram_list) };
+
+static MemoryRegion *system_memory;
+
 #endif
 
 CPUState *first_cpu;
@@ -197,6 +202,7 @@ typedef struct PhysPageDesc {
 static void *l1_phys_map[P_L1_SIZE];
 
 static void io_mem_init(void);
+static void memory_map_init(void);
 
 /* io memory support */
 CPUWriteMemoryFunc *io_mem_write[IO_MEM_NB_ENTRIES][4];
@@ -571,6 +577,7 @@ void cpu_exec_init_all(unsigned long tb_size)
 code_gen_ptr = code_gen_buffer;
 page_init();
 #if !defined(CONFIG_USER_ONLY)
+memory_map_init();
 io_mem_init();
 #endif
 #if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_USE_GUEST_BASE)
@@ -3807,6 +3814,18 @@ static void io_mem_init(void)
   DEVICE_NATIVE_ENDIAN);
 }
 
+static void memory_map_init(void)
+{
+system_memory = qemu_malloc(sizeof(*system_memory));
+memory_region_init(system_memory, "system", UINT64_MAX);
+set_system_memory_map(system_memory);
+}
+
+MemoryRegion *get_system_memory(void)
+{
+return system_memory;
+}
+
 #endif /* !defined(CONFIG_USER_ONLY) */
 
 /* physical memory access (slow version, mainly for debug) */
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 12/23] memory: add ioeventfd support

2011-07-26 Thread Avi Kivity
As with the rest of the memory API, the caller associates an eventfd
with an address, and the memory API takes care of registering or
unregistering when the address is made visible or invisible to the
guest.

Signed-off-by: Avi Kivity 
---
 memory.c |  224 ++
 memory.h |   45 +
 2 files changed, 269 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index 7dd7cac..686bbf2 100644
--- a/memory.c
+++ b/memory.c
@@ -15,6 +15,7 @@
 #include "exec-memory.h"
 #include "ioport.h"
 #include "bitops.h"
+#include "kvm.h"
 #include 
 
 typedef struct AddrRange AddrRange;
@@ -64,6 +65,50 @@ struct CoalescedMemoryRange {
 QTAILQ_ENTRY(CoalescedMemoryRange) link;
 };
 
+struct MemoryRegionIoeventfd {
+AddrRange addr;
+bool match_data;
+uint64_t data;
+int fd;
+};
+
+static bool memory_region_ioeventfd_before(MemoryRegionIoeventfd a,
+   MemoryRegionIoeventfd b)
+{
+if (a.addr.start < b.addr.start) {
+return true;
+} else if (a.addr.start > b.addr.start) {
+return false;
+} else if (a.addr.size < b.addr.size) {
+return true;
+} else if (a.addr.size > b.addr.size) {
+return false;
+} else if (a.match_data < b.match_data) {
+return true;
+} else  if (a.match_data > b.match_data) {
+return false;
+} else if (a.match_data) {
+if (a.data < b.data) {
+return true;
+} else if (a.data > b.data) {
+return false;
+}
+}
+if (a.fd < b.fd) {
+return true;
+} else if (a.fd > b.fd) {
+return false;
+}
+return false;
+}
+
+static bool memory_region_ioeventfd_equal(MemoryRegionIoeventfd a,
+  MemoryRegionIoeventfd b)
+{
+return !memory_region_ioeventfd_before(a, b)
+&& !memory_region_ioeventfd_before(b, a);
+}
+
 typedef struct FlatRange FlatRange;
 typedef struct FlatView FlatView;
 
@@ -92,6 +137,8 @@ struct AddressSpace {
 const AddressSpaceOps *ops;
 MemoryRegion *root;
 FlatView current_map;
+int ioeventfd_nb;
+MemoryRegionIoeventfd *ioeventfds;
 };
 
 struct AddressSpaceOps {
@@ -99,6 +146,8 @@ struct AddressSpaceOps {
 void (*range_del)(AddressSpace *as, FlatRange *fr);
 void (*log_start)(AddressSpace *as, FlatRange *fr);
 void (*log_stop)(AddressSpace *as, FlatRange *fr);
+void (*ioeventfd_add)(AddressSpace *as, MemoryRegionIoeventfd *fd);
+void (*ioeventfd_del)(AddressSpace *as, MemoryRegionIoeventfd *fd);
 };
 
 #define FOR_EACH_FLAT_RANGE(var, view)  \
@@ -208,11 +257,35 @@ static void as_memory_log_stop(AddressSpace *as, 
FlatRange *fr)
 cpu_physical_log_stop(fr->addr.start, fr->addr.size);
 }
 
+static void as_memory_ioeventfd_add(AddressSpace *as, MemoryRegionIoeventfd 
*fd)
+{
+int r;
+
+assert(fd->match_data && fd->addr.size == 4);
+
+r = kvm_set_ioeventfd_mmio_long(fd->fd, fd->addr.start, fd->data, true);
+if (r < 0) {
+abort();
+}
+}
+
+static void as_memory_ioeventfd_del(AddressSpace *as, MemoryRegionIoeventfd 
*fd)
+{
+int r;
+
+r = kvm_set_ioeventfd_mmio_long(fd->fd, fd->addr.start, fd->data, false);
+if (r < 0) {
+abort();
+}
+}
+
 static const AddressSpaceOps address_space_ops_memory = {
 .range_add = as_memory_range_add,
 .range_del = as_memory_range_del,
 .log_start = as_memory_log_start,
 .log_stop = as_memory_log_stop,
+.ioeventfd_add = as_memory_ioeventfd_add,
+.ioeventfd_del = as_memory_ioeventfd_del,
 };
 
 static AddressSpace address_space_memory = {
@@ -288,9 +361,33 @@ static void as_io_range_del(AddressSpace *as, FlatRange 
*fr)
 isa_unassign_ioport(fr->addr.start, fr->addr.size);
 }
 
+static void as_io_ioeventfd_add(AddressSpace *as, MemoryRegionIoeventfd *fd)
+{
+int r;
+
+assert(fd->match_data && fd->addr.size == 2);
+
+r = kvm_set_ioeventfd_pio_word(fd->fd, fd->addr.start, fd->data, true);
+if (r < 0) {
+abort();
+}
+}
+
+static void as_io_ioeventfd_del(AddressSpace *as, MemoryRegionIoeventfd *fd)
+{
+int r;
+
+r = kvm_set_ioeventfd_pio_word(fd->fd, fd->addr.start, fd->data, false);
+if (r < 0) {
+abort();
+}
+}
+
 static const AddressSpaceOps address_space_ops_io = {
 .range_add = as_io_range_add,
 .range_del = as_io_range_del,
+.ioeventfd_add = as_io_ioeventfd_add,
+.ioeventfd_del = as_io_ioeventfd_del,
 };
 
 static AddressSpace address_space_io = {
@@ -389,6 +486,69 @@ static FlatView generate_memory_topology(MemoryRegion *mr)
 return view;
 }
 
+static void address_space_add_del_ioeventfds(AddressSpace *as,
+ MemoryRegionIoeventfd *fds_new,
+ unsigned fds_new_nb,
+ MemoryRegionIoeventfd *fds_old,
+  

[PATCH v2 14/23] memory: transaction API

2011-07-26 Thread Avi Kivity
Allow changes to the memory hierarchy to be accumulated and
made visible all at once.  This reduces computational effort,
especially when an accelerator (e.g. kvm) is involved.

Useful when a single register update causes multiple changes
to an address space.

Signed-off-by: Avi Kivity 
---
 memory.c |   18 ++
 memory.h |8 
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index 7a5670e..5c6e63d 100644
--- a/memory.c
+++ b/memory.c
@@ -18,6 +18,8 @@
 #include "kvm.h"
 #include 
 
+unsigned memory_region_transaction_depth = 0;
+
 typedef struct AddrRange AddrRange;
 
 struct AddrRange {
@@ -626,6 +628,10 @@ static void address_space_update_topology(AddressSpace *as)
 
 static void memory_region_update_topology(void)
 {
+if (memory_region_transaction_depth) {
+return;
+}
+
 if (address_space_memory.root) {
 address_space_update_topology(&address_space_memory);
 }
@@ -634,6 +640,18 @@ static void memory_region_update_topology(void)
 }
 }
 
+void memory_region_transaction_begin(void)
+{
+++memory_region_transaction_depth;
+}
+
+void memory_region_transaction_commit(void)
+{
+assert(memory_region_transaction_depth);
+--memory_region_transaction_depth;
+memory_region_update_topology();
+}
+
 void memory_region_init(MemoryRegion *mr,
 const char *name,
 uint64_t size)
diff --git a/memory.h b/memory.h
index c280a39..4e518b2 100644
--- a/memory.h
+++ b/memory.h
@@ -456,6 +456,14 @@ void memory_region_add_subregion_overlap(MemoryRegion *mr,
 void memory_region_del_subregion(MemoryRegion *mr,
  MemoryRegion *subregion);
 
+/* Start a transaction; changes will be accumulated and made visible only
+ * when the transaction ends.
+ */
+void memory_region_transaction_begin(void);
+/* Commit a transaction and make changes visible to the guest.
+ */
+void memory_region_transaction_commit(void);
+
 #endif
 
 #endif
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 11/23] memory: add backward compatibility for old mmio registration

2011-07-26 Thread Avi Kivity
This eases the transition to the new API.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 memory.c |   10 ++
 memory.h |   10 ++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index 5f2c9ef..7dd7cac 100644
--- a/memory.c
+++ b/memory.c
@@ -14,6 +14,7 @@
 #include "memory.h"
 #include "exec-memory.h"
 #include "ioport.h"
+#include "bitops.h"
 #include 
 
 typedef struct AddrRange AddrRange;
@@ -506,6 +507,10 @@ static uint32_t memory_region_read_thunk_n(void *_mr,
 return -1U; /* FIXME: better signalling */
 }
 
+if (!mr->ops->read) {
+return mr->ops->old_mmio.read[bitops_ffsl(size)](mr->opaque, addr);
+}
+
 /* FIXME: support unaligned access */
 
 access_size_min = mr->ops->impl.min_access_size;
@@ -542,6 +547,11 @@ static void memory_region_write_thunk_n(void *_mr,
 return; /* FIXME: better signalling */
 }
 
+if (!mr->ops->write) {
+mr->ops->old_mmio.write[bitops_ffsl(size)](mr->opaque, addr, data);
+return;
+}
+
 /* FIXME: support unaligned access */
 
 access_size_min = mr->ops->impl.min_access_size;
diff --git a/memory.h b/memory.h
index 40ab95a..003c999 100644
--- a/memory.h
+++ b/memory.h
@@ -28,6 +28,7 @@
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegion MemoryRegion;
 typedef struct MemoryRegionPortio MemoryRegionPortio;
+typedef struct MemoryRegionMmio MemoryRegionMmio;
 
 /* Must match *_DIRTY_FLAGS in cpu-all.h.  To be replaced with dynamic
  * registration.
@@ -36,6 +37,11 @@ typedef struct MemoryRegionPortio MemoryRegionPortio;
 #define DIRTY_MEMORY_CODE  1
 #define DIRTY_MEMORY_MIGRATION 3
 
+struct MemoryRegionMmio {
+CPUReadMemoryFunc *read[3];
+CPUWriteMemoryFunc *write[3];
+};
+
 /*
  * Memory region callbacks
  */
@@ -85,6 +91,10 @@ struct MemoryRegionOps {
  * backwards compatibility with old portio registration
  */
 const MemoryRegionPortio *old_portio;
+/* If .read and .write are not present, old_mmio may be used for
+ * backwards compatibility with old mmio registration
+ */
+const MemoryRegionMmio old_mmio;
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/23] memory: add backward compatibility for old portio registration

2011-07-26 Thread Avi Kivity
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 memory.c |   32 
 memory.h |   17 +
 2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index df0ed0e..5f2c9ef 100644
--- a/memory.c
+++ b/memory.c
@@ -218,6 +218,21 @@ static AddressSpace address_space_memory = {
 .ops = &address_space_ops_memory,
 };
 
+static const MemoryRegionPortio *find_portio(MemoryRegion *mr, uint64_t offset,
+ unsigned width, bool write)
+{
+const MemoryRegionPortio *mrp;
+
+for (mrp = mr->ops->old_portio; mrp->size; ++mrp) {
+if (offset >= mrp->offset && offset < mrp->offset + mrp->len
+&& width == mrp->size
+&& (write ? (bool)mrp->write : (bool)mrp->read)) {
+return mrp;
+}
+}
+return NULL;
+}
+
 static void memory_region_iorange_read(IORange *iorange,
uint64_t offset,
unsigned width,
@@ -225,6 +240,15 @@ static void memory_region_iorange_read(IORange *iorange,
 {
 MemoryRegion *mr = container_of(iorange, MemoryRegion, iorange);
 
+if (mr->ops->old_portio) {
+const MemoryRegionPortio *mrp = find_portio(mr, offset, width, false);
+
+*data = ((uint64_t)1 << (width * 8)) - 1;
+if (mrp) {
+*data = mrp->read(mr->opaque, offset - mrp->offset);
+}
+return;
+}
 *data = mr->ops->read(mr->opaque, offset, width);
 }
 
@@ -235,6 +259,14 @@ static void memory_region_iorange_write(IORange *iorange,
 {
 MemoryRegion *mr = container_of(iorange, MemoryRegion, iorange);
 
+if (mr->ops->old_portio) {
+const MemoryRegionPortio *mrp = find_portio(mr, offset, width, true);
+
+if (mrp) {
+mrp->write(mr->opaque, offset - mrp->offset, data);
+}
+return;
+}
 mr->ops->write(mr->opaque, offset, data, width);
 }
 
diff --git a/memory.h b/memory.h
index 88ba428..40ab95a 100644
--- a/memory.h
+++ b/memory.h
@@ -23,9 +23,11 @@
 #include "targphys.h"
 #include "qemu-queue.h"
 #include "iorange.h"
+#include "ioport.h"
 
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegion MemoryRegion;
+typedef struct MemoryRegionPortio MemoryRegionPortio;
 
 /* Must match *_DIRTY_FLAGS in cpu-all.h.  To be replaced with dynamic
  * registration.
@@ -78,6 +80,11 @@ struct MemoryRegionOps {
  */
  bool unaligned;
 } impl;
+
+/* If .read and .write are not present, old_portio may be used for
+ * backwards compatibility with old portio registration
+ */
+const MemoryRegionPortio *old_portio;
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -105,6 +112,16 @@ struct MemoryRegion {
 uint8_t dirty_log_mask;
 };
 
+struct MemoryRegionPortio {
+uint32_t offset;
+uint32_t len;
+unsigned size;
+IOPortReadFunc *read;
+IOPortWriteFunc *write;
+};
+
+#define PORTIO_END { }
+
 /**
  * memory_region_init: Initialize a memory region
  *
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 17/23] pc: grab system_memory

2011-07-26 Thread Avi Kivity
While eventually this should come from the machine initialization function,
take a short cut to avoid converting all machines now.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/pc.c  |3 ++-
 hw/pc.h  |4 +++-
 hw/pc_piix.c |8 +++-
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index a3e8539..369566a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -957,7 +957,8 @@ void pc_cpus_init(const char *cpu_model)
 }
 }
 
-void pc_memory_init(const char *kernel_filename,
+void pc_memory_init(MemoryRegion *system_memory,
+const char *kernel_filename,
 const char *kernel_cmdline,
 const char *initrd_filename,
 ram_addr_t below_4g_mem_size,
diff --git a/hw/pc.h b/hw/pc.h
index 6d5730b..fa57583 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -6,6 +6,7 @@
 #include "isa.h"
 #include "fdc.h"
 #include "net.h"
+#include "memory.h"
 
 /* PC-style peripherals (also used by other machines).  */
 
@@ -129,7 +130,8 @@ void pc_cmos_set_s3_resume(void *opaque, int irq, int 
level);
 void pc_acpi_smi_interrupt(void *opaque, int irq, int level);
 
 void pc_cpus_init(const char *cpu_model);
-void pc_memory_init(const char *kernel_filename,
+void pc_memory_init(MemoryRegion *system_memory,
+const char *kernel_filename,
 const char *kernel_cmdline,
 const char *initrd_filename,
 ram_addr_t below_4g_mem_size,
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index c5c16b4..d83854c 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -39,6 +39,8 @@
 #include "blockdev.h"
 #include "smbus.h"
 #include "xen.h"
+#include "memory.h"
+#include "exec-memory.h"
 #ifdef CONFIG_XEN
 #  include 
 #endif
@@ -89,6 +91,9 @@ static void pc_init1(ram_addr_t ram_size,
 DriveInfo *hd[MAX_IDE_BUS * MAX_IDE_DEVS];
 BusState *idebus[MAX_IDE_BUS];
 ISADevice *rtc_state;
+MemoryRegion *system_memory;
+
+system_memory = get_system_memory();
 
 pc_cpus_init(cpu_model);
 
@@ -106,7 +111,8 @@ static void pc_init1(ram_addr_t ram_size,
 
 /* allocate ram and load rom/bios */
 if (!xen_enabled()) {
-pc_memory_init(kernel_filename, kernel_cmdline, initrd_filename,
+pc_memory_init(system_memory,
+   kernel_filename, kernel_cmdline, initrd_filename,
below_4g_mem_size, above_4g_mem_size);
 }
 
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 22/23] sysbus: add MemoryRegion based memory management API

2011-07-26 Thread Avi Kivity
Allow registering sysbus device memory using a MemoryRegion.  Once all users
are converted, sysbus_init_mmio() and sysbus_init_mmio_cb() will be removed.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/sysbus.c |   27 ---
 hw/sysbus.h |3 +++
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/hw/sysbus.c b/hw/sysbus.c
index 2e22be7..ea442ac 100644
--- a/hw/sysbus.c
+++ b/hw/sysbus.c
@@ -19,6 +19,7 @@
 
 #include "sysbus.h"
 #include "monitor.h"
+#include "exec-memory.h"
 
 static void sysbus_dev_print(Monitor *mon, DeviceState *dev, int indent);
 static char *sysbus_get_fw_dev_path(DeviceState *dev);
@@ -49,11 +50,20 @@ void sysbus_mmio_map(SysBusDevice *dev, int n, 
target_phys_addr_t addr)
 }
 if (dev->mmio[n].addr != (target_phys_addr_t)-1) {
 /* Unregister previous mapping.  */
-cpu_register_physical_memory(dev->mmio[n].addr, dev->mmio[n].size,
- IO_MEM_UNASSIGNED);
+if (dev->mmio[n].memory) {
+memory_region_del_subregion(get_system_memory(),
+dev->mmio[n].memory);
+} else {
+cpu_register_physical_memory(dev->mmio[n].addr, dev->mmio[n].size,
+ IO_MEM_UNASSIGNED);
+}
 }
 dev->mmio[n].addr = addr;
-if (dev->mmio[n].cb) {
+if (dev->mmio[n].memory) {
+memory_region_add_subregion(get_system_memory(),
+addr,
+dev->mmio[n].memory);
+} else if (dev->mmio[n].cb) {
 dev->mmio[n].cb(dev, addr);
 } else {
 cpu_register_physical_memory(addr, dev->mmio[n].size,
@@ -107,6 +117,17 @@ void sysbus_init_mmio_cb(SysBusDevice *dev, 
target_phys_addr_t size,
 dev->mmio[n].cb = cb;
 }
 
+void sysbus_init_mmio_region(SysBusDevice *dev, MemoryRegion *memory)
+{
+int n;
+
+assert(dev->num_mmio < QDEV_MAX_MMIO);
+n = dev->num_mmio++;
+dev->mmio[n].addr = -1;
+dev->mmio[n].size = memory_region_size(memory);
+dev->mmio[n].memory = memory;
+}
+
 void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t ioport, pio_addr_t size)
 {
 pio_addr_t i;
diff --git a/hw/sysbus.h b/hw/sysbus.h
index 4e8cb16..5f62e2d 100644
--- a/hw/sysbus.h
+++ b/hw/sysbus.h
@@ -4,6 +4,7 @@
 /* Devices attached directly to the main system bus.  */
 
 #include "qdev.h"
+#include "memory.h"
 
 #define QDEV_MAX_MMIO 32
 #define QDEV_MAX_PIO 32
@@ -23,6 +24,7 @@ struct SysBusDevice {
 target_phys_addr_t size;
 mmio_mapfunc cb;
 ram_addr_t iofunc;
+MemoryRegion *memory;
 } mmio[QDEV_MAX_MMIO];
 int num_pio;
 pio_addr_t pio[QDEV_MAX_PIO];
@@ -46,6 +48,7 @@ void sysbus_init_mmio(SysBusDevice *dev, target_phys_addr_t 
size,
   ram_addr_t iofunc);
 void sysbus_init_mmio_cb(SysBusDevice *dev, target_phys_addr_t size,
 mmio_mapfunc cb);
+void sysbus_init_mmio_region(SysBusDevice *dev, MemoryRegion *memory);
 void sysbus_init_irq(SysBusDevice *dev, qemu_irq *p);
 void sysbus_pass_irq(SysBusDevice *dev, SysBusDevice *target);
 void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t ioport, pio_addr_t 
size);
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 16/23] ioport: register ranges by byte aligned addresses always

2011-07-26 Thread Avi Kivity
The I/O port space is byte addressable, even for word and long accesses.

An example is the VMware svga card, which has long ports on offsets 0,
1, and 2.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 ioport.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ioport.c b/ioport.c
index 0d2611d..a32483b 100644
--- a/ioport.c
+++ b/ioport.c
@@ -146,7 +146,7 @@ int register_ioport_read(pio_addr_t start, int length, int 
size,
 hw_error("register_ioport_read: invalid size");
 return -1;
 }
-for(i = start; i < start + length; i += size) {
+for(i = start; i < start + length; ++i) {
 ioport_read_table[bsize][i] = func;
 if (ioport_opaque[i] != NULL && ioport_opaque[i] != opaque)
 hw_error("register_ioport_read: invalid opaque for address 0x%x",
@@ -166,7 +166,7 @@ int register_ioport_write(pio_addr_t start, int length, int 
size,
 hw_error("register_ioport_write: invalid size");
 return -1;
 }
-for(i = start; i < start + length; i += size) {
+for(i = start; i < start + length; ++i) {
 ioport_write_table[bsize][i] = func;
 if (ioport_opaque[i] != NULL && ioport_opaque[i] != opaque)
 hw_error("register_ioport_write: invalid opaque for address 0x%x",
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 19/23] pc: move global memory map out of pc_init1() and into its callers

2011-07-26 Thread Avi Kivity
Signed-off-by: Avi Kivity 
---
 hw/pc_piix.c |   15 ---
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index d83854c..f2d0476 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -68,7 +68,8 @@ static void ioapic_init(IsaIrqState *isa_irq_state)
 }
 
 /* PC hardware initialisation */
-static void pc_init1(ram_addr_t ram_size,
+static void pc_init1(MemoryRegion *system_memory,
+ ram_addr_t ram_size,
  const char *boot_device,
  const char *kernel_filename,
  const char *kernel_cmdline,
@@ -91,9 +92,6 @@ static void pc_init1(ram_addr_t ram_size,
 DriveInfo *hd[MAX_IDE_BUS * MAX_IDE_DEVS];
 BusState *idebus[MAX_IDE_BUS];
 ISADevice *rtc_state;
-MemoryRegion *system_memory;
-
-system_memory = get_system_memory();
 
 pc_cpus_init(cpu_model);
 
@@ -214,7 +212,8 @@ static void pc_init_pci(ram_addr_t ram_size,
 const char *initrd_filename,
 const char *cpu_model)
 {
-pc_init1(ram_size, boot_device,
+pc_init1(get_system_memory(),
+ ram_size, boot_device,
  kernel_filename, kernel_cmdline,
  initrd_filename, cpu_model, 1, 1);
 }
@@ -226,7 +225,8 @@ static void pc_init_pci_no_kvmclock(ram_addr_t ram_size,
 const char *initrd_filename,
 const char *cpu_model)
 {
-pc_init1(ram_size, boot_device,
+pc_init1(get_system_memory(),
+ ram_size, boot_device,
  kernel_filename, kernel_cmdline,
  initrd_filename, cpu_model, 1, 0);
 }
@@ -240,7 +240,8 @@ static void pc_init_isa(ram_addr_t ram_size,
 {
 if (cpu_model == NULL)
 cpu_model = "486";
-pc_init1(ram_size, boot_device,
+pc_init1(get_system_memory(),
+ ram_size, boot_device,
  kernel_filename, kernel_cmdline,
  initrd_filename, cpu_model, 0, 1);
 }
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/23] memory: separate building the final memory map into two steps

2011-07-26 Thread Avi Kivity
Instead of adding and deleting regions in one pass, do a delete
pass followed by an add pass.  This fixes the following case:

from:
  0x-0x0fff ram  (a1)
  0x1000-0x1fff mmio (a2)
  0x2000-0x2fff ram  (a3)

to:
  0x-0x2fff ram  (b1)

The single pass algorithm removed a1, added b2, then removed a2 and a3,
which caused the wrong memory map to be built.  The two pass algorithm
removes a1, a2, and a3, then adds b1.

Signed-off-by: Avi Kivity 
---
 memory.c |   38 +-
 1 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/memory.c b/memory.c
index 686bbf2..7a5670e 100644
--- a/memory.c
+++ b/memory.c
@@ -549,10 +549,11 @@ static void address_space_update_ioeventfds(AddressSpace 
*as)
 as->ioeventfd_nb = ioeventfd_nb;
 }
 
-static void address_space_update_topology(AddressSpace *as)
+static void address_space_update_topology_pass(AddressSpace *as,
+   FlatView old_view,
+   FlatView new_view,
+   bool adding)
 {
-FlatView old_view = as->current_map;
-FlatView new_view = generate_memory_topology(as->root);
 unsigned iold, inew;
 FlatRange *frold, *frnew;
 
@@ -579,15 +580,20 @@ static void address_space_update_topology(AddressSpace 
*as)
 && !flatrange_equal(frold, frnew {
 /* In old, but (not in new, or in new but attributes changed). */
 
-as->ops->range_del(as, frold);
+if (!adding) {
+as->ops->range_del(as, frold);
+}
+
 ++iold;
 } else if (frold && frnew && flatrange_equal(frold, frnew)) {
 /* In both (logging may have changed) */
 
-if (frold->dirty_log_mask && !frnew->dirty_log_mask) {
-as->ops->log_stop(as, frnew);
-} else if (frnew->dirty_log_mask && !frold->dirty_log_mask) {
-as->ops->log_start(as, frnew);
+if (adding) {
+if (frold->dirty_log_mask && !frnew->dirty_log_mask) {
+as->ops->log_stop(as, frnew);
+} else if (frnew->dirty_log_mask && !frold->dirty_log_mask) {
+as->ops->log_start(as, frnew);
+}
 }
 
 ++iold;
@@ -595,10 +601,24 @@ static void address_space_update_topology(AddressSpace 
*as)
 } else {
 /* In new */
 
-as->ops->range_add(as, frnew);
+if (adding) {
+as->ops->range_add(as, frnew);
+}
+
 ++inew;
 }
 }
+}
+
+
+static void address_space_update_topology(AddressSpace *as)
+{
+FlatView old_view = as->current_map;
+FlatView new_view = generate_memory_topology(as->root);
+
+address_space_update_topology_pass(as, old_view, new_view, false);
+address_space_update_topology_pass(as, old_view, new_view, true);
+
 as->current_map = new_view;
 flatview_destroy(&old_view);
 address_space_update_ioeventfds(as);
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 23/23] usb-ohci: convert to MemoryRegion

2011-07-26 Thread Avi Kivity
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/usb-ohci.c |   42 +-
 1 files changed, 17 insertions(+), 25 deletions(-)

diff --git a/hw/usb-ohci.c b/hw/usb-ohci.c
index 8491d59..337b250 100644
--- a/hw/usb-ohci.c
+++ b/hw/usb-ohci.c
@@ -62,7 +62,7 @@ typedef struct OHCIPort {
 typedef struct {
 USBBus bus;
 qemu_irq irq;
-int mem;
+MemoryRegion mem;
 int num_ports;
 const char *name;
 
@@ -1440,13 +1440,13 @@ static void ohci_port_set_status(OHCIState *ohci, int 
portnum, uint32_t val)
 return;
 }
 
-static uint32_t ohci_mem_read(void *ptr, target_phys_addr_t addr)
+static uint64_t ohci_mem_read(void *opaque,
+  target_phys_addr_t addr,
+  unsigned size)
 {
-OHCIState *ohci = ptr;
+OHCIState *ohci = opaque;
 uint32_t retval;
 
-addr &= 0xff;
-
 /* Only aligned reads are allowed on OHCI */
 if (addr & 3) {
 fprintf(stderr, "usb-ohci: Mis-aligned read\n");
@@ -1563,11 +1563,12 @@ static uint32_t ohci_mem_read(void *ptr, 
target_phys_addr_t addr)
 return retval;
 }
 
-static void ohci_mem_write(void *ptr, target_phys_addr_t addr, uint32_t val)
+static void ohci_mem_write(void *opaque,
+   target_phys_addr_t addr,
+   uint64_t val,
+   unsigned size)
 {
-OHCIState *ohci = ptr;
-
-addr &= 0xff;
+OHCIState *ohci = opaque;
 
 /* Only aligned reads are allowed on OHCI */
 if (addr & 3) {
@@ -1697,18 +1698,10 @@ static void ohci_async_cancel_device(OHCIState *ohci, 
USBDevice *dev)
 }
 }
 
-/* Only dword reads are defined on OHCI register space */
-static CPUReadMemoryFunc * const ohci_readfn[3]={
-ohci_mem_read,
-ohci_mem_read,
-ohci_mem_read
-};
-
-/* Only dword writes are defined on OHCI register space */
-static CPUWriteMemoryFunc * const ohci_writefn[3]={
-ohci_mem_write,
-ohci_mem_write,
-ohci_mem_write
+static const MemoryRegionOps ohci_mem_ops = {
+.read = ohci_mem_read,
+.write = ohci_mem_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
 };
 
 static USBPortOps ohci_port_ops = {
@@ -1764,8 +1757,7 @@ static int usb_ohci_init(OHCIState *ohci, DeviceState 
*dev,
 }
 }
 
-ohci->mem = cpu_register_io_memory(ohci_readfn, ohci_writefn, ohci,
-   DEVICE_LITTLE_ENDIAN);
+memory_region_init_io(&ohci->mem, &ohci_mem_ops, ohci, "ohci", 256);
 ohci->localmem_base = localmem_base;
 
 ohci->name = dev->info->name;
@@ -1799,7 +1791,7 @@ static int usb_ohci_initfn_pci(struct PCIDevice *dev)
 ohci->state.irq = ohci->pci_dev.irq[0];
 
 /* TODO: avoid cast below by using dev */
-pci_register_bar_simple(&ohci->pci_dev, 0, 256, 0, ohci->state.mem);
+pci_register_bar_region(&ohci->pci_dev, 0, 0, &ohci->state.mem);
 return 0;
 }
 
@@ -1822,7 +1814,7 @@ static int ohci_init_pxa(SysBusDevice *dev)
 /* Cannot fail as we pass NULL for masterbus */
 usb_ohci_init(&s->ohci, &dev->qdev, s->num_ports, s->dma_offset, NULL, 0);
 sysbus_init_irq(dev, &s->ohci.irq);
-sysbus_init_mmio(dev, 0x1000, s->ohci.mem);
+sysbus_init_mmio_region(dev, &s->ohci.mem);
 
 return 0;
 }
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 18/23] pc: convert pc_memory_init() to memory API

2011-07-26 Thread Avi Kivity
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/pc.c |   59 ---
 hw/pc.h |1 +
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 369566a..1c9d89a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -41,6 +41,7 @@
 #include "sysemu.h"
 #include "blockdev.h"
 #include "ui/qemu-spice.h"
+#include "memory.h"
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -966,22 +967,30 @@ void pc_memory_init(MemoryRegion *system_memory,
 {
 char *filename;
 int ret, linux_boot, i;
-ram_addr_t ram_addr, bios_offset, option_rom_offset;
+MemoryRegion *ram, *bios, *isa_bios, *option_rom_mr;
+MemoryRegion *ram_below_4g, *ram_above_4g;
 int bios_size, isa_bios_size;
 void *fw_cfg;
 
 linux_boot = (kernel_filename != NULL);
 
-/* allocate RAM */
-ram_addr = qemu_ram_alloc(NULL, "pc.ram",
-  below_4g_mem_size + above_4g_mem_size);
-cpu_register_physical_memory(0, 0xa, ram_addr);
-cpu_register_physical_memory(0x10,
- below_4g_mem_size - 0x10,
- ram_addr + 0x10);
+/* Allocate RAM.  We allocate it as a single memory region and use
+ * aliases to address portions of it, mostly for backwards compatiblity
+ * with older qemus that used qemu_ram_alloc().
+ */
+ram = qemu_malloc(sizeof(*ram));
+memory_region_init_ram(ram, NULL, "pc.ram",
+   below_4g_mem_size + above_4g_mem_size);
+ram_below_4g = qemu_malloc(sizeof(*ram_below_4g));
+memory_region_init_alias(ram_below_4g, "ram-below-4g", ram,
+ 0, below_4g_mem_size);
+memory_region_add_subregion(system_memory, 0, ram_below_4g);
 if (above_4g_mem_size > 0) {
-cpu_register_physical_memory(0x1ULL, above_4g_mem_size,
- ram_addr + below_4g_mem_size);
+ram_above_4g = qemu_malloc(sizeof(*ram_above_4g));
+memory_region_init_alias(ram_above_4g, "ram-above-4g", ram,
+ below_4g_mem_size, above_4g_mem_size);
+memory_region_add_subregion(system_memory, 0x1ULL,
+ram_above_4g);
 }
 
 /* BIOS load */
@@ -997,7 +1006,9 @@ void pc_memory_init(MemoryRegion *system_memory,
 (bios_size % 65536) != 0) {
 goto bios_error;
 }
-bios_offset = qemu_ram_alloc(NULL, "pc.bios", bios_size);
+bios = qemu_malloc(sizeof(*bios));
+memory_region_init_ram(bios, NULL, "pc.bios", bios_size);
+memory_region_set_readonly(bios, true);
 ret = rom_add_file_fixed(bios_name, (uint32_t)(-bios_size), -1);
 if (ret != 0) {
 bios_error:
@@ -1011,16 +1022,26 @@ void pc_memory_init(MemoryRegion *system_memory,
 isa_bios_size = bios_size;
 if (isa_bios_size > (128 * 1024))
 isa_bios_size = 128 * 1024;
-cpu_register_physical_memory(0x10 - isa_bios_size,
- isa_bios_size,
- (bios_offset + bios_size - isa_bios_size) | 
IO_MEM_ROM);
-
-option_rom_offset = qemu_ram_alloc(NULL, "pc.rom", PC_ROM_SIZE);
-cpu_register_physical_memory(PC_ROM_MIN_VGA, PC_ROM_SIZE, 
option_rom_offset);
+isa_bios = qemu_malloc(sizeof(*isa_bios));
+memory_region_init_alias(isa_bios, "isa-bios", bios,
+ bios_size - isa_bios_size, isa_bios_size);
+memory_region_add_subregion_overlap(system_memory,
+0x10 - isa_bios_size,
+isa_bios,
+1);
+memory_region_set_readonly(isa_bios, true);
+
+option_rom_mr = qemu_malloc(sizeof(*option_rom_mr));
+memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE);
+memory_region_add_subregion_overlap(system_memory,
+PC_ROM_MIN_VGA,
+option_rom_mr,
+1);
 
 /* map all the bios at the top of memory */
-cpu_register_physical_memory((uint32_t)(-bios_size),
- bios_size, bios_offset | IO_MEM_ROM);
+memory_region_add_subregion(system_memory,
+(uint32_t)(-bios_size),
+bios);
 
 fw_cfg = bochs_bios_init();
 rom_set_fw(fw_cfg);
diff --git a/hw/pc.h b/hw/pc.h
index fa57583..40684f4 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -2,6 +2,7 @@
 #define HW_PC_H
 
 #include "qemu-common.h"
+#include "memory.h"
 #include "ioport.h"
 #include "isa.h"
 #include "fdc.h"
-- 
1.7.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 21/23] pci: add MemoryRegion based BAR management API

2011-07-26 Thread Avi Kivity
Allow registering a BAR using a MemoryRegion.  Once all users are converted,
pci_register_bar() and pci_register_bar_simple() will be removed.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/pci.c |   47 +++
 hw/pci.h |3 +++
 2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index cf16f3b..36db58b 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -844,10 +844,15 @@ static void pci_unregister_io_regions(PCIDevice *pci_dev)
 if (r->type == PCI_BASE_ADDRESS_SPACE_IO) {
 isa_unassign_ioport(r->addr, r->filtered_size);
 } else {
-cpu_register_physical_memory(pci_to_cpu_addr(pci_dev->bus,
- r->addr),
- r->filtered_size,
- IO_MEM_UNASSIGNED);
+if (r->memory) {
+memory_region_del_subregion(pci_dev->bus->address_space,
+r->memory);
+} else {
+cpu_register_physical_memory(pci_to_cpu_addr(pci_dev->bus,
+ r->addr),
+ r->filtered_size,
+ IO_MEM_UNASSIGNED);
+}
 }
 }
 }
@@ -893,6 +898,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 r->type = type;
 r->map_func = map_func;
 r->ram_addr = IO_MEM_UNASSIGNED;
+r->memory = NULL;
 
 wmask = ~(size - 1);
 addr = pci_bar(pci_dev, region_num);
@@ -918,6 +924,16 @@ static void pci_simple_bar_mapfunc(PCIDevice *pci_dev, int 
region_num,
  pci_dev->io_regions[region_num].ram_addr);
 }
 
+static void pci_simple_bar_mapfunc_region(PCIDevice *pci_dev, int region_num,
+  pcibus_t addr, pcibus_t size,
+  int type)
+{
+memory_region_add_subregion_overlap(pci_dev->bus->address_space,
+addr,
+pci_dev->io_regions[region_num].memory,
+1);
+}
+
 void pci_register_bar_simple(PCIDevice *pci_dev, int region_num,
  pcibus_t size,  uint8_t attr, ram_addr_t ram_addr)
 {
@@ -927,6 +943,15 @@ void pci_register_bar_simple(PCIDevice *pci_dev, int 
region_num,
 pci_dev->io_regions[region_num].ram_addr = ram_addr;
 }
 
+void pci_register_bar_region(PCIDevice *pci_dev, int region_num,
+ uint8_t attr, MemoryRegion *memory)
+{
+pci_register_bar(pci_dev, region_num, memory_region_size(memory),
+ PCI_BASE_ADDRESS_SPACE_MEMORY | attr,
+ pci_simple_bar_mapfunc_region);
+pci_dev->io_regions[region_num].memory = memory;
+}
+
 static void pci_bridge_filter(PCIDevice *d, pcibus_t *addr, pcibus_t *size,
   uint8_t type)
 {
@@ -1065,10 +1090,16 @@ static void pci_update_mappings(PCIDevice *d)
 isa_unassign_ioport(r->addr, r->filtered_size);
 }
 } else {
-cpu_register_physical_memory(pci_to_cpu_addr(d->bus, r->addr),
- r->filtered_size,
- IO_MEM_UNASSIGNED);
-qemu_unregister_coalesced_mmio(r->addr, r->filtered_size);
+if (r->memory) {
+memory_region_del_subregion(d->bus->address_space,
+r->memory);
+} else {
+cpu_register_physical_memory(pci_to_cpu_addr(d->bus,
+ r->addr),
+ r->filtered_size,
+ IO_MEM_UNASSIGNED);
+qemu_unregister_coalesced_mmio(r->addr, r->filtered_size);
+}
 }
 }
 r->addr = new_addr;
diff --git a/hw/pci.h b/hw/pci.h
index cfeb042..c51156d 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -94,6 +94,7 @@ typedef struct PCIIORegion {
 uint8_t type;
 PCIMapIORegionFunc *map_func;
 ram_addr_t ram_addr;
+MemoryRegion *memory;
 } PCIIORegion;
 
 #define PCI_ROM_SLOT 6
@@ -204,6 +205,8 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 PCIMapIORegionFunc *map_func);
 void pci_register_bar_simple(PCIDevice *pci_dev, int region_num,
  pcibus_t size, uint8_t attr, ram_addr_t ram_addr);
+void pci_register_bar_region(PCIDevice *pci_dev, int region_num,
+ uint8_t attr, MemoryRegion *memory);
 
 int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
uint8_t

[PATCH v2 20/23] pci: pass address space to pci bus when created

2011-07-26 Thread Avi Kivity
This is now done sloppily, via get_system_memory().  Eventually callers
will be converted to stop using that.

Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/apb_pci.c   |2 ++
 hw/bonito.c|4 +++-
 hw/grackle_pci.c   |5 +++--
 hw/gt64xxx.c   |4 +++-
 hw/pc.h|4 +++-
 hw/pc_piix.c   |3 ++-
 hw/pci.c   |   16 +++-
 hw/pci.h   |   12 +---
 hw/pci_host.h  |1 +
 hw/pci_internals.h |1 +
 hw/piix_pci.c  |   13 +
 hw/ppc4xx_pci.c|5 -
 hw/ppc_mac.h   |9 ++---
 hw/ppc_newworld.c  |5 +++--
 hw/ppc_oldworld.c  |3 ++-
 hw/ppc_prep.c  |3 ++-
 hw/ppce500_pci.c   |6 +-
 hw/prep_pci.c  |5 +++--
 hw/prep_pci.h  |3 ++-
 hw/sh_pci.c|4 +++-
 hw/unin_pci.c  |   10 ++
 hw/versatile_pci.c |2 ++
 22 files changed, 85 insertions(+), 35 deletions(-)

diff --git a/hw/apb_pci.c b/hw/apb_pci.c
index 974c87a..8b9939c 100644
--- a/hw/apb_pci.c
+++ b/hw/apb_pci.c
@@ -34,6 +34,7 @@
 #include "rwhandler.h"
 #include "apb_pci.h"
 #include "sysemu.h"
+#include "exec-memory.h"
 
 /* debug APB */
 //#define DEBUG_APB
@@ -346,6 +347,7 @@ PCIBus *pci_apb_init(target_phys_addr_t special_base,
 
 d->bus = pci_register_bus(&d->busdev.qdev, "pci",
  pci_apb_set_irq, pci_pbm_map_irq, d,
+ get_system_memory(),
  0, 32);
 pci_bus_set_mem_base(d->bus, mem_base);
 
diff --git a/hw/bonito.c b/hw/bonito.c
index e8c57a3..5f62dda 100644
--- a/hw/bonito.c
+++ b/hw/bonito.c
@@ -42,6 +42,7 @@
 #include "mips.h"
 #include "pci_host.h"
 #include "sysemu.h"
+#include "exec-memory.h"
 
 //#define DEBUG_BONITO
 
@@ -773,7 +774,8 @@ PCIBus *bonito_init(qemu_irq *pic)
 dev = qdev_create(NULL, "Bonito-pcihost");
 pcihost = FROM_SYSBUS(BonitoState, sysbus_from_qdev(dev));
 b = pci_register_bus(&pcihost->busdev.qdev, "pci", pci_bonito_set_irq,
- pci_bonito_map_irq, pic, 0x28, 32);
+ pci_bonito_map_irq, pic, get_system_memory(),
+ 0x28, 32);
 pcihost->bus = b;
 qdev_init_nofail(dev);
 
diff --git a/hw/grackle_pci.c b/hw/grackle_pci.c
index cee07e0..da67cf9 100644
--- a/hw/grackle_pci.c
+++ b/hw/grackle_pci.c
@@ -61,7 +61,8 @@ static void pci_grackle_reset(void *opaque)
 {
 }
 
-PCIBus *pci_grackle_init(uint32_t base, qemu_irq *pic)
+PCIBus *pci_grackle_init(uint32_t base, qemu_irq *pic,
+ MemoryRegion *address_space)
 {
 DeviceState *dev;
 SysBusDevice *s;
@@ -74,7 +75,7 @@ PCIBus *pci_grackle_init(uint32_t base, qemu_irq *pic)
 d->host_state.bus = pci_register_bus(&d->busdev.qdev, "pci",
  pci_grackle_set_irq,
  pci_grackle_map_irq,
- pic, 0, 4);
+ pic, address_space, 0, 4);
 
 pci_create_simple(d->host_state.bus, 0, "grackle");
 
diff --git a/hw/gt64xxx.c b/hw/gt64xxx.c
index 8e1f6a0..65e63dd 100644
--- a/hw/gt64xxx.c
+++ b/hw/gt64xxx.c
@@ -27,6 +27,7 @@
 #include "pci.h"
 #include "pci_host.h"
 #include "pc.h"
+#include "exec-memory.h"
 
 //#define DEBUG
 
@@ -1092,7 +1093,8 @@ PCIBus *gt64120_register(qemu_irq *pic)
 d = FROM_SYSBUS(GT64120State, s);
 d->pci.bus = pci_register_bus(&d->busdev.qdev, "pci",
   gt64120_pci_set_irq, gt64120_pci_map_irq,
-  pic, PCI_DEVFN(18, 0), 4);
+  pic, get_system_memory(),
+  PCI_DEVFN(18, 0), 4);
 d->ISD_handle = cpu_register_io_memory(gt64120_read, gt64120_write, d,
DEVICE_NATIVE_ENDIAN);
 
diff --git a/hw/pc.h b/hw/pc.h
index 40684f4..a2de0fe 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -178,7 +178,9 @@ int pcspk_audio_init(qemu_irq *pic);
 struct PCII440FXState;
 typedef struct PCII440FXState PCII440FXState;
 
-PCIBus *i440fx_init(PCII440FXState **pi440fx_state, int *piix_devfn, qemu_irq 
*pic, ram_addr_t ram_size);
+PCIBus *i440fx_init(PCII440FXState **pi440fx_state, int *piix_devfn,
+qemu_irq *pic, MemoryRegion *address_space,
+ram_addr_t ram_size);
 void i440fx_init_memory_mappings(PCII440FXState *d);
 
 /* piix4.c */
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index f2d0476..2b9c2b1 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -128,7 +128,8 @@ static void pc_init1(MemoryRegion *system_memory,
 isa_irq = qemu_allocate_irqs(isa_irq_handler, isa_irq_state, 24);
 
 if (pci_enabled) {
-pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, isa_irq, ram_size);
+pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, isa_irq,
+  system_memory, 

[PATCH 05/11] KVM: MMU: cleanup FNAME(invlpg)

2011-07-26 Thread Xiao Guangrong
Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   16 ++--
 arch/x86/kvm/paging_tmpl.h |   42 +++---
 2 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 992a8a5..52ef6d7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1801,7 +1801,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
}
 }
 
-static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
+static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 u64 *spte)
 {
u64 pte;
@@ -1809,17 +1809,21 @@ static void mmu_page_zap_pte(struct kvm *kvm, struct 
kvm_mmu_page *sp,
 
pte = *spte;
if (is_shadow_present_pte(pte)) {
-   if (is_last_spte(pte, sp->role.level))
+   if (is_last_spte(pte, sp->role.level)) {
drop_spte(kvm, spte);
-   else {
+   if (is_large_pte(pte))
+   --kvm->stat.lpages;
+   } else {
child = page_header(pte & PT64_BASE_ADDR_MASK);
drop_parent_pte(child, spte);
}
-   } else if (is_mmio_spte(pte))
+   return true;
+   }
+
+   if (is_mmio_spte(pte))
mmu_spte_clear_no_track(spte);
 
-   if (is_large_pte(pte))
-   --kvm->stat.lpages;
+   return false;
 }
 
 static void kvm_mmu_page_unlink_children(struct kvm *kvm,
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 02daac5..64210a0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -652,6 +652,16 @@ out_unlock:
return 0;
 }
 
+static gpa_t FNAME(get_first_pte_gpa)(struct kvm_mmu_page *sp)
+{
+   int offset = 0;
+
+   if (PTTYPE == 32)
+   offset = sp->role.quadrant << PT64_LEVEL_BITS;
+
+   return gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
+}
+
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
struct kvm_shadow_walk_iterator iterator;
@@ -659,7 +669,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
gpa_t pte_gpa = -1;
int level;
u64 *sptep;
-   int need_flush = 0;
 
vcpu_clear_mmio_info(vcpu, gva);
 
@@ -671,36 +680,20 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
gva)
 
sp = page_header(__pa(sptep));
if (is_last_spte(*sptep, level)) {
-   int offset, shift;
-
if (!sp->unsync)
break;
 
-   shift = PAGE_SHIFT -
- (PT_LEVEL_BITS - PT64_LEVEL_BITS) * level;
-   offset = sp->role.quadrant << shift;
-
-   pte_gpa = (sp->gfn << PAGE_SHIFT) + offset;
+   pte_gpa = FNAME(get_first_pte_gpa)(sp);
pte_gpa += (sptep - sp->spt) * sizeof(pt_element_t);
 
-   if (is_shadow_present_pte(*sptep)) {
-   if (is_large_pte(*sptep))
-   --vcpu->kvm->stat.lpages;
-   drop_spte(vcpu->kvm, sptep);
-   need_flush = 1;
-   } else if (is_mmio_spte(*sptep))
-   mmu_spte_clear_no_track(sptep);
-
-   break;
+   if (mmu_page_zap_pte(vcpu->kvm, sp, sptep))
+   kvm_flush_remote_tlbs(vcpu->kvm);
}
 
if (!is_shadow_present_pte(*sptep) || !sp->unsync_children)
break;
}
 
-   if (need_flush)
-   kvm_flush_remote_tlbs(vcpu->kvm);
-
atomic_inc(&vcpu->kvm->arch.invlpg_counter);
 
spin_unlock(&vcpu->kvm->mmu_lock);
@@ -766,19 +759,14 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu 
*vcpu, gva_t vaddr,
  */
 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
-   int i, offset, nr_present;
+   int i, nr_present = 0;
bool host_writable;
gpa_t first_pte_gpa;
 
-   offset = nr_present = 0;
-
/* direct kvm_mmu_page can not be unsync. */
BUG_ON(sp->role.direct);
 
-   if (PTTYPE == 32)
-   offset = sp->role.quadrant << PT64_LEVEL_BITS;
-
-   first_pte_gpa = gfn_to_gpa(sp->gfn) + offset * sizeof(pt_element_t);
+   first_pte_gpa = FNAME(get_first_pte_gpa)(sp);
 
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
unsigned pte_access;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org

[PATCH 06/11] KVM: MMU: fast prefetch spte on invlpg path

2011-07-26 Thread Xiao Guangrong
Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |4 +---
 arch/x86/kvm/mmu.c  |   38 +++---
 arch/x86/kvm/paging_tmpl.h  |   23 ++-
 arch/x86/kvm/x86.c  |4 ++--
 4 files changed, 28 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ad88bfb..ce17642 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -456,7 +456,6 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
-   atomic_t invlpg_counter;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
@@ -750,8 +749,7 @@ int fx_init(struct kvm_vcpu *vcpu);
 
 void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes,
-  bool guest_initiated, bool repeat_write);
+  const u8 *new, int bytes, bool repeat_write);
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 52ef6d7..dfe8e6b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3525,8 +3525,7 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu 
*vcpu)
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes,
-  bool guest_initiated, bool repeat_write)
+  const u8 *new, int bytes, bool repeat_write)
 {
gfn_t gfn = gpa >> PAGE_SHIFT;
union kvm_mmu_page_role mask = { .word = 0 };
@@ -3535,7 +3534,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
LIST_HEAD(invalid_list);
u64 entry, gentry, *spte;
unsigned pte_size, page_offset, misaligned, quadrant, offset;
-   int level, npte, invlpg_counter, r, flooded = 0;
+   int level, npte, r, flooded = 0;
bool remote_flush, local_flush, zap_page;
 
/*
@@ -3550,19 +3549,16 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
 
-   invlpg_counter = atomic_read(&vcpu->kvm->arch.invlpg_counter);
-
/*
 * Assume that the pte write on a page table of the same type
 * as the current vcpu paging mode since we update the sptes only
 * when they have the same mode.
 */
-   if ((is_pae(vcpu) && bytes == 4) || !new) {
+   if (is_pae(vcpu) && bytes == 4) {
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-   if (is_pae(vcpu)) {
-   gpa &= ~(gpa_t)7;
-   bytes = 8;
-   }
+   gpa &= ~(gpa_t)7;
+   bytes = 8;
+
r = kvm_read_guest(vcpu->kvm, gpa, &gentry, min(bytes, 8));
if (r)
gentry = 0;
@@ -3583,22 +3579,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
mmu_topup_memory_caches(vcpu);
spin_lock(&vcpu->kvm->mmu_lock);
-   if (atomic_read(&vcpu->kvm->arch.invlpg_counter) != invlpg_counter)
-   gentry = 0;
kvm_mmu_free_some_pages(vcpu);
++vcpu->kvm->stat.mmu_pte_write;
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
-   if (guest_initiated) {
-   if (gfn == vcpu->arch.last_pt_write_gfn
-   && !last_updated_pte_accessed(vcpu)) {
-   ++vcpu->arch.last_pt_write_count;
-   if (vcpu->arch.last_pt_write_count >= 3)
-   flooded = 1;
-   } else {
-   vcpu->arch.last_pt_write_gfn = gfn;
-   vcpu->arch.last_pt_write_count = 1;
-   vcpu->arch.last_pte_updated = NULL;
-   }
+   if (gfn == vcpu->arch.last_pt_write_gfn
+   && !last_updated_pte_accessed(vcpu)) {
+   ++vcpu->arch.last_pt_write_count;
+   if (vcpu->arch.last_pt_write_count >= 3)
+   flooded = 1;
+   } else {
+   vcpu->arch.last_pt_write_gfn = gfn;
+   vcpu->arch.last_pt_write_count = 1;
+   vcpu->arch.last_pte_updated = NULL;
}
 
mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 64210a0..3466229 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -666,20 +666,22 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
gva)
 {
struct kvm_shadow_walk_iterator iterator;
struct 

[PATCH 07/11] KVM: MMU: remove unnecessary kvm_mmu_free_some_pages

2011-07-26 Thread Xiao Guangrong
In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dfe8e6b..7629802 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3579,7 +3579,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
mmu_topup_memory_caches(vcpu);
spin_lock(&vcpu->kvm->mmu_lock);
-   kvm_mmu_free_some_pages(vcpu);
++vcpu->kvm->stat.mmu_pte_write;
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
if (gfn == vcpu->arch.last_pt_write_gfn
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/11] KVM: MMU: split kvm_mmu_pte_write function

2011-07-26 Thread Xiao Guangrong
kvm_mmu_pte_write is too long, we split it for better readable

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |  180 
 1 files changed, 111 insertions(+), 69 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7629802..931c23a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3524,48 +3524,29 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu 
*vcpu)
return mmu_memory_cache_free_object(cache);
 }
 
-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes, bool repeat_write)
+static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
+   const u8 *new, int *bytes)
 {
-   gfn_t gfn = gpa >> PAGE_SHIFT;
-   union kvm_mmu_page_role mask = { .word = 0 };
-   struct kvm_mmu_page *sp;
-   struct hlist_node *node;
-   LIST_HEAD(invalid_list);
-   u64 entry, gentry, *spte;
-   unsigned pte_size, page_offset, misaligned, quadrant, offset;
-   int level, npte, r, flooded = 0;
-   bool remote_flush, local_flush, zap_page;
-
-   /*
-* If we don't have indirect shadow pages, it means no page is
-* write-protected, so we can exit simply.
-*/
-   if (!ACCESS_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
-   return;
-
-   zap_page = remote_flush = local_flush = false;
-   offset = offset_in_page(gpa);
-
-   pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
+   u64 gentry;
+   int r;
 
/*
 * Assume that the pte write on a page table of the same type
 * as the current vcpu paging mode since we update the sptes only
 * when they have the same mode.
 */
-   if (is_pae(vcpu) && bytes == 4) {
+   if (is_pae(vcpu) && *bytes == 4) {
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-   gpa &= ~(gpa_t)7;
-   bytes = 8;
+   *gpa &= ~(gpa_t)7;
+   *bytes = 8;
 
-   r = kvm_read_guest(vcpu->kvm, gpa, &gentry, min(bytes, 8));
+   r = kvm_read_guest(vcpu->kvm, *gpa, &gentry, min(*bytes, 8));
if (r)
gentry = 0;
new = (const u8 *)&gentry;
}
 
-   switch (bytes) {
+   switch (*bytes) {
case 4:
gentry = *(const u32 *)new;
break;
@@ -3577,66 +3558,127 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t 
gpa,
break;
}
 
-   mmu_topup_memory_caches(vcpu);
-   spin_lock(&vcpu->kvm->mmu_lock);
-   ++vcpu->kvm->stat.mmu_pte_write;
-   trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
+   return gentry;
+}
+
+/*
+ * If we're seeing too many writes to a page, it may no longer be a page table,
+ * or we may be forking, in which case it is better to unmap the page.
+ */
+static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+   bool flooded = false;
+
if (gfn == vcpu->arch.last_pt_write_gfn
&& !last_updated_pte_accessed(vcpu)) {
++vcpu->arch.last_pt_write_count;
if (vcpu->arch.last_pt_write_count >= 3)
-   flooded = 1;
+   flooded = true;
} else {
vcpu->arch.last_pt_write_gfn = gfn;
vcpu->arch.last_pt_write_count = 1;
vcpu->arch.last_pte_updated = NULL;
}
 
+   return flooded;
+}
+
+/*
+ * Misaligned accesses are too much trouble to fix up; also, they usually
+ * indicate a page is not used as a page table.
+ */
+static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
+   int bytes)
+{
+   unsigned offset, pte_size, misaligned;
+
+   pgprintk("misaligned: gpa %llx bytes %d role %x\n",
+gpa, bytes, sp->role.word);
+
+   offset = offset_in_page(gpa);
+   pte_size = sp->role.cr4_pae ? 8 : 4;
+   misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
+   misaligned |= bytes < 4;
+
+   return misaligned;
+}
+
+static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
+{
+   unsigned page_offset, quadrant;
+   u64 *spte;
+   int level;
+
+   page_offset = offset_in_page(gpa);
+   level = sp->role.level;
+   *nspte = 1;
+   if (!sp->role.cr4_pae) {
+   page_offset <<= 1;  /* 32->64 */
+   /*
+* A 32-bit pde maps 4MB while the shadow pdes map
+* only 2MB.  So we need to double the offset again
+* and zap two pdes instead of one.
+*/
+   if (level == PT32_ROOT_LEVEL) {
+   page_offset &= ~7; /* kill rounding error */
+   page_offset <<= 1;
+   *nspte = 2;
+   }

[PATCH 09/11] KVM: MMU: remove the mismatch shadow page

2011-07-26 Thread Xiao Guangrong
If the shadow page has different cpu mode with current vcpu, we do better
remove them since the OS does not changes cpu mode frequently

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   26 +++---
 1 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 931c23a..2328ee6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3603,6 +3603,18 @@ static bool detect_write_misaligned(struct kvm_mmu_page 
*sp, gpa_t gpa,
return misaligned;
 }
 
+/*
+ * The OS hardly changes cpu mode after boot, we can zap the shadow page if
+ * it is mismatched with the current vcpu.
+ */
+static bool detect_mismatch_sp(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+   union kvm_mmu_page_role mask;
+
+   mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
+   return (sp->role.word ^ vcpu->arch.mmu.base_role.word) & mask.word;
+}
+
 static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
 {
unsigned page_offset, quadrant;
@@ -3638,13 +3650,12 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
   const u8 *new, int bytes, bool repeat_write)
 {
gfn_t gfn = gpa >> PAGE_SHIFT;
-   union kvm_mmu_page_role mask = { .word = 0 };
struct kvm_mmu_page *sp;
struct hlist_node *node;
LIST_HEAD(invalid_list);
u64 entry, gentry, *spte;
int npte;
-   bool remote_flush, local_flush, zap_page, flooded, misaligned;
+   bool remote_flush, local_flush, zap_page, flooded;
 
/*
 * If we don't have indirect shadow pages, it means no page is
@@ -3664,10 +3675,13 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
flooded = detect_write_flooding(vcpu, gfn);
-   mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
+   bool mismatch, misaligned;
+
misaligned = detect_write_misaligned(sp, gpa, bytes);
-   if (misaligned || flooded || repeat_write) {
+   mismatch = detect_mismatch_sp(vcpu, sp);
+
+   if (misaligned || mismatch || flooded || repeat_write) {
zap_page |= !!kvm_mmu_prepare_zap_page(vcpu->kvm, sp,
 &invalid_list);
++vcpu->kvm->stat.mmu_flooded;
@@ -3682,9 +3696,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
while (npte--) {
entry = *spte;
mmu_page_zap_pte(vcpu->kvm, sp, spte);
-   if (gentry &&
- !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
- & mask.word) && get_free_pte_list_desc_nr(vcpu))
+   if (gentry && get_free_pte_list_desc_nr(vcpu))
mmu_pte_write_new_pte(vcpu, sp, spte, &gentry);
if (!remote_flush && need_remote_flush(entry, *spte))
remote_flush = true;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/11] KVM: MMU: fix detecting misaligned accessed

2011-07-26 Thread Xiao Guangrong
Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2328ee6..bb55b15 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3597,6 +3597,14 @@ static bool detect_write_misaligned(struct kvm_mmu_page 
*sp, gpa_t gpa,
 
offset = offset_in_page(gpa);
pte_size = sp->role.cr4_pae ? 8 : 4;
+
+   /*
+* Sometimes, the OS only writes the last one bytes to update status
+* bits, for example, in linux, andb instruction is used in clear_bit().
+*/
+   if (sp->role.level == 1 && !(offset & (pte_size - 1)) && bytes == 1)
+   return false;
+
misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
misaligned |= bytes < 4;
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 00/23] Memory API, batch 1

2011-07-26 Thread Avi Kivity

On 07/25/2011 11:23 PM, Anthony Liguori wrote:


Very nice series.

Despite my lack of exciting over ioeventfd and coalescing, if you fix 
the few minor comments, this should be a no brainer to merge.


Thanks; v2 posted.

Note that this patchset has a logical dependency on my "[PATCH] 
CODING_STYLE: explicitly allow braceless 'else if'" patch.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Paolo Bonzini

On 07/26/2011 01:26 PM, Avi Kivity wrote:

+while (i < view->nr) {
+j = i + 1;
+while (j < view->nr
+   && can_merge(&view->ranges[j-1], &view->ranges[j])) {
+view->ranges[i].addr.size += view->ranges[j].addr.size;
+++j;
+}
+++i;


if (j != i) {


+memmove(&view->ranges[i], &view->ranges[j],
+(view->nr - j) * sizeof(view->ranges[j]));
+view->nr -= j - i;
+}


}


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Avi Kivity

On 07/26/2011 02:36 PM, Paolo Bonzini wrote:

On 07/26/2011 01:26 PM, Avi Kivity wrote:

+while (i < view->nr) {
+j = i + 1;
+while (j < view->nr
+ && can_merge(&view->ranges[j-1], &view->ranges[j])) {
+view->ranges[i].addr.size += view->ranges[j].addr.size;
+++j;
+}
+++i;


if (j != i) {


+memmove(&view->ranges[i], &view->ranges[j],
+(view->nr - j) * sizeof(view->ranges[j]));
+view->nr -= j - i;
+}


}


Seems to work both ways?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 14/23] memory: transaction API

2011-07-26 Thread Avi Kivity

On 07/26/2011 01:48 PM, Avi Kivity wrote:

On 07/25/2011 10:16 PM, Anthony Liguori wrote:

On 07/25/2011 09:02 AM, Avi Kivity wrote:

Allow changes to the memory hierarchy to be accumulated and
made visible all at once.  This reduces computational effort,
especially when an accelerator (e.g. kvm) is involved.

Useful when a single register update causes multiple changes
to an address space.

Signed-off-by: Avi Kivity


What's the motivation for this?  Was this just because it seemed neat 
to do or did you run into a performance issue you were trying to solve?


Both cirrus and 440fx need this; look at 
i440fx_update_memory_mappings() for example, it blindly updates 
mappings even for PAMs which haven't changed.


These issues can be also solved by more care on the caller's side, or 
by making the API richer, but it's good to have a no-thought-required 
solution, particularly as it's so easy to do.




I should note that updating the memory map is relatively slow with kvm 
due to the need to wait for an rcu grace period; though recent kernels 
are faster.  So any saving here has a large effect.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] device-assignment pci: correct pci config size default for cap version 2 endpoints

2011-07-26 Thread Michael S. Tsirkin
On Mon, Jul 25, 2011 at 02:14:28PM -0600, Alex Williamson wrote:
> On Sun, 2011-07-24 at 11:36 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 21, 2011 at 11:02:53AM -0600, Alex Williamson wrote:
> > > This is crazy, why would we only test this for PCI_CAP_ID_EXP?  If the
> > > test is going to go in device-assignment, we need to wrap
> > > pci_add_capability and do it for all callers.  However, maybe we can get
> > > MST to take it in pci_add_capability() if we make the test more
> > > complete...  something like:
> > > 
> > > if ((pos < 256 && size > 256 - pos) ||
> > > (pci_config_size() > 256 && pos > 256 &&
> > >  size > pci_config_size() - pos)) {
> > >... badness
> > > 
> > > Thanks,
> > > 
> > > Alex
> > 
> > We expect device assignment to be able to corrupt
> > guest memory but not qemu memory. So we must validate
> 
> Gee, thanks.

Ugh, sorry, not device assignment per se.
I really meant an assigned device controlled by
a malicious guest.

> > whatever we get from the device, and I think this
> > validation belongs in device-assignment.c:
> > 
> > IOW I think input should be validated where it's
> > input, while we still know it's untrusted,
> > instead of relying on core to validate parameters.
> > 
> > Makes sense?
> 
> No, not really.  Why should the core "trust" anything?

We generally don't validate most function parameters.
We trust callers to a level because they can corrupt our memory anyway.

> This is a basic, simply sanity test, why push it out to the leaf
> driver when pushing it
> into the core provides the sanity test for everyone?  Is it beneath an
> emulated driver to pass such a test?  Thanks,
> 
> Alex

It's easy to add this to the core but please don't
rely on this: functions validating parameters is
a debugging aid. Validating untrusted input from a guest-controlled
device is a security feature.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Paolo Bonzini

On 07/26/2011 01:38 PM, Avi Kivity wrote:


if (j != i) {


+memmove(&view->ranges[i], &view->ranges[j],
+(view->nr - j) * sizeof(view->ranges[j]));
+view->nr -= j - i;
+}


}


Seems to work both ways?


Sure, but you're pointlessly memmove-ing memory over itself.  I'm not 
sure how many segments a single memory region will usually have, but 
it's better to be safe.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Avi Kivity

On 07/26/2011 02:51 PM, Paolo Bonzini wrote:

On 07/26/2011 01:38 PM, Avi Kivity wrote:


if (j != i) {


+memmove(&view->ranges[i], &view->ranges[j],
+(view->nr - j) * sizeof(view->ranges[j]));
+view->nr -= j - i;
+}


}


Seems to work both ways?


Sure, but you're pointlessly memmove-ing memory over itself.  I'm not 
sure how many segments a single memory region will usually have, but 
it's better to be safe.


This will never be an issue in practice.  We expect to have few 
FlatRanges (O(100)), simplify() will be called very rarely (never during 
normal operations; a few dozen times during boot; a few during hotplug); 
everything is in L1 cache; and the cost will be dwarfed by any calls to 
kvm (if enabled).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions

2011-07-26 Thread Gleb Natapov
On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
> We usually use repeat string instructions to clear the page, for example,
By "we" do you mean Linux guest?

> we call memset to clear a page table, stosb is used in this function, and 
> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
> and walking shadow page cache for 1024 times, it is terrible
> 
> In fact, if it is the repeat string instructions emulated and it is not a
> IO/MMIO access, we can zap all the corresponding shadow pages and return to 
> the
> guest, then the mapping can became writable and directly write the page
> 
So this patch does two independent things as far as I can see. First it
stops reentering guest if rep instruction is done on memory and second
it drops shadow pages if access to shadowed page table is rep. Why not
separate those in different patches?  BTW not entering guest periodically
increases interrupt latencies. Why not zap shadow, make page writable
and reenter the guest instead of emulation, it should be much faster (do we
care to optimize for old cpus by complicating the code anyway?).

> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_emulate.h |1 +
>  arch/x86/include/asm/kvm_host.h|2 +-
>  arch/x86/kvm/emulate.c |5 +++--
>  arch/x86/kvm/mmu.c |4 ++--
>  arch/x86/kvm/paging_tmpl.h |3 ++-
>  arch/x86/kvm/x86.c |9 +++--
>  6 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_emulate.h 
> b/arch/x86/include/asm/kvm_emulate.h
> index 6040d11..7e5c4d5 100644
> --- a/arch/x86/include/asm/kvm_emulate.h
> +++ b/arch/x86/include/asm/kvm_emulate.h
> @@ -244,6 +244,7 @@ struct x86_emulate_ctxt {
>   bool guest_mode; /* guest running a nested guest */
>   bool perm_ok; /* do not check permissions if true */
>   bool only_vendor_specific_insn;
> + bool pio_mmio_emulate; /* it is the pio/mmio access if true */
>  
>   bool have_exception;
>   struct x86_exception exception;
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c00ec28..95f55a9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -752,7 +752,7 @@ int fx_init(struct kvm_vcpu *vcpu);
>  void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  const u8 *new, int bytes,
> -bool guest_initiated);
> +bool guest_initiated, bool repeat_write);
>  int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
>  void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 6f08bc9..36fbac6 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4009,8 +4009,9 @@ writeback:
>* Re-enter guest when pio read ahead buffer is empty
>* or, if it is not used, after each 1024 iteration.
>*/
> - if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff) 
> &&
> - (r->end == 0 || r->end != r->pos)) {
> + if ((r->end != 0 || ctxt->regs[VCPU_REGS_RCX] & 0x3ff)
> +   && (r->end == 0 || r->end != r->pos) &&
> +   ctxt->pio_mmio_emulate) {
>   /*
>* Reset read cache. Usually happens before
>* decode, but since instruction is restarted
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 587695b..5193de8 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3541,7 +3541,7 @@ static int get_free_pte_list_desc_nr(struct kvm_vcpu 
> *vcpu)
>  
>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  const u8 *new, int bytes,
> -bool guest_initiated)
> +bool guest_initiated, bool repeat_write)
>  {
>   gfn_t gfn = gpa >> PAGE_SHIFT;
>   union kvm_mmu_page_role mask = { .word = 0 };
> @@ -3622,7 +3622,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>   pte_size = sp->role.cr4_pae ? 8 : 4;
>   misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
>   misaligned |= bytes < 4;
> - if (misaligned || flooded) {
> + if (misaligned || flooded || repeat_write) {
>   /*
>* Misaligned accesses are too much trouble to fix
>* up; also, they usually indicate a page is not used
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 507e2b8..02daac5 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -710,7 +710,8 @@ static void F

[PATCH 11/11] KVM: MMU: improve write flooding detected

2011-07-26 Thread Xiao Guangrong
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
or not, if the spte is not accessed but it is written frequently, we treat is
not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |6 +---
 arch/x86/kvm/mmu.c  |   53 +--
 arch/x86/kvm/paging_tmpl.h  |9 +-
 3 files changed, 16 insertions(+), 52 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ce17642..8c938db 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@ struct kvm_mmu_page {
int clear_spte_count;
 #endif
 
+   int write_flooding_count;
+
struct rcu_head rcu;
 };
 
@@ -352,10 +354,6 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-   gfn_t last_pt_write_gfn;
-   int   last_pt_write_count;
-   u64  *last_pte_updated;
-
struct fpu guest_fpu;
u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bb55b15..8c2885c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1688,6 +1688,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
} else if (sp->unsync)
kvm_mmu_mark_parents_unsync(sp);
 
+   sp->write_flooding_count = 0;
trace_kvm_mmu_get_page(sp, false);
return sp;
}
@@ -1840,15 +1841,6 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, 
u64 *parent_pte)
mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-   int i;
-   struct kvm_vcpu *vcpu;
-
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   vcpu->arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
u64 *parent_pte;
@@ -1908,7 +1900,6 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
}
 
sp->role.invalid = 1;
-   kvm_mmu_reset_last_pte_updated(kvm);
return ret;
 }
 
@@ -2350,8 +2341,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
}
kvm_release_pfn_clean(pfn);
-   if (speculative)
-   vcpu->arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3509,13 +3498,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu 
*vcpu, bool zap_page,
kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-   u64 *spte = vcpu->arch.last_pte_updated;
-
-   return !!(spte && (*spte & shadow_accessed_mask));
-}
-
 static int get_free_pte_list_desc_nr(struct kvm_vcpu *vcpu)
 {
struct kvm_mmu_memory_cache *cache;
@@ -3565,22 +3547,14 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu 
*vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-   bool flooded = false;
-
-   if (gfn == vcpu->arch.last_pt_write_gfn
-   && !last_updated_pte_accessed(vcpu)) {
-   ++vcpu->arch.last_pt_write_count;
-   if (vcpu->arch.last_pt_write_count >= 3)
-   flooded = true;
-   } else {
-   vcpu->arch.last_pt_write_gfn = gfn;
-   vcpu->arch.last_pt_write_count = 1;
-   vcpu->arch.last_pte_updated = NULL;
-   }
+   if (spte && !(*spte & shadow_accessed_mask))
+   sp->write_flooding_count++;
+   else
+   sp->write_flooding_count = 0;
 
-   return flooded;
+   return sp->write_flooding_count >= 3;
 }
 
 /*
@@ -3663,7 +3637,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
LIST_HEAD(invalid_list);
u64 entry, gentry, *spte;
int npte;
-   bool remote_flush, local_flush, zap_page, flooded;
+   bool remote_flush, local_flush, zap_page;
 
/*
 * If we don't have indirect shadow pages, it means no page is
@@ -3682,21 +3656,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
++vcpu->kvm->stat.mmu_pte_write;
trace_kvm_m

Re: [PATCH 6/6] scsi-disk: Check for supported commands

2011-07-26 Thread Kevin Wolf
Am 22.07.2011 16:51, schrieb Hannes Reinecke:
> Not every command is support for any device type. This patch adds
> a check for rejecting unsupported commands.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  hw/scsi-disk.c |  104 
> +++-
>  1 files changed, 103 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
> index ae2c157..8ad90c0 100644
> --- a/hw/scsi-disk.c
> +++ b/hw/scsi-disk.c
> @@ -361,13 +361,107 @@ static int scsi_get_sense(SCSIRequest *req, uint8_t 
> *outbuf, int len)
>  return scsi_build_sense(s->sense, outbuf, len, len > 14);
>  }
>  
> +#define GENERIC_CMD (uint32_t)0x
> +#define DISK_CMD (1u << TYPE_DISK)
> +#define TAPE_CMD (1u << TYPE_TAPE)
> +#define PRINTER_CMD (1u << TYPE_PRINTER)
> +#define PROCESSOR_CMD (1u << TYPE_PROCESSOR)
> +#define WORM_CMD (1u << TYPE_WORM)
> +#define ROM_CMD (1u << TYPE_ROM)
> +#define SCANNER_CMD (1u << TYPE_SCANNER)
> +#define MOD_CMD (1u << TYPE_MOD)
> +#define MEDIUM_CHANGER_CMD (1u << TYPE_MEDIUM_CHANGER)
> +#define ARRAY_CMD (1u << TYPE_STORAGE_ARRAY)
> +#define ENCLOSURE_CMD (1u << TYPE_ENCLOSURE)
> +#define RBC_CMD (1u << TYPE_RBC)
> +#define OSD_CMD (1u << TYPE_OSD)
> +
> +#define NO_ROM_CMD (GENERIC_CMD | ~ROM_CMD)
> +
> +uint32_t scsi_cmd_table[0x100] = {
> +[TEST_UNIT_READY]   = GENERIC_CMD,
> +[REWIND]= TAPE_CMD,
> +[REQUEST_SENSE] = GENERIC_CMD,
> +[FORMAT_UNIT]   = DISK_CMD|ROM_CMD,
> +[READ_BLOCK_LIMITS] = TAPE_CMD,
> +[REASSIGN_BLOCKS]   = DISK_CMD|WORM_CMD|MOD_CMD,
> +[READ_6]= DISK_CMD|TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[WRITE_6]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD,
> +[READ_REVERSE]  = TAPE_CMD,
> +[WRITE_FILEMARKS]   = TAPE_CMD,
> +[SPACE] = TAPE_CMD,
> +[INQUIRY]   = GENERIC_CMD,
> +[MODE_SELECT]   = GENERIC_CMD,
> +[RESERVE]   = TAPE_CMD|PRINTER_CMD,
> +[RELEASE]   = TAPE_CMD|PRINTER_CMD,
> +[ERASE] = TAPE_CMD,
> +[MODE_SENSE]= GENERIC_CMD,
> +[START_STOP]= GENERIC_CMD,
> +[RECEIVE_DIAGNOSTIC]= GENERIC_CMD,
> +[SEND_DIAGNOSTIC]   = GENERIC_CMD,
> +[ALLOW_MEDIUM_REMOVAL]  = GENERIC_CMD,
> +[READ_CAPACITY_10]  = DISK_CMD|WORM_CMD|MOD_CMD,
> +[READ_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[WRITE_10]  = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[SEEK_10]   = TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[WRITE_VERIFY_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[VERIFY_10] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[READ_POSITION] = TAPE_CMD,
> +[SYNCHRONIZE_CACHE] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD|RBC_CMD,
> +[WRITE_BUFFER]  = GENERIC_CMD,
> +[READ_BUFFER]   = GENERIC_CMD,
> +[READ_LONG_10]  = DISK_CMD|WORM_CMD|MOD_CMD,
> +[WRITE_LONG_10] = DISK_CMD|WORM_CMD|MOD_CMD,
> +[WRITE_SAME_10] = DISK_CMD,
> +[UNMAP] = DISK_CMD,
> +[READ_TOC]  = ROM_CMD,
> +[REPORT_DENSITY_SUPPORT]= TAPE_CMD,
> +[GET_CONFIGURATION] = ROM_CMD,
> +[LOG_SELECT]= GENERIC_CMD,
> +[LOG_SENSE] = GENERIC_CMD,
> +[MODE_SELECT_10]= GENERIC_CMD,
> +[RESERVE_10]= PRINTER_CMD,
> +[RELEASE_10]= PRINTER_CMD,
> +[MODE_SENSE_10] = GENERIC_CMD,
> +[PERSISTENT_RESERVE_IN] = GENERIC_CMD,
> +[PERSISTENT_RESERVE_OUT]= GENERIC_CMD,
> +[VARLENGTH_CDB] = OSD_CMD,
> +[WRITE_FILEMARKS_16]= TAPE_CMD,
> +[ATA_PASSTHROUGH]   = DISK_CMD|ROM_CMD|RBC_CMD,
> +[READ_16]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
> +[WRITE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
> +[WRITE_VERIFY_16]   = DISK_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
> +[SYNCHRONIZE_CACHE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
> +[LOCATE_16] = TAPE_CMD,
> +[WRITE_SAME_16] = DISK_CMD|TAPE_CMD,
> +[SERVICE_ACTION_IN] = GENERIC_CMD,
> +[REPORT_LUNS]   = NO_ROM_CMD,
> +[BLANK] = ROM_CMD,
> +[MAINTENANCE_IN]= NO_ROM_CMD,
> +[MAINTENANCE_OUT]   = NO_ROM_CMD,
> +[MOVE_MEDIUM]   = MEDIUM_CHANGER_CMD,
> +[LOAD_UNLOAD]   = ROM_CMD|MEDIUM_CHANGER_CMD,
> +[READ_12]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[WRITE_12]  = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[WRITE_VERIFY_12]   = DISK_CMD|WORM_CMD

Re: [PATCH 6/6] scsi-disk: Check for supported commands

2011-07-26 Thread Hannes Reinecke

On 07/26/2011 03:07 PM, Kevin Wolf wrote:

Am 22.07.2011 16:51, schrieb Hannes Reinecke:

Not every command is support for any device type. This patch adds
a check for rejecting unsupported commands.

Signed-off-by: Hannes Reinecke
---
  hw/scsi-disk.c |  104 +++-
  1 files changed, 103 insertions(+), 1 deletions(-)

diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index ae2c157..8ad90c0 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -361,13 +361,107 @@ static int scsi_get_sense(SCSIRequest *req, uint8_t 
*outbuf, int len)
  return scsi_build_sense(s->sense, outbuf, len, len>  14);
  }

+#define GENERIC_CMD (uint32_t)0x
+#define DISK_CMD (1u<<  TYPE_DISK)
+#define TAPE_CMD (1u<<  TYPE_TAPE)
+#define PRINTER_CMD (1u<<  TYPE_PRINTER)
+#define PROCESSOR_CMD (1u<<  TYPE_PROCESSOR)
+#define WORM_CMD (1u<<  TYPE_WORM)
+#define ROM_CMD (1u<<  TYPE_ROM)
+#define SCANNER_CMD (1u<<  TYPE_SCANNER)
+#define MOD_CMD (1u<<  TYPE_MOD)
+#define MEDIUM_CHANGER_CMD (1u<<  TYPE_MEDIUM_CHANGER)
+#define ARRAY_CMD (1u<<  TYPE_STORAGE_ARRAY)
+#define ENCLOSURE_CMD (1u<<  TYPE_ENCLOSURE)
+#define RBC_CMD (1u<<  TYPE_RBC)
+#define OSD_CMD (1u<<  TYPE_OSD)
+
+#define NO_ROM_CMD (GENERIC_CMD | ~ROM_CMD)
+
+uint32_t scsi_cmd_table[0x100] = {
+[TEST_UNIT_READY]   = GENERIC_CMD,
+[REWIND]= TAPE_CMD,
+[REQUEST_SENSE] = GENERIC_CMD,
+[FORMAT_UNIT]   = DISK_CMD|ROM_CMD,
+[READ_BLOCK_LIMITS] = TAPE_CMD,
+[REASSIGN_BLOCKS]   = DISK_CMD|WORM_CMD|MOD_CMD,
+[READ_6]= DISK_CMD|TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[WRITE_6]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD,
+[READ_REVERSE]  = TAPE_CMD,
+[WRITE_FILEMARKS]   = TAPE_CMD,
+[SPACE] = TAPE_CMD,
+[INQUIRY]   = GENERIC_CMD,
+[MODE_SELECT]   = GENERIC_CMD,
+[RESERVE]   = TAPE_CMD|PRINTER_CMD,
+[RELEASE]   = TAPE_CMD|PRINTER_CMD,
+[ERASE] = TAPE_CMD,
+[MODE_SENSE]= GENERIC_CMD,
+[START_STOP]= GENERIC_CMD,
+[RECEIVE_DIAGNOSTIC]= GENERIC_CMD,
+[SEND_DIAGNOSTIC]   = GENERIC_CMD,
+[ALLOW_MEDIUM_REMOVAL]  = GENERIC_CMD,
+[READ_CAPACITY_10]  = DISK_CMD|WORM_CMD|MOD_CMD,
+[READ_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[WRITE_10]  = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[SEEK_10]   = TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[WRITE_VERIFY_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[VERIFY_10] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[READ_POSITION] = TAPE_CMD,
+[SYNCHRONIZE_CACHE] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD|RBC_CMD,
+[WRITE_BUFFER]  = GENERIC_CMD,
+[READ_BUFFER]   = GENERIC_CMD,
+[READ_LONG_10]  = DISK_CMD|WORM_CMD|MOD_CMD,
+[WRITE_LONG_10] = DISK_CMD|WORM_CMD|MOD_CMD,
+[WRITE_SAME_10] = DISK_CMD,
+[UNMAP] = DISK_CMD,
+[READ_TOC]  = ROM_CMD,
+[REPORT_DENSITY_SUPPORT]= TAPE_CMD,
+[GET_CONFIGURATION] = ROM_CMD,
+[LOG_SELECT]= GENERIC_CMD,
+[LOG_SENSE] = GENERIC_CMD,
+[MODE_SELECT_10]= GENERIC_CMD,
+[RESERVE_10]= PRINTER_CMD,
+[RELEASE_10]= PRINTER_CMD,
+[MODE_SENSE_10] = GENERIC_CMD,
+[PERSISTENT_RESERVE_IN] = GENERIC_CMD,
+[PERSISTENT_RESERVE_OUT]= GENERIC_CMD,
+[VARLENGTH_CDB] = OSD_CMD,
+[WRITE_FILEMARKS_16]= TAPE_CMD,
+[ATA_PASSTHROUGH]   = DISK_CMD|ROM_CMD|RBC_CMD,
+[READ_16]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
+[WRITE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
+[WRITE_VERIFY_16]   = DISK_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
+[SYNCHRONIZE_CACHE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
+[LOCATE_16] = TAPE_CMD,
+[WRITE_SAME_16] = DISK_CMD|TAPE_CMD,
+[SERVICE_ACTION_IN] = GENERIC_CMD,
+[REPORT_LUNS]   = NO_ROM_CMD,
+[BLANK] = ROM_CMD,
+[MAINTENANCE_IN]= NO_ROM_CMD,
+[MAINTENANCE_OUT]   = NO_ROM_CMD,
+[MOVE_MEDIUM]   = MEDIUM_CHANGER_CMD,
+[LOAD_UNLOAD]   = ROM_CMD|MEDIUM_CHANGER_CMD,
+[READ_12]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[WRITE_12]  = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[WRITE_VERIFY_12]   = DISK_CMD|WORM_CMD|MOD_CMD,
+[VERIFY_12] = DISK_CMD|WORM_CMD|MOD_CMD,
+[READ_ELEMENT_STATUS]   = WORM_CMD|MOD_CMD,
+[SET_CD_SPEED]

Re: [Qemu-devel] [PATCH 6/6] scsi-disk: Check for supported commands

2011-07-26 Thread Christoph Hellwig
On Fri, Jul 22, 2011 at 04:51:17PM +0200, Hannes Reinecke wrote:
> Not every command is support for any device type. This patch adds
> a check for rejecting unsupported commands.
> 
> Signed-off-by: Hannes Reinecke 

This seems to conflic with Markus' series.  But if we want to invest
any major effort into it, we really need to different dispatch tables
for different device types.  There's two sane ways to do it:

one top-level handler with a switch per device type, or tables
with a handler pointer with a device type.  I'm fine with either one.

What I really don't get with this patch is the listing of all the different
SCSI device types.  It's already a mistake that we tried to handle disks
and CDROMs with the same driver, but adding even more just makes it worth.

IMHO we should simply have one file per SCSI spec, e.g. an spc.c for the
common bits, then an sbc.c for disks, and mmc.c for cdroms.  Maybe more
the day we add more emulated device types.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] scsi-disk: Check for supported commands

2011-07-26 Thread Kevin Wolf
Am 22.07.2011 16:51, schrieb Hannes Reinecke:
> Not every command is support for any device type. This patch adds
> a check for rejecting unsupported commands.
> 
> Signed-off-by: Hannes Reinecke 

We do emulate SEEK (6), but it's not in your scsi_cmd_table at all.

> ---
>  hw/scsi-disk.c |  104 
> +++-
>  1 files changed, 103 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
> index ae2c157..8ad90c0 100644
> --- a/hw/scsi-disk.c
> +++ b/hw/scsi-disk.c
> @@ -361,13 +361,107 @@ static int scsi_get_sense(SCSIRequest *req, uint8_t 
> *outbuf, int len)
>  return scsi_build_sense(s->sense, outbuf, len, len > 14);
>  }
>  
> +#define GENERIC_CMD (uint32_t)0x
> +#define DISK_CMD (1u << TYPE_DISK)
> +#define TAPE_CMD (1u << TYPE_TAPE)
> +#define PRINTER_CMD (1u << TYPE_PRINTER)
> +#define PROCESSOR_CMD (1u << TYPE_PROCESSOR)
> +#define WORM_CMD (1u << TYPE_WORM)
> +#define ROM_CMD (1u << TYPE_ROM)
> +#define SCANNER_CMD (1u << TYPE_SCANNER)
> +#define MOD_CMD (1u << TYPE_MOD)
> +#define MEDIUM_CHANGER_CMD (1u << TYPE_MEDIUM_CHANGER)
> +#define ARRAY_CMD (1u << TYPE_STORAGE_ARRAY)
> +#define ENCLOSURE_CMD (1u << TYPE_ENCLOSURE)
> +#define RBC_CMD (1u << TYPE_RBC)
> +#define OSD_CMD (1u << TYPE_OSD)
> +
> +#define NO_ROM_CMD (GENERIC_CMD | ~ROM_CMD)
> +
> +uint32_t scsi_cmd_table[0x100] = {
> +[TEST_UNIT_READY]   = GENERIC_CMD,
> +[REWIND]= TAPE_CMD,
> +[REQUEST_SENSE] = GENERIC_CMD,
> +[FORMAT_UNIT]   = DISK_CMD|ROM_CMD,
> +[READ_BLOCK_LIMITS] = TAPE_CMD,
> +[REASSIGN_BLOCKS]   = DISK_CMD|WORM_CMD|MOD_CMD,
> +[READ_6]= DISK_CMD|TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,

My spec says that MMC doesn't support READ(6)

> +[WRITE_6]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD,
> +[READ_REVERSE]  = TAPE_CMD,
> +[WRITE_FILEMARKS]   = TAPE_CMD,
> +[SPACE] = TAPE_CMD,
> +[INQUIRY]   = GENERIC_CMD,
> +[MODE_SELECT]   = GENERIC_CMD,
> +[RESERVE]   = TAPE_CMD|PRINTER_CMD,
> +[RELEASE]   = TAPE_CMD|PRINTER_CMD,

What's the reason for allowing this for tape, but not e.g. for disks?
It's marked as obsolete for both (and optional for quite a few other types)

> +[ERASE] = TAPE_CMD,
> +[MODE_SENSE]= GENERIC_CMD,
> +[START_STOP]= GENERIC_CMD,
> +[RECEIVE_DIAGNOSTIC]= GENERIC_CMD,
> +[SEND_DIAGNOSTIC]   = GENERIC_CMD,
> +[ALLOW_MEDIUM_REMOVAL]  = GENERIC_CMD,
> +[READ_CAPACITY_10]  = DISK_CMD|WORM_CMD|MOD_CMD,

ROM_CMD, too

> +[READ_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[WRITE_10]  = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[SEEK_10]   = TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,

Tape doesn't have SEEK(10) in the spec.

> +[WRITE_VERIFY_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[VERIFY_10] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
> +[READ_POSITION] = TAPE_CMD,
> +[SYNCHRONIZE_CACHE] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD|RBC_CMD,
> +[WRITE_BUFFER]  = GENERIC_CMD,
> +[READ_BUFFER]   = GENERIC_CMD,
> +[READ_LONG_10]  = DISK_CMD|WORM_CMD|MOD_CMD,
> +[WRITE_LONG_10] = DISK_CMD|WORM_CMD|MOD_CMD,
> +[WRITE_SAME_10] = DISK_CMD,
> +[UNMAP] = DISK_CMD,
> +[READ_TOC]  = ROM_CMD,
> +[REPORT_DENSITY_SUPPORT]= TAPE_CMD,
> +[GET_CONFIGURATION] = ROM_CMD,
> +[LOG_SELECT]= GENERIC_CMD,
> +[LOG_SENSE] = GENERIC_CMD,
> +[MODE_SELECT_10]= GENERIC_CMD,
> +[RESERVE_10]= PRINTER_CMD,
> +[RELEASE_10]= PRINTER_CMD,
> +[MODE_SENSE_10] = GENERIC_CMD,
> +[PERSISTENT_RESERVE_IN] = GENERIC_CMD,
> +[PERSISTENT_RESERVE_OUT]= GENERIC_CMD,
> +[VARLENGTH_CDB] = OSD_CMD,
> +[WRITE_FILEMARKS_16]= TAPE_CMD,
> +[ATA_PASSTHROUGH]   = DISK_CMD|ROM_CMD|RBC_CMD,
> +[READ_16]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
> +[WRITE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
> +[WRITE_VERIFY_16]   = DISK_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
> +[SYNCHRONIZE_CACHE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,

Tape doesn't have this.

> +[LOCATE_16] = TAPE_CMD,
> +[WRITE_SAME_16] = DISK_CMD|TAPE_CMD,

Again not for tape in my spec.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordom

Re: [PATCH 0/6][v2] Check for supported SCSI commands

2011-07-26 Thread Kevin Wolf
Am 22.07.2011 16:51, schrieb Hannes Reinecke:
> Markus Armbruster pointed out that not every SCSI command is supported
> for a given device type. Based on his patch and suggestiongs this series
> cleans up the SCSI device type and adds a check for supported commands.
> 
> Hannes Reinecke (6):
>   scsi-disk: Codingstyle fixes
>   scsi: Remove references to SET_WINDOW
>   scsi: Remove REZERO_UNIT emulation
>   scsi: Sanitize command definitions
>   scsi-disk: Remove 'drive_kind'
>   scsi-disk: Check for supported commands
> 
>  hw/scsi-bus.c |   74 -
>  hw/scsi-defs.h|   62 +++---
>  hw/scsi-disk.c|  185 
> -
>  hw/scsi-generic.c |2 +-
>  4 files changed, 221 insertions(+), 102 deletions(-)

Thanks, applied patches 1-5 to the block branch.

Please send v3 of patch 6 on top.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions

2011-07-26 Thread Avi Kivity

On 07/26/2011 03:27 PM, Gleb Natapov wrote:

On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
>  We usually use repeat string instructions to clear the page, for example,
By "we" do you mean Linux guest?

>  we call memset to clear a page table, stosb is used in this function, and
>  repeated for 1024 times, that means we should occupy mmu lock for 1024 times
>  and walking shadow page cache for 1024 times, it is terrible
>
>  In fact, if it is the repeat string instructions emulated and it is not a
>  IO/MMIO access, we can zap all the corresponding shadow pages and return to 
the
>  guest, then the mapping can became writable and directly write the page
>
So this patch does two independent things as far as I can see. First it
stops reentering guest if rep instruction is done on memory and second
it drops shadow pages if access to shadowed page table is rep. Why not
separate those in different patches?  BTW not entering guest periodically
increases interrupt latencies. Why not zap shadow, make page writable
and reenter the guest instead of emulation, it should be much faster (do we
care to optimize for old cpus by complicating the code anyway?).



The second thing is mentioned on the TODO list in a more general way: 
tag instructions that are typically used to modify the page tables, and 
drop shadow if any other instruction is used.  Since MOVS is typically 
not used to update pagetables, it would not be tagged.


The list would include, I'd guess, and, or, bts, btc, mov, xchg, 
cmpxchg, and cmpxchg8b.  Anything else?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtagent for qemu-kvm ?

2011-07-26 Thread Michael Roth

On 07/26/2011 03:24 AM, Stefan Hajnoczi wrote:

On Tue, Jul 26, 2011 at 6:58 AM, Prateek Sharma  wrote:

On Tue, 26 Jul 2011, Stefan Hajnoczi wrote:


On Tue, Jul 26, 2011 at 6:05 AM, Prateek Sharma
wrote:


Hi all,
Is there any equivalent of qemu's virtagent in qemu-kvm?
[http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg02149.html]  .
In particular , i want to share pages between KVM guests and the host .
Is
there an appropriate mechanism for this in existence which i could use ?
Another nice feature virtagent provides is the ability to see guest dmesg
output in the host...


virtagent doesn't share pages between guest and host AFAIK.

Have you looked at hw/ivshmem.c?

Most of the time you don't need to actually share pages between guest
and host.  Use existing mechanisms like networking or serial to
communicate instead.

Stefan



Thanks for the quick response!

I was tempted by virtagent because it provides dmesg output directly,
something which is exactly what i need.

Anyway, looks like i'll need to fallback to serial/networking, as you
suggested.

On a related note, is porting virtagent to qemu-kvm
possible/useful/in-the-pipeline ?


I expect qemu-ga will be merged into qemu-kvm.git when they merge from
qemu.git again soon.

Mike: I don't see a dmesg command in the guest schema in qemu.git.  Is
there a recommended way of getting dmesg from the host (maybe read
file /var/log/messages)?



Yup, the guest-file-read RPC to access /var/log/messages is the only 
means in the current version. There was a dmesg RPC in earlier 
prototypes but it was removed in favor of lower-level mechanisms.



Stefan


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] scsi-disk: Check for supported commands

2011-07-26 Thread Hannes Reinecke

On 07/26/2011 03:46 PM, Kevin Wolf wrote:

Am 22.07.2011 16:51, schrieb Hannes Reinecke:

Not every command is support for any device type. This patch adds
a check for rejecting unsupported commands.

Signed-off-by: Hannes Reinecke


We do emulate SEEK (6), but it's not in your scsi_cmd_table at all.


Hmm.

---
  hw/scsi-disk.c |  104 +++-
  1 files changed, 103 insertions(+), 1 deletions(-)

diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index ae2c157..8ad90c0 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -361,13 +361,107 @@ static int scsi_get_sense(SCSIRequest *req, uint8_t 
*outbuf, int len)
  return scsi_build_sense(s->sense, outbuf, len, len>  14);
  }

+#define GENERIC_CMD (uint32_t)0x
+#define DISK_CMD (1u<<  TYPE_DISK)
+#define TAPE_CMD (1u<<  TYPE_TAPE)
+#define PRINTER_CMD (1u<<  TYPE_PRINTER)
+#define PROCESSOR_CMD (1u<<  TYPE_PROCESSOR)
+#define WORM_CMD (1u<<  TYPE_WORM)
+#define ROM_CMD (1u<<  TYPE_ROM)
+#define SCANNER_CMD (1u<<  TYPE_SCANNER)
+#define MOD_CMD (1u<<  TYPE_MOD)
+#define MEDIUM_CHANGER_CMD (1u<<  TYPE_MEDIUM_CHANGER)
+#define ARRAY_CMD (1u<<  TYPE_STORAGE_ARRAY)
+#define ENCLOSURE_CMD (1u<<  TYPE_ENCLOSURE)
+#define RBC_CMD (1u<<  TYPE_RBC)
+#define OSD_CMD (1u<<  TYPE_OSD)
+
+#define NO_ROM_CMD (GENERIC_CMD | ~ROM_CMD)
+
+uint32_t scsi_cmd_table[0x100] = {
+[TEST_UNIT_READY]   = GENERIC_CMD,
+[REWIND]= TAPE_CMD,
+[REQUEST_SENSE] = GENERIC_CMD,
+[FORMAT_UNIT]   = DISK_CMD|ROM_CMD,
+[READ_BLOCK_LIMITS] = TAPE_CMD,
+[REASSIGN_BLOCKS]   = DISK_CMD|WORM_CMD|MOD_CMD,
+[READ_6]= DISK_CMD|TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,


My spec says that MMC doesn't support READ(6)


But it does support 'RECEIVE(6)', with the same opcode.


+[WRITE_6]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD,
+[READ_REVERSE]  = TAPE_CMD,
+[WRITE_FILEMARKS]   = TAPE_CMD,
+[SPACE] = TAPE_CMD,
+[INQUIRY]   = GENERIC_CMD,
+[MODE_SELECT]   = GENERIC_CMD,
+[RESERVE]   = TAPE_CMD|PRINTER_CMD,
+[RELEASE]   = TAPE_CMD|PRINTER_CMD,


What's the reason for allowing this for tape, but not e.g. for disks?
It's marked as obsolete for both (and optional for quite a few other types)

Because it's mandatory for TAPE and PRINTER. But the implementation 
details are horrible and we're not doing anything here (which in 
itself is a violation of the spec), so I think it's better to

not support it if we don't have to.


+[ERASE] = TAPE_CMD,
+[MODE_SENSE]= GENERIC_CMD,
+[START_STOP]= GENERIC_CMD,
+[RECEIVE_DIAGNOSTIC]= GENERIC_CMD,
+[SEND_DIAGNOSTIC]   = GENERIC_CMD,
+[ALLOW_MEDIUM_REMOVAL]  = GENERIC_CMD,
+[READ_CAPACITY_10]  = DISK_CMD|WORM_CMD|MOD_CMD,


ROM_CMD, too


Ok.


+[READ_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[WRITE_10]  = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[SEEK_10]   = TAPE_CMD|WORM_CMD|ROM_CMD|MOD_CMD,


Tape doesn't have SEEK(10) in the spec.


But it does have 'LOCATE_10', which shares the same opcode.


+[WRITE_VERIFY_10]   = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[VERIFY_10] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD,
+[READ_POSITION] = TAPE_CMD,
+[SYNCHRONIZE_CACHE] = DISK_CMD|WORM_CMD|ROM_CMD|MOD_CMD|RBC_CMD,
+[WRITE_BUFFER]  = GENERIC_CMD,
+[READ_BUFFER]   = GENERIC_CMD,
+[READ_LONG_10]  = DISK_CMD|WORM_CMD|MOD_CMD,
+[WRITE_LONG_10] = DISK_CMD|WORM_CMD|MOD_CMD,
+[WRITE_SAME_10] = DISK_CMD,
+[UNMAP] = DISK_CMD,
+[READ_TOC]  = ROM_CMD,
+[REPORT_DENSITY_SUPPORT]= TAPE_CMD,
+[GET_CONFIGURATION] = ROM_CMD,
+[LOG_SELECT]= GENERIC_CMD,
+[LOG_SENSE] = GENERIC_CMD,
+[MODE_SELECT_10]= GENERIC_CMD,
+[RESERVE_10]= PRINTER_CMD,
+[RELEASE_10]= PRINTER_CMD,
+[MODE_SENSE_10] = GENERIC_CMD,
+[PERSISTENT_RESERVE_IN] = GENERIC_CMD,
+[PERSISTENT_RESERVE_OUT]= GENERIC_CMD,
+[VARLENGTH_CDB] = OSD_CMD,
+[WRITE_FILEMARKS_16]= TAPE_CMD,
+[ATA_PASSTHROUGH]   = DISK_CMD|ROM_CMD|RBC_CMD,
+[READ_16]   = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
+[WRITE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
+[WRITE_VERIFY_16]   = DISK_CMD|WORM_CMD|MOD_CMD|RBC_CMD,
+[SYNCHRONIZE_CACHE_16]  = DISK_CMD|TAPE_CMD|WORM_CMD|MOD_CMD|RBC_CMD,


Tape doesn't have this.


It's called 'SPACE(16)' for tapes.


+[LOCATE_16] 

Re: [Qemu-devel] [PATCH v2 04/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Richard Henderson
On 07/26/2011 04:36 AM, Paolo Bonzini wrote:
> On 07/26/2011 01:26 PM, Avi Kivity wrote:
>> +while (i < view->nr) {
>> +j = i + 1;
>> +while (j < view->nr
>> +   && can_merge(&view->ranges[j-1], &view->ranges[j])) {
>> +view->ranges[i].addr.size += view->ranges[j].addr.size;
>> +++j;
>> +}
>> +++i;
> 
> if (j != i) {
> 
>> +memmove(&view->ranges[i], &view->ranges[j],
>> +(view->nr - j) * sizeof(view->ranges[j]));
>> +view->nr -= j - i;
>> +}
> 
> }

Err, j = i + 1, and increments.  Your conditional will always be true.


r~
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] device-assignment pci: correct pci config size default for cap version 2 endpoints

2011-07-26 Thread Alex Williamson
On Tue, 2011-07-26 at 14:42 +0300, Michael S. Tsirkin wrote:
> On Mon, Jul 25, 2011 at 02:14:28PM -0600, Alex Williamson wrote:
> > On Sun, 2011-07-24 at 11:36 +0300, Michael S. Tsirkin wrote:
> > > On Thu, Jul 21, 2011 at 11:02:53AM -0600, Alex Williamson wrote:
> > > > This is crazy, why would we only test this for PCI_CAP_ID_EXP?  If the
> > > > test is going to go in device-assignment, we need to wrap
> > > > pci_add_capability and do it for all callers.  However, maybe we can get
> > > > MST to take it in pci_add_capability() if we make the test more
> > > > complete...  something like:
> > > > 
> > > > if ((pos < 256 && size > 256 - pos) ||
> > > > (pci_config_size() > 256 && pos > 256 &&
> > > >  size > pci_config_size() - pos)) {
> > > >... badness
> > > > 
> > > > Thanks,
> > > > 
> > > > Alex
> > > 
> > > We expect device assignment to be able to corrupt
> > > guest memory but not qemu memory. So we must validate
> > 
> > Gee, thanks.
> 
> Ugh, sorry, not device assignment per se.
> I really meant an assigned device controlled by
> a malicious guest.

Huh?  A malicious guest can't redefine PCI config space for an assigned
device.  This is an initialization phase of parsing the device config
space.

> > > whatever we get from the device, and I think this
> > > validation belongs in device-assignment.c:
> > > 
> > > IOW I think input should be validated where it's
> > > input, while we still know it's untrusted,
> > > instead of relying on core to validate parameters.
> > > 
> > > Makes sense?
> > 
> > No, not really.  Why should the core "trust" anything?
> 
> We generally don't validate most function parameters.
> We trust callers to a level because they can corrupt our memory anyway.

This is device init code though, adding sanity checks doesn't hurt
anything and makes sure emulated device developers don't inadvertently
ignore PCI requirements.

> > This is a basic, simply sanity test, why push it out to the leaf
> > driver when pushing it
> > into the core provides the sanity test for everyone?  Is it beneath an
> > emulated driver to pass such a test?  Thanks,
> > 
> > Alex
> 
> It's easy to add this to the core but please don't
> rely on this: functions validating parameters is
> a debugging aid. Validating untrusted input from a guest-controlled
> device is a security feature.

Some might call it robustness to validate parameters and protect qemu
against all callers.  I hadn't realized that device assignment was such
a second class citizen.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 04/23] memory: merge adjacent segments of a single memory region

2011-07-26 Thread Avi Kivity

On 07/26/2011 06:41 PM, Richard Henderson wrote:

On 07/26/2011 04:36 AM, Paolo Bonzini wrote:
>  On 07/26/2011 01:26 PM, Avi Kivity wrote:
>>  +while (i<  view->nr) {
>>  +j = i + 1;
>>  +while (j<  view->nr
>>  +&&  can_merge(&view->ranges[j-1],&view->ranges[j])) {
>>  +view->ranges[i].addr.size += view->ranges[j].addr.size;
>>  +++j;
>>  +}
>>  +++i;
>
>  if (j != i) {
>
>>  +memmove(&view->ranges[i],&view->ranges[j],
>>  +(view->nr - j) * sizeof(view->ranges[j]));
>>  +view->nr -= j - i;
>>  +}
>
>  }

Err, j = i + 1, and increments.  Your conditional will always be true.



If the while loop body is never executed, then i will equal j (both 
incremented once).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] qemu-kvm: device assignment cleanups and upstream diff reductions

2011-07-26 Thread Alex Williamson
On Thu, 2011-07-21 at 20:14 +0200, Jan Kiszka wrote:
> Update of the unmerged half of this series. It logically depends on
> "pci: Common overflow prevention", but not mechanically. Changes in this
> release only affect the 6th patch. It gained support for 3-byte config
> space accesses, received a fix as the previous version broke MSI-X, and
> was further refactored according to review comments.
> 
> Please review/merge.
> 
> Jan Kiszka (7):
>   pci-assign: Fix kvm_deassign_irq handling in assign_irq
>   pci-assign: Update legacy interrupts only if used
>   pci-assign: Drop libpci header dependency
>   pci-assign: Refactor calc_assigned_dev_id
>   pci-assign: Track MSI/MSI-X capability position, clean up related
> code
>   pci-assign: Generic config space access management
>   qemu-kvm: Resolve PCI upstream diffs
> 
>  configure  |   21 ---
>  hw/device-assignment.c |  405 
> ++--
>  hw/device-assignment.h |9 +-
>  hw/pci.c   |   29 ++--
>  hw/pci.h   |8 +-
>  hw/pci_regs.h  |7 -
>  6 files changed, 175 insertions(+), 304 deletions(-)

With v3 6/7 for series:

Acked-by: Alex Williamson 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2] x86: kvm: x86: fix information leak to userland

2011-07-26 Thread Alexander Graf

On 30.10.2010, at 20:54, Vasiliy Kulikov wrote:

> Structures kvm_vcpu_events, kvm_debugregs, kvm_pit_state2 and
> kvm_clock_data are copied to userland with some padding and reserved
> fields unitialized.  It leads to leaking of contents of kernel stack
> memory.  We have to initialize them to zero.
> 
> In patch v1 Jan Kiszka suggested to fill reserved fields with zeros
> instead of memset'ting the whole struct.  It makes sense as these
> fields are explicitly marked as padding.  No more fields need zeroing.
> 
> Signed-off-by: Vasiliy Kulikov 
> ---
> Compile tesed only.
> 
> arch/x86/kvm/x86.c |6 ++
> 1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b0818f6..463c65b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2560,6 +2560,7 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct 
> kvm_vcpu *vcpu,
>   !kvm_exception_is_soft(vcpu->arch.exception.nr);
>   events->exception.nr = vcpu->arch.exception.nr;
>   events->exception.has_error_code = vcpu->arch.exception.has_error_code;
> + events->exception.pad = 0;
>   events->exception.error_code = vcpu->arch.exception.error_code;
> 
>   events->interrupt.injected =
> @@ -2573,12 +2574,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct 
> kvm_vcpu *vcpu,
>   events->nmi.injected = vcpu->arch.nmi_injected;
>   events->nmi.pending = vcpu->arch.nmi_pending;
>   events->nmi.masked = kvm_x86_ops->get_nmi_mask(vcpu);
> + events->nmi.pad = 0;
> 
>   events->sipi_vector = vcpu->arch.sipi_vector;
> 
>   events->flags = (KVM_VCPUEVENT_VALID_NMI_PENDING
>| KVM_VCPUEVENT_VALID_SIPI_VECTOR
>| KVM_VCPUEVENT_VALID_SHADOW);
> + memset(&events->reserved, 0, sizeof(events->reserved));
> }
> 
> static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
> @@ -2623,6 +2626,7 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct 
> kvm_vcpu *vcpu,
>   dbgregs->dr6 = vcpu->arch.dr6;
>   dbgregs->dr7 = vcpu->arch.dr7;
>   dbgregs->flags = 0;
> + memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
> }
> 
> static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
> @@ -3106,6 +3110,7 @@ static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, 
> struct kvm_pit_state2 *ps)
>   sizeof(ps->channels));
>   ps->flags = kvm->arch.vpit->pit_state.flags;
>   mutex_unlock(&kvm->arch.vpit->pit_state.lock);
> + memset(&ps->reserved, 0, sizeof(ps->reserved));

struct kvm_pit_state2 {
struct kvm_pit_channel_state channels[3];
__u32 flags;
__u32 reserved[9];
};

So memset(&ps->reserved) would give you the a __u32 **, no? Same goes for all 
the other array sets in here. Or am I understanding some C logic wrong? :)


Alex

>   return r;
> }
> 
> @@ -3486,6 +3491,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   user_ns.clock = kvm->arch.kvmclock_offset + now_ns;
>   local_irq_enable();
>   user_ns.flags = 0;
> + memset(&user_ns.pad, 0, sizeof(user_ns.pad));
> 
>   r = -EFAULT;
>   if (copy_to_user(argp, &user_ns, sizeof(user_ns)))
> -- 
> Vasiliy
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2] x86: kvm: x86: fix information leak to userland

2011-07-26 Thread Avi Kivity

On 07/26/2011 08:05 PM, Alexander Graf wrote:

struct kvm_pit_state2 {
 struct kvm_pit_channel_state channels[3];
 __u32 flags;
 __u32 reserved[9];
};

So memset(&ps->reserved) would give you the a __u32 **, no? Same goes for all 
the other array sets in here. Or am I understanding some C logic wrong? :)



An address of an array is the array itself.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2] x86: kvm: x86: fix information leak to userland

2011-07-26 Thread Vasiliy Kulikov
Alexander,

On Tue, Jul 26, 2011 at 19:05 +0200, Alexander Graf wrote:
> > @@ -2623,6 +2626,7 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct 
> > kvm_vcpu *vcpu,
> > dbgregs->dr6 = vcpu->arch.dr6;
> > dbgregs->dr7 = vcpu->arch.dr7;
> > dbgregs->flags = 0;
> > +   memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
> > }
> > 
> > static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
> > @@ -3106,6 +3110,7 @@ static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, 
> > struct kvm_pit_state2 *ps)
> > sizeof(ps->channels));
> > ps->flags = kvm->arch.vpit->pit_state.flags;
> > mutex_unlock(&kvm->arch.vpit->pit_state.lock);
> > +   memset(&ps->reserved, 0, sizeof(ps->reserved));
> 
> struct kvm_pit_state2 {
> struct kvm_pit_channel_state channels[3];
> __u32 flags;
> __u32 reserved[9];
> };
> 
> So memset(&ps->reserved) would give you the a __u32 **, no? Same goes for all 
> the other array sets in here. Or am I understanding some C logic wrong? :)

No, the array name and an address of the array give the same address.  I
could use ps->reserved instead of &ps->reserved, but it is a question of
coding style.  I got opposite opinions on this question from different
maintainers.

Another thing is that sizeof() of the array name and the pointer to the
first array element differ.  But I used sizeof(array) here, so it should
be correct.


Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2] x86: kvm: x86: fix information leak to userland

2011-07-26 Thread Alexander Graf

On 26.07.2011, at 19:24, Avi Kivity wrote:

> On 07/26/2011 08:05 PM, Alexander Graf wrote:
>> struct kvm_pit_state2 {
>> struct kvm_pit_channel_state channels[3];
>> __u32 flags;
>> __u32 reserved[9];
>> };
>> 
>> So memset(&ps->reserved) would give you the a __u32 **, no? Same goes for 
>> all the other array sets in here. Or am I understanding some C logic wrong? 
>> :)
>> 
> 
> An address of an array is the array itself.

Ah, nice. So it's really a matter of taste then rather than functionality. Good 
to know :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2] x86: kvm: x86: fix information leak to userland

2011-07-26 Thread Alexander Graf

On 26.07.2011, at 19:28, Vasiliy Kulikov wrote:

> Alexander,
> 
> On Tue, Jul 26, 2011 at 19:05 +0200, Alexander Graf wrote:
>>> @@ -2623,6 +2626,7 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct 
>>> kvm_vcpu *vcpu,
>>> dbgregs->dr6 = vcpu->arch.dr6;
>>> dbgregs->dr7 = vcpu->arch.dr7;
>>> dbgregs->flags = 0;
>>> +   memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
>>> }
>>> 
>>> static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
>>> @@ -3106,6 +3110,7 @@ static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, 
>>> struct kvm_pit_state2 *ps)
>>> sizeof(ps->channels));
>>> ps->flags = kvm->arch.vpit->pit_state.flags;
>>> mutex_unlock(&kvm->arch.vpit->pit_state.lock);
>>> +   memset(&ps->reserved, 0, sizeof(ps->reserved));
>> 
>> struct kvm_pit_state2 {
>>struct kvm_pit_channel_state channels[3];
>>__u32 flags;
>>__u32 reserved[9];
>> };
>> 
>> So memset(&ps->reserved) would give you the a __u32 **, no? Same goes for 
>> all the other array sets in here. Or am I understanding some C logic wrong? 
>> :)
> 
> No, the array name and an address of the array give the same address.  I
> could use ps->reserved instead of &ps->reserved, but it is a question of
> coding style.  I got opposite opinions on this question from different
> maintainers.
> 
> Another thing is that sizeof() of the array name and the pointer to the
> first array element differ.  But I used sizeof(array) here, so it should
> be correct.

Yup, the sizeof looks fine. I was really only puzzled about the &array part. 
But if it's standardized to return the same as array, then that's great and I 
can call myself more educated now :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM test: Add subtest nmi_watchdog

2011-07-26 Thread Lucas Meneghel Rodrigues
From: Amos Kong 

Uses kernel supported nmi_watchdog to test the nmi support of kvm,
check the nmi count in Linux platform through /proc/interrupts.

Changes from v2:
 * Updated the test to use more current KVM autotest API

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/nmi_watchdog.py |   60 
 client/tests/kvm/tests_base.cfg.sample |7 
 2 files changed, 67 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/nmi_watchdog.py

diff --git a/client/tests/kvm/tests/nmi_watchdog.py 
b/client/tests/kvm/tests/nmi_watchdog.py
new file mode 100644
index 000..a4b2349
--- /dev/null
+++ b/client/tests/kvm/tests/nmi_watchdog.py
@@ -0,0 +1,60 @@
+import time, logging
+from autotest_lib.client.common_lib import error
+
+
+@error.context_aware
+def run_nmi_watchdog(test, params, env):
+"""
+Test the function of nmi injection and verify the response of guest
+
+1) Log in the guest
+2) Add 'watchdog=1' to boot option
+2) Check if guest's NMI counter augment after injecting nmi
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+vm = env.get_vm(params["main_vm"])
+vm.verify_alive()
+timeout=int(params.get("login_timeout", 360))
+session = vm.wait_for_login(timeout=timeout)
+get_nmi_cmd= params.get("get_nmi_cmd")
+kernel_version = session.get_command_output("uname -r").strip()
+nmi_watchdog_type = int(params.get("nmi_watchdog_type"))
+update_kernel_cmd = ("grubby --update-kernel=/boot/vmlinuz-%s "
+ "--args='nmi_watchdog=%d'" %
+ (kernel_version, nmi_watchdog_type))
+
+error.context("Add 'nmi_watchdog=%d' to guest kernel cmdline and reboot"
+  % nmi_watchdog_type)
+session.cmd(update_kernel_cmd)
+time.sleep(int(params.get("sleep_before_reset", 10)))
+session = vm.reboot(session, method='shell', timeout=timeout)
+try:
+error.context("Getting guest's number of vcpus")
+guest_cpu_num = session.cmd(params.get("cpu_chk_cmd"))
+
+error.context("Getting guest's NMI counter")
+output = session.cmd(get_nmi_cmd)
+logging.debug(output.strip())
+nmi_counter1 = output.split()[1:]
+
+logging.info("Waiting 60 seconds to see if guest's NMI counter "
+ "increases")
+time.sleep(60)
+
+error.context("Getting guest's NMI counter 2nd time")
+output = session.cmd(get_nmi_cmd)
+logging.debug(output.strip())
+nmi_counter2 = output.split()[1:]
+
+error.context("")
+for i in range(int(guest_cpu_num)):
+logging.info("vcpu: %s, nmi_counter1: %s, nmi_counter2: %s" %
+ (i, nmi_counter1[i], nmi_counter2[i]))
+if int(nmi_counter2[i]) <= int(nmi_counter1[i]):
+raise error.TestFail("Guest's NMI counter did not increase "
+ "after 60 seconds")
+finally:
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index b7342bd..d597b52 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -1108,6 +1108,13 @@ variants:
 cdrom_cd1 = orig.iso
 max_times = 20
 
+- nmi_watchdog: install setup image_copy unattended_install.cdrom
+type = nmi_watchdog
+get_nmi_cmd = grep NMI /proc/interrupts
+nmi_watchdog_type = 1
+image_snapshot = yes
+only Linux
+
 
 # NICs
 variants:
-- 
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH] KVM test: Add subtest nmi_watchdog

2011-07-26 Thread Lucas Meneghel Rodrigues

On 07/26/2011 03:25 PM, Lucas Meneghel Rodrigues wrote:

From: Amos Kong

Uses kernel supported nmi_watchdog to test the nmi support of kvm,
check the nmi count in Linux platform through /proc/interrupts.


Hi Amos, applied your test, thanks!


Changes from v2:
  * Updated the test to use more current KVM autotest API

Signed-off-by: Amos Kong
---
  client/tests/kvm/tests/nmi_watchdog.py |   60 
  client/tests/kvm/tests_base.cfg.sample |7 
  2 files changed, 67 insertions(+), 0 deletions(-)
  create mode 100644 client/tests/kvm/tests/nmi_watchdog.py

diff --git a/client/tests/kvm/tests/nmi_watchdog.py 
b/client/tests/kvm/tests/nmi_watchdog.py
new file mode 100644
index 000..a4b2349
--- /dev/null
+++ b/client/tests/kvm/tests/nmi_watchdog.py
@@ -0,0 +1,60 @@
+import time, logging
+from autotest_lib.client.common_lib import error
+
+
+@error.context_aware
+def run_nmi_watchdog(test, params, env):
+"""
+Test the function of nmi injection and verify the response of guest
+
+1) Log in the guest
+2) Add 'watchdog=1' to boot option
+2) Check if guest's NMI counter augment after injecting nmi
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+vm = env.get_vm(params["main_vm"])
+vm.verify_alive()
+timeout=int(params.get("login_timeout", 360))
+session = vm.wait_for_login(timeout=timeout)
+get_nmi_cmd= params.get("get_nmi_cmd")
+kernel_version = session.get_command_output("uname -r").strip()
+nmi_watchdog_type = int(params.get("nmi_watchdog_type"))
+update_kernel_cmd = ("grubby --update-kernel=/boot/vmlinuz-%s "
+ "--args='nmi_watchdog=%d'" %
+ (kernel_version, nmi_watchdog_type))
+
+error.context("Add 'nmi_watchdog=%d' to guest kernel cmdline and reboot"
+  % nmi_watchdog_type)
+session.cmd(update_kernel_cmd)
+time.sleep(int(params.get("sleep_before_reset", 10)))
+session = vm.reboot(session, method='shell', timeout=timeout)
+try:
+error.context("Getting guest's number of vcpus")
+guest_cpu_num = session.cmd(params.get("cpu_chk_cmd"))
+
+error.context("Getting guest's NMI counter")
+output = session.cmd(get_nmi_cmd)
+logging.debug(output.strip())
+nmi_counter1 = output.split()[1:]
+
+logging.info("Waiting 60 seconds to see if guest's NMI counter "
+ "increases")
+time.sleep(60)
+
+error.context("Getting guest's NMI counter 2nd time")
+output = session.cmd(get_nmi_cmd)
+logging.debug(output.strip())
+nmi_counter2 = output.split()[1:]
+
+error.context("")
+for i in range(int(guest_cpu_num)):
+logging.info("vcpu: %s, nmi_counter1: %s, nmi_counter2: %s" %
+ (i, nmi_counter1[i], nmi_counter2[i]))
+if int(nmi_counter2[i])<= int(nmi_counter1[i]):
+raise error.TestFail("Guest's NMI counter did not increase "
+ "after 60 seconds")
+finally:
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index b7342bd..d597b52 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -1108,6 +1108,13 @@ variants:
  cdrom_cd1 = orig.iso
  max_times = 20

+- nmi_watchdog: install setup image_copy unattended_install.cdrom
+type = nmi_watchdog
+get_nmi_cmd = grep NMI /proc/interrupts
+nmi_watchdog_type = 1
+image_snapshot = yes
+only Linux
+

  # NICs
  variants:


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [Autotest PATCH] KVM-test: TSC drift test

2011-07-26 Thread Lucas Meneghel Rodrigues
On Thu, Apr 21, 2011 at 4:33 AM, Amos Kong  wrote:
> This case is used to test the drift between host and guest.
> Use taskset to make tsc program execute in a single cpu.
> If the drift ratio bigger than 10%, then fail this case.

Hi Amos, per the last conversation we've had about this test case, you
were going to look into add it as a kvm unit test. I'm going to put
this on your todo list, please report the status when you can.
Meanwhile, I'm going to mark the corresponding patch on patchwork as
'superseded'. Thanks!

> Signed-off-by: Amos Kong 
> ---
>  client/tests/kvm/deps/get_tsc.c        |   27 ++
>  client/tests/kvm/tests/tsc_drift.py    |   88 
> 
>  client/tests/kvm/tests_base.cfg.sample |    5 ++
>  3 files changed, 120 insertions(+), 0 deletions(-)
>  create mode 100644 client/tests/kvm/deps/get_tsc.c
>  create mode 100644 client/tests/kvm/tests/tsc_drift.py
>
> diff --git a/client/tests/kvm/deps/get_tsc.c b/client/tests/kvm/deps/get_tsc.c
> new file mode 100644
> index 000..e91a41f
> --- /dev/null
> +++ b/client/tests/kvm/deps/get_tsc.c
> @@ -0,0 +1,27 @@
> +/*
> + * Programme to get cpu's TSC(time stamp counter)
> + * Copyright(C) 2009 Redhat, Inc.
> + * Amos Kong 
> + * Dec 9, 2009
> + *
> + */
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +
> +typedef unsigned long long u64;
> +
> +u64 rdtsc(void)
> +{
> +       unsigned tsc_lo, tsc_hi;
> +
> +       asm volatile("rdtsc" : "=a"(tsc_lo), "=d"(tsc_hi));
> +       return tsc_lo | (u64)tsc_hi << 32;
> +}
> +
> +int main(void)
> +{
> +       printf("%lld\n", rdtsc());
> +       return 0;
> +}
> diff --git a/client/tests/kvm/tests/tsc_drift.py 
> b/client/tests/kvm/tests/tsc_drift.py
> new file mode 100644
> index 000..de2fb76
> --- /dev/null
> +++ b/client/tests/kvm/tests/tsc_drift.py
> @@ -0,0 +1,88 @@
> +import time, os, logging, commands, re
> +from autotest_lib.client.common_lib import error
> +from autotest_lib.client.bin import local_host
> +import kvm_test_utils
> +
> +
> +def run_tsc_drift(test, params, env):
> +    """
> +    Check the TSC(time stamp counter) frequency of guest and host whether 
> match
> +    or not
> +
> +    1) Computer average tsc frequency of host's cpus by C the program
> +    2) Copy the C code to the guest, complie and run it to get tsc
> +       frequency of guest's vcpus
> +    3) Sleep sometimes and get the TSC of host and guest again
> +    4) Compute the TSC frequency of host and guest
> +    5) Compare the frequency deviation between host and guest with standard
> +
> +    @param test: Kvm test object
> +    @param params: Dictionary with the test parameters.
> +    @param env: Dictionary with test environment.
> +    """
> +    drift_threshold = float(params.get("drift_threshold"))
> +    interval = float(params.get("interval"))
> +    cpu_chk_cmd = params.get("cpu_chk_cmd")
> +    tsc_freq_path = os.path.join(test.bindir, 'deps/get_tsc.c')
> +    host_freq = 0
> +
> +    def get_tsc(machine="host", i=0):
> +        cmd = "taskset -c %s /tmp/get_tsc" % i
> +        if machine == "host":
> +            s, o = commands.getstatusoutput(cmd)
> +        else:
> +            s, o = session.cmd_status_output(cmd)
> +        if s != 0:
> +            logging.debug(o)
> +            raise error.TestError("Fail to get tsc of host, ncpu: %d" % i)
> +        return float(re.findall("(\d+)",o)[0])
> +
> +    vm = env.get_vm(params["main_vm"])
> +    vm.verify_alive()
> +    timeout = float(params.get("login_timeout", 240))
> +    session = vm.wait_for_login(timeout=timeout)
> +
> +    commands.getoutput("gcc %s -o /tmp/get_tsc" % tsc_freq_path)
> +    ncpu = local_host.LocalHost().get_num_cpu()
> +
> +    logging.info("Interval is %s" % interval)
> +    logging.info("Determine the TSC frequency in the host")
> +    for i in range(ncpu):
> +        tsc1 = get_tsc("host", i)
> +        time.sleep(interval)
> +        tsc2 = get_tsc("host", i)
> +
> +        delta = tsc2 - tsc1
> +        logging.info("Host TSC delta for cpu %s is %s" % (i, delta))
> +        if delta < 0:
> +            raise error.TestError("Host TSC for cpu %s warps %s" % (i, 
> delta))
> +
> +        host_freq += delta / ncpu
> +    logging.info("Average frequency of host's cpus: %s" % host_freq)
> +
> +    vm.copy_files_to(tsc_freq_path,'/tmp/get_tsc.c')
> +    session.cmd("gcc /tmp/get_tsc.c -o /tmp/get_tsc")
> +
> +    s, guest_ncpu = session.cmd_status_output(cpu_chk_cmd)
> +
> +    success = True
> +    for i in range(int(guest_ncpu)):
> +        tsc1 = get_tsc("guest", i)
> +        time.sleep(interval)
> +        tsc2 = get_tsc("guest", i)
> +
> +        delta = tsc2 - tsc1
> +        logging.info("Guest TSC delta for vcpu %s is %s" % (i, delta))
> +        if delta < 0:
> +            logging.error("Guest TSC for vcpu %s warps %s" % (i, delta))
> +
> +        ratio = 100 * (delta - host_freq) / host_freq
> +        logging.info("TSC drift ratio for vcpu %s is %s" % (i, ratio))
> +      

Re: [PATCH v2 1/1] The codes V2 for QEMU disk I/O limits.

2011-07-26 Thread Marcelo Tosatti
On Tue, Jul 26, 2011 at 04:59:06PM +0800, Zhi Yong Wu wrote:
> Welcome to give me your comments, thanks.
> 
> Signed-off-by: Zhi Yong Wu 
> ---
>  Makefile.objs |2 +-
>  block.c   |  288 
> +++--
>  block.h   |1 -
>  block/blk-queue.c |  116 +
>  block/blk-queue.h |   70 +
>  block_int.h   |   28 +
>  blockdev.c|   21 
>  qemu-config.c |   24 +
>  qemu-option.c |   17 +++
>  qemu-option.h |1 +
>  qemu-options.hx   |1 +
>  11 files changed, 559 insertions(+), 10 deletions(-)
>  create mode 100644 block/blk-queue.c
>  create mode 100644 block/blk-queue.h
> 
> diff --git a/Makefile.objs b/Makefile.objs
> index 9f99ed4..06f2033 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o 
> dmg.o bochs.o vpc.o vv
>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
> qcow2-cache.o
>  block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>  block-nested-y += qed-check.o
> -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
> +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o 
> blk-queue.o
>  block-nested-$(CONFIG_WIN32) += raw-win32.o
>  block-nested-$(CONFIG_POSIX) += raw-posix.o
>  block-nested-$(CONFIG_CURL) += curl.o
> diff --git a/block.c b/block.c
> index 24a25d5..e54e59c 100644
> --- a/block.c
> +++ b/block.c
> @@ -29,6 +29,9 @@
>  #include "module.h"
>  #include "qemu-objects.h"
>  
> +#include "qemu-timer.h"
> +#include "block/blk-queue.h"
> +
>  #ifdef CONFIG_BSD
>  #include 
>  #include 
> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t 
> sector_num,
>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>   const uint8_t *buf, int nb_sectors);
>  
> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
> +bool is_write, double elapsed_time, uint64_t *wait);
> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
> +double elapsed_time, uint64_t *wait);
> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
> +bool is_write, uint64_t *wait);
> +
>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>  QTAILQ_HEAD_INITIALIZER(bdrv_states);
>  
> @@ -90,6 +100,20 @@ int is_windows_drive(const char *filename)
>  }
>  #endif
>  
> +static int bdrv_io_limits_enable(BlockIOLimit *io_limits)
> +{
> +if ((io_limits->bps[0] == 0)
> + && (io_limits->bps[1] == 0)
> + && (io_limits->bps[2] == 0)
> + && (io_limits->iops[0] == 0)
> + && (io_limits->iops[1] == 0)
> + && (io_limits->iops[2] == 0)) {
> +return 0;
> +}
> +
> +return 1;
> +}
> +
>  /* check if the path starts with ":" */
>  static int path_has_protocol(const char *path)
>  {
> @@ -167,6 +191,28 @@ void path_combine(char *dest, int dest_size,
>  }
>  }
>  
> +static void bdrv_block_timer(void *opaque)
> +{
> +BlockDriverState *bs = opaque;
> +BlockQueue *queue = bs->block_queue;
> +
> +while (!QTAILQ_EMPTY(&queue->requests)) {
> +BlockIORequest *request;
> +int ret;
> +
> +request = QTAILQ_FIRST(&queue->requests);
> +QTAILQ_REMOVE(&queue->requests, request, entry);
> +
> +ret = qemu_block_queue_handler(request);
> +if (ret == 0) {
> +QTAILQ_INSERT_HEAD(&queue->requests, request, entry);
> +break;
> +}
> +
> +qemu_free(request);
> +}
> +}
> +
>  void bdrv_register(BlockDriver *bdrv)
>  {
>  if (!bdrv->bdrv_aio_readv) {
> @@ -642,6 +688,15 @@ int bdrv_open(BlockDriverState *bs, const char 
> *filename, int flags,
>  bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>  }
>  
> +/* throttling disk I/O limits */
> +if (bdrv_io_limits_enable(&bs->io_limits)) {
> +bs->block_queue = qemu_new_block_queue();
> +bs->block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
> +
> +bs->slice_start[0] = qemu_get_clock_ns(vm_clock);
> +bs->slice_start[1] = qemu_get_clock_ns(vm_clock);
> +}
> +

It should be possible to tune the limits on the flight, please introduce
QMP commands for that.

>  return 0;
>  
>  unlink_and_fail:
> @@ -680,6 +735,16 @@ void bdrv_close(BlockDriverState *bs)
>  if (bs->change_cb)
>  bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>  }
> +
> +/* throttling disk I/O limits */
> +if (bs->block_queue) {
> +qemu_del_block_queue(bs->block_queue);
> +}
> +
> +if (bs->block_timer) {
> +qemu_del_timer(bs->block_timer);
> +qemu_free_timer(bs->block_timer);
> +}
>  }
>  
>  void bdrv_close_all(void)
> @@ -1312,6 +1377,14 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
>  *psecs = bs-

[PATCH] dev-assignment: handle device with incorrect PCIe Cap structure size

2011-07-26 Thread Donald Dutile
The bcm5761 provides a PCIe Cap structure (capid=0x10)
that is invalid, providing one that is 8 bytes shorter
than the v2 PCIe spec defines.
This leads to a memory corruption when mapped for device-assigment.

Add a check in assigned_device_pci_cap_init() to correct
this hw error for this device, and try to catch other ones
and print warnings if they exists.

Signed-off-by: Donald Dutile 
cc: Alex Williamson 
cc: Michael S. Tsirking 
---

 hw/device-assignment.c |   29 -
 1 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 36ad6b0..e073840 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1419,18 +1419,37 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 }
 
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
-uint8_t version;
+uint8_t version, size;
 uint16_t type, devctl, lnkcap, lnksta;
 uint32_t devcap;
-int size = 0x3c; /* version 2 size */
 
 version = pci_get_byte(pci_dev->config + pos + PCI_EXP_FLAGS);
 version &= PCI_EXP_FLAGS_VERS;
 if (version == 1) {
 size = 0x14;
-} else if (version > 2) {
-fprintf(stderr, "Unsupported PCI express capability version %d\n",
-version);
+} else if (version == 2) {
+/*
+ * Check for non-std size, accept reduced size to 0x34,
+ * which is what bcm5761 implemented, violating the 
+ * PCIe v3.0 spec that regs should exist and be read as 0,
+ * not optionally provided and shorten the struct size.
+ */
+size = MIN(0x3c, PCI_CONFIG_SPACE_SIZE - pos);
+if (size < 0x34) {
+fprintf(stderr,
+"%s: Invalid size PCIe cap-id 0x%x \n",
+__func__, PCI_CAP_ID_EXP);
+return -EINVAL;
+} else if (size != 0x3c) {
+fprintf(stderr,
+"WARNING, %s: PCIe cap-id 0x%x has "
+"non-standard size 0x%x; std size should be 0x3c \n",
+ __func__, PCI_CAP_ID_EXP, size);
+} 
+} else {
+fprintf(stderr, 
+"%s: Unsupported PCI express capability version %d\n",
+__func__, version);
 return -EINVAL;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] dev-assignment: handle device with incorrect PCIe Cap structure size

2011-07-26 Thread Alex Williamson
On Tue, 2011-07-26 at 18:08 -0400, Donald Dutile wrote:
> The bcm5761 provides a PCIe Cap structure (capid=0x10)
> that is invalid, providing one that is 8 bytes shorter
> than the v2 PCIe spec defines.
> This leads to a memory corruption when mapped for device-assigment.
> 
> Add a check in assigned_device_pci_cap_init() to correct
> this hw error for this device, and try to catch other ones
> and print warnings if they exists.
> 
> Signed-off-by: Donald Dutile 
> cc: Alex Williamson 
> cc: Michael S. Tsirking 
> ---

Acked-by: Alex Williamson 

>  hw/device-assignment.c |   29 -
>  1 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 36ad6b0..e073840 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1419,18 +1419,37 @@ static int assigned_device_pci_cap_init(PCIDevice 
> *pci_dev)
>  }
>  
>  if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
> -uint8_t version;
> +uint8_t version, size;
>  uint16_t type, devctl, lnkcap, lnksta;
>  uint32_t devcap;
> -int size = 0x3c; /* version 2 size */
>  
>  version = pci_get_byte(pci_dev->config + pos + PCI_EXP_FLAGS);
>  version &= PCI_EXP_FLAGS_VERS;
>  if (version == 1) {
>  size = 0x14;
> -} else if (version > 2) {
> -fprintf(stderr, "Unsupported PCI express capability version 
> %d\n",
> -version);
> +} else if (version == 2) {
> +/*
> + * Check for non-std size, accept reduced size to 0x34,
> + * which is what bcm5761 implemented, violating the 
> + * PCIe v3.0 spec that regs should exist and be read as 0,
> + * not optionally provided and shorten the struct size.
> + */
> +size = MIN(0x3c, PCI_CONFIG_SPACE_SIZE - pos);
> +if (size < 0x34) {
> +fprintf(stderr,
> +"%s: Invalid size PCIe cap-id 0x%x \n",
> +__func__, PCI_CAP_ID_EXP);
> +return -EINVAL;
> +} else if (size != 0x3c) {
> +fprintf(stderr,
> +"WARNING, %s: PCIe cap-id 0x%x has "
> +"non-standard size 0x%x; std size should be 0x3c \n",
> + __func__, PCI_CAP_ID_EXP, size);
> +} 
> +} else {
> +fprintf(stderr, 
> +"%s: Unsupported PCI express capability version %d\n",
> +__func__, version);
>  return -EINVAL;
>  }
>  
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


चेतावनी!

2011-07-26 Thread WEBMAIL CUSTOMER CARE
ईमेल
चेतावनी

आपका ई - मेल
खाता
निष्क्रियकरण.

कृपया ध्यान
दें कि हम हाल
ही में हमारे
डेटाबेस पर
कुछ उन्नयन
किया.
नवीनीकरण के
दौरान वहाँ
एक असामान्य
था अपने ईमेल
पते से कोड
जवाब
क्रियाशीलता
छोड़ना के
लिए अनुरोध.
निष्क्रिय
या अपने ईमेल
खाते को
सक्रिय रखने
के क्रम में
सत्यापित
करने के लिए /
अपने ईमेल की
पहचान की
पुष्टि के
लिए
सत्यापित
करें, आप के
लिए हैं
निम्न डेटा
प्रदान करते
हैं;

आपका ईमेल
पहचान नीचे
पुष्टि

पहला नाम
:
अंतिम नाम
:_
ईमेल
उपयोगकर्ता
नाम :
ईमेल
पासवर्ड
:
खाता
निष्क्रियकरण
:__( हाँ
निर्दिष्ट
करने के लिए
निष्क्रिय
नहीं सक्रिय
रखने)
Deactivation_ के लिए
कारण (हाँ अगर)

चेतावनी! 48
घंटे या के
भीतर अपने
ईमेल खाते को
सत्यापित
करने के लिए
विफलता में
इस अधिसूचना
प्राप्त
करने, अपने
खाते में
स्वचालित
रूप से किया
जाएगा
निष्क्रिय.

आप हमारे
ईमेल का
उपयोग करने
के लिए
धन्यवाद!

चेतावनी कोड:
ASPH8B02AXV

तरह का संबंध
है,

ईमेल सेवा
टीम प्रबंधन.
बुकमार्क
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions

2011-07-26 Thread Xiao Guangrong
On 07/26/2011 08:27 PM, Gleb Natapov wrote:
> On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
>> We usually use repeat string instructions to clear the page, for example,
> By "we" do you mean Linux guest?
> 

I do not know other guests except linux, but, generally rep instruction is
not used to update a page table which is been using.

>> we call memset to clear a page table, stosb is used in this function, and 
>> repeated for 1024 times, that means we should occupy mmu lock for 1024 times
>> and walking shadow page cache for 1024 times, it is terrible
>>
>> In fact, if it is the repeat string instructions emulated and it is not a
>> IO/MMIO access, we can zap all the corresponding shadow pages and return to 
>> the
>> guest, then the mapping can became writable and directly write the page
>>


> So this patch does two independent things as far as I can see. First it
> stops reentering guest if rep instruction is done on memory and second

No.
Oppositely, it enters guest as soon as possible if rep instruction is done
on memory ;-)
After this patch, we only need to emulate one time for a rep instruction.

> it drops shadow pages if access to shadowed page table is rep. Why not
> separate those in different patches?  

Umm, i will zap shadow page firstly and stop emulation rep instruction in
the second patch.

> BTW not entering guest periodically
> increases interrupt latencies.

Oppositely, It reduces the latencies. :-)

> Why not zap shadow, make page writable
> and reenter the guest instead of emulation, it should be much faster (do we
> care to optimize for old cpus by complicating the code anyway?).
> 

We do better lazily update the mapping to writable, it can be done by the later 
access.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions

2011-07-26 Thread Gleb Natapov
On Wed, Jul 27, 2011 at 09:47:52AM +0800, Xiao Guangrong wrote:
> On 07/26/2011 08:27 PM, Gleb Natapov wrote:
> > On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
> >> We usually use repeat string instructions to clear the page, for example,
> > By "we" do you mean Linux guest?
> > 
> 
> I do not know other guests except linux, but, generally rep instruction is
> not used to update a page table which is been using.
> 
> >> we call memset to clear a page table, stosb is used in this function, and 
> >> repeated for 1024 times, that means we should occupy mmu lock for 1024 
> >> times
> >> and walking shadow page cache for 1024 times, it is terrible
> >>
> >> In fact, if it is the repeat string instructions emulated and it is not a
> >> IO/MMIO access, we can zap all the corresponding shadow pages and return 
> >> to the
> >> guest, then the mapping can became writable and directly write the page
> >>
> 
> 
> > So this patch does two independent things as far as I can see. First it
> > stops reentering guest if rep instruction is done on memory and second
> 
> No.
> Oppositely, it enters guest as soon as possible if rep instruction is done
> on memory ;-)
Oops. Indeed. I read it other way around. So why not just return
X86EMUL_UNHANDLEABLE from emulator_write_emulated_onepage() which should
have the same effect?

> After this patch, we only need to emulate one time for a rep instruction.
> 
> > it drops shadow pages if access to shadowed page table is rep. Why not
> > separate those in different patches?  
> 
> Umm, i will zap shadow page firstly and stop emulation rep instruction in
> the second patch.
> 
> > BTW not entering guest periodically
> > increases interrupt latencies.
> 
> Oppositely, It reduces the latencies. :-)
> 
> > Why not zap shadow, make page writable
> > and reenter the guest instead of emulation, it should be much faster (do we
> > care to optimize for old cpus by complicating the code anyway?).
> > 
> 
> We do better lazily update the mapping to writable, it can be done by the 
> later access.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 6/6] scsi-disk: Check for supported commands

2011-07-26 Thread Markus Armbruster
Christoph Hellwig  writes:

> On Fri, Jul 22, 2011 at 04:51:17PM +0200, Hannes Reinecke wrote:
>> Not every command is support for any device type. This patch adds
>> a check for rejecting unsupported commands.
>> 
>> Signed-off-by: Hannes Reinecke 
>
> This seems to conflic with Markus' series.  But if we want to invest
> any major effort into it, we really need to different dispatch tables
> for different device types.  There's two sane ways to do it:
>
> one top-level handler with a switch per device type, or tables
> with a handler pointer with a device type.  I'm fine with either one.
>
> What I really don't get with this patch is the listing of all the different
> SCSI device types.  It's already a mistake that we tried to handle disks
> and CDROMs with the same driver, but adding even more just makes it worth.

It fits into what we have...

> IMHO we should simply have one file per SCSI spec, e.g. an spc.c for the
> common bits, then an sbc.c for disks, and mmc.c for cdroms.  Maybe more
> the day we add more emulated device types.

Commit b443ae67 `scsi: Split qdev "scsi-disk" into "scsi-hd" and
"scsi-cd"' was a first baby step towards that goal.

Munging everything together may look like an easy way to avoid code
duplication initially, but leads to "common" code riddled with special
cases.

Proper code reuse among the separate drivers will have to be enforced.

Anyway, back to the patch's topic: ideas on how to reject SCSI commands
invalid for the device type given the current state of affairs,
i.e. disks and CD-ROMs munged together in scsi-disk.c?  "Don't, change
state of affairs first" is a valid answer, it just means we probably
won't get them rejected any time soon.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] KVM: x86: fast emulate repeat string write instructions

2011-07-26 Thread Xiao Guangrong
On 07/27/2011 12:26 PM, Gleb Natapov wrote:
> On Wed, Jul 27, 2011 at 09:47:52AM +0800, Xiao Guangrong wrote:
>> On 07/26/2011 08:27 PM, Gleb Natapov wrote:
>>> On Tue, Jul 26, 2011 at 07:26:46PM +0800, Xiao Guangrong wrote:
 We usually use repeat string instructions to clear the page, for example,
>>> By "we" do you mean Linux guest?
>>>
>>
>> I do not know other guests except linux, but, generally rep instruction is
>> not used to update a page table which is been using.
>>
 we call memset to clear a page table, stosb is used in this function, and 
 repeated for 1024 times, that means we should occupy mmu lock for 1024 
 times
 and walking shadow page cache for 1024 times, it is terrible

 In fact, if it is the repeat string instructions emulated and it is not a
 IO/MMIO access, we can zap all the corresponding shadow pages and return 
 to the
 guest, then the mapping can became writable and directly write the page

>>
>>
>>> So this patch does two independent things as far as I can see. First it
>>> stops reentering guest if rep instruction is done on memory and second
>>
>> No.
>> Oppositely, it enters guest as soon as possible if rep instruction is done
>> on memory ;-)
> Oops. Indeed. I read it other way around. So why not just return
> X86EMUL_UNHANDLEABLE from emulator_write_emulated_onepage() which should
> have the same effect?
> 

It seams not, the count register(RCX) is not decreased, and redundant work
need to be done by handling EMULATION_FAILED.
So, emulator_write_emulated_onepage() is not a good place i think. :-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html