from:"Zhang Haoyu"

Re: [Qemu-devel] [PATCH V2] add migration capability to bypass the shared memory

2017-09-25 Thread Zhang Haoyu

If hotplug memory during migration, the calculation of migration_dirty_pages
maybe incorrect, should fixed as below,

-void migration_bitmap_extend(ram_addr_t old, ram_addr_t new)
+void migration_bitmap_extend(RAMBlock *block, ram_addr_t old, ram_addr_t new)
{
/* called in qemu main thread, so there is
 * no writing race against this migration_bitmap
 */
-if (migration_bitmap_rcu) {
+if (migration_bitmap_rcu && (!migrate_bypass_shared_memory() || 
!qemu_ram_is_shared(block))) {
struct BitmapRcu *old_bitmap = migration_bitmap_rcu, *bitmap;
bitmap = g_new(struct BitmapRcu, 1);

On 2016/8/10 8:54, Lai Jiangshan wrote:
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
> 
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
> 
> The philosophy behind this key feature and the advanced
> key features is that a part of the memory management is
> separated out from the qemu, and let the other toolkits
> such as libvirt, runv(https://github.com/hyperhq/runv/)
> or the next qemu-cmd directly access to it, manage it,
> provide features to it.
> 
> The hyperhq(http://hyper.sh  http://hypercontainer.io/)
> introduced the feature vm-template(vm-fast-live-clone)
> to the hyper container for several months, it works perfect.
> (see https://github.com/hyperhq/runv/pull/297)
> 
> The feature vm-template makes the containers(VMs) can
> be started in 130ms and save 80M memory for every
> container(VM). So that the hyper containers are fast
> and high-density as normal containers.
> 
> In current qemu command line, shared memory has
> to be configured via memory-object. Anyone can add a
> -mem-path-share to the qemu command line for combining
> with -mem-path for this feature. This patch doesn’t include
> this change of -mem-path-share.
> 
> Advanced features:
> 1) qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
> 
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.
> 
> 2) extremely-fast-save-restore
> the same above, but the mem-path is on the persistent file system.
> 
> 3)  vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
> 
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.
> 
> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
> 
> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.
> 
> 4) yet-another-post-copy-migration
> It is a possible feature, no toolkit can do it well now.
> Using nbd server/client on the memory file is reluctantly Ok but
> inconvenient. A special feature for tmpfs might be needed to
> fully complete this feature.
> No one need yet another post copy migration method,
> but it is possible when some crazy man need it.
> 
> Changed from v1:
>fix style
> 
> Signed-off-by: Lai Jiangshan 
> ---
>  exec.c|  5 +
>  include/exec/cpu-common.h |  1 +
>  include/migration/migration.h |  1 +
>  migration/migration.c |  9 +
>  migration/ram.c   | 37 -
>  qapi-schema.json  |  6 +-
>  qmp-commands.hx   |  3 +++
>  7 files changed, 52 insertions(+), 10 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 8ffde75..888919a 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1402,6 +1402,11 @@ static void qemu_ram_setup_dump(void *addr, ram_addr_t 
> size)
>  }
>  }
>  
> +bool qemu_ram_is_shared(RAMBlock *rb)
> +{
> +return rb->flags & RAM_SHARED;
> +}
> +
>  const char *qemu_ram_get_idstr(RAMBlock *rb

Re: [Qemu-devel] [PATCH V2] add migration capability to bypass the shared memory

2017-09-25 Thread Zhang Haoyu

If hotplug memory during migration, the calculation of migration_dirty_pages
maybe not correct,
void migration_bitmap_extend(ram_addr_t old, ram_addr_t new)
{
...
migration_dirty_pages += new - old;
call_rcu(old_bitmap, migration_bitmap_free, rcu);
...
}

Thanks,
Zhanghaoyu


On 2016/8/10 8:54, Lai Jiangshan wrote:
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
> 
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
> 
> The philosophy behind this key feature and the advanced
> key features is that a part of the memory management is
> separated out from the qemu, and let the other toolkits
> such as libvirt, runv(https://github.com/hyperhq/runv/)
> or the next qemu-cmd directly access to it, manage it,
> provide features to it.
> 
> The hyperhq(http://hyper.sh  http://hypercontainer.io/)
> introduced the feature vm-template(vm-fast-live-clone)
> to the hyper container for several months, it works perfect.
> (see https://github.com/hyperhq/runv/pull/297)
> 
> The feature vm-template makes the containers(VMs) can
> be started in 130ms and save 80M memory for every
> container(VM). So that the hyper containers are fast
> and high-density as normal containers.
> 
> In current qemu command line, shared memory has
> to be configured via memory-object. Anyone can add a
> -mem-path-share to the qemu command line for combining
> with -mem-path for this feature. This patch doesn’t include
> this change of -mem-path-share.
> 
> Advanced features:
> 1) qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
> 
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.
> 
> 2) extremely-fast-save-restore
> the same above, but the mem-path is on the persistent file system.
> 
> 3)  vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
> 
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.
> 
> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
> 
> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.
> 
> 4) yet-another-post-copy-migration
> It is a possible feature, no toolkit can do it well now.
> Using nbd server/client on the memory file is reluctantly Ok but
> inconvenient. A special feature for tmpfs might be needed to
> fully complete this feature.
> No one need yet another post copy migration method,
> but it is possible when some crazy man need it.
> 
> Changed from v1:
>fix style
> 
> Signed-off-by: Lai Jiangshan 
> ---
>  exec.c|  5 +
>  include/exec/cpu-common.h |  1 +
>  include/migration/migration.h |  1 +
>  migration/migration.c |  9 +
>  migration/ram.c   | 37 -
>  qapi-schema.json  |  6 +-
>  qmp-commands.hx   |  3 +++
>  7 files changed, 52 insertions(+), 10 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 8ffde75..888919a 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1402,6 +1402,11 @@ static void qemu_ram_setup_dump(void *addr, ram_addr_t 
> size)
>  }
>  }
>  
> +bool qemu_ram_is_shared(RAMBlock *rb)
> +{
> +return rb->flags & RAM_SHARED;
> +}
> +
>  const char *qemu_ram_get_idstr(RAMBlock *rb)
>  {
>  return rb->idstr;
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 952bcfe..7c18db9 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -58,6 +58,7 @@ RAMBlock *qemu_ram_block_from_host(void *ptr, bool 
> round_offset,
>  void qemu_ram_set_

Re: [Qemu-devel] [PATCH] add migration capability to bypass the shared memory

2017-09-20 Thread Zhang Haoyu

Hi,

Any update?

Thanks,
Zhang Haoyu

On 2016/8/30 12:11, Lai Jiangshan wrote:
> On Wed, Aug 10, 2016 at 5:03 PM, Juan Quintela  wrote:
>> Lai Jiangshan  wrote:
>>
>> Hi
>>
>> First of all, I like a lot the patchset, but I would preffer to split it
>> to find "possible" bugs along the lines, especially in postcopy, but not 
>> only.
> 
> Hello, thanks for review and comments
> 
> I tried to make the patch be sane and tight.
> I don't see any strong reason to split it without complicating the patch.
> 
>>
>> [very nice description of the patch]
>>
>> Nothing to say about the QMP and shared memory detection, looks correct
>> to me.
>>
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 815bc0e..880972d 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -605,6 +605,28 @@ static void migration_bitmap_sync_init(void)
>>>  num_dirty_pages_period = 0;
>>>  xbzrle_cache_miss_prev = 0;
>>>  iterations_prev = 0;
>>> +migration_dirty_pages = 0;
>>> +}
>>> +
>>> +static void migration_bitmap_init(unsigned long *bitmap)
>>> +{
>>> +RAMBlock *block;
>>> +
>>> +bitmap_clear(bitmap, 0, last_ram_offset() >> TARGET_PAGE_BITS);
>>> +rcu_read_lock();
>>> +QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>>> +if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) 
>>> {
>>> +bitmap_set(bitmap, block->offset >> TARGET_PAGE_BITS,
>>> +   block->used_length >> TARGET_PAGE_BITS);
>>> +
>>> +/*
>>> + * Count the total number of pages used by ram blocks not 
>>> including
>>> + * any gaps due to alignment or unplugs.
>>> + */
>>> + migration_dirty_pages += block->used_length >> TARGET_PAGE_BITS;
>>> + }
>>> +}
>>> +rcu_read_unlock();
>>>  }
>>
>> We can split this function in a different patch.
> 
> it calls the new function migrate_bypass_shared_memory().
> it is no a good idea to split it out.
> 
>> I haven't fully search
>> if we care about taking the rcu lock here.  The thing that I am more
>> interested is in knowing what happens when we don't set
>> migration_dirty_pages as the full "possible" memory pages.
> 
> I hadn't tested it with postcopy, I don't know how to use postcopy.
> From my review I can't find obvious bugs about it.
> 
> I don't think there is any good reason to use migrate_bypass
> and postcopy together,  I can disable the migrate_bypass
> when postcopy==true if you want.
> 
>>
>> Once here, should we check for ROM regions?
>>
>> BTW, could'nt we use:
>>
>> int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
>> {
>> RAMBlock *block;
>> int ret = 0;
>>
>> rcu_read_lock();
>> QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>> ret = func(block->idstr, block->host, block->offset,
>>block->used_length, opaque);
>> if (ret) {
>> break;
>> }
>> }
>> rcu_read_unlock();
>> return ret;
>> }
>>
> 
> the patch only introduces only one "QLIST_FOREACH_RCU(ram_list.blocks)"
> but
> # git grep 'QLIST_FOREACH_RCU.*ram_list'  | wc -l
> #   16
> 
> I don't want to introduce qemu_ram_foreach_block()
> and touch another 15 places.
> I hope someone do it after merged.
> 
> 
>>
>>
>>>
>>>  static void migration_bitmap_sync(void)
>>> @@ -631,7 +653,9 @@ static void migration_bitmap_sync(void)
>>>  qemu_mutex_lock(&migration_bitmap_mutex);
>>>  rcu_read_lock();
>>>  QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>>> -migration_bitmap_sync_range(block->offset, block->used_length);
>>> +if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) 
>>> {
>>> +migration_bitmap_sync_range(block->offset, block->used_length);
>>> +}
>>>  }
>>>  rcu_read_unlock();
>>>  qemu_mutex_unlock(&migration_bitmap_mutex);
>>
>> Oops, another place where we were not using qemu_ram_foreach_block :p
>>
>>
>>> @@ -1926,19 +1950,14 @@ static int ram_save_setup

Re: [Qemu-devel] [PATCH V2] add migration capability to bypass the shared memory

2017-09-20 Thread Zhang Haoyu

Hi Jiangshan,

Any update from this patch?

Thanks,
Zhang Haoyu

On 2016/8/11 22:45, Lai Jiangshan wrote:
> Note, the old local migration patchset:
> https://lists.gnu.org/archive/html/qemu-devel/2013-12/msg00073.html
> 
> this patch can be considered as a new local migration implementation,
> but with more restrictions (the memory must set shared when boot the qemu)
>

Re: [Qemu-devel] [RFC] introduce bitmap to bdrv_commit totrackdirtysector

2015-03-09 Thread Zhang Haoyu


On 2015-03-10 10:58:44, Fam Zheng wrote:
> > Is it worthy to implement it?
> 
> Maybe, I think once we start to assign node-name to each BDS in the tree, this
> will be natural to implement.
> 
Once implemented, to be honest, I really don't see much advantages of 
incremental backup over aforementioned mechanism(external snapshot + active 
block commit),
especially, regarding the case of source vm non-normal shutdown, full copy is 
needed,
and corresponding check mechanism is needed too.
I think so much exceptional scenarios are needed to be considered for 
incremental backup mechanism.
Maybe I missed something.

Thanks,
Zhang Haoyu
> > And does qemu support commit any external snapshot to its backing file?
> 
> Yes.

Re: [Qemu-devel] [RFC] introduce bitmap to bdrv_commit to trackdirtysector

2015-03-09 Thread Zhang Haoyu


On 2015-03-10 09:54:47, Fam Zheng wrote:
> On Tue, 03/10 09:30, Zhang Haoyu wrote:
> > 
> > On 2015-03-10 08:29:19, Fam Zheng wrote:
> > > On Mon, 03/09 16:14, Zhang Haoyu wrote:
> > > > Hi John, Vladimir 
> > > > We can using active block commit to implement incremental backup 
> > > > without guest disruption,
> > > > e.g., 
> > > > origin <= A <= B <= C <= current BDS,
> > > > a new external snapshot will be produced before every time backup,
> > > > we can migrate A, B, C, ... to destination, 
> > > > then commit_active_start() the unneeded snapshot in source or 
> > > > destination end.
> > > > 
> > > > So, comparing with above mechanism,
> > > > what's the advantages of the incremental backup implemented by John and 
> > > > Vladimir?
> > > 
> > > We can't migrate A, B, C because they are buried in the backing chain 
> > > under
> > > "current BDS". 
> > I think we can backup the incremental image(e.g., A, B, C) to destination 
> > in top mode,
> > although I haven't read the code in detail, it can work theoretically, I 
> > think.
> 
> No, we don't have any command do that.
> 
Is it worthy to implement it?
And does qemu support commit any external snapshot to its backing file?

> > 
> > > Even if we do that, there will be severe IO amplification
> > > compared to the dirty bitmap way.
> > > 
> > Yes, block-commit will produce extra IO.
> > But regarding incremental backup, when guest IO is performed, 
> > will the corresponding dirty bit be synchronized to qcow2 image 
> > simultaneously?
> 
> No, that would have a huge performance penalty. It will only be synced
> at shutdown and or periodically, therefore it has the same implications with
> other cache, such as page cache or block driver metadata cache.
> 
> > if no, if source VM is shut-down in non-normal way, like killed by force or 
> > by mistake or server poweroff suddenly,
> > some dirty bits maybe lost, then full copy is needed.
> 
> Yes, it is a reasonable rule.
> 
> > > 
> > drive-backup is not incremental backup, full copy is needed in every time 
> > backup,
> > so it dosen't meet our requirements.
> 
> I didn't mean drive-backup already provides incremental backup, but we do need
> it to implement it (see the patch series posted by John Snow).
> 
> Thanks,
> Fam

Re: [Qemu-devel] question about live migration with storage

2015-03-09 Thread Zhang Haoyu


On 2015-01-15 18:08:39, Paolo Bonzini wrote:
> 
> On 15/01/2015 10:56, Zhang Haoyu wrote:
> > I see, when waiting the completion of drive_mirror IO, the coroutine will be
> > switched back to main-thread to poll and process other events, like qmp 
> > request,
> > then after the IO completed, coroutine will be switched back in main-loop 
> > process, right?
> > 
> > Another question:
> > while starting to migrate storage, will the unused sector be allocated and 
> > transferred?
> > regarding the thin-provisioning qcow2 disk, will extra large IO requests be
> > performed, and more data be transferred for the unallocated clusters?
> 
> No.
> 
My test results show that
when using thin-provisioning qcow2 image(created by qemu-img create -f qcow2 
preallocation=metadata),
even the unallocated sectors will be transferred to destination, so much data 
is transferred,
so the qcow2 image in destination is full allocated.

Thanks,
Zhang Haoyu
> Paolo

Re: [Qemu-devel] [RFC] introduce bitmap to bdrv_commit to trackdirty sector

2015-03-09 Thread Zhang Haoyu


On 2015-03-10 08:29:19, Fam Zheng wrote:
> On Mon, 03/09 16:14, Zhang Haoyu wrote:
> > Hi John, Vladimir 
> > We can using active block commit to implement incremental backup without 
> > guest disruption,
> > e.g., 
> > origin <= A <= B <= C <= current BDS,
> > a new external snapshot will be produced before every time backup,
> > we can migrate A, B, C, ... to destination, 
> > then commit_active_start() the unneeded snapshot in source or destination 
> > end.
> > 
> > So, comparing with above mechanism,
> > what's the advantages of the incremental backup implemented by John and 
> > Vladimir?
> 
> We can't migrate A, B, C because they are buried in the backing chain under
> "current BDS". 
I think we can backup the incremental image(e.g., A, B, C) to destination in 
top mode,
although I haven't read the code in detail, it can work theoretically, I think.

> Even if we do that, there will be severe IO amplification
> compared to the dirty bitmap way.
> 
Yes, block-commit will produce extra IO.
But regarding incremental backup, when guest IO is performed, 
will the corresponding dirty bit be synchronized to qcow2 image simultaneously?
If yes, it will impact on IO throughput in two sides, I think,
1) extro IO  2) randomize the IO,
if no, if source VM is shut-down in non-normal way, like killed by force or by 
mistake or server poweroff suddenly,
some dirty bits maybe lost, then full copy is needed.

> For the same reason we have drive-backup (write intercept) instead of a
> "snapshot + drive-mirror" analogue.
> 
drive-backup is not incremental backup, full copy is needed in every time 
backup,
so it dosen't meet our requirements.

Zhang Haoyu
> Fam

Re: [Qemu-devel] [RFC] introduce bitmap to bdrv_commit to track dirty sector

2015-03-09 Thread Zhang Haoyu

Hi John, Vladimir 
We can using active block commit to implement incremental backup without guest 
disruption,
e.g., 
origin <= A <= B <= C <= current BDS,
a new external snapshot will be produced before every time backup,
we can migrate A, B, C, ... to destination, 
then commit_active_start() the unneeded snapshot in source or destination end.

So, comparing with above mechanism,
what's the advantages of the incremental backup implemented by John and 
Vladimir?

Thanks,
Zhang Haoyu

On 2015-03-09 15:38:40, Paolo Bonzini wrote:
>
>On 09/03/2015 08:03, Zhang Haoyu wrote:
>> 
>> On 2015-03-03 18:00:09, Paolo Bonzini wrote:
>>>
>>> On 03/03/2015 07:52, Zhang Haoyu wrote:
>>>> Hi,
>>>> If introducing bitmap to bdrv_commit to track dirty sector,
>>>> could we implement guest non-disruption while performing commit?
>>>
>>> That is already implemented.  It uses the same code that implements
>>> storage migration (block/mirror.c).
>>>
>> Hi Paolo,
>> do you mean commit_active_start()?
>
>Yes.
>
>Paolo

Re: [Qemu-devel] [RFC] introduce bitmap to bdrv_commit to track dirty sector

2015-03-09 Thread Zhang Haoyu


On 2015-03-03 18:00:09, Paolo Bonzini wrote:
> 
> On 03/03/2015 07:52, Zhang Haoyu wrote:
> > Hi,
> > If introducing bitmap to bdrv_commit to track dirty sector,
> > could we implement guest non-disruption while performing commit?
> 
> That is already implemented.  It uses the same code that implements
> storage migration (block/mirror.c).
> 
Hi Paolo,
do you mean commit_active_start()?

> Paolo

[Qemu-devel] [RFC] introduce bitmap to bdrv_commit to track dirty sector

2015-03-02 Thread Zhang Haoyu

Hi,
If introducing bitmap to bdrv_commit to track dirty sector,
could we implement guest non-disruption while performing commit?

Thanks,
Zhang Haoyu

[Qemu-devel] [PATCH] fix mc146818rtc wrong subsection name to avoid vmstate_subsection_load() fail

2015-02-05 Thread Zhang Haoyu

fix mc146818rtc wrong subsection name to avoid vmstate_subsection_load() fail
during incoming migration or loadvm.

Signed-off-by: Zhang Haoyu 
---
 hw/timer/mc146818rtc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
index 5a107fa..0600c9a 100644
--- a/hw/timer/mc146818rtc.c
+++ b/hw/timer/mc146818rtc.c
@@ -734,7 +734,7 @@ static int rtc_post_load(void *opaque, int version_id)
 }
 
 static const VMStateDescription vmstate_rtc_irq_reinject_on_ack_count = {
-.name = "irq_reinject_on_ack_count",
+.name = "mc146818rtc/irq_reinject_on_ack_count",
 .version_id = 1,
 .minimum_version_id = 1,
 .fields = (VMStateField[]) {
-- 
1.8.3.1

Re: [Qemu-devel] [RFC] optimization for qcow2 cache get/put

2015-01-26 Thread Zhang Haoyu


On 2015-01-27 09:24:13, Zhang Haoyu wrote:
> 
> On 2015-01-26 22:11:59, Max Reitz wrote:
> >On 2015-01-26 at 08:20, Zhang Haoyu wrote:
>> > Hi, all
> > >
> > > Regarding too large qcow2 image, e.g., 2TB,
> > > so long disruption happened when performing snapshot,
> >> which was caused by cache update and IO wait.
> > > perf top data shown as below,
> > > PerfTop:2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz 
> > > cycles],  (target_pid: 34294)
> > > 
> >>
> > >  33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get
> > >  27.59%  qemu-system-x86_64  [.] qcow2_cache_put
> > >  15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty
> > >   5.49%  qemu-system-x86_64  [.] update_refcount
> >>   3.02%  libpthread-2.13.so  [.] pthread_getspecific
> > >   2.26%  qemu-system-x86_64  [.] get_refcount
> > >   1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state
>> >   1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
> >>   1.20%  qemu-system-x86_64  [.] qemu_coroutine_self
> > >   1.16%  libz.so.1.2.7   [.] 0x3018
> > >   0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount
> > >   0.91%  qemu-system-x86_64  [.] qcow2_cache_get
> > >   0.76%  libc-2.13.so[.] 0x00134e49
> >>   0.73%  qemu-system-x86_64  [.] bdrv_debug_event
> > >   0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt
> > >   0.12%  [kernel][k] _raw_spin_unlock_irqrestore
> > >   0.10%  qemu-system-x86_64  [.] vga_draw_line24_32
> >>   0.09%  [vdso]  [.] 0x060c
> > >   0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap
> > >   0.08%  [kernel][k] do_blockdev_direct_IO
> > >
> > > If expand the cache table size, the IO will be decreased,
> >> but the calculation time will be grown.
>> > so it's worthy to optimize qcow2 cache get and put algorithm.
> > >
> > > My proposal:
> >> get:
> > > using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
> > > raw implementation,
> > > index = (use offset >> cluster_bits) % c->size;
> > > if (c->entries[index].offset == offset) {
> >>  goto found;
> > > }
> > >
> > > replace:
> >> c->entries[use offset >> cluster_bits) % c->size].offset = offset;
> > 
> > Well, direct-mapped caches do have their benefits, but remember that 
> > they do have disadvantages, too. Regarding CPU caches, set associative 
>> caches seem to be largely favored, so that may be a better idea.
> >
> Thanks, Max,
> I think if direct-mapped caches were used, we can expand the cache table size
> to decrease IOs, and cache location is not time-expensive even cpu cache miss
> happened.
> Of course set associative caches is preferred regarding cpu caches,
> but sequential traverse algorithm only provides more probability
> for association, but after running some time, the probability
> of association maybe reduced, I guess.
> I will test the direct-mapped cache, and test result will be posted soon.
> 
I've tested direct-mapped cache, the conflicts of cache location caused
about 4000 IOs during performing snapshot for 2TB thin-provision qcow2 image.
But the overhead of qcow2_cache_do_get() significantly decreased from
33.80% to 10.43%.
I'll try two-dimension cache to decrease the mostly IO, even to zero, 
4 as the default size of the second dimension.

Any ideas?

> > CC'ing Kevin, because it's his code.
> > 
> > Max
> >
>> > ...
> > >
> > > put:
> > > using 64-entries cache table to cache
> >> the recently got c->entries, i.e., cache for cache,
> > > then during put process, firstly search the 64-entries cache,
> > > if not found, then the c->entries.
I've tried c->last_used_cache pointer for the most recently got
cache entry, the overhead of qcow2_cache_put() significantly
decreased from 27.59% to 5.38%.
I've also traced c->last_used_cache miss rate, absolutely zero,
I'll test again. 
> > >
> >> Any idea?
> > >
> > > Thanks,
> > > Zhang Haoyu

Re: [Qemu-devel] [RFC] optimization for qcow2 cache get/put

2015-01-26 Thread Zhang Haoyu


On 2015-01-26 22:11:59, Max Reitz wrote:
>On 2015-01-26 at 08:20, Zhang Haoyu wrote:
> > Hi, all
> >
> > Regarding too large qcow2 image, e.g., 2TB,
> > so long disruption happened when performing snapshot,
>> which was caused by cache update and IO wait.
> > perf top data shown as below,
> > PerfTop:2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz cycles],  
> > (target_pid: 34294)
> > 
>>
> >  33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get
> >  27.59%  qemu-system-x86_64  [.] qcow2_cache_put
> >  15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty
> >   5.49%  qemu-system-x86_64  [.] update_refcount
>>   3.02%  libpthread-2.13.so  [.] pthread_getspecific
> >   2.26%  qemu-system-x86_64  [.] get_refcount
> >   1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state
> >   1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
>>   1.20%  qemu-system-x86_64  [.] qemu_coroutine_self
> >   1.16%  libz.so.1.2.7   [.] 0x3018
> >   0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount
> >   0.91%  qemu-system-x86_64  [.] qcow2_cache_get
> >   0.76%  libc-2.13.so[.] 0x00134e49
>>   0.73%  qemu-system-x86_64  [.] bdrv_debug_event
> >   0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt
> >   0.12%  [kernel][k] _raw_spin_unlock_irqrestore
> >   0.10%  qemu-system-x86_64  [.] vga_draw_line24_32
>>   0.09%  [vdso]  [.] 0x060c
> >   0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap
> >   0.08%  [kernel][k] do_blockdev_direct_IO
> >
> > If expand the cache table size, the IO will be decreased,
>> but the calculation time will be grown.
> > so it's worthy to optimize qcow2 cache get and put algorithm.
> >
> > My proposal:
>> get:
> > using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
> > raw implementation,
> > index = (use offset >> cluster_bits) % c->size;
> > if (c->entries[index].offset == offset) {
>>  goto found;
> > }
> >
> > replace:
>> c->entries[use offset >> cluster_bits) % c->size].offset = offset;
> 
> Well, direct-mapped caches do have their benefits, but remember that 
> they do have disadvantages, too. Regarding CPU caches, set associative 
> caches seem to be largely favored, so that may be a better idea.
>
Thanks, Max,
I think if direct-mapped caches were used, we can expand the cache table size
to decrease IOs, and cache location is not time-expensive even cpu cache miss
happened.
Of course set associative caches is preferred regarding cpu caches,
but sequential traverse algorithm only provides more probability
for association, but after running some time, the probability
of association maybe reduced, I guess.
I will test the direct-mapped cache, and test result will be posted soon.

> CC'ing Kevin, because it's his code.
> 
> Max
>
> > ...
> >
> > put:
> > using 64-entries cache table to cache
>> the recently got c->entries, i.e., cache for cache,
> > then during put process, firstly search the 64-entries cache,
> > if not found, then the c->entries.
> >
>> Any idea?
> >
> > Thanks,
> > Zhang Haoyu

[Qemu-devel] [RFC] optimization for qcow2 cache get/put

2015-01-26 Thread Zhang Haoyu

Hi, all

Regarding too large qcow2 image, e.g., 2TB,
so long disruption happened when performing snapshot,
which was caused by cache update and IO wait.
perf top data shown as below,
   PerfTop:2554 irqs/sec  kernel: 0.4%  exact:  0.0% [4000Hz cycles],  
(target_pid: 34294)


33.80%  qemu-system-x86_64  [.] qcow2_cache_do_get
27.59%  qemu-system-x86_64  [.] qcow2_cache_put   
15.19%  qemu-system-x86_64  [.] qcow2_cache_entry_mark_dirty  
 5.49%  qemu-system-x86_64  [.] update_refcount   
 3.02%  libpthread-2.13.so  [.] pthread_getspecific   
 2.26%  qemu-system-x86_64  [.] get_refcount  
 1.95%  qemu-system-x86_64  [.] coroutine_get_thread_state
 1.32%  qemu-system-x86_64  [.] qcow2_update_snapshot_refcount
 1.20%  qemu-system-x86_64  [.] qemu_coroutine_self   
 1.16%  libz.so.1.2.7   [.] 0x3018
 0.95%  qemu-system-x86_64  [.] qcow2_update_cluster_refcount 
 0.91%  qemu-system-x86_64  [.] qcow2_cache_get   
 0.76%  libc-2.13.so[.] 0x00134e49
 0.73%  qemu-system-x86_64  [.] bdrv_debug_event  
 0.16%  qemu-system-x86_64  [.] pthread_getspecific@plt   
 0.12%  [kernel][k] _raw_spin_unlock_irqrestore   
 0.10%  qemu-system-x86_64  [.] vga_draw_line24_32
 0.09%  [vdso]  [.] 0x060c
 0.09%  qemu-system-x86_64  [.] qcow2_check_metadata_overlap  
 0.08%  [kernel][k] do_blockdev_direct_IO  

If expand the cache table size, the IO will be decreased, 
but the calculation time will be grown.
so it's worthy to optimize qcow2 cache get and put algorithm.

My proposal:
get:
using ((use offset >> cluster_bits) % c->size) to locate the cache entry,
raw implementation,
index = (use offset >> cluster_bits) % c->size;
if (c->entries[index].offset == offset) {
goto found;
}

replace:
c->entries[use offset >> cluster_bits) % c->size].offset = offset;
...

put:
using 64-entries cache table to cache
the recently got c->entries, i.e., cache for cache,
then during put process, firstly search the 64-entries cache,
if not found, then the c->entries.

Any idea?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [question] incremental backup a running vm

2015-01-26 Thread Zhang Haoyu


On 2015-01-26 19:29:03, Paolo Bonzini wrote:
>
> On 26/01/2015 12:13, Zhang Haoyu wrote:
> > Thanks, Paolo,
> > but too many internal snapshots were saved by customers,
>> switching to external snapshot mechanism has significant impaction
> > on subsequent upgrade.
> 
> In that case, patches are welcome. :)
>
> > Another problem:
> > drive_backup just implement one time backup,
> > but I want VMWare's VDP-like backup mechanism.
> > The initial backup of a virtual machine takes comparatively more time,
>> because all of the data for that virtual machine is being backed up. 
> > Subsequent backups of the same virtual machine take less time, because
> > changed block tracking (log dirty) mechanism is used to only backup the 
> > dirty data.
> > After inittial backup done, even the VM shutdown, but subsequent backup 
> > also only 
>> copy the changed data.
> 
> As mentioned before, patches for this are on the list.
> 
I see, thanks, Paolo.
> Paolo

Re: [Qemu-devel] [question] incremental backup a running vm

2015-01-26 Thread Zhang Haoyu

On 2015-01-26 17:29:43, Paolo Bonzini wrote:
>
> On 26/01/2015 02:07, Zhang Haoyu wrote:
> > Hi, Kashyap
> > I've tried ‘drive_backup’ via QMP,
>> but the snapshots were missed to backup to destination,
> > I think the reason is that backup_run() only copy the
> > guest data regarding qcow2 image.
> 
>Yes, that's the case.
> 
> QEMU cannot still access internal snapshots while the file is open.
> External snapshots are opened read-only, and can be copied with "cp"
> while QEMU is running.
Thanks, Paolo,
but too many internal snapshots were saved by customers,
switching to external snapshot mechanism has significant impaction
on subsequent upgrade.

Another problem:
drive_backup just implement one time backup,
but I want VMWare's VDP-like backup mechanism.
The initial backup of a virtual machine takes comparatively more time,
because all of the data for that virtual machine is being backed up. 
Subsequent backups of the same virtual machine take less time, because
changed block tracking (log dirty) mechanism is used to only backup the dirty 
data.
After inittial backup done, even the VM shutdown, but subsequent backup also 
only 
copy the changed data.

Thanks,
Zhang Haoyu
>
> Paolo

Re: [Qemu-devel] [question] incremental backup a running vm

2015-01-25 Thread Zhang Haoyu


On 2015-01-23 07:30:19, Kashyap Chamarthy wrote:
>On Wed, Jan 21, 2015 at 11:39:44AM +0100, Paolo Bonzini wrote:
> > 
> > 
> > On 21/01/2015 11:32, Zhang Haoyu wrote:
> > > Hi,
>> > 
> > > Does drive_mirror support incremental backup a running vm?
> > > Or other mechanism does?
> > > 
>> > incremental backup a running vm requirements:
> > > First time backup, all of the allocated data will be mirrored to 
> > > destination,
> > > then a copied bitmap will be saved to a file, then the bitmap file will 
> > > log dirty for
> > > the changed data.
> > > Next time backup, only the dirty data will be mirrored to destination.
>> > Even the VM shutdown and start after several days,
> > > the bitmap will be loaded while starting vm.
> > > Any ideas?
> > 
>> Drive-mirror is for storage migration.  For backup there is another job,
> > drive-backup.  drive-backup copies a point-in-time snapshot of one or
> > more disks corresponding to when the backup was started.
> 
> Zhang, I've been testing the `drive-backup` command via QMP for a little
>while, and it works reasonably well. If you'd like to test it you can
> quickly try as below, once you have a QEMU instance runing with QMP
> (I invoke my QEMU instances with '-qmp tcp:localhost:,server').
> 
Hi, Kashyap
I've tried ‘drive_backup’ via QMP,
but the snapshots were missed to backup to destination,
I think the reason is that backup_run() only copy the
guest data regarding qcow2 image.

Thanks,
Zhang Haoyu
>The below script invokes the 'drive-backup' QMP command (Ensure you're
> using the correct disk, your disk ID might be 'drive-virtio-disk1' :-) )
> -
> #!/bin/bash
> set -x
>exec 3<>/dev/tcp/localhost/
> echo -e "{ 'execute': 'qmp_capabilities' }" >&3
> read response <&3
> echo $response
>echo -e "{ 'execute': 'drive-backup', 'arguments':
>   { 'device': 'drive-virtio-disk0', 'sync': 'full', 'target':
>   '/export/backup-of-vm1.qcow2', 'mode': 'absolute-paths', 'format': 
> 'qcow2' } }" >&3
> read response <&3
> echo $response
>-
> 
> Once the above is invoked succesfully, you can see the success of the
> command on the shell where your QMP server is running:
>-
> [. . .]
> char device redirected to /dev/pts/8 (label charserial0)
> QEMU waiting for connection on: disconnected:tcp:localhost:,server
> Formatting '/export/backup-of-vm1.qcow2', fmt=qcow2 size=53687091200 
> encryption=off cluster_size=65536 lazy_refcounts=off 
>-
> 
> 
> > Incremental backup is being worked on.  You can see patches on the list.
> 
> 
> -- 
> /kashyap

[Qemu-devel] [question] incremental backup a running vm

2015-01-21 Thread Zhang Haoyu

Hi,

Does drive_mirror support incremental backup a running vm?
Or other mechanism does?

incremental backup a running vm requirements:
First time backup, all of the allocated data will be mirrored to destination,
then a copied bitmap will be saved to a file, then the bitmap file will log 
dirty for
the changed data.
Next time backup, only the dirty data will be mirrored to destination.
Even the VM shutdown and start after several days,
the bitmap will be loaded while starting vm.
Any ideas?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [PATCH] spice-char: fix wrong assert condition

2015-01-18 Thread Zhang Haoyu


On 2015-01-17 19:55:16, Peter Maydell wrote:
>On 17 January 2015 at 11:52, Peter Maydell  wrote:
> > On 17 January 2015 at 06:48, Zhang Haoyu  wrote:
> >> G_IO_OUT|G_IO_HUP are passed from all of the callers
> >> of chr_add_watch hooker, the assert condition MUST be
> >> changed.
>>
> > "All the callers do this currently" isn't a reason to
> > change an assert. Possible reasons to change it include:
> >  * "X is required because if we don't then things break because Y"
>>  * "it is meaningless (a programming error) to do X, because Y"
> >
> > Can you give the reasoning for changing this assert?
> > Does it apply also to other add_watch hook functions?
> > Would the assert be better at a higher level in the callstack?
>
> ...that said, I've just figured out what your patch is trying
> to fix. I think that kind of makes the point about the commit
> message not really being very clear about why the change is
>being made, though. (Should we be checking
> '(cond & G_IO_OUT) != 0'? 
I think '(cond & G_IO_OUT) != 0' is better, v2 will be posted.

> Is this a bandaid over missing
> functionality in the spice chardriver?)
I really don't know about this.
> 
> -- PMM

[Qemu-devel] [PATCH] spice-char: fix wrong assert condition

2015-01-16 Thread Zhang Haoyu

G_IO_OUT|G_IO_HUP are passed from all of the callers
of chr_add_watch hooker, the assert condition MUST be
changed.

Signed-off-by: Zhang Haoyu 
---
 spice-qemu-char.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/spice-qemu-char.c b/spice-qemu-char.c
index 7e0d300..b81a9db 100644
--- a/spice-qemu-char.c
+++ b/spice-qemu-char.c
@@ -169,7 +169,7 @@ static GSource *spice_chr_add_watch(CharDriverState *chr, 
GIOCondition cond)
 SpiceCharDriver *scd = chr->opaque;
 SpiceCharSource *src;
 
-assert(cond == G_IO_OUT);
+assert(cond == (G_IO_OUT|G_IO_HUP));
 
 src = (SpiceCharSource *)g_source_new(&SpiceCharSourceFuncs,
   sizeof(SpiceCharSource));
-- 
1.7.12.4

Re: [Qemu-devel] question about live migration with storage

2015-01-15 Thread Zhang Haoyu


On 2015-01-15 17:11:49, Paolo Bonzini wrote:
> 
> On 15/01/2015 04:54, Zhang Haoyu wrote:
> >>>> > >> 2) Finer-grain control the parameters of block migration (dirty 
> >>>> > >> bitmap
> >>>> > >> granularity).
> >>>> > >>
> >>>> > >> 3) Block and RAM migration do not share the same socket and thus can
> >>>> > >> more easily be parallelized.
> >>>> > >>
> > drive_mirror job is done in main-thread,
> > then how to accept the qmp_migrate request while drive_mirror is performing?
> > If need to wait for the completion of drive_mirror,
> > how to implement the parallelization between block and ram migration?
> 
> drive_mirror runs in the background, using coroutines.
> 
I see, when waiting the completion of drive_mirror IO, the coroutine will be
switched back to main-thread to poll and process other events, like qmp request,
then after the IO completed, coroutine will be switched back in main-loop 
process, right?

Another question:
while starting to migrate storage, will the unused sector be allocated and 
transferred?
regarding the thin-provisioning qcow2 disk, will extra large IO requests be
performed, and more data be transferred for the unallocated clusters?

Thanks,
Zhang Haoyu
> Paolo

Re: [Qemu-devel] question about live migration with storage

2015-01-14 Thread Zhang Haoyu


On 2015-01-14 17:07:08, Paolo Bonzini wrote:
> 
> 
> On 14/01/2015 08:58, Zhang Haoyu wrote:
> >> 2) Finer-grain control the parameters of block migration (dirty bitmap
> >> granularity).
> >>
> >> 3) Block and RAM migration do not share the same socket and thus can
> >> more easily be parallelized.
> >>
drive_mirror job is done in main-thread,
then how to accept the qmp_migrate request while drive_mirror is performing?
If need to wait for the completion of drive_mirror,
how to implement the parallelization between block and ram migration?

Thanks,
Zhang Haoyu
> > But, because of the parallelization, how to calculate the progress?
> 
> You can use "query-block-jobs".
> 
> > BTW, the traditional iterative mechanism is buggy-implemented?
> > e.g., some bugs which are very difficult to fixed, or something?
> 
> There are no bugs that corrupt data. It's simply less flexible.
> 
> Regarding parallelization, the problems happen especially if you disable
> migration bandwidth limits.  Then you'll see large chunks of RAM and
> large chunks of block device data being sent.  This is a problem for
> convergence in some workloads. Instead, with NBD-based storage migration
> everything happens in parallel, and TCP makes sure that bandwidth is
> split between the sockets.
> 
> If you have a migration bandwidth limit, NBD-based storage migration
> will use a separate bandwidth limit for each disk and for RAM. Thus
> network usage is less predictable than with block-migration.c.  On the
> other hand, storage migration uses a lot of network so it's usually more
> likely that you do not have migration bandwidth limits.
> 
> Paolo
> 
> > Thanks,
> > Zhang Haoyu
> >> Note that 1-2 are not yet supported by libvirt as far as I remember.
> >>
> >> Paolo

Re: [Qemu-devel] question about live migration with storage

2015-01-14 Thread Zhang Haoyu


On 2015-01-14 15:42:41, Paolo Bonzini wrote:
> 
> On 14/01/2015 03:41, Zhang Haoyu wrote:
> > Hi, Paolo,
> > what's advantages of drive_mirror over traditional mechanism implemented in 
> > block-migration.c ?
> > Why libvirt use drive_mirror instead of traditional iterative mechanism as 
> > the default way
> > of live migration with non-shared storage?
> 
> 1) Being able to choose which block devices are migrated, and whether
> they are migrated incrementally or not.
> 
Yes.
> 2) Finer-grain control the parameters of block migration (dirty bitmap
> granularity).
> 
> 3) Block and RAM migration do not share the same socket and thus can
> more easily be parallelized.
> 
But, because of the parallelization, how to calculate the progress?

BTW, the traditional iterative mechanism is buggy-implemented?
e.g., some bugs which are very difficult to fixed, or something?

Thanks,
Zhang Haoyu
> Note that 1-2 are not yet supported by libvirt as far as I remember.
> 
> Paolo

Re: [Qemu-devel] question about live migration with storage

2015-01-13 Thread Zhang Haoyu


On 2015-01-13 17:45:45, Paolo Bonzini wrote:
> 
> On 13/01/2015 03:03, Zhang Haoyu wrote:
> > >I want to live migrate a vm with storage, with regard to the migration of 
> > >storage,
> > >should I use drive_mirror or traditional mechanism implemented in 
> > >block-migration.c ?
> > 
> > Because I don't use libvirtd to manage vm,
> > if I want to use drive_mirror to perform live migration with storage,
> > how to organize the flow via script?
> 
> The same as libvirt does.
> 
Hi, Paolo,
what's advantages of drive_mirror over traditional mechanism implemented in 
block-migration.c ?
Why libvirt use drive_mirror instead of traditional iterative mechanism as the 
default way
of live migration with non-shared storage?

Thanks,
Zhang Haoyu
> Paolo

Re: [Qemu-devel] question about live migration with storage

2015-01-12 Thread Zhang Haoyu


On 2015-01-13 09:49:00, Zhang Haoyu wrote:
>Hi,
>
>I want to live migrate a vm with storage, with regard to the migration of 
>storage,
>should I use drive_mirror or traditional mechanism implemented in 
>block-migration.c ?
Because I don't use libvirtd to manage vm,
if I want to use drive_mirror to perform live migration with storage,
how to organize the flow via script?

>Any advices?
>
>Thanks,
>Zhang Haoyu

[Qemu-devel] question about live migration with storage

2015-01-12 Thread Zhang Haoyu

Hi,

I want to live migrate a vm with storage, with regard to the migration of 
storage,
should I use drive_mirror or traditional mechanism implemented in 
block-migration.c ?
Any advices?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] How to clone a running vm?

2015-01-12 Thread Zhang Haoyu


On 2015-01-12 15:50:13, Zhang Haoyu wrote:
>Hi,
>
>I want to clone a running vm without shutoff,
>can below method work?
>1) create a snapshot for the vm
>2) create a new qcow2 image from the snapshot, but how?
>3) use the new qcow2 image as backing image to clone vms
>
Can drive_mirror clone a running vm without guest service disruption?

>Any ideas?
>
>Thanks,
>Zhang Haoyu

[Qemu-devel] How to clone a running vm?

2015-01-11 Thread Zhang Haoyu

Hi,

I want to clone a running vm without shutoff,
can below method work?
1) create a snapshot for the vm
2) create a new qcow2 image from the snapshot, but how?
3) use the new qcow2 image as backing image to clone vms

Any ideas?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] Does kvm friendly support GPT?

2015-01-11 Thread Zhang Haoyu


On 2014-12-22 09:28:52, Paolo Bonzini wrote:
>
>On 22/12/2014 07:39, Zhang Haoyu wrote:
>> Hi,
>> 
>> When I perform P2V from native servers with win2008 to kvm vm,
>> some cases failed due to the physical disk was using GPT for partition,
>> and QEMU doesn't support GPT by default.
>> 
>> And, I see in below site that OVMF can be used to enable UEFI to support GPT,
>> http://www.linux-kvm.org/page/OVMF
>> 
>> But, it seems that OVMF is not stable enough for kvm.
>
>OVMF is stable.  The main issues are:
>
I download OVMF from https://github.com/tianocore/edk2.git
I build and run OVMF to enable UEFI to start windows7, windows8, windows2008, 
windows2012,
test results are:
1) windows7 and windows2008 got stuck at boot stage.
   Either the '-vga std' or '-vga qxl' QEMU option is used, but which dosen't 
work.

2) windows8 and windows2012 started successfully, seems okay for now.
   But error dialog of "no enough resources ..." is reported for rtl8139 and 
e1000 NIC in windows devices manager,
   virtio NIC is okay.

Any advices?

Thanks,
Zhang Haoyu

>1) tools support (libvirt and above), which is being worked on
>
>2) the FAT driver is not free, which prevents distribution in Fedora and
>several other distributions
>
>Paolo

Re: [Qemu-devel] vhost-user: migration?

2015-01-09 Thread Zhang Haoyu

Hi,
what's the status of migration support for vhost-user?

Thanks,
Zhang Haoyu
   
On 2014-06-18 22:07:49, Michael S. Tsirkin wrote:
> On Wed, Jun 18, 2014 at 04:37:57PM +0300, Nikolay Nikolaev wrote:
> > 
> > 
> > 
> > On Wed, Jun 18, 2014 at 3:35 PM, Michael S. Tsirkin  wrote:
> > 
> > On Wed, Jun 18, 2014 at 02:41:15PM +0300, Nikolay Nikolaev wrote:
> > >
> > >
> > >
> > > On Wed, Jun 18, 2014 at 2:29 PM, Michael S. Tsirkin 
> > wrote:
> > >
> > >     On Wed, Jun 18, 2014 at 02:24:32PM +0300, Nikolay Nikolaev wrote:
> > >     > Hello,
> > >     >
> > >     > On Wed, Jun 18, 2014 at 12:33 PM, Michael S. Tsirkin <
> > m...@redhat.com>
> > >     wrote:
> > >     >
> > >     >     Just a query whether migration works for you
> > >     >     guys with vhost-user and snabbswitch.
> > >     >
> > >     >
> > >     > I am not aware of anyone ever did such tests.
> > >     >
> > >     >
> > >     >     vhost user code seems to have enough hooks in
> > >     >     place to make it work, but was it tested
> > >     >     in practice?
> > >     >
> > >     >
> > >     >
> > >     > Can you give some pointers of a simple/quick test scenario. 
> > Maybe
> > we'll
> > >     be able
> > >     > to set it up.
> > >
> > >
> > >     Basically migrate while doing networking :)
> > >
> > >     Does your vhost server have code to set bits in the memory log
> > >     when logging is enabled?
> > >
> > >
> > > No - we didn't implement this feature.
> > 
> > Interesting. In that case vhost user should for now
> > block migration.
> > 
> > Any place in the source where we can see example of similar behaviour?
> >  
> > 
> > You must also check log feature and return failure.
> > 
> > Should this happen on the QEMU side? Is this the log feature of the device?
> > 
> > regards,
> > Nikolay Nikolaev
> >  
> 
> Well, QEMU is buggy in that it tries to set log
> flag without checking.
> I will fix that.
> 
> 
> However
> 1.  Your server must not publish the VHOST_F_LOG_ALL feature.
> 
> On vhost side it should validate features set
> match the supported mask.
> Also for flags in the virtqueue address structure.
> Please validate as much as possible, e.g.
> validate all unused fields are zero.
> If things work by mistake now you will have to maintain
> them indefinitely.
> 
> 
> 
> > 
> > 
> > >
> > >     The log address is supplied to VHOST_SET_LOG_BASE and 
> > log_guest_addr.
> > >
> > >
> > >     >
> > >     >
> > >     >     --
> > >     >     MST
> > >     >
> > >     >
> > >     > regards,
> > >     > Nikolay Nikolaev
> > >
> > >
> > > regards,
> > > Nikolay Nikolaev

Re: [Qemu-devel] [question] How to get the guest physical memory usage from host?

2014-12-22 Thread Zhang Haoyu


On 2014/12/22 17:33, Andrey Korolyov wrote:
>> Yes, it should depends on guest os implementation,
>> because physical memory is managed by OS, which have the full knowledge of 
>> memory usage,
>> so I'm afraid that windows dose not friendly support it.
>> Could you detail the peeking techniques mentioned above?
>>
>> Thanks,
>> Zhang Haoyu
> 
> Generally I meant virDomainMemoryPeek, but nothing prevents you to
> write code with same functionality, if libvirt usage is not preferred,
> it is only about asking monitor for chunks of memory and parse them in
> a proper way.
> 
Thanks, Andrey.

Re: [Qemu-devel] cannot receive qemu-dev/kvm-dev mails sent by myself

2014-12-22 Thread Zhang Haoyu


On 2014/12/23 9:36, Fam Zheng wrote:
> On Mon, 12/22 20:21, Zhang Haoyu wrote:
>>
>> On 2014/12/22 20:05, Paolo Bonzini wrote:
>>>
>>>
>>> On 22/12/2014 12:40, Zhang Haoyu wrote:
>>>> On 2014/12/22 17:54, Paolo Bonzini wrote:
>>>>>
>>>>>
>>>>> On 22/12/2014 10:48, Zhang Haoyu wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I cannot receive qemu-dev/kvm-dev mails sent by myself,
>>>>>> but mails from others can be received,
>>>>>> any helps?
>>>>>
>>>>> For qemu-devel, you need to configure mailman to send messages even if
>>>>> they are yours.  For the kvm mailing list I'm not sure how it works (I
>>>>> read it through GMANE).
>>>>>
>>>> I didn't find the configuration, could you send out the configuration 
>>>> method or link site?
>>>
>>> https://lists.nongnu.org/mailman/options/qemu-devel
>>>
>> Thanks, Paolo,
>> I already have fellow the default settings, as it didn't work, so forget it.
>> "Receive your own posts to the list?
>> Ordinarily, you will get a copy of every message you post to the list. If 
>> you don't want to receive this copy, set this option to No."
>> => Yes
>>
>> Avoid duplicate copies of messages?
>> When you are listed explicitly in the To: or Cc: headers of a list message, 
>> you can opt to not receive another copy from the mailing list. Select Yes to 
>> avoid receiving copies from the mailing list; select No to receive copies.
>>
>> If the list has member personalized messages enabled, and you elect to 
>> receive copies, every copy will have a X-Mailman-Copy: yes header added to 
>> it.
>> => No
>>
> 
> So you're probably talking about the gmail "feature" that you don't get copies
> of the messages sent by you. I don't use gmail to send emails on lists, Stefan
> (Cc'ed) may know better.
> 
I think you're right, it's okay for my company-mail before,
which is out of use for period of time.

Thanks,
Zhang Haoyu
> Fam
>

Re: [Qemu-devel] cannot receive qemu-dev/kvm-dev mails sent by myself

2014-12-22 Thread Zhang Haoyu

On 2014/12/22 20:05, Paolo Bonzini wrote:
> 
> 
> On 22/12/2014 12:40, Zhang Haoyu wrote:
>> On 2014/12/22 17:54, Paolo Bonzini wrote:
>>>
>>>
>>> On 22/12/2014 10:48, Zhang Haoyu wrote:
>>>> Hi,
>>>>
>>>> I cannot receive qemu-dev/kvm-dev mails sent by myself,
>>>> but mails from others can be received,
>>>> any helps?
>>>
>>> For qemu-devel, you need to configure mailman to send messages even if
>>> they are yours.  For the kvm mailing list I'm not sure how it works (I
>>> read it through GMANE).
>>>
>> I didn't find the configuration, could you send out the configuration method 
>> or link site?
> 
> https://lists.nongnu.org/mailman/options/qemu-devel
> 
Thanks, Paolo,
I already have fellow the default settings, as it didn't work, so forget it.
"Receive your own posts to the list?
Ordinarily, you will get a copy of every message you post to the list. If you 
don't want to receive this copy, set this option to No."
=> Yes

Avoid duplicate copies of messages?
When you are listed explicitly in the To: or Cc: headers of a list message, you 
can opt to not receive another copy from the mailing list. Select Yes to avoid 
receiving copies from the mailing list; select No to receive copies.

If the list has member personalized messages enabled, and you elect to receive 
copies, every copy will have a X-Mailman-Copy: yes header added to it.
=> No

> Paolo
>

Re: [Qemu-devel] cannot receive qemu-dev/kvm-dev mails sent by myself

2014-12-22 Thread Zhang Haoyu

On 2014/12/22 17:54, Paolo Bonzini wrote:
> 
> 
> On 22/12/2014 10:48, Zhang Haoyu wrote:
>> Hi,
>>
>> I cannot receive qemu-dev/kvm-dev mails sent by myself,
>> but mails from others can be received,
>> any helps?
> 
> For qemu-devel, you need to configure mailman to send messages even if
> they are yours.  For the kvm mailing list I'm not sure how it works (I
> read it through GMANE).
> 
I didn't find the configuration, could you send out the configuration method or 
link site?

Thanks,
Zhang Haoyu

> Paolo
>

Re: [Qemu-devel] Does kvm friendly support GPT?

2014-12-22 Thread Zhang Haoyu


On 2014/12/22 17:52, Paolo Bonzini wrote:
> 
> 
> On 22/12/2014 10:40, Zhang Haoyu wrote:
>>> 2) the FAT driver is not free, which prevents distribution in Fedora and
>>> several other distributions
>>>
>> Sorry, I cannot follow you,
>> the "FAT" mentioned above means FAT filesystem?
>> what's the relationship between OVMF and FAT?
>>
>> I want to use OVMF to enable UEFI to support GPT partition for P2V scenario.
> 
> Yes, I mean FAT filesystem.
> 
> The EFI system partition (the one with BOOTX64.EFI) is FAT.  The
> filesystem driver is not free, and without the driver you cannot boot a
> UEFI system.  So Fedora and other distributions that do not allow
> non-free or patent-encumbered software cannot distribute a useful OVMF
> build.
> 
Paolo,thanks.

> Paolo
>

[Qemu-devel] cannot receive qemu-dev/kvm-dev mails sent by myself

2014-12-22 Thread Zhang Haoyu

Hi,

I cannot receive qemu-dev/kvm-dev mails sent by myself,
but mails from others can be received,
any helps?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] Does kvm friendly support GPT?

2014-12-22 Thread Zhang Haoyu


On 2014/12/22 17:28, Paolo Bonzini wrote:
> 
> 
> On 22/12/2014 07:39, Zhang Haoyu wrote:
>> Hi,
>>
>> When I perform P2V from native servers with win2008 to kvm vm,
>> some cases failed due to the physical disk was using GPT for partition,
>> and QEMU doesn't support GPT by default.
>>
>> And, I see in below site that OVMF can be used to enable UEFI to support GPT,
>> http://www.linux-kvm.org/page/OVMF
>>
>> But, it seems that OVMF is not stable enough for kvm.
> 
> OVMF is stable.  The main issues are:
> 
> 1) tools support (libvirt and above), which is being worked on
> 
> 2) the FAT driver is not free, which prevents distribution in Fedora and
> several other distributions
> 
Sorry, I cannot follow you,
the "FAT" mentioned above means FAT filesystem?
what's the relationship between OVMF and FAT?

I want to use OVMF to enable UEFI to support GPT partition for P2V scenario.

Thanks,
Zhang Haoyu
> Paolo
>

Re: [Qemu-devel] [question] How to get the guest physical memory usage from host?

2014-12-22 Thread Zhang Haoyu


On 2014/12/22 17:16, Andrey Korolyov wrote:
> On Mon, Dec 22, 2014 at 11:57 AM, Zhang Haoyu  
> wrote:
>>
>> On 2014/12/22 16:41, Andrey Korolyov wrote:
>>> On Mon, Dec 22, 2014 at 6:59 AM, Zhang Haoyu  
>>> wrote:
>>>> Hi,
>>>>
>>>> How to get the guest physical memory usage from host?
>>>> I don't want to introduce a guest-agent to get the info.
>>>>
>>>> Thanks,
>>>> Zhang Haoyu
>>>>
>>>
>>> There`s probably one approach, simular to (currently abandoned)
>>> virt-dmesg, via peeking to the guest memory directly. What are reasons
>>> for not leaving this task to agent, as it looks much simpler?
>>>
>> In some cases, we are not permitted to install anything in customer's guest 
>> environment.
>>
>> Thanks,
>> Zhang Haoyu
> 
> Ok, looks like memory peeking techniques are only a choice then. And,
> of course, this approach tends to be less flexible, because it depends
> on a relatively strict requirements for running kernel in a guest.
> 
Yes, it should depends on guest os implementation,
because physical memory is managed by OS, which have the full knowledge of 
memory usage,
so I'm afraid that windows dose not friendly support it.
Could you detail the peeking techniques mentioned above?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [question] How to get the guest physical memory usage from host?

2014-12-22 Thread Zhang Haoyu


On 2014/12/22 16:41, Andrey Korolyov wrote:
> On Mon, Dec 22, 2014 at 6:59 AM, Zhang Haoyu  
> wrote:
>> Hi,
>>
>> How to get the guest physical memory usage from host?
>> I don't want to introduce a guest-agent to get the info.
>>
>> Thanks,
>> Zhang Haoyu
>>
> 
> There`s probably one approach, simular to (currently abandoned)
> virt-dmesg, via peeking to the guest memory directly. What are reasons
> for not leaving this task to agent, as it looks much simpler?
> 
In some cases, we are not permitted to install anything in customer's guest 
environment.

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [PATCH] support vhost-user socket to reconnect

2014-12-21 Thread Zhang Haoyu

Hi, Kun

Is this patch one of patch series?
I don't see any place to reference "is_reconnect" field.

On 2014/12/22 15:06, zhangkun wrote:
> From: zhangkun 
> 
> Signed-off-by: zhangkun 
> ---
>  net/vhost-user.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/net/vhost-user.c b/net/vhost-user.c
> index 24e050c..957e78c 100644
> --- a/net/vhost-user.c
> +++ b/net/vhost-user.c
> @@ -26,6 +26,7 @@ typedef struct VhostUserChardevProps {
>  bool is_socket;
>  bool is_unix;
>  bool is_server;
> +bool is_reconnect;
>  } VhostUserChardevProps;
>  
>  VHostNetState *vhost_user_get_vhost_net(NetClientState *nc)
> @@ -132,6 +133,11 @@ static void net_vhost_user_event(void *opaque, int event)
>  }
>  }
>  
> +static bool net_vhost_user_can_read(void *opaque)
> +{
> +return true;
> +} 
> +
>  static int net_vhost_user_init(NetClientState *peer, const char *device,
> const char *name, CharDriverState *chr,
> bool vhostforce)
> @@ -151,7 +157,7 @@ static int net_vhost_user_init(NetClientState *peer, 
> const char *device,
>  s->chr = chr;
>  s->vhostforce = vhostforce;
>  
> -qemu_chr_add_handlers(s->chr, NULL, NULL, net_vhost_user_event, s);
> +qemu_chr_add_handlers(s->chr, net_vhost_user_can_read, NULL, 
> net_vhost_user_event, s);
>  
Why no read handler?

>  return 0;
>  }
> @@ -167,6 +173,8 @@ static int net_vhost_chardev_opts(const char *name, const 
> char *value,
>  props->is_unix = true;
>  } else if (strcmp(name, "server") == 0) {
>  props->is_server = true;
> +} else if (strcmp(name, "reconnect") == 0) {
> +props->is_reconnect = true;
>  } else {
>  error_report("vhost-user does not support a chardev"
>   " with the following option:\n %s = %s",
>

[Qemu-devel] Does kvm friendly support GPT?

2014-12-21 Thread Zhang Haoyu

Hi,

When I perform P2V from native servers with win2008 to kvm vm,
some cases failed due to the physical disk was using GPT for partition,
and QEMU doesn't support GPT by default.

And, I see in below site that OVMF can be used to enable UEFI to support GPT,
http://www.linux-kvm.org/page/OVMF

But, it seems that OVMF is not stable enough for kvm.

Any advises?

Thanks,
Zhang Haoyu

[Qemu-devel] [question] How to get the guest physical memory usage from host?

2014-12-21 Thread Zhang Haoyu

Hi,

How to get the guest physical memory usage from host?
I don't want to introduce a guest-agent to get the info.

Thanks,
Zhang Haoyu

[Qemu-devel] [question] kvm fully support vga adapter pass-through ?

2014-11-18 Thread Zhang Haoyu

Hi all,

Does the combination of qemu-2.0.1 and linux-3.10 fully support direct-assign 
vga adapters to vm?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] Where is the VM live migration code?

2014-11-17 Thread Zhang Haoyu

> Hi,
> 
> I saw this page:
> 
> http://www.linux-kvm.org/page/Migration.
> 
> It looks like Migration is a feature provided by KVM? But when I look
> at the Linux kernel source code, i.e., virt/kvm, and arch/x86/kvm, I
> don't see the code for this migration feature.
> 
Most of live migration code is in qemu migration.c, savevm.c, arch_init.c,
block-migration.c, and the other devices's save/load handler, .etc,
only log/sync dirty page implemented in kernel.
You can read the most important function migration_thread(),
process_incoming_migration_co().

> So I wonder where is the source code for the live migration? Is it
>purely implemented in user space? Because I see there are the
> following files in the qemu source code:
> 
> migration.c  migration-exec.c  migration-fd.c  migration-rdma.c
> migration-tcp.c  migration-unix.c
> 
> If I wish to understand the implementation of migration in Qemu/KVM,
> are these above files the ones I should read? Thanks.
> 
> -Jidong

Re: [Qemu-devel] [question] updating the base image for all clones which have been running for months

2014-11-06 Thread Zhang Haoyu

>>>> Hi, all
>>>>
>>>> I used base image A to clone so many vm, 
>>>> after running for months, each vm has its own private applications and 
>>>> data,
>>>> which maybe different from each other.
>>>> Now, I want to install some applications for all of the clones,
>>>> what should I do?
>
>How would you do it for bare metal? Do the same for your guests.
>
For bare-metal, I use manager to push the applications to each host-agent
which is running in each host, the host-agent is responsible to install the
applications.

Thanks,
Zhang Haoyu

>>>
>>> Install the applications on each clone separately, or use some other
>>> method to make it available (like installing on a shared network
>>> resource).
>>>
>> Could you detail "installing on a shared network resource"?
>
>Set up a network file system, like NFS or gluster, install your software
>on the shared file system, and then make each guest mount the shared
>file system in order to use the software.  The same as you would on bare
>metal.

[Qemu-devel] [PATCH] qcow2-cache: conditionally call bdrv_flush() in qcow2_cache_flush()

2014-11-06 Thread Zhang Haoyu

Needless to call bdrv_flush() in qcow2_cache_flush()
if no cache entry is dirty.

Signed-off-by: Zhang Haoyu 
---
 block/qcow2-cache.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 904f6b1..09ee155 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -110,10 +110,6 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 BDRVQcowState *s = bs->opaque;
 int ret = 0;
 
-if (!c->entries[i].dirty || !c->entries[i].offset) {
-return 0;
-}
-
 trace_qcow2_cache_entry_flush(qemu_coroutine_self(),
   c == s->l2_table_cache, i);
 
@@ -166,19 +162,23 @@ int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c)
 {
 BDRVQcowState *s = bs->opaque;
 int result = 0;
+bool need_bdrv_flush = false;
 int ret;
 int i;
 
 trace_qcow2_cache_flush(qemu_coroutine_self(), c == s->l2_table_cache);
 
 for (i = 0; i < c->size; i++) {
-ret = qcow2_cache_entry_flush(bs, c, i);
-if (ret < 0 && result != -ENOSPC) {
-result = ret;
+if (c->entries[i].dirty && c->entries[i].offset) {
+ret = qcow2_cache_entry_flush(bs, c, i);
+if (ret < 0 && result != -ENOSPC) {
+result = ret;
+}
+need_bdrv_flush = true;
 }
 }
 
-if (result == 0) {
+if (need_bdrv_flush && result == 0) {
 ret = bdrv_flush(bs->file);
 if (ret < 0) {
 result = ret;
@@ -289,9 +289,11 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 return i;
 }
 
-ret = qcow2_cache_entry_flush(bs, c, i);
-if (ret < 0) {
-return ret;
+if (c->entries[i].dirty && c->entries[i].offset) {
+ret = qcow2_cache_entry_flush(bs, c, i);
+if (ret < 0) {
+return ret;
+}
 }
 
 trace_qcow2_cache_get_read(qemu_coroutine_self(),
-- 
1.7.12.4

Re: [Qemu-devel] [question] updating the base image for all clones which havebeen running for months

2014-11-03 Thread Zhang Haoyu

>> Hi, all
>> 
>> I used base image A to clone so many vm, 
>> after running for months, each vm has its own private applications and data,
>> which maybe different from each other.
>> Now, I want to install some applications for all of the clones,
>> what should I do?
>
>Install the applications on each clone separately, or use some other
>method to make it available (like installing on a shared network
>resource).
>
Could you detail "installing on a shared network resource"?

Thanks,
Zhang Haoyu
>> Can I rebase image A to B which have the applications to be installed,
>> then change the base image to B for all clones?
>
>The problem is that rebase works on the block level, not on the file
>system level. 
Yes, this is the point.

>Changing the backing file won't produce a correctly
>working guest, it causes file system corruption.
>
>Kevin

[Qemu-devel] [question] updating the base image for all clones which have been running for months

2014-11-03 Thread Zhang Haoyu

Hi, all

I used base image A to clone so many vm, 
after running for months, each vm has its own private applications and data,
which maybe different from each other.
Now, I want to install some applications for all of the clones,
what should I do?

Can I rebase image A to B which have the applications to be installed,
then change the base image to B for all clones?
I don't think it can work, the clone images maybe corrupted latter.
If the clone image's space is not enough, what will happen?
Any ideas?

My bad idea is to push the applications from center to each clone's guest-agent,
which is responsible to install the applications.

Thanks,
Zhang Haoyu

[Qemu-devel] [question] How is the progress of optimizing qcow2_check_metadata_overlap() with reagard to cpu overhead?

2014-10-25 Thread Zhang Haoyu

Hi, Max

How is the progress of optimizing qcow2_check_metadata_overlap?
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/127037/focus=127364

Thanks,
Zhang Haoyu

[Qemu-devel] [PATCH v5] snapshot: use local variable to bdrv_pwrite_sync L1 table

2014-10-22 Thread Zhang Haoyu

Use local variable to bdrv_pwrite_sync L1 table,
needless to make conversion of cached L1 table between
big-endian and host style.

Signed-off-by: Zhang Haoyu 
Reviewed-by: Max Reitz 
---
v4 -> v5:
- delete superfluous check of "l1_size2 != 0"
  after qemu_try_blockalign(l1_size2)

v3 -> v4:
 - convert local L1 table to host-style before copy it
   back to s->l1_table

v2 -> v3:
 - replace g_try_malloc0 with qemu_try_blockalign
 - copy the latest local L1 table back to s->l1_table
   after successfully bdrv_pwrite_sync L1 table

v1 -> v2:
 - remove the superflous assignment, l1_table = NULL;
 - replace 512 with BDRV_SECTOR_SIZE, and align_offset with ROUND_UP
 - remove needless check of if (l1_table) before g_free(l1_table)

 block/qcow2-refcount.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..4cf6639 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -881,14 +881,17 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 {
 BDRVQcowState *s = bs->opaque;
 uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
-bool l1_allocated = false;
 int64_t old_offset, old_l2_offset;
 int i, j, l1_modified = 0, nb_csectors, refcount;
 int ret;
 
 l2_table = NULL;
-l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
+l1_table = qemu_try_blockalign(bs->file, l1_size2);
+if (l1_table == NULL) {
+ret = -ENOMEM;
+goto fail;
+}
 
 s->cache_discards = true;
 
@@ -896,13 +899,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
  * l1_table_offset when it is the current s->l1_table_offset! Be careful
  * when changing this! */
 if (l1_table_offset != s->l1_table_offset) {
-l1_table = g_try_malloc0(align_offset(l1_size2, 512));
-if (l1_size2 && l1_table == NULL) {
-ret = -ENOMEM;
-goto fail;
-}
-l1_allocated = true;
-
 ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
 if (ret < 0) {
 goto fail;
@@ -912,8 +908,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 be64_to_cpus(&l1_table[i]);
 } else {
 assert(l1_size == s->l1_size);
-l1_table = s->l1_table;
-l1_allocated = false;
+memcpy(l1_table, s->l1_table, l1_size2);
 }
 
 for(i = 0; i < l1_size; i++) {
@@ -1055,13 +1050,14 @@ fail:
 }
 
 ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
-
-for (i = 0; i < l1_size; i++) {
-be64_to_cpus(&l1_table[i]);
+if (ret == 0) {
+for (i = 0; i < l1_size; i++) {
+be64_to_cpus(&l1_table[i]);
+}
+memcpy(s->l1_table, l1_table, l1_size2);
 }
 }
-if (l1_allocated)
-g_free(l1_table);
+g_free(l1_table);
 return ret;
 }
 
-- 
1.7.12.4

Re: [Qemu-devel] [PATCH v4] snapshot: use local variable to bdrv_pwrite_syncL1 table

2014-10-22 Thread Zhang Haoyu

>> Use local variable to bdrv_pwrite_sync L1 table,
>> needless to make conversion of cached L1 table between
>> big-endian and host style.
>> 
>> Signed-off-by: Zhang Haoyu 
>> Reviewed-by: Max Reitz 
>> ---
>> v3 -> v4:
>>  - convert local L1 table to host-style before copy it
>>back to s->l1_table
>> 
>> v2 -> v3:
>>  - replace g_try_malloc0 with qemu_try_blockalign
>>  - copy the latest local L1 table back to s->l1_table
>>after successfully bdrv_pwrite_sync L1 table
>> 
>> v1 -> v2:
>>  - remove the superflous assignment, l1_table = NULL;
>>  - replace 512 with BDRV_SECTOR_SIZE, and align_offset with ROUND_UP
>>  - remove needless check of if (l1_table) before g_free(l1_table)
>> 
>>  block/qcow2-refcount.c | 28 
>>  1 file changed, 12 insertions(+), 16 deletions(-)
>> 
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 2bcaaf9..3e4050a 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -881,14 +881,17 @@ int qcow2_update_snapshot_refcount(BlockDriverState 
>> *bs,
>>  {
>>  BDRVQcowState *s = bs->opaque;
>>  uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
>> -bool l1_allocated = false;
>>  int64_t old_offset, old_l2_offset;
>>  int i, j, l1_modified = 0, nb_csectors, refcount;
>>  int ret;
>>  
>>  l2_table = NULL;
>> -l1_table = NULL;
>>  l1_size2 = l1_size * sizeof(uint64_t);
>> +l1_table = qemu_try_blockalign(bs->file, l1_size2);
>> +if (l1_size2 && l1_table == NULL) {
>
>I think this check has a logic problem.  If l1_size2 != 0 and l1_table == NULL,
>What will happen?
>
If the condition of "if (l1_size2 && l1_table == NULL)" is true,
below code will be performed,
s->cache_discards = false;
qcow2_process_discards(bs, ret);
g_free(l1_table);

What's your question?

Thanks,
Zhang Haoyu
>Best regards,
>-Gonglei
>
>> +ret = -ENOMEM;
>> +goto fail;
>> +}
>>  
>>  s->cache_discards = true;
>>  
>> @@ -896,13 +899,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>   * l1_table_offset when it is the current s->l1_table_offset! Be careful
>>   * when changing this! */
>>  if (l1_table_offset != s->l1_table_offset) {
>> -l1_table = g_try_malloc0(align_offset(l1_size2, 512));
>> -if (l1_size2 && l1_table == NULL) {
>> -ret = -ENOMEM;
>> -goto fail;
>> -}
>> -l1_allocated = true;
>> -
>>  ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
>>  if (ret < 0) {
>>  goto fail;
>> @@ -912,8 +908,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>  be64_to_cpus(&l1_table[i]);
>>  } else {
>>  assert(l1_size == s->l1_size);
>> -l1_table = s->l1_table;
>> -l1_allocated = false;
>> +memcpy(l1_table, s->l1_table, l1_size2);
>>  }
>>  
>>  for(i = 0; i < l1_size; i++) {
>> @@ -1055,13 +1050,14 @@ fail:
>>  }
>>  
>>  ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, 
>> l1_size2);
>> -
>> -for (i = 0; i < l1_size; i++) {
>> -be64_to_cpus(&l1_table[i]);
>> +if (ret == 0) {
>> +for (i = 0; i < l1_size; i++) {
>> +be64_to_cpus(&l1_table[i]);
>> +}
>> +memcpy(s->l1_table, l1_table, l1_size2);
>>  }
>>  }
>> -if (l1_allocated)
>> -g_free(l1_table);
>> +g_free(l1_table);
>>  return ret;
>>  }
>>

[Qemu-devel] [PATCH v4] snapshot: use local variable to bdrv_pwrite_sync L1 table

2014-10-22 Thread Zhang Haoyu

Use local variable to bdrv_pwrite_sync L1 table,
needless to make conversion of cached L1 table between
big-endian and host style.

Signed-off-by: Zhang Haoyu 
Reviewed-by: Max Reitz 
---
v3 -> v4:
 - convert local L1 table to host-style before copy it
   back to s->l1_table

v2 -> v3:
 - replace g_try_malloc0 with qemu_try_blockalign
 - copy the latest local L1 table back to s->l1_table
   after successfully bdrv_pwrite_sync L1 table

v1 -> v2:
 - remove the superflous assignment, l1_table = NULL;
 - replace 512 with BDRV_SECTOR_SIZE, and align_offset with ROUND_UP
 - remove needless check of if (l1_table) before g_free(l1_table)

 block/qcow2-refcount.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..3e4050a 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -881,14 +881,17 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 {
 BDRVQcowState *s = bs->opaque;
 uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
-bool l1_allocated = false;
 int64_t old_offset, old_l2_offset;
 int i, j, l1_modified = 0, nb_csectors, refcount;
 int ret;
 
 l2_table = NULL;
-l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
+l1_table = qemu_try_blockalign(bs->file, l1_size2);
+if (l1_size2 && l1_table == NULL) {
+ret = -ENOMEM;
+goto fail;
+}
 
 s->cache_discards = true;
 
@@ -896,13 +899,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
  * l1_table_offset when it is the current s->l1_table_offset! Be careful
  * when changing this! */
 if (l1_table_offset != s->l1_table_offset) {
-l1_table = g_try_malloc0(align_offset(l1_size2, 512));
-if (l1_size2 && l1_table == NULL) {
-ret = -ENOMEM;
-goto fail;
-}
-l1_allocated = true;
-
 ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
 if (ret < 0) {
 goto fail;
@@ -912,8 +908,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 be64_to_cpus(&l1_table[i]);
 } else {
 assert(l1_size == s->l1_size);
-l1_table = s->l1_table;
-l1_allocated = false;
+memcpy(l1_table, s->l1_table, l1_size2);
 }
 
 for(i = 0; i < l1_size; i++) {
@@ -1055,13 +1050,14 @@ fail:
 }
 
 ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
-
-for (i = 0; i < l1_size; i++) {
-be64_to_cpus(&l1_table[i]);
+if (ret == 0) {
+for (i = 0; i < l1_size; i++) {
+be64_to_cpus(&l1_table[i]);
+}
+memcpy(s->l1_table, l1_table, l1_size2);
 }
 }
-if (l1_allocated)
-g_free(l1_table);
+g_free(l1_table);
 return ret;
 }
 
-- 
1.7.12.4

Re: [Qemu-devel] [Qemu-trivial] [PATCH v3] snapshot: use local variable to bdrv_pwrite_sync L1 table

2014-10-21 Thread Zhang Haoyu

>Use local variable to bdrv_pwrite_sync L1 table,
>needless to make conversion of cached L1 table between
>big-endian and host style.
>
>Signed-off-by: Zhang Haoyu 
>Reviewed-by: Max Reitz 
>---
>v2 -> v3:
> - replace g_try_malloc0 with qemu_try_blockalign
> - copy the latest local L1 table back to s->l1_table
>   after successfully bdrv_pwrite_sync L1 table
>
>v1 -> v2:
> - remove the superflous assignment, l1_table = NULL;
> - replace 512 with BDRV_SECTOR_SIZE, and align_offset with ROUND_UP
> - remove needless check of if (l1_table) before g_free(l1_table)
>
> block/qcow2-refcount.c | 25 +
> 1 file changed, 9 insertions(+), 16 deletions(-)
>
>diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>index 2bcaaf9..bceadce 100644
>--- a/block/qcow2-refcount.c
>+++ b/block/qcow2-refcount.c
>@@ -881,14 +881,17 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
> {
> BDRVQcowState *s = bs->opaque;
> uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
>-bool l1_allocated = false;
> int64_t old_offset, old_l2_offset;
> int i, j, l1_modified = 0, nb_csectors, refcount;
> int ret;
> 
> l2_table = NULL;
>-l1_table = NULL;
> l1_size2 = l1_size * sizeof(uint64_t);
>+l1_table = qemu_try_blockalign(bs->file, l1_size2);
>+if (l1_size2 && l1_table == NULL) {
>+ret = -ENOMEM;
>+goto fail;
>+}
> 
> s->cache_discards = true;
> 
>@@ -896,13 +899,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>  * l1_table_offset when it is the current s->l1_table_offset! Be careful
>  * when changing this! */
> if (l1_table_offset != s->l1_table_offset) {
>-l1_table = g_try_malloc0(align_offset(l1_size2, 512));
>-if (l1_size2 && l1_table == NULL) {
>-ret = -ENOMEM;
>-goto fail;
>-}
>-l1_allocated = true;
>-
> ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
> if (ret < 0) {
> goto fail;
>@@ -912,8 +908,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
> be64_to_cpus(&l1_table[i]);
> } else {
> assert(l1_size == s->l1_size);
>-l1_table = s->l1_table;
>-l1_allocated = false;
>+memcpy(l1_table, s->l1_table, l1_size2);
> }
> 
> for(i = 0; i < l1_size; i++) {
>@@ -1055,13 +1050,11 @@ fail:
> }
> 
> ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
>-
>-for (i = 0; i < l1_size; i++) {
>-be64_to_cpus(&l1_table[i]);
>+if (ret == 0) {
>+memcpy(s->l1_table, l1_table, l1_size2);
I made a mistake, big-endian l1 table was copied back to s->l1_table.
> }
> }
>-if (l1_allocated)
>-g_free(l1_table);
>+g_free(l1_table);
> return ret;
> }
> 
>-- 
>1.7.12.4

[Qemu-devel] [PATCH v3] snapshot: use local variable to bdrv_pwrite_sync L1 table

2014-10-21 Thread Zhang Haoyu

Use local variable to bdrv_pwrite_sync L1 table,
needless to make conversion of cached L1 table between
big-endian and host style.

Signed-off-by: Zhang Haoyu 
Reviewed-by: Max Reitz 
---
v2 -> v3:
 - replace g_try_malloc0 with qemu_try_blockalign
 - copy the latest local L1 table back to s->l1_table
   after successfully bdrv_pwrite_sync L1 table

v1 -> v2:
 - remove the superflous assignment, l1_table = NULL;
 - replace 512 with BDRV_SECTOR_SIZE, and align_offset with ROUND_UP
 - remove needless check of if (l1_table) before g_free(l1_table)

 block/qcow2-refcount.c | 25 +
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..bceadce 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -881,14 +881,17 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 {
 BDRVQcowState *s = bs->opaque;
 uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
-bool l1_allocated = false;
 int64_t old_offset, old_l2_offset;
 int i, j, l1_modified = 0, nb_csectors, refcount;
 int ret;
 
 l2_table = NULL;
-l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
+l1_table = qemu_try_blockalign(bs->file, l1_size2);
+if (l1_size2 && l1_table == NULL) {
+ret = -ENOMEM;
+goto fail;
+}
 
 s->cache_discards = true;
 
@@ -896,13 +899,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
  * l1_table_offset when it is the current s->l1_table_offset! Be careful
  * when changing this! */
 if (l1_table_offset != s->l1_table_offset) {
-l1_table = g_try_malloc0(align_offset(l1_size2, 512));
-if (l1_size2 && l1_table == NULL) {
-ret = -ENOMEM;
-goto fail;
-}
-l1_allocated = true;
-
 ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
 if (ret < 0) {
 goto fail;
@@ -912,8 +908,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 be64_to_cpus(&l1_table[i]);
 } else {
 assert(l1_size == s->l1_size);
-l1_table = s->l1_table;
-l1_allocated = false;
+memcpy(l1_table, s->l1_table, l1_size2);
 }
 
 for(i = 0; i < l1_size; i++) {
@@ -1055,13 +1050,11 @@ fail:
 }
 
 ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
-
-for (i = 0; i < l1_size; i++) {
-be64_to_cpus(&l1_table[i]);
+if (ret == 0) {
+memcpy(s->l1_table, l1_table, l1_size2);
 }
 }
-if (l1_allocated)
-g_free(l1_table);
+g_free(l1_table);
 return ret;
 }
 
-- 
1.7.12.4

Re: [Qemu-devel] [Qemu-stable] [PATCH v2] snapshot: use local variable tobdrv_pwrite_sync L1 table

2014-10-21 Thread Zhang Haoyu

>> Use local variable to bdrv_pwrite_sync L1 table,
>> needless to make conversion of cached L1 table between
>> big-endian and host style.
>>
>> Signed-off-by: Zhang Haoyu 
>> ---
>> v1 -> v2:
>>   - remove the superflous assignment, l1_table = NULL;
>>   - replace 512 with BDRV_SECTOR_SIZE, and align_offset with ROUND_UP
>>   - remove needless check of if (l1_table) before g_free(l1_table)
>>
>>   block/qcow2-refcount.c | 24 +++-
>>   1 file changed, 7 insertions(+), 17 deletions(-)
>>
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 2bcaaf9..29a916a 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -881,14 +881,17 @@ int qcow2_update_snapshot_refcount(BlockDriverState 
>> *bs,
>>   {
>>   BDRVQcowState *s = bs->opaque;
>>   uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
>> -bool l1_allocated = false;
>>   int64_t old_offset, old_l2_offset;
>>   int i, j, l1_modified = 0, nb_csectors, refcount;
>>   int ret;
>>   
>>   l2_table = NULL;
>> -l1_table = NULL;
>>   l1_size2 = l1_size * sizeof(uint64_t);
>> +l1_table = g_try_malloc0(ROUND_UP(l1_size2, BDRV_SECTOR_SIZE));
>
>I'm sorry, but I just now realized that we don't even need the 0 variant 
>here. We are never accessing any element beyond l1_table[l1_size - 1], 
>and all elements from 0 until l1_size - 1 are overwritten by 
>bdrv_pread() or memcpy(). Therefore we can simply use 
>qemu_try_blockalign(bs->file, l1_size2) here.
>
OK, I will replace g_try_malloc0(ROUND_UP(l1_size2, BDRV_SECTOR_SIZE)) with
qemu_try_blockalign(bs, ROUND_UP(l1_size2, BDRV_SECTOR_SIZE)).

>(and we should be using qemu_{try_,}blockalign() instead of 
>g_{try_,}malloc{0,}() whenever possible for performance reasons)
>
>> +if (l1_size2 && l1_table == NULL) {
>> +ret = -ENOMEM;
>> +goto fail;
>> +}
>>   
>>   s->cache_discards = true;
>>   
>> @@ -896,13 +899,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>* l1_table_offset when it is the current s->l1_table_offset! Be 
>> careful
>>* when changing this! */
>>   if (l1_table_offset != s->l1_table_offset) {
>> -l1_table = g_try_malloc0(align_offset(l1_size2, 512));
>> -if (l1_size2 && l1_table == NULL) {
>> -ret = -ENOMEM;
>> -goto fail;
>> -}
>> -l1_allocated = true;
>> -
>>   ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
>>   if (ret < 0) {
>>   goto fail;
>> @@ -912,8 +908,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>   be64_to_cpus(&l1_table[i]);
>>   } else {
>>   assert(l1_size == s->l1_size);
>> -l1_table = s->l1_table;
>> -l1_allocated = false;
>> +memcpy(l1_table, s->l1_table, l1_size2);
>>   }
>>   
>>   for(i = 0; i < l1_size; i++) {
>> @@ -1055,13 +1050,8 @@ fail:
>>   }
>>   
>>   ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, 
>> l1_size2);
>> -
>> -for (i = 0; i < l1_size; i++) {
>> -be64_to_cpus(&l1_table[i]);
>> -}
>
>But I realized something even more important as well: If you are 
>modifying the current L1 table, you need to copy the result back to 
>s->l1_table.
>
My fault.
>Sorry for having missed these in my previous review.
>
>Max
>
>>   }
>> -if (l1_allocated)
>> -g_free(l1_table);
>> +g_free(l1_table);
>>   return ret;
>>   }
>>

[Qemu-devel] [PATCH v2] snapshot: use local variable to bdrv_pwrite_sync L1 table

2014-10-21 Thread Zhang Haoyu

Use local variable to bdrv_pwrite_sync L1 table,
needless to make conversion of cached L1 table between
big-endian and host style.

Signed-off-by: Zhang Haoyu 
---
v1 -> v2:
 - remove the superflous assignment, l1_table = NULL;
 - replace 512 with BDRV_SECTOR_SIZE, and align_offset with ROUND_UP
 - remove needless check of if (l1_table) before g_free(l1_table)

 block/qcow2-refcount.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..29a916a 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -881,14 +881,17 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 {
 BDRVQcowState *s = bs->opaque;
 uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
-bool l1_allocated = false;
 int64_t old_offset, old_l2_offset;
 int i, j, l1_modified = 0, nb_csectors, refcount;
 int ret;
 
 l2_table = NULL;
-l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
+l1_table = g_try_malloc0(ROUND_UP(l1_size2, BDRV_SECTOR_SIZE));
+if (l1_size2 && l1_table == NULL) {
+ret = -ENOMEM;
+goto fail;
+}
 
 s->cache_discards = true;
 
@@ -896,13 +899,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
  * l1_table_offset when it is the current s->l1_table_offset! Be careful
  * when changing this! */
 if (l1_table_offset != s->l1_table_offset) {
-l1_table = g_try_malloc0(align_offset(l1_size2, 512));
-if (l1_size2 && l1_table == NULL) {
-ret = -ENOMEM;
-goto fail;
-}
-l1_allocated = true;
-
 ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
 if (ret < 0) {
 goto fail;
@@ -912,8 +908,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 be64_to_cpus(&l1_table[i]);
 } else {
 assert(l1_size == s->l1_size);
-l1_table = s->l1_table;
-l1_allocated = false;
+memcpy(l1_table, s->l1_table, l1_size2);
 }
 
 for(i = 0; i < l1_size; i++) {
@@ -1055,13 +1050,8 @@ fail:
 }
 
 ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
-
-for (i = 0; i < l1_size; i++) {
-be64_to_cpus(&l1_table[i]);
-}
 }
-if (l1_allocated)
-g_free(l1_table);
+g_free(l1_table);
 return ret;
 }
 
-- 
1.7.12.4

Re: [Qemu-devel] [PATCH] snapshot: use local variable to bdrv_pwrite_syncL1 table

2014-10-21 Thread Zhang Haoyu

>> Use local variable to bdrv_pwrite_sync L1 table,
>> needless to make conversion of cached L1 table between
>> big-endian and host style.
>>
>> Signed-off-by: Zhang Haoyu 
>> ---
>>   block/qcow2-refcount.c | 22 +++---
>>   1 file changed, 7 insertions(+), 15 deletions(-)
>>
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 2bcaaf9..8b318e8 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -881,7 +881,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>   {
>>   BDRVQcowState *s = bs->opaque;
>>   uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
>> -bool l1_allocated = false;
>>   int64_t old_offset, old_l2_offset;
>>   int i, j, l1_modified = 0, nb_csectors, refcount;
>>   int ret;
>> @@ -889,6 +888,11 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>   l2_table = NULL;
>>   l1_table = NULL;
>
>Please remove this assignment; thanks to this hunk we don't need it anymore.
OK.
>
>>   l1_size2 = l1_size * sizeof(uint64_t);
>> +l1_table = g_try_malloc0(align_offset(l1_size2, 512));
>
>I wanted to propose using qemu_try_blockalign(), but since it'd require 
>a memset() afterwards, it gets rather ugly.
>
>Could you at least replace 512 by BDRV_SECTOR_SIZE, and maybe even 
>align_offset() by ROUND_UP()? We should probably do the latter in all of 
>the qcow2 code, though, I think it's just there because it has been 
>around since before there was a ROUND_UP()...
>
Good, I will replace 512 with BDRV_SECTOR_SIZE, and replace align_offset with 
ROUND_UP.
>> +if (l1_size2 && l1_table == NULL) {
>> +ret = -ENOMEM;
>> +goto fail;
>> +}
>>   
>>   s->cache_discards = true;
>>   
>> @@ -896,13 +900,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>* l1_table_offset when it is the current s->l1_table_offset! Be 
>> careful
>>* when changing this! */
>>   if (l1_table_offset != s->l1_table_offset) {
>> -l1_table = g_try_malloc0(align_offset(l1_size2, 512));
>> -if (l1_size2 && l1_table == NULL) {
>> -ret = -ENOMEM;
>> -goto fail;
>> -}
>> -l1_allocated = true;
>> -
>>   ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
>>   if (ret < 0) {
>>   goto fail;
>> @@ -912,8 +909,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
>>   be64_to_cpus(&l1_table[i]);
>>   } else {
>>   assert(l1_size == s->l1_size);
>> -l1_table = s->l1_table;
>> -l1_allocated = false;
>> +memcpy(l1_table, s->l1_table, l1_size2);
>>   }
>>   
>>   for(i = 0; i < l1_size; i++) {
>> @@ -1055,12 +1051,8 @@ fail:
>
>I don't think it will change a lot, but could you wrap the 
>"s->cache_discards = false; qcow2_process_discards(bs, ret);" in an "if 
>(s->cache_discards)"? You have introduced a case where s->cache_discards 
>was still false, so we don't need to call qcow2_process_discards() then 
>(which will hopefully return immediately, but well...).
s->cache_discards's initial value is true in qcow2_update_snapshot_refcount(),
where s->cache_discards is set to false?
Or you means s->cache_discards should be set to false
after g_try_malloc0(align_offset(l1_size2, 512)) failed.

>
>>   }
>>   
>>   ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, 
>> l1_size2);
>> -
>> -for (i = 0; i < l1_size; i++) {
>> -be64_to_cpus(&l1_table[i]);
>> -}
>>   }
>> -if (l1_allocated)
>> +if (l1_table)
>>   g_free(l1_table);
>
>Just drop the condition. g_free(l1_table); is enough.
>
OK.
>>   return ret;
>>   }
>
>The change itself is good, it just needs some polishing.
>
>Max

Re: [Qemu-devel] [question] savevm/delvm: Is it necessary to performbdrv_drain_all before savevm and delvm?

2014-10-21 Thread Zhang Haoyu

> >> >> Hi,
> >> >>
> >> >> I noticed that bdrv_drain_all is performed in load_vmstate before
> >> bdrv_snapshot_goto,
>> >> >> and bdrv_drain_all is performed in qmp_transaction before
>> >> internal_snapshot_prepare,
>> >> >> so is it also neccesary to perform bdrv_drain_all in savevm and delvm?
>> >> >
>> >> >Definitely yes for savevm. do_savevm() calls it indirectly via
>> >> >vm_stop(), so that part looks okay.
>> >> >
>> >> Yes, you are right.
>> >> 
>> >> >delvm doesn't affect the currently running VM, and therefore doesn't
>> >> >interfere with guest requests that are in flight. So I think that a
>> >> >bdrv_drain_all() isn't needed there.
>> >> >
>> >> I'm worried about that there are still pending IOs while deleting 
>> >> snapshot,
>> >> then is it possible that there is concurrency problem between the
>> >> process of deleting snapshot
>> >> and the coroutine of io read/write(bdrv_co_do_rw) invoked by the
>> >> pending IOs?
>> >> This coroutine is also in main thread.
>> >> Am I missing something?
>> >
>> >What kind of concurrency problem are you thinking of?
>> >
>> I have encountered two problem,
>> 1) double-free of Qcow2DiscardRegion in qcow2_process_discards
>>please see the discussing mail: [PATCH] qcow2: fix double-free of 
>> Qcow2DiscardRegion in qcow2_process_discards
>> 2) qcow2 image is truncated to very huge size, but the size shown by 
>> qemu-img info is normal
>>please see the discussing mail:  
>>[question] is it possible that big-endian l1 table offset referenced by 
>> other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
>
>Did you verify that the invalid value actually makes sense if
>byteswapped? For example, that there is no reserved bit set then?
>
Yes, exactly, I have verified that those l2 table offset are invalid value if 
byte-swapped.

>> I suspect that both of the two problems are concurrency problem mentioned 
>> above, 
>> please see the discussing mail.
>> 
>> 
>> >I do see that there might be a chance of concurrency, but that doesn't
>> >automatically mean the requests are conflicting.
>> >
>> >Would you feel better with taking s->lock in qcow2_snapshot_delete()?
>> Both deleting snapshot and the coroutine of pending io 
>> read/write(bdrv_co_do_rw)
>> are performed in main thread, could BDRVQcowState.lock work?
>
>Yes. s->lock is not a mutex for threads, but a coroutine based one.
>
Yes, you are right.

>The problem, however, is that qcow2_snapshot_delete() isn't execute in a
>coroutine, so we can't take s->lock here. We would either need to move
>it into a coroutine or add a bdrv_drain_all() indeed.
>
I'm inclined to add bdrv_drain_all(), just keeping consistent with the other 
snapshot-related operations, like savevm, loadvm, internal_snapshot_prepare, 
etc.

Thanks,
Zhang Haoyu

>This also means that we probably need to review all other cases where
>non-coroutine callbacks from BlockDriver might interfere with running
>requests. The original assumption that they are safe as long as they are
>not running in a coroutine seems to be wrong.
Agreed.
>
>Kevin

[Qemu-devel] [PATCH bugfix] snapshot: add bdrv_drain_all() to bdrv_snapshot_delete() to avoid concurrency problem

2014-10-21 Thread Zhang Haoyu

If there are still pending i/o while deleting snapshot,
because deleting snapshot is done in non-coroutine context, and
the pending i/o read/write (bdrv_co_do_rw) is done in coroutine context,
so it's possible to cause concurrency problem between above two operations.
Add bdrv_drain_all() to bdrv_snapshot_delete() to avoid this problem.

Signed-off-by: Zhang Haoyu 
---
 block/snapshot.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/snapshot.c b/block/snapshot.c
index 85c52ff..ebc386a 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -236,6 +236,10 @@ int bdrv_snapshot_delete(BlockDriverState *bs,
 error_setg(errp, "snapshot_id and name are both NULL");
 return -EINVAL;
 }
+
+/* drain all pending i/o before deleting snapshot */
+bdrv_drain_all();
+
 if (drv->bdrv_snapshot_delete) {
 return drv->bdrv_snapshot_delete(bs, snapshot_id, name, errp);
 }
-- 
1.7.12.4

[Qemu-devel] [PATCH] snapshot: use local variable to bdrv_pwrite_sync L1 table

2014-10-21 Thread Zhang Haoyu

Use local variable to bdrv_pwrite_sync L1 table,
needless to make conversion of cached L1 table between
big-endian and host style.

Signed-off-by: Zhang Haoyu 
---
 block/qcow2-refcount.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..8b318e8 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -881,7 +881,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 {
 BDRVQcowState *s = bs->opaque;
 uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
-bool l1_allocated = false;
 int64_t old_offset, old_l2_offset;
 int i, j, l1_modified = 0, nb_csectors, refcount;
 int ret;
@@ -889,6 +888,11 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 l2_table = NULL;
 l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
+l1_table = g_try_malloc0(align_offset(l1_size2, 512));
+if (l1_size2 && l1_table == NULL) {
+ret = -ENOMEM;
+goto fail;
+}
 
 s->cache_discards = true;
 
@@ -896,13 +900,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
  * l1_table_offset when it is the current s->l1_table_offset! Be careful
  * when changing this! */
 if (l1_table_offset != s->l1_table_offset) {
-l1_table = g_try_malloc0(align_offset(l1_size2, 512));
-if (l1_size2 && l1_table == NULL) {
-ret = -ENOMEM;
-goto fail;
-}
-l1_allocated = true;
-
 ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
 if (ret < 0) {
 goto fail;
@@ -912,8 +909,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 be64_to_cpus(&l1_table[i]);
 } else {
 assert(l1_size == s->l1_size);
-l1_table = s->l1_table;
-l1_allocated = false;
+memcpy(l1_table, s->l1_table, l1_size2);
 }
 
 for(i = 0; i < l1_size; i++) {
@@ -1055,12 +1051,8 @@ fail:
 }
 
 ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);
-
-for (i = 0; i < l1_size; i++) {
-be64_to_cpus(&l1_table[i]);
-}
 }
-if (l1_allocated)
+if (l1_table)
 g_free(l1_table);
 return ret;
 }
-- 
1.7.12.4

Re: [Qemu-devel] [question] savevm/delvm: Is it necessary to perform bdrv_drain_all before savevm and delvm?

2014-10-20 Thread Zhang Haoyu

>> >> Hi,
>> >>
>> >> I noticed that bdrv_drain_all is performed in load_vmstate before
>> bdrv_snapshot_goto,
>> >> and bdrv_drain_all is performed in qmp_transaction before
>> internal_snapshot_prepare,
>> >> so is it also neccesary to perform bdrv_drain_all in savevm and delvm?
>> >
>> >Definitely yes for savevm. do_savevm() calls it indirectly via
>> >vm_stop(), so that part looks okay.
>> >
>> Yes, you are right.
>> 
>> >delvm doesn't affect the currently running VM, and therefore doesn't
>> >interfere with guest requests that are in flight. So I think that a
>> >bdrv_drain_all() isn't needed there.
>> >
>> I'm worried about that there are still pending IOs while deleting snapshot,
>> then is it possible that there is concurrency problem between the
>> process of deleting snapshot
>> and the coroutine of io read/write(bdrv_co_do_rw) invoked by the
>> pending IOs?
>> This coroutine is also in main thread.
>> Am I missing something?
>
>What kind of concurrency problem are you thinking of?
>
I have encountered two problem,
1) double-free of Qcow2DiscardRegion in qcow2_process_discards
   please see the discussing mail: [PATCH] qcow2: fix double-free of 
Qcow2DiscardRegion in qcow2_process_discards
2) qcow2 image is truncated to very huge size, but the size shown by qemu-img 
info is normal
   please see the discussing mail:  
   [question] is it possible that big-endian l1 table offset referenced by 
other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?
   
I suspect that both of the two problems are concurrency problem mentioned 
above, 
please see the discussing mail.


>I do see that there might be a chance of concurrency, but that doesn't
>automatically mean the requests are conflicting.
>
>Would you feel better with taking s->lock in qcow2_snapshot_delete()?
Both deleting snapshot and the coroutine of pending io read/write(bdrv_co_do_rw)
are performed in main thread, could BDRVQcowState.lock work?

Thanks,
Zhang Haoyu
>This might actually be a valid concern.
>
>Kevin

Re: [Qemu-devel] [question] savevm/delvm: Is it necessary to perform bdrv_drain_all before savevm and delvm?

2014-10-20 Thread Zhang Haoyu


>> Hi,
>>
>> I noticed that bdrv_drain_all is performed in load_vmstate before 
bdrv_snapshot_goto,
>> and bdrv_drain_all is performed in qmp_transaction before 
internal_snapshot_prepare,

>> so is it also neccesary to perform bdrv_drain_all in savevm and delvm?
>
>Definitely yes for savevm. do_savevm() calls it indirectly via
>vm_stop(), so that part looks okay.
>
Yes, you are right.

>delvm doesn't affect the currently running VM, and therefore doesn't
>interfere with guest requests that are in flight. So I think that a
>bdrv_drain_all() isn't needed there.
>
I'm worried about that there are still pending IOs while deleting snapshot,
then is it possible that there is concurrency problem between the 
process of deleting snapshot
and the coroutine of io read/write(bdrv_co_do_rw) invoked by the pending 
IOs?

This coroutine is also in main thread.
Am I missing something?

Thanks,
Zhang Haoyu

>Kevin

[Qemu-devel] [question] savevm/delvm: Is it neccesary to perform bdrv_drain_all before savevm and delvm?

2014-10-20 Thread Zhang Haoyu

Hi,

I noticed that bdrv_drain_all is performed in load_vmstate before 
bdrv_snapshot_goto,
and bdrv_drain_all is performed in qmp_transaction before 
internal_snapshot_prepare,
so is it also neccesary to perform bdrv_drain_all in savevm and delvm?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [question] is it possible that big-endian l1tableoffsetreferencedby other I/O while updating l1 table offset inqcow2_update_snapshot_refcount?

2014-10-13 Thread Zhang Haoyu

ot necessarily a local variable to 
>>>>>> qcow2_update_snapshot_refcount,
>>>>>> which depends on condition of "if (l1_table_offset != 
>>>>>> s->l1_table_offset)",
>>>>>> if the condition not true, l1_table = s->l1_table.
>>>>> Oh, yes, you're right. Okay, so in theory nothing should happen anyway,
>>>>> because qcow2 does not have to be reentrant (so s->l1_table will not be
>>>>> accessed while it's big endian and therefore possibly not in CPU order).
>>>> Could you detail how qcow2 does not have to be reentrant?
>>>> In below stack,
>>>> qcow2_update_snapshot_refcount
>>>> |- cpu_to_be64s(&l1_table[i])
>>>> |- bdrv_pwrite_sync
>>> This is executed on bs->file, not the qcow2 BDS.
>>>
>> Yes, bs->file is passed to bdrv_pwrite_sync here,
>> but aio_poll(aio_context) will poll all BDS's aio, not only that of 
>> bs->file, doesn't it?
>> Is it possible that there are pending aio which belong to this qcow2 BDS 
>> still exist?
>
>qcow2 is generally not reentrant, this is secured by locking 
>(BDRVQcowState.lock). As long as one request for a BDS is still running, 
>it will not be interrupted.
>
This problem can be reproduced with loop of 
savevm -> delvm -> savevm -> delvm ..., for about half-hour,
but after applying above change of using local variable to sync l1_table,
this problem has not been occurred for more than 48 hours with loop of
savevm -> delvm -> savevm -> delvm ...
Could you help analysing this problem, please?

And, because bdrv_co_do_rw is running in a coroutine context, not the other 
thread,
both bdrv_co_do_rw and qcow2_update_snapshot_refcount are performed in the same 
thread (main-thread),
how does BDRVQcowState.lock avoid the reentrant?

Thanks,
Zhang Haoyu
>Max
>
>> Thanks,
>> Zhang Haoyu
>>> Max
>>>
>>>> |-- bdrv_pwrite
>>>> |--- bdrv_pwritev
>>>> | bdrv_prwv_co
>>>> |- aio_poll(aio_context) <== this aio_context is qemu_aio_context
>>>> |-- aio_dispatch
>>>> |--- bdrv_co_io_em_complete
>>>> | qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry 
>>>> is bdrv_co_do_rw
>>>> bdrv_co_do_rw will access l1_table to perform I/O operation.
>>>>
>>>> Thanks,
>>>> Zhang Haoyu
>>>>> But I find it rather ugly to convert the cached L1 table to big endian,
>>>>> so I'd be fine with the patch you proposed.
>>>>>
>>>>> Max

Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferencedby other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?

2014-10-13 Thread Zhang Haoyu

>>>>>>>> Hi,
>>>>>>>> I encounter a problem that after deleting snapshot, the qcow2 image 
>>>>>>>> size is very larger than that it should be displayed by ls command,
>>>>>>>> but the virtual disk size is okay via qemu-img info.
>>>>>>>> I suspect that during updating l1 table offset, other I/O job 
>>>>>>>> reference the big-endian l1 table offset (very large value),
>>>>>>>> so the file is truncated to very large.
>>>>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>>>>> still consuming holes in the file; the maximum offset of the file is
>>>>>>> still unchanged, even if the file is no longer using as many referenced
>>>>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>>>>> possible (punching holes if your kernel and file system is new enough to
>>>>>>> support that), so that it is not consuming the amount of disk space that
>>>>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>>>>> having an online defragmenter for qcow2 files has been kicked around
>>>>>>> before, but it is complex enough that no one has attempted a patch yet.
>>>>>> Sorry, I didn't clarify the problem clearly.
>>>>>> In qcow2_update_snapshot_refcount(), below code,
>>>>>>/* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>>>>if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>for (i = 0; i < l1_size; i++) {
>>>>>>cpu_to_be64s(&l1_table[i]);
>>>>>>}
>>>>>>
>>>>>>ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, 
>>>>>> l1_size2);
>>>>>>
>>>>>>for (i = 0; i < l1_size; i++) {
>>>>>>be64_to_cpus(&l1_table[i]);
>>>>>>}
>>>>>>}
>>>>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>>>>> is it possible that there is other I/O reference this interim l1 table 
>>>>>> whose entries contain the be64 l2 table offset?
>>>>>> The be64 l2 table offset maybe a very large value, hundreds of TB is 
>>>>>> possible,
>>>>>> then the qcow2 file will be truncated to far larger than normal size.
>>>>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is 
>>>>>> still normal displayed by qemu-img info.
>>>>>>
>>>>>> If the possibility mentioned above exists, below raw code may fix it,
>>>>>> if (ret == 0 && addend >= 0 && l1_modified) {
>>>>>>tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>>>>memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>>>>for (i = 0; i < l1_size; i++) {
>>>>>>cpu_to_be64s(&tmp_l1_table[i]);
>>>>>>}
>>>>>>ret = bdrv_pwrite_sync(bs->file, l1_table_offset, 
>>>>>> tmp_l1_table, l1_size2);
>>>>>>
>>>>>>free(tmp_l1_table);
>>>>>>}
>>>>> l1_table is already a local variable (local to
>>>>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>>>>> introducing another local buffer should mitigate the problem, if there
>>>>> is any.
>>>>>
>>>> l1_table is not necessarily a local variable to 
>>>> qcow2_update_snapshot_refcount,
>>>> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
>>>> if the condition not true, l1_table = s->l1_table.
>>> Oh, yes, you're right. Okay, so in theory nothing should happen anyway,
>>> because qcow2 does not have to be reentrant (so s->l1_table will not be
>>> accessed while it's big endian and therefore possibly not in CPU order).
>> Could you detail how qcow2 does not have to be reentrant?
>> In below stack,
>> qcow2_update_snapshot_refcount
>> |- cpu_to_be64s(&l1_table[i])
>> |- bdrv_pwrite_sync
>
>This is executed on bs->file, not the qcow2 BDS.
>
Yes, bs->file is passed to bdrv_pwrite_sync here, 
but aio_poll(aio_context) will poll all BDS's aio, not only that of bs->file, 
doesn't it?
Is it possible that there are pending aio which belong to this qcow2 BDS still 
exist?

Thanks,
Zhang Haoyu
>Max
>
>> |-- bdrv_pwrite
>> |--- bdrv_pwritev
>> | bdrv_prwv_co
>> |- aio_poll(aio_context) <== this aio_context is qemu_aio_context
>> |-- aio_dispatch
>> |--- bdrv_co_io_em_complete
>> | qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry is 
>> bdrv_co_do_rw
>> bdrv_co_do_rw will access l1_table to perform I/O operation.
>>
>> Thanks,
>> Zhang Haoyu
>>> But I find it rather ugly to convert the cached L1 table to big endian,
>>> so I'd be fine with the patch you proposed.
>>>
>>> Max

Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffsetreferenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?

2014-10-13 Thread Zhang Haoyu

>>>>>> Hi,
>>>>>> I encounter a problem that after deleting snapshot, the qcow2 image size 
>>>>>> is very larger than that it should be displayed by ls command,
>>>>>> but the virtual disk size is okay via qemu-img info.
>>>>>> I suspect that during updating l1 table offset, other I/O job reference 
>>>>>> the big-endian l1 table offset (very large value),
>>>>>> so the file is truncated to very large.
>>>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>>>> still consuming holes in the file; the maximum offset of the file is
>>>>> still unchanged, even if the file is no longer using as many referenced
>>>>> clusters.  Recent changes have gone in to sparsify the file when
>>>>> possible (punching holes if your kernel and file system is new enough to
>>>>> support that), so that it is not consuming the amount of disk space that
>>>>> a mere ls reports.  But if what you are asking for is a way to compact
>>>>> the file back down, then you'll need to submit a patch.  The idea of
>>>>> having an online defragmenter for qcow2 files has been kicked around
>>>>> before, but it is complex enough that no one has attempted a patch yet.
>>>> Sorry, I didn't clarify the problem clearly.
>>>> In qcow2_update_snapshot_refcount(), below code,
>>>>   /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>>>   if (ret == 0 && addend >= 0 && l1_modified) {
>>>>   for (i = 0; i < l1_size; i++) {
>>>>   cpu_to_be64s(&l1_table[i]);
>>>>   }
>>>>
>>>>   ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, 
>>>> l1_size2);
>>>>
>>>>   for (i = 0; i < l1_size; i++) {
>>>>   be64_to_cpus(&l1_table[i]);
>>>>   }
>>>>   }
>>>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>>>> is it possible that there is other I/O reference this interim l1 table 
>>>> whose entries contain the be64 l2 table offset?
>>>> The be64 l2 table offset maybe a very large value, hundreds of TB is 
>>>> possible,
>>>> then the qcow2 file will be truncated to far larger than normal size.
>>>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is 
>>>> still normal displayed by qemu-img info.
>>>>
>>>> If the possibility mentioned above exists, below raw code may fix it,
>>>>if (ret == 0 && addend >= 0 && l1_modified) {
>>>>   tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>>>   memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>>>   for (i = 0; i < l1_size; i++) {
>>>>   cpu_to_be64s(&tmp_l1_table[i]);
>>>>   }
>>>>   ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, 
>>>> l1_size2);
>>>>
>>>>   free(tmp_l1_table);
>>>>   }
>>> l1_table is already a local variable (local to
>>> qcow2_update_snapshot_refcount()), so I can't really imagine how
>>> introducing another local buffer should mitigate the problem, if there
>>> is any.
>>>
>> l1_table is not necessarily a local variable to 
>> qcow2_update_snapshot_refcount,
>> which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
>> if the condition not true, l1_table = s->l1_table.
>
>Oh, yes, you're right. Okay, so in theory nothing should happen anyway, 
>because qcow2 does not have to be reentrant (so s->l1_table will not be 
>accessed while it's big endian and therefore possibly not in CPU order). 
Could you detail how qcow2 does not have to be reentrant?
In below stack,
qcow2_update_snapshot_refcount
|- cpu_to_be64s(&l1_table[i])
|- bdrv_pwrite_sync
|-- bdrv_pwrite
|--- bdrv_pwritev
| bdrv_prwv_co
|- aio_poll(aio_context) <== this aio_context is qemu_aio_context
|-- aio_dispatch
|--- bdrv_co_io_em_complete
| qemu_coroutine_enter(co->coroutine, NULL); <== coroutine entry is 
bdrv_co_do_rw
bdrv_co_do_rw will access l1_table to perform I/O operation.

Thanks,
Zhang Haoyu
>But I find it rather ugly to convert the cached L1 table to big endian, 
>so I'd be fine with the patch you proposed.
>
>Max

Re: [Qemu-devel] [question] is it possible that big-endian l1 tableoffset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?

2014-10-12 Thread Zhang Haoyu

>>>> Hi,
>>>> I encounter a problem that after deleting snapshot, the qcow2 image size 
>>>> is very larger than that it should be displayed by ls command,
>>>> but the virtual disk size is okay via qemu-img info.
>>>> I suspect that during updating l1 table offset, other I/O job reference 
>>>> the big-endian l1 table offset (very large value),
>>>> so the file is truncated to very large.
>>> Not quite.  Rather, all the data that the snapshot used to occupy is
>>> still consuming holes in the file; the maximum offset of the file is
>>> still unchanged, even if the file is no longer using as many referenced
>>> clusters.  Recent changes have gone in to sparsify the file when
>>> possible (punching holes if your kernel and file system is new enough to
>>> support that), so that it is not consuming the amount of disk space that
>>> a mere ls reports.  But if what you are asking for is a way to compact
>>> the file back down, then you'll need to submit a patch.  The idea of
>>> having an online defragmenter for qcow2 files has been kicked around
>>> before, but it is complex enough that no one has attempted a patch yet.
>> Sorry, I didn't clarify the problem clearly.
>> In qcow2_update_snapshot_refcount(), below code,
>>  /* Update L1 only if it isn't deleted anyway (addend = -1) */
>>  if (ret == 0 && addend >= 0 && l1_modified) {
>>  for (i = 0; i < l1_size; i++) {
>>  cpu_to_be64s(&l1_table[i]);
>>  }
>>
>>  ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, 
>> l1_size2);
>>
>>  for (i = 0; i < l1_size; i++) {
>>  be64_to_cpus(&l1_table[i]);
>>  }
>>  }
>> between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);,
>> is it possible that there is other I/O reference this interim l1 table whose 
>> entries contain the be64 l2 table offset?
>> The be64 l2 table offset maybe a very large value, hundreds of TB is 
>> possible,
>> then the qcow2 file will be truncated to far larger than normal size.
>> So we'll see the huge size of the qcow2 file by ls -hl, but the size is 
>> still normal displayed by qemu-img info.
>>
>> If the possibility mentioned above exists, below raw code may fix it,
>>   if (ret == 0 && addend >= 0 && l1_modified) {
>>  tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
>>  memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
>>  for (i = 0; i < l1_size; i++) {
>>  cpu_to_be64s(&tmp_l1_table[i]);
>>  }
>>  ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, 
>> l1_size2);
>>
>>  free(tmp_l1_table);
>>  }
>
>l1_table is already a local variable (local to 
>qcow2_update_snapshot_refcount()), so I can't really imagine how 
>introducing another local buffer should mitigate the problem, if there 
>is any.
>
l1_table is not necessarily a local variable to qcow2_update_snapshot_refcount,
which depends on condition of "if (l1_table_offset != s->l1_table_offset)",
if the condition not true, l1_table = s->l1_table.

Thanks,
Zhang Haoyu
>Max

Re: [Qemu-devel] [PATCH] qcow2: fix double-free of Qcow2DiscardRegion in qcow2_process_discards

2014-10-12 Thread Zhang Haoyu



On 2014-10-12 15:34, Kevin Wolf wrote:

Am 11.10.2014 um 09:14 hat Zhang Haoyu geschrieben:

In qcow2_update_snapshot_refcount -> qcow2_process_discards() -> bdrv_discard()
may free the Qcow2DiscardRegion which is referenced by "next" pointer in
qcow2_process_discards() now, in next iteration, d = next, so g_free(d)
will double-free this Qcow2DiscardRegion.

qcow2_snapshot_delete
|- qcow2_update_snapshot_refcount
|-- qcow2_process_discards
|--- bdrv_discard
| aio_poll
|- aio_dispatch
|-- bdrv_co_io_em_complete
|--- qemu_coroutine_enter(co->coroutine, NULL); <=== coroutine entry is 
bdrv_co_do_rw
|--- g_free(d) <== free first Qcow2DiscardRegion is okay
|--- d = next;  <== this set is done in QTAILQ_FOREACH_SAFE() macro.
|--- g_free(d);  <== double-free will happen if during previous iteration, 
bdrv_discard had free this object.

Do you have a reproducer for this or did code review lead you to this?
This problem can be reproduced with loop of savevm -> delvm -> savem -> 
delvm ..., about 4 hours.

When I delete the vm snapshot, qemu crashed with a core file,
I debug the core file and find the double-free and the stack.

So I add a breakpoint at g_free(d);, and find that indeed a double-free 
happened,  twice free with the same address.

And only the first discard region have not happened with double-free.


At the moment I can't see how bdrv_discard(bs->file) could ever free a
Qcow2DiscardRegion of bs, as it's working on a completely different
BlockDriverState (which usually won't even be a qcow2 one).
I think the "aio_context" in bdrv_discard -> aio_poll(aio_context, true) 
is the qemu_aio_context,
no matter the bs or bs->file passed to bdrv_discard, so 
aio_poll(aio_context) will poll all of the aio.



bdrv_co_do_rw
|- bdrv_co_do_writev
|-- bdrv_co_do_pwritev
|--- bdrv_aligned_pwritev
| qcow2_co_writev
|- qcow2_alloc_cluster_link_l2
|-- qcow2_free_any_clusters
|--- qcow2_free_clusters
| update_refcount
|- qcow2_process_discards
|-- g_free(d)  <== In next iteration, this Qcow2DiscardRegion will be 
double-free.

This shouldn't happen in a nested call either, as s->lock can't be taken
recursively.
Could you detail how s->lock prevent that, above stack is from the gdb, 
when I add a breakpoint in g_free(d).


Thanks,
Zhang Haoyu


Kevin

[Qemu-devel] [PATCH] qcow2: fix leak of Qcow2DiscardRegion in update_refcount_discard

2014-10-11 Thread Zhang Haoyu

When the Qcow2DiscardRegion is adjacent to another one referenced by "d",
free this Qcow2DiscardRegion metadata referenced by "p" after
it was removed from s->discards queue.

Signed-off-by: Zhang Haoyu 
---
 block/qcow2-refcount.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..c31d85a 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -524,6 +524,7 @@ found:
 QTAILQ_REMOVE(&s->discards, p, next);
 d->offset = MIN(d->offset, p->offset);
 d->bytes += p->bytes;
+g_free(p);
 }
 }
 
-- 
1.7.12.4

[Qemu-devel] [PATCH] qcow2: fix double-free of Qcow2DiscardRegion in qcow2_process_discards

2014-10-11 Thread Zhang Haoyu

In qcow2_update_snapshot_refcount -> qcow2_process_discards() -> bdrv_discard()
may free the Qcow2DiscardRegion which is referenced by "next" pointer in
qcow2_process_discards() now, in next iteration, d = next, so g_free(d)
will double-free this Qcow2DiscardRegion.

qcow2_snapshot_delete
|- qcow2_update_snapshot_refcount
|-- qcow2_process_discards
|--- bdrv_discard
| aio_poll
|- aio_dispatch
|-- bdrv_co_io_em_complete
|--- qemu_coroutine_enter(co->coroutine, NULL); <=== coroutine entry is 
bdrv_co_do_rw
|--- g_free(d) <== free first Qcow2DiscardRegion is okay
|--- d = next;  <== this set is done in QTAILQ_FOREACH_SAFE() macro.
|--- g_free(d);  <== double-free will happen if during previous iteration, 
bdrv_discard had free this object.

bdrv_co_do_rw
|- bdrv_co_do_writev
|-- bdrv_co_do_pwritev
|--- bdrv_aligned_pwritev
| qcow2_co_writev
|- qcow2_alloc_cluster_link_l2
|-- qcow2_free_any_clusters
|--- qcow2_free_clusters
| update_refcount
|- qcow2_process_discards
|-- g_free(d)  <== In next iteration, this Qcow2DiscardRegion will be 
double-free.

Signed-off-by: Zhang Haoyu 
Signed-off-by: Fu Xuewei 
---
 block/qcow2-refcount.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..3b759a3 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -462,9 +462,9 @@ fail_block:
 void qcow2_process_discards(BlockDriverState *bs, int ret)
 {
 BDRVQcowState *s = bs->opaque;
-Qcow2DiscardRegion *d, *next;
+Qcow2DiscardRegion *d;
 
-QTAILQ_FOREACH_SAFE(d, &s->discards, next, next) {
+while ((d = QTAILQ_FIRST(&s->discards)) != NULL) {
 QTAILQ_REMOVE(&s->discards, d, next);
 
 /* Discard is optional, ignore the return value */
-- 
1.7.12.4

Re: [Qemu-devel] [question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?

2014-10-10 Thread Zhang Haoyu


>> Hi,
>>
>> Is it worthy to introduce a unified co-scheduling mechanism to CFS ?
>> Because multiple cooperating threads or tasks frequently synchronize 
with each other,
>> not executing them concurrently would only increase the latency of 
synchronization.
>> For example, a thread blocking in spinlock to waiting for another 
thread to release the same spinlock
>> might reduce its waiting time by being executed concurrently with 
the thread which hold the same spinlock.
>> In virtualization scenario, multiple vcpus (which belong to the same 
vm) co-scheduling is more desired

>> when several cooperating threads/task is running in guest.
>>
>> Is there a plane for this work?
>
> Please refer to gang scheduler.
>
Is there a mechanism to dynamically detect which vcpus belong to the 
same gang?
Maybe a cooperative degree can be used to decide the threshold of which 
vcpus belong to the same gang, just a wild thought.


> Regards,
> Wanpeng Li
>>
>> Thanks,
>> Zhang Haoyu

[Qemu-devel] [question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?

2014-10-10 Thread Zhang Haoyu


Hi,

Is it worthy to introduce a unified co-scheduling mechanism to CFS ?
Because multiple cooperating threads or tasks frequently synchronize 
with each other,
not executing them concurrently would only increase the latency of 
synchronization.
For example, a thread blocking in spinlock to waiting for another thread 
to release the same spinlock
might reduce its waiting time by being executed concurrently with the 
thread which hold the same spinlock.
In virtualization scenario, multiple vcpus (which belong to the same vm) 
co-scheduling is more desired

when several cooperating threads/task is running in guest.

Is there a plane for this work?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [question] is it possible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?

2014-10-09 Thread Zhang Haoyu

>> Hi,
>> I encounter a problem that after deleting snapshot, the qcow2 image size is 
>> very larger than that it should be displayed by ls command, 
>> but the virtual disk size is okay via qemu-img info.
>> I suspect that during updating l1 table offset, other I/O job reference the 
>> big-endian l1 table offset (very large value),
>> so the file is truncated to very large.
>
>Not quite.  Rather, all the data that the snapshot used to occupy is
>still consuming holes in the file; the maximum offset of the file is
>still unchanged, even if the file is no longer using as many referenced
>clusters.  Recent changes have gone in to sparsify the file when
>possible (punching holes if your kernel and file system is new enough to
>support that), so that it is not consuming the amount of disk space that
>a mere ls reports.  But if what you are asking for is a way to compact
>the file back down, then you'll need to submit a patch.  The idea of
>having an online defragmenter for qcow2 files has been kicked around
>before, but it is complex enough that no one has attempted a patch yet.

Sorry, I didn't clarify the problem clearly.
In qcow2_update_snapshot_refcount(), below code, 
/* Update L1 only if it isn't deleted anyway (addend = -1) */
if (ret == 0 && addend >= 0 && l1_modified) {
for (i = 0; i < l1_size; i++) {
cpu_to_be64s(&l1_table[i]);
}

ret = bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table, l1_size2);

for (i = 0; i < l1_size; i++) {
be64_to_cpus(&l1_table[i]);
}
}
between cpu_to_be64s(&l1_table[i]); and be64_to_cpus(&l1_table[i]);, 
is it possible that there is other I/O reference this interim l1 table whose 
entries contain the be64 l2 table offset? 
The be64 l2 table offset maybe a very large value, hundreds of TB is possible,
then the qcow2 file will be truncated to far larger than normal size.
So we'll see the huge size of the qcow2 file by ls -hl, but the size is still 
normal displayed by qemu-img info.

If the possibility mentioned above exists, below raw code may fix it,
 if (ret == 0 && addend >= 0 && l1_modified) {
tmp_l1_table = g_malloc0(l1_size * sizeof(uint64_t))
memcpy(tmp_l1_table, l1_table, l1_size * sizeof(uint64_t));
for (i = 0; i < l1_size; i++) {
cpu_to_be64s(&tmp_l1_table[i]);
}
ret = bdrv_pwrite_sync(bs->file, l1_table_offset, tmp_l1_table, 
l1_size2);

free(tmp_l1_table);
}

Thanks,
Zhang Haoyu

[Qemu-devel] [question] is it posssible that big-endian l1 table offset referenced by other I/O while updating l1 table offset in qcow2_update_snapshot_refcount?

2014-10-09 Thread Zhang Haoyu

Hi,
I encounter a problem that after deleting snaptshot, the qcow2 image size is 
very larger than that it should be displayed by ls command, 
but the virtual disk size is okay via qemu-img info.
I suspect that during updating l1 table offset, other I/O job reference the 
big-endian l1 table offset (very large value), so the file is truncated to very 
large.
Any ideas?

Thanks,
Zhang Haoyu

[Qemu-devel] [PATCH bugfix v2] snapshot: fix referencing wrong variable in while loop in do_delvm

2014-09-29 Thread Zhang Haoyu

The while loop variabal is "bs1", 
but "bs" is always passed to bdrv_snapshot_delete_by_id_or_name.
Broken in commit a89d89d, v1.7.0.

v1 -> v2:
* add broken commit id to commit message

Signed-off-by: Zhang Haoyu 
Reviewed-by: Markus Armbruster 
---
 savevm.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/savevm.c b/savevm.c
index e19ae0a..2d8eb96 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1245,19 +1245,18 @@ int load_vmstate(const char *name)
 
 void do_delvm(Monitor *mon, const QDict *qdict)
 {
-BlockDriverState *bs, *bs1;
+BlockDriverState *bs;
 Error *err = NULL;
 const char *name = qdict_get_str(qdict, "name");
 
-bs = find_vmstate_bs();
-if (!bs) {
+if (!find_vmstate_bs()) {
 monitor_printf(mon, "No block device supports snapshots\n");
 return;
 }
 
-bs1 = NULL;
-while ((bs1 = bdrv_next(bs1))) {
-if (bdrv_can_snapshot(bs1)) {
+bs = NULL;
+while ((bs = bdrv_next(bs))) {
+if (bdrv_can_snapshot(bs)) {
 bdrv_snapshot_delete_by_id_or_name(bs, name, &err);
 if (err) {
 monitor_printf(mon,
-- 
1.7.12.4

[Qemu-devel] [PATCH bugfix] snapshot: fix referencing wrong variable in while loop in do_delvm

2014-09-29 Thread Zhang Haoyu

The while loop variabal is "bs1", but "bs" is always passed to 
bdrv_snapshot_delete_by_id_or_name.

Signed-off-by: Zhang Haoyu 
---
 savevm.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/savevm.c b/savevm.c
index e19ae0a..2d8eb96 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1245,19 +1245,18 @@ int load_vmstate(const char *name)
 
 void do_delvm(Monitor *mon, const QDict *qdict)
 {
-BlockDriverState *bs, *bs1;
+BlockDriverState *bs;
 Error *err = NULL;
 const char *name = qdict_get_str(qdict, "name");
 
-bs = find_vmstate_bs();
-if (!bs) {
+if (!find_vmstate_bs()) {
 monitor_printf(mon, "No block device supports snapshots\n");
 return;
 }
 
-bs1 = NULL;
-while ((bs1 = bdrv_next(bs1))) {
-if (bdrv_can_snapshot(bs1)) {
+bs = NULL;
+while ((bs = bdrv_next(bs))) {
+if (bdrv_can_snapshot(bs)) {
 bdrv_snapshot_delete_by_id_or_name(bs, name, &err);
 if (err) {
 monitor_printf(mon,
-- 
1.7.12.4

Re: [Qemu-devel] [question] virtio-blk performancedegradationhappened with virito-serial

2014-09-22 Thread Zhang Haoyu

>> > >>> Hi, all
>> > >>> 
>> > >>> I start a VM with virtio-serial (default ports number: 31), and found 
>> > >>> that virtio-blk performance degradation happened, about 25%, this 
>> > >>> problem can be reproduced 100%.
>> > >>> without virtio-serial:
>> > >>> 4k-read-random 1186 IOPS
>> > >>> with virtio-serial:
>> > >>> 4k-read-random 871 IOPS
>> > >>> 
>> > >>> but if use max_ports=2 option to limit the max number of virio-serial 
>> > >>> ports, then the IO performance degradation is not so serious, about 5%.
>> > >>> 
>> > >>> And, ide performance degradation does not happen with virtio-serial.
>> > >>
>> > >>Pretty sure it's related to MSI vectors in use.  It's possible that
>> > >>the virtio-serial device takes up all the avl vectors in the guests,
>> > >>leaving old-style irqs for the virtio-blk device.
>> > >>
>> > >I don't think so,
>> > >I use iometer to test 64k-read(or write)-sequence case, if I disable the 
>> > >virtio-serial dynamically via device manager->virtio-serial => disable,
>> > >then the performance get promotion about 25% immediately, then I 
>> > >re-enable the virtio-serial via device manager->virtio-serial => enable,
>> > >the performance got back again, very obvious.
>> > add comments:
>> > Although the virtio-serial is enabled, I don't use it at all, the 
>> > degradation still happened.
>> 
>> Using the vectors= option as mentioned below, you can restrict the
>> number of MSI vectors the virtio-serial device gets.  You can then
>> confirm whether it's MSI that's related to these issues.
>
>Amit,
>
>It's related to the big number of ioeventfds used in virtio-serial-pci. With
>virtio-serial-pci's ioeventfd=off, the performance is not affected no matter if
>guest initializes it or not.
>
>In my test, there are 12 fds to poll in qemu_poll_ns before loading guest
>virtio_console.ko, whereas 76 once modprobe virtio_console.
>
>Looks like the ppoll takes more time to poll more fds.
>
>Some trace data with systemtap:
>
>12 fds:
>
>time  rel_time  symbol
>15    (+1)  qemu_poll_ns  [enter]
>18(+3)  qemu_poll_ns  [return]
>
>76 fd:
>
>12(+2)  qemu_poll_ns  [enter]
>18(+6)  qemu_poll_ns  [return]
>
>I haven't looked at virtio-serial code, I'm not sure if we should reduce the
>number of ioeventfds in virtio-serial-pci or focus on lower level efficiency.
>
Does ioeventfd=off hamper the performance of virtio-serial?
In my opinion, virtio-serial's use scenario is not so high throughput rate, 
so ioventfd=off has slight impaction on the performance.

Thanks,
Zhang Haoyu

>Haven't compared with g_poll but I think the underlying syscall should be the
>same.
>
>Any ideas?
>
>Fam
>
>
>> 
>> > >So, I think it has no business with legacy interrupt mode, right?
>> > >
>> > >I am going to observe the difference of perf top data on qemu and perf 
>> > >kvm stat data when disable/enable virtio-serial in guest,
>> > >and the difference of perf top data on guest when disable/enable 
>> > >virtio-serial in guest,
>> > >any ideas?
>> > >
>> > >Thanks,
>> > >Zhang Haoyu
>> > >>If you restrict the number of vectors the virtio-serial device gets
>> > >>(using the -device virtio-serial-pci,vectors= param), does that make
>> > >>things better for you?
>> 
>> 
>> 
>>  Amit

Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-09-16 Thread Zhang Haoyu

>>>>>> If virtio-blk and virtio-serial share an IRQ, the guest operating
system has to check each virtqueue for activity. Maybe there is some
inefficiency doing that.
>>>>>> AFAIK virtio-serial registers 64 virtqueues (on 31 ports +
console) even if everything is unused.
>>>>>
>>>>> That could be the case if MSI is disabled.
>>>>
>>>> Do the windows virtio drivers enable MSIs, in their inf file?
>>>
>>> It depends on the version of the drivers, but it is a reasonable guess
>>> at what differs between Linux and Windows. Haoyu, can you give us the
>>> output of lspci from a Linux guest?
>>>
>> I made a test with fio on rhel-6.5 guest, the same degradation
happened too, this degradation can be reproduced on rhel6.5 guest 100%.
>> virtio_console module installed:
>> 64K-write-sequence: 285 MBPS, 4380 IOPS
>> virtio_console module uninstalled:
>> 64K-write-sequence: 370 MBPS, 5670 IOPS
>>
>I use top -d 1 -H -p  to monitor the cpu usage, and found that,
>virtio_console module installed:
>qemu main thread cpu usage: 98%
>virtio_console module uninstalled:
>qemu main thread cpu usage: 60%
>

I found that the statement "err =
register_virtio_driver(&virtio_console);" in virtio_console module's
init() function will
cause the degradation, if I directly return before "err =
register_virtio_driver(&virtio_console);", then the degradation disappeared,
if directly return after "err =
register_virtio_driver(&virtio_console);", the degradation is still there.
I will try below test case,
1. Dose not emulate virito-serial deivce, then install/uninstall
virtio_console driver in guest,
to see whether there is difference in virtio-blk performance and cpu usage.
2. Does not emulate virito-serial deivce, then install virtio_balloon
driver (and also dose not emulate virtio-balloon device),
to see whether virtio-blk performance degradation will happen.
3. Emulating virtio-balloon device instead of virtio-serial deivce ,
then to see whether the virtio-blk performance is hampered.

Base on the test result, corresponding analysis will be performed.
Any ideas?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-09-11 Thread Zhang Haoyu

ical Slot: 18
>Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B- DisINTx+
>Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> SERR- Interrupt: pin A routed to IRQ 10
>Region 0: I/O ports at c0c0 [size=32]
>Region 1: Memory at febd4000 (32-bit, non-prefetchable) [size=4K]
>Expansion ROM at feb8 [disabled] [size=256K]
>Capabilities: [40] MSI-X: Enable+ Count=3 Masked-
>Vector table: BAR=1 offset=
>PBA: BAR=1 offset=0800
>Kernel driver in use: virtio-pci
>Kernel modules: virtio_pci
>
>Thanks,
>Zhang Haoyu

[Qemu-devel] [PATCH v2] kvm: ioapic: conditionally delay irq delivery duringeoi broadcast

2014-09-11 Thread Zhang Haoyu

Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some dealy between
the EOI writing and irq delivery (system bus latency?). So we need to emulate
this behavior. Otherwise, for a guest who haven't register a proper irq handler
, it would stay in the interrupt routine as this irq would be re-injected
immediately after guest enables interrupt. This would lead guest can't move
forward and may miss the possibility to get proper irq handler registered (one
example is windows guest resuming from hibernation).

As there's no way to differ the unhandled irq from new raised ones, this patch
solve this problems by scheduling a delayed work when the count of irq injected
during eoi broadcast exceeds a threshold value. After this patch, the guest can
move a little forward when there's no suitable irq handler in case it may
register one very soon and for guest who has a bad irq detection routine ( such
as note_interrupt() in linux ), this bad irq would be recognized soon as in the
past.

v1 -> v2:
  - delete the TODO comment
  - adjust the coding style to kernel style
  - add the missing "}" for if (ioapic->irq_eoi[i] == 
IOAPIC_SUCCESSIVE_IRQ_MAX_COUNT) {

Cc: Michael S. Tsirkin 
Cc: Jan Kiszka 
Signed-off-by: Jason Wang 
Signed-off-by: Zhang Haoyu 
---
 include/trace/events/kvm.h | 20 +++
 virt/kvm/ioapic.c  | 50 --
 virt/kvm/ioapic.h  |  6 ++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 908925a..ab679c3 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -95,6 +95,26 @@ TRACE_EVENT(kvm_ioapic_set_irq,
  __entry->coalesced ? " (coalesced)" : "")
 );
 
+TRACE_EVENT(kvm_ioapic_delayed_eoi_inj,
+   TP_PROTO(__u64 e),
+   TP_ARGS(e),
+
+   TP_STRUCT__entry(
+   __field(__u64,  e   )
+   ),
+
+   TP_fast_assign(
+   __entry->e  = e;
+   ),
+
+   TP_printk("dst %x vec=%u (%s|%s|%s%s)",
+ (u8)(__entry->e >> 56), (u8)__entry->e,
+ __print_symbolic((__entry->e >> 8 & 0x7), kvm_deliver_mode),
+ (__entry->e & (1<<11)) ? "logical" : "physical",
+ (__entry->e & (1<<15)) ? "level" : "edge",
+ (__entry->e & (1<<16)) ? "|masked" : "")
+);
+
 TRACE_EVENT(kvm_msi_set_irq,
TP_PROTO(__u64 address, __u64 data),
TP_ARGS(address, data),
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index e8ce34c..8e1dc67 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -405,6 +405,24 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int 
irq_source_id)
spin_unlock(&ioapic->lock);
 }
 
+static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
+{
+   int i;
+   struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
+eoi_inject.work);
+   spin_lock(&ioapic->lock);
+   for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+   union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
+
+   if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
+   continue;
+
+   if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
+   ioapic_service(ioapic, i, false);
+   }
+   spin_unlock(&ioapic->lock);
+}
+
 static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
struct kvm_ioapic *ioapic, int vector, int trigger_mode)
 {
@@ -435,8 +453,32 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
 
ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
ent->fields.remote_irr = 0;
-   if (ioapic->irr & (1 << i))
-   ioapic_service(ioapic, i, false);
+   if (!ent->fields.mask && (ioapic->irr & (1 << i))) {
+   ++ioapic->irq_eoi[i];
+   if (ioapic->irq_eoi[i] == 
IOAPIC_SUCCESSIVE_IRQ_MAX_COUNT) {
+   /*
+* Real hardware does not deliver the irq so
+* immediately during eoi broadcast, so we need
+* to emulate this behavior. Otherwise, for
+* guests who has not registered handler of a
+* level irq, this irq would be injected
+* immediately after guest enabl

Re: [Qemu-devel] [PATCH] kvm: ioapic: conditionally delay irq delivery during eoi broadcast

2014-09-11 Thread Zhang Haoyu

>> Subject: [Qemu-devel] [PATCH] kvm: ioapic: conditionally delay irq delivery
>> duringeoi broadcast
>> 
>> Currently, we call ioapic_service() immediately when we find the irq is still
>> active during eoi broadcast. But for real hardware, there's some dealy 
>> between
>> the EOI writing and irq delivery (system bus latency?). So we need to emulate
>> this behavior. Otherwise, for a guest who haven't register a proper irq 
>> handler
>> , it would stay in the interrupt routine as this irq would be re-injected
>> immediately after guest enables interrupt. This would lead guest can't move
>> forward and may miss the possibility to get proper irq handler registered 
>> (one
>> example is windows guest resuming from hibernation).
>> 
>> As there's no way to differ the unhandled irq from new raised ones, this 
>> patch
>> solve this problems by scheduling a delayed work when the count of irq 
>> injected
>> during eoi broadcast exceeds a threshold value. After this patch, the guest 
>> can
>> move a little forward when there's no suitable irq handler in case it may
>> register one very soon and for guest who has a bad irq detection routine ( 
>> such
>> as note_interrupt() in linux ), this bad irq would be recognized soon as in 
>> the
>> past.
>> 
>> Cc: Michael S. Tsirkin 
>> Signed-off-by: Jason Wang 
>> Signed-off-by: Zhang Haoyu 
>> ---
>>  include/trace/events/kvm.h | 20 +++
>>  virt/kvm/ioapic.c  | 50
>> --
>>  virt/kvm/ioapic.h  |  6 ++
>>  3 files changed, 74 insertions(+), 2 deletions(-)
>> 
>If this is a new version, please add a v2/v3 suffix and describe the changes at
>those different versions .
>
>You can get more information from:
>http://wiki.qemu.org/Contribute/SubmitAPatch
>
Okay, thanks.

>Best regards,
>-Gonglei

[Qemu-devel] [PATCH] kvm: ioapic: conditionally delay irq delivery duringeoi broadcast

2014-09-11 Thread Zhang Haoyu

Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some dealy between
the EOI writing and irq delivery (system bus latency?). So we need to emulate
this behavior. Otherwise, for a guest who haven't register a proper irq handler
, it would stay in the interrupt routine as this irq would be re-injected
immediately after guest enables interrupt. This would lead guest can't move
forward and may miss the possibility to get proper irq handler registered (one
example is windows guest resuming from hibernation).

As there's no way to differ the unhandled irq from new raised ones, this patch
solve this problems by scheduling a delayed work when the count of irq injected
during eoi broadcast exceeds a threshold value. After this patch, the guest can
move a little forward when there's no suitable irq handler in case it may
register one very soon and for guest who has a bad irq detection routine ( such
as note_interrupt() in linux ), this bad irq would be recognized soon as in the
past.

Cc: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
Signed-off-by: Zhang Haoyu 
---
 include/trace/events/kvm.h | 20 +++
 virt/kvm/ioapic.c  | 50 --
 virt/kvm/ioapic.h  |  6 ++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 908925a..ab679c3 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -95,6 +95,26 @@ TRACE_EVENT(kvm_ioapic_set_irq,
  __entry->coalesced ? " (coalesced)" : "")
 );
 
+TRACE_EVENT(kvm_ioapic_delayed_eoi_inj,
+   TP_PROTO(__u64 e),
+   TP_ARGS(e),
+
+   TP_STRUCT__entry(
+   __field(__u64,  e   )
+   ),
+
+   TP_fast_assign(
+   __entry->e  = e;
+   ),
+
+   TP_printk("dst %x vec=%u (%s|%s|%s%s)",
+ (u8)(__entry->e >> 56), (u8)__entry->e,
+ __print_symbolic((__entry->e >> 8 & 0x7), kvm_deliver_mode),
+ (__entry->e & (1<<11)) ? "logical" : "physical",
+ (__entry->e & (1<<15)) ? "level" : "edge",
+ (__entry->e & (1<<16)) ? "|masked" : "")
+);
+
 TRACE_EVENT(kvm_msi_set_irq,
TP_PROTO(__u64 address, __u64 data),
TP_ARGS(address, data),
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index e8ce34c..8e1dc67 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -405,6 +405,24 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int 
irq_source_id)
spin_unlock(&ioapic->lock);
 }
 
+static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
+{
+   int i;
+   struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
+eoi_inject.work);
+   spin_lock(&ioapic->lock);
+   for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+   union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
+
+   if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
+   continue;
+
+   if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
+   ioapic_service(ioapic, i, false);
+   }
+   spin_unlock(&ioapic->lock);
+}
+
 static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
struct kvm_ioapic *ioapic, int vector, int trigger_mode)
 {
@@ -435,8 +453,32 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
 
ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
ent->fields.remote_irr = 0;
-   if (ioapic->irr & (1 << i))
-   ioapic_service(ioapic, i, false);
+   if (!ent->fields.mask && (ioapic->irr & (1 << i))) {
+   ++ioapic->irq_eoi[i];
+   if (ioapic->irq_eoi[i] == 
IOAPIC_SUCCESSIVE_IRQ_MAX_COUNT) {
+   /*
+* Real hardware does not deliver the irq so
+* immediately during eoi broadcast, so we need
+* to emulate this behavior. Otherwise, for
+* guests who has not registered handler of a
+* level irq, this irq would be injected
+* immediately after guest enables interrupt
+* (which happens usually at the end of the
+* common interrupt routi

Re: [Qemu-devel] [PATCH] kvm: ioapic: conditionally delay irq delivery duringeoi broadcast

2014-09-11 Thread Zhang Haoyu

>> Currently, we call ioapic_service() immediately when we find the irq is still
>> active during eoi broadcast. But for real hardware, there's some dealy 
>> between
>> the EOI writing and irq delivery (system bus latency?). So we need to emulate
>> this behavior. Otherwise, for a guest who haven't register a proper irq 
>> handler
>> , it would stay in the interrupt routine as this irq would be re-injected
>> immediately after guest enables interrupt. This would lead guest can't move
>> forward and may miss the possibility to get proper irq handler registered 
>> (one
>> example is windows guest resuming from hibernation).
>> 
>> As there's no way to differ the unhandled irq from new raised ones, this 
>> patch
>> solve this problems by scheduling a delayed work when the count of irq 
>> injected
>> during eoi broadcast exceeds a threshold value. After this patch, the guest 
>> can
>> move a little forward when there's no suitable irq handler in case it may
>> register one very soon and for guest who has a bad irq detection routine ( 
>> such
>> as note_interrupt() in linux ), this bad irq would be recognized soon as in 
>> the
>> past.
>> 
>> Cc: Michael S. Tsirkin 
>> Signed-off-by: Jason Wang 
>> Signed-off-by: Zhang Haoyu 
>> ---
>>  include/trace/events/kvm.h | 20 ++
>>  virt/kvm/ioapic.c  | 51 
>> --
>>  virt/kvm/ioapic.h  |  6 ++
>>  3 files changed, 75 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
>> index 908925a..b05f688 100644
>> --- a/include/trace/events/kvm.h
>> +++ b/include/trace/events/kvm.h
>> @@ -95,6 +95,26 @@ TRACE_EVENT(kvm_ioapic_set_irq,
>>__entry->coalesced ? " (coalesced)" : "")
>>  );
>>  
>> +TRACE_EVENT(kvm_ioapic_delayed_eoi_inj,
>> +TP_PROTO(__u64 e),
>> +TP_ARGS(e),
>> +
>> +TP_STRUCT__entry(
>> +__field(__u64,  e   )
>> +),
>> +
>> +TP_fast_assign(
>> +__entry->e  = e;
>> +),
>> +
>> +TP_printk("dst %x vec=%u (%s|%s|%s%s)",
>> +  (u8)(__entry->e >> 56), (u8)__entry->e,
>> +  __print_symbolic((__entry->e >> 8 & 0x7), kvm_deliver_mode),
>> +  (__entry->e & (1<<11)) ? "logical" : "physical",
>> +  (__entry->e & (1<<15)) ? "level" : "edge",
>> +  (__entry->e & (1<<16)) ? "|masked" : "")
>> +);
>> +
>>  TRACE_EVENT(kvm_msi_set_irq,
>>  TP_PROTO(__u64 address, __u64 data),
>>  TP_ARGS(address, data),
>> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
>> index e8ce34c..a36c4c4 100644
>> --- a/virt/kvm/ioapic.c
>> +++ b/virt/kvm/ioapic.c
>> @@ -405,6 +405,24 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, 
>> int irq_source_id)
>>  spin_unlock(&ioapic->lock);
>>  }
>>  
>> +static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
>> +{
>> +int i;
>> +struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
>> + eoi_inject.work);
>> +spin_lock(&ioapic->lock);
>> +for (i = 0; i < IOAPIC_NUM_PINS; i++) {
>> +union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
>> +
>> +if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
>> +continue;
>> +
>> +if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
>> +ioapic_service(ioapic, i, false);
>> +}
>> +spin_unlock(&ioapic->lock);
>> +}
>> +
>>  static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
>>  struct kvm_ioapic *ioapic, int vector, int trigger_mode)
>>  {
>> @@ -435,8 +453,33 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu 
>> *vcpu,
>>  
>>  ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
>>  ent->fields.remote_irr = 0;
>> -if (ioapic->irr & (1 << i))
>> -ioapic_service(ioapic, i, false);
>> +if (!ent->fields.mask && (ioapic->irr & (1 <

[Qemu-devel] [PATCH] kvm: ioapic: conditionally delay irq delivery during eoi broadcast

2014-09-10 Thread Zhang Haoyu

Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some dealy between
the EOI writing and irq delivery (system bus latency?). So we need to emulate
this behavior. Otherwise, for a guest who haven't register a proper irq handler
, it would stay in the interrupt routine as this irq would be re-injected
immediately after guest enables interrupt. This would lead guest can't move
forward and may miss the possibility to get proper irq handler registered (one
example is windows guest resuming from hibernation).

As there's no way to differ the unhandled irq from new raised ones, this patch
solve this problems by scheduling a delayed work when the count of irq injected
during eoi broadcast exceeds a threshold value. After this patch, the guest can
move a little forward when there's no suitable irq handler in case it may
register one very soon and for guest who has a bad irq detection routine ( such
as note_interrupt() in linux ), this bad irq would be recognized soon as in the
past.

Cc: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
Signed-off-by: Zhang Haoyu 
---
 include/trace/events/kvm.h | 20 ++
 virt/kvm/ioapic.c  | 51 --
 virt/kvm/ioapic.h  |  6 ++
 3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 908925a..b05f688 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -95,6 +95,26 @@ TRACE_EVENT(kvm_ioapic_set_irq,
  __entry->coalesced ? " (coalesced)" : "")
 );
 
+TRACE_EVENT(kvm_ioapic_delayed_eoi_inj,
+   TP_PROTO(__u64 e),
+   TP_ARGS(e),
+
+   TP_STRUCT__entry(
+   __field(__u64,  e   )
+   ),
+
+   TP_fast_assign(
+   __entry->e  = e;
+   ),
+
+   TP_printk("dst %x vec=%u (%s|%s|%s%s)",
+ (u8)(__entry->e >> 56), (u8)__entry->e,
+ __print_symbolic((__entry->e >> 8 & 0x7), kvm_deliver_mode),
+ (__entry->e & (1<<11)) ? "logical" : "physical",
+ (__entry->e & (1<<15)) ? "level" : "edge",
+ (__entry->e & (1<<16)) ? "|masked" : "")
+);
+
 TRACE_EVENT(kvm_msi_set_irq,
TP_PROTO(__u64 address, __u64 data),
TP_ARGS(address, data),
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index e8ce34c..a36c4c4 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -405,6 +405,24 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int 
irq_source_id)
spin_unlock(&ioapic->lock);
 }
 
+static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
+{
+   int i;
+   struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
+eoi_inject.work);
+   spin_lock(&ioapic->lock);
+   for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+   union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
+
+   if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
+   continue;
+
+   if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
+   ioapic_service(ioapic, i, false);
+   }
+   spin_unlock(&ioapic->lock);
+}
+
 static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
struct kvm_ioapic *ioapic, int vector, int trigger_mode)
 {
@@ -435,8 +453,33 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
 
ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
ent->fields.remote_irr = 0;
-   if (ioapic->irr & (1 << i))
-   ioapic_service(ioapic, i, false);
+   if (!ent->fields.mask && (ioapic->irr & (1 << i))) {
+   ++ioapic->irq_eoi[i];
+   if (ioapic->irq_eoi[i] == 
IOAPIC_SUCCESSIVE_IRQ_MAX_COUNT) {
+   /*
+* Real hardware does not deliver the irq so
+* immediately during eoi broadcast, so we need
+* to emulate this behavior. Otherwise, for
+* guests who has not registered handler of a
+* level irq, this irq would be injected
+* immediately after guest enables interrupt
+* (which happens usually at the end of the
+* common interrupt routine). This would lead

Re: [Qemu-devel] [PATCH] kvm: ioapic: conditionally delay irq delivery during eoi broadcast

2014-09-10 Thread Zhang Haoyu

>+static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
>+{
>+  int i, ret;
>+  struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
>+   eoi_inject.work);
>+  spin_lock(&ioapic->lock);
>+  for (i = 0; i < IOAPIC_NUM_PINS; i++) {
>+  union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
>+
>+  if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
>+  continue;
>+
>+  if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
>+  ret = ioapic_service(ioapic, i, false);
>+  }
>+  spin_unlock(&ioapic->lock);
>+}
>+
Delete unused variable, ret.

Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some dealy between
the EOI writing and irq delivery (system bus latency?). So we need to emulate
this behavior. Otherwise, for a guest who haven't register a proper irq handler
, it would stay in the interrupt routine as this irq would be re-injected
immediately after guest enables interrupt. This would lead guest can't move
forward and may miss the possibility to get proper irq handler registered (one
example is windows guest resuming from hibernation).

As there's no way to differ the unhandled irq from new raised ones, this patch
solve this problems by scheduling a delayed work when the count of irq injected
during eoi broadcast exceeds a threshold value. After this patch, the guest can
move a little forward when there's no suitable irq handler in case it may
register one very soon and for guest who has a bad irq detection routine ( such
as note_interrupt() in linux ), this bad irq would be recognized soon as in the
past.

Cc: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
Signed-off-by: Zhang Haoyu 
---
 include/trace/events/kvm.h | 20 ++
 virt/kvm/ioapic.c  | 51 --
 virt/kvm/ioapic.h  |  6 ++
 3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 908925a..b05f688 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -95,6 +95,26 @@ TRACE_EVENT(kvm_ioapic_set_irq,
  __entry->coalesced ? " (coalesced)" : "")
 );
 
+TRACE_EVENT(kvm_ioapic_delayed_eoi_inj,
+   TP_PROTO(__u64 e),
+   TP_ARGS(e),
+
+   TP_STRUCT__entry(
+   __field(__u64,  e   )
+   ),
+
+   TP_fast_assign(
+   __entry->e  = e;
+   ),
+
+   TP_printk("dst %x vec=%u (%s|%s|%s%s)%s",
+ (u8)(__entry->e >> 56), (u8)__entry->e,
+ __print_symbolic((__entry->e >> 8 & 0x7), kvm_deliver_mode),
+ (__entry->e & (1<<11)) ? "logical" : "physical",
+ (__entry->e & (1<<15)) ? "level" : "edge",
+ (__entry->e & (1<<16)) ? "|masked" : "")
+);
+
 TRACE_EVENT(kvm_msi_set_irq,
TP_PROTO(__u64 address, __u64 data),
TP_ARGS(address, data),
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index e8ce34c..a36c4c4 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -405,6 +405,24 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int 
irq_source_id)
spin_unlock(&ioapic->lock);
 }
 
+static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
+{
+   int i;
+   struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
+eoi_inject.work);
+   spin_lock(&ioapic->lock);
+   for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+   union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
+
+   if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
+   continue;
+
+   if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
+   ioapic_service(ioapic, i, false);
+   }
+   spin_unlock(&ioapic->lock);
+}
+
 static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
struct kvm_ioapic *ioapic, int vector, int trigger_mode)
 {
@@ -435,8 +453,33 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
 
ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
ent->fields.remote_irr = 0;
-   if (ioapic->irr & (1 << i))
-   ioapic_service(ioapic, i, false);
+   if (!ent->fields.mask && (ioapic->irr & (1 << i))) {
+

[Qemu-devel] [PATCH] kvm: ioapic: conditionally delay irq delivery during eoi broadcast

2014-09-10 Thread Zhang Haoyu

Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some dealy between
the EOI writing and irq delivery (system bus latency?). So we need to emulate
this behavior. Otherwise, for a guest who haven't register a proper irq handler
, it would stay in the interrupt routine as this irq would be re-injected
immediately after guest enables interrupt. This would lead guest can't move
forward and may miss the possibility to get proper irq handler registered (one
example is windows guest resuming from hibernation).

As there's no way to differ the unhandled irq from new raised ones, this patch
solve this problems by scheduling a delayed work when the count of irq injected
during eoi broadcast exceeds a threshold value. After this patch, the guest can
move a little forward when there's no suitable irq handler in case it may
register one very soon and for guest who has a bad irq detection routine ( such
as note_interrupt() in linux ), this bad irq would be recognized soon as in the
past.

Cc: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
Signed-off-by: Zhang Haoyu 
---
 include/trace/events/kvm.h | 20 ++
 virt/kvm/ioapic.c  | 51 --
 virt/kvm/ioapic.h  |  6 ++
 3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 908925a..b05f688 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -95,6 +95,26 @@ TRACE_EVENT(kvm_ioapic_set_irq,
  __entry->coalesced ? " (coalesced)" : "")
 );
 
+TRACE_EVENT(kvm_ioapic_delayed_eoi_inj,
+   TP_PROTO(__u64 e),
+   TP_ARGS(e),
+
+   TP_STRUCT__entry(
+   __field(__u64,  e   )
+   ),
+
+   TP_fast_assign(
+   __entry->e  = e;
+   ),
+
+   TP_printk("dst %x vec=%u (%s|%s|%s%s)%s",
+ (u8)(__entry->e >> 56), (u8)__entry->e,
+ __print_symbolic((__entry->e >> 8 & 0x7), kvm_deliver_mode),
+ (__entry->e & (1<<11)) ? "logical" : "physical",
+ (__entry->e & (1<<15)) ? "level" : "edge",
+ (__entry->e & (1<<16)) ? "|masked" : "")
+);
+
 TRACE_EVENT(kvm_msi_set_irq,
TP_PROTO(__u64 address, __u64 data),
TP_ARGS(address, data),
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index e8ce34c..a36c4c4 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -405,6 +405,24 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int 
irq_source_id)
spin_unlock(&ioapic->lock);
 }
 
+static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
+{
+   int i, ret;
+   struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
+eoi_inject.work);
+   spin_lock(&ioapic->lock);
+   for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+   union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
+
+   if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
+   continue;
+
+   if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
+   ret = ioapic_service(ioapic, i, false);
+   }
+   spin_unlock(&ioapic->lock);
+}
+
 static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
struct kvm_ioapic *ioapic, int vector, int trigger_mode)
 {
@@ -435,8 +453,33 @@ static void __kvm_ioapic_update_eoi(struct kvm_vcpu *vcpu,
 
ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
ent->fields.remote_irr = 0;
-   if (ioapic->irr & (1 << i))
-   ioapic_service(ioapic, i, false);
+   if (!ent->fields.mask && (ioapic->irr & (1 << i))) {
+   ++ioapic->irq_eoi[i];
+   if (ioapic->irq_eoi[i] == 
IOAPIC_SUCCESSIVE_IRQ_MAX_COUNT) {
+   /*
+* Real hardware does not deliver the irq so
+* immediately during eoi broadcast, so we need
+* to emulate this behavior. Otherwise, for
+* guests who has not registered handler of a
+* level irq, this irq would be injected
+* immediately after guest enables interrupt
+* (which happens usually at the end of the
+* common interrupt routine). This would lead

Re: [Qemu-devel] [question] virtio-blk performance degradationhappenedwith virito-serial

2014-09-07 Thread Zhang Haoyu

Hi, Paolo, Amit,
any ideas?

Thanks,
Zhang Haoyu


On 2014-9-4 15:56, Zhang Haoyu wrote:
>>>>> If virtio-blk and virtio-serial share an IRQ, the guest operating system 
>>>>> has to check each virtqueue for activity. Maybe there is some 
>>>>> inefficiency doing that.
>>>>> AFAIK virtio-serial registers 64 virtqueues (on 31 ports + console) even 
>>>>> if everything is unused.
>>>> That could be the case if MSI is disabled.
>>> Do the windows virtio drivers enable MSIs, in their inf file?
>> It depends on the version of the drivers, but it is a reasonable guess
>> at what differs between Linux and Windows.  Haoyu, can you give us the
>> output of lspci from a Linux guest?
>>
> I made a test with fio on rhel-6.5 guest, the same degradation happened too,  
> this degradation can be reproduced on rhel6.5 guest 100%.
> virtio_console module installed:
> 64K-write-sequence: 285 MBPS, 4380 IOPS
> virtio_console module uninstalled:
> 64K-write-sequence: 370 MBPS, 5670 IOPS
>
> And, virio-blk's interrupt mode always is MSI, no matter if virtio_console 
> module is installed or uninstalled.
> 25:2245933   PCI-MSI-edge  virtio1-requests
>
> fio command:
> fio -filename /dev/vda -direct=1 -iodepth=1 -thread -rw=write -ioengine=psync 
> -bs=64k -size=30G -numjobs=1 -name=mytest
>
> QEMU comamnd:
> /usr/bin/kvm -id 5497356709352 -chardev 
> socket,id=qmp,path=/var/run/qemu-server/5497356709352.qmp,server,nowait -mon 
> chardev=qmp,mode=control -vnc :0,websocket,to=200 -enable-kvm -pidfile 
> /var/run/qemu-server/5497356709352.pid -daemonize -name io-test-rhel-6.5 -smp 
> sockets=1,cores=1 -cpu core2duo -nodefaults -vga cirrus -no-hpet -k en-us 
> -boot menu=on,splash-time=8000 -m 4096 -usb -drive 
> file=/sf/data/local/zhanghaoyu/rhel-server-6.5-x86_64-dvd.iso,if=none,id=drive-ide0,media=cdrom,aio=native,forecast=disable
>  -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200 
> -drive 
> file=/sf/data/local/images/host-1051721dff13/io-test-rhel-6.5.vm/vm-disk-1.qcow2,if=none,id=drive-virtio1,cache=none,aio=native
>  -device virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb 
> -drive 
> file=/sf/data/local/images/host-1051721dff13/io-test-rhel-6.5.vm/vm-disk-2.qcow2,if=none,id=drive-virtio2,cache=none,aio=native
>  -device virtio-blk-pci,drive=drive-virtio2,id=virtio2,bus=pci
>  .0,addr=0xc,bootindex=101 -netdev 
> type=tap,id=net0,ifname=164922379979200,script=/sf/etc/kvm/vtp-bridge,vhost=on,vhostforce=on
>  -device 
> virtio-net-pci,mac=FE:FC:FE:C6:47:F6,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
>  -rtc driftfix=slew,clock=rt -global kvm-pit.lost_tick_policy=discard -global 
> PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -chardev 
> socket,path=/run/virtser/1649223799792.sock,server,nowait,id=channelser 
> -device virtio-serial,vectors=4 -device 
> virtserialport,chardev=channelser,name=channelser.virtserial0.0
>
> [environment]
> Host:linux-3.10(RHEL7-rc1)
> QEMU: qemu-2.0.1
> Guest: RHEL6.5
>
> # lspci -tv
> -[:00]-+-00.0  Intel Corporation 440FX - 82441FX PMC [Natoma]
>+-01.0  Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
>+-01.1  Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
>+-01.2  Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II]
>+-01.3  Intel Corporation 82371AB/EB/MB PIIX4 ACPI
>+-02.0  Cirrus Logic GD 5446
>+-03.0  Red Hat, Inc Virtio console
>+-0b.0  Red Hat, Inc Virtio block device
>+-0c.0  Red Hat, Inc Virtio block device
>\-12.0  Red Hat, Inc Virtio network device
>
> # lspci -vvv
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
> Subsystem: Red Hat, Inc Qemu virtual machine
> Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B- DisINTx-
> Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> SERR- 
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
> Subsystem: Red Hat, Inc Qemu virtual machine
> Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B- DisINTx-
> Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- 
> SERR- 
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 
> (prog-if 80 [Master])
> Subsystem: Red Hat, Inc Qemu virtual machine
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B- DisINTx-
> Status: Cap- 66MHz- UDF- FastB2B+ ParErr- D

[Qemu-devel] [question] git clone kvm.git failed

2014-09-04 Thread Zhang Haoyu

Hi, all
I encounter below error during git clone kvm.git,

# git clone git://git.kernel.org/pub/scm/virt/kvm/kvm.git kvm_0905
Cloning into 'kvm_0905'...
remote: Counting objects: 3819711, done.
remote: Compressing objects: 100% (575699/575699), done.
remote: Total 3819711 (delta 3219203), reused 3812285 (delta 3211836)
Receiving objects: 100% (3819711/3819711), 804.71 MiB | 122 KiB/s, done.
Resolving deltas: 100% (3219203/3219203), done.
error: unable to create file include/linux/types.h (File too large)
error: unable to create file include/linux/u64_stats_sync.h (File too large)
error: unable to create file include/linux/uaccess.h (File too large)
error: unable to create file include/linux/ucb1400.h (File too large)
error: unable to create file include/linux/ucs2_string.h (File too large)
error: unable to create file include/linux/udp.h (File too large)
error: unable to create file include/linux/uidgid.h (File too large)
error: unable to create file include/linux/uinput.h (File too large)
error: unable to create file include/linux/uio.h (File too large)
error: unable to create file include/linux/uio_driver.h (File too large)
fatal: cannot create directory at 'include/linux/unaligned': File too large

How to resolve these errors?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [question] virtio-blk performance degradationhappenedwith virito-serial

2014-09-04 Thread Zhang Haoyu

>> > > If virtio-blk and virtio-serial share an IRQ, the guest operating system 
>> > > has to check each virtqueue for activity. Maybe there is some 
>> > > inefficiency doing that.
>> > > AFAIK virtio-serial registers 64 virtqueues (on 31 ports + console) even 
>> > > if everything is unused.
>> > 
>> > That could be the case if MSI is disabled.
>> 
>> Do the windows virtio drivers enable MSIs, in their inf file?
>
>It depends on the version of the drivers, but it is a reasonable guess
>at what differs between Linux and Windows.  Haoyu, can you give us the
>output of lspci from a Linux guest?
>
I made a test with fio on rhel-6.5 guest, the same degradation happened too,  
this degradation can be reproduced on rhel6.5 guest 100%.
virtio_console module installed:
64K-write-sequence: 285 MBPS, 4380 IOPS
virtio_console module uninstalled:
64K-write-sequence: 370 MBPS, 5670 IOPS

And, virio-blk's interrupt mode always is MSI, no matter if virtio_console 
module is installed or uninstalled.
25:2245933   PCI-MSI-edge  virtio1-requests

fio command:
fio -filename /dev/vda -direct=1 -iodepth=1 -thread -rw=write -ioengine=psync 
-bs=64k -size=30G -numjobs=1 -name=mytest

QEMU comamnd:
/usr/bin/kvm -id 5497356709352 -chardev 
socket,id=qmp,path=/var/run/qemu-server/5497356709352.qmp,server,nowait -mon 
chardev=qmp,mode=control -vnc :0,websocket,to=200 -enable-kvm -pidfile 
/var/run/qemu-server/5497356709352.pid -daemonize -name io-test-rhel-6.5 -smp 
sockets=1,cores=1 -cpu core2duo -nodefaults -vga cirrus -no-hpet -k en-us -boot 
menu=on,splash-time=8000 -m 4096 -usb -drive 
file=/sf/data/local/zhanghaoyu/rhel-server-6.5-x86_64-dvd.iso,if=none,id=drive-ide0,media=cdrom,aio=native,forecast=disable
 -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200 -drive 
file=/sf/data/local/images/host-1051721dff13/io-test-rhel-6.5.vm/vm-disk-1.qcow2,if=none,id=drive-virtio1,cache=none,aio=native
 -device virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb 
-drive 
file=/sf/data/local/images/host-1051721dff13/io-test-rhel-6.5.vm/vm-disk-2.qcow2,if=none,id=drive-virtio2,cache=none,aio=native
 -device virtio-blk-pci,drive=drive-virtio2,id=virtio2,bus=pci
 .0,addr=0xc,bootindex=101 -netdev 
type=tap,id=net0,ifname=164922379979200,script=/sf/etc/kvm/vtp-bridge,vhost=on,vhostforce=on
 -device 
virtio-net-pci,mac=FE:FC:FE:C6:47:F6,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
 -rtc driftfix=slew,clock=rt -global kvm-pit.lost_tick_policy=discard -global 
PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -chardev 
socket,path=/run/virtser/1649223799792.sock,server,nowait,id=channelser -device 
virtio-serial,vectors=4 -device 
virtserialport,chardev=channelser,name=channelser.virtserial0.0

[environment]
Host:linux-3.10(RHEL7-rc1)
QEMU: qemu-2.0.1
Guest: RHEL6.5

# lspci -tv
-[:00]-+-00.0  Intel Corporation 440FX - 82441FX PMC [Natoma]
   +-01.0  Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
   +-01.1  Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
   +-01.2  Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II]
   +-01.3  Intel Corporation 82371AB/EB/MB PIIX4 ACPI
   +-02.0  Cirrus Logic GD 5446
   +-03.0  Red Hat, Inc Virtio console
   +-0b.0  Red Hat, Inc Virtio block device
   +-0c.0  Red Hat, Inc Virtio block device
   \-12.0  Red Hat, Inc Virtio network device

# lspci -vvv
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
Subsystem: Red Hat, Inc Qemu virtual machine
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- 
SERR- TAbort- 
SERR- TAbort- SERR- TAbort- 
SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- Paolo

Re: [Qemu-devel] [question] e1000 interrupt stormhappenedbecauseofits correspondingioapic->irr bit always set

2014-09-03 Thread Zhang Haoyu

ould lead
>> >+* guest can't move forward and may miss the
>> >+* possibility to get proper irq handler
>> >+* registered. So we need to give some breath to
>> >+* guest. TODO: 1 is too long?
>> >+*/
>> >+   schedule_delayed_work(&ioapic->eoi_inject, 1);
>> >+   ioapic->irq_eoi = 0;
>> -+   ioapic->irq_eoi = 0;
>> ++   ioapic->irq_eoi[i] = 0;
>> >+   } else {
>> >+   ioapic_service(ioapic, i);
>> >+   }
>> >+   }
>> ++   else {
>> ++   ioapic->irq_eoi[i] = 0;
>> ++   }
>> >}
>> > }
>> I think ioapic->irq_eoi is prone to reach to 100, because during a eoi 
>> broadcast, 
>> it's possible that another interrupt's (not current eoi's corresponding 
>> interrupt) irr is set, so the ioapic->irq_eoi will grow continually,
>> and not too long, ioapic->irq_eoi will reach to 100.
>> I want to add "u32 irq_eoi[IOAPIC_NUM_PINS];" instead of "u32 irq_eoi;".
>> Any ideas?
>> 
>> Zhang Haoyu
>
>I'm a bit concerned how this will affect realtime guests.
>Worth adding a flag to enable this, so that e.g. virtio is not
>affected?
>
Your concern is reasonable.
If applying Jason's original patch, sometimes the virtio's interrupt delay is 
more than 4ms(my host's HZ=250), 
but very rarely happened.
And with my above change on it(per irq counter instead of total irq counter), 
the delayed virtio interrupt is more rarely happened,
then I use 1000 instead of 100 on "if (ioapic->irq_eoi[i] == 1000)",  I made a 
test for 10min, the delayed virtio interrupt has not happened.

Thanks,
Zhang Haoyu

>
>> >
>> > -375,12 +414,14  void kvm_ioapic_reset(struct 
>> > kvm_ioapic *ioapic)
>> > {
>> >int i;
>> >
>> >+   cancel_delayed_work_sync(&ioapic->eoi_inject);
>> >for (i = 0; i < IOAPIC_NUM_PINS; i++)
>> >ioapic->redirtbl[i].fields.mask = 1;
>> >ioapic->base_address = IOAPIC_DEFAULT_BASE_ADDRESS;
>> >ioapic->ioregsel = 0;
>> >ioapic->irr = 0;
>> >ioapic->id = 0;
>> >+   ioapic->irq_eoi = 0;
>> -+   ioapic->irq_eoi = 0;
>> ++   memset(ioapic->irq_eoi, 0x00, IOAPIC_NUM_PINS);
>> >update_handled_vectors(ioapic);
>> > }
>> >
>> > -398,6 +439,7  int kvm_ioapic_init(struct kvm *kvm)
>> >if (!ioapic)
>> >return -ENOMEM;
>> >spin_lock_init(&ioapic->lock);
>> >+   INIT_DELAYED_WORK(&ioapic->eoi_inject, kvm_ioapic_eoi_inject_work);
>> >kvm->arch.vioapic = ioapic;
>> >kvm_ioapic_reset(ioapic);
>> >kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
>> > -418,6 +460,7  void kvm_ioapic_destroy(struct kvm 
>> > *kvm)
>> > {
>> >struct kvm_ioapic *ioapic = kvm->arch.vioapic;
>> >
>> >+   cancel_delayed_work_sync(&ioapic->eoi_inject);
>> >if (ioapic) {
>> >kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, &ioapic->dev);
>> >kvm->arch.vioapic = NULL;
>> >diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
>> >index 0b190c3..8938e66 100644
>> >--- a/virt/kvm/ioapic.h
>> >+++ b/virt/kvm/ioapic.h
>> > -47,6 +47,8  struct kvm_ioapic {
>> >void (*ack_notifier)(void *opaque, int irq);
>> >spinlock_t lock;
>> >DECLARE_BITMAP(handled_vectors, 256);
>> >+   struct delayed_work eoi_inject;
>> >+   u32 irq_eoi;
>> -+   u32 irq_eoi;
>> ++   u32 irq_eoi[IOAPIC_NUM_PINS];
>> > };
>> >
>> > #ifdef DEBUG
>>

Re: [Qemu-devel] [question] virtio-blk performancedegradationhappened with virito-serial

2014-09-03 Thread Zhang Haoyu

>> >>> Hi, all
>> >>> 
>> >>> I start a VM with virtio-serial (default ports number: 31), and found 
>> >>> that virtio-blk performance degradation happened, about 25%, this 
>> >>> problem can be reproduced 100%.
>> >>> without virtio-serial:
>> >>> 4k-read-random 1186 IOPS
>> >>> with virtio-serial:
>> >>> 4k-read-random 871 IOPS
>> >>> 
>> >>> but if use max_ports=2 option to limit the max number of virio-serial 
>> >>> ports, then the IO performance degradation is not so serious, about 5%.
>> >>> 
>> >>> And, ide performance degradation does not happen with virtio-serial.
>> >>
>> >>Pretty sure it's related to MSI vectors in use.  It's possible that
>> >>the virtio-serial device takes up all the avl vectors in the guests,
>> >>leaving old-style irqs for the virtio-blk device.
>> >>
>> >I don't think so,
>> >I use iometer to test 64k-read(or write)-sequence case, if I disable the 
>> >virtio-serial dynamically via device manager->virtio-serial => disable,
>> >then the performance get promotion about 25% immediately, then I re-enable 
>> >the virtio-serial via device manager->virtio-serial => enable,
>> >the performance got back again, very obvious.
>> add comments:
>> Although the virtio-serial is enabled, I don't use it at all, the 
>> degradation still happened.
>
>Using the vectors= option as mentioned below, you can restrict the
>number of MSI vectors the virtio-serial device gets.  You can then
>confirm whether it's MSI that's related to these issues.
>
I use "-device virtio-serial,vectors=4" instead of "-device virtio-serial", but 
the degradation still happened, nothing changed.
with virtio-serial enabled:
64k-write-sequence: 4200 IOPS
with virtio-serial disabled:
64k-write-sequence: 5300 IOPS

How to confirm whether it's MSI in windows?

Thanks,
Zhang Haoyu

>> >So, I think it has no business with legacy interrupt mode, right?
>> >
>> >I am going to observe the difference of perf top data on qemu and perf kvm 
>> >stat data when disable/enable virtio-serial in guest,
>> >and the difference of perf top data on guest when disable/enable 
>> >virtio-serial in guest,
>> >any ideas?
>> >
>> >Thanks,
>> >Zhang Haoyu
>> >>If you restrict the number of vectors the virtio-serial device gets
>> >>(using the -device virtio-serial-pci,vectors= param), does that make
>> >>things better for you?

Re: [Qemu-devel] [question] virtio-blk performancedegradationhappened with virito-serial

2014-09-01 Thread Zhang Haoyu

>> >> Hi, all
>> >> 
>> >> I start a VM with virtio-serial (default ports number: 31), and found 
>> >> that virtio-blk performance degradation happened, about 25%, this problem 
>> >> can be reproduced 100%.
>> >> without virtio-serial:
>> >> 4k-read-random 1186 IOPS
>> >> with virtio-serial:
>> >> 4k-read-random 871 IOPS
>> >> 
>> >> but if use max_ports=2 option to limit the max number of virio-serial 
>> >> ports, then the IO performance degradation is not so serious, about 5%.
>> >> 
>> >> And, ide performance degradation does not happen with virtio-serial.
>> >
>> >Pretty sure it's related to MSI vectors in use.  It's possible that
>> >the virtio-serial device takes up all the avl vectors in the guests,
>> >leaving old-style irqs for the virtio-blk device.
>> >
>> I don't think so,
>> I use iometer to test 64k-read(or write)-sequence case, if I disable the 
>> virtio-serial dynamically via device manager->virtio-serial => disable,
>> then the performance get promotion about 25% immediately, then I re-enable 
>> the virtio-serial via device manager->virtio-serial => enable,
>> the performance got back again, very obvious.
>> So, I think it has no business with legacy interrupt mode, right?
>> 
>> I am going to observe the difference of perf top data on qemu and perf kvm 
>> stat data when disable/enable virtio-serial in guest,
>> and the difference of perf top data on guest when disable/enable 
>> virtio-serial in guest,
>> any ideas?
>
>So it's a windows guest; it could be something windows driver
>specific, then?  Do you see the same on Linux guests too?
>
I suspect windows driver specific, too.
I have not test linux guest, I'll test it later.

Thanks,
Zhang Haoyu
>   Amit

Re: [Qemu-devel] [question] virtio-blk performance degradationhappened with virito-serial

2014-09-01 Thread Zhang Haoyu

>>> Hi, all
>>> 
>>> I start a VM with virtio-serial (default ports number: 31), and found that 
>>> virtio-blk performance degradation happened, about 25%, this problem can be 
>>> reproduced 100%.
>>> without virtio-serial:
>>> 4k-read-random 1186 IOPS
>>> with virtio-serial:
>>> 4k-read-random 871 IOPS
>>> 
>>> but if use max_ports=2 option to limit the max number of virio-serial 
>>> ports, then the IO performance degradation is not so serious, about 5%.
>>> 
>>> And, ide performance degradation does not happen with virtio-serial.
>>
>>Pretty sure it's related to MSI vectors in use.  It's possible that
>>the virtio-serial device takes up all the avl vectors in the guests,
>>leaving old-style irqs for the virtio-blk device.
>>
>I don't think so,
>I use iometer to test 64k-read(or write)-sequence case, if I disable the 
>virtio-serial dynamically via device manager->virtio-serial => disable,
>then the performance get promotion about 25% immediately, then I re-enable the 
>virtio-serial via device manager->virtio-serial => enable,
>the performance got back again, very obvious.
add comments:
Although the virtio-serial is enabled, I don't use it at all, the degradation 
still happened.

>So, I think it has no business with legacy interrupt mode, right?
>
>I am going to observe the difference of perf top data on qemu and perf kvm 
>stat data when disable/enable virtio-serial in guest,
>and the difference of perf top data on guest when disable/enable virtio-serial 
>in guest,
>any ideas?
>
>Thanks,
>Zhang Haoyu
>>If you restrict the number of vectors the virtio-serial device gets
>>(using the -device virtio-serial-pci,vectors= param), does that make
>>things better for you?
>>
>>
>>  Amit

Re: [Qemu-devel] [question] virtio-blk performance degradationhappened with virito-serial

2014-09-01 Thread Zhang Haoyu

>> Hi, all
>> 
>> I start a VM with virtio-serial (default ports number: 31), and found that 
>> virtio-blk performance degradation happened, about 25%, this problem can be 
>> reproduced 100%.
>> without virtio-serial:
>> 4k-read-random 1186 IOPS
>> with virtio-serial:
>> 4k-read-random 871 IOPS
>> 
>> but if use max_ports=2 option to limit the max number of virio-serial ports, 
>> then the IO performance degradation is not so serious, about 5%.
>> 
>> And, ide performance degradation does not happen with virtio-serial.
>
>Pretty sure it's related to MSI vectors in use.  It's possible that
>the virtio-serial device takes up all the avl vectors in the guests,
>leaving old-style irqs for the virtio-blk device.
>
I don't think so,
I use iometer to test 64k-read(or write)-sequence case, if I disable the 
virtio-serial dynamically via device manager->virtio-serial => disable,
then the performance get promotion about 25% immediately, then I re-enable the 
virtio-serial via device manager->virtio-serial => enable,
the performance got back again, very obvious.
So, I think it has no business with legacy interrupt mode, right?

I am going to observe the difference of perf top data on qemu and perf kvm stat 
data when disable/enable virtio-serial in guest,
and the difference of perf top data on guest when disable/enable virtio-serial 
in guest,
any ideas?

Thanks,
Zhang Haoyu
>If you restrict the number of vectors the virtio-serial device gets
>(using the -device virtio-serial-pci,vectors= param), does that make
>things better for you?
>
>
>   Amit

[Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial

2014-08-29 Thread Zhang Haoyu

Hi, all

I start a VM with virtio-serial (default ports number: 31), and found that 
virtio-blk performance degradation happened, about 25%, this problem can be 
reproduced 100%.
without virtio-serial:
4k-read-random 1186 IOPS
with virtio-serial:
4k-read-random 871 IOPS

but if use max_ports=2 option to limit the max number of virio-serial ports, 
then the IO performance degradation is not so serious, about 5%.

And, ide performance degradation does not happen with virtio-serial.

[environment]
Host OS: linux-3.10
QEMU: 2.0.1
Guest OS: windows server 2008

[qemu command]
/usr/bin/kvm -id 1587174272642 -chardev 
socket,id=qmp,path=/var/run/qemu-server/1587174272642.qmp,server,nowait -mon 
chardev=qmp,mode=control -vnc :0,websocket,to=200 -enable-kvm -pidfile 
/var/run/qemu-server/1587174272642.pid -daemonize -name win2008-32 -smp 
sockets=1,cores=1 -cpu core2duo -nodefaults -vga cirrus -no-hpet -k en-us -boot 
menu=on,splash-time=8000 -m 2048 -usb -drive 
if=none,id=drive-ide0,media=cdrom,aio=native -device 
ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200 -drive 
file=/sf/data/local/images/host-00e081de43d7/cea072c4294f/win2008-32.vm/vm-disk-1.qcow2,if=none,id=drive-ide2,cache=none,aio=native
 -device ide-hd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=100 -netdev 
type=tap,id=net0,ifname=158717427264200,script=/sf/etc/kvm/vtp-bridge -device 
e1000,mac=FE:FC:FE:D3:F9:2B,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300
 -rtc driftfix=slew,clock=rt,base=localtime -global 
kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3
 =1 -global PIIX4_PM.disable_s4=1

Any ideas?

Thanks,
Zhang Haoyu

Re: [Qemu-devel] [question] e1000 interrupt storm happenedbecauseofitscorrespondingioapic->irr bit always set

2014-08-28 Thread Zhang Haoyu

Hi, Yang, Gleb, Michael,
Could you help review below patch please?

Thanks,
Zhang Haoyu

>> Hi Jason,
>> I tested below patch, it's okay, the e1000 interrupt storm disappeared.
>> But I am going to make a bit change on it, could you help review it?
>>
>>> Currently, we call ioapic_service() immediately when we find the irq is 
>>> still
>>> active during eoi broadcast. But for real hardware, there's some dealy 
>>> between
>>> the EOI writing and irq delivery (system bus latency?). So we need to 
>>> emulate
>>> this behavior. Otherwise, for a guest who haven't register a proper irq 
>>> handler
>>> , it would stay in the interrupt routine as this irq would be re-injected
>>> immediately after guest enables interrupt. This would lead guest can't move
>>> forward and may miss the possibility to get proper irq handler registered 
>>> (one
>>> example is windows guest resuming from hibernation).
>>>
>>> As there's no way to differ the unhandled irq from new raised ones, this 
>>> patch
>>> solve this problems by scheduling a delayed work when the count of irq 
>>> injected
>>> during eoi broadcast exceeds a threshold value. After this patch, the guest 
>>> can
>>> move a little forward when there's no suitable irq handler in case it may
>>> register one very soon and for guest who has a bad irq detection routine ( 
>>> such
>>> as note_interrupt() in linux ), this bad irq would be recognized soon as in 
>>> the
>>> past.
>>>
>>> Signed-off-by: Jason Wang  redhat.com>
>>> ---
>>> virt/kvm/ioapic.c |   47 +--
>>> virt/kvm/ioapic.h |2 ++
>>> 2 files changed, 47 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
>>> index dcaf272..892253e 100644
>>> --- a/virt/kvm/ioapic.c
>>> +++ b/virt/kvm/ioapic.c
>>> -221,6 +221,24  int kvm_ioapic_set_irq(struct 
>>> kvm_ioapic *ioapic, int irq, int level)
>>> return ret;
>>> }
>>>
>>> +static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
>>> +{
>>> +   int i, ret;
>>> +   struct kvm_ioapic *ioapic = container_of(work, struct kvm_ioapic,
>>> +eoi_inject.work);
>>> +   spin_lock(&ioapic->lock);
>>> +   for (i = 0; i < IOAPIC_NUM_PINS; i++) {
>>> +   union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
>>> +
>>> +   if (ent->fields.trig_mode != IOAPIC_LEVEL_TRIG)
>>> +   continue;
>>> +
>>> +   if (ioapic->irr & (1 << i) && !ent->fields.remote_irr)
>>> +   ret = ioapic_service(ioapic, i);
>>> +   }
>>> +   spin_unlock(&ioapic->lock);
>>> +}
>>> +
>>> static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
>>>  int trigger_mode)
>>> {
>>> -249,8 +267,29  static void 
>>> __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
>>>
>>> ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
>>> ent->fields.remote_irr = 0;
>>> -   if (!ent->fields.mask && (ioapic->irr & (1 << i)))
>>> -   ioapic_service(ioapic, i);
>>> +   if (!ent->fields.mask && (ioapic->irr & (1 << i))) {
>>> +   ++ioapic->irq_eoi;
>> -+   ++ioapic->irq_eoi;
>> ++   ++ioapic->irq_eoi[i];
>>> +   if (ioapic->irq_eoi == 100) {
>> -+   if (ioapic->irq_eoi == 100) {
>> ++   if (ioapic->irq_eoi[i] == 100) {
>>> +   /*
>>> +* Real hardware does not deliver the irq so
>>> +* immediately during eoi broadcast, so we need
>>> +* to emulate this behavior. Otherwise, for
>>> +* guests who has not registered handler of a
>>> +* level irq, this irq would be injected
>>> +* immediately after guest enables interrupt
>>> +* (which happens usually at the end of the
>>

Re: [Qemu-devel] [question] e1000 interrupt storm happenedbecauseofits correspondingioapic->irr bit always set

2014-08-28 Thread Zhang Haoyu

;+  }
++  else {
++  ioapic->irq_eoi[i] = 0;
++  }
>   }
> }
I think ioapic->irq_eoi is prone to reach to 100, because during a eoi 
broadcast, 
it's possible that another interrupt's (not current eoi's corresponding 
interrupt) irr is set, so the ioapic->irq_eoi will grow continually,
and not too long, ioapic->irq_eoi will reach to 100.
I want to add "u32 irq_eoi[IOAPIC_NUM_PINS];" instead of "u32 irq_eoi;".
Any ideas?

Zhang Haoyu
>
> -375,12 +414,14  void kvm_ioapic_reset(struct 
> kvm_ioapic *ioapic)
> {
>   int i;
>
>+  cancel_delayed_work_sync(&ioapic->eoi_inject);
>   for (i = 0; i < IOAPIC_NUM_PINS; i++)
>   ioapic->redirtbl[i].fields.mask = 1;
>   ioapic->base_address = IOAPIC_DEFAULT_BASE_ADDRESS;
>   ioapic->ioregsel = 0;
>   ioapic->irr = 0;
>   ioapic->id = 0;
>+  ioapic->irq_eoi = 0;
-+  ioapic->irq_eoi = 0;
++  memset(ioapic->irq_eoi, 0x00, IOAPIC_NUM_PINS);
>   update_handled_vectors(ioapic);
> }
>
> -398,6 +439,7  int kvm_ioapic_init(struct kvm *kvm)
>   if (!ioapic)
>   return -ENOMEM;
>   spin_lock_init(&ioapic->lock);
>+  INIT_DELAYED_WORK(&ioapic->eoi_inject, kvm_ioapic_eoi_inject_work);
>   kvm->arch.vioapic = ioapic;
>   kvm_ioapic_reset(ioapic);
>   kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
> -418,6 +460,7  void kvm_ioapic_destroy(struct kvm 
> *kvm)
> {
>   struct kvm_ioapic *ioapic = kvm->arch.vioapic;
>
>+  cancel_delayed_work_sync(&ioapic->eoi_inject);
>   if (ioapic) {
>   kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS, &ioapic->dev);
>   kvm->arch.vioapic = NULL;
>diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
>index 0b190c3..8938e66 100644
>--- a/virt/kvm/ioapic.h
>+++ b/virt/kvm/ioapic.h
> -47,6 +47,8  struct kvm_ioapic {
>   void (*ack_notifier)(void *opaque, int irq);
>   spinlock_t lock;
>   DECLARE_BITMAP(handled_vectors, 256);
>+  struct delayed_work eoi_inject;
>+  u32 irq_eoi;
-+  u32 irq_eoi;
++  u32 irq_eoi[IOAPIC_NUM_PINS];
> };
>
> #ifdef DEBUG

Re: [Qemu-devel] [question] e1000 interrupt storm happened becauseofits correspondingioapic->irr bit always set

2014-08-27 Thread Zhang Haoyu

>>>>>> Hi, all
>>>>>>
>>>>>> I use a qemu-1.4.1/qemu-2.0.0 to run win7 guest, and encounter e1000 NIC 
>>>>>> interrupt storm, 
>>>>>> because "if (!ent->fields.mask && (ioapic->irr & (1 << i)))" is always 
>>>>>> true in __kvm_ioapic_update_eoi().
>>>>>>
>>>>>> Any ideas?
>>>>> We meet this several times: search the autoneg patches for an example of
>>>>> workaround for this in qemu, and patch kvm: ioapic: conditionally delay
>>>>> irq delivery during eoi broadcast for an workaround in kvm (rejected).
>>>>>
>>>> Thanks, Jason,
>>>> I searched "e1000 autoneg" in gmane.comp.emulators.qemu, and found below 
>>>> patches, 
>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/143001/focus=143007
>>> This series is the first try to fix the guest hang during guest
>>> hibernation or driver enable/disable.
>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/284105/focus=284765
>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/186159/focus=187351
>>> Those are follow-up that tries to fix the bugs introduced by the autoneg
>>> hack.
>>>> which one tries to fix this problem, or all of them?
>>> As you can see, those kinds of hacking may not as good as we expect
>>> since we don't know exactly how e1000 works. Only the register function
>>> description from Intel's manual may not be sufficient. And you can
>>> search e1000 in the archives and you can find some behaviour of e1000
>>> registers were not fictionalized like what spec said. It was really
>>> suggested to use virtio-net instead of e1000 in guest. 
>> Will the "[PATCH] kvm: ioapic: conditionally delay irq delivery during eoi 
>> broadcast" add delay to virtual interrupt injection sometimes,
>> then some time delay sensitive applications will be impacted?
>
>I don't test it too much but it only give a minor delay of 1% irq in the
>hope of guest irq handler will be registered shortly. But I suspect it's
>the bug of e1000 who inject the irq in the wrong time. Under what cases
>did you meet this issue?
Some scenarios, not constant and 100% reproducity, 
e.g., reboot vm, ifdown e1000 nic, install kaspersky(network configuration is 
performed during installing stage), .etc.

Thanks,
Zhang Haoyu

>>
>> Thanks,
>> Zhang Haoyu

Re: [Qemu-devel] [PULL 3/3] vnc: fix screen updates

2014-08-26 Thread Zhang Haoyu

>Bug was added by 38ee14f4f33f8836fc0e209ca59c6ae8c6edf380.
>vnc_jobs_join call is missing in one code path.
>
>Reported-by: Anthony PERARD 
>Signed-off-by: Gerd Hoffmann 
>---
> ui/vnc.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/ui/vnc.c b/ui/vnc.c
>index 6c9d4f3..aac93f0 100644
>--- a/ui/vnc.c
>+++ b/ui/vnc.c
>@@ -935,6 +935,9 @@ static int vnc_update_client(VncState *vs, int has_dirty, 
>bool sync)
> }
> 
> vnc_job_push(job);
>+if (sync) {
>+vnc_jobs_join(vs);
>+}
> vs->force_update = 0;
> return n;
> }
>-- 
>1.8.3.1

What about below raw patch,
@@ -935,14 +935,13 @@ static int vnc_update_client(VncState *vs, int has_dirty, 
bool sync)
 }
 
 vnc_job_push(job);
 vs->force_update = 0;
-return n;
 }

 if (vs->csock == -1) {
 vnc_disconnect_finish(vs);
 } else if (sync) {
 vnc_jobs_join(vs);
 }

-return 0;
+return n;

Thanks,
Zhang Haoyu

1 2 >

1 - 100 of 145 matches

Mail list logo