Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Avi Kivity

 On 09/11/2010 05:20 PM, Joerg Roedel wrote:

On Sat, Sep 11, 2010 at 03:43:02PM +0200, Alexander Graf wrote:

@@ -305,6 +322,8 @@ static x86_def_t builtin_x86_defs[] = {
 CPUID_EXT3_OSVW, CPUID_EXT3_IBS */
 .ext3_features = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM |
 CPUID_EXT3_ABM | CPUID_EXT3_SSE4A,
+.svm_features = CPUID_SVM_NPT | CPUID_SVM_LBRV | CPUID_SVM_NRIPSAVE |
+CPUID_SVM_VMCBCLEAN,

Does that phenom already do all those? It does NPT, but I'm not sure
about NRIPSAVE for example.

Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old
Phenoms don't have it. For the SVM features it is not that important
what the host hardware supports but what KVM can emulate. VMCBCLEAN can
be emulated without supporting it in the host for example.


Well, let's have a phenom2 type for those new features (and any other 
features the phenom 2 has).  What's the point of using the name of 
existing hardware if it doesn't match that hardware?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Alexander Graf

On 12.09.2010, at 08:05, Avi Kivity wrote:

 On 09/11/2010 05:20 PM, Joerg Roedel wrote:
 On Sat, Sep 11, 2010 at 03:43:02PM +0200, Alexander Graf wrote:
 @@ -305,6 +322,8 @@ static x86_def_t builtin_x86_defs[] = {
 CPUID_EXT3_OSVW, CPUID_EXT3_IBS */
 .ext3_features = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM |
 CPUID_EXT3_ABM | CPUID_EXT3_SSE4A,
 +.svm_features = CPUID_SVM_NPT | CPUID_SVM_LBRV | 
 CPUID_SVM_NRIPSAVE |
 +CPUID_SVM_VMCBCLEAN,
 Does that phenom already do all those? It does NPT, but I'm not sure
 about NRIPSAVE for example.
 Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old
 Phenoms don't have it. For the SVM features it is not that important
 what the host hardware supports but what KVM can emulate. VMCBCLEAN can
 be emulated without supporting it in the host for example.
 
 Well, let's have a phenom2 type for those new features (and any other 
 features the phenom 2 has).  What's the point of using the name of existing 
 hardware if it doesn't match that hardware?

Isn't that what cpu=host is there for? I don't want to see cpu type cluttering 
like we have it on ppc. I added the core2duo type for Mac OS X guests for which 
those are basically the oldest supported CPUs.

For the Phenom type, I honestly don't remember why, but there was also a good 
reason to add it. In fact, I use it today to have nested virt without -cpu host 
on hardware that's too new for my guests.

Either way, I don't think we need a phenom2 type. The features additional are 
minor enough to not really matter and all use cases I can come up with require 
either -cpu host (local virt) or -cpu phenom (migration).


Alex




[Qemu-devel] Re: [PATCH] Introduce qemu_madvise()

2010-09-12 Thread Blue Swirl
On Sat, Sep 11, 2010 at 5:05 PM, Andreas Färber andreas.faer...@web.de wrote:
 From: Andreas Färber afaer...@opensolaris.org

 vl.c has a Sun-specific hack to supply a prototype for madvise(),
 but the call site has apparently moved to arch_init.c.
 The underlying issue is that madvise() is not a POSIX function,
 therefore Solaris' _POSIX_C_SOURCE suppresses the prototype.

 Haiku doesn't implement madvise() at all.
 OpenBSD however doesn't implement posix_madvise().

 Check for posix_madvise() in configure and supply qemu_madvise() as wrapper.
 Use qemu_madvise() in arch_init.c's ram_load().

 http://www.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html

 Remaining madvise() users:
 exec.c: limited to __linux__ and/or MADV_MERGEABLE (no POSIX equivalent)
 kvm-all.c: limited to MADV_DONTFORK (no POSIX equivalent),
           otherwise runtime error if !kvm_has_sync_mmu()
 hw/virtio-balloon.c: limited to __linux__

For consistency, I'd convert all users. If an OS doesn't support a
flag, we can return something like -ENOTSUP which may be checked by
the caller if it cares.


 v1 - v2:
 * Don't rely on posix_madvise() availability, add qemu_madvise().
  Suggested by Blue Swirl.

 Signed-off-by: Andreas Färber afaer...@opensolaris.org
 Cc: Blue Swirl blauwir...@gmail.com
 ---
  arch_init.c |    2 +-
  configure   |   11 +++
  osdep.c     |   26 ++
  osdep.h     |    4 
  vl.c        |    3 ---
  5 files changed, 42 insertions(+), 4 deletions(-)

 diff --git a/arch_init.c b/arch_init.c
 index e468c0c..a910033 100644
 --- a/arch_init.c
 +++ b/arch_init.c
 @@ -396,7 +396,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
  #ifndef _WIN32
             if (ch == 0 
                 (!kvm_enabled() || kvm_has_sync_mmu())) {
 -                madvise(host, TARGET_PAGE_SIZE, MADV_DONTNEED);
 +                qemu_madvise(host, TARGET_PAGE_SIZE, QEMU_MADV_DONTNEED);
             }
  #endif
         } else if (flags  RAM_SAVE_FLAG_PAGE) {
 diff --git a/configure b/configure
 index 4061cb7..5e6f3e1 100755
 --- a/configure
 +++ b/configure
 @@ -2069,6 +2069,17 @@ if compile_prog   ; then
  fi

  ##
 +# check if we have posix_madvise
 +
 +cat  $TMPC  EOF
 +#include sys/mman.h
 +int main(void) { posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); return 0; }
 +EOF
 +if compile_prog   ; then
 +    QEMU_CFLAGS=-DCONFIG_POSIX_MADVISE ${QEMU_CFLAGS}

Please take a look around configure:2444 how to add new config flags.
I'd also add a similar check for madvise() if posix_madvise() check
fails.

 +fi
 +
 +##
  # check if trace backend exists

  sh $source_path/tracetool --$trace_backend --check-backend  /dev/null 
 2 /dev/null
 diff --git a/osdep.c b/osdep.c
 index 30426ff..8c09597 100644
 --- a/osdep.c
 +++ b/osdep.c
 @@ -28,6 +28,7 @@
  #include errno.h
  #include unistd.h
  #include fcntl.h
 +#include sys/mman.h

  /* Needed early for CONFIG_BSD etc. */
  #include config-host.h
 @@ -139,6 +140,31 @@ void qemu_vfree(void *ptr)

  #endif

 +int qemu_madvise(void *addr, size_t len, int advice)
 +{
 +#ifdef CONFIG_POSIX_MADVISE
 +    switch (advice) {
 +        case QEMU_MADV_DONTNEED:
 +            advice = POSIX_MADV_DONTNEED;
 +            break;
 +        default:
 +            fprintf(stderr, Advice %d unhandled.\n, advice);
 +            abort();

This should be an assert, it's an internal error. It's also common to
both cases.

 +    }
 +    return posix_madvise(addr, len, advice);
 +#else

#elif defined(CONFIG_MADVISE)

 +    switch (advice) {
 +        case QEMU_MADV_DONTNEED:
 +            advice = MADV_DONTNEED;
 +            break;

case QEMU_MADV_DONTFORK:
#ifndef MADV_DONTFORK
return -ENOTSUP;
#else
advice = MADV_DONTFORK;
break;
#endif

Same goes for MADV_MERGEABLE.

 +        default:
 +            fprintf(stderr, Advice %d unhandled.\n, advice);
 +            abort();
 +    }
 +    return madvise(addr, len, advice);

#else
return -ENOTSUP;



Re: [Qemu-devel] Re: [PATCH v2 2/9] pcie: helper functions for pcie extended capability.

2010-09-12 Thread Blue Swirl
On Wed, Sep 8, 2010 at 5:38 PM, Wei Xu we...@cisco.com wrote:
 Isaku:

 For binary constants below, to achieve max compatibility with gcc versions,
 I recommend to change to hex (0x...):

Yes, binary constants were only added to GCC 4.3.x. Since they are
also GCC extensions with no obvious way to circumvent their use (as
with GCC attributes), they shouldn't be used.



Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Avi Kivity

 On 09/12/2010 09:16 AM, Alexander Graf wrote:


Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old
Phenoms don't have it. For the SVM features it is not that important
what the host hardware supports but what KVM can emulate. VMCBCLEAN can
be emulated without supporting it in the host for example.

Well, let's have a phenom2 type for those new features (and any other features 
the phenom 2 has).  What's the point of using the name of existing hardware if 
it doesn't match that hardware?

Isn't that what cpu=host is there for? I don't want to see cpu type cluttering 
like we have it on ppc. I added the core2duo type for Mac OS X guests for which 
those are basically the oldest supported CPUs.


-cpu host is to all supported features into your guest.
-cpu phenom is to pretend you are running on a phenom cpu.  This is 
useful for a migration farm for which the greatest common denominator is 
a phenom.


Those are separate use cases.


For the Phenom type, I honestly don't remember why, but there was also a good 
reason to add it. In fact, I use it today to have nested virt without -cpu host 
on hardware that's too new for my guests.


Curious, what guests balk at modern hardware but are fine with phenom?


Either way, I don't think we need a phenom2 type. The features additional are 
minor enough to not really matter and all use cases I can come up with require 
either -cpu host (local virt) or -cpu phenom (migration).


I'm fine with this (or with adding phenom2).  But don't make phenom 
contain flags that real phenoms don't have.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH] Introduce qemu_madvise()

2010-09-12 Thread Andreas Färber

Am 12.09.2010 um 09:20 schrieb Blue Swirl:

On Sat, Sep 11, 2010 at 5:05 PM, Andreas Färber andreas.faer...@web.de 
 wrote:

Use qemu_madvise() in arch_init.c's ram_load().

http://www.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html

Remaining madvise() users:
exec.c: limited to __linux__ and/or MADV_MERGEABLE (no POSIX  
equivalent)

kvm-all.c: limited to MADV_DONTFORK (no POSIX equivalent),
  otherwise runtime error if !kvm_has_sync_mmu()
hw/virtio-balloon.c: limited to __linux__


For consistency, I'd convert all users. If an OS doesn't support a
flag, we can return something like -ENOTSUP which may be checked by
the caller if it cares.


Will do.


diff --git a/configure b/configure
index 4061cb7..5e6f3e1 100755
--- a/configure
+++ b/configure
@@ -2069,6 +2069,17 @@ if compile_prog   ; then
 fi

 ##
+# check if we have posix_madvise
+
+cat  $TMPC  EOF
+#include sys/mman.h
+int main(void) { posix_madvise(NULL, 0, POSIX_MADV_DONTNEED);  
return 0; }

+EOF
+if compile_prog   ; then
+QEMU_CFLAGS=-DCONFIG_POSIX_MADVISE ${QEMU_CFLAGS}


Please take a look around configure:2444 how to add new config flags.


I did. It's just not obvious that they find their way from the  
Makefile to a C header, unlike autoconf.



I'd also add a similar check for madvise() if posix_madvise() check
fails.


Fine with me.


diff --git a/osdep.c b/osdep.c
index 30426ff..8c09597 100644
--- a/osdep.c
+++ b/osdep.c
@@ -28,6 +28,7 @@
 #include errno.h
 #include unistd.h
 #include fcntl.h
+#include sys/mman.h

 /* Needed early for CONFIG_BSD etc. */
 #include config-host.h
@@ -139,6 +140,31 @@ void qemu_vfree(void *ptr)

 #endif

+int qemu_madvise(void *addr, size_t len, int advice)
+{
+#ifdef CONFIG_POSIX_MADVISE
+switch (advice) {
+case QEMU_MADV_DONTNEED:
+advice = POSIX_MADV_DONTNEED;
+break;
+default:
+fprintf(stderr, Advice %d unhandled.\n, advice);
+abort();


This should be an assert, it's an internal error. It's also common to
both cases.


+}
+return posix_madvise(addr, len, advice);
+#else


#elif defined(CONFIG_MADVISE)


+switch (advice) {
+case QEMU_MADV_DONTNEED:
+advice = MADV_DONTNEED;
+break;


case QEMU_MADV_DONTFORK:
#ifndef MADV_DONTFORK
return -ENOTSUP;
#else
advice = MADV_DONTFORK;
break;
#endif

Same goes for MADV_MERGEABLE.


So you disagree with or didn't yet read Alex' suggestion of  
eliminating this mapping code in qemu_madvise() altogether?

Doing that in a sensible way would allow code to do:

#ifdef QEMU_MADV_MERGEABLE
...

as before at compile-time. The only qemu_madvise() error condition  
would then be the #else below.



+default:
+fprintf(stderr, Advice %d unhandled.\n, advice);
+abort();
+}
+return madvise(addr, len, advice);


#else
return -ENOTSUP;


Thanks,

Andreas


[Qemu-devel] Re: [PATCH] Introduce qemu_madvise()

2010-09-12 Thread Blue Swirl
On Sun, Sep 12, 2010 at 8:50 AM, Andreas Färber andreas.faer...@web.de wrote:
 Am 12.09.2010 um 09:20 schrieb Blue Swirl:

 On Sat, Sep 11, 2010 at 5:05 PM, Andreas Färber andreas.faer...@web.de
 wrote:

 Use qemu_madvise() in arch_init.c's ram_load().


 http://www.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html

 Remaining madvise() users:
 exec.c: limited to __linux__ and/or MADV_MERGEABLE (no POSIX equivalent)
 kvm-all.c: limited to MADV_DONTFORK (no POSIX equivalent),
          otherwise runtime error if !kvm_has_sync_mmu()
 hw/virtio-balloon.c: limited to __linux__

 For consistency, I'd convert all users. If an OS doesn't support a
 flag, we can return something like -ENOTSUP which may be checked by
 the caller if it cares.

 Will do.

 diff --git a/configure b/configure
 index 4061cb7..5e6f3e1 100755
 --- a/configure
 +++ b/configure
 @@ -2069,6 +2069,17 @@ if compile_prog   ; then
  fi

  ##
 +# check if we have posix_madvise
 +
 +cat  $TMPC  EOF
 +#include sys/mman.h
 +int main(void) { posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); return 0;
 }
 +EOF
 +if compile_prog   ; then
 +    QEMU_CFLAGS=-DCONFIG_POSIX_MADVISE ${QEMU_CFLAGS}

 Please take a look around configure:2444 how to add new config flags.

 I did. It's just not obvious that they find their way from the Makefile to a
 C header, unlike autoconf.

 I'd also add a similar check for madvise() if posix_madvise() check
 fails.

 Fine with me.

 diff --git a/osdep.c b/osdep.c
 index 30426ff..8c09597 100644
 --- a/osdep.c
 +++ b/osdep.c
 @@ -28,6 +28,7 @@
  #include errno.h
  #include unistd.h
  #include fcntl.h
 +#include sys/mman.h

  /* Needed early for CONFIG_BSD etc. */
  #include config-host.h
 @@ -139,6 +140,31 @@ void qemu_vfree(void *ptr)

  #endif

 +int qemu_madvise(void *addr, size_t len, int advice)
 +{
 +#ifdef CONFIG_POSIX_MADVISE
 +    switch (advice) {
 +        case QEMU_MADV_DONTNEED:
 +            advice = POSIX_MADV_DONTNEED;
 +            break;
 +        default:
 +            fprintf(stderr, Advice %d unhandled.\n, advice);
 +            abort();

 This should be an assert, it's an internal error. It's also common to
 both cases.

 +    }
 +    return posix_madvise(addr, len, advice);
 +#else

 #elif defined(CONFIG_MADVISE)

 +    switch (advice) {
 +        case QEMU_MADV_DONTNEED:
 +            advice = MADV_DONTNEED;
 +            break;

 case QEMU_MADV_DONTFORK:
 #ifndef MADV_DONTFORK
 return -ENOTSUP;
 #else
 advice = MADV_DONTFORK;
 break;
 #endif

 Same goes for MADV_MERGEABLE.

 So you disagree with or didn't yet read Alex' suggestion of eliminating this
 mapping code in qemu_madvise() altogether?
 Doing that in a sensible way would allow code to do:

 #ifdef QEMU_MADV_MERGEABLE
 ...

 as before at compile-time. The only qemu_madvise() error condition would
 then be the #else below.

That's one way too, if nobody cares about madvise() return values for
MADV_MERGEABLE. In any case I'd like to eliminate the #ifdefs in other
places than osdep.[ch].



Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Alexander Graf

Am 12.09.2010 um 10:01 schrieb Avi Kivity a...@redhat.com:

 On 09/12/2010 09:16 AM, Alexander Graf wrote:
 
 Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old
 Phenoms don't have it. For the SVM features it is not that important
 what the host hardware supports but what KVM can emulate. VMCBCLEAN can
 be emulated without supporting it in the host for example.
 Well, let's have a phenom2 type for those new features (and any other 
 features the phenom 2 has).  What's the point of using the name of existing 
 hardware if it doesn't match that hardware?
 Isn't that what cpu=host is there for? I don't want to see cpu type 
 cluttering like we have it on ppc. I added the core2duo type for Mac OS X 
 guests for which those are basically the oldest supported CPUs.
 
 -cpu host is to all supported features into your guest.
 -cpu phenom is to pretend you are running on a phenom cpu.  This is useful 
 for a migration farm for which the greatest common denominator is a phenom.
 
 Those are separate use cases.

Exactly. And the benefit from phenom - phenom2 is so minor that it doesn't 
make sense.

 
 For the Phenom type, I honestly don't remember why, but there was also a 
 good reason to add it. In fact, I use it today to have nested virt without 
 -cpu host on hardware that's too new for my guests.
 
 Curious, what guests balk at modern hardware but are fine with phenom?

Sles11 GA ;).

 
 Either way, I don't think we need a phenom2 type. The features additional 
 are minor enough to not really matter and all use cases I can come up with 
 require either -cpu host (local virt) or -cpu phenom (migration).
 
 I'm fine with this (or with adding phenom2).  But don't make phenom contain 
 flags that real phenoms don't have.

Those were my words :).


Alex
 



Re: [Qemu-devel] Stable patch tracking

2010-09-12 Thread Avi Kivity

 On 09/07/2010 09:24 PM, Anthony Liguori wrote:
A lot of folks tend to reply to patches and suggest them for stable 
which is certainly appreciated but is very difficult for me to track 
reliably.


So here's another way to nominate a patch for inclusion in -stable:

http://wiki.qemu.org/Releases/0.13.0

Just copy/paste the template and fill out with the appropriate details.


Can the captcha for logged in users be removed?

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Avi Kivity

 On 09/12/2010 12:06 PM, Alexander Graf wrote:

For the Phenom type, I honestly don't remember why, but there was also a good 
reason to add it. In fact, I use it today to have nested virt without -cpu host 
on hardware that's too new for my guests.

Curious, what guests balk at modern hardware but are fine with phenom?

Sles11 GA ;).


Still curious, how does -cpu host break it?


Either way, I don't think we need a phenom2 type. The features additional are 
minor enough to not really matter and all use cases I can come up with require 
either -cpu host (local virt) or -cpu phenom (migration).

I'm fine with this (or with adding phenom2).  But don't make phenom contain 
flags that real phenoms don't have.

Those were my words :).


Then we are in agreement.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files

2010-09-12 Thread Avi Kivity

 On 09/11/2010 05:04 PM, Anthony Liguori wrote:

Image files have two types of data: immutable data that describes things like
image size, backing files, etc. and mutable data that includes offset and
reference count tables.



Note: even the logical size is, in principle, mutable.  If we introduce 
external snapshots, then so are backing files?


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts

2010-09-12 Thread Avi Kivity

 On 09/11/2010 05:04 PM, Anthony Liguori wrote:

This fixes a couple nasty problems relating to live migration.

1) When dealing with shared storage with weak coherence (i.e. NFS), even if
we re-read, we may end up with undesired caching.  By delaying any reads
until we absolutely have to, we decrease the likelihood of any undesirable
caching.

2) When dealing with copy-on-read, the local storage acts as a cache.  We need
to make sure to avoid any reads to avoid polluting the local cache.

+
  static void ide_identify(IDEState *s)
  {
  uint16_t *p;
@@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
return;
  }

+guess_geometry(s);
+


This can cause a disk read, no?  Shouldn't it be made asynchronous?

Or just move it to just before the guest starts?

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration

2010-09-12 Thread Avi Kivity

 On 09/11/2010 05:04 PM, Anthony Liguori wrote:

Today, live migration only works when using shared storage that is fully
cache coherent using raw images.

The failure case with weak coherent (i.e. NFS) is subtle but nontheless still
exists.  NFS only guarantees close-to-open coherence and when performing a live
migration, we do an open on the source and an open on the destination.  We
fsync() on the source before launching the destination but since we have two
simultaneous opens, we're not guaranteed coherence.

This is not necessarily a problem except that we are a bit gratituous in reading
from the disk before launching a guest.  This means that as things stand today,
we're guaranteed to read the first 64k of the disk and as such, if a client
writes to that region during live migration, corruption will result.

The second failure condition has to do with image files (such as qcow2).  Today,
we aggressively cache metadata in all image formats and that cache is definitely
not coherent even with fully coherent shared storage.

In all image formats, we prefetch at least the L1 table in open() which means
that if there is a write operation that causes a modification to an L1 table,
corruption will ensue.

This series attempts to address both of these issue.  Technically, if a NFS
client aggressively prefetches this solution is not enough but in practice,
Linux doesn't do that.


I think it is unlikely that it will, but I prefer to be on the right 
side of the standards.  Why not delay image open until after migration 
completes?  I know your concern about the image not being there, but we 
can verify that with access().  If the image is deleted between access() 
and open() then the user has much bigger problems.


Note that on NFS, removing (and I think chmoding) a file after it is 
opened will cause subsequent data access to fail, unlike posix.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

2010-09-12 Thread Avi Kivity

 On 09/07/2010 04:41 PM, Anthony Liguori wrote:

Hi,

We've got copy-on-read and image streaming working in QED and before 
going much further, I wanted to bounce some interfaces off of the 
libvirt folks to make sure our final interface makes sense.


Here's the basic idea:

Today, you can create images based on base images that are copy on 
write.  With QED, we also support copy on read which forces a copy 
from the backing image on read requests and write requests.


Is copy on read QED specific?  It looks very similar to the commit 
command, except with I/O directions reversed.


IIRC, commit looks like

  for each sector:
if image.mapped(sector):
backing_image.write(sector, image.read(sector))

whereas copy-on-read looks like:

  def copy_on_read():
set_ioprio(idle)
for each sector:
  if not image.mapped(sector):
  image.write(sector, backing_image.read(sector))
   run_in_thread(copy_on_read)

With appropriate locking.



In additional to copy on read, we introduce a notion of streaming a 
block device which means that we search for an unallocated region of 
the leaf image and force a copy-on-read operation.


The combination of copy-on-read and streaming means that you can start 
a guest based on slow storage (like over the network) and bring in 
blocks on demand while also having a deterministic mechanism to 
complete the transfer.


The interface for copy-on-read is just an option within qemu-img 
create.  Streaming, on the other hand, requires a bit more thought.  
Today, I have a monitor command that does the following:


stream device sector offset

Which will try to stream the minimal amount of data for a single I/O 
operation and then return how many sectors were successfully streamed.


The idea about how to drive this interface is a loop like:

offset = 0;
while offset  image_size:
   wait_for_idle_time()
   count = stream(device, offset)
   offset += count



This is way too low level for the management stack.

Have you considered using the idle class I/O priority to implement 
this?  That would allow host-wide prioritization.  Not sure how to do 
cluster-wide, I don't think NFS has the concept of I/O priority.



--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Alexander Graf

Am 12.09.2010 um 12:22 schrieb Avi Kivity a...@redhat.com:

 On 09/12/2010 12:06 PM, Alexander Graf wrote:
 For the Phenom type, I honestly don't remember why, but there was also a 
 good reason to add it. In fact, I use it today to have nested virt without 
 -cpu host on hardware that's too new for my guests.
 Curious, what guests balk at modern hardware but are fine with phenom?
 Sles11 GA ;).
 
 Still curious, how does -cpu host break it?

Uh, something in the power management module. Just give it a try on an Istambul 
system. Just use the 11.1 iso - the kernels are close enough.

Alex

 



[Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage

2010-09-12 Thread Andreas Färber
Public bug reported:

Running `LANG=C LC_ALL=C ./configure --prefix=...
--install=/usr/ucb/install` on Solaris 10 amd64 results in the following
errors:

./configure: bad substitution
./configure: !: not found
./configure: curl-config: not found
./configure: curl-config: not found

Error: invalid trace backend
Please choose a supported trace backend.


Unfortunately it doesn't print the line numbers of the errors. It must be 
somewhere after the check for `install`.

The first few can be resolved by running `bash ./configure ...` instead.

The check if trace backend exists hardcodes `sh
$source_path/tracetool ...` in configure. Replacing sh with bash makes
it work.

`gmake` complains Makefile:331: no file name for -include, which is a filter 
for *.d files.
`create_config` gets the 'bad substitution' error as well. Replacing sh with 
bash in rules.mak works.
etc.

To sum it up,
a) there are shell script incompatibilities with Solaris 10's /bin/sh shell, and
b) hardcoding 'sh' in configure or Makefiles seems like a bad idea.

QEMU Git 73d7434279e3905164afd02360eebe4b43c7fa (ESP: fix ESP DMA
access...)

$ uname -a
SunOS sonnengoettin 5.10 Generic_142901-03 i86pc i386 i86pc

# No banner output for /bin/sh

$ bash --version
GNU bash, version 3.00.16(1)-release (i386-pc-solaris2.10)
Copyright (C) 2004 Free Software Foundation, Inc.

** Affects: qemu
 Importance: Undecided
 Status: New


** Tags: solaris

-- 
configure and build errors on Solaris 10 due to /bin/sh usage
https://bugs.launchpad.net/bugs/636315
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: New

Bug description:
Running `LANG=C LC_ALL=C ./configure --prefix=... --install=/usr/ucb/install` 
on Solaris 10 amd64 results in the following errors:

./configure: bad substitution
./configure: !: not found
./configure: curl-config: not found
./configure: curl-config: not found

Error: invalid trace backend
Please choose a supported trace backend.


Unfortunately it doesn't print the line numbers of the errors. It must be 
somewhere after the check for `install`.

The first few can be resolved by running `bash ./configure ...` instead. 

The check if trace backend exists hardcodes `sh $source_path/tracetool ...` 
in configure. Replacing sh with bash makes it work.

`gmake` complains Makefile:331: no file name for -include, which is a filter 
for *.d files.
`create_config` gets the 'bad substitution' error as well. Replacing sh with 
bash in rules.mak works.
etc.

To sum it up,
a) there are shell script incompatibilities with Solaris 10's /bin/sh shell, and
b) hardcoding 'sh' in configure or Makefiles seems like a bad idea.

QEMU Git 73d7434279e3905164afd02360eebe4b43c7fa (ESP: fix ESP DMA access...)

$ uname -a
SunOS sonnengoettin 5.10 Generic_142901-03 i86pc i386 i86pc

# No banner output for /bin/sh

$ bash --version
GNU bash, version 3.00.16(1)-release (i386-pc-solaris2.10)
Copyright (C) 2004 Free Software Foundation, Inc.





Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Avi Kivity

 On 09/12/2010 01:14 PM, Alexander Graf wrote:

Uh, something in the power management module. Just give it a try on an Istambul 
system. Just use the 11.1 iso - the kernels are close enough.


Oh, probably some silly msr.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] [PATCH v3] Introduce qemu_madvise()

2010-09-12 Thread Andreas Färber
From: Andreas Färber afaer...@opensolaris.org

vl.c has a Sun-specific hack to supply a prototype for madvise(),
but the call site has apparently moved to arch_init.c.

Haiku doesn't implement madvise() in favor of posix_madvise().
OpenBSD and Solaris 10 don't implement posix_madvise() but madvise().

Check for madvise() and posix_madvise() in configure and supply qemu_madvise()
as wrapper. Prefer madvise() over posix_madvise() due to flag availability.
Convert all callers to use qemu_madvise() and QEMU_MADV_*. No functional change
except for arch_init.c:ram_load() now potentially falling back to 
posix_madvise()
or no-op in lack of both.

v2 - v3:
* Reuse the *_MADV_* defines for QEMU_MADV_*. Suggested by Alexander Graf.
* Add configure check for madvise(), too.
  Add defines to Makefile, not QEMU_CFLAGS.
  Convert all callers, untested. Suggested by Blue Swirl.
* Keep Solaris' madvise() prototype around. Pointed out by Alexander Graf.

v1 - v2:
* Don't rely on posix_madvise() availability, add qemu_madvise().
  Suggested by Blue Swirl.

Signed-off-by: Andreas Färber afaer...@opensolaris.org
Cc: Blue Swirl blauwir...@gmail.com
Cc: Alexander Graf ag...@suse.de
---
 arch_init.c |2 +-
 configure   |   33 +
 exec.c  |8 
 hw/virtio-balloon.c |4 ++--
 kvm-all.c   |6 +++---
 osdep.c |   15 +++
 osdep.h |   25 +
 vl.c|3 ---
 8 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index e468c0c..a910033 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -396,7 +396,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
 #ifndef _WIN32
 if (ch == 0 
 (!kvm_enabled() || kvm_has_sync_mmu())) {
-madvise(host, TARGET_PAGE_SIZE, MADV_DONTNEED);
+qemu_madvise(host, TARGET_PAGE_SIZE, QEMU_MADV_DONTNEED);
 }
 #endif
 } else if (flags  RAM_SAVE_FLAG_PAGE) {
diff --git a/configure b/configure
index 4061cb7..86558eb 100755
--- a/configure
+++ b/configure
@@ -2069,6 +2069,31 @@ if compile_prog   ; then
 fi
 
 ##
+# check if we have madvise
+
+madvise=no
+cat  $TMPC  EOF
+#include sys/types.h
+#include sys/mman.h
+int main(void) { return madvise(NULL, 0, MADV_DONTNEED); }
+EOF
+if compile_prog   ; then
+madvise=yes
+fi
+
+##
+# check if we have posix_madvise
+
+posix_madvise=no
+cat  $TMPC  EOF
+#include sys/mman.h
+int main(void) { return posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); }
+EOF
+if compile_prog   ; then
+posix_madvise=yes
+fi
+
+##
 # check if trace backend exists
 
 sh $source_path/tracetool --$trace_backend --check-backend  /dev/null 2 
/dev/null
@@ -2226,6 +2251,8 @@ echo KVM support   $kvm
 echo fdt support   $fdt
 echo preadv support$preadv
 echo fdatasync $fdatasync
+echo madvise   $madvise
+echo posix_madvise $posix_madvise
 echo uuid support  $uuid
 echo vhost-net support $vhost_net
 echo Trace backend $trace_backend
@@ -2466,6 +2493,12 @@ fi
 if test $fdatasync = yes ; then
   echo CONFIG_FDATASYNC=y  $config_host_mak
 fi
+if test $madvise = yes ; then
+  echo CONFIG_MADVISE=y  $config_host_mak
+fi
+if test $posix_madvise = yes ; then
+  echo CONFIG_POSIX_MADVISE=y  $config_host_mak
+fi
 
 # XXX: suppress that
 if [ $bsd = yes ] ; then
diff --git a/exec.c b/exec.c
index 380dab5..b1fe3e9 100644
--- a/exec.c
+++ b/exec.c
@@ -2841,8 +2841,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, 
const char *name,
 new_block-host = file_ram_alloc(new_block, size, mem_path);
 if (!new_block-host) {
 new_block-host = qemu_vmalloc(size);
-#ifdef MADV_MERGEABLE
-madvise(new_block-host, size, MADV_MERGEABLE);
+#ifdef QEMU_MADV_MERGEABLE
+qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE);
 #endif
 }
 #else
@@ -2858,8 +2858,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, 
const char *name,
 #else
 new_block-host = qemu_vmalloc(size);
 #endif
-#ifdef MADV_MERGEABLE
-madvise(new_block-host, size, MADV_MERGEABLE);
+#ifdef QEMU_MADV_MERGEABLE
+qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE);
 #endif
 }
 }
diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 9fe3886..1e74674 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -51,8 +51,8 @@ static void balloon_page(void *addr, int deflate)
 {
 #if defined(__linux__)
 if (!kvm_enabled() || kvm_has_sync_mmu())
-madvise(addr, TARGET_PAGE_SIZE,
-deflate ? MADV_WILLNEED : MADV_DONTNEED);
+qemu_madvise(addr, TARGET_PAGE_SIZE,
+deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
 #endif
 }
 
diff --git 

Re: [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration

2010-09-12 Thread Anthony Liguori

On 09/12/2010 05:46 AM, Avi Kivity wrote:

 On 09/11/2010 05:04 PM, Anthony Liguori wrote:

Today, live migration only works when using shared storage that is fully
cache coherent using raw images.

The failure case with weak coherent (i.e. NFS) is subtle but 
nontheless still
exists.  NFS only guarantees close-to-open coherence and when 
performing a live
migration, we do an open on the source and an open on the 
destination.  We
fsync() on the source before launching the destination but since we 
have two

simultaneous opens, we're not guaranteed coherence.

This is not necessarily a problem except that we are a bit gratituous 
in reading
from the disk before launching a guest.  This means that as things 
stand today,
we're guaranteed to read the first 64k of the disk and as such, if a 
client

writes to that region during live migration, corruption will result.

The second failure condition has to do with image files (such as 
qcow2).  Today,
we aggressively cache metadata in all image formats and that cache is 
definitely

not coherent even with fully coherent shared storage.

In all image formats, we prefetch at least the L1 table in open() 
which means
that if there is a write operation that causes a modification to an 
L1 table,

corruption will ensue.

This series attempts to address both of these issue.  Technically, if 
a NFS
client aggressively prefetches this solution is not enough but in 
practice,

Linux doesn't do that.


I think it is unlikely that it will, but I prefer to be on the right 
side of the standards.


I've been asking around about this and one thing that was suggested was 
acquiring a file lock as NFS requires that a lock acquisition drops any 
client cache for a file.  I need to understand this a bit more so it's 
step #2.


  Why not delay image open until after migration completes?  I know 
your concern about the image not being there, but we can verify that 
with access().  If the image is deleted between access() and open() 
then the user has much bigger problems.


3/3 would still be needed because if we delay the open we obviously can 
do a read until an open.


So it's only really a choice between invalidate_cache and delaying 
open.  It's a far less invasive change to just do invalidate_cache 
though and it has some nice properties.


Regards,

Anthony Liguori

Note that on NFS, removing (and I think chmoding) a file after it is 
opened will cause subsequent data access to fail, unlike posix.







Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files

2010-09-12 Thread Anthony Liguori

On 09/12/2010 05:37 AM, Avi Kivity wrote:

 On 09/11/2010 05:04 PM, Anthony Liguori wrote:
Image files have two types of data: immutable data that describes 
things like
image size, backing files, etc. and mutable data that includes offset 
and

reference count tables.



Note: even the logical size is, in principle, mutable.  If we 
introduce external snapshots, then so are backing files?


Maybe it should be, data the can be changed during a live migration.

Backing files and logical size shouldn't change during live migration.

But even so, I think the interface make sense.  It's basically, drop 
anything you have cached that may change during migration.  What needs 
to be read is dependent on features/format.


Regards,

Anthony Liguori





Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts

2010-09-12 Thread Anthony Liguori

On 09/12/2010 05:42 AM, Avi Kivity wrote:

 On 09/11/2010 05:04 PM, Anthony Liguori wrote:

This fixes a couple nasty problems relating to live migration.

1) When dealing with shared storage with weak coherence (i.e. NFS), 
even if
we re-read, we may end up with undesired caching.  By delaying 
any reads
until we absolutely have to, we decrease the likelihood of any 
undesirable

caching.

2) When dealing with copy-on-read, the local storage acts as a 
cache.  We need

to make sure to avoid any reads to avoid polluting the local cache.

+
  static void ide_identify(IDEState *s)
  {
  uint16_t *p;
@@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
  return;
  }

+guess_geometry(s);
+


This can cause a disk read, no?  Shouldn't it be made asynchronous?


Yes, it should.  I'm not sure there's an easy way to make it 
asynchronous though not because of the block layer but because of how 
these functions are called.




Or just move it to just before the guest starts?


We don't really have a notion of guest starts today although maybe we 
should.


Regards,

Anthony Liguori




Re: [Qemu-devel] Stable patch tracking

2010-09-12 Thread Anthony Liguori

On 09/12/2010 05:08 AM, Avi Kivity wrote:

 On 09/07/2010 09:24 PM, Anthony Liguori wrote:
A lot of folks tend to reply to patches and suggest them for stable 
which is certainly appreciated but is very difficult for me to track 
reliably.


So here's another way to nominate a patch for inclusion in -stable:

http://wiki.qemu.org/Releases/0.13.0

Just copy/paste the template and fill out with the appropriate details.


Can the captcha for logged in users be removed?


Can you be more specific about what you're seeing?

There should only be a captcha for creating an account.  I've never seen 
it since I created my account.


Regards,

Anthony Liguori





Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-12 Thread Avi Kivity

 On 09/10/2010 08:07 PM, Anthony Liguori wrote:

On 09/10/2010 10:49 AM, Avi Kivity wrote:
  If I do a qemu-img create -f qcow2 foo.img 10GB, and then do a 
naive copy of the image file and end up with a 2GB image when 
there's nothing in it, that's badness.


Only if you crash in the middle.  If not, you free the preallocation 
during shutdown (or when running a guest, when it isn't actively 
writing at 100 MB/s).


Which is potentially guest exploitable.


If this worries you, run a scrubber in the background after an 
uncontrolled crash.  Like qed fsck, this will recover the free list from 
L2.  Unlike qed fsck, it will not delay starting of large guests.




And what do you do when you shutdown and start up?  You're setting a 
reference count on blocks and keeping metadata in memory that those 
blocks are really free.  Do you need an atexit hook to decrement the 
reference counts? 


Not atexit, just when we close the image.


Just a detail, but we need an atexit() handler to make sure block 
devices get closed because we have too many exit()s in the code today.


Right.



Do you need to create a free list structure that gets written out on 
close?


Yes, the same freelist that we allocate from.  It's an allocated but 
not yet referenced list.


Does it get written to disk?


On exit or when there is no allocation activity.



Just saying we can do batching is not solving the problem.  If you 
want to claim that the formats are equally, then in the very least, 
you have to give a very exact description of how this would work 
because it's not entirely straight forward.


I thought I did, but I realize it is spread over multiple email 
messages.  If you like, I can try to summarize it.  It will be 
equally useful for qed once you add a freelist for UNMAP support.


Yes, please consolidate so we can debate specifics.  If there's a 
reasonable way to fix qcow2, I'm happy to walk away from qed.  But 
we've studied the problem and do not believe there's a reasonable 
approach to fixing qcow2 whereas reasonable considers the amount of 
development effort, the time line required to get things right, and 
the confidence we would have in the final product compared against the 
one time cost of introducing a new format.


I've started something and will post it soon.  When considering 
development time, also consider the time it will take users to actually 
use qed (6 months for qemu release users, ~9 months on average for 
semiannual community distro releases, 12-18 months for enterprise 
distros.  Consider also that we still have to support qcow2 since people 
do use the extra features, and since I don't see us forcing them to migrate.



I don't think you have any grounds to make such a statement.


No, it's a forward-looking statement.  But you're already looking at 
adding a freelist for UNMAP support and three levels for larger 
images.  So it's safe to say that qed will not remain as nice and 
simple as it is now.


I have a lot of faith in starting from a strong base and avoiding 
making it weaker vs. starting from a weak base and trying to make it 
stronger.


This has led to many rewrites in the past.



I realize it's somewhat subjective though.


While qed looks like a good start, it has at least three flaws already 
(relying on physical image size, relying on fsck, and limited logical 
image size).  Just fixing those will introduce complication.  What about 
new features or newly discovered flaws?






4) We have looked at trying to fix qcow2.  It appears to be a 
monumental amount of work that starts with a rewrite where it's 
unclear if we can even keep supporting all of the special 
features.  IOW, there is likely to be a need for users to 
experience some type of image conversion or optimization process.


I don't see why.


Because you're oversimplifying what it takes to make qcow2 perform 
well.


Maybe.  With all its complexity, it's nowhere near as close to the 
simplest filesystem.  The biggest burden is the state machine design.


Maybe I'm broken with respect to how I think, but I find state 
machines very easy to rationalize.


Your father's state machine. Not as clumsy or random as a thread; an 
elegant weapon for a more civilized age


To me, the biggest burden in qcow2 is thinking through how you deal 
with shared resources.  Because you can block for a long period of 
time during write operations, it's not enough to just carry a mutex 
during all metadata operations.  You have to stage operations and 
commit them at very specific points in time.


The standard way of dealing with this is to have a hash table for 
metadata that contains a local mutex:


l2cache = defaultdict(L2)

def get_l2(pos):
l2 = l2cache[pos]
l2.mutex.lock()
if not l2.valid:
 l2.pos = pos
 l2.read()
 l2.valid = True
return l2

def put_l2(l2):
if l2.dirty:
l2.write()
l2.dirty = False

Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

2010-09-12 Thread Anthony Liguori

On 09/12/2010 07:41 AM, Avi Kivity wrote:

 On 09/07/2010 05:57 PM, Anthony Liguori wrote:

I agree that streaming should be generic, like block migration.  The
trivial generic implementation is:

void bdrv_stream(BlockDriverState* bs)
{
 for (sector = 0; sector  bdrv_getlength(bs); sector += n) {
 if (!bdrv_is_allocated(bs, sector,n)) {


Three problems here.  First problem is that bdrv_is_allocated is 
synchronous. 


Put the whole thing in a thread.


It doesn't fix anything.  You don't want stream to serialize all I/O 
operations.


The second problem is that streaming makes the most sense when it's 
the smallest useful piece of work whereas bdrv_is_allocated() may 
return a very large range.


You could cap it here but you then need to make sure that cap is at 
least cluster_size to avoid a lot of unnecessary I/O.


That seems like a nice solution.  You probably want a multiple of the 
cluster size to retain efficiency.


What you basically do is:

stream_step_three():
   complete()

stream_step_two(offset, length):
   bdrv_aio_readv(offset, length, buffer, stream_step_three)

bdrv_aio_stream():
bdrv_aio_find_free_cluster(stream_step_two)

And that's exactly what the current code looks like.  The only change to 
the patch that this does is make some of qed's internals be block layer 
interfaces.


One of the things Stefan has mentioned is that a lot of the QED code 
could be reused by other formats.  All formats implement things like CoW 
on their own today but if you exposed interfaces like 
bdrv_aio_find_free_cluster(), you could actually implement a lot more in 
the generic block layer.


So, I agree with you in principle that this all should be common code.  
I think it's a larger effort though.




The QED streaming implementation is 140 LOCs too so you quickly end 
up adding more code to the block formats to support these new 
interfaces than it takes to just implement it in the block format.


bdrv_is_allocated() already exists (and is needed for commit), what 
else is needed?  cluster size?


Synchronous implementations are not reusable to implement asynchronous 
anything.  But you need the code to be cluster aware too.


Third problem is that  streaming really requires being able to do 
zero write detection in a meaningful way.  You don't want to always 
do zero write detection so you need another interface to mark a 
specific write as a write that should be checked for zeros.


You can do that in bdrv_stream(), above, before the actual write, and 
call bdrv_unmap() if you detect zeros.


My QED branch now does that FWIW.  At the moment, it only detects zero 
reads to unallocated clusters and writes a special zero cluster marker.  
However, the detection code is in the generic path so once the fsck() 
logic is working, we can implement a free list in QED.


In QED, the detection code needs to have a lot of knowledge about 
cluster boundaries and the format of the device.  In principle, this 
should be common code but it's not for the same reason copy-on-write is 
not common code today.


Regards,

Anthony Liguori




Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

2010-09-12 Thread Avi Kivity

 On 09/12/2010 03:25 PM, Anthony Liguori wrote:

On 09/12/2010 07:41 AM, Avi Kivity wrote:

 On 09/07/2010 05:57 PM, Anthony Liguori wrote:

I agree that streaming should be generic, like block migration.  The
trivial generic implementation is:

void bdrv_stream(BlockDriverState* bs)
{
 for (sector = 0; sector  bdrv_getlength(bs); sector += n) {
 if (!bdrv_is_allocated(bs, sector,n)) {


Three problems here.  First problem is that bdrv_is_allocated is 
synchronous. 


Put the whole thing in a thread.


It doesn't fix anything.  You don't want stream to serialize all I/O 
operations.


Why would it serialize all I/O operations?  It's just like another vcpu 
issuing reads.




The second problem is that streaming makes the most sense when it's 
the smallest useful piece of work whereas bdrv_is_allocated() may 
return a very large range.


You could cap it here but you then need to make sure that cap is at 
least cluster_size to avoid a lot of unnecessary I/O.


That seems like a nice solution.  You probably want a multiple of the 
cluster size to retain efficiency.


What you basically do is:

stream_step_three():
   complete()

stream_step_two(offset, length):
   bdrv_aio_readv(offset, length, buffer, stream_step_three)

bdrv_aio_stream():
bdrv_aio_find_free_cluster(stream_step_two)


Isn't there a write() missing somewhere?



And that's exactly what the current code looks like.  The only change 
to the patch that this does is make some of qed's internals be block 
layer interfaces.


Why do you need find_free_cluster()?  That's a physical offset thing.  
Just write to the same logical offset.


IOW:

bdrv_aio_stream():
bdrv_aio_read(offset, stream_2)

stream_2():
if all zeros:
increment offset
if more:
bdrv_aio_stream()
   bdrv_aio_write(offset, stream_3)

stream_3():
bdrv_aio_write(offset, stream_4)

stream_4():
increment offset
if more:
 bdrv_aio_stream()


Of course, need to serialize wrt guest writes, which adds a bit more 
complexity.  I'll leave it to you to code the state machine for that.




One of the things Stefan has mentioned is that a lot of the QED code 
could be reused by other formats.  All formats implement things like 
CoW on their own today but if you exposed interfaces like 
bdrv_aio_find_free_cluster(), you could actually implement a lot more 
in the generic block layer.


So, I agree with you in principle that this all should be common 
code.  I think it's a larger effort though.


Not that large I think; and it will make commit async as a side effect.



The QED streaming implementation is 140 LOCs too so you quickly end 
up adding more code to the block formats to support these new 
interfaces than it takes to just implement it in the block format.


bdrv_is_allocated() already exists (and is needed for commit), what 
else is needed?  cluster size?


Synchronous implementations are not reusable to implement asynchronous 
anything. 


Surely this is easy to fix, at least for qed.

What we need is thread infrastructure that allows us to convert between 
the two methods.



But you need the code to be cluster aware too.


Yes, another variable in BlockDriverState.



Third problem is that  streaming really requires being able to do 
zero write detection in a meaningful way.  You don't want to always 
do zero write detection so you need another interface to mark a 
specific write as a write that should be checked for zeros.


You can do that in bdrv_stream(), above, before the actual write, and 
call bdrv_unmap() if you detect zeros.


My QED branch now does that FWIW.  At the moment, it only detects zero 
reads to unallocated clusters and writes a special zero cluster 
marker.  However, the detection code is in the generic path so once 
the fsck() logic is working, we can implement a free list in QED.


In QED, the detection code needs to have a lot of knowledge about 
cluster boundaries and the format of the device.  In principle, this 
should be common code but it's not for the same reason copy-on-write 
is not common code today.


Parts of it are: commit.  Of course, that's horribly synchronous.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 2/9] pcie: helper functions for pcie extended capability.

2010-09-12 Thread Michael S. Tsirkin
On Wed, Sep 08, 2010 at 04:39:35PM +0900, Isaku Yamahata wrote:
 +#define PCI_EXP_SLTCTL_AIC_SHIFT6
 +#define PCI_EXP_SLTCTL_AIC_ON   (PCI_EXP_HP_IND_ON  
 PCI_EXP_SLTCTL_AIC_SHIFT)
 +#define PCI_EXP_SLTCTL_AIC_BLINK(PCI_EXP_HP_IND_BLINK  
 PCI_EXP_SLTCTL_AIC_SHIFT)
 +#define PCI_EXP_SLTCTL_AIC_OFF  (PCI_EXP_HP_IND_OFF  
 PCI_EXP_SLTCTL_AIC_SHIFT)
 +
 +#define PCI_EXP_SLTCTL_PIC_SHIFT8
 +#define PCI_EXP_SLTCTL_PIC_ON   (PCI_EXP_HP_IND_ON  
 PCI_EXP_SLTCTL_PIC_SHIFT)
 +#define PCI_EXP_SLTCTL_PIC_BLINK(PCI_EXP_HP_IND_BLINK  
 PCI_EXP_SLTCTL_PIC_SHIFT)
 +#define PCI_EXP_SLTCTL_PIC_OFF  (PCI_EXP_HP_IND_OFF  
 PCI_EXP_SLTCTL_PIC_SHIFT)

It might be better to simply define the 6 macros we are using directly.
The duplication here is minimal, and I guess it will be easier get
them into linux this way.

-- 
MST



Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files

2010-09-12 Thread Avi Kivity

 On 09/12/2010 03:06 PM, Anthony Liguori wrote:


Backing files and logical size shouldn't change during live migration.


Why not?



But even so, I think the interface make sense.  It's basically, drop 
anything you have cached that may change during migration.  What needs 
to be read is dependent on features/format.


Sure.  Certainly as an incremental step.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Stable patch tracking

2010-09-12 Thread Avi Kivity

 On 09/12/2010 03:04 PM, Anthony Liguori wrote:

Can the captcha for logged in users be removed?



Can you be more specific about what you're seeing?

There should only be a captcha for creating an account.  I've never 
seen it since I created my account.




I updated the page (logged in as avi) and it asked me for a captcha.  I 
don't recall anything else.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts

2010-09-12 Thread Avi Kivity

 On 09/12/2010 03:08 PM, Anthony Liguori wrote:

This can cause a disk read, no?  Shouldn't it be made asynchronous?



Yes, it should.  I'm not sure there's an easy way to make it 
asynchronous though not because of the block layer but because of how 
these functions are called.


Sorry to harp on the subject, but that's the standard problem with state 
machines.  Every time you want to do a blocking operation in a function, 
you have to put all its locals in some structure, split the function 
into two, do some scheduling, etc.




Or just move it to just before the guest starts?


We don't really have a notion of guest starts today although maybe 
we should.


Wasn't there some qdev callback that represents this?  Faint memory from 
the reset thread.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Stable patch tracking

2010-09-12 Thread Avi Kivity

 On 09/12/2010 03:42 PM, Avi Kivity wrote:

 On 09/12/2010 03:04 PM, Anthony Liguori wrote:

Can the captcha for logged in users be removed?



Can you be more specific about what you're seeing?

There should only be a captcha for creating an account.  I've never 
seen it since I created my account.




I updated the page (logged in as avi) and it asked me for a captcha.  
I don't recall anything else.




I did another edit (to my user page) and it didn't this time.  Maybe it 
only happens on the first edit?


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Copy, Cut, Paste - Files/Text

2010-09-12 Thread Dhananjay Goel
Hi there!

I'm using qemu for quite a while and was searching for clipboard sharing
between guests and host. I got some info of copy/paste using VNC. I would
appreciate if somebody can help me in finding if the following things are
supported directly using Qemu:

1) Copy/Paste Files between Host and Guests.
2) Drag and drop files from/to VMs
3) Clipboard sharing to copy paste text across VMs (host and guest(s))

Thanks,
Dhananjay


Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Avi Kivity

 On 09/12/2010 04:30 PM, Joerg Roedel wrote:



Either way, I don't think we need a phenom2 type. The features
additional are minor enough to not really matter and all use cases I
can come up with require either -cpu host (local virt) or -cpu phenom
(migration).

I'm fine with this (or with adding phenom2).  But don't make phenom
contain flags that real phenoms don't have.

How about features that are not supported by the hardware but can be
supported in emulation? The VMCBCLEAN feature is one of those which
makes a lot of sense to reduce the emulated world-switch times. I guess
its ok to enable those with -cpu host?




It's a good question.  We do that with x2apic, so yes.

Userspace can do four things with KVM_GET_SUPPORTED_CPUID:
- feed it right back to KVM_SET_CPUID2, getting maximum features and 
hopefully performance
- use it to mask the real cpuid which it fetches directly, getting 
something as similar as possible to the host
- use it to mask a predefined cpu type, getting something as similar as 
possible to that cpu type
- use it to verify that a predefined cpu type is supported, getting 
somthing exactly the same as that cpu type


-cpu host is the first option.  If the need comes for the second option, 
we can provide it.


I'll note that KVM_GET_SUPPORTED_CPUID can return values not present in 
the host cpuid in the documentation.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features

2010-09-12 Thread Joerg Roedel
On Sun, Sep 12, 2010 at 10:01:28AM +0200, Avi Kivity wrote:
  On 09/12/2010 09:16 AM, Alexander Graf wrote:

 Either way, I don't think we need a phenom2 type. The features
 additional are minor enough to not really matter and all use cases I
 can come up with require either -cpu host (local virt) or -cpu phenom
 (migration).

 I'm fine with this (or with adding phenom2).  But don't make phenom  
 contain flags that real phenoms don't have.

How about features that are not supported by the hardware but can be
supported in emulation? The VMCBCLEAN feature is one of those which
makes a lot of sense to reduce the emulated world-switch times. I guess
its ok to enable those with -cpu host?

Joerg




Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-12 Thread Anthony Liguori

On 09/12/2010 08:24 AM, Avi Kivity wrote:

Not atexit, just when we close the image.


Just a detail, but we need an atexit() handler to make sure block 
devices get closed because we have too many exit()s in the code today.



Right.


So when you click the 'X' on the qemu window, we get to wait a few 
seconds for it to actually disappear because it's flushing metadata to 
disk..



I've started something and will post it soon.


Excellent, thank you.

  When considering development time, also consider the time it will 
take users to actually use qed (6 months for qemu release users, ~9 
months on average for semiannual community distro releases, 12-18 
months for enterprise distros.  Consider also that we still have to 
support qcow2 since people do use the extra features, and since I 
don't see us forcing them to migrate.


I'm of the opinion that qcow2 is unfit for production use for the type 
of production environments I care about.  The amount of changes needed 
to make qcow2 fit for production use put it on at least the same 
timeline as you cite above.


Yes, there are people today that qcow2 is appropriate but by the same 
respect, it will continue to be appropriate for them in the future.


In my view, we don't have an image format fit for production use.  
You're arguing we should make qcow2 fit for production use whereas I am 
arguing we should start from scratch.  My reasoning for starting from 
scratch is that it simplifies the problem.  Your reasoning for improving 
qcow2 is simplifying the transition for non-production users of qcow2.


We have an existence proof that we can achieve good data integrity and 
good performance by simplifying the problem.  The burden still is 
establishing that it's possible to improve qcow2 in a reasonable amount 
of effort.


NB, you could use qcow2 today if you had all of the data integrity fixes 
or didn't care about data integrity in the event of power failure or 
didn't care about performance.  I don't have any customers that fit that 
bill so from my perspective, qcow2 isn't production fit.  That doesn't 
mean that it's not fit for someone else's production use.




I realize it's somewhat subjective though.


While qed looks like a good start, it has at least three flaws already 
(relying on physical image size, relying on fsck, and limited logical 
image size).  Just fixing those will introduce complication.  What 
about new features or newly discovered flaws?


Let's quantify fsck.  My suspicion is that if you've got the storage for 
1TB disk images, it's fast enough that fsck can not be so bad.


Keep in mind, we don't have to completely pause the guest while 
fsck'ing.  We simply have to prevent cluster allocations.  We can allow 
reads and we can allow writes to allocated clusters.


Consequently, if you had a 1TB disk image, it's extremely likely that 
the vast majority of I/O is just to allocated clusters which means that 
fsck() is entirely a background task.  The worst case scenario is 
actually a half-allocated disk.


But since you have to boot before you can run any serious test, if it 
takes 5 seconds to do an fsck(), it's highly likely that it's not even 
noticeable.


Maybe I'm broken with respect to how I think, but I find state 
machines very easy to rationalize.


Your father's state machine. Not as clumsy or random as a thread; an 
elegant weapon for a more civilized age


I find your lack of faith in QED disturbing.

To me, the biggest burden in qcow2 is thinking through how you deal 
with shared resources.  Because you can block for a long period of 
time during write operations, it's not enough to just carry a mutex 
during all metadata operations.  You have to stage operations and 
commit them at very specific points in time.


The standard way of dealing with this is to have a hash table for 
metadata that contains a local mutex:


l2cache = defaultdict(L2)

def get_l2(pos):
l2 = l2cache[pos]
l2.mutex.lock()
if not l2.valid:
 l2.pos = pos
 l2.read()
 l2.valid = True
return l2

def put_l2(l2):
if l2.dirty:
l2.write()
l2.dirty = False
l2.mutex.unlock()


You're missing how you create entries.  That means you've got to do:

def put_l2(l2):
   if l2.committed:
   if l2.dirty
   l2.write()
   l2.dirty = False
   l2.mutex.unlock()
else:
   l2.mutex.lock()
   l2cache[l2.pos] = l2
   l2.mutex.unlock()

And this really illustrates my point.  It's a harder problem that it 
seems.  You also are keeping l2 reads from occurring when flushing a 
dirty l2 entry which is less parallel than what qed achieves today.


This is part of why I prefer state machines.  Acquiring a mutex is too 
easy and it makes it easy to not think through what all could be 
running.  When you are more explicit about when you are allowing 
concurrency, I think it's easier to be more aggressive.


It's a 

Re: [Qemu-devel] Stable patch tracking

2010-09-12 Thread Anthony Liguori

On 09/12/2010 08:42 AM, Avi Kivity wrote:

 On 09/12/2010 03:04 PM, Anthony Liguori wrote:

Can the captcha for logged in users be removed?



Can you be more specific about what you're seeing?

There should only be a captcha for creating an account.  I've never 
seen it since I created my account.




I updated the page (logged in as avi) and it asked me for a captcha.  
I don't recall anything else.


I assume it was something weird with cookies and/or browser caching.  
Let me know if it happens again.  Edits by anonymous users require 
captcha to prevent spam so something probably thought you were still an 
anonymous user even though you had logged in.


Regards,

Anthony Liguori





[Qemu-devel] [PATCH] trace: Fix user emulator dependency on trace objects

2010-09-12 Thread Andreas Färber
On a clean build, after generating trace.h, make would recurse into *-*-user
without a clue how to build ../trace.o (added to $(obj-y) in Makefile.target)
since its generation rule is in the main Makefile.
The softmmus are seemingly unaffected because the $(TOOLS), which each have
a dependency on $(trace-obj-y), are built first for the build-all target.

Add a dependency on $(trace-obj-y) for %-user, as done for the qemu-* tools.

Let's be paranoid and do the same for %-softmmu while at it, just in case
someone messes with $(TOOLS) or calls the Makefile target directly.

Signed-off-by: Andreas Färber andreas.faer...@web.de
Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
Cc: Prerna Saxena pre...@linux.vnet.ibm.com
Cc: Blue Swirl blauwir...@gmail.com
Cc: Anthony Liguori aligu...@us.ibm.com
---
 Makefile |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index ab91d42..090d632 100644
--- a/Makefile
+++ b/Makefile
@@ -80,9 +80,9 @@ include $(SRC_PATH)/Makefile.objs
 endif
 
 $(common-obj-y): $(GENERATED_HEADERS)
-$(filter %-softmmu,$(SUBDIR_RULES)): $(common-obj-y) subdir-libdis
+$(filter %-softmmu,$(SUBDIR_RULES)): $(trace-obj-y) $(common-obj-y) 
subdir-libdis
 
-$(filter %-user,$(SUBDIR_RULES)): $(GENERATED_HEADERS) subdir-libdis-user 
subdir-libuser
+$(filter %-user,$(SUBDIR_RULES)): $(GENERATED_HEADERS) $(trace-obj-y) 
subdir-libdis-user subdir-libuser
 
 ROMSUBDIR_RULES=$(patsubst %,romsubdir-%, $(ROMS))
 romsubdir-%:
-- 
1.7.2.2




Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

2010-09-12 Thread Anthony Liguori

On 09/12/2010 08:40 AM, Avi Kivity wrote:
Why would it serialize all I/O operations?  It's just like another 
vcpu issuing reads.


Because the block layer isn't re-entrant.


What you basically do is:

stream_step_three():
   complete()

stream_step_two(offset, length):
   bdrv_aio_readv(offset, length, buffer, stream_step_three)

bdrv_aio_stream():
bdrv_aio_find_free_cluster(stream_step_two)


Isn't there a write() missing somewhere?


Streaming relies on copy-on-read to do the writing.



And that's exactly what the current code looks like.  The only change 
to the patch that this does is make some of qed's internals be block 
layer interfaces.


Why do you need find_free_cluster()?  That's a physical offset thing.  
Just write to the same logical offset.


IOW:

bdrv_aio_stream():
bdrv_aio_read(offset, stream_2)


It's an optimization.  If you've got a fully missing L1 entry, then 
you're going to memset() 2GB worth of zeros.  That's just wasted work.  
With a 1TB image with a 1GB allocation, it's a huge amount of wasted work.



stream_2():
if all zeros:
increment offset
if more:
bdrv_aio_stream()
   bdrv_aio_write(offset, stream_3)

stream_3():
bdrv_aio_write(offset, stream_4)


I don't understand why stream_3() is needed.


stream_4():
increment offset
if more:
 bdrv_aio_stream()


Of course, need to serialize wrt guest writes, which adds a bit more 
complexity.  I'll leave it to you to code the state machine for that.


http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d



Third problem is that  streaming really requires being able to do 
zero write detection in a meaningful way.  You don't want to always 
do zero write detection so you need another interface to mark a 
specific write as a write that should be checked for zeros.


You can do that in bdrv_stream(), above, before the actual write, 
and call bdrv_unmap() if you detect zeros.


My QED branch now does that FWIW.  At the moment, it only detects 
zero reads to unallocated clusters and writes a special zero cluster 
marker.  However, the detection code is in the generic path so once 
the fsck() logic is working, we can implement a free list in QED.


In QED, the detection code needs to have a lot of knowledge about 
cluster boundaries and the format of the device.  In principle, this 
should be common code but it's not for the same reason copy-on-write 
is not common code today.


Parts of it are: commit.  Of course, that's horribly synchronous.


If you've got AIO internally, making commit work is pretty easy.  Doing 
asynchronous commit at a generic layer is not easy though unless you 
expose lots of details.


Generally, I think the block layer makes more sense if the interface to 
the formats are high level and code sharing is achieved not by mandating 
a world view but rather but making libraries of common functionality.   
This is more akin to how the FS layer works in Linux.


So IMHO, we ought to add a bdrv_aio_commit function, turn the current 
code into a generic_aio_commit, implement a qed_aio_commit, then somehow 
do qcow2_aio_commit, and look at what we can refactor into common code.


Regards,

Anthony Liguori



[Qemu-devel] Re: [PATCH 1/3] pci: sorting out type confusion in pci_register_bar().

2010-09-12 Thread Michael S. Tsirkin
On Thu, Sep 09, 2010 at 11:48:55AM +0900, Isaku Yamahata wrote:
 This patch sorts out invalid use of pcibus_t.
 
 In pci_register_bar(), pcibus_t wmask is used.
 It should, However, be uint64_t because it is used to set
 pci configuration space value(PCIDevice::wmask) by pci_set_quad()
 or pci_set_long().
 So make its type uint64_t and cast from pcibus_t to int64_t where
 necessary.


I don't think we need the cast. Applied without the cast.

 Maybe for strict type safety conversion functions would
 be desirable. Something like pcibus_to_uintN(), uintN_to_pcibus().
 But conversions are done only a few place, so it wouldn't be
 worthwhile.
 
 Signed-off-by: Isaku Yamahata yamah...@valinux.co.jp
 ---
  hw/pci.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/hw/pci.c b/hw/pci.c
 index f03b83e..8d6b299 100644
 --- a/hw/pci.c
 +++ b/hw/pci.c
 @@ -763,7 +763,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
  {
  PCIIORegion *r;
  uint32_t addr;
 -pcibus_t wmask;
 +uint64_t wmask;
  
  if ((unsigned int)region_num = PCI_NUM_REGIONS)
  return;
 @@ -781,7 +781,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
  r-type = type;
  r-map_func = map_func;
  
 -wmask = ~(size - 1);
 +wmask = (uint64_t)(~(size - 1));
  addr = pci_bar(pci_dev, region_num);
  if (region_num == PCI_ROM_SLOT) {
  /* ROM enable bit is writeable */
 -- 
 1.7.1.1



Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files

2010-09-12 Thread Anthony Liguori

On 09/12/2010 08:28 AM, Avi Kivity wrote:

 On 09/12/2010 03:06 PM, Anthony Liguori wrote:


Backing files and logical size shouldn't change during live migration.


Why not?


To make our lives easier.

Regards,

Anthony Liguori



But even so, I think the interface make sense.  It's basically, drop 
anything you have cached that may change during migration.  What 
needs to be read is dependent on features/format.


Sure.  Certainly as an incremental step.






[Qemu-devel] Re: [PATCH 2/3] pci: don't ignore invalid parameter for pci_register_bar().

2010-09-12 Thread Michael S. Tsirkin
On Thu, Sep 09, 2010 at 11:48:56AM +0900, Isaku Yamahata wrote:
 Abort when invalid value for region_num is passed to pci_register_bar.
 That is caller's bug. Abort instead of silently ignoring invalid value.
 
 Signed-off-by: Isaku Yamahata yamah...@valinux.co.jp
 ---
  hw/pci.c |5 ++---
  1 files changed, 2 insertions(+), 3 deletions(-)

Applied.

 diff --git a/hw/pci.c b/hw/pci.c
 index 8d6b299..31eba9a 100644
 --- a/hw/pci.c
 +++ b/hw/pci.c
 @@ -765,9 +765,8 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
  uint32_t addr;
  uint64_t wmask;
  
 -if ((unsigned int)region_num = PCI_NUM_REGIONS)
 -return;
 -
 +assert(region_num = 0);
 +assert(region_num  PCI_NUM_REGIONS);
  if (size  (size-1)) {
  fprintf(stderr, ERROR: PCI region size must be pow2 
  type=0x%x, size=0x%FMT_PCIBUS\n, type, size);
 -- 
 1.7.1.1



[Qemu-devel] Re: [PATCH 3/3] pci: improve signature of pci_register_bar().

2010-09-12 Thread Michael S. Tsirkin
On Thu, Sep 09, 2010 at 11:48:57AM +0900, Isaku Yamahata wrote:
 Make type uint8_t from int because PCIIORegion::type is uint8_t.
 
 Signed-off-by: Isaku Yamahata yamah...@valinux.co.jp
 ---
  hw/pci.c |2 +-
  hw/pci.h |2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)


Applied, thanks.

 diff --git a/hw/pci.c b/hw/pci.c
 index 31eba9a..ceee291 100644
 --- a/hw/pci.c
 +++ b/hw/pci.c
 @@ -758,7 +758,7 @@ static int pci_unregister_device(DeviceState *dev)
  }
  
  void pci_register_bar(PCIDevice *pci_dev, int region_num,
 -pcibus_t size, int type,
 +pcibus_t size, uint8_t type,
  PCIMapIORegionFunc *map_func)
  {
  PCIIORegion *r;
 diff --git a/hw/pci.h b/hw/pci.h
 index bccab3a..25179eb 100644
 --- a/hw/pci.h
 +++ b/hw/pci.h
 @@ -192,7 +192,7 @@ PCIDevice *pci_register_device(PCIBus *bus, const char 
 *name,
 PCIConfigWriteFunc *config_write);
  
  void pci_register_bar(PCIDevice *pci_dev, int region_num,
 -pcibus_t size, int type,
 +pcibus_t size, uint8_t type,
  PCIMapIORegionFunc *map_func);
  
  int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 -- 
 1.7.1.1



Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts

2010-09-12 Thread Anthony Liguori

On 09/12/2010 08:26 AM, Avi Kivity wrote:

 On 09/12/2010 03:08 PM, Anthony Liguori wrote:

This can cause a disk read, no?  Shouldn't it be made asynchronous?



Yes, it should.  I'm not sure there's an easy way to make it 
asynchronous though not because of the block layer but because of how 
these functions are called.


Sorry to harp on the subject, but that's the standard problem with 
state machines.  Every time you want to do a blocking operation in a 
function, you have to put all its locals in some structure, split the 
function into two, do some scheduling, etc.


We can't block the VCPU thread for arbitrarily long periods of time.  If 
we get a PIO operation requesting information about geometry, we can't 
wait for a disk read in order to satisfy that request.


We need to kick off the I/O operations in the background such that the 
data is available before the PIO operation happens.  This isn't SM vs. 
threads at all, this is simply about the fact that we can't do block I/O 
during a PIO operation.




Or just move it to just before the guest starts?


We don't really have a notion of guest starts today although maybe 
we should.


Wasn't there some qdev callback that represents this?  Faint memory 
from the reset thread.


Yes, realize().  I guess that's a reasonable approach.

Regards,

Anthony Liguori



[Qemu-devel] [Bug 623852] Re: PPC emulation loops on booting a FreeBSD kernel

2010-09-12 Thread Massimo Montecchi
Also I can confirm that I have this problem on QEMU.
I had tried booting FreeBSD8.1-ppc under QEMU (Linux x86_64 host; PPC guest) 
but there seems to be a problem with whatever the FreeBSD8.1 kernel does, that 
QEMU's PPC emulation can't handle.

I am using the latest version of QEMU from GIT as of 11/9/10.
The kernel starts OK then loops after Kernel entry at 0x100100 

-- 
PPC emulation loops on booting a FreeBSD kernel
https://bugs.launchpad.net/bugs/623852
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: New

Bug description:
Has anyone tried booting FreeBSD8.1-ppc under QEMU (Linux x86_64 host; PPC 
guest)?  I can get Linux/PPC to run fine, and FreeBSD8.1-i386 as well; but 
there seems to be a problem with whatever the FreeBSD8.1 kernel does, that 
QEMU's PPC emulation can't handle.

I am using the latest version of QEMU from GIT as of 25/8/10.  I don't know how 
to get a git commit hash, so I can't quote it.

The kernel starts OK then loops after Kernel entry at 0x100100 

The command I am running is

qemu-system-ppc -cdrom FreeBSD-8.1-RELEASE-powerpc-disc1.iso -hda 
freebsd8.1-ppc -m 94 -boot d

I obtained the kernel from 
ftp://ftp.freebsd.org/pub/FreeBSD/releases/powerpc/ISO-IMAGES/8.1/FreeBSD-8.1-RELEASE-powerpc-disc1.iso.





[Qemu-devel] [Bug 636446] [NEW] prep ppc machine no more working

2010-09-12 Thread Massimo Montecchi
Public bug reported:

I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i
want to use PREP PPC machine the error is the same:


qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS

CPU #0:
NIP    LR  CTR  XER 
MSR  HID0   HF  idx 0
TB   DECR 
GPR00    
GPR04    
GPR08    
GPR12    
GPR16    
GPR20    
GPR24    
GPR28    
CR   [ -  -  -  -  -  -  -  -  ] RES 
FPR00    
FPR04    
FPR08    
FPR12    
FPR16    
FPR20    
FPR24    
FPR28    
FPSCR 
SRR0  SRR1  SDR1 

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
prep ppc machine no more working
https://bugs.launchpad.net/bugs/636446
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: New

Bug description:
I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to 
use PREP PPC machine the error is the same:


qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS

CPU #0:
NIP    LR  CTR  XER 
MSR  HID0   HF  idx 0
TB   DECR 
GPR00    
GPR04    
GPR08    
GPR12    
GPR16    
GPR20    
GPR24    
GPR28    
CR   [ -  -  -  -  -  -  -  -  ] RES 
FPR00    
FPR04    
FPR08    
FPR12    
FPR16    
FPR20    
FPR24    
FPR28    
FPSCR 
SRR0  SRR1  SDR1 





Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-12 Thread Avi Kivity

 On 09/12/2010 05:13 PM, Anthony Liguori wrote:

On 09/12/2010 08:24 AM, Avi Kivity wrote:

Not atexit, just when we close the image.


Just a detail, but we need an atexit() handler to make sure block 
devices get closed because we have too many exit()s in the code today.



Right.


So when you click the 'X' on the qemu window, we get to wait a few 
seconds for it to actually disappear because it's flushing metadata to 
disk..


If it was doing heavy write I/O, you'll need to wait a bit (a few 
seconds are a few hundreds of clusters worth of metadata).  If it 
managed to flush while you were moving your mouse, no delay.




  When considering development time, also consider the time it will 
take users to actually use qed (6 months for qemu release users, ~9 
months on average for semiannual community distro releases, 12-18 
months for enterprise distros.  Consider also that we still have to 
support qcow2 since people do use the extra features, and since I 
don't see us forcing them to migrate.


I'm of the opinion that qcow2 is unfit for production use for the type 
of production environments I care about.  The amount of changes needed 
to make qcow2 fit for production use put it on at least the same 
timeline as you cite above.


If it's exactly the same time, we gain by having one less format.



Yes, there are people today that qcow2 is appropriate but by the same 
respect, it will continue to be appropriate for them in the future.


In my view, we don't have an image format fit for production use.  
You're arguing we should make qcow2 fit for production use whereas I 
am arguing we should start from scratch.  My reasoning for starting 
from scratch is that it simplifies the problem.  Your reasoning for 
improving qcow2 is simplifying the transition for non-production users 
of qcow2.


We have an existence proof that we can achieve good data integrity and 
good performance by simplifying the problem.  The burden still is 
establishing that it's possible to improve qcow2 in a reasonable 
amount of effort.


Agreed.



I realize it's somewhat subjective though.


While qed looks like a good start, it has at least three flaws 
already (relying on physical image size, relying on fsck, and limited 
logical image size).  Just fixing those will introduce complication.  
What about new features or newly discovered flaws?


Let's quantify fsck.  My suspicion is that if you've got the storage 
for 1TB disk images, it's fast enough that fsck can not be so bad.


It doesn't follow.  The storage is likely to be shared among many 
guests.  The image size (or how full it is) don't really matter; startup 
time is the aggregate number of L2s over all images starting now, 
divided by the number of spindles, divided by the number of IOPS each 
spindle provides.


Since an L2 spans a lot of logical address space, it is likely that many 
L2s will be allocated (in fact, it makes sense to preallocate them).




Keep in mind, we don't have to completely pause the guest while 
fsck'ing.  We simply have to prevent cluster allocations.  We can 
allow reads and we can allow writes to allocated clusters.


True.



Consequently, if you had a 1TB disk image, it's extremely likely that 
the vast majority of I/O is just to allocated clusters which means 
that fsck() is entirely a background task.  The worst case scenario is 
actually a half-allocated disk.


No, the worst case is 0.003% allocated disk, with the allocated clusters 
distributed uniformly.  That means all your L2s are allocated, but 
almost none of your clusters are.




But since you have to boot before you can run any serious test, if it 
takes 5 seconds to do an fsck(), it's highly likely that it's not even 
noticeable.


What if it takes 300 seconds?



Maybe I'm broken with respect to how I think, but I find state 
machines very easy to rationalize.


Your father's state machine. Not as clumsy or random as a thread; an 
elegant weapon for a more civilized age


I find your lack of faith in QED disturbing.


When 900 years old you reach, state machines you will not find so easy 
to understand.


To me, the biggest burden in qcow2 is thinking through how you deal 
with shared resources.  Because you can block for a long period of 
time during write operations, it's not enough to just carry a mutex 
during all metadata operations.  You have to stage operations and 
commit them at very specific points in time.


The standard way of dealing with this is to have a hash table for 
metadata that contains a local mutex:


l2cache = defaultdict(L2)

def get_l2(pos):
l2 = l2cache[pos]
l2.mutex.lock()
if not l2.valid:
 l2.pos = pos
 l2.read()
 l2.valid = True
return l2

def put_l2(l2):
if l2.dirty:
l2.write()
l2.dirty = False
l2.mutex.unlock()


You're missing how you create entries.  That means you've got to do:

def put_l2(l2):
   if l2.committed:

Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files

2010-09-12 Thread Avi Kivity

 On 09/12/2010 05:26 PM, Anthony Liguori wrote:

On 09/12/2010 08:28 AM, Avi Kivity wrote:

 On 09/12/2010 03:06 PM, Anthony Liguori wrote:


Backing files and logical size shouldn't change during live migration.


Why not?


To make our lives easier.


It means management needs to block volume resize while a live migration 
takes place.  Since live migration is typically done by the system 
automatically, while volume resize happens in response to user request, 
this isn't a good idea.  Both in terms of user experience, and in terms 
of pushing more complexity to management.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts

2010-09-12 Thread Avi Kivity

 On 09/12/2010 05:29 PM, Anthony Liguori wrote:

On 09/12/2010 08:26 AM, Avi Kivity wrote:

 On 09/12/2010 03:08 PM, Anthony Liguori wrote:

This can cause a disk read, no?  Shouldn't it be made asynchronous?



Yes, it should.  I'm not sure there's an easy way to make it 
asynchronous though not because of the block layer but because of 
how these functions are called.


Sorry to harp on the subject, but that's the standard problem with 
state machines.  Every time you want to do a blocking operation in a 
function, you have to put all its locals in some structure, split the 
function into two, do some scheduling, etc.


We can't block the VCPU thread for arbitrarily long periods of time.  
If we get a PIO operation requesting information about geometry, we 
can't wait for a disk read in order to satisfy that request.


We need to kick off the I/O operations in the background such that the 
data is available before the PIO operation happens.  This isn't SM vs. 
threads at all, this is simply about the fact that we can't do block 
I/O during a PIO operation.


Isn't this an identify command, where the guest can only read the data 
after the interface indicates it is ready (or posts an interrupt)?  I 
thought the guest interface was already async.


The code appears to support this:

switch(val) {
case WIN_IDENTIFY:
if (s-bs  s-drive_kind != IDE_CD) {
if (s-drive_kind != IDE_CFATA)
ide_identify(s);
else
ide_cfata_identify(s);
s-status = READY_STAT | SEEK_STAT;
ide_transfer_start(s, s-io_buffer, 512, 
ide_transfer_stop);

} else {
if (s-drive_kind == IDE_CD) {
ide_set_signature(s);
}
ide_abort_command(s);
}
ide_set_irq(s-bus);
break;

but I may be misinterpreting it.  If I'm right, all we need to do is 
push this whole block to a thread.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-12 Thread Anthony Liguori

On 09/12/2010 10:56 AM, Avi Kivity wrote:
No, the worst case is 0.003% allocated disk, with the allocated 
clusters distributed uniformly.  That means all your L2s are 
allocated, but almost none of your clusters are.


But in this case, you're so sparse that your metadata is pretty much 
co-located which means seek performance won't matter much.




But since you have to boot before you can run any serious test, if it 
takes 5 seconds to do an fsck(), it's highly likely that it's not 
even noticeable.


What if it takes 300 seconds?


That means for a 1TB disk you're taking 500ms per L2 entry, you're fully 
allocated and yet still doing an fsck.  That seems awfully unlikely.



   if l2.committed:
   if l2.dirty
   l2.write()
   l2.dirty = False
   l2.mutex.unlock()
else:
   l2.mutex.lock()
   l2cache[l2.pos] = l2
   l2.mutex.unlock()


The in-memory L2 is created by defaultdict().  I did omit linking L2 
into L1, by that's a function call.  With a state machine, it's a new 
string of states and calls.


But you have to write the L2 to disk first before you link it so it's 
not purely in memory.


It's far easier to just avoid internal snapshots altogether and 
this is exactly the thought process that led to QED.  Once you drop 
support for internal snapshots, you can dramatically simplify.


The amount of metadata is O(nb_L2 * nb_snapshots).  For qed, 
nb_snapshots = 1 but nb_L2 can be still quite large.  If fsck is too 
long for one, it is too long for the other.


nb_L2 is very small.  It's exactly n / 2GB + 1 where n is image 
size.  Since image size is typically  100GB, practically speaking 
it's less than 50.


OTOH, nb_snapshots in qcow2 can be very large.  In fact, it's not 
unrealistic for nb_snapshots to be  50.  What that means is that 
instead of metadata being O(n) as it is today, it's at least O(n^2).


Why is in n^2?  It's still n*m.  If your image is 4TB instead of 
100GB, the time increases by a factor of 40 for both.


It's n*m but either n ~= m in which case it's n^2 or m  n, in which 
case, it's just n, or m  n in which case, it's just O(m).


This is where asymptotic complexity ends up not being terribly helpful :-)

Let me put this another way though, if you support internal snapshots, 
what's a reasonable number of snapshots to expect reasonable performance 
with?  10?  100?  1000? 1?


Not doing qed-on-lvm is definitely a limitation.  The one use case 
I've heard is qcow2 on top of clustered LVM as clustered LVM is 
simpler than a clustered filesystem.  I don't know the space well 
enough so I need to think more about it.


I don't either.  If this use case survives, and if qed isn't changed 
to accomodate it, it means that that's another place where qed can't 
supplant qcow2.


I'm okay with that.  An image file should require a file system.  If 
I was going to design an image file to be used on top of raw storage, 
I would take an entirely different approach.


That spreads our efforts further.


No.  I don't think we should be in the business of designing on top of 
raw storage.  Either assume fixed partitions, LVM, or a file system.  We 
shouldn't reinvent the wheel at every opportunity (just the carefully 
chosen opportunities).


Refcount table.  See above discussion  for my thoughts on refcount 
table.


Ok.  It boils down to is fsck on startup acceptable.  Without a 
freelist, you need fsck for both unclean shutdown and for UNMAP.


To rebuild the free list on unclean shutdown.


If you have an on-disk compact freelist, you don't need that fsck.


If you have an on-disk compact [consistent] freelist, you don't need 
that fsck.


Consistency is the key point.  We go out of our way to avoid a 
consistent freelist in QED because it's the path to best performance.  
The key goal for a file format should be to have exactly as much 
consistency as required and not one bit more as consistency always means 
worse performance.


On the other hand, allocating a cluster in qcow2 as it is now requires 
scanning the refcount table.  Not very pretty.  Kevin, how does that 
perform?


(an aside: with cache!=none we're bouncing in the kernel as well; we 
really need to make it work for cache=none, perhaps use O_DIRECT for 
data and writeback for metadata and shared backing images).


QED achieves zero-copy with cache=none today.  In fact, our 
performance testing that we'll publish RSN is exclusively with 
cache=none.


In this case, preallocation should really be cheap, since there isn't 
a ton of dirty data that needs to be flushed.  You issue an extra 
flush once in a while so your truncate (or physical image size in the 
header) gets to disk, but that doesn't block new writes.


It makes qed/lvm work, and it replaces the need to fsck for the next 
allocation with the need for a background scrubber to reclaim storage 
(you need that anyway for UNMAP).  It makes the whole thing a lot more 
attractive IMO.


For a 1PB disk image with qcow2, 

Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files

2010-09-12 Thread Anthony Liguori

On 09/12/2010 11:06 AM, Avi Kivity wrote:

 On 09/12/2010 05:26 PM, Anthony Liguori wrote:

On 09/12/2010 08:28 AM, Avi Kivity wrote:

 On 09/12/2010 03:06 PM, Anthony Liguori wrote:


Backing files and logical size shouldn't change during live migration.


Why not?


To make our lives easier.


It means management needs to block volume resize while a live 
migration takes place.  Since live migration is typically done by the 
system automatically, while volume resize happens in response to user 
request, this isn't a good idea.  Both in terms of user experience, 
and in terms of pushing more complexity to management.


We don't do volume resize today so it's a moot point.

Regards,

Anthony Liguori




[Qemu-devel] [Bug 611646] Re: isa bus not working

2010-09-12 Thread Victor Shkamerda
** Changed in: qemu
   Status: New = Fix Committed

-- 
isa bus not working
https://bugs.launchpad.net/bugs/611646
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: Fix Committed

Bug description:
isa bus emulation not working anymore.

Try running qemu -M isapc. It will crash with segmentation fault.

This is a qemu HEAD from git on Fedora linux.





[Qemu-devel] [Bug 636446] Re: prep ppc machine no more working

2010-09-12 Thread Andreas Färber
** Changed in: qemu
   Status: New = Confirmed

-- 
prep ppc machine no more working
https://bugs.launchpad.net/bugs/636446
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: Confirmed

Bug description:
I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to 
use PREP PPC machine the error is the same:


qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS

CPU #0:
NIP    LR  CTR  XER 
MSR  HID0   HF  idx 0
TB   DECR 
GPR00    
GPR04    
GPR08    
GPR12    
GPR16    
GPR20    
GPR24    
GPR28    
CR   [ -  -  -  -  -  -  -  -  ] RES 
FPR00    
FPR04    
FPR08    
FPR12    
FPR16    
FPR20    
FPR24    
FPR28    
FPSCR 
SRR0  SRR1  SDR1 





Re: [Qemu-devel] [PATCH 01/14] trace: Add trace-events file for declaring trace events

2010-09-12 Thread Stefan Hajnoczi
On Sat, Sep 11, 2010 at 10:53 PM, Andreas Färber andreas.faer...@web.de wrote:
 Am 06.09.2010 um 17:13 schrieb Stefan Hajnoczi:

 diff --git a/Makefile b/Makefile
 index f95cc2f..3c5e6a0 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -1,6 +1,6 @@
 # Makefile for QEMU.

 -GENERATED_HEADERS = config-host.h
 +GENERATED_HEADERS = config-host.h trace.h

 ifneq ($(wildcard config-host.mak),)
 # Put the all: rule here so that config-host.mak can contain dependencies.
 @@ -104,16 +104,24 @@ ui/vnc.o: QEMU_CFLAGS += $(VNC_TLS_CFLAGS)

 bt-host.o: QEMU_CFLAGS += $(BLUEZ_CFLAGS)

 +trace.h: $(SRC_PATH)/trace-events config-host.mak
 +       $(call quiet-command,sh $(SRC_PATH)/tracetool --$(TRACE_BACKEND)
 -h  $  $@,  GEN   $@)
 +
 +trace.c: $(SRC_PATH)/trace-events config-host.mak
 +       $(call quiet-command,sh $(SRC_PATH)/tracetool --$(TRACE_BACKEND)
 -c  $  $@,  GEN   $@)
 +
 +trace.o: trace.c $(GENERATED_HEADERS)

 There's a dependency issue hidden here: The user emulators recurse into
 *-user with just a dependency on trace.h. The build then fails because the
 target Makefile does not know how to build ../trace.o since the rule is in
 this Makefile instead.

 I noticed it with ppc-haiku-user but this should be visible with linux-user
 on a clean build, too.

 Adding trace.o or $(trace-obj-y) somewhere before subdir-libuser seems to
 help. I'll post a patch later on if no one beats me.

Your patch would be appreciated.  For the record, in order to reproduce it:
$ make distclean
$ ./configure --target-list=i386-linux-user
$ make
make[1]: *** No rule to make target `../trace.o', needed by `qemu-i386'.  Stop.

If one of the targets does build trace.o then the build completes
successfully.  That's why all targets build works fine.

Stefan



[Qemu-devel] [Bug 611640] Re: snapshot mode is broken for raw images

2010-09-12 Thread Victor Shkamerda
** Changed in: qemu
   Status: New = Fix Committed

-- 
snapshot mode is broken for raw images
https://bugs.launchpad.net/bugs/611640
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: Fix Committed

Bug description:
The snapshot mode is not working if you use raw format for image instead of 
qcow2. Create a raw disk image by running dd of=xxx.img bs=1 count=0 seek=8G. 
Then run qemu -snapshot xxx.img. In monitor console run info block. There 
should be qcow2 temporary image backed by the raw image. Now run commit 
ide0-hd0 and after that again info block. You should see now that there is 
no more image, so any operations on it will fail form now on. This is for 
latest HEAD in git.





Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

2010-09-12 Thread Anthony Liguori

On 09/12/2010 11:45 AM, Avi Kivity wrote:

Streaming relies on copy-on-read to do the writing.



Ah.  You can avoid the copy-on-read implementation in the block format 
driver and do it completely in generic code.


Copy on read takes advantage of temporal locality.  You wouldn't want to 
stream without copy on read because you decrease your idle I/O time by 
not effectively caching.



stream_4():
increment offset
if more:
 bdrv_aio_stream()


Of course, need to serialize wrt guest writes, which adds a bit more 
complexity.  I'll leave it to you to code the state machine for that.


http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d 



Clever - it pushes all the synchronization into the copy-on-read 
implementation.  But the serialization there hardly jumps out of the 
code.


Do I understand correctly that you can only have one allocating read 
or write running?


Cluster allocation, L2 cache allocation, or on-disk L2 allocation?

You only have one on-disk L2 allocation at one time.  That's just an 
implementation detail at the moment.  An on-disk L2 allocation happens 
only when writing to a new cluster that requires a totally new L2 
entry.  Since L2s cover 2GB of logical space, it's a rare event so this 
turns out to be pretty reasonable for a first implementation.


Parallel on-disk L2 allocations is not that difficult, it's just a 
future TODO.




Generally, I think the block layer makes more sense if the interface 
to the formats are high level and code sharing is achieved not by 
mandating a world view but rather but making libraries of common 
functionality.   This is more akin to how the FS layer works in Linux.


So IMHO, we ought to add a bdrv_aio_commit function, turn the current 
code into a generic_aio_commit, implement a qed_aio_commit, then 
somehow do qcow2_aio_commit, and look at what we can refactor into 
common code.


What Linux does if have an equivalent of bdrv_generic_aio_commit() 
which most implementations call (or default to), and only do something 
if they want something special.  Something like commit (or 
copy-on-read, or copy-on-write, or streaming) can be implement 100% in 
terms of the generic functions (and indeed qcow2 backing files can be 
any format).


Yes, what I'm really saying is that we should take the 
bdrv_generic_aio_commit() approach.  I think we're in agreement here.


Regards,

Anthony Liguori




Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage

2010-09-12 Thread Blue Swirl
On Sun, Sep 12, 2010 at 11:26 AM, Andreas Färber
636...@bugs.launchpad.net wrote:
 Public bug reported:

 Running `LANG=C LC_ALL=C ./configure --prefix=...
 --install=/usr/ucb/install` on Solaris 10 amd64 results in the following
 errors:

 ./configure: bad substitution
 ./configure: !: not found
 ./configure: curl-config: not found
 ./configure: curl-config: not found

 Error: invalid trace backend
 Please choose a supported trace backend.

What is the output of sh ./tracetool --nop --check-backend?



 Unfortunately it doesn't print the line numbers of the errors. It must be 
 somewhere after the check for `install`.

 The first few can be resolved by running `bash ./configure ...` instead.

 The check if trace backend exists hardcodes `sh
 $source_path/tracetool ...` in configure. Replacing sh with bash makes
 it work.

 `gmake` complains Makefile:331: no file name for -include, which is a 
 filter for *.d files.
 `create_config` gets the 'bad substitution' error as well. Replacing sh with 
 bash in rules.mak works.
 etc.

 To sum it up,
 a) there are shell script incompatibilities with Solaris 10's /bin/sh shell, 
 and

I fixed one in 2184d75b4a6a253e8b1e002b3dbcc85c20ba6041 and now
Milax's /bin/sh is happy.



[Qemu-devel] Re: [PATCH v3] Introduce qemu_madvise()

2010-09-12 Thread Blue Swirl
On Sun, Sep 12, 2010 at 12:55 PM, Andreas Färber andreas.faer...@web.de wrote:
 From: Andreas Färber afaer...@opensolaris.org

 vl.c has a Sun-specific hack to supply a prototype for madvise(),
 but the call site has apparently moved to arch_init.c.

 Haiku doesn't implement madvise() in favor of posix_madvise().
 OpenBSD and Solaris 10 don't implement posix_madvise() but madvise().

 Check for madvise() and posix_madvise() in configure and supply qemu_madvise()
 as wrapper. Prefer madvise() over posix_madvise() due to flag availability.
 Convert all callers to use qemu_madvise() and QEMU_MADV_*. No functional 
 change
 except for arch_init.c:ram_load() now potentially falling back to 
 posix_madvise()
 or no-op in lack of both.

 v2 - v3:
 * Reuse the *_MADV_* defines for QEMU_MADV_*. Suggested by Alexander Graf.
 * Add configure check for madvise(), too.
  Add defines to Makefile, not QEMU_CFLAGS.
  Convert all callers, untested. Suggested by Blue Swirl.
 * Keep Solaris' madvise() prototype around. Pointed out by Alexander Graf.

 v1 - v2:
 * Don't rely on posix_madvise() availability, add qemu_madvise().
  Suggested by Blue Swirl.

 Signed-off-by: Andreas Färber afaer...@opensolaris.org
 Cc: Blue Swirl blauwir...@gmail.com
 Cc: Alexander Graf ag...@suse.de
 ---
  arch_init.c         |    2 +-
  configure           |   33 +
  exec.c              |    8 
  hw/virtio-balloon.c |    4 ++--
  kvm-all.c           |    6 +++---
  osdep.c             |   15 +++
  osdep.h             |   25 +
  vl.c                |    3 ---
  8 files changed, 83 insertions(+), 13 deletions(-)

 diff --git a/arch_init.c b/arch_init.c
 index e468c0c..a910033 100644
 --- a/arch_init.c
 +++ b/arch_init.c
 @@ -396,7 +396,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
  #ifndef _WIN32
             if (ch == 0 
                 (!kvm_enabled() || kvm_has_sync_mmu())) {
 -                madvise(host, TARGET_PAGE_SIZE, MADV_DONTNEED);
 +                qemu_madvise(host, TARGET_PAGE_SIZE, QEMU_MADV_DONTNEED);
             }
  #endif
         } else if (flags  RAM_SAVE_FLAG_PAGE) {
 diff --git a/configure b/configure
 index 4061cb7..86558eb 100755
 --- a/configure
 +++ b/configure
 @@ -2069,6 +2069,31 @@ if compile_prog   ; then
  fi

  ##
 +# check if we have madvise
 +
 +madvise=no
 +cat  $TMPC  EOF
 +#include sys/types.h
 +#include sys/mman.h
 +int main(void) { return madvise(NULL, 0, MADV_DONTNEED); }
 +EOF
 +if compile_prog   ; then
 +    madvise=yes
 +fi
 +
 +##
 +# check if we have posix_madvise
 +
 +posix_madvise=no
 +cat  $TMPC  EOF
 +#include sys/mman.h
 +int main(void) { return posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); }
 +EOF
 +if compile_prog   ; then
 +    posix_madvise=yes
 +fi
 +
 +##
  # check if trace backend exists

  sh $source_path/tracetool --$trace_backend --check-backend  /dev/null 
 2 /dev/null
 @@ -2226,6 +2251,8 @@ echo KVM support       $kvm
  echo fdt support       $fdt
  echo preadv support    $preadv
  echo fdatasync         $fdatasync
 +echo madvise           $madvise
 +echo posix_madvise     $posix_madvise
  echo uuid support      $uuid
  echo vhost-net support $vhost_net
  echo Trace backend     $trace_backend
 @@ -2466,6 +2493,12 @@ fi
  if test $fdatasync = yes ; then
   echo CONFIG_FDATASYNC=y  $config_host_mak
  fi
 +if test $madvise = yes ; then
 +  echo CONFIG_MADVISE=y  $config_host_mak
 +fi
 +if test $posix_madvise = yes ; then
 +  echo CONFIG_POSIX_MADVISE=y  $config_host_mak
 +fi

  # XXX: suppress that
  if [ $bsd = yes ] ; then
 diff --git a/exec.c b/exec.c
 index 380dab5..b1fe3e9 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -2841,8 +2841,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, 
 const char *name,
             new_block-host = file_ram_alloc(new_block, size, mem_path);
             if (!new_block-host) {
                 new_block-host = qemu_vmalloc(size);
 -#ifdef MADV_MERGEABLE
 -                madvise(new_block-host, size, MADV_MERGEABLE);
 +#ifdef QEMU_MADV_MERGEABLE

I'd like to avoid these #ifdefs. How about always #defining
QEMU_MADV_MERGEABLE? QEMU_MADV_* values could be synthetic and not
rely on MADV_*.

 +                qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE);
  #endif
             }
  #else
 @@ -2858,8 +2858,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, 
 const char *name,
  #else
             new_block-host = qemu_vmalloc(size);
  #endif
 -#ifdef MADV_MERGEABLE
 -            madvise(new_block-host, size, MADV_MERGEABLE);
 +#ifdef QEMU_MADV_MERGEABLE
 +            qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE);
  #endif
         }
     }
 diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
 index 9fe3886..1e74674 100644
 --- a/hw/virtio-balloon.c
 +++ b/hw/virtio-balloon.c
 @@ -51,8 +51,8 @@ static void 

Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

2010-09-12 Thread Avi Kivity

 On 09/12/2010 07:19 PM, Anthony Liguori wrote:

On 09/12/2010 11:45 AM, Avi Kivity wrote:

Streaming relies on copy-on-read to do the writing.



Ah.  You can avoid the copy-on-read implementation in the block 
format driver and do it completely in generic code.


Copy on read takes advantage of temporal locality.  You wouldn't want 
to stream without copy on read because you decrease your idle I/O time 
by not effectively caching.


I meant, implement copy-on-read in generic code side by side with 
streaming.  Streaming becomes just a prefetch operation (read and 
discard) which lets copy-on-read do the rest.  This is essentially your 
implementation, yes?





stream_4():
increment offset
if more:
 bdrv_aio_stream()


Of course, need to serialize wrt guest writes, which adds a bit 
more complexity.  I'll leave it to you to code the state machine 
for that.


http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d 



Clever - it pushes all the synchronization into the copy-on-read 
implementation.  But the serialization there hardly jumps out of the 
code.


Do I understand correctly that you can only have one allocating read 
or write running?


Cluster allocation, L2 cache allocation, or on-disk L2 allocation?

You only have one on-disk L2 allocation at one time.  That's just an 
implementation detail at the moment.  An on-disk L2 allocation happens 
only when writing to a new cluster that requires a totally new L2 
entry.  Since L2s cover 2GB of logical space, it's a rare event so 
this turns out to be pretty reasonable for a first implementation.


Parallel on-disk L2 allocations is not that difficult, it's just a 
future TODO.


Really, you can just preallocate all L2s.  Most filesystems will touch 
all of them very soon.  qcow2 might save some space for snapshots which 
share L2s (doubtful) or for 4k clusters (historical) but for qed with 
64k clusters, it doesn't save any space.


Linear L2s will also make your fsck *much* quicker.  Size is .01% of 
logical image size.  1MB for a 10GB guest, by the time you install 
something on it that's a drop in the bucket.


If you install a guest on a 100GB disk, what percentage of L2s are 
allocated?






Generally, I think the block layer makes more sense if the interface 
to the formats are high level and code sharing is achieved not by 
mandating a world view but rather but making libraries of common 
functionality.   This is more akin to how the FS layer works in Linux.


So IMHO, we ought to add a bdrv_aio_commit function, turn the 
current code into a generic_aio_commit, implement a qed_aio_commit, 
then somehow do qcow2_aio_commit, and look at what we can refactor 
into common code.


What Linux does if have an equivalent of bdrv_generic_aio_commit() 
which most implementations call (or default to), and only do 
something if they want something special.  Something like commit (or 
copy-on-read, or copy-on-write, or streaming) can be implement 100% 
in terms of the generic functions (and indeed qcow2 backing files can 
be any format).


Yes, what I'm really saying is that we should take the 
bdrv_generic_aio_commit() approach.  I think we're in agreement here.




Strange feeling.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] [Bug 636446] Re: prep ppc machine no more working

2010-09-12 Thread Andreas Färber
This is a known issue, nobody has volunteered to update PReP emulation
yet. The last partially working version was v0.9.1, using an
OpenHackWare ROM file.

Depending on why you want to emulate a PReP machine, you can either try
supplying some proprietary ROM file, or contribute PReP support to
OpenBIOS [1].

[1] http://lists.openbios.org/mailman/listinfo/openbios

-- 
prep ppc machine no more working
https://bugs.launchpad.net/bugs/636446
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: Confirmed

Bug description:
I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to 
use PREP PPC machine the error is the same:


qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS

CPU #0:
NIP    LR  CTR  XER 
MSR  HID0   HF  idx 0
TB   DECR 
GPR00    
GPR04    
GPR08    
GPR12    
GPR16    
GPR20    
GPR24    
GPR28    
CR   [ -  -  -  -  -  -  -  -  ] RES 
FPR00    
FPR04    
FPR08    
FPR12    
FPR16    
FPR20    
FPR24    
FPR28    
FPSCR 
SRR0  SRR1  SDR1 





Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage

2010-09-12 Thread Andreas Färber

Am 12.09.2010 um 19:22 schrieb Blue Swirl:


On Sun, Sep 12, 2010 at 11:26 AM, Andreas Färber
636...@bugs.launchpad.net wrote:

Error: invalid trace backend
Please choose a supported trace backend.


What is the output of sh ./tracetool --nop --check-backend?


./tracetool: syntax error at line 51: `$' unexpected


Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage

2010-09-12 Thread Blue Swirl
On Sun, Sep 12, 2010 at 5:35 PM, Andreas Färber andreas.faer...@web.de wrote:
 Am 12.09.2010 um 19:22 schrieb Blue Swirl:

 On Sun, Sep 12, 2010 at 11:26 AM, Andreas Färber
 636...@bugs.launchpad.net wrote:

 Error: invalid trace backend
 Please choose a supported trace backend.

 What is the output of sh ./tracetool --nop --check-backend?

 ./tracetool: syntax error at line 51: `$' unexpected

Does this patch fix the problem?

diff --git a/tracetool b/tracetool
index 534cc70..c7582bf 100755
--- a/tracetool
+++ b/tracetool
@@ -48,7 +48,8 @@ get_argnames()
 {
 local nfields field name
 nfields=0
-for field in $(get_args $1); do
+args=get_args $1
+for field in $args; do
 nfields=$((nfields + 1))

 # Drop pointer star



Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-12 Thread Avi Kivity

 On 09/12/2010 07:09 PM, Anthony Liguori wrote:

On 09/12/2010 10:56 AM, Avi Kivity wrote:
No, the worst case is 0.003% allocated disk, with the allocated 
clusters distributed uniformly.  That means all your L2s are 
allocated, but almost none of your clusters are.


But in this case, you're so sparse that your metadata is pretty much 
co-located which means seek performance won't matter much.


You still get the rotational delay.  But yes, the hit is reduced.





But since you have to boot before you can run any serious test, if 
it takes 5 seconds to do an fsck(), it's highly likely that it's not 
even noticeable.


What if it takes 300 seconds?


That means for a 1TB disk you're taking 500ms per L2 entry, you're 
fully allocated and yet still doing an fsck.  That seems awfully 
unlikely.


I meant for a fully populated L1.  That's 10ms per L2.

But since that's 64TB, that's unlikely too.  It can still take 10s for a 
2TB disk.





   if l2.committed:
   if l2.dirty
   l2.write()
   l2.dirty = False
   l2.mutex.unlock()
else:
   l2.mutex.lock()
   l2cache[l2.pos] = l2
   l2.mutex.unlock()


The in-memory L2 is created by defaultdict().  I did omit linking L2 
into L1, by that's a function call.  With a state machine, it's a new 
string of states and calls.


But you have to write the L2 to disk first before you link it so it's 
not purely in memory.


That's fine.  Threading allows you to have blocking calls.  It's slower, 
but very rare anyway.


Why is in n^2?  It's still n*m.  If your image is 4TB instead of 
100GB, the time increases by a factor of 40 for both.


It's n*m but either n ~= m in which case it's n^2 or m  n, in which 
case, it's just n, or m  n in which case, it's just O(m).


This is where asymptotic complexity ends up not being terribly helpful 
:-)


Let me put this another way though, if you support internal snapshots, 
what's a reasonable number of snapshots to expect reasonable 
performance with?  10?  100?  1000? 1?


I'd say 10.  Not that I really want to support internal snapshots, it 
doesn't work well with multiple disks.


I'm okay with that.  An image file should require a file system.  If 
I was going to design an image file to be used on top of raw 
storage, I would take an entirely different approach.


That spreads our efforts further.


No.  I don't think we should be in the business of designing on top of 
raw storage.  Either assume fixed partitions, LVM, or a file system.  
We shouldn't reinvent the wheel at every opportunity (just the 
carefully chosen opportunities).


I agree, but in this case there was no choice.



Refcount table.  See above discussion  for my thoughts on refcount 
table.


Ok.  It boils down to is fsck on startup acceptable.  Without a 
freelist, you need fsck for both unclean shutdown and for UNMAP.


To rebuild the free list on unclean shutdown.


If you have an on-disk compact freelist, you don't need that fsck.


If you have an on-disk compact [consistent] freelist, you don't need 
that fsck.


Consistency is the key point.  We go out of our way to avoid a 
consistent freelist in QED because it's the path to best performance.  
The key goal for a file format should be to have exactly as much 
consistency as required and not one bit more as consistency always 
means worse performance.


Preallocation lets you have a consistent (or at least conservative) free 
list, with just a bit of extra consistency.  If you piggy back 
preallocation on guest syncs, you don't even pay for that.


On the other hand, linear L2 (which now become L1) means your fsck is 
just a linear scan of the table, which is probably faster than qcow2 
allocation...




(an aside: with cache!=none we're bouncing in the kernel as well; 
we really need to make it work for cache=none, perhaps use O_DIRECT 
for data and writeback for metadata and shared backing images).


QED achieves zero-copy with cache=none today.  In fact, our 
performance testing that we'll publish RSN is exclusively with 
cache=none.


In this case, preallocation should really be cheap, since there isn't 
a ton of dirty data that needs to be flushed.  You issue an extra 
flush once in a while so your truncate (or physical image size in the 
header) gets to disk, but that doesn't block new writes.


It makes qed/lvm work, and it replaces the need to fsck for the next 
allocation with the need for a background scrubber to reclaim storage 
(you need that anyway for UNMAP).  It makes the whole thing a lot 
more attractive IMO.



For a 1PB disk image with qcow2, the reference count table is 128GB.  
For a 1TB image, the reference count table is 128MB.   For a 128GB 
image, the reference table is 16MB which is why we get away with it 
today.


Anytime you grow the freelist with qcow2, you have to write a brand 
new freelist table and update the metadata synchronously to point to a 
new version of it.  That means for a 1TB image, you're potentially 
writing out 

Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage

2010-09-12 Thread Andreas Färber

Am 12.09.2010 um 19:47 schrieb Blue Swirl:

On Sun, Sep 12, 2010 at 5:35 PM, Andreas Färber andreas.faer...@web.de 
 wrote:

Am 12.09.2010 um 19:22 schrieb Blue Swirl:


What is the output of sh ./tracetool --nop --check-backend?


./tracetool: syntax error at line 51: `$' unexpected


Does this patch fix the problem?

diff --git a/tracetool b/tracetool
index 534cc70..c7582bf 100755
--- a/tracetool
+++ b/tracetool
@@ -48,7 +48,8 @@ get_argnames()
{
local nfields field name
nfields=0
-for field in $(get_args $1); do
+args=get_args $1
+for field in $args; do


This part yes. (I took the liberty of adding args to the local line  
above)



nfields=$((nfields + 1))


Next error is here:
./tracetool: syntax error at line 53: `nfields=$' unexpected


Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread Adhyas Avasthi
Thanks for the response. I tried './configure' without any arguments
and still the same issue.
Find the output from ./configure and make attached.

Thanks,
Adhyas

On Sat, Sep 11, 2010 at 10:37 AM, Anthony Liguori anth...@codemonkey.ws wrote:
 Please  include the full output of configure.

 Why are you explicitly using --disable-kvm and --disable-xen?

 Regards,

 Anthony Liguori


out_configure
Description: Binary data


out_make
Description: Binary data


Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread Anthony Liguori

On 09/12/2010 01:06 PM, Adhyas Avasthi wrote:

Thanks for the response. I tried './configure' without any arguments
and still the same issue.
Find the output from ./configure and make attached.
   


You've got a weird version of make installed.

It's throwing an error when no file is specified for -include but the 
point of -include vs. include is that -include is supposed to be silent 
on error.


What's the output of 'make --version'?

Regards,

Anthony Liguori


Thanks,
Adhyas

On Sat, Sep 11, 2010 at 10:37 AM, Anthony Liguorianth...@codemonkey.ws  wrote:
   

Please  include the full output of configure.

Why are you explicitly using --disable-kvm and --disable-xen?

Regards,

Anthony Liguori
 





Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-12 Thread Anthony Liguori

On 09/12/2010 12:51 PM, Avi Kivity wrote:

 On 09/12/2010 07:09 PM, Anthony Liguori wrote:

On 09/12/2010 10:56 AM, Avi Kivity wrote:
No, the worst case is 0.003% allocated disk, with the allocated 
clusters distributed uniformly.  That means all your L2s are 
allocated, but almost none of your clusters are.


But in this case, you're so sparse that your metadata is pretty much 
co-located which means seek performance won't matter much.


You still get the rotational delay.  But yes, the hit is reduced.





But since you have to boot before you can run any serious test, if 
it takes 5 seconds to do an fsck(), it's highly likely that it's 
not even noticeable.


What if it takes 300 seconds?


That means for a 1TB disk you're taking 500ms per L2 entry, you're 
fully allocated and yet still doing an fsck.  That seems awfully 
unlikely.


I meant for a fully populated L1.  That's 10ms per L2.


Your math is off.  A single L2 entry covers 2GB worth of logical space.  
That means a 1TB image consists of 512 L2s.  300 / 512 == .585 which is 
585ms.


That's a fully populated L1 on a 1TB image.

But since that's 64TB, that's unlikely too.  It can still take 10s for 
a 2TB disk.


Ah, you're talking about a 64TB image.  Recall that we can read L2s in 
parallel.  I have trouble imaging that we'd get serialized performance 
with a 64TB backing store.  It's much more likely you've got more than 
one spindle in this scenario.


Why is in n^2?  It's still n*m.  If your image is 4TB instead of 
100GB, the time increases by a factor of 40 for both.


It's n*m but either n ~= m in which case it's n^2 or m  n, in which 
case, it's just n, or m  n in which case, it's just O(m).


This is where asymptotic complexity ends up not being terribly 
helpful :-)


Let me put this another way though, if you support internal 
snapshots, what's a reasonable number of snapshots to expect 
reasonable performance with?  10?  100?  1000? 1?


I'd say 10.  Not that I really want to support internal snapshots, it 
doesn't work well with multiple disks.


I don't think that's reasonable.  The folks that I've talked to about 
snapshots seem to want to do crazy things like use it for 
checkpointing.  TBH, I think they're looking for the ability to do 
thousands of checkpoints with an efficient way to release old checkpoints.


I imagine that's the design point things like btrfs are trying to achieve.

On the other hand, linear L2 (which now become L1) means your fsck is 
just a linear scan of the table, which is probably faster than qcow2 
allocation...


And this is just a data layout optimization which is the sort of thing 
that we should let performance data drive.


For a 1PB disk image with qcow2, the reference count table is 128GB.  
For a 1TB image, the reference count table is 128MB.   For a 128GB 
image, the reference table is 16MB which is why we get away with it 
today.


Anytime you grow the freelist with qcow2, you have to write a brand 
new freelist table and update the metadata synchronously to point to 
a new version of it.  That means for a 1TB image, you're potentially 
writing out 128MB of data just to allocate a new cluster.


s/freelist/refcount table/ to translate to current qcow2 
nomenclature.  This is certainly not fast.  You can add a bunch of 
free blocks each time you mitigate the growth but I can't of many 
circumstances where a 128MB write isn't going to be noticeable.  And 
it only gets worse as time moves on because 1TB disk images are 
already in use today.




That's a strong point.  qcow2 doubles on each allocation, it 
amortizes, but the delay is certainly going to be noticable.


You can do it ahead of time (so guest writes don't need to wait) but 
it's still expensive.


The trouble is, safe growth of the reference count table is hard because 
it's contiguous.  That means you need to copy the table to another 
location all at once instead of just creating a new L1 table and reusing 
most of the existing L2 entries.


It's a damning flaw in the format for large images.  You can preallocate 
the whole thing up front to try to avoid the cost at run time but even 
then, that's a huge cost to pay in disk space up front.


It's very easy to neglect the details in something like qcow2.  We've 
been talking like the refcount table is basically free to read and 
write but it's absolutely not.  With large disk images, you're 
caching an awful lot of metadata to read the refcount table in fully.


If you reduce the reference count table to exactly two bits, you can 
store that within the L1/L2 metadata since we have an extra 12 bits 
worth of storage space.  Since you need the L1/L2 metadata anyway, we 
might as well just use that space as the authoritative source of the 
free list information.


The only difference between qcow2 and qed is that since we use an 
on-demand table for L1/L2, our free list may be non-contiguous.  
Since we store virtual - physical instead of physical-virtual, you 
have to do a 

Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage

2010-09-12 Thread Blue Swirl
On Sun, Sep 12, 2010 at 5:58 PM, Andreas Färber andreas.faer...@web.de wrote:
 Am 12.09.2010 um 19:47 schrieb Blue Swirl:

 On Sun, Sep 12, 2010 at 5:35 PM, Andreas Färber andreas.faer...@web.de
 wrote:

 Am 12.09.2010 um 19:22 schrieb Blue Swirl:

 What is the output of sh ./tracetool --nop --check-backend?

 ./tracetool: syntax error at line 51: `$' unexpected

 Does this patch fix the problem?

 diff --git a/tracetool b/tracetool
 index 534cc70..c7582bf 100755
 --- a/tracetool
 +++ b/tracetool
 @@ -48,7 +48,8 @@ get_argnames()
 {
    local nfields field name
    nfields=0
 -    for field in $(get_args $1); do
 +    args=get_args $1
 +    for field in $args; do

 This part yes. (I took the liberty of adding args to the local line above)

        nfields=$((nfields + 1))

 Next error is here:
 ./tracetool: syntax error at line 53: `nfields=$' unexpected

That looks like fully standards compliant, so Solaris' /bin/sh is not.
Can you try what happens with /usr/xpg4/bin/sh?



Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread Andreas Färber

Am 12.09.2010 um 21:26 schrieb Anthony Liguori:


On 09/12/2010 01:06 PM, Adhyas Avasthi wrote:

Thanks for the response. I tried './configure' without any arguments
and still the same issue.
Find the output from ./configure and make attached.



You've got a weird version of make installed.

It's throwing an error when no file is specified for -include


A warning, not an error. I reported a similar issue here:
https://bugs.launchpad.net/bugs/636315

but the point of -include vs. include is that -include is supposed  
to be silent on error.


The manual is very brief and doesn't touch on advanced -include use  
cases like filters that might result in an empty set of files:
If you want make to simply ignore a makefile which does not exist or  
cannot be remade, with no error message, use the -include directive  
instead of include, like this:


 -include filenames...
This acts like include in every way except that there is no error  
(not even a warning) if any of the filenames (or any prerequisites  
of any of the filenames) do not exist or cannot be remade.



http://www.gnu.org/software/make/manual/html_node/Include.html#Include


Makefile:331 will be the include for *.d files, and so is Makefile.hw: 
24.




What's the output of 'make --version'?


bash-3.00$ gmake --version
GNU Make 3.80
Copyright (C) 2002  Free Software Foundation, Inc.
This is free software; see the sources for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

Adhyas' looked like a 3.79.1 preview, judging by the command name  
make-3.79.1-p7. ;)


Regards,

Andreas



[Qemu-devel] [PATCH 1/5] use qemu_blockalign consistently

2010-09-12 Thread Christoph Hellwig
Use qemu_blockalign for all allocations in the block layer.  This allows
increasing the required alignment, which is need to support O_DIRECT on
devices with large block sizes.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: qemu/hw/scsi-disk.c
===
--- qemu.orig/hw/scsi-disk.c2010-09-12 14:42:40.942759377 -0300
+++ qemu/hw/scsi-disk.c 2010-09-12 14:43:04.694759377 -0300
@@ -70,14 +70,15 @@ struct SCSIDiskState
 char *serial;
 };
 
-static SCSIDiskReq *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun)
+static SCSIDiskReq *scsi_new_request(SCSIDiskState *s, uint32_t tag,
+uint32_t lun)
 {
 SCSIRequest *req;
 SCSIDiskReq *r;
 
-req = scsi_req_alloc(sizeof(SCSIDiskReq), d, tag, lun);
+req = scsi_req_alloc(sizeof(SCSIDiskReq), s-qdev, tag, lun);
 r = DO_UPCAST(SCSIDiskReq, req, req);
-r-iov.iov_base = qemu_memalign(512, SCSI_DMA_BUF_SIZE);
+r-iov.iov_base = qemu_blockalign(s-bs, SCSI_DMA_BUF_SIZE);
 return r;
 }
 
@@ -939,7 +940,7 @@ static int32_t scsi_send_command(SCSIDev
 }
 /* ??? Tags are not unique for different luns.  We only implement a
single lun, so this should not matter.  */
-r = scsi_new_request(d, tag, lun);
+r = scsi_new_request(s, tag, lun);
 outbuf = (uint8_t *)r-iov.iov_base;
 is_write = 0;
 DPRINTF(Command: lun=%d tag=0x%x data=0x%02x, lun, tag, buf[0]);
Index: qemu/hw/sd.c
===
--- qemu.orig/hw/sd.c   2010-09-12 14:31:04.387759376 -0300
+++ qemu/hw/sd.c2010-09-12 14:43:04.695759377 -0300
@@ -440,7 +440,7 @@ SDState *sd_init(BlockDriverState *bs, i
 SDState *sd;
 
 sd = (SDState *) qemu_mallocz(sizeof(SDState));
-sd-buf = qemu_memalign(512, 512);
+sd-buf = qemu_blockalign(bs, 512);
 sd-spi = is_spi;
 sd-enable = 1;
 sd_reset(sd, bs);
Index: qemu/qemu-io.c
===
--- qemu.orig/qemu-io.c 2010-09-12 14:31:04.394759376 -0300
+++ qemu/qemu-io.c  2010-09-12 14:43:04.699759377 -0300
@@ -61,7 +61,7 @@ static void *qemu_io_alloc(size_t len, i
 
if (misalign)
len += MISALIGN_OFFSET;
-   buf = qemu_memalign(512, len);
+   buf = qemu_blockalign(bs, len);
memset(buf, pattern, len);
if (misalign)
buf += MISALIGN_OFFSET;
Index: qemu/qemu-nbd.c
===
--- qemu.orig/qemu-nbd.c2010-09-12 14:31:04.401759376 -0300
+++ qemu/qemu-nbd.c 2010-09-12 14:43:04.706759377 -0300
@@ -446,7 +446,7 @@ int main(int argc, char **argv)
 max_fd = sharing_fds[0];
 nb_fds++;
 
-data = qemu_memalign(512, NBD_BUFFER_SIZE);
+data = qemu_blockalign(bs, NBD_BUFFER_SIZE);
 if (data == NULL)
 errx(EXIT_FAILURE, Cannot allocate data buffer);
 
Index: qemu/posix-aio-compat.c
===
--- qemu.orig/posix-aio-compat.c2010-09-12 14:42:46.725759377 -0300
+++ qemu/posix-aio-compat.c 2010-09-12 14:43:04.711759377 -0300
@@ -270,7 +270,7 @@ static ssize_t handle_aiocb_rw(struct qe
  * Ok, we have to do it the hard way, copy all segments into
  * a single aligned buffer.
  */
-buf = qemu_memalign(512, aiocb-aio_nbytes);
+buf = qemu_blockalign(aiocb-common.bs, aiocb-aio_nbytes);
 if (aiocb-aio_type  QEMU_AIO_WRITE) {
 char *p = buf;
 int i;



[Qemu-devel] [PATCH 2/5] raw-posix: handle 512 byte alignment correctly

2010-09-12 Thread Christoph Hellwig
Replace the hardcoded handling of 512 byte alignment with bs-buffer_alignment
to handle larger sector size devices correctly.

Note that we can not rely on it to be initialize in bdrv_open, so deal
with the worst case there.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: qemu/block/raw-posix.c
===
--- qemu.orig/block/raw-posix.c 2010-09-12 14:31:04.344759376 -0300
+++ qemu/block/raw-posix.c  2010-09-12 14:46:02.120759376 -0300
@@ -97,12 +97,12 @@
 #define FTYPE_CD 1
 #define FTYPE_FD 2
 
-#define ALIGNED_BUFFER_SIZE (32 * 512)
-
 /* if the FD is not accessed during that time (in ms), we try to
reopen it to see if the disk has been changed */
 #define FD_OPEN_TIMEOUT 1000
 
+#define MAX_BLOCKSIZE  4096
+
 typedef struct BDRVRawState {
 int fd;
 int type;
@@ -118,7 +118,8 @@ typedef struct BDRVRawState {
 int use_aio;
 void *aio_ctx;
 #endif
-uint8_t* aligned_buf;
+uint8_t *aligned_buf;
+unsigned aligned_buf_size;
 } BDRVRawState;
 
 static int fd_open(BlockDriverState *bs);
@@ -161,7 +162,12 @@ static int raw_open_common(BlockDriverSt
 s-aligned_buf = NULL;
 
 if ((bdrv_flags  BDRV_O_NOCACHE)) {
-s-aligned_buf = qemu_blockalign(bs, ALIGNED_BUFFER_SIZE);
+/*
+ * Allocate a buffer for read/modify/write cycles.  Chose the size
+ * pessimistically as we don't know the block size yet.
+ */
+s-aligned_buf_size = 32 * MAX_BLOCKSIZE;
+s-aligned_buf = qemu_memalign(MAX_BLOCKSIZE, s-aligned_buf_size);
 if (s-aligned_buf == NULL) {
 goto out_close;
 }
@@ -278,8 +284,9 @@ static int raw_pread_aligned(BlockDriver
 }
 
 /*
- * offset and count are in bytes, but must be multiples of 512 for files
- * opened with O_DIRECT. buf must be aligned to 512 bytes then.
+ * offset and count are in bytes, but must be multiples of the sector size
+ * for files opened with O_DIRECT. buf must be aligned to sector size bytes
+ * then.
  *
  * This function may be called without alignment if the caller ensures
  * that O_DIRECT is not in effect.
@@ -316,24 +323,25 @@ static int raw_pread(BlockDriverState *b
  uint8_t *buf, int count)
 {
 BDRVRawState *s = bs-opaque;
+unsigned sector_mask = bs-buffer_alignment - 1;
 int size, ret, shift, sum;
 
 sum = 0;
 
 if (s-aligned_buf != NULL)  {
 
-if (offset  0x1ff) {
-/* align offset on a 512 bytes boundary */
+if (offset  sector_mask) {
+/* align offset on a sector size bytes boundary */
 
-shift = offset  0x1ff;
-size = (shift + count + 0x1ff)  ~0x1ff;
-if (size  ALIGNED_BUFFER_SIZE)
-size = ALIGNED_BUFFER_SIZE;
+shift = offset  sector_mask;
+size = (shift + count + sector_mask)  ~sector_mask;
+if (size  s-aligned_buf_size)
+size = s-aligned_buf_size;
 ret = raw_pread_aligned(bs, offset - shift, s-aligned_buf, size);
 if (ret  0)
 return ret;
 
-size = 512 - shift;
+size = bs-buffer_alignment - shift;
 if (size  count)
 size = count;
 memcpy(buf, s-aligned_buf + shift, size);
@@ -346,15 +354,15 @@ static int raw_pread(BlockDriverState *b
 if (count == 0)
 return sum;
 }
-if (count  0x1ff || (uintptr_t) buf  0x1ff) {
+if (count  sector_mask || (uintptr_t) buf  sector_mask) {
 
 /* read on aligned buffer */
 
 while (count) {
 
-size = (count + 0x1ff)  ~0x1ff;
-if (size  ALIGNED_BUFFER_SIZE)
-size = ALIGNED_BUFFER_SIZE;
+size = (count + sector_mask)  ~sector_mask;
+if (size  s-aligned_buf_size)
+size = s-aligned_buf_size;
 
 ret = raw_pread_aligned(bs, offset, s-aligned_buf, size);
 if (ret  0) {
@@ -404,25 +412,28 @@ static int raw_pwrite(BlockDriverState *
   const uint8_t *buf, int count)
 {
 BDRVRawState *s = bs-opaque;
+unsigned sector_mask = bs-buffer_alignment - 1;
 int size, ret, shift, sum;
 
 sum = 0;
 
 if (s-aligned_buf != NULL) {
 
-if (offset  0x1ff) {
-/* align offset on a 512 bytes boundary */
-shift = offset  0x1ff;
-ret = raw_pread_aligned(bs, offset - shift, s-aligned_buf, 512);
+if (offset  sector_mask) {
+/* align offset on a sector size bytes boundary */
+shift = offset  sector_mask;
+ret = raw_pread_aligned(bs, offset - shift, s-aligned_buf,
+bs-buffer_alignment);
 if (ret  0)
 return ret;
 
-size = 512 - shift;
+size = bs-buffer_alignment - shift;
 if 

[Qemu-devel] [PATCH 4/5] scsi-disk: propagate the required alignment

2010-09-12 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig h...@lst.de

Index: qemu/hw/scsi-disk.c
===
--- qemu.orig/hw/scsi-disk.c2010-09-12 14:43:04.694759377 -0300
+++ qemu/hw/scsi-disk.c 2010-09-12 14:50:19.049759376 -0300
@@ -1178,6 +1178,7 @@ static int scsi_disk_initfn(SCSIDevice *
 s-qdev.blocksize = s-qdev.conf.logical_block_size;
 }
 s-cluster_size = s-qdev.blocksize / 512;
+s-bs-buffer_alignment = s-qdev.blocksize;
 
 s-qdev.type = TYPE_DISK;
 qemu_add_vm_change_state_handler(scsi_dma_restart_cb, s);



[Qemu-devel] [PATCH 5/5] ide: propagate the required alignment

2010-09-12 Thread Christoph Hellwig
IDE is a bit ugly in this respect.  For one it doesn't really keep track
of a sector size - most of the protocol is in units of 512 bytes, and we
assume 2048 bytes for CDROMs which is correct most of the time.

Second IDE allocates an I/O buffer long before we know if we're dealing
with a CDROM or not, so increase the alignment for the io_buffer
unconditionally.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: qemu/hw/ide/core.c
===
--- qemu.orig/hw/ide/core.c 2010-09-12 18:30:06.0 -0300
+++ qemu/hw/ide/core.c  2010-09-12 18:32:29.133759395 -0300
@@ -2645,6 +2645,7 @@ int ide_init_drive(IDEState *s, BlockDri
 if (bdrv_get_type_hint(bs) == BDRV_TYPE_CDROM) {
 s-drive_kind = IDE_CD;
 bdrv_set_change_cb(bs, cdrom_change_cb, s);
+bs-buffer_alignment = 2048;
 } else {
 if (!bdrv_is_inserted(s-bs)) {
 error_report(Device needs media, but drive is empty);
@@ -2679,7 +2680,8 @@ static void ide_init1(IDEBus *bus, int u
 s-bus = bus;
 s-unit = unit;
 s-drive_serial = drive_serial++;
-s-io_buffer = qemu_blockalign(s-bs, IDE_DMA_BUF_SECTORS*512 + 4);
+/* we need at least 2k alignment for accessing CDROMs using O_DIRECT */
+s-io_buffer = qemu_memalign(2048, IDE_DMA_BUF_SECTORS*512 + 4);
 s-io_buffer_total_len = IDE_DMA_BUF_SECTORS*512 + 4;
 s-smart_selftest_data = qemu_blockalign(s-bs, 512);
 s-sector_write_timer = qemu_new_timer(vm_clock,



[Qemu-devel] [PATCH 3/5] virtio-blk: propagate the required alignment

2010-09-12 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig h...@lst.de

Index: qemu/hw/virtio-blk.c
===
--- qemu.orig/hw/virtio-blk.c   2010-09-12 14:42:46.721759377 -0300
+++ qemu/hw/virtio-blk.c2010-09-12 14:51:09.737759375 -0300
@@ -540,6 +540,7 @@ VirtIODevice *virtio_blk_init(DeviceStat
 register_savevm(dev, virtio-blk, virtio_blk_id++, 2,
 virtio_blk_save, virtio_blk_load, s);
 bdrv_set_removable(s-bs, 0);
+s-bs-buffer_alignment = conf-logical_block_size;
 
 return s-vdev;
 }



Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage

2010-09-12 Thread Andreas Färber

Am 12.09.2010 um 23:05 schrieb Blue Swirl:

On Sun, Sep 12, 2010 at 5:58 PM, Andreas Färber andreas.faer...@web.de 
 wrote:

Am 12.09.2010 um 19:47 schrieb Blue Swirl:

   nfields=$((nfields + 1))


./tracetool: syntax error at line 53: `nfields=$' unexpected


That looks like fully standards compliant, so Solaris' /bin/sh is not.
Can you try what happens with /usr/xpg4/bin/sh?


Works fine! Must've done something wrong when testing that earlier  
today.


configure, create_config and tracetool with your fix are silent when / 
usr/xpg4/bin is in the $PATH.

If you commit it, we can close this ticket. Thanks for your help, Blue.

Build still fails, in qemu-nbd.c due to err.h, but that's unrelated  
to /bin/sh.

After disabling the tools in configure, sparc-softmmu builds fine again.


Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread Adhyas Avasthi
bash-3.2$ make --version
GNU Make version 3.79.1-p7, by Richard Stallman and Roland McGrath.
Built for i686-pc-linux-gnu
Copyright (C) 1988, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 2000
Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This is really blocking my work at this point as I cannot even
compile. Is there a workaround that can unblock me for the time being
?

Thanks,
Adhyas

On Sun, Sep 12, 2010 at 12:26 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 What's the output of 'make --version'?



Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread Andreas Färber

Am 13.09.2010 um 00:08 schrieb Adhyas Avasthi:


Is there a workaround that can unblock me for the time being
?


Try giving an explicit --target-list= to configure, to narrow down  
what build step causes this. There may be subtle dependency issues,  
cf. my recent patch trace: Fix user emulator dependency on trace  
objects which fixes a similar issue.


As for blocking your work, just try the last stable version.

Cheers,
Andreas



Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread Anthony Liguori

On 09/12/2010 04:16 PM, Andreas Färber wrote:

Am 12.09.2010 um 21:26 schrieb Anthony Liguori:


On 09/12/2010 01:06 PM, Adhyas Avasthi wrote:

Thanks for the response. I tried './configure' without any arguments
and still the same issue.
Find the output from ./configure and make attached.



You've got a weird version of make installed.

It's throwing an error when no file is specified for -include


A warning, not an error. I reported a similar issue here:
https://bugs.launchpad.net/bugs/636315

but the point of -include vs. include is that -include is supposed to 
be silent on error.


The manual is very brief and doesn't touch on advanced -include use 
cases like filters that might result in an empty set of files:
If you want make to simply ignore a makefile which does not exist or 
cannot be remade, with no error message, use the -include directive 
instead of include, like this:


 -include filenames...
This acts like include in every way except that there is no error 
(not even a warning) if any of the filenames (or any prerequisites of 
any of the filenames) do not exist or cannot be remade.



http://www.gnu.org/software/make/manual/html_node/Include.html#Include


Makefile:331 will be the include for *.d files, and so is Makefile.hw:24.



What's the output of 'make --version'?


bash-3.00$ gmake --version
GNU Make 3.80
Copyright (C) 2002  Free Software Foundation, Inc.
This is free software; see the sources for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

Adhyas' looked like a 3.79.1 preview, judging by the command name 
make-3.79.1-p7. ;)


I have:

anth...@titi:~/build/qemu$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for x86_64-pc-linux-gnu

Which exhibits none of this silliness.

Are ya'll using Gentoo?

Regards,

Anthony Liguori





Regards,

Andreas





Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread malc
On Fri, 10 Sep 2010, Adhyas Avasthi wrote:

 When I try to run the following commands on rc1 or fresh git (synced 5
 minutes back), I get the following error:
 
 bash-3.2$ ./configure --disable-kvm --disable-xen
 bash-3.2$ make
 Makefile:24: no file name for `-include'
 make-3.79.1-p7[1]: *** No rule to make target `loader.o', needed by
 `all'.  Stop.
 make-3.79.1-p7: *** [subdir-libhw32] Error 2
 
 Any pointers as to what might be causing it?
 

Adding /dev/null to the list might work.

 --
 Adhyas
 
 Two types have compatible type if their types are the same.
     ? ANSI C Standard, 3.1.2.6.
 
 

-- 
mailto:av1...@comtv.ru

Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1

2010-09-12 Thread Adhyas Avasthi
I tried qemu-0.13.0-rc1 and it has the same issues on this machine.
Let me try what you suggest.

Thanks,
Adhyas

On Sun, Sep 12, 2010 at 3:21 PM, Andreas Färber andreas.faer...@web.de wrote:
 Try giving an explicit --target-list= to configure, to narrow down what
 build step causes this. There may be subtle dependency issues, cf. my recent
 patch trace: Fix user emulator dependency on trace objects which fixes a
 similar issue.

 As for blocking your work, just try the last stable version.



[Qemu-devel] [Bug 636446] Re: prep ppc machine no more working

2010-09-12 Thread Massimo Montecchi
I want to emulate PReP machine to be able to install AIX for testing
purpose.

I will ask on openbios list.

-- 
prep ppc machine no more working
https://bugs.launchpad.net/bugs/636446
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.

Status in QEMU: Confirmed

Bug description:
I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to 
use PREP PPC machine the error is the same:


qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS

CPU #0:
NIP    LR  CTR  XER 
MSR  HID0   HF  idx 0
TB   DECR 
GPR00    
GPR04    
GPR08    
GPR12    
GPR16    
GPR20    
GPR24    
GPR28    
CR   [ -  -  -  -  -  -  -  -  ] RES 
FPR00    
FPR04    
FPR08    
FPR12    
FPR16    
FPR20    
FPR24    
FPR28    
FPSCR 
SRR0  SRR1  SDR1 





Re: [Qemu-devel] Arm : How to investigate Qemu blank screen on target system problem?

2010-09-12 Thread Mulyadi Santosa
On Sun, Sep 12, 2010 at 02:43, Robert Smith wmk...@yahoo.com wrote:
 Sirs,

 I've built kernel  v2.6.33.2 and tryintg to simulate realview (or integrator) 
 arm board on qemu under Ubuntu 9.10.

 make ARCH=arm realview_defconfig
 make ARCH=arm CROSS_COMPILE=arm-linux-
 qemu-system-arm -M realview -kernel zImage.2.6.33.2.realview -initrd 
 initramfs.cpio.gz

 or

 make ARCH=arm integrator_defconfig
 make ARCH=arm CROSS_COMPILE=arm-linux-
 qemu-system-arm -M integratorcp -kernel zImage.2.6.33.2.integrator -initrd 
 initramfs.cpio.gz

 In both cases I am getting blank Qemu screen on the target system.

I am thinking, is it due to the serial output? i.e it's not displaying
to standard console but serial console?

-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com



[Qemu-devel] PowerPC code generation and the program counter

2010-09-12 Thread Stu Grossman
I've been using qemu-12.4 to trace accesses to non-existent addresses, but I've
found that the PC is incorrect when cpu_abort() is called from within the
unassigned memory helper routines (unassigned_mem_read[bwl] and
unassigned_mem_write[bwl]).  Even nearby instructions (plus or minus 10
instructions or so) don't match the type of load or store being done, so this
isn't a PC being current_instr+4 kind of thing.

I ended up modifying the GEN_LD* and GEN_ST* macros (in target-ppc/translate.c)
to call gen_update_nip(ctx, ctx-nip - 4).  This fixed the above problem, which
has helped enormously.

Since I'm not a qemu expert, I was wondering about several things:

1) Was it really necessary to add gen_update_nip to the load and store
   instructions in order to get the correct PC?  Could the correct PC
   have been derived some other way, without a runtime cost for all
   basic loads and stores?
2) As the current code lacks that fix, the basic load and store
   instructions will save an incorrect PC if an exception occurs.  If
   so, how come nobody noticed this before?  I think that exceptions
   would have srr0 pointing at the last instruction which called
   gen_update_nip.  So when the target returns from a data exception,
   it might re-execute some instructions.  Possibly harmless, but could
   lead to subtle bugs...

Thanks, Stu

Here's the patch if anybody is interested:

*** translate.c~Sat Sep 11 23:43:25 2010
--- translate.c Sun Sep 12 20:49:53 2010
***
*** 2549,2554 
--- 2549,2555 
  {
 \
  TCGv EA;
 \
  gen_set_access_type(ctx, ACCESS_INT);
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  EA = tcg_temp_new();
 \
  gen_addr_imm_index(ctx, EA, 0);
 \
  gen_qemu_##ldop(ctx, cpu_gpr[rD(ctx-opcode)], EA);
 \
***
*** 2564,2569 
--- 2565,2571 
  gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);
 \
  return;
 \
  }
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  gen_set_access_type(ctx, ACCESS_INT);
 \
  EA = tcg_temp_new();
 \
  if (type == PPC_64B)
 \
***
*** 2584,2589 
--- 2586,2592 
  gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);
 \
  return;
 \
  }
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  gen_set_access_type(ctx, ACCESS_INT);
 \
  EA = tcg_temp_new();
 \
  gen_addr_reg_index(ctx, EA);
 \
***
*** 2596,2601 
--- 2599,2605 
  static void glue(gen_, name##x)(DisasContext *ctx)
 \
  {
 \
  TCGv EA;
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  gen_set_access_type(ctx, ACCESS_INT);
 \
  EA = tcg_temp_new();
 \
  gen_addr_reg_index(ctx, EA);
 \
***
*** 2693,2698 
--- 2697,2703 
  static void glue(gen_, name)(DisasContext *ctx)
 \
  {
 \
  TCGv EA;
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  gen_set_access_type(ctx, ACCESS_INT);
 \
  EA = tcg_temp_new();
 \
  gen_addr_imm_index(ctx, EA, 0);
 \
***
*** 2708,2713 
--- 2713,2719 
  gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);
 \
  return;
 \
  }
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  gen_set_access_type(ctx, ACCESS_INT);
 \
  EA = tcg_temp_new();
 \
  if (type == PPC_64B)
 \
***
*** 2727,2732 
--- 2733,2739 
  gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);
 \
  return;
 \
  }
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  gen_set_access_type(ctx, ACCESS_INT);
 \
  EA = tcg_temp_new();
 \
  gen_addr_reg_index(ctx, EA);
 \
***
*** 2739,2744 
--- 2746,2752 
  static void glue(gen_, name##x)(DisasContext *ctx)
 \
  {
 \
  TCGv EA;
 \
+ gen_update_nip(ctx, ctx-nip - 4);
   \
  gen_set_access_type(ctx, ACCESS_INT);
 \
  EA = tcg_temp_new();
 \
  gen_addr_reg_index(ctx, EA);
 \