Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/11/2010 05:20 PM, Joerg Roedel wrote: On Sat, Sep 11, 2010 at 03:43:02PM +0200, Alexander Graf wrote: @@ -305,6 +322,8 @@ static x86_def_t builtin_x86_defs[] = { CPUID_EXT3_OSVW, CPUID_EXT3_IBS */ .ext3_features = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A, +.svm_features = CPUID_SVM_NPT | CPUID_SVM_LBRV | CPUID_SVM_NRIPSAVE | +CPUID_SVM_VMCBCLEAN, Does that phenom already do all those? It does NPT, but I'm not sure about NRIPSAVE for example. Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old Phenoms don't have it. For the SVM features it is not that important what the host hardware supports but what KVM can emulate. VMCBCLEAN can be emulated without supporting it in the host for example. Well, let's have a phenom2 type for those new features (and any other features the phenom 2 has). What's the point of using the name of existing hardware if it doesn't match that hardware? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 12.09.2010, at 08:05, Avi Kivity wrote: On 09/11/2010 05:20 PM, Joerg Roedel wrote: On Sat, Sep 11, 2010 at 03:43:02PM +0200, Alexander Graf wrote: @@ -305,6 +322,8 @@ static x86_def_t builtin_x86_defs[] = { CPUID_EXT3_OSVW, CPUID_EXT3_IBS */ .ext3_features = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A, +.svm_features = CPUID_SVM_NPT | CPUID_SVM_LBRV | CPUID_SVM_NRIPSAVE | +CPUID_SVM_VMCBCLEAN, Does that phenom already do all those? It does NPT, but I'm not sure about NRIPSAVE for example. Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old Phenoms don't have it. For the SVM features it is not that important what the host hardware supports but what KVM can emulate. VMCBCLEAN can be emulated without supporting it in the host for example. Well, let's have a phenom2 type for those new features (and any other features the phenom 2 has). What's the point of using the name of existing hardware if it doesn't match that hardware? Isn't that what cpu=host is there for? I don't want to see cpu type cluttering like we have it on ppc. I added the core2duo type for Mac OS X guests for which those are basically the oldest supported CPUs. For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). Alex
[Qemu-devel] Re: [PATCH] Introduce qemu_madvise()
On Sat, Sep 11, 2010 at 5:05 PM, Andreas Färber andreas.faer...@web.de wrote: From: Andreas Färber afaer...@opensolaris.org vl.c has a Sun-specific hack to supply a prototype for madvise(), but the call site has apparently moved to arch_init.c. The underlying issue is that madvise() is not a POSIX function, therefore Solaris' _POSIX_C_SOURCE suppresses the prototype. Haiku doesn't implement madvise() at all. OpenBSD however doesn't implement posix_madvise(). Check for posix_madvise() in configure and supply qemu_madvise() as wrapper. Use qemu_madvise() in arch_init.c's ram_load(). http://www.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html Remaining madvise() users: exec.c: limited to __linux__ and/or MADV_MERGEABLE (no POSIX equivalent) kvm-all.c: limited to MADV_DONTFORK (no POSIX equivalent), otherwise runtime error if !kvm_has_sync_mmu() hw/virtio-balloon.c: limited to __linux__ For consistency, I'd convert all users. If an OS doesn't support a flag, we can return something like -ENOTSUP which may be checked by the caller if it cares. v1 - v2: * Don't rely on posix_madvise() availability, add qemu_madvise(). Suggested by Blue Swirl. Signed-off-by: Andreas Färber afaer...@opensolaris.org Cc: Blue Swirl blauwir...@gmail.com --- arch_init.c | 2 +- configure | 11 +++ osdep.c | 26 ++ osdep.h | 4 vl.c | 3 --- 5 files changed, 42 insertions(+), 4 deletions(-) diff --git a/arch_init.c b/arch_init.c index e468c0c..a910033 100644 --- a/arch_init.c +++ b/arch_init.c @@ -396,7 +396,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) #ifndef _WIN32 if (ch == 0 (!kvm_enabled() || kvm_has_sync_mmu())) { - madvise(host, TARGET_PAGE_SIZE, MADV_DONTNEED); + qemu_madvise(host, TARGET_PAGE_SIZE, QEMU_MADV_DONTNEED); } #endif } else if (flags RAM_SAVE_FLAG_PAGE) { diff --git a/configure b/configure index 4061cb7..5e6f3e1 100755 --- a/configure +++ b/configure @@ -2069,6 +2069,17 @@ if compile_prog ; then fi ## +# check if we have posix_madvise + +cat $TMPC EOF +#include sys/mman.h +int main(void) { posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); return 0; } +EOF +if compile_prog ; then + QEMU_CFLAGS=-DCONFIG_POSIX_MADVISE ${QEMU_CFLAGS} Please take a look around configure:2444 how to add new config flags. I'd also add a similar check for madvise() if posix_madvise() check fails. +fi + +## # check if trace backend exists sh $source_path/tracetool --$trace_backend --check-backend /dev/null 2 /dev/null diff --git a/osdep.c b/osdep.c index 30426ff..8c09597 100644 --- a/osdep.c +++ b/osdep.c @@ -28,6 +28,7 @@ #include errno.h #include unistd.h #include fcntl.h +#include sys/mman.h /* Needed early for CONFIG_BSD etc. */ #include config-host.h @@ -139,6 +140,31 @@ void qemu_vfree(void *ptr) #endif +int qemu_madvise(void *addr, size_t len, int advice) +{ +#ifdef CONFIG_POSIX_MADVISE + switch (advice) { + case QEMU_MADV_DONTNEED: + advice = POSIX_MADV_DONTNEED; + break; + default: + fprintf(stderr, Advice %d unhandled.\n, advice); + abort(); This should be an assert, it's an internal error. It's also common to both cases. + } + return posix_madvise(addr, len, advice); +#else #elif defined(CONFIG_MADVISE) + switch (advice) { + case QEMU_MADV_DONTNEED: + advice = MADV_DONTNEED; + break; case QEMU_MADV_DONTFORK: #ifndef MADV_DONTFORK return -ENOTSUP; #else advice = MADV_DONTFORK; break; #endif Same goes for MADV_MERGEABLE. + default: + fprintf(stderr, Advice %d unhandled.\n, advice); + abort(); + } + return madvise(addr, len, advice); #else return -ENOTSUP;
Re: [Qemu-devel] Re: [PATCH v2 2/9] pcie: helper functions for pcie extended capability.
On Wed, Sep 8, 2010 at 5:38 PM, Wei Xu we...@cisco.com wrote: Isaku: For binary constants below, to achieve max compatibility with gcc versions, I recommend to change to hex (0x...): Yes, binary constants were only added to GCC 4.3.x. Since they are also GCC extensions with no obvious way to circumvent their use (as with GCC attributes), they shouldn't be used.
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 09:16 AM, Alexander Graf wrote: Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old Phenoms don't have it. For the SVM features it is not that important what the host hardware supports but what KVM can emulate. VMCBCLEAN can be emulated without supporting it in the host for example. Well, let's have a phenom2 type for those new features (and any other features the phenom 2 has). What's the point of using the name of existing hardware if it doesn't match that hardware? Isn't that what cpu=host is there for? I don't want to see cpu type cluttering like we have it on ppc. I added the core2duo type for Mac OS X guests for which those are basically the oldest supported CPUs. -cpu host is to all supported features into your guest. -cpu phenom is to pretend you are running on a phenom cpu. This is useful for a migration farm for which the greatest common denominator is a phenom. Those are separate use cases. For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Curious, what guests balk at modern hardware but are fine with phenom? Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH] Introduce qemu_madvise()
Am 12.09.2010 um 09:20 schrieb Blue Swirl: On Sat, Sep 11, 2010 at 5:05 PM, Andreas Färber andreas.faer...@web.de wrote: Use qemu_madvise() in arch_init.c's ram_load(). http://www.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html Remaining madvise() users: exec.c: limited to __linux__ and/or MADV_MERGEABLE (no POSIX equivalent) kvm-all.c: limited to MADV_DONTFORK (no POSIX equivalent), otherwise runtime error if !kvm_has_sync_mmu() hw/virtio-balloon.c: limited to __linux__ For consistency, I'd convert all users. If an OS doesn't support a flag, we can return something like -ENOTSUP which may be checked by the caller if it cares. Will do. diff --git a/configure b/configure index 4061cb7..5e6f3e1 100755 --- a/configure +++ b/configure @@ -2069,6 +2069,17 @@ if compile_prog ; then fi ## +# check if we have posix_madvise + +cat $TMPC EOF +#include sys/mman.h +int main(void) { posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); return 0; } +EOF +if compile_prog ; then +QEMU_CFLAGS=-DCONFIG_POSIX_MADVISE ${QEMU_CFLAGS} Please take a look around configure:2444 how to add new config flags. I did. It's just not obvious that they find their way from the Makefile to a C header, unlike autoconf. I'd also add a similar check for madvise() if posix_madvise() check fails. Fine with me. diff --git a/osdep.c b/osdep.c index 30426ff..8c09597 100644 --- a/osdep.c +++ b/osdep.c @@ -28,6 +28,7 @@ #include errno.h #include unistd.h #include fcntl.h +#include sys/mman.h /* Needed early for CONFIG_BSD etc. */ #include config-host.h @@ -139,6 +140,31 @@ void qemu_vfree(void *ptr) #endif +int qemu_madvise(void *addr, size_t len, int advice) +{ +#ifdef CONFIG_POSIX_MADVISE +switch (advice) { +case QEMU_MADV_DONTNEED: +advice = POSIX_MADV_DONTNEED; +break; +default: +fprintf(stderr, Advice %d unhandled.\n, advice); +abort(); This should be an assert, it's an internal error. It's also common to both cases. +} +return posix_madvise(addr, len, advice); +#else #elif defined(CONFIG_MADVISE) +switch (advice) { +case QEMU_MADV_DONTNEED: +advice = MADV_DONTNEED; +break; case QEMU_MADV_DONTFORK: #ifndef MADV_DONTFORK return -ENOTSUP; #else advice = MADV_DONTFORK; break; #endif Same goes for MADV_MERGEABLE. So you disagree with or didn't yet read Alex' suggestion of eliminating this mapping code in qemu_madvise() altogether? Doing that in a sensible way would allow code to do: #ifdef QEMU_MADV_MERGEABLE ... as before at compile-time. The only qemu_madvise() error condition would then be the #else below. +default: +fprintf(stderr, Advice %d unhandled.\n, advice); +abort(); +} +return madvise(addr, len, advice); #else return -ENOTSUP; Thanks, Andreas
[Qemu-devel] Re: [PATCH] Introduce qemu_madvise()
On Sun, Sep 12, 2010 at 8:50 AM, Andreas Färber andreas.faer...@web.de wrote: Am 12.09.2010 um 09:20 schrieb Blue Swirl: On Sat, Sep 11, 2010 at 5:05 PM, Andreas Färber andreas.faer...@web.de wrote: Use qemu_madvise() in arch_init.c's ram_load(). http://www.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html Remaining madvise() users: exec.c: limited to __linux__ and/or MADV_MERGEABLE (no POSIX equivalent) kvm-all.c: limited to MADV_DONTFORK (no POSIX equivalent), otherwise runtime error if !kvm_has_sync_mmu() hw/virtio-balloon.c: limited to __linux__ For consistency, I'd convert all users. If an OS doesn't support a flag, we can return something like -ENOTSUP which may be checked by the caller if it cares. Will do. diff --git a/configure b/configure index 4061cb7..5e6f3e1 100755 --- a/configure +++ b/configure @@ -2069,6 +2069,17 @@ if compile_prog ; then fi ## +# check if we have posix_madvise + +cat $TMPC EOF +#include sys/mman.h +int main(void) { posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); return 0; } +EOF +if compile_prog ; then + QEMU_CFLAGS=-DCONFIG_POSIX_MADVISE ${QEMU_CFLAGS} Please take a look around configure:2444 how to add new config flags. I did. It's just not obvious that they find their way from the Makefile to a C header, unlike autoconf. I'd also add a similar check for madvise() if posix_madvise() check fails. Fine with me. diff --git a/osdep.c b/osdep.c index 30426ff..8c09597 100644 --- a/osdep.c +++ b/osdep.c @@ -28,6 +28,7 @@ #include errno.h #include unistd.h #include fcntl.h +#include sys/mman.h /* Needed early for CONFIG_BSD etc. */ #include config-host.h @@ -139,6 +140,31 @@ void qemu_vfree(void *ptr) #endif +int qemu_madvise(void *addr, size_t len, int advice) +{ +#ifdef CONFIG_POSIX_MADVISE + switch (advice) { + case QEMU_MADV_DONTNEED: + advice = POSIX_MADV_DONTNEED; + break; + default: + fprintf(stderr, Advice %d unhandled.\n, advice); + abort(); This should be an assert, it's an internal error. It's also common to both cases. + } + return posix_madvise(addr, len, advice); +#else #elif defined(CONFIG_MADVISE) + switch (advice) { + case QEMU_MADV_DONTNEED: + advice = MADV_DONTNEED; + break; case QEMU_MADV_DONTFORK: #ifndef MADV_DONTFORK return -ENOTSUP; #else advice = MADV_DONTFORK; break; #endif Same goes for MADV_MERGEABLE. So you disagree with or didn't yet read Alex' suggestion of eliminating this mapping code in qemu_madvise() altogether? Doing that in a sensible way would allow code to do: #ifdef QEMU_MADV_MERGEABLE ... as before at compile-time. The only qemu_madvise() error condition would then be the #else below. That's one way too, if nobody cares about madvise() return values for MADV_MERGEABLE. In any case I'd like to eliminate the #ifdefs in other places than osdep.[ch].
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
Am 12.09.2010 um 10:01 schrieb Avi Kivity a...@redhat.com: On 09/12/2010 09:16 AM, Alexander Graf wrote: Depends on which Phenom you have. A Phenom II has NRIPSAVE but the old Phenoms don't have it. For the SVM features it is not that important what the host hardware supports but what KVM can emulate. VMCBCLEAN can be emulated without supporting it in the host for example. Well, let's have a phenom2 type for those new features (and any other features the phenom 2 has). What's the point of using the name of existing hardware if it doesn't match that hardware? Isn't that what cpu=host is there for? I don't want to see cpu type cluttering like we have it on ppc. I added the core2duo type for Mac OS X guests for which those are basically the oldest supported CPUs. -cpu host is to all supported features into your guest. -cpu phenom is to pretend you are running on a phenom cpu. This is useful for a migration farm for which the greatest common denominator is a phenom. Those are separate use cases. Exactly. And the benefit from phenom - phenom2 is so minor that it doesn't make sense. For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Curious, what guests balk at modern hardware but are fine with phenom? Sles11 GA ;). Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. Those were my words :). Alex
Re: [Qemu-devel] Stable patch tracking
On 09/07/2010 09:24 PM, Anthony Liguori wrote: A lot of folks tend to reply to patches and suggest them for stable which is certainly appreciated but is very difficult for me to track reliably. So here's another way to nominate a patch for inclusion in -stable: http://wiki.qemu.org/Releases/0.13.0 Just copy/paste the template and fill out with the appropriate details. Can the captcha for logged in users be removed? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 12:06 PM, Alexander Graf wrote: For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Curious, what guests balk at modern hardware but are fine with phenom? Sles11 GA ;). Still curious, how does -cpu host break it? Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. Those were my words :). Then we are in agreement. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
On 09/11/2010 05:04 PM, Anthony Liguori wrote: Image files have two types of data: immutable data that describes things like image size, backing files, etc. and mutable data that includes offset and reference count tables. Note: even the logical size is, in principle, mutable. If we introduce external snapshots, then so are backing files? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
On 09/11/2010 05:04 PM, Anthony Liguori wrote: This fixes a couple nasty problems relating to live migration. 1) When dealing with shared storage with weak coherence (i.e. NFS), even if we re-read, we may end up with undesired caching. By delaying any reads until we absolutely have to, we decrease the likelihood of any undesirable caching. 2) When dealing with copy-on-read, the local storage acts as a cache. We need to make sure to avoid any reads to avoid polluting the local cache. + static void ide_identify(IDEState *s) { uint16_t *p; @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) return; } +guess_geometry(s); + This can cause a disk read, no? Shouldn't it be made asynchronous? Or just move it to just before the guest starts? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration
On 09/11/2010 05:04 PM, Anthony Liguori wrote: Today, live migration only works when using shared storage that is fully cache coherent using raw images. The failure case with weak coherent (i.e. NFS) is subtle but nontheless still exists. NFS only guarantees close-to-open coherence and when performing a live migration, we do an open on the source and an open on the destination. We fsync() on the source before launching the destination but since we have two simultaneous opens, we're not guaranteed coherence. This is not necessarily a problem except that we are a bit gratituous in reading from the disk before launching a guest. This means that as things stand today, we're guaranteed to read the first 64k of the disk and as such, if a client writes to that region during live migration, corruption will result. The second failure condition has to do with image files (such as qcow2). Today, we aggressively cache metadata in all image formats and that cache is definitely not coherent even with fully coherent shared storage. In all image formats, we prefetch at least the L1 table in open() which means that if there is a write operation that causes a modification to an L1 table, corruption will ensue. This series attempts to address both of these issue. Technically, if a NFS client aggressively prefetches this solution is not enough but in practice, Linux doesn't do that. I think it is unlikely that it will, but I prefer to be on the right side of the standards. Why not delay image open until after migration completes? I know your concern about the image not being there, but we can verify that with access(). If the image is deleted between access() and open() then the user has much bigger problems. Note that on NFS, removing (and I think chmoding) a file after it is opened will cause subsequent data access to fail, unlike posix. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
On 09/07/2010 04:41 PM, Anthony Liguori wrote: Hi, We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense. Here's the basic idea: Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests. Is copy on read QED specific? It looks very similar to the commit command, except with I/O directions reversed. IIRC, commit looks like for each sector: if image.mapped(sector): backing_image.write(sector, image.read(sector)) whereas copy-on-read looks like: def copy_on_read(): set_ioprio(idle) for each sector: if not image.mapped(sector): image.write(sector, backing_image.read(sector)) run_in_thread(copy_on_read) With appropriate locking. In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation. The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer. The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following: stream device sector offset Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed. The idea about how to drive this interface is a loop like: offset = 0; while offset image_size: wait_for_idle_time() count = stream(device, offset) offset += count This is way too low level for the management stack. Have you considered using the idle class I/O priority to implement this? That would allow host-wide prioritization. Not sure how to do cluster-wide, I don't think NFS has the concept of I/O priority. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
Am 12.09.2010 um 12:22 schrieb Avi Kivity a...@redhat.com: On 09/12/2010 12:06 PM, Alexander Graf wrote: For the Phenom type, I honestly don't remember why, but there was also a good reason to add it. In fact, I use it today to have nested virt without -cpu host on hardware that's too new for my guests. Curious, what guests balk at modern hardware but are fine with phenom? Sles11 GA ;). Still curious, how does -cpu host break it? Uh, something in the power management module. Just give it a try on an Istambul system. Just use the 11.1 iso - the kernels are close enough. Alex
[Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage
Public bug reported: Running `LANG=C LC_ALL=C ./configure --prefix=... --install=/usr/ucb/install` on Solaris 10 amd64 results in the following errors: ./configure: bad substitution ./configure: !: not found ./configure: curl-config: not found ./configure: curl-config: not found Error: invalid trace backend Please choose a supported trace backend. Unfortunately it doesn't print the line numbers of the errors. It must be somewhere after the check for `install`. The first few can be resolved by running `bash ./configure ...` instead. The check if trace backend exists hardcodes `sh $source_path/tracetool ...` in configure. Replacing sh with bash makes it work. `gmake` complains Makefile:331: no file name for -include, which is a filter for *.d files. `create_config` gets the 'bad substitution' error as well. Replacing sh with bash in rules.mak works. etc. To sum it up, a) there are shell script incompatibilities with Solaris 10's /bin/sh shell, and b) hardcoding 'sh' in configure or Makefiles seems like a bad idea. QEMU Git 73d7434279e3905164afd02360eebe4b43c7fa (ESP: fix ESP DMA access...) $ uname -a SunOS sonnengoettin 5.10 Generic_142901-03 i86pc i386 i86pc # No banner output for /bin/sh $ bash --version GNU bash, version 3.00.16(1)-release (i386-pc-solaris2.10) Copyright (C) 2004 Free Software Foundation, Inc. ** Affects: qemu Importance: Undecided Status: New ** Tags: solaris -- configure and build errors on Solaris 10 due to /bin/sh usage https://bugs.launchpad.net/bugs/636315 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: New Bug description: Running `LANG=C LC_ALL=C ./configure --prefix=... --install=/usr/ucb/install` on Solaris 10 amd64 results in the following errors: ./configure: bad substitution ./configure: !: not found ./configure: curl-config: not found ./configure: curl-config: not found Error: invalid trace backend Please choose a supported trace backend. Unfortunately it doesn't print the line numbers of the errors. It must be somewhere after the check for `install`. The first few can be resolved by running `bash ./configure ...` instead. The check if trace backend exists hardcodes `sh $source_path/tracetool ...` in configure. Replacing sh with bash makes it work. `gmake` complains Makefile:331: no file name for -include, which is a filter for *.d files. `create_config` gets the 'bad substitution' error as well. Replacing sh with bash in rules.mak works. etc. To sum it up, a) there are shell script incompatibilities with Solaris 10's /bin/sh shell, and b) hardcoding 'sh' in configure or Makefiles seems like a bad idea. QEMU Git 73d7434279e3905164afd02360eebe4b43c7fa (ESP: fix ESP DMA access...) $ uname -a SunOS sonnengoettin 5.10 Generic_142901-03 i86pc i386 i86pc # No banner output for /bin/sh $ bash --version GNU bash, version 3.00.16(1)-release (i386-pc-solaris2.10) Copyright (C) 2004 Free Software Foundation, Inc.
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 01:14 PM, Alexander Graf wrote: Uh, something in the power management module. Just give it a try on an Istambul system. Just use the 11.1 iso - the kernels are close enough. Oh, probably some silly msr. -- error compiling committee.c: too many arguments to function
[Qemu-devel] [PATCH v3] Introduce qemu_madvise()
From: Andreas Färber afaer...@opensolaris.org vl.c has a Sun-specific hack to supply a prototype for madvise(), but the call site has apparently moved to arch_init.c. Haiku doesn't implement madvise() in favor of posix_madvise(). OpenBSD and Solaris 10 don't implement posix_madvise() but madvise(). Check for madvise() and posix_madvise() in configure and supply qemu_madvise() as wrapper. Prefer madvise() over posix_madvise() due to flag availability. Convert all callers to use qemu_madvise() and QEMU_MADV_*. No functional change except for arch_init.c:ram_load() now potentially falling back to posix_madvise() or no-op in lack of both. v2 - v3: * Reuse the *_MADV_* defines for QEMU_MADV_*. Suggested by Alexander Graf. * Add configure check for madvise(), too. Add defines to Makefile, not QEMU_CFLAGS. Convert all callers, untested. Suggested by Blue Swirl. * Keep Solaris' madvise() prototype around. Pointed out by Alexander Graf. v1 - v2: * Don't rely on posix_madvise() availability, add qemu_madvise(). Suggested by Blue Swirl. Signed-off-by: Andreas Färber afaer...@opensolaris.org Cc: Blue Swirl blauwir...@gmail.com Cc: Alexander Graf ag...@suse.de --- arch_init.c |2 +- configure | 33 + exec.c |8 hw/virtio-balloon.c |4 ++-- kvm-all.c |6 +++--- osdep.c | 15 +++ osdep.h | 25 + vl.c|3 --- 8 files changed, 83 insertions(+), 13 deletions(-) diff --git a/arch_init.c b/arch_init.c index e468c0c..a910033 100644 --- a/arch_init.c +++ b/arch_init.c @@ -396,7 +396,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) #ifndef _WIN32 if (ch == 0 (!kvm_enabled() || kvm_has_sync_mmu())) { -madvise(host, TARGET_PAGE_SIZE, MADV_DONTNEED); +qemu_madvise(host, TARGET_PAGE_SIZE, QEMU_MADV_DONTNEED); } #endif } else if (flags RAM_SAVE_FLAG_PAGE) { diff --git a/configure b/configure index 4061cb7..86558eb 100755 --- a/configure +++ b/configure @@ -2069,6 +2069,31 @@ if compile_prog ; then fi ## +# check if we have madvise + +madvise=no +cat $TMPC EOF +#include sys/types.h +#include sys/mman.h +int main(void) { return madvise(NULL, 0, MADV_DONTNEED); } +EOF +if compile_prog ; then +madvise=yes +fi + +## +# check if we have posix_madvise + +posix_madvise=no +cat $TMPC EOF +#include sys/mman.h +int main(void) { return posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); } +EOF +if compile_prog ; then +posix_madvise=yes +fi + +## # check if trace backend exists sh $source_path/tracetool --$trace_backend --check-backend /dev/null 2 /dev/null @@ -2226,6 +2251,8 @@ echo KVM support $kvm echo fdt support $fdt echo preadv support$preadv echo fdatasync $fdatasync +echo madvise $madvise +echo posix_madvise $posix_madvise echo uuid support $uuid echo vhost-net support $vhost_net echo Trace backend $trace_backend @@ -2466,6 +2493,12 @@ fi if test $fdatasync = yes ; then echo CONFIG_FDATASYNC=y $config_host_mak fi +if test $madvise = yes ; then + echo CONFIG_MADVISE=y $config_host_mak +fi +if test $posix_madvise = yes ; then + echo CONFIG_POSIX_MADVISE=y $config_host_mak +fi # XXX: suppress that if [ $bsd = yes ] ; then diff --git a/exec.c b/exec.c index 380dab5..b1fe3e9 100644 --- a/exec.c +++ b/exec.c @@ -2841,8 +2841,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, new_block-host = file_ram_alloc(new_block, size, mem_path); if (!new_block-host) { new_block-host = qemu_vmalloc(size); -#ifdef MADV_MERGEABLE -madvise(new_block-host, size, MADV_MERGEABLE); +#ifdef QEMU_MADV_MERGEABLE +qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE); #endif } #else @@ -2858,8 +2858,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, #else new_block-host = qemu_vmalloc(size); #endif -#ifdef MADV_MERGEABLE -madvise(new_block-host, size, MADV_MERGEABLE); +#ifdef QEMU_MADV_MERGEABLE +qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE); #endif } } diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 9fe3886..1e74674 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -51,8 +51,8 @@ static void balloon_page(void *addr, int deflate) { #if defined(__linux__) if (!kvm_enabled() || kvm_has_sync_mmu()) -madvise(addr, TARGET_PAGE_SIZE, -deflate ? MADV_WILLNEED : MADV_DONTNEED); +qemu_madvise(addr, TARGET_PAGE_SIZE, +deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); #endif } diff --git
Re: [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration
On 09/12/2010 05:46 AM, Avi Kivity wrote: On 09/11/2010 05:04 PM, Anthony Liguori wrote: Today, live migration only works when using shared storage that is fully cache coherent using raw images. The failure case with weak coherent (i.e. NFS) is subtle but nontheless still exists. NFS only guarantees close-to-open coherence and when performing a live migration, we do an open on the source and an open on the destination. We fsync() on the source before launching the destination but since we have two simultaneous opens, we're not guaranteed coherence. This is not necessarily a problem except that we are a bit gratituous in reading from the disk before launching a guest. This means that as things stand today, we're guaranteed to read the first 64k of the disk and as such, if a client writes to that region during live migration, corruption will result. The second failure condition has to do with image files (such as qcow2). Today, we aggressively cache metadata in all image formats and that cache is definitely not coherent even with fully coherent shared storage. In all image formats, we prefetch at least the L1 table in open() which means that if there is a write operation that causes a modification to an L1 table, corruption will ensue. This series attempts to address both of these issue. Technically, if a NFS client aggressively prefetches this solution is not enough but in practice, Linux doesn't do that. I think it is unlikely that it will, but I prefer to be on the right side of the standards. I've been asking around about this and one thing that was suggested was acquiring a file lock as NFS requires that a lock acquisition drops any client cache for a file. I need to understand this a bit more so it's step #2. Why not delay image open until after migration completes? I know your concern about the image not being there, but we can verify that with access(). If the image is deleted between access() and open() then the user has much bigger problems. 3/3 would still be needed because if we delay the open we obviously can do a read until an open. So it's only really a choice between invalidate_cache and delaying open. It's a far less invasive change to just do invalidate_cache though and it has some nice properties. Regards, Anthony Liguori Note that on NFS, removing (and I think chmoding) a file after it is opened will cause subsequent data access to fail, unlike posix.
Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
On 09/12/2010 05:37 AM, Avi Kivity wrote: On 09/11/2010 05:04 PM, Anthony Liguori wrote: Image files have two types of data: immutable data that describes things like image size, backing files, etc. and mutable data that includes offset and reference count tables. Note: even the logical size is, in principle, mutable. If we introduce external snapshots, then so are backing files? Maybe it should be, data the can be changed during a live migration. Backing files and logical size shouldn't change during live migration. But even so, I think the interface make sense. It's basically, drop anything you have cached that may change during migration. What needs to be read is dependent on features/format. Regards, Anthony Liguori
Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
On 09/12/2010 05:42 AM, Avi Kivity wrote: On 09/11/2010 05:04 PM, Anthony Liguori wrote: This fixes a couple nasty problems relating to live migration. 1) When dealing with shared storage with weak coherence (i.e. NFS), even if we re-read, we may end up with undesired caching. By delaying any reads until we absolutely have to, we decrease the likelihood of any undesirable caching. 2) When dealing with copy-on-read, the local storage acts as a cache. We need to make sure to avoid any reads to avoid polluting the local cache. + static void ide_identify(IDEState *s) { uint16_t *p; @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s) return; } +guess_geometry(s); + This can cause a disk read, no? Shouldn't it be made asynchronous? Yes, it should. I'm not sure there's an easy way to make it asynchronous though not because of the block layer but because of how these functions are called. Or just move it to just before the guest starts? We don't really have a notion of guest starts today although maybe we should. Regards, Anthony Liguori
Re: [Qemu-devel] Stable patch tracking
On 09/12/2010 05:08 AM, Avi Kivity wrote: On 09/07/2010 09:24 PM, Anthony Liguori wrote: A lot of folks tend to reply to patches and suggest them for stable which is certainly appreciated but is very difficult for me to track reliably. So here's another way to nominate a patch for inclusion in -stable: http://wiki.qemu.org/Releases/0.13.0 Just copy/paste the template and fill out with the appropriate details. Can the captcha for logged in users be removed? Can you be more specific about what you're seeing? There should only be a captcha for creating an account. I've never seen it since I created my account. Regards, Anthony Liguori
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
On 09/10/2010 08:07 PM, Anthony Liguori wrote: On 09/10/2010 10:49 AM, Avi Kivity wrote: If I do a qemu-img create -f qcow2 foo.img 10GB, and then do a naive copy of the image file and end up with a 2GB image when there's nothing in it, that's badness. Only if you crash in the middle. If not, you free the preallocation during shutdown (or when running a guest, when it isn't actively writing at 100 MB/s). Which is potentially guest exploitable. If this worries you, run a scrubber in the background after an uncontrolled crash. Like qed fsck, this will recover the free list from L2. Unlike qed fsck, it will not delay starting of large guests. And what do you do when you shutdown and start up? You're setting a reference count on blocks and keeping metadata in memory that those blocks are really free. Do you need an atexit hook to decrement the reference counts? Not atexit, just when we close the image. Just a detail, but we need an atexit() handler to make sure block devices get closed because we have too many exit()s in the code today. Right. Do you need to create a free list structure that gets written out on close? Yes, the same freelist that we allocate from. It's an allocated but not yet referenced list. Does it get written to disk? On exit or when there is no allocation activity. Just saying we can do batching is not solving the problem. If you want to claim that the formats are equally, then in the very least, you have to give a very exact description of how this would work because it's not entirely straight forward. I thought I did, but I realize it is spread over multiple email messages. If you like, I can try to summarize it. It will be equally useful for qed once you add a freelist for UNMAP support. Yes, please consolidate so we can debate specifics. If there's a reasonable way to fix qcow2, I'm happy to walk away from qed. But we've studied the problem and do not believe there's a reasonable approach to fixing qcow2 whereas reasonable considers the amount of development effort, the time line required to get things right, and the confidence we would have in the final product compared against the one time cost of introducing a new format. I've started something and will post it soon. When considering development time, also consider the time it will take users to actually use qed (6 months for qemu release users, ~9 months on average for semiannual community distro releases, 12-18 months for enterprise distros. Consider also that we still have to support qcow2 since people do use the extra features, and since I don't see us forcing them to migrate. I don't think you have any grounds to make such a statement. No, it's a forward-looking statement. But you're already looking at adding a freelist for UNMAP support and three levels for larger images. So it's safe to say that qed will not remain as nice and simple as it is now. I have a lot of faith in starting from a strong base and avoiding making it weaker vs. starting from a weak base and trying to make it stronger. This has led to many rewrites in the past. I realize it's somewhat subjective though. While qed looks like a good start, it has at least three flaws already (relying on physical image size, relying on fsck, and limited logical image size). Just fixing those will introduce complication. What about new features or newly discovered flaws? 4) We have looked at trying to fix qcow2. It appears to be a monumental amount of work that starts with a rewrite where it's unclear if we can even keep supporting all of the special features. IOW, there is likely to be a need for users to experience some type of image conversion or optimization process. I don't see why. Because you're oversimplifying what it takes to make qcow2 perform well. Maybe. With all its complexity, it's nowhere near as close to the simplest filesystem. The biggest burden is the state machine design. Maybe I'm broken with respect to how I think, but I find state machines very easy to rationalize. Your father's state machine. Not as clumsy or random as a thread; an elegant weapon for a more civilized age To me, the biggest burden in qcow2 is thinking through how you deal with shared resources. Because you can block for a long period of time during write operations, it's not enough to just carry a mutex during all metadata operations. You have to stage operations and commit them at very specific points in time. The standard way of dealing with this is to have a hash table for metadata that contains a local mutex: l2cache = defaultdict(L2) def get_l2(pos): l2 = l2cache[pos] l2.mutex.lock() if not l2.valid: l2.pos = pos l2.read() l2.valid = True return l2 def put_l2(l2): if l2.dirty: l2.write() l2.dirty = False
Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
On 09/12/2010 07:41 AM, Avi Kivity wrote: On 09/07/2010 05:57 PM, Anthony Liguori wrote: I agree that streaming should be generic, like block migration. The trivial generic implementation is: void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,n)) { Three problems here. First problem is that bdrv_is_allocated is synchronous. Put the whole thing in a thread. It doesn't fix anything. You don't want stream to serialize all I/O operations. The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range. You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O. That seems like a nice solution. You probably want a multiple of the cluster size to retain efficiency. What you basically do is: stream_step_three(): complete() stream_step_two(offset, length): bdrv_aio_readv(offset, length, buffer, stream_step_three) bdrv_aio_stream(): bdrv_aio_find_free_cluster(stream_step_two) And that's exactly what the current code looks like. The only change to the patch that this does is make some of qed's internals be block layer interfaces. One of the things Stefan has mentioned is that a lot of the QED code could be reused by other formats. All formats implement things like CoW on their own today but if you exposed interfaces like bdrv_aio_find_free_cluster(), you could actually implement a lot more in the generic block layer. So, I agree with you in principle that this all should be common code. I think it's a larger effort though. The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format. bdrv_is_allocated() already exists (and is needed for commit), what else is needed? cluster size? Synchronous implementations are not reusable to implement asynchronous anything. But you need the code to be cluster aware too. Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros. You can do that in bdrv_stream(), above, before the actual write, and call bdrv_unmap() if you detect zeros. My QED branch now does that FWIW. At the moment, it only detects zero reads to unallocated clusters and writes a special zero cluster marker. However, the detection code is in the generic path so once the fsck() logic is working, we can implement a free list in QED. In QED, the detection code needs to have a lot of knowledge about cluster boundaries and the format of the device. In principle, this should be common code but it's not for the same reason copy-on-write is not common code today. Regards, Anthony Liguori
Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
On 09/12/2010 03:25 PM, Anthony Liguori wrote: On 09/12/2010 07:41 AM, Avi Kivity wrote: On 09/07/2010 05:57 PM, Anthony Liguori wrote: I agree that streaming should be generic, like block migration. The trivial generic implementation is: void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,n)) { Three problems here. First problem is that bdrv_is_allocated is synchronous. Put the whole thing in a thread. It doesn't fix anything. You don't want stream to serialize all I/O operations. Why would it serialize all I/O operations? It's just like another vcpu issuing reads. The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range. You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O. That seems like a nice solution. You probably want a multiple of the cluster size to retain efficiency. What you basically do is: stream_step_three(): complete() stream_step_two(offset, length): bdrv_aio_readv(offset, length, buffer, stream_step_three) bdrv_aio_stream(): bdrv_aio_find_free_cluster(stream_step_two) Isn't there a write() missing somewhere? And that's exactly what the current code looks like. The only change to the patch that this does is make some of qed's internals be block layer interfaces. Why do you need find_free_cluster()? That's a physical offset thing. Just write to the same logical offset. IOW: bdrv_aio_stream(): bdrv_aio_read(offset, stream_2) stream_2(): if all zeros: increment offset if more: bdrv_aio_stream() bdrv_aio_write(offset, stream_3) stream_3(): bdrv_aio_write(offset, stream_4) stream_4(): increment offset if more: bdrv_aio_stream() Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that. One of the things Stefan has mentioned is that a lot of the QED code could be reused by other formats. All formats implement things like CoW on their own today but if you exposed interfaces like bdrv_aio_find_free_cluster(), you could actually implement a lot more in the generic block layer. So, I agree with you in principle that this all should be common code. I think it's a larger effort though. Not that large I think; and it will make commit async as a side effect. The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format. bdrv_is_allocated() already exists (and is needed for commit), what else is needed? cluster size? Synchronous implementations are not reusable to implement asynchronous anything. Surely this is easy to fix, at least for qed. What we need is thread infrastructure that allows us to convert between the two methods. But you need the code to be cluster aware too. Yes, another variable in BlockDriverState. Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros. You can do that in bdrv_stream(), above, before the actual write, and call bdrv_unmap() if you detect zeros. My QED branch now does that FWIW. At the moment, it only detects zero reads to unallocated clusters and writes a special zero cluster marker. However, the detection code is in the generic path so once the fsck() logic is working, we can implement a free list in QED. In QED, the detection code needs to have a lot of knowledge about cluster boundaries and the format of the device. In principle, this should be common code but it's not for the same reason copy-on-write is not common code today. Parts of it are: commit. Of course, that's horribly synchronous. -- error compiling committee.c: too many arguments to function
[Qemu-devel] Re: [PATCH v2 2/9] pcie: helper functions for pcie extended capability.
On Wed, Sep 08, 2010 at 04:39:35PM +0900, Isaku Yamahata wrote: +#define PCI_EXP_SLTCTL_AIC_SHIFT6 +#define PCI_EXP_SLTCTL_AIC_ON (PCI_EXP_HP_IND_ON PCI_EXP_SLTCTL_AIC_SHIFT) +#define PCI_EXP_SLTCTL_AIC_BLINK(PCI_EXP_HP_IND_BLINK PCI_EXP_SLTCTL_AIC_SHIFT) +#define PCI_EXP_SLTCTL_AIC_OFF (PCI_EXP_HP_IND_OFF PCI_EXP_SLTCTL_AIC_SHIFT) + +#define PCI_EXP_SLTCTL_PIC_SHIFT8 +#define PCI_EXP_SLTCTL_PIC_ON (PCI_EXP_HP_IND_ON PCI_EXP_SLTCTL_PIC_SHIFT) +#define PCI_EXP_SLTCTL_PIC_BLINK(PCI_EXP_HP_IND_BLINK PCI_EXP_SLTCTL_PIC_SHIFT) +#define PCI_EXP_SLTCTL_PIC_OFF (PCI_EXP_HP_IND_OFF PCI_EXP_SLTCTL_PIC_SHIFT) It might be better to simply define the 6 macros we are using directly. The duplication here is minimal, and I guess it will be easier get them into linux this way. -- MST
Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
On 09/12/2010 03:06 PM, Anthony Liguori wrote: Backing files and logical size shouldn't change during live migration. Why not? But even so, I think the interface make sense. It's basically, drop anything you have cached that may change during migration. What needs to be read is dependent on features/format. Sure. Certainly as an incremental step. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Stable patch tracking
On 09/12/2010 03:04 PM, Anthony Liguori wrote: Can the captcha for logged in users be removed? Can you be more specific about what you're seeing? There should only be a captcha for creating an account. I've never seen it since I created my account. I updated the page (logged in as avi) and it asked me for a captcha. I don't recall anything else. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
On 09/12/2010 03:08 PM, Anthony Liguori wrote: This can cause a disk read, no? Shouldn't it be made asynchronous? Yes, it should. I'm not sure there's an easy way to make it asynchronous though not because of the block layer but because of how these functions are called. Sorry to harp on the subject, but that's the standard problem with state machines. Every time you want to do a blocking operation in a function, you have to put all its locals in some structure, split the function into two, do some scheduling, etc. Or just move it to just before the guest starts? We don't really have a notion of guest starts today although maybe we should. Wasn't there some qdev callback that represents this? Faint memory from the reset thread. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Stable patch tracking
On 09/12/2010 03:42 PM, Avi Kivity wrote: On 09/12/2010 03:04 PM, Anthony Liguori wrote: Can the captcha for logged in users be removed? Can you be more specific about what you're seeing? There should only be a captcha for creating an account. I've never seen it since I created my account. I updated the page (logged in as avi) and it asked me for a captcha. I don't recall anything else. I did another edit (to my user page) and it didn't this time. Maybe it only happens on the first edit? -- error compiling committee.c: too many arguments to function
[Qemu-devel] Copy, Cut, Paste - Files/Text
Hi there! I'm using qemu for quite a while and was searching for clipboard sharing between guests and host. I got some info of copy/paste using VNC. I would appreciate if somebody can help me in finding if the following things are supported directly using Qemu: 1) Copy/Paste Files between Host and Guests. 2) Drag and drop files from/to VMs 3) Clipboard sharing to copy paste text across VMs (host and guest(s)) Thanks, Dhananjay
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On 09/12/2010 04:30 PM, Joerg Roedel wrote: Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. How about features that are not supported by the hardware but can be supported in emulation? The VMCBCLEAN feature is one of those which makes a lot of sense to reduce the emulated world-switch times. I guess its ok to enable those with -cpu host? It's a good question. We do that with x2apic, so yes. Userspace can do four things with KVM_GET_SUPPORTED_CPUID: - feed it right back to KVM_SET_CPUID2, getting maximum features and hopefully performance - use it to mask the real cpuid which it fetches directly, getting something as similar as possible to the host - use it to mask a predefined cpu type, getting something as similar as possible to that cpu type - use it to verify that a predefined cpu type is supported, getting somthing exactly the same as that cpu type -cpu host is the first option. If the need comes for the second option, we can provide it. I'll note that KVM_GET_SUPPORTED_CPUID can return values not present in the host cpuid in the documentation. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 2/2] qemu-kvm: Add svm cpuid features
On Sun, Sep 12, 2010 at 10:01:28AM +0200, Avi Kivity wrote: On 09/12/2010 09:16 AM, Alexander Graf wrote: Either way, I don't think we need a phenom2 type. The features additional are minor enough to not really matter and all use cases I can come up with require either -cpu host (local virt) or -cpu phenom (migration). I'm fine with this (or with adding phenom2). But don't make phenom contain flags that real phenoms don't have. How about features that are not supported by the hardware but can be supported in emulation? The VMCBCLEAN feature is one of those which makes a lot of sense to reduce the emulated world-switch times. I guess its ok to enable those with -cpu host? Joerg
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
On 09/12/2010 08:24 AM, Avi Kivity wrote: Not atexit, just when we close the image. Just a detail, but we need an atexit() handler to make sure block devices get closed because we have too many exit()s in the code today. Right. So when you click the 'X' on the qemu window, we get to wait a few seconds for it to actually disappear because it's flushing metadata to disk.. I've started something and will post it soon. Excellent, thank you. When considering development time, also consider the time it will take users to actually use qed (6 months for qemu release users, ~9 months on average for semiannual community distro releases, 12-18 months for enterprise distros. Consider also that we still have to support qcow2 since people do use the extra features, and since I don't see us forcing them to migrate. I'm of the opinion that qcow2 is unfit for production use for the type of production environments I care about. The amount of changes needed to make qcow2 fit for production use put it on at least the same timeline as you cite above. Yes, there are people today that qcow2 is appropriate but by the same respect, it will continue to be appropriate for them in the future. In my view, we don't have an image format fit for production use. You're arguing we should make qcow2 fit for production use whereas I am arguing we should start from scratch. My reasoning for starting from scratch is that it simplifies the problem. Your reasoning for improving qcow2 is simplifying the transition for non-production users of qcow2. We have an existence proof that we can achieve good data integrity and good performance by simplifying the problem. The burden still is establishing that it's possible to improve qcow2 in a reasonable amount of effort. NB, you could use qcow2 today if you had all of the data integrity fixes or didn't care about data integrity in the event of power failure or didn't care about performance. I don't have any customers that fit that bill so from my perspective, qcow2 isn't production fit. That doesn't mean that it's not fit for someone else's production use. I realize it's somewhat subjective though. While qed looks like a good start, it has at least three flaws already (relying on physical image size, relying on fsck, and limited logical image size). Just fixing those will introduce complication. What about new features or newly discovered flaws? Let's quantify fsck. My suspicion is that if you've got the storage for 1TB disk images, it's fast enough that fsck can not be so bad. Keep in mind, we don't have to completely pause the guest while fsck'ing. We simply have to prevent cluster allocations. We can allow reads and we can allow writes to allocated clusters. Consequently, if you had a 1TB disk image, it's extremely likely that the vast majority of I/O is just to allocated clusters which means that fsck() is entirely a background task. The worst case scenario is actually a half-allocated disk. But since you have to boot before you can run any serious test, if it takes 5 seconds to do an fsck(), it's highly likely that it's not even noticeable. Maybe I'm broken with respect to how I think, but I find state machines very easy to rationalize. Your father's state machine. Not as clumsy or random as a thread; an elegant weapon for a more civilized age I find your lack of faith in QED disturbing. To me, the biggest burden in qcow2 is thinking through how you deal with shared resources. Because you can block for a long period of time during write operations, it's not enough to just carry a mutex during all metadata operations. You have to stage operations and commit them at very specific points in time. The standard way of dealing with this is to have a hash table for metadata that contains a local mutex: l2cache = defaultdict(L2) def get_l2(pos): l2 = l2cache[pos] l2.mutex.lock() if not l2.valid: l2.pos = pos l2.read() l2.valid = True return l2 def put_l2(l2): if l2.dirty: l2.write() l2.dirty = False l2.mutex.unlock() You're missing how you create entries. That means you've got to do: def put_l2(l2): if l2.committed: if l2.dirty l2.write() l2.dirty = False l2.mutex.unlock() else: l2.mutex.lock() l2cache[l2.pos] = l2 l2.mutex.unlock() And this really illustrates my point. It's a harder problem that it seems. You also are keeping l2 reads from occurring when flushing a dirty l2 entry which is less parallel than what qed achieves today. This is part of why I prefer state machines. Acquiring a mutex is too easy and it makes it easy to not think through what all could be running. When you are more explicit about when you are allowing concurrency, I think it's easier to be more aggressive. It's a
Re: [Qemu-devel] Stable patch tracking
On 09/12/2010 08:42 AM, Avi Kivity wrote: On 09/12/2010 03:04 PM, Anthony Liguori wrote: Can the captcha for logged in users be removed? Can you be more specific about what you're seeing? There should only be a captcha for creating an account. I've never seen it since I created my account. I updated the page (logged in as avi) and it asked me for a captcha. I don't recall anything else. I assume it was something weird with cookies and/or browser caching. Let me know if it happens again. Edits by anonymous users require captcha to prevent spam so something probably thought you were still an anonymous user even though you had logged in. Regards, Anthony Liguori
[Qemu-devel] [PATCH] trace: Fix user emulator dependency on trace objects
On a clean build, after generating trace.h, make would recurse into *-*-user without a clue how to build ../trace.o (added to $(obj-y) in Makefile.target) since its generation rule is in the main Makefile. The softmmus are seemingly unaffected because the $(TOOLS), which each have a dependency on $(trace-obj-y), are built first for the build-all target. Add a dependency on $(trace-obj-y) for %-user, as done for the qemu-* tools. Let's be paranoid and do the same for %-softmmu while at it, just in case someone messes with $(TOOLS) or calls the Makefile target directly. Signed-off-by: Andreas Färber andreas.faer...@web.de Cc: Stefan Hajnoczi stefa...@linux.vnet.ibm.com Cc: Prerna Saxena pre...@linux.vnet.ibm.com Cc: Blue Swirl blauwir...@gmail.com Cc: Anthony Liguori aligu...@us.ibm.com --- Makefile |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index ab91d42..090d632 100644 --- a/Makefile +++ b/Makefile @@ -80,9 +80,9 @@ include $(SRC_PATH)/Makefile.objs endif $(common-obj-y): $(GENERATED_HEADERS) -$(filter %-softmmu,$(SUBDIR_RULES)): $(common-obj-y) subdir-libdis +$(filter %-softmmu,$(SUBDIR_RULES)): $(trace-obj-y) $(common-obj-y) subdir-libdis -$(filter %-user,$(SUBDIR_RULES)): $(GENERATED_HEADERS) subdir-libdis-user subdir-libuser +$(filter %-user,$(SUBDIR_RULES)): $(GENERATED_HEADERS) $(trace-obj-y) subdir-libdis-user subdir-libuser ROMSUBDIR_RULES=$(patsubst %,romsubdir-%, $(ROMS)) romsubdir-%: -- 1.7.2.2
Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
On 09/12/2010 08:40 AM, Avi Kivity wrote: Why would it serialize all I/O operations? It's just like another vcpu issuing reads. Because the block layer isn't re-entrant. What you basically do is: stream_step_three(): complete() stream_step_two(offset, length): bdrv_aio_readv(offset, length, buffer, stream_step_three) bdrv_aio_stream(): bdrv_aio_find_free_cluster(stream_step_two) Isn't there a write() missing somewhere? Streaming relies on copy-on-read to do the writing. And that's exactly what the current code looks like. The only change to the patch that this does is make some of qed's internals be block layer interfaces. Why do you need find_free_cluster()? That's a physical offset thing. Just write to the same logical offset. IOW: bdrv_aio_stream(): bdrv_aio_read(offset, stream_2) It's an optimization. If you've got a fully missing L1 entry, then you're going to memset() 2GB worth of zeros. That's just wasted work. With a 1TB image with a 1GB allocation, it's a huge amount of wasted work. stream_2(): if all zeros: increment offset if more: bdrv_aio_stream() bdrv_aio_write(offset, stream_3) stream_3(): bdrv_aio_write(offset, stream_4) I don't understand why stream_3() is needed. stream_4(): increment offset if more: bdrv_aio_stream() Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that. http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros. You can do that in bdrv_stream(), above, before the actual write, and call bdrv_unmap() if you detect zeros. My QED branch now does that FWIW. At the moment, it only detects zero reads to unallocated clusters and writes a special zero cluster marker. However, the detection code is in the generic path so once the fsck() logic is working, we can implement a free list in QED. In QED, the detection code needs to have a lot of knowledge about cluster boundaries and the format of the device. In principle, this should be common code but it's not for the same reason copy-on-write is not common code today. Parts of it are: commit. Of course, that's horribly synchronous. If you've got AIO internally, making commit work is pretty easy. Doing asynchronous commit at a generic layer is not easy though unless you expose lots of details. Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux. So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code. Regards, Anthony Liguori
[Qemu-devel] Re: [PATCH 1/3] pci: sorting out type confusion in pci_register_bar().
On Thu, Sep 09, 2010 at 11:48:55AM +0900, Isaku Yamahata wrote: This patch sorts out invalid use of pcibus_t. In pci_register_bar(), pcibus_t wmask is used. It should, However, be uint64_t because it is used to set pci configuration space value(PCIDevice::wmask) by pci_set_quad() or pci_set_long(). So make its type uint64_t and cast from pcibus_t to int64_t where necessary. I don't think we need the cast. Applied without the cast. Maybe for strict type safety conversion functions would be desirable. Something like pcibus_to_uintN(), uintN_to_pcibus(). But conversions are done only a few place, so it wouldn't be worthwhile. Signed-off-by: Isaku Yamahata yamah...@valinux.co.jp --- hw/pci.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index f03b83e..8d6b299 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -763,7 +763,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, { PCIIORegion *r; uint32_t addr; -pcibus_t wmask; +uint64_t wmask; if ((unsigned int)region_num = PCI_NUM_REGIONS) return; @@ -781,7 +781,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, r-type = type; r-map_func = map_func; -wmask = ~(size - 1); +wmask = (uint64_t)(~(size - 1)); addr = pci_bar(pci_dev, region_num); if (region_num == PCI_ROM_SLOT) { /* ROM enable bit is writeable */ -- 1.7.1.1
Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
On 09/12/2010 08:28 AM, Avi Kivity wrote: On 09/12/2010 03:06 PM, Anthony Liguori wrote: Backing files and logical size shouldn't change during live migration. Why not? To make our lives easier. Regards, Anthony Liguori But even so, I think the interface make sense. It's basically, drop anything you have cached that may change during migration. What needs to be read is dependent on features/format. Sure. Certainly as an incremental step.
[Qemu-devel] Re: [PATCH 2/3] pci: don't ignore invalid parameter for pci_register_bar().
On Thu, Sep 09, 2010 at 11:48:56AM +0900, Isaku Yamahata wrote: Abort when invalid value for region_num is passed to pci_register_bar. That is caller's bug. Abort instead of silently ignoring invalid value. Signed-off-by: Isaku Yamahata yamah...@valinux.co.jp --- hw/pci.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) Applied. diff --git a/hw/pci.c b/hw/pci.c index 8d6b299..31eba9a 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -765,9 +765,8 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num, uint32_t addr; uint64_t wmask; -if ((unsigned int)region_num = PCI_NUM_REGIONS) -return; - +assert(region_num = 0); +assert(region_num PCI_NUM_REGIONS); if (size (size-1)) { fprintf(stderr, ERROR: PCI region size must be pow2 type=0x%x, size=0x%FMT_PCIBUS\n, type, size); -- 1.7.1.1
[Qemu-devel] Re: [PATCH 3/3] pci: improve signature of pci_register_bar().
On Thu, Sep 09, 2010 at 11:48:57AM +0900, Isaku Yamahata wrote: Make type uint8_t from int because PCIIORegion::type is uint8_t. Signed-off-by: Isaku Yamahata yamah...@valinux.co.jp --- hw/pci.c |2 +- hw/pci.h |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Applied, thanks. diff --git a/hw/pci.c b/hw/pci.c index 31eba9a..ceee291 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -758,7 +758,7 @@ static int pci_unregister_device(DeviceState *dev) } void pci_register_bar(PCIDevice *pci_dev, int region_num, -pcibus_t size, int type, +pcibus_t size, uint8_t type, PCIMapIORegionFunc *map_func) { PCIIORegion *r; diff --git a/hw/pci.h b/hw/pci.h index bccab3a..25179eb 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -192,7 +192,7 @@ PCIDevice *pci_register_device(PCIBus *bus, const char *name, PCIConfigWriteFunc *config_write); void pci_register_bar(PCIDevice *pci_dev, int region_num, -pcibus_t size, int type, +pcibus_t size, uint8_t type, PCIMapIORegionFunc *map_func); int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, -- 1.7.1.1
Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
On 09/12/2010 08:26 AM, Avi Kivity wrote: On 09/12/2010 03:08 PM, Anthony Liguori wrote: This can cause a disk read, no? Shouldn't it be made asynchronous? Yes, it should. I'm not sure there's an easy way to make it asynchronous though not because of the block layer but because of how these functions are called. Sorry to harp on the subject, but that's the standard problem with state machines. Every time you want to do a blocking operation in a function, you have to put all its locals in some structure, split the function into two, do some scheduling, etc. We can't block the VCPU thread for arbitrarily long periods of time. If we get a PIO operation requesting information about geometry, we can't wait for a disk read in order to satisfy that request. We need to kick off the I/O operations in the background such that the data is available before the PIO operation happens. This isn't SM vs. threads at all, this is simply about the fact that we can't do block I/O during a PIO operation. Or just move it to just before the guest starts? We don't really have a notion of guest starts today although maybe we should. Wasn't there some qdev callback that represents this? Faint memory from the reset thread. Yes, realize(). I guess that's a reasonable approach. Regards, Anthony Liguori
[Qemu-devel] [Bug 623852] Re: PPC emulation loops on booting a FreeBSD kernel
Also I can confirm that I have this problem on QEMU. I had tried booting FreeBSD8.1-ppc under QEMU (Linux x86_64 host; PPC guest) but there seems to be a problem with whatever the FreeBSD8.1 kernel does, that QEMU's PPC emulation can't handle. I am using the latest version of QEMU from GIT as of 11/9/10. The kernel starts OK then loops after Kernel entry at 0x100100 -- PPC emulation loops on booting a FreeBSD kernel https://bugs.launchpad.net/bugs/623852 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: New Bug description: Has anyone tried booting FreeBSD8.1-ppc under QEMU (Linux x86_64 host; PPC guest)? I can get Linux/PPC to run fine, and FreeBSD8.1-i386 as well; but there seems to be a problem with whatever the FreeBSD8.1 kernel does, that QEMU's PPC emulation can't handle. I am using the latest version of QEMU from GIT as of 25/8/10. I don't know how to get a git commit hash, so I can't quote it. The kernel starts OK then loops after Kernel entry at 0x100100 The command I am running is qemu-system-ppc -cdrom FreeBSD-8.1-RELEASE-powerpc-disc1.iso -hda freebsd8.1-ppc -m 94 -boot d I obtained the kernel from ftp://ftp.freebsd.org/pub/FreeBSD/releases/powerpc/ISO-IMAGES/8.1/FreeBSD-8.1-RELEASE-powerpc-disc1.iso.
[Qemu-devel] [Bug 636446] [NEW] prep ppc machine no more working
Public bug reported: I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to use PREP PPC machine the error is the same: qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS CPU #0: NIP LR CTR XER MSR HID0 HF idx 0 TB DECR GPR00 GPR04 GPR08 GPR12 GPR16 GPR20 GPR24 GPR28 CR [ - - - - - - - - ] RES FPR00 FPR04 FPR08 FPR12 FPR16 FPR20 FPR24 FPR28 FPSCR SRR0 SRR1 SDR1 ** Affects: qemu Importance: Undecided Status: New -- prep ppc machine no more working https://bugs.launchpad.net/bugs/636446 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: New Bug description: I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to use PREP PPC machine the error is the same: qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS CPU #0: NIP LR CTR XER MSR HID0 HF idx 0 TB DECR GPR00 GPR04 GPR08 GPR12 GPR16 GPR20 GPR24 GPR28 CR [ - - - - - - - - ] RES FPR00 FPR04 FPR08 FPR12 FPR16 FPR20 FPR24 FPR28 FPSCR SRR0 SRR1 SDR1
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
On 09/12/2010 05:13 PM, Anthony Liguori wrote: On 09/12/2010 08:24 AM, Avi Kivity wrote: Not atexit, just when we close the image. Just a detail, but we need an atexit() handler to make sure block devices get closed because we have too many exit()s in the code today. Right. So when you click the 'X' on the qemu window, we get to wait a few seconds for it to actually disappear because it's flushing metadata to disk.. If it was doing heavy write I/O, you'll need to wait a bit (a few seconds are a few hundreds of clusters worth of metadata). If it managed to flush while you were moving your mouse, no delay. When considering development time, also consider the time it will take users to actually use qed (6 months for qemu release users, ~9 months on average for semiannual community distro releases, 12-18 months for enterprise distros. Consider also that we still have to support qcow2 since people do use the extra features, and since I don't see us forcing them to migrate. I'm of the opinion that qcow2 is unfit for production use for the type of production environments I care about. The amount of changes needed to make qcow2 fit for production use put it on at least the same timeline as you cite above. If it's exactly the same time, we gain by having one less format. Yes, there are people today that qcow2 is appropriate but by the same respect, it will continue to be appropriate for them in the future. In my view, we don't have an image format fit for production use. You're arguing we should make qcow2 fit for production use whereas I am arguing we should start from scratch. My reasoning for starting from scratch is that it simplifies the problem. Your reasoning for improving qcow2 is simplifying the transition for non-production users of qcow2. We have an existence proof that we can achieve good data integrity and good performance by simplifying the problem. The burden still is establishing that it's possible to improve qcow2 in a reasonable amount of effort. Agreed. I realize it's somewhat subjective though. While qed looks like a good start, it has at least three flaws already (relying on physical image size, relying on fsck, and limited logical image size). Just fixing those will introduce complication. What about new features or newly discovered flaws? Let's quantify fsck. My suspicion is that if you've got the storage for 1TB disk images, it's fast enough that fsck can not be so bad. It doesn't follow. The storage is likely to be shared among many guests. The image size (or how full it is) don't really matter; startup time is the aggregate number of L2s over all images starting now, divided by the number of spindles, divided by the number of IOPS each spindle provides. Since an L2 spans a lot of logical address space, it is likely that many L2s will be allocated (in fact, it makes sense to preallocate them). Keep in mind, we don't have to completely pause the guest while fsck'ing. We simply have to prevent cluster allocations. We can allow reads and we can allow writes to allocated clusters. True. Consequently, if you had a 1TB disk image, it's extremely likely that the vast majority of I/O is just to allocated clusters which means that fsck() is entirely a background task. The worst case scenario is actually a half-allocated disk. No, the worst case is 0.003% allocated disk, with the allocated clusters distributed uniformly. That means all your L2s are allocated, but almost none of your clusters are. But since you have to boot before you can run any serious test, if it takes 5 seconds to do an fsck(), it's highly likely that it's not even noticeable. What if it takes 300 seconds? Maybe I'm broken with respect to how I think, but I find state machines very easy to rationalize. Your father's state machine. Not as clumsy or random as a thread; an elegant weapon for a more civilized age I find your lack of faith in QED disturbing. When 900 years old you reach, state machines you will not find so easy to understand. To me, the biggest burden in qcow2 is thinking through how you deal with shared resources. Because you can block for a long period of time during write operations, it's not enough to just carry a mutex during all metadata operations. You have to stage operations and commit them at very specific points in time. The standard way of dealing with this is to have a hash table for metadata that contains a local mutex: l2cache = defaultdict(L2) def get_l2(pos): l2 = l2cache[pos] l2.mutex.lock() if not l2.valid: l2.pos = pos l2.read() l2.valid = True return l2 def put_l2(l2): if l2.dirty: l2.write() l2.dirty = False l2.mutex.unlock() You're missing how you create entries. That means you've got to do: def put_l2(l2): if l2.committed:
Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
On 09/12/2010 05:26 PM, Anthony Liguori wrote: On 09/12/2010 08:28 AM, Avi Kivity wrote: On 09/12/2010 03:06 PM, Anthony Liguori wrote: Backing files and logical size shouldn't change during live migration. Why not? To make our lives easier. It means management needs to block volume resize while a live migration takes place. Since live migration is typically done by the system automatically, while volume resize happens in response to user request, this isn't a good idea. Both in terms of user experience, and in terms of pushing more complexity to management. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
On 09/12/2010 05:29 PM, Anthony Liguori wrote: On 09/12/2010 08:26 AM, Avi Kivity wrote: On 09/12/2010 03:08 PM, Anthony Liguori wrote: This can cause a disk read, no? Shouldn't it be made asynchronous? Yes, it should. I'm not sure there's an easy way to make it asynchronous though not because of the block layer but because of how these functions are called. Sorry to harp on the subject, but that's the standard problem with state machines. Every time you want to do a blocking operation in a function, you have to put all its locals in some structure, split the function into two, do some scheduling, etc. We can't block the VCPU thread for arbitrarily long periods of time. If we get a PIO operation requesting information about geometry, we can't wait for a disk read in order to satisfy that request. We need to kick off the I/O operations in the background such that the data is available before the PIO operation happens. This isn't SM vs. threads at all, this is simply about the fact that we can't do block I/O during a PIO operation. Isn't this an identify command, where the guest can only read the data after the interface indicates it is ready (or posts an interrupt)? I thought the guest interface was already async. The code appears to support this: switch(val) { case WIN_IDENTIFY: if (s-bs s-drive_kind != IDE_CD) { if (s-drive_kind != IDE_CFATA) ide_identify(s); else ide_cfata_identify(s); s-status = READY_STAT | SEEK_STAT; ide_transfer_start(s, s-io_buffer, 512, ide_transfer_stop); } else { if (s-drive_kind == IDE_CD) { ide_set_signature(s); } ide_abort_command(s); } ide_set_irq(s-bus); break; but I may be misinterpreting it. If I'm right, all we need to do is push this whole block to a thread. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
On 09/12/2010 10:56 AM, Avi Kivity wrote: No, the worst case is 0.003% allocated disk, with the allocated clusters distributed uniformly. That means all your L2s are allocated, but almost none of your clusters are. But in this case, you're so sparse that your metadata is pretty much co-located which means seek performance won't matter much. But since you have to boot before you can run any serious test, if it takes 5 seconds to do an fsck(), it's highly likely that it's not even noticeable. What if it takes 300 seconds? That means for a 1TB disk you're taking 500ms per L2 entry, you're fully allocated and yet still doing an fsck. That seems awfully unlikely. if l2.committed: if l2.dirty l2.write() l2.dirty = False l2.mutex.unlock() else: l2.mutex.lock() l2cache[l2.pos] = l2 l2.mutex.unlock() The in-memory L2 is created by defaultdict(). I did omit linking L2 into L1, by that's a function call. With a state machine, it's a new string of states and calls. But you have to write the L2 to disk first before you link it so it's not purely in memory. It's far easier to just avoid internal snapshots altogether and this is exactly the thought process that led to QED. Once you drop support for internal snapshots, you can dramatically simplify. The amount of metadata is O(nb_L2 * nb_snapshots). For qed, nb_snapshots = 1 but nb_L2 can be still quite large. If fsck is too long for one, it is too long for the other. nb_L2 is very small. It's exactly n / 2GB + 1 where n is image size. Since image size is typically 100GB, practically speaking it's less than 50. OTOH, nb_snapshots in qcow2 can be very large. In fact, it's not unrealistic for nb_snapshots to be 50. What that means is that instead of metadata being O(n) as it is today, it's at least O(n^2). Why is in n^2? It's still n*m. If your image is 4TB instead of 100GB, the time increases by a factor of 40 for both. It's n*m but either n ~= m in which case it's n^2 or m n, in which case, it's just n, or m n in which case, it's just O(m). This is where asymptotic complexity ends up not being terribly helpful :-) Let me put this another way though, if you support internal snapshots, what's a reasonable number of snapshots to expect reasonable performance with? 10? 100? 1000? 1? Not doing qed-on-lvm is definitely a limitation. The one use case I've heard is qcow2 on top of clustered LVM as clustered LVM is simpler than a clustered filesystem. I don't know the space well enough so I need to think more about it. I don't either. If this use case survives, and if qed isn't changed to accomodate it, it means that that's another place where qed can't supplant qcow2. I'm okay with that. An image file should require a file system. If I was going to design an image file to be used on top of raw storage, I would take an entirely different approach. That spreads our efforts further. No. I don't think we should be in the business of designing on top of raw storage. Either assume fixed partitions, LVM, or a file system. We shouldn't reinvent the wheel at every opportunity (just the carefully chosen opportunities). Refcount table. See above discussion for my thoughts on refcount table. Ok. It boils down to is fsck on startup acceptable. Without a freelist, you need fsck for both unclean shutdown and for UNMAP. To rebuild the free list on unclean shutdown. If you have an on-disk compact freelist, you don't need that fsck. If you have an on-disk compact [consistent] freelist, you don't need that fsck. Consistency is the key point. We go out of our way to avoid a consistent freelist in QED because it's the path to best performance. The key goal for a file format should be to have exactly as much consistency as required and not one bit more as consistency always means worse performance. On the other hand, allocating a cluster in qcow2 as it is now requires scanning the refcount table. Not very pretty. Kevin, how does that perform? (an aside: with cache!=none we're bouncing in the kernel as well; we really need to make it work for cache=none, perhaps use O_DIRECT for data and writeback for metadata and shared backing images). QED achieves zero-copy with cache=none today. In fact, our performance testing that we'll publish RSN is exclusively with cache=none. In this case, preallocation should really be cheap, since there isn't a ton of dirty data that needs to be flushed. You issue an extra flush once in a while so your truncate (or physical image size in the header) gets to disk, but that doesn't block new writes. It makes qed/lvm work, and it replaces the need to fsck for the next allocation with the need for a background scrubber to reclaim storage (you need that anyway for UNMAP). It makes the whole thing a lot more attractive IMO. For a 1PB disk image with qcow2,
Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
On 09/12/2010 11:06 AM, Avi Kivity wrote: On 09/12/2010 05:26 PM, Anthony Liguori wrote: On 09/12/2010 08:28 AM, Avi Kivity wrote: On 09/12/2010 03:06 PM, Anthony Liguori wrote: Backing files and logical size shouldn't change during live migration. Why not? To make our lives easier. It means management needs to block volume resize while a live migration takes place. Since live migration is typically done by the system automatically, while volume resize happens in response to user request, this isn't a good idea. Both in terms of user experience, and in terms of pushing more complexity to management. We don't do volume resize today so it's a moot point. Regards, Anthony Liguori
[Qemu-devel] [Bug 611646] Re: isa bus not working
** Changed in: qemu Status: New = Fix Committed -- isa bus not working https://bugs.launchpad.net/bugs/611646 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Fix Committed Bug description: isa bus emulation not working anymore. Try running qemu -M isapc. It will crash with segmentation fault. This is a qemu HEAD from git on Fedora linux.
[Qemu-devel] [Bug 636446] Re: prep ppc machine no more working
** Changed in: qemu Status: New = Confirmed -- prep ppc machine no more working https://bugs.launchpad.net/bugs/636446 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Confirmed Bug description: I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to use PREP PPC machine the error is the same: qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS CPU #0: NIP LR CTR XER MSR HID0 HF idx 0 TB DECR GPR00 GPR04 GPR08 GPR12 GPR16 GPR20 GPR24 GPR28 CR [ - - - - - - - - ] RES FPR00 FPR04 FPR08 FPR12 FPR16 FPR20 FPR24 FPR28 FPSCR SRR0 SRR1 SDR1
Re: [Qemu-devel] [PATCH 01/14] trace: Add trace-events file for declaring trace events
On Sat, Sep 11, 2010 at 10:53 PM, Andreas Färber andreas.faer...@web.de wrote: Am 06.09.2010 um 17:13 schrieb Stefan Hajnoczi: diff --git a/Makefile b/Makefile index f95cc2f..3c5e6a0 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ # Makefile for QEMU. -GENERATED_HEADERS = config-host.h +GENERATED_HEADERS = config-host.h trace.h ifneq ($(wildcard config-host.mak),) # Put the all: rule here so that config-host.mak can contain dependencies. @@ -104,16 +104,24 @@ ui/vnc.o: QEMU_CFLAGS += $(VNC_TLS_CFLAGS) bt-host.o: QEMU_CFLAGS += $(BLUEZ_CFLAGS) +trace.h: $(SRC_PATH)/trace-events config-host.mak + $(call quiet-command,sh $(SRC_PATH)/tracetool --$(TRACE_BACKEND) -h $ $@, GEN $@) + +trace.c: $(SRC_PATH)/trace-events config-host.mak + $(call quiet-command,sh $(SRC_PATH)/tracetool --$(TRACE_BACKEND) -c $ $@, GEN $@) + +trace.o: trace.c $(GENERATED_HEADERS) There's a dependency issue hidden here: The user emulators recurse into *-user with just a dependency on trace.h. The build then fails because the target Makefile does not know how to build ../trace.o since the rule is in this Makefile instead. I noticed it with ppc-haiku-user but this should be visible with linux-user on a clean build, too. Adding trace.o or $(trace-obj-y) somewhere before subdir-libuser seems to help. I'll post a patch later on if no one beats me. Your patch would be appreciated. For the record, in order to reproduce it: $ make distclean $ ./configure --target-list=i386-linux-user $ make make[1]: *** No rule to make target `../trace.o', needed by `qemu-i386'. Stop. If one of the targets does build trace.o then the build completes successfully. That's why all targets build works fine. Stefan
[Qemu-devel] [Bug 611640] Re: snapshot mode is broken for raw images
** Changed in: qemu Status: New = Fix Committed -- snapshot mode is broken for raw images https://bugs.launchpad.net/bugs/611640 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Fix Committed Bug description: The snapshot mode is not working if you use raw format for image instead of qcow2. Create a raw disk image by running dd of=xxx.img bs=1 count=0 seek=8G. Then run qemu -snapshot xxx.img. In monitor console run info block. There should be qcow2 temporary image backed by the raw image. Now run commit ide0-hd0 and after that again info block. You should see now that there is no more image, so any operations on it will fail form now on. This is for latest HEAD in git.
Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
On 09/12/2010 11:45 AM, Avi Kivity wrote: Streaming relies on copy-on-read to do the writing. Ah. You can avoid the copy-on-read implementation in the block format driver and do it completely in generic code. Copy on read takes advantage of temporal locality. You wouldn't want to stream without copy on read because you decrease your idle I/O time by not effectively caching. stream_4(): increment offset if more: bdrv_aio_stream() Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that. http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d Clever - it pushes all the synchronization into the copy-on-read implementation. But the serialization there hardly jumps out of the code. Do I understand correctly that you can only have one allocating read or write running? Cluster allocation, L2 cache allocation, or on-disk L2 allocation? You only have one on-disk L2 allocation at one time. That's just an implementation detail at the moment. An on-disk L2 allocation happens only when writing to a new cluster that requires a totally new L2 entry. Since L2s cover 2GB of logical space, it's a rare event so this turns out to be pretty reasonable for a first implementation. Parallel on-disk L2 allocations is not that difficult, it's just a future TODO. Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux. So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code. What Linux does if have an equivalent of bdrv_generic_aio_commit() which most implementations call (or default to), and only do something if they want something special. Something like commit (or copy-on-read, or copy-on-write, or streaming) can be implement 100% in terms of the generic functions (and indeed qcow2 backing files can be any format). Yes, what I'm really saying is that we should take the bdrv_generic_aio_commit() approach. I think we're in agreement here. Regards, Anthony Liguori
Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage
On Sun, Sep 12, 2010 at 11:26 AM, Andreas Färber 636...@bugs.launchpad.net wrote: Public bug reported: Running `LANG=C LC_ALL=C ./configure --prefix=... --install=/usr/ucb/install` on Solaris 10 amd64 results in the following errors: ./configure: bad substitution ./configure: !: not found ./configure: curl-config: not found ./configure: curl-config: not found Error: invalid trace backend Please choose a supported trace backend. What is the output of sh ./tracetool --nop --check-backend? Unfortunately it doesn't print the line numbers of the errors. It must be somewhere after the check for `install`. The first few can be resolved by running `bash ./configure ...` instead. The check if trace backend exists hardcodes `sh $source_path/tracetool ...` in configure. Replacing sh with bash makes it work. `gmake` complains Makefile:331: no file name for -include, which is a filter for *.d files. `create_config` gets the 'bad substitution' error as well. Replacing sh with bash in rules.mak works. etc. To sum it up, a) there are shell script incompatibilities with Solaris 10's /bin/sh shell, and I fixed one in 2184d75b4a6a253e8b1e002b3dbcc85c20ba6041 and now Milax's /bin/sh is happy.
[Qemu-devel] Re: [PATCH v3] Introduce qemu_madvise()
On Sun, Sep 12, 2010 at 12:55 PM, Andreas Färber andreas.faer...@web.de wrote: From: Andreas Färber afaer...@opensolaris.org vl.c has a Sun-specific hack to supply a prototype for madvise(), but the call site has apparently moved to arch_init.c. Haiku doesn't implement madvise() in favor of posix_madvise(). OpenBSD and Solaris 10 don't implement posix_madvise() but madvise(). Check for madvise() and posix_madvise() in configure and supply qemu_madvise() as wrapper. Prefer madvise() over posix_madvise() due to flag availability. Convert all callers to use qemu_madvise() and QEMU_MADV_*. No functional change except for arch_init.c:ram_load() now potentially falling back to posix_madvise() or no-op in lack of both. v2 - v3: * Reuse the *_MADV_* defines for QEMU_MADV_*. Suggested by Alexander Graf. * Add configure check for madvise(), too. Add defines to Makefile, not QEMU_CFLAGS. Convert all callers, untested. Suggested by Blue Swirl. * Keep Solaris' madvise() prototype around. Pointed out by Alexander Graf. v1 - v2: * Don't rely on posix_madvise() availability, add qemu_madvise(). Suggested by Blue Swirl. Signed-off-by: Andreas Färber afaer...@opensolaris.org Cc: Blue Swirl blauwir...@gmail.com Cc: Alexander Graf ag...@suse.de --- arch_init.c | 2 +- configure | 33 + exec.c | 8 hw/virtio-balloon.c | 4 ++-- kvm-all.c | 6 +++--- osdep.c | 15 +++ osdep.h | 25 + vl.c | 3 --- 8 files changed, 83 insertions(+), 13 deletions(-) diff --git a/arch_init.c b/arch_init.c index e468c0c..a910033 100644 --- a/arch_init.c +++ b/arch_init.c @@ -396,7 +396,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) #ifndef _WIN32 if (ch == 0 (!kvm_enabled() || kvm_has_sync_mmu())) { - madvise(host, TARGET_PAGE_SIZE, MADV_DONTNEED); + qemu_madvise(host, TARGET_PAGE_SIZE, QEMU_MADV_DONTNEED); } #endif } else if (flags RAM_SAVE_FLAG_PAGE) { diff --git a/configure b/configure index 4061cb7..86558eb 100755 --- a/configure +++ b/configure @@ -2069,6 +2069,31 @@ if compile_prog ; then fi ## +# check if we have madvise + +madvise=no +cat $TMPC EOF +#include sys/types.h +#include sys/mman.h +int main(void) { return madvise(NULL, 0, MADV_DONTNEED); } +EOF +if compile_prog ; then + madvise=yes +fi + +## +# check if we have posix_madvise + +posix_madvise=no +cat $TMPC EOF +#include sys/mman.h +int main(void) { return posix_madvise(NULL, 0, POSIX_MADV_DONTNEED); } +EOF +if compile_prog ; then + posix_madvise=yes +fi + +## # check if trace backend exists sh $source_path/tracetool --$trace_backend --check-backend /dev/null 2 /dev/null @@ -2226,6 +2251,8 @@ echo KVM support $kvm echo fdt support $fdt echo preadv support $preadv echo fdatasync $fdatasync +echo madvise $madvise +echo posix_madvise $posix_madvise echo uuid support $uuid echo vhost-net support $vhost_net echo Trace backend $trace_backend @@ -2466,6 +2493,12 @@ fi if test $fdatasync = yes ; then echo CONFIG_FDATASYNC=y $config_host_mak fi +if test $madvise = yes ; then + echo CONFIG_MADVISE=y $config_host_mak +fi +if test $posix_madvise = yes ; then + echo CONFIG_POSIX_MADVISE=y $config_host_mak +fi # XXX: suppress that if [ $bsd = yes ] ; then diff --git a/exec.c b/exec.c index 380dab5..b1fe3e9 100644 --- a/exec.c +++ b/exec.c @@ -2841,8 +2841,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, new_block-host = file_ram_alloc(new_block, size, mem_path); if (!new_block-host) { new_block-host = qemu_vmalloc(size); -#ifdef MADV_MERGEABLE - madvise(new_block-host, size, MADV_MERGEABLE); +#ifdef QEMU_MADV_MERGEABLE I'd like to avoid these #ifdefs. How about always #defining QEMU_MADV_MERGEABLE? QEMU_MADV_* values could be synthetic and not rely on MADV_*. + qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE); #endif } #else @@ -2858,8 +2858,8 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, #else new_block-host = qemu_vmalloc(size); #endif -#ifdef MADV_MERGEABLE - madvise(new_block-host, size, MADV_MERGEABLE); +#ifdef QEMU_MADV_MERGEABLE + qemu_madvise(new_block-host, size, QEMU_MADV_MERGEABLE); #endif } } diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 9fe3886..1e74674 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -51,8 +51,8 @@ static void
Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
On 09/12/2010 07:19 PM, Anthony Liguori wrote: On 09/12/2010 11:45 AM, Avi Kivity wrote: Streaming relies on copy-on-read to do the writing. Ah. You can avoid the copy-on-read implementation in the block format driver and do it completely in generic code. Copy on read takes advantage of temporal locality. You wouldn't want to stream without copy on read because you decrease your idle I/O time by not effectively caching. I meant, implement copy-on-read in generic code side by side with streaming. Streaming becomes just a prefetch operation (read and discard) which lets copy-on-read do the rest. This is essentially your implementation, yes? stream_4(): increment offset if more: bdrv_aio_stream() Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that. http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a04a105f4cb7637?hp=34ed425e7dd39c511bc247d1ab900e19b8c74a5d Clever - it pushes all the synchronization into the copy-on-read implementation. But the serialization there hardly jumps out of the code. Do I understand correctly that you can only have one allocating read or write running? Cluster allocation, L2 cache allocation, or on-disk L2 allocation? You only have one on-disk L2 allocation at one time. That's just an implementation detail at the moment. An on-disk L2 allocation happens only when writing to a new cluster that requires a totally new L2 entry. Since L2s cover 2GB of logical space, it's a rare event so this turns out to be pretty reasonable for a first implementation. Parallel on-disk L2 allocations is not that difficult, it's just a future TODO. Really, you can just preallocate all L2s. Most filesystems will touch all of them very soon. qcow2 might save some space for snapshots which share L2s (doubtful) or for 4k clusters (historical) but for qed with 64k clusters, it doesn't save any space. Linear L2s will also make your fsck *much* quicker. Size is .01% of logical image size. 1MB for a 10GB guest, by the time you install something on it that's a drop in the bucket. If you install a guest on a 100GB disk, what percentage of L2s are allocated? Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux. So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code. What Linux does if have an equivalent of bdrv_generic_aio_commit() which most implementations call (or default to), and only do something if they want something special. Something like commit (or copy-on-read, or copy-on-write, or streaming) can be implement 100% in terms of the generic functions (and indeed qcow2 backing files can be any format). Yes, what I'm really saying is that we should take the bdrv_generic_aio_commit() approach. I think we're in agreement here. Strange feeling. -- error compiling committee.c: too many arguments to function
[Qemu-devel] [Bug 636446] Re: prep ppc machine no more working
This is a known issue, nobody has volunteered to update PReP emulation yet. The last partially working version was v0.9.1, using an OpenHackWare ROM file. Depending on why you want to emulate a PReP machine, you can either try supplying some proprietary ROM file, or contribute PReP support to OpenBIOS [1]. [1] http://lists.openbios.org/mailman/listinfo/openbios -- prep ppc machine no more working https://bugs.launchpad.net/bugs/636446 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Confirmed Bug description: I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to use PREP PPC machine the error is the same: qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS CPU #0: NIP LR CTR XER MSR HID0 HF idx 0 TB DECR GPR00 GPR04 GPR08 GPR12 GPR16 GPR20 GPR24 GPR28 CR [ - - - - - - - - ] RES FPR00 FPR04 FPR08 FPR12 FPR16 FPR20 FPR24 FPR28 FPSCR SRR0 SRR1 SDR1
Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage
Am 12.09.2010 um 19:22 schrieb Blue Swirl: On Sun, Sep 12, 2010 at 11:26 AM, Andreas Färber 636...@bugs.launchpad.net wrote: Error: invalid trace backend Please choose a supported trace backend. What is the output of sh ./tracetool --nop --check-backend? ./tracetool: syntax error at line 51: `$' unexpected
Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage
On Sun, Sep 12, 2010 at 5:35 PM, Andreas Färber andreas.faer...@web.de wrote: Am 12.09.2010 um 19:22 schrieb Blue Swirl: On Sun, Sep 12, 2010 at 11:26 AM, Andreas Färber 636...@bugs.launchpad.net wrote: Error: invalid trace backend Please choose a supported trace backend. What is the output of sh ./tracetool --nop --check-backend? ./tracetool: syntax error at line 51: `$' unexpected Does this patch fix the problem? diff --git a/tracetool b/tracetool index 534cc70..c7582bf 100755 --- a/tracetool +++ b/tracetool @@ -48,7 +48,8 @@ get_argnames() { local nfields field name nfields=0 -for field in $(get_args $1); do +args=get_args $1 +for field in $args; do nfields=$((nfields + 1)) # Drop pointer star
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
On 09/12/2010 07:09 PM, Anthony Liguori wrote: On 09/12/2010 10:56 AM, Avi Kivity wrote: No, the worst case is 0.003% allocated disk, with the allocated clusters distributed uniformly. That means all your L2s are allocated, but almost none of your clusters are. But in this case, you're so sparse that your metadata is pretty much co-located which means seek performance won't matter much. You still get the rotational delay. But yes, the hit is reduced. But since you have to boot before you can run any serious test, if it takes 5 seconds to do an fsck(), it's highly likely that it's not even noticeable. What if it takes 300 seconds? That means for a 1TB disk you're taking 500ms per L2 entry, you're fully allocated and yet still doing an fsck. That seems awfully unlikely. I meant for a fully populated L1. That's 10ms per L2. But since that's 64TB, that's unlikely too. It can still take 10s for a 2TB disk. if l2.committed: if l2.dirty l2.write() l2.dirty = False l2.mutex.unlock() else: l2.mutex.lock() l2cache[l2.pos] = l2 l2.mutex.unlock() The in-memory L2 is created by defaultdict(). I did omit linking L2 into L1, by that's a function call. With a state machine, it's a new string of states and calls. But you have to write the L2 to disk first before you link it so it's not purely in memory. That's fine. Threading allows you to have blocking calls. It's slower, but very rare anyway. Why is in n^2? It's still n*m. If your image is 4TB instead of 100GB, the time increases by a factor of 40 for both. It's n*m but either n ~= m in which case it's n^2 or m n, in which case, it's just n, or m n in which case, it's just O(m). This is where asymptotic complexity ends up not being terribly helpful :-) Let me put this another way though, if you support internal snapshots, what's a reasonable number of snapshots to expect reasonable performance with? 10? 100? 1000? 1? I'd say 10. Not that I really want to support internal snapshots, it doesn't work well with multiple disks. I'm okay with that. An image file should require a file system. If I was going to design an image file to be used on top of raw storage, I would take an entirely different approach. That spreads our efforts further. No. I don't think we should be in the business of designing on top of raw storage. Either assume fixed partitions, LVM, or a file system. We shouldn't reinvent the wheel at every opportunity (just the carefully chosen opportunities). I agree, but in this case there was no choice. Refcount table. See above discussion for my thoughts on refcount table. Ok. It boils down to is fsck on startup acceptable. Without a freelist, you need fsck for both unclean shutdown and for UNMAP. To rebuild the free list on unclean shutdown. If you have an on-disk compact freelist, you don't need that fsck. If you have an on-disk compact [consistent] freelist, you don't need that fsck. Consistency is the key point. We go out of our way to avoid a consistent freelist in QED because it's the path to best performance. The key goal for a file format should be to have exactly as much consistency as required and not one bit more as consistency always means worse performance. Preallocation lets you have a consistent (or at least conservative) free list, with just a bit of extra consistency. If you piggy back preallocation on guest syncs, you don't even pay for that. On the other hand, linear L2 (which now become L1) means your fsck is just a linear scan of the table, which is probably faster than qcow2 allocation... (an aside: with cache!=none we're bouncing in the kernel as well; we really need to make it work for cache=none, perhaps use O_DIRECT for data and writeback for metadata and shared backing images). QED achieves zero-copy with cache=none today. In fact, our performance testing that we'll publish RSN is exclusively with cache=none. In this case, preallocation should really be cheap, since there isn't a ton of dirty data that needs to be flushed. You issue an extra flush once in a while so your truncate (or physical image size in the header) gets to disk, but that doesn't block new writes. It makes qed/lvm work, and it replaces the need to fsck for the next allocation with the need for a background scrubber to reclaim storage (you need that anyway for UNMAP). It makes the whole thing a lot more attractive IMO. For a 1PB disk image with qcow2, the reference count table is 128GB. For a 1TB image, the reference count table is 128MB. For a 128GB image, the reference table is 16MB which is why we get away with it today. Anytime you grow the freelist with qcow2, you have to write a brand new freelist table and update the metadata synchronously to point to a new version of it. That means for a 1TB image, you're potentially writing out
Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage
Am 12.09.2010 um 19:47 schrieb Blue Swirl: On Sun, Sep 12, 2010 at 5:35 PM, Andreas Färber andreas.faer...@web.de wrote: Am 12.09.2010 um 19:22 schrieb Blue Swirl: What is the output of sh ./tracetool --nop --check-backend? ./tracetool: syntax error at line 51: `$' unexpected Does this patch fix the problem? diff --git a/tracetool b/tracetool index 534cc70..c7582bf 100755 --- a/tracetool +++ b/tracetool @@ -48,7 +48,8 @@ get_argnames() { local nfields field name nfields=0 -for field in $(get_args $1); do +args=get_args $1 +for field in $args; do This part yes. (I took the liberty of adding args to the local line above) nfields=$((nfields + 1)) Next error is here: ./tracetool: syntax error at line 53: `nfields=$' unexpected
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
Thanks for the response. I tried './configure' without any arguments and still the same issue. Find the output from ./configure and make attached. Thanks, Adhyas On Sat, Sep 11, 2010 at 10:37 AM, Anthony Liguori anth...@codemonkey.ws wrote: Please include the full output of configure. Why are you explicitly using --disable-kvm and --disable-xen? Regards, Anthony Liguori out_configure Description: Binary data out_make Description: Binary data
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
On 09/12/2010 01:06 PM, Adhyas Avasthi wrote: Thanks for the response. I tried './configure' without any arguments and still the same issue. Find the output from ./configure and make attached. You've got a weird version of make installed. It's throwing an error when no file is specified for -include but the point of -include vs. include is that -include is supposed to be silent on error. What's the output of 'make --version'? Regards, Anthony Liguori Thanks, Adhyas On Sat, Sep 11, 2010 at 10:37 AM, Anthony Liguorianth...@codemonkey.ws wrote: Please include the full output of configure. Why are you explicitly using --disable-kvm and --disable-xen? Regards, Anthony Liguori
Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
On 09/12/2010 12:51 PM, Avi Kivity wrote: On 09/12/2010 07:09 PM, Anthony Liguori wrote: On 09/12/2010 10:56 AM, Avi Kivity wrote: No, the worst case is 0.003% allocated disk, with the allocated clusters distributed uniformly. That means all your L2s are allocated, but almost none of your clusters are. But in this case, you're so sparse that your metadata is pretty much co-located which means seek performance won't matter much. You still get the rotational delay. But yes, the hit is reduced. But since you have to boot before you can run any serious test, if it takes 5 seconds to do an fsck(), it's highly likely that it's not even noticeable. What if it takes 300 seconds? That means for a 1TB disk you're taking 500ms per L2 entry, you're fully allocated and yet still doing an fsck. That seems awfully unlikely. I meant for a fully populated L1. That's 10ms per L2. Your math is off. A single L2 entry covers 2GB worth of logical space. That means a 1TB image consists of 512 L2s. 300 / 512 == .585 which is 585ms. That's a fully populated L1 on a 1TB image. But since that's 64TB, that's unlikely too. It can still take 10s for a 2TB disk. Ah, you're talking about a 64TB image. Recall that we can read L2s in parallel. I have trouble imaging that we'd get serialized performance with a 64TB backing store. It's much more likely you've got more than one spindle in this scenario. Why is in n^2? It's still n*m. If your image is 4TB instead of 100GB, the time increases by a factor of 40 for both. It's n*m but either n ~= m in which case it's n^2 or m n, in which case, it's just n, or m n in which case, it's just O(m). This is where asymptotic complexity ends up not being terribly helpful :-) Let me put this another way though, if you support internal snapshots, what's a reasonable number of snapshots to expect reasonable performance with? 10? 100? 1000? 1? I'd say 10. Not that I really want to support internal snapshots, it doesn't work well with multiple disks. I don't think that's reasonable. The folks that I've talked to about snapshots seem to want to do crazy things like use it for checkpointing. TBH, I think they're looking for the ability to do thousands of checkpoints with an efficient way to release old checkpoints. I imagine that's the design point things like btrfs are trying to achieve. On the other hand, linear L2 (which now become L1) means your fsck is just a linear scan of the table, which is probably faster than qcow2 allocation... And this is just a data layout optimization which is the sort of thing that we should let performance data drive. For a 1PB disk image with qcow2, the reference count table is 128GB. For a 1TB image, the reference count table is 128MB. For a 128GB image, the reference table is 16MB which is why we get away with it today. Anytime you grow the freelist with qcow2, you have to write a brand new freelist table and update the metadata synchronously to point to a new version of it. That means for a 1TB image, you're potentially writing out 128MB of data just to allocate a new cluster. s/freelist/refcount table/ to translate to current qcow2 nomenclature. This is certainly not fast. You can add a bunch of free blocks each time you mitigate the growth but I can't of many circumstances where a 128MB write isn't going to be noticeable. And it only gets worse as time moves on because 1TB disk images are already in use today. That's a strong point. qcow2 doubles on each allocation, it amortizes, but the delay is certainly going to be noticable. You can do it ahead of time (so guest writes don't need to wait) but it's still expensive. The trouble is, safe growth of the reference count table is hard because it's contiguous. That means you need to copy the table to another location all at once instead of just creating a new L1 table and reusing most of the existing L2 entries. It's a damning flaw in the format for large images. You can preallocate the whole thing up front to try to avoid the cost at run time but even then, that's a huge cost to pay in disk space up front. It's very easy to neglect the details in something like qcow2. We've been talking like the refcount table is basically free to read and write but it's absolutely not. With large disk images, you're caching an awful lot of metadata to read the refcount table in fully. If you reduce the reference count table to exactly two bits, you can store that within the L1/L2 metadata since we have an extra 12 bits worth of storage space. Since you need the L1/L2 metadata anyway, we might as well just use that space as the authoritative source of the free list information. The only difference between qcow2 and qed is that since we use an on-demand table for L1/L2, our free list may be non-contiguous. Since we store virtual - physical instead of physical-virtual, you have to do a
Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage
On Sun, Sep 12, 2010 at 5:58 PM, Andreas Färber andreas.faer...@web.de wrote: Am 12.09.2010 um 19:47 schrieb Blue Swirl: On Sun, Sep 12, 2010 at 5:35 PM, Andreas Färber andreas.faer...@web.de wrote: Am 12.09.2010 um 19:22 schrieb Blue Swirl: What is the output of sh ./tracetool --nop --check-backend? ./tracetool: syntax error at line 51: `$' unexpected Does this patch fix the problem? diff --git a/tracetool b/tracetool index 534cc70..c7582bf 100755 --- a/tracetool +++ b/tracetool @@ -48,7 +48,8 @@ get_argnames() { local nfields field name nfields=0 - for field in $(get_args $1); do + args=get_args $1 + for field in $args; do This part yes. (I took the liberty of adding args to the local line above) nfields=$((nfields + 1)) Next error is here: ./tracetool: syntax error at line 53: `nfields=$' unexpected That looks like fully standards compliant, so Solaris' /bin/sh is not. Can you try what happens with /usr/xpg4/bin/sh?
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
Am 12.09.2010 um 21:26 schrieb Anthony Liguori: On 09/12/2010 01:06 PM, Adhyas Avasthi wrote: Thanks for the response. I tried './configure' without any arguments and still the same issue. Find the output from ./configure and make attached. You've got a weird version of make installed. It's throwing an error when no file is specified for -include A warning, not an error. I reported a similar issue here: https://bugs.launchpad.net/bugs/636315 but the point of -include vs. include is that -include is supposed to be silent on error. The manual is very brief and doesn't touch on advanced -include use cases like filters that might result in an empty set of files: If you want make to simply ignore a makefile which does not exist or cannot be remade, with no error message, use the -include directive instead of include, like this: -include filenames... This acts like include in every way except that there is no error (not even a warning) if any of the filenames (or any prerequisites of any of the filenames) do not exist or cannot be remade. http://www.gnu.org/software/make/manual/html_node/Include.html#Include Makefile:331 will be the include for *.d files, and so is Makefile.hw: 24. What's the output of 'make --version'? bash-3.00$ gmake --version GNU Make 3.80 Copyright (C) 2002 Free Software Foundation, Inc. This is free software; see the sources for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Adhyas' looked like a 3.79.1 preview, judging by the command name make-3.79.1-p7. ;) Regards, Andreas
[Qemu-devel] [PATCH 1/5] use qemu_blockalign consistently
Use qemu_blockalign for all allocations in the block layer. This allows increasing the required alignment, which is need to support O_DIRECT on devices with large block sizes. Signed-off-by: Christoph Hellwig h...@lst.de Index: qemu/hw/scsi-disk.c === --- qemu.orig/hw/scsi-disk.c2010-09-12 14:42:40.942759377 -0300 +++ qemu/hw/scsi-disk.c 2010-09-12 14:43:04.694759377 -0300 @@ -70,14 +70,15 @@ struct SCSIDiskState char *serial; }; -static SCSIDiskReq *scsi_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun) +static SCSIDiskReq *scsi_new_request(SCSIDiskState *s, uint32_t tag, +uint32_t lun) { SCSIRequest *req; SCSIDiskReq *r; -req = scsi_req_alloc(sizeof(SCSIDiskReq), d, tag, lun); +req = scsi_req_alloc(sizeof(SCSIDiskReq), s-qdev, tag, lun); r = DO_UPCAST(SCSIDiskReq, req, req); -r-iov.iov_base = qemu_memalign(512, SCSI_DMA_BUF_SIZE); +r-iov.iov_base = qemu_blockalign(s-bs, SCSI_DMA_BUF_SIZE); return r; } @@ -939,7 +940,7 @@ static int32_t scsi_send_command(SCSIDev } /* ??? Tags are not unique for different luns. We only implement a single lun, so this should not matter. */ -r = scsi_new_request(d, tag, lun); +r = scsi_new_request(s, tag, lun); outbuf = (uint8_t *)r-iov.iov_base; is_write = 0; DPRINTF(Command: lun=%d tag=0x%x data=0x%02x, lun, tag, buf[0]); Index: qemu/hw/sd.c === --- qemu.orig/hw/sd.c 2010-09-12 14:31:04.387759376 -0300 +++ qemu/hw/sd.c2010-09-12 14:43:04.695759377 -0300 @@ -440,7 +440,7 @@ SDState *sd_init(BlockDriverState *bs, i SDState *sd; sd = (SDState *) qemu_mallocz(sizeof(SDState)); -sd-buf = qemu_memalign(512, 512); +sd-buf = qemu_blockalign(bs, 512); sd-spi = is_spi; sd-enable = 1; sd_reset(sd, bs); Index: qemu/qemu-io.c === --- qemu.orig/qemu-io.c 2010-09-12 14:31:04.394759376 -0300 +++ qemu/qemu-io.c 2010-09-12 14:43:04.699759377 -0300 @@ -61,7 +61,7 @@ static void *qemu_io_alloc(size_t len, i if (misalign) len += MISALIGN_OFFSET; - buf = qemu_memalign(512, len); + buf = qemu_blockalign(bs, len); memset(buf, pattern, len); if (misalign) buf += MISALIGN_OFFSET; Index: qemu/qemu-nbd.c === --- qemu.orig/qemu-nbd.c2010-09-12 14:31:04.401759376 -0300 +++ qemu/qemu-nbd.c 2010-09-12 14:43:04.706759377 -0300 @@ -446,7 +446,7 @@ int main(int argc, char **argv) max_fd = sharing_fds[0]; nb_fds++; -data = qemu_memalign(512, NBD_BUFFER_SIZE); +data = qemu_blockalign(bs, NBD_BUFFER_SIZE); if (data == NULL) errx(EXIT_FAILURE, Cannot allocate data buffer); Index: qemu/posix-aio-compat.c === --- qemu.orig/posix-aio-compat.c2010-09-12 14:42:46.725759377 -0300 +++ qemu/posix-aio-compat.c 2010-09-12 14:43:04.711759377 -0300 @@ -270,7 +270,7 @@ static ssize_t handle_aiocb_rw(struct qe * Ok, we have to do it the hard way, copy all segments into * a single aligned buffer. */ -buf = qemu_memalign(512, aiocb-aio_nbytes); +buf = qemu_blockalign(aiocb-common.bs, aiocb-aio_nbytes); if (aiocb-aio_type QEMU_AIO_WRITE) { char *p = buf; int i;
[Qemu-devel] [PATCH 2/5] raw-posix: handle 512 byte alignment correctly
Replace the hardcoded handling of 512 byte alignment with bs-buffer_alignment to handle larger sector size devices correctly. Note that we can not rely on it to be initialize in bdrv_open, so deal with the worst case there. Signed-off-by: Christoph Hellwig h...@lst.de Index: qemu/block/raw-posix.c === --- qemu.orig/block/raw-posix.c 2010-09-12 14:31:04.344759376 -0300 +++ qemu/block/raw-posix.c 2010-09-12 14:46:02.120759376 -0300 @@ -97,12 +97,12 @@ #define FTYPE_CD 1 #define FTYPE_FD 2 -#define ALIGNED_BUFFER_SIZE (32 * 512) - /* if the FD is not accessed during that time (in ms), we try to reopen it to see if the disk has been changed */ #define FD_OPEN_TIMEOUT 1000 +#define MAX_BLOCKSIZE 4096 + typedef struct BDRVRawState { int fd; int type; @@ -118,7 +118,8 @@ typedef struct BDRVRawState { int use_aio; void *aio_ctx; #endif -uint8_t* aligned_buf; +uint8_t *aligned_buf; +unsigned aligned_buf_size; } BDRVRawState; static int fd_open(BlockDriverState *bs); @@ -161,7 +162,12 @@ static int raw_open_common(BlockDriverSt s-aligned_buf = NULL; if ((bdrv_flags BDRV_O_NOCACHE)) { -s-aligned_buf = qemu_blockalign(bs, ALIGNED_BUFFER_SIZE); +/* + * Allocate a buffer for read/modify/write cycles. Chose the size + * pessimistically as we don't know the block size yet. + */ +s-aligned_buf_size = 32 * MAX_BLOCKSIZE; +s-aligned_buf = qemu_memalign(MAX_BLOCKSIZE, s-aligned_buf_size); if (s-aligned_buf == NULL) { goto out_close; } @@ -278,8 +284,9 @@ static int raw_pread_aligned(BlockDriver } /* - * offset and count are in bytes, but must be multiples of 512 for files - * opened with O_DIRECT. buf must be aligned to 512 bytes then. + * offset and count are in bytes, but must be multiples of the sector size + * for files opened with O_DIRECT. buf must be aligned to sector size bytes + * then. * * This function may be called without alignment if the caller ensures * that O_DIRECT is not in effect. @@ -316,24 +323,25 @@ static int raw_pread(BlockDriverState *b uint8_t *buf, int count) { BDRVRawState *s = bs-opaque; +unsigned sector_mask = bs-buffer_alignment - 1; int size, ret, shift, sum; sum = 0; if (s-aligned_buf != NULL) { -if (offset 0x1ff) { -/* align offset on a 512 bytes boundary */ +if (offset sector_mask) { +/* align offset on a sector size bytes boundary */ -shift = offset 0x1ff; -size = (shift + count + 0x1ff) ~0x1ff; -if (size ALIGNED_BUFFER_SIZE) -size = ALIGNED_BUFFER_SIZE; +shift = offset sector_mask; +size = (shift + count + sector_mask) ~sector_mask; +if (size s-aligned_buf_size) +size = s-aligned_buf_size; ret = raw_pread_aligned(bs, offset - shift, s-aligned_buf, size); if (ret 0) return ret; -size = 512 - shift; +size = bs-buffer_alignment - shift; if (size count) size = count; memcpy(buf, s-aligned_buf + shift, size); @@ -346,15 +354,15 @@ static int raw_pread(BlockDriverState *b if (count == 0) return sum; } -if (count 0x1ff || (uintptr_t) buf 0x1ff) { +if (count sector_mask || (uintptr_t) buf sector_mask) { /* read on aligned buffer */ while (count) { -size = (count + 0x1ff) ~0x1ff; -if (size ALIGNED_BUFFER_SIZE) -size = ALIGNED_BUFFER_SIZE; +size = (count + sector_mask) ~sector_mask; +if (size s-aligned_buf_size) +size = s-aligned_buf_size; ret = raw_pread_aligned(bs, offset, s-aligned_buf, size); if (ret 0) { @@ -404,25 +412,28 @@ static int raw_pwrite(BlockDriverState * const uint8_t *buf, int count) { BDRVRawState *s = bs-opaque; +unsigned sector_mask = bs-buffer_alignment - 1; int size, ret, shift, sum; sum = 0; if (s-aligned_buf != NULL) { -if (offset 0x1ff) { -/* align offset on a 512 bytes boundary */ -shift = offset 0x1ff; -ret = raw_pread_aligned(bs, offset - shift, s-aligned_buf, 512); +if (offset sector_mask) { +/* align offset on a sector size bytes boundary */ +shift = offset sector_mask; +ret = raw_pread_aligned(bs, offset - shift, s-aligned_buf, +bs-buffer_alignment); if (ret 0) return ret; -size = 512 - shift; +size = bs-buffer_alignment - shift; if
[Qemu-devel] [PATCH 4/5] scsi-disk: propagate the required alignment
Signed-off-by: Christoph Hellwig h...@lst.de Index: qemu/hw/scsi-disk.c === --- qemu.orig/hw/scsi-disk.c2010-09-12 14:43:04.694759377 -0300 +++ qemu/hw/scsi-disk.c 2010-09-12 14:50:19.049759376 -0300 @@ -1178,6 +1178,7 @@ static int scsi_disk_initfn(SCSIDevice * s-qdev.blocksize = s-qdev.conf.logical_block_size; } s-cluster_size = s-qdev.blocksize / 512; +s-bs-buffer_alignment = s-qdev.blocksize; s-qdev.type = TYPE_DISK; qemu_add_vm_change_state_handler(scsi_dma_restart_cb, s);
[Qemu-devel] [PATCH 5/5] ide: propagate the required alignment
IDE is a bit ugly in this respect. For one it doesn't really keep track of a sector size - most of the protocol is in units of 512 bytes, and we assume 2048 bytes for CDROMs which is correct most of the time. Second IDE allocates an I/O buffer long before we know if we're dealing with a CDROM or not, so increase the alignment for the io_buffer unconditionally. Signed-off-by: Christoph Hellwig h...@lst.de Index: qemu/hw/ide/core.c === --- qemu.orig/hw/ide/core.c 2010-09-12 18:30:06.0 -0300 +++ qemu/hw/ide/core.c 2010-09-12 18:32:29.133759395 -0300 @@ -2645,6 +2645,7 @@ int ide_init_drive(IDEState *s, BlockDri if (bdrv_get_type_hint(bs) == BDRV_TYPE_CDROM) { s-drive_kind = IDE_CD; bdrv_set_change_cb(bs, cdrom_change_cb, s); +bs-buffer_alignment = 2048; } else { if (!bdrv_is_inserted(s-bs)) { error_report(Device needs media, but drive is empty); @@ -2679,7 +2680,8 @@ static void ide_init1(IDEBus *bus, int u s-bus = bus; s-unit = unit; s-drive_serial = drive_serial++; -s-io_buffer = qemu_blockalign(s-bs, IDE_DMA_BUF_SECTORS*512 + 4); +/* we need at least 2k alignment for accessing CDROMs using O_DIRECT */ +s-io_buffer = qemu_memalign(2048, IDE_DMA_BUF_SECTORS*512 + 4); s-io_buffer_total_len = IDE_DMA_BUF_SECTORS*512 + 4; s-smart_selftest_data = qemu_blockalign(s-bs, 512); s-sector_write_timer = qemu_new_timer(vm_clock,
[Qemu-devel] [PATCH 3/5] virtio-blk: propagate the required alignment
Signed-off-by: Christoph Hellwig h...@lst.de Index: qemu/hw/virtio-blk.c === --- qemu.orig/hw/virtio-blk.c 2010-09-12 14:42:46.721759377 -0300 +++ qemu/hw/virtio-blk.c2010-09-12 14:51:09.737759375 -0300 @@ -540,6 +540,7 @@ VirtIODevice *virtio_blk_init(DeviceStat register_savevm(dev, virtio-blk, virtio_blk_id++, 2, virtio_blk_save, virtio_blk_load, s); bdrv_set_removable(s-bs, 0); +s-bs-buffer_alignment = conf-logical_block_size; return s-vdev; }
Re: [Qemu-devel] [Bug 636315] [NEW] configure and build errors on Solaris 10 due to /bin/sh usage
Am 12.09.2010 um 23:05 schrieb Blue Swirl: On Sun, Sep 12, 2010 at 5:58 PM, Andreas Färber andreas.faer...@web.de wrote: Am 12.09.2010 um 19:47 schrieb Blue Swirl: nfields=$((nfields + 1)) ./tracetool: syntax error at line 53: `nfields=$' unexpected That looks like fully standards compliant, so Solaris' /bin/sh is not. Can you try what happens with /usr/xpg4/bin/sh? Works fine! Must've done something wrong when testing that earlier today. configure, create_config and tracetool with your fix are silent when / usr/xpg4/bin is in the $PATH. If you commit it, we can close this ticket. Thanks for your help, Blue. Build still fails, in qemu-nbd.c due to err.h, but that's unrelated to /bin/sh. After disabling the tools in configure, sparc-softmmu builds fine again.
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
bash-3.2$ make --version GNU Make version 3.79.1-p7, by Richard Stallman and Roland McGrath. Built for i686-pc-linux-gnu Copyright (C) 1988, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 2000 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. This is really blocking my work at this point as I cannot even compile. Is there a workaround that can unblock me for the time being ? Thanks, Adhyas On Sun, Sep 12, 2010 at 12:26 PM, Anthony Liguori anth...@codemonkey.ws wrote: What's the output of 'make --version'?
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
Am 13.09.2010 um 00:08 schrieb Adhyas Avasthi: Is there a workaround that can unblock me for the time being ? Try giving an explicit --target-list= to configure, to narrow down what build step causes this. There may be subtle dependency issues, cf. my recent patch trace: Fix user emulator dependency on trace objects which fixes a similar issue. As for blocking your work, just try the last stable version. Cheers, Andreas
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
On 09/12/2010 04:16 PM, Andreas Färber wrote: Am 12.09.2010 um 21:26 schrieb Anthony Liguori: On 09/12/2010 01:06 PM, Adhyas Avasthi wrote: Thanks for the response. I tried './configure' without any arguments and still the same issue. Find the output from ./configure and make attached. You've got a weird version of make installed. It's throwing an error when no file is specified for -include A warning, not an error. I reported a similar issue here: https://bugs.launchpad.net/bugs/636315 but the point of -include vs. include is that -include is supposed to be silent on error. The manual is very brief and doesn't touch on advanced -include use cases like filters that might result in an empty set of files: If you want make to simply ignore a makefile which does not exist or cannot be remade, with no error message, use the -include directive instead of include, like this: -include filenames... This acts like include in every way except that there is no error (not even a warning) if any of the filenames (or any prerequisites of any of the filenames) do not exist or cannot be remade. http://www.gnu.org/software/make/manual/html_node/Include.html#Include Makefile:331 will be the include for *.d files, and so is Makefile.hw:24. What's the output of 'make --version'? bash-3.00$ gmake --version GNU Make 3.80 Copyright (C) 2002 Free Software Foundation, Inc. This is free software; see the sources for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Adhyas' looked like a 3.79.1 preview, judging by the command name make-3.79.1-p7. ;) I have: anth...@titi:~/build/qemu$ make --version GNU Make 3.81 Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. This program built for x86_64-pc-linux-gnu Which exhibits none of this silliness. Are ya'll using Gentoo? Regards, Anthony Liguori Regards, Andreas
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
On Fri, 10 Sep 2010, Adhyas Avasthi wrote: When I try to run the following commands on rc1 or fresh git (synced 5 minutes back), I get the following error: bash-3.2$ ./configure --disable-kvm --disable-xen bash-3.2$ make Makefile:24: no file name for `-include' make-3.79.1-p7[1]: *** No rule to make target `loader.o', needed by `all'. Stop. make-3.79.1-p7: *** [subdir-libhw32] Error 2 Any pointers as to what might be causing it? Adding /dev/null to the list might work. -- Adhyas Two types have compatible type if their types are the same. ? ANSI C Standard, 3.1.2.6. -- mailto:av1...@comtv.ru
Re: [Qemu-devel] Issue with compiling qemu-0.13.0.-rc1
I tried qemu-0.13.0-rc1 and it has the same issues on this machine. Let me try what you suggest. Thanks, Adhyas On Sun, Sep 12, 2010 at 3:21 PM, Andreas Färber andreas.faer...@web.de wrote: Try giving an explicit --target-list= to configure, to narrow down what build step causes this. There may be subtle dependency issues, cf. my recent patch trace: Fix user emulator dependency on trace objects which fixes a similar issue. As for blocking your work, just try the last stable version.
[Qemu-devel] [Bug 636446] Re: prep ppc machine no more working
I want to emulate PReP machine to be able to install AIX for testing purpose. I will ask on openbios list. -- prep ppc machine no more working https://bugs.launchpad.net/bugs/636446 You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. Status in QEMU: Confirmed Bug description: I have tried qemu from 0.11 to latest git (as of 11/09/2010), but if i want to use PREP PPC machine the error is the same: qemu: hardware error: PowerPC 601 / 620 / 970 need a 1MB BIOS CPU #0: NIP LR CTR XER MSR HID0 HF idx 0 TB DECR GPR00 GPR04 GPR08 GPR12 GPR16 GPR20 GPR24 GPR28 CR [ - - - - - - - - ] RES FPR00 FPR04 FPR08 FPR12 FPR16 FPR20 FPR24 FPR28 FPSCR SRR0 SRR1 SDR1
Re: [Qemu-devel] Arm : How to investigate Qemu blank screen on target system problem?
On Sun, Sep 12, 2010 at 02:43, Robert Smith wmk...@yahoo.com wrote: Sirs, I've built kernel v2.6.33.2 and tryintg to simulate realview (or integrator) arm board on qemu under Ubuntu 9.10. make ARCH=arm realview_defconfig make ARCH=arm CROSS_COMPILE=arm-linux- qemu-system-arm -M realview -kernel zImage.2.6.33.2.realview -initrd initramfs.cpio.gz or make ARCH=arm integrator_defconfig make ARCH=arm CROSS_COMPILE=arm-linux- qemu-system-arm -M integratorcp -kernel zImage.2.6.33.2.integrator -initrd initramfs.cpio.gz In both cases I am getting blank Qemu screen on the target system. I am thinking, is it due to the serial output? i.e it's not displaying to standard console but serial console? -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com
[Qemu-devel] PowerPC code generation and the program counter
I've been using qemu-12.4 to trace accesses to non-existent addresses, but I've found that the PC is incorrect when cpu_abort() is called from within the unassigned memory helper routines (unassigned_mem_read[bwl] and unassigned_mem_write[bwl]). Even nearby instructions (plus or minus 10 instructions or so) don't match the type of load or store being done, so this isn't a PC being current_instr+4 kind of thing. I ended up modifying the GEN_LD* and GEN_ST* macros (in target-ppc/translate.c) to call gen_update_nip(ctx, ctx-nip - 4). This fixed the above problem, which has helped enormously. Since I'm not a qemu expert, I was wondering about several things: 1) Was it really necessary to add gen_update_nip to the load and store instructions in order to get the correct PC? Could the correct PC have been derived some other way, without a runtime cost for all basic loads and stores? 2) As the current code lacks that fix, the basic load and store instructions will save an incorrect PC if an exception occurs. If so, how come nobody noticed this before? I think that exceptions would have srr0 pointing at the last instruction which called gen_update_nip. So when the target returns from a data exception, it might re-execute some instructions. Possibly harmless, but could lead to subtle bugs... Thanks, Stu Here's the patch if anybody is interested: *** translate.c~Sat Sep 11 23:43:25 2010 --- translate.c Sun Sep 12 20:49:53 2010 *** *** 2549,2554 --- 2549,2555 { \ TCGv EA; \ gen_set_access_type(ctx, ACCESS_INT); \ + gen_update_nip(ctx, ctx-nip - 4); \ EA = tcg_temp_new(); \ gen_addr_imm_index(ctx, EA, 0); \ gen_qemu_##ldop(ctx, cpu_gpr[rD(ctx-opcode)], EA); \ *** *** 2564,2569 --- 2565,2571 gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL); \ return; \ } \ + gen_update_nip(ctx, ctx-nip - 4); \ gen_set_access_type(ctx, ACCESS_INT); \ EA = tcg_temp_new(); \ if (type == PPC_64B) \ *** *** 2584,2589 --- 2586,2592 gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL); \ return; \ } \ + gen_update_nip(ctx, ctx-nip - 4); \ gen_set_access_type(ctx, ACCESS_INT); \ EA = tcg_temp_new(); \ gen_addr_reg_index(ctx, EA); \ *** *** 2596,2601 --- 2599,2605 static void glue(gen_, name##x)(DisasContext *ctx) \ { \ TCGv EA; \ + gen_update_nip(ctx, ctx-nip - 4); \ gen_set_access_type(ctx, ACCESS_INT); \ EA = tcg_temp_new(); \ gen_addr_reg_index(ctx, EA); \ *** *** 2693,2698 --- 2697,2703 static void glue(gen_, name)(DisasContext *ctx) \ { \ TCGv EA; \ + gen_update_nip(ctx, ctx-nip - 4); \ gen_set_access_type(ctx, ACCESS_INT); \ EA = tcg_temp_new(); \ gen_addr_imm_index(ctx, EA, 0); \ *** *** 2708,2713 --- 2713,2719 gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL); \ return; \ } \ + gen_update_nip(ctx, ctx-nip - 4); \ gen_set_access_type(ctx, ACCESS_INT); \ EA = tcg_temp_new(); \ if (type == PPC_64B) \ *** *** 2727,2732 --- 2733,2739 gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL); \ return; \ } \ + gen_update_nip(ctx, ctx-nip - 4); \ gen_set_access_type(ctx, ACCESS_INT); \ EA = tcg_temp_new(); \ gen_addr_reg_index(ctx, EA); \ *** *** 2739,2744 --- 2746,2752 static void glue(gen_, name##x)(DisasContext *ctx) \ { \ TCGv EA; \ + gen_update_nip(ctx, ctx-nip - 4); \ gen_set_access_type(ctx, ACCESS_INT); \ EA = tcg_temp_new(); \ gen_addr_reg_index(ctx, EA); \