Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-10 Thread Jamie Lokier
Kevin Wolf wrote:
 Am 08.01.2013 11:39, schrieb Liu Yuan:
  On 01/08/2013 06:00 PM, Kevin Wolf wrote:
  Am 08.01.2013 10:45, schrieb Liu Yuan:
  On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
  Otherwise use sheepdog writeback and let QEMU block.c decide when to
  flush.  Never use sheepdog writethrough because it's redundant here.
 
  I don't get it. What do you mean by 'redundant'? If we use virtio 
  sheepdog block driver, how can we specify writethrough mode for Sheepdog
  cache? Here 'writethrough' means use a pure read cache, which doesn't
  need flush at all.
 
  A writethrough cache is equivalent to a write-back cache where each
  write is followed by a flush. qemu makes sure to send these flushes, so
  there is no need use Sheepdog's writethrough mode.
  
  Implement writethrough as writeback + flush will cause considerable
  overhead for network block device like Sheepdog: a single write request
  will be executed as two requests: write + flush
 
 Yeah, maybe we should have some kind of a FUA flag with write requests
 instead of sending a separate flush.

Note that write+FUA has different semantics than write+flush, at least
with regular disks.

write+FUA commits just what was written, while write+flush commits
everything that was written before.

-- Jamie



Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-10 Thread Jamie Lokier
Paolo Bonzini wrote:
 Il 09/01/2013 14:04, Liu Yuan ha scritto:
  2 The upper layer software which relies on the 'cache=xxx' to choose
cache mode will fail its assumption against new QEMU.
   
   Which assumptions do you mean? As far as I can say the behaviour hasn't
   changed, except possibly for the performance.
 
  When users set 'cache=writethrough' to export only a writethrough cache
  to Guest, but with new QEMU, it will actually get a writeback cache as
  default.
 
 They get a writeback cache implementation-wise, but they get a
 writethrough cache safety-wise.  How the cache is implemented doesn't
 matter, as long as it looks like a writethrough cache.
 
 In fact, consider a local disk that doesn't support FUA.  In old QEMU,
 images used to be opened with O_DSYNC and that splits each write into
 WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
 flushes are created.  Old QEMU changes it in the kernel, new QEMU
 changes it in userspace.
 
  We don't need to communicate to the guest. I think 'cache=xxx' means
  what kind of cache the users *expect* to export to Guest OS. So if
  cache=writethrough set, Guest OS couldn't turn it to writeback cache
  magically. This is like I bought a disk with 'writethrough' cache
  built-in, I didn't expect that it turned to be a disk with writeback
  cache under the hood which could possible lose data when power outage
  happened.
 
 It's not by magic.  It's by explicitly requesting the disk to do this.
 
 Perhaps it's a bug that the cache mode is not reset when the machine is
 reset.  I haven't checked that, but it would be a valid complaint.

The question is, is cache=writeback/cache=writethrough an initial
setting of guest-visible WCE that the guest is allowed to change, or
is cache=writeththrough a way of saying don't have a write cache
(which may or may not be reflected in the guest-visible disk id).

I couldn't tell from QEMU documentation which is intended.  It would
be a bit silly if it means different things for different backend
storage.

I have seen (obscure) guest code which toggled WCE to simulate FUA,
and there is plenty of advice out there saying to set WCE=0 for
certain kinds of databases because of its presumed crash safety.  Even
very ancient guests on Linux and Windows can change WCE=0 with IDE and
SCSI.

So from a guest point of view, I think guest setting WCE=0 should mean
exactly the same as FUA every write, or flush after every write, until
guest setting WCE=1.

-- Jamie



Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-10 Thread Jamie Lokier
Paolo Bonzini wrote:
 Il 10/01/2013 16:25, Jamie Lokier ha scritto:
   Perhaps it's a bug that the cache mode is not reset when the machine is
   reset.  I haven't checked that, but it would be a valid complaint.
  The question is, is cache=writeback/cache=writethrough an initial
  setting of guest-visible WCE that the guest is allowed to change, or
  is cache=writeththrough a way of saying don't have a write cache
  (which may or may not be reflected in the guest-visible disk id).
 
 It used to be the latter (with reflection in the disk data), but now it
 is the former.

Interesting.  It could be worth a note in the manual.

  I couldn't tell from QEMU documentation which is intended.  It would
  be a bit silly if it means different things for different backend
  storage.
 
 It means the same thing for IDE, SCSI and virtio-blk.  Other backends,
 such as SD, do not even have flush, and are really slow with
 cache=writethrough because they write one sector at a time.  For this
 reason they cannot really be used in a safe manner.
 
  I have seen (obscure) guest code which toggled WCE to simulate FUA,
 
 That's quite useless, since WCE=1-WCE=0 is documented to cause a flush
 (and it does).  Might as well send a real flush.

It was because the ATA spec seemed to permit the combination of WCE
with no flush command supported.  So WCE=1-WCE=0 was used to flush,
and kept at WCE=0 for the subsequent logging write-FUA(s), until a
non-FUA write was wanted.

-- Jamie



Re: [Qemu-devel] [PATCH v7 00/10] i8254, i8259 and running Microport UNIX (ca 1987)

2012-12-11 Thread Jamie Lokier
Matthew Ogilvie wrote:
 2. Just fix it immediately, and don't worry about migration.  Squash
the last few patches together.  A single missed periodic
timer tick that only happens when migrating
between versions of qemu is probably not a significant
concern.  (Unless someone knows of an OS that actually runs
the i8254 in single shot mode 4, where a missed interrupt
could cause a hang or something?)

Hi Matthew,

Such as Linux?  0x38 looks like mode 4 to me.  I suspect it's used in
tickless mode when there isn't a better clock event source.

linux/drivers/clocksource/i8253.c:

#ifdef CONFIG_CLKEVT_I8253
   /* ... */

case CLOCK_EVT_MODE_ONESHOT:
/* One shot setup */
outb_p(0x38, PIT_MODE);

   /* ... */

/*
 * Program the next event in oneshot mode
 *
 * Delta is given in PIT ticks
 */
static int pit_next_event(unsigned long delta, struct 
clock_event_device *evt)
{
raw_spin_lock(i8253_lock);
outb_p(delta  0xff , PIT_CH0); /* LSB */
outb_p(delta  8 , PIT_CH0);   /* MSB */
raw_spin_unlock(i8253_lock);

return 0;
}

   /* ... */
#endif

 4. Support both old and fixed i8254 models, selectable at runtime
with a command line option.  (Question: What should such an
option look like?)  This may be the best way to actually
change the 8254, but I'm not sure changes are even needed.
It's certainly getting rather far afield from running Microport
UNIX...

I can't see a reason to have the old behaviour, if every guest works
with the new one, except for this awkward cross-version migration
thing.

I guess ideally, device emulations would be versioned when their
behaviour changes, rather like shared libraries are, and the
appropriate old version kept around to be loaded for a particular
machine that's still running with it.  Sounds a bit complicated though.

Best,
-- Jamie



Re: [Qemu-devel] [PATCH] qemu-timer: Don't use RDTSC on 386s and 486s

2012-11-23 Thread Jamie Lokier
Peter Maydell wrote:
 On 23 November 2012 15:15, Paolo Bonzini pbonz...@redhat.com wrote:
  Il 23/11/2012 16:12, Peter Maydell ha scritto:
  Adjust the conditional which guards the implementation of
 
  -#elif defined(__i386__)
  +#elif defined(__i586__)
 
   static inline int64_t cpu_get_real_ticks(void)
   {
 
 
  You should at least test __i686__ too:
 
  $ gcc -m32 -dM -E -x c /dev/null |grep __i
  #define __i686 1
  #define __i686__ 1
  #define __i386 1
  #define __i386__ 1
 
 Yuck. I had assumed gcc would define everything from i386
 on up when building for later cores.

No, and it doesn't define __i686__ on all x86-32 targets after i686 either:

$ gcc -march=core2 -dM -E -x c /dev/null | grep __[0-9a-z] | sort
#define __core2 1
#define __core2__ 1
#define __gnu_linux__ 1
#define __i386 1
#define __i386__ 1
#define __linux 1
#define __linux__ 1
#define __tune_core2__ 1
#define __unix 1
#define __unix__ 1

x86 instruction sets haven't followed a linear progression of features
for quite a while, especially including non-Intel chips, so it stopped
making sense for GCC to indicate the instruction set in that way.

GCC 4.6.3 defines __i586__ only when the target arch is set by -march
(or default) to i586, pentium or pentium-mmx.

And it defines __i686__ only when -march= is set (or default) to c3-2,
i686, pentiumpro, pentium2, pentium3, pentium3m or pentium-m.

Otherwise it's just things like __athlon__, __corei7__, etc.

The only one that's consistent is __i386__ (and __i386).

-- Jamie



Re: [Qemu-devel] [PATCH] qemu-timer: Don't use RDTSC on 386s and 486s

2012-11-23 Thread Jamie Lokier
Peter Maydell wrote:
 On 23 November 2012 15:17, Peter Maydell peter.mayd...@linaro.org wrote:
  On 23 November 2012 15:15, Paolo Bonzini pbonz...@redhat.com wrote:
  You should at least test __i686__ too:
 
  $ gcc -m32 -dM -E -x c /dev/null |grep __i
  #define __i686 1
  #define __i686__ 1
  #define __i386 1
  #define __i386__ 1
 
  Yuck. I had assumed gcc would define everything from i386
  on up when building for later cores.
 
 ...and there's an enormous list of x86 cores too. This bites
 us already -- if you use '-march=native' to get best for my
 cpu then on a Core2, say, it will define __i386__ and __core2__
 but not __i686__, so TCG won't use cmov :-(
 
 Anybody got any good ideas for how to say is this at least
 a 586/686? in a way that won't fail for any newly introduced
 x86 core types?

Fwiw, cmov doesn't work on some VIA 686 class CPUs.

Shouldn't TCG decide whether to use cmov at runtime anyway, using
cpuid?  For dynamically generated code it would seem not very
expensive to do that.

Looking at GCC source, it has an internal flag to say whether the
target has cmov, but doesn't expose it in preprocessor conditionals.

-- Jamie



Re: [Qemu-devel] [PATCH] qemu-timer: Don't use RDTSC on 386s and 486s

2012-11-23 Thread Jamie Lokier
Peter Maydell wrote:
 On 23 November 2012 15:31, Jamie Lokier ja...@shareable.org wrote:
  x86 instruction sets haven't followed a linear progression of features
  for quite a while, especially including non-Intel chips, so it stopped
  making sense for GCC to indicate the instruction set in that way.
 
 If you're going to go down that route you need to start defining
 #defines for features then, so we could say defined(__rdtsc__)
 or defined(__cmov__) and so on. I don't see any of those either :-(

It does for some major architectural instructions groups like MMX,
different kinds of SSE, etc.  But not everything and I don't see cmov
among them.  I agree it's unfortunate.

-- Jamie



Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server

2012-10-25 Thread Jamie Lokier
Kevin Wolf wrote:
 Am 24.10.2012 16:32, schrieb Jamie Lokier:
  Kevin Wolf wrote:
  Am 24.10.2012 14:16, schrieb Nicholas Thomas:
  On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
  Since the I/O _order_ before, and sometimes after, flush, is important
  for data integrity, this needs to be maintained when I/Os are queued in
  the disconnected state -- including those which were inflight at the
  time disconnect was detected and then retried on reconnect.
 
  Hmm, discussing this on IRC I was told that it wasn't necessary to
  preserve order - although I forget the fine detail. Depending on the
  implementation of qemu's coroutine mutexes, operations may not actually
  be performed in order right now - it's not too easy to work out what's
  happening.
 
  It's possible to reorder, but it must be consistent with the order in
  which completion is signalled to the guest. The semantics of flush is
  that at the point that the flush completes, all writes to the disk that
  already have completed successfully are stable. It doesn't say anything
  about writes that are still in flight, they may or may not be flushed to
  disk.
  
  I admit I wasn't thinking clearly how much ordering NBD actually
  guarantees (or if there's ordering the guest depends on implicitly
  even if it's not guaranteed in specification), and how that is related
  within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
  that the guest expects for various emulated devices and their settings.
  
  The ordering (if any) needed from the NBD driver (or any backend) is
  going to depend on the assumptions baked into the interface between
  QEMU device emulation - backend.
  
  E.g. if every device emulation waited for all outstanding writes to
  complete before sending a flush, then it wouldn't matter how the
  backend reordered its requests, even getting the completions out of
  order.
  
  Is that relationship documented (and conformed to)?
 
 No, like so many other things in qemu it's not spelt out explicitly.
 However, as I understand it it's the same behaviour as real hardware
 has, so device emulation at least for the common devices doesn't have to
 implement anything special for it. If the hardware even supports
 parallel requests, otherwise it would automatically only have a single
 request in flight (like IDE).

That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite
common.

They are features of devices which support multiple parallel requests,
but with certain ordering constraints conveyed by or expected by the
guest, which has to be ensured when it's mapped onto a QEMU fully
asynchronous backend.

That means they are features of the hardware which device emulations
_do_ have to implement.  If they don't, the storage is unreliable on
things like host power removal and virtual power removal.

If the backends are allowed to explicitly have no coupling between
different request types (even flush/discard and write), and ordering
constraints are being enforced by the order in which device emulations
submit and wait, that's fine.

I mention this, because POSIX aio_fsync() is _not_ fully decoupled
according to it's specification.

So it might be that some device emulations are depending on the
semantics of aio_fsync() or the QEMU equivalent by now; and randomly
reordering in the NBD driver in unusual circumstances (or any other
backend), would break those semantics.

-- Jamie



Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server

2012-10-24 Thread Jamie Lokier
Kevin Wolf wrote:
 Am 24.10.2012 14:16, schrieb Nicholas Thomas:
  On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
  Since the I/O _order_ before, and sometimes after, flush, is important
  for data integrity, this needs to be maintained when I/Os are queued in
  the disconnected state -- including those which were inflight at the
  time disconnect was detected and then retried on reconnect.
  
  Hmm, discussing this on IRC I was told that it wasn't necessary to
  preserve order - although I forget the fine detail. Depending on the
  implementation of qemu's coroutine mutexes, operations may not actually
  be performed in order right now - it's not too easy to work out what's
  happening.
 
 It's possible to reorder, but it must be consistent with the order in
 which completion is signalled to the guest. The semantics of flush is
 that at the point that the flush completes, all writes to the disk that
 already have completed successfully are stable. It doesn't say anything
 about writes that are still in flight, they may or may not be flushed to
 disk.

I admit I wasn't thinking clearly how much ordering NBD actually
guarantees (or if there's ordering the guest depends on implicitly
even if it's not guaranteed in specification), and how that is related
within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
that the guest expects for various emulated devices and their settings.

The ordering (if any) needed from the NBD driver (or any backend) is
going to depend on the assumptions baked into the interface between
QEMU device emulation - backend.

E.g. if every device emulation waited for all outstanding writes to
complete before sending a flush, then it wouldn't matter how the
backend reordered its requests, even getting the completions out of
order.

Is that relationship documented (and conformed to)?

-- Jamie



Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server

2012-10-23 Thread Jamie Lokier
Nicholas Thomas wrote:
 On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote:
  Am 22.10.2012 13:09, schrieb n...@bytemark.co.uk:
   
   This is unlikely to come up now, but is a necessary prerequisite for 
   reconnection
   behaviour.
   
   Signed-off-by: Nick Thomas n...@bytemark.co.uk
   ---
block/nbd.c |   13 +++--
1 files changed, 11 insertions(+), 2 deletions(-)
  
  What's the real requirement here? Silently ignoring a flush and
  returning success for it feels wrong. Why is it correct?
  
  Kevin
 
 I just needed to avoid socket operations while s-sock == -1, and
 extending the existing case of can't do the command, so pretend I did
 it to can't do the command right now, so pretend... seemed like an
 easy way out. 

Hi Nicholas,

Ignoring a flush is another way of saying corrupt my data in some
circumstances.  We have options in QEMU already to say whether flushes
are ignored on normal discs, but if someone's chosen the I really
care about my database/filesystem option, and verified that their NBD
setup really performs them (in normal circumstances), silently
dropping flushes from time to time isn't nice.

I would much rather the guest is forced to wait until reconnection and
then get a successful flush, if the problem is just that the server
was done briefly.  Or, if that is too hard, that the flush is 

Since the I/O _order_ before, and sometimes after, flush, is important
for data integrity, this needs to be maintained when I/Os are queued in
the disconnected state -- including those which were inflight at the
time disconnect was detected and then retried on reconnect.

Ignoring a discard is not too bad.  However, if discard is retried,
then I/O order is important in relation to those as well.

 In the Bytemark case, the NBD server always opens the file O_SYNC, so
 nbd_co_flush could check in_flight == 0 and return 0/1 based on that;
 but I'd be surprised if that's true for all NBD servers. Should we be
 returning 1 here for both not supported and can't do it right now,
 instead?

When the server is opening the file O_SYNC, wouldn't it make sense to
tell QEMU -- and the guest -- that there's no need to send flushes at
all, as it's equivalent to a disk with no write-cache (or disabled)?

Best,
-- Jamie



Re: [Qemu-devel] Using PCI config space to indicate config location

2012-10-09 Thread Jamie Lokier
Rusty Russell wrote:
 I don't think it'll be that bad; reset clears the device to unknown,
 bar0 moves it from unknown-legacy mode, bar1/2/3 changes it from
 unknown-modern mode, and anything else is bad (I prefer being strict so
 we catch bad implementations from the beginning).

Will that work, if the guest with kernel that uses modern mode, kexecs
to an older (but presumed reliable) kernel that only knows about legacy mode?

I.e. will the replacement kernel, or (ideally) replacement driver on
the rare occasion that is needed on a running kernel, be able to reset
the device hard enough?

-- Jamie



Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations

2012-09-19 Thread Jamie Lokier
Avi Kivity wrote:
 On 09/13/2012 09:54 AM, liu ping fan wrote:
 
  +typedef struct Atomic {
  +int counter;
  +} Atomic;
 
  Best to mark counter 'volatile'.
 
  +
  +static inline void atomic_set(Atomic *v, int i)
  +{
  +v-counter = i;
  +}
  +
  +static inline int atomic_read(Atomic *v)
  +{
  +return v-counter;
  +}
 
 
  So these two operations don't get mangled by the optimizer.
 
  Browsing linux code and reading lkml, find some similar material. But
  they have moved volatile from -counter to function - atomic_read().
  As to atomic_read(), I think it need to prevent optimizer from
  refetching issue, but as to atomic_set(), do we need ?
 
 I think so, to prevent reordering.

Hi,

I don't think volatile makes any difference to reordering here.

The compiler is not going to move the atomic_set() store before or
after another instruction on the same atomic variable anyway, just
like it wouldn't do that for an ordinary assignment.

If you're concerned about ordering with respect to other memory, then
volatile wouldn't make much difference.  barrier() before and after would.

If you're copying Linux's semantics, Linux's atomic_set() doesn't
include any barriers, nor imply any.  atomic_read() uses volatile to
ensure that each call re-reads the value, for example in a loop.
(Same as ACCESS_ONCE()).  If there was a call to atomic_set() in a
loop, it doesn't guarantee that will be written each time.

-- Jamie



Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations

2012-09-19 Thread Jamie Lokier
liu ping fan wrote:
  +static inline void atomic_set(Atomic *v, int i)
  +{
  +v-counter = i;
  +}

Hi,

When running on ARM Linux kernels prior to 2.6.32, userspace
atomic_set() needs to use clrex or strex too.

See Linux commit 200b812d, Clear the exclusive monitor when returning
from an exception.

You can see ARM's atomic_set() used to use strex, and warns it's
important.  The kernel patch allows atomic_set() to be simplified, and
that includes for userspace, by putting clrex/strex in the exception
return path instead.

However, someone may run QEMU on a kernel before 2.6.32, which isn't
that old.  (E.g. my phone is running 2.6.28).

Otherwise you can have this situation:

Initially: a = 0.

Thread
  atomic_inc(a, 1)
  = ldrex, add, [strex interrupted]

 Interrupted by signal handler
  atomic_set(a, 3)
  = str
 Signal return

Resume thread
  = strex (succeeds because CPU-local exclusive-flag still set)

Result: a = 1, should be impossible when the signal triggered, and
information about the signal is lost.

A more realistic example would use atomic_compare_exchange(), to
atomic-read-and-clear, atomic-read-and-dec-if-not-zero a variable set
in a signal handler, however I've used atomic_inc() to illustrate
because that's in your patch.

Best,
-- Jamie



Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations

2012-09-19 Thread Jamie Lokier
Peter Maydell wrote:
 On 19 September 2012 14:32, Jamie Lokier ja...@shareable.org wrote:
  However, someone may run QEMU on a kernel before 2.6.32, which isn't
  that old.  (E.g. my phone is running 2.6.28).
 
 NB that ARM kernels that old have other amusing bugs, such
 as not saving the floating point registers when invoking
 signal handlers.

Hi Peter,

It's not that old ( 3 years).  Granted that's not a nice one, but I'm
under the impression it occurs only when the signal handler uses (VFP
hardware) floating point.  I.e. most programs don't do that, they keep
the signal handlers simple (probably including QEMU).

(I've read about other platforms that have similar issues using
floating point in signal handlers; best avoided.)

Anyway, people are running those kernels, someone will try to run QEMU
on it unless...

 I would be happy for QEMU to just say your  kernel is too old!...

I'd be quite happy with that as well, if you want to put a check in
and refuse to run (like Glibc does).

Less happy with obscure, rare failures of atomicity that are
practically undebuggable, and easily fixed.

Cheers,
-- Jamie



Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-09 Thread Jamie Lokier
Anthony Liguori wrote:
 The new API will do away with the IOAPIC/PIC/PIT emulation and defer
 them to userspace.
 
 I'm a big fan of this.

I agree with getting rid of unnecessary emulations.
(Why were those things emulated in the first place?)

But it would be good to retain some way to plugin device emulations
in the kernel, separate from KVM core with a well-defined API boundary.

Then it wouldn't matter to the KVM core whether there's PIT emulation
or whatever; that would just be a separate module.  Perhaps even with
its own /dev device and maybe not tightly bound to KVM,

 Note: this may cause a regression for older guests that don't
 support MSI or kvmclock.  Device assignment will be done using
 VFIO, that is, without direct kvm involvement.

I don't like the sound of regressions.

I tend to think of a VM as something that needs to have consistent
behaviour over a long time, for keeping working systems running for
years despite changing hardware, or reviving old systems to test
software and make patches for things in long-term maintenance etc.

But I haven't noticed problems from upgrading kernelspace-KVM yet,
only upgrading the userspace parts.  If a kernel upgrade is risky,
that makes upgrading host kernels difficult and all or nothing for
all the guests within.

However it looks like you mean only the performance characteristics
will change because of moving things back to userspace?

 Local APICs will be mandatory, but it will be possible to hide them from
 the guest.  This means that it will no longer be possible to emulate an
 APIC in userspace, but it will be possible to virtualize an APIC-less
 core - userspace will play with the LINT0/LINT1 inputs (configured as
 EXITINT and NMI) to queue interrupts and NMIs.
 
 I think this makes sense.  An interesting consequence of this is
 that it's no longer necessary to associate the VCPU context with an
 MMIO/PIO operation.  I'm not sure if there's an obvious benefit to
 that but it's interesting nonetheless.

Would that be useful for using VCPUs to run sandboxed userspace code
with ability to trap and control the whole environment (as opposed to
guest OSes, or ptrace which is rather incomplete and unsuitable for
sandboxing code meant for other OSes)?

Thanks,
-- Jamie



Re: [Qemu-devel] Get only TCG code without execution

2012-02-09 Thread Jamie Lokier
陳韋任 wrote:
  As x86 doesn't use or need barrier instructions, when translating x86
  to (say) run on ARM host, multi-threaded code that needs barriers
  isn't easy to detect, so barriers may be required between every memory
  access in the generated ARM code.
 
   Sounds awful to me. Regardless current QEMU's support for multi-threaded
 application, it's possible to emulate a architecture with stronger memory
 model on a weaker one?

It's possible, unfortunately those barriers tends to be quite
expensive and they are needed often, so it would run slowly. Probably
a lot slower than using a single host thread with preemption to
simulate multiple guest CPUs. But someone should try it and find out.

It might be possible to do some deep analysis of the guest to work out
which memory accesses don't need barriers, but it's a hard research
problem with no guarantee of a good solution.

One strategy which comes to mind is simulated MESI or MOESI (cache
coherency protocols) at the page level, so independent guest threads
never have unsynchronised access to the same page. Or at finer
granularity, with more emulation overhead (but still maybe less than
barriers). Another is software transactional memory techniques.

Neither will run system software at great speed, but certain kinds of
mostly-independent processing, for example a guest running mainly
userspace number crunching in independent processes, might work
alright.

-- Jamie



Re: [Qemu-devel] [PATCH] main-loop: For tools, initialize timers as part of qemu_init_main_loop()

2012-01-21 Thread Jamie Lokier
Michael Roth wrote:
 In some cases initializing the alarm timers can lead to non-negligable
 overhead from programs that link against qemu-tool.o. At least,
 setting a max-resolution WinMM alarm timer via mm_start_timer() (the
 current default for Windows) can increase the tick rate on Windows
 OSs and affect frequency scaling, and in the case of tools that run
 in guest OSs such has qemu-ga, the impact can be fairly dramatic
 (+20%/20% user/sys time on a core 2 processor was observed from an idle
 Windows XP guest).
 
 This patch doesn't address the issue directly (not sure what a good
 solution would be for Windows, or what other situations it might be
 noticeable),

Is this a timer that need to fire soon after setting, every time?

I wonder if a different kind of Windows timer, lower-resolution, could
be used if the timeout is longer.  If it has insufficient resolution,
it could be set to trigger a little early, then set a high-resolution
timer at that point.

Maybe that could help for Linux CONFIG_NOHZ guests?

-- Jamie



Re: [Qemu-devel] qemu-kvm upstreaming: Do we need -no-kvm-pit and -no-kvm-pit-reinjection semantics?

2012-01-20 Thread Jamie Lokier
Jan Kiszka wrote:
 Usability. Users should not have to care about individual tick-based
 clocks. They care about my OS requires lost ticks compensation, yes or no.

Conceivably an OS may require lost ticks compensation depending on
boot options given to the OS telling it which clock sources to use.

However I like the idea of a global default, which you can set and all
the devices inherit it unless overridden in each device.

-- Jamie



Re: [Qemu-devel] Get only TCG code without execution

2012-01-20 Thread Jamie Lokier
Peter Maydell wrote:
   guest binaries don't actually rely that much on the memory model.
 
  I think the reason is those guest binaries are single thread. Memory model 
  is
  important in multi-threaded case. BTW, our binary translator now can 
  translate
  x86 binary to ARM binary, and ARM has weaker memory model than x86.
 
 Yes. At the moment this works for QEMU on ARM hosts because in
 system mode QEMU itself is single-threaded so the nastier interactions
 between multiple guest CPUs don't occur (just about every memory model
 defines that memory interactions within a single thread of execution
 behave in the obvious manner).

 I also had in mind that guest binaries
 tend to make fairly stereotypical use of things like LDREX/STREX
 rather than relying on obscure details like their interaction with
 plain load/stores.

As x86 doesn't use or need barrier instructions, when translating x86
to (say) run on ARM host, multi-threaded code that needs barriers
isn't easy to detect, so barriers may be required between every memory
access in the generated ARM code.

-- Jamie



Re: [Qemu-devel] Get only TCG code without execution

2012-01-20 Thread Jamie Lokier
陳韋任 wrote:
   What's load/store exclusive implementation?

It's how some architectures do atomic operations, instead of having
atomic instructions like x86 does.

 And as a general emulator, QEMU shouldn't implement any
 architecture-specific memory model, right? What comes into my mind
 is QEMU only need to follow guest memory operations when translates
 guest binary to TCG ops. When translate TCG ops to host binary, it
 also has to be careful not to mess up the memory ordering.

The error occurs when emulating two or more guest CPUs in parallel
using two or more host CPUs for speed.  Then not mess up the memory
ordering may require barrier instructions in the host binary code,
depending on the guest and host architectures.  Without barrier
instructions, the CPUs reorder memory accesses even if the instruction
order is kept the same. This reordering done by the CPU is called the
memory model. TCG cannot currently produce these barrier instructions,
and it's not clear if it will ever be able to do so efficiently.

-- Jamie



Re: [Qemu-devel] [PATCH 2/2] qemu-ga: Add the guest-suspend command

2012-01-17 Thread Jamie Lokier
Eric Blake wrote:
 On 01/16/2012 03:51 AM, Jamie Lokier wrote:
  I'm not sure if it's relevant to the this code, but on Glibc fork() is
  not async-signal-safe and has been known to crash in signal handlers.
  This is why fork() was removed from SUS async-signal-safe functions.
 
 fork() is still in the list of async-signal-safe functions [1];

You're right, but it looks like it may be removed in the next edition:

   https://www.opengroup.org/austin/docs/austin_446.txt

 it was only pthread_atfork() which was removed.

I didn't think pthread_atfork() ever was async-signal-safe.

 That is, fork() is _required_
 to be async-signal-safe (and usable from signal handlers), provided that
 the actions following the fork also follow safety rules.

Nonethless, Glibc fork() isn't async-signal-safe even if it should be:

http://sourceware.org/bugzilla/show_bug.cgi?id=4737
  In general, why is multithreadedness relevant to async-signal-safety here?
 
 Because POSIX 2008 (SUS inherits from POSIX, so it has the same
 restriction) states that if a multithreaded app calls fork, the child
 can only portably use async-signal-safe functions up until a successful
 exec or _exit.  Even though the child is _not_ operating in a signal
 handler context, it _is_ operating in a context of a single thread where
 other threads from the parent may have been holding locks, and thus
 calling any unsafe function (that is, any function that tries to obtain
 a lock) may deadlock.

Somewhat confusing, when you have pthread_atfork() existing for the
entire purpose of allowing non-async-signal-safe functions, provided
the application isn't multithreaded, but libraries can be (I'm not
sure what the difference between application and library is in this
context).

http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html

It is suggested that programs that use fork() call an exec function
very soon afterwards in the child process, thus resetting all
states. In the meantime, only a short list of async-signal-safe
library routines are promised to be available.

Unfortunately, this solution does not address the needs of
multi-threaded libraries. Application programs may not be aware that a
multi-threaded library is in use, and they feel free to call any
number of library routines between the fork() and exec calls, just as
they always have. Indeed, they may be extant single-threaded programs
and cannot, therefore, be expected to obey new restrictions imposed by
the threads library.

 I don't know if qemu-ga is intended to be a multi-threaded app, so I
 don't know if being paranoid about async-signal-safety matters in this
 particular case, but I _do_ know that libvirt has encountered issues
 with using non-safe functions prior to exec, which is why it always
 raises red flags when I see unsafe code between fork and exec.

Quite right, I agree. :-)

-- Jamie



Re: [Qemu-devel] [PATCH 2/2] qemu-ga: Add the guest-suspend command

2012-01-17 Thread Jamie Lokier
Michael Roth wrote:
 STDIO is one of the major areas of code that is definitely not
 async signal safe. Consider Thread A doing something like
 fwrite(stderr, Foo\n), while another thread forks, and then
 its child also does an fwrite(stderr, Foo\n). Given that
 every stdio function will lock/unlock a mutex, you easily get
 this sequence of events:
 
 1. Thread A: lock(stderr)
 2. Thread A: write(stderr, foo\n);
 3. Thread B: fork() -  Process B1
 4. Thread A: unlock(stderr)
 5.   Process B1: lock(stderr)
 
 When the child process is started at step 3, the FILE *stderr
 object will be locked by thread A.  When Thread A does the
 unlock in step 4, it has no effect on Process B1. So process
 B1 hangs forever in step 5.
 
 Ahh, thanks for the example. I missed that these issues were
 specifically WRT to code that was fork()'d from a multi-threaded
 application. Seemed pretty scary otherwise :)

The pthread_atfork() mechanism, or equivalent in libc, should be
sorting out those stdio locks, but I don't know for sure what Glibc does.

I do know it traverses a stdio list on fork() though:

   http://sourceware.org/bugzilla/show_bug.cgi?id=4737#c4

Which is why Glibc's fork() is not async-signal-safe even though it's
supposed to be.

stdio in a fork() child is historical unix stuff; I expect there are
quite a lot of old applications that use stdio in a child process.
Not multi-threaded applications, but they can link to multi-threaded
libraries these without knowing.

Still there are bugs around (like Glibc's fork() not being
async-signal-safe).  It pays to be cautious.

-- Jamie



Re: [Qemu-devel] [PATCH 2/2] qemu-ga: Add the guest-suspend command

2012-01-16 Thread Jamie Lokier
Eric Blake wrote:
 On 01/13/2012 12:15 PM, Luiz Capitulino wrote:
  This might look complex, but the final code is quite simple. The
  purpose of that approach is to allow qemu-ga to reap its children
  (semi-)automatically from its SIGCHLD handler.
 
 Yes, given your desire for the top-level qemu-ga signal handler to be
 simple, I can see why you did a double fork, so that the intermediate
 child can change the SIGCHLD behavior and actually do a blocking wait in
 the case where status should not be ignored.

An alternative is for SIGCHLD to write a byte to a non-blocking pipe
and do nothing else.  A main loop outside signal context reads from
the pipe, and on each read triggers a subloop of non-blocking
waitpid() getting child statuses until there are no more.  Because
it's outside signal context, it's safe to do anything with the child
statuses.

(A long time ago, on other unixes, this wasn't possible because
SIGCHLD would be retriggered until wait(), but it's not relevant on
anything modern.)

  +execlp(pmutils_bin, pmutils_bin, arg, NULL);
 
 Do we really want to be relying on a PATH lookup, or should we be using
 an absolute path in pmutils_bin?

Since you mention async-signal-safety, execlp() isn't
async-signal-safe!  Last time I checked, in Glibc execlp() could call
malloc().  Also reading PATH looks at the environment, which isn't
always thread-safe either, depending on what else is going on.

I'm not sure if it's relevant to the this code, but on Glibc fork() is
not async-signal-safe and has been known to crash in signal handlers.
This is why fork() was removed from SUS async-signal-safe functions.

 I didn't check whether slog() is async-signal safe (probably not, since
 even snprintf() is not async-signal safe, and you are passing a printf
 style format string).  But strerror() is not, so you shouldn't be using
 it in the child if qemu-ga is multithreaded.

In general, why is multithreadedness relevant to async-signal-safety here?

Thanks,
-- Jamie



Re: [Qemu-devel] converging around a single guest agent

2011-11-17 Thread Jamie Lokier
 On 11/16/2011 03:36 PM, Anthony Liguori wrote:
 We have another requirement. We need to embed the source for the guest
 agent in the QEMU release tarball. This is for GPL compliance since we
 want to include an ISO (eventually) that contains binaries.

Paolo Bonzini wrote:
 ovirt-guest-agent is licensed under GPLv3, so you do not need to;
 the options in GPLv3 include this one:
 
 d) Convey the object code by offering access from a designated
 place (gratis or for a charge), and offer equivalent access to the
 Corresponding Source in the same way through the same place at no
 further charge.  You need not require recipients to copy the
 Corresponding Source along with the object code.  If the place to
 copy the object code is a network server, the Corresponding Source
 may be on a different server (operated by you or a third party)
 that supports equivalent copying facilities, provided you maintain
 clear directions next to the object code saying where to find the
 Corresponding Source.  Regardless of what server hosts the
 Corresponding Source, you remain obligated to ensure that it is
 available for as long as needed to satisfy these requirements.

Hi,

GPLv2 also has a clause similar to the above.  In GPLv2 it's not
enumerated, but says:

If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.

I'm not sure why mere aggregation (GPLv2) and aggregate (GPLv3)
aren't sufficient to allow shipping the different binaries together in
a single ISO regardless of where the source code lives or how it's licensed.

-- Jamie



Re: [Qemu-devel] [PATCH 5/6] Do constant folding for shift operations.

2011-05-27 Thread Jamie Lokier
Richard Henderson wrote:
 On 05/26/2011 01:25 PM, Blue Swirl wrote:
  I don't see the point.  The C99 implementation defined escape hatch
  exists for weird cpus.  Which we won't be supporting as a QEMU host.
  
  Maybe not, but a compiler with this property could arrive. For
  example, GCC developers could decide that since this weirdness is
  allowed by the standard, it may be implemented as well.
 
 If you like, you can write a configure test for it.  But, honestly,
 essentially every place in qemu that uses shifts on signed types
 would have to be audited.  Really.

I agree, the chance of qemu ever working, or needing to work, on a non
two's complement machine is pretty remote!

 The C99 hook exists to efficiently support targets that don't have
 arithmetic shift operations.  Honestly.

If you care, this should be portable without a configure test, as
constant folding should have the same behaviour:

(((int32_t)-3  1 == (int32_t)-2)
 ? (int32_t)x  (int32_t)y
 : long_winded_portable_shift_right(x, y))

-- Jamie



Re: [Qemu-devel] [PATCH] Add support for fd: protocol

2011-05-24 Thread Jamie Lokier
Stefan Hajnoczi wrote:
 On Mon, May 23, 2011 at 11:49 PM, Jamie Lokier ja...@shareable.org wrote:
  Being able to override the backing file path would be useful anyway.
 
  I've already had problems when moving established qcow2 files between
  systems, that for historical reasons contain either an absolute path
  inside for the backing file, or some messy ../../whatever, or
  foo/bar/whatever, or backing.img (lots of different ones with the
  same name), all of which are a pain to work around.
...
 Try the qemu-img rebase -f command:
 
 qemu-img uses the unsafe mode if -u is specified. In this mode, only the
 backing file name and format of filename is changed without any
 checks on the
 file contents. The user must take care of specifying the correct new 
 backing
 file, or the guest-visible content of the image will be corrupted.
 
 This mode is useful for renaming or moving the backing file to somewhere
 else.  It can be used without an accessible old backing file, i.e. you can
 use it to fix an image whose backing file has already been moved/renamed.

Yes indeed.  That feature was added after the last time I dealt with this 
problem.

However, I have wanted to open *precious*, *read-only* qcow2 images,
for example with -snapshot or the explicit equivalent, and for those
precious images I am loathe to let any tool write a single byte to
them.  The files are kept read-only, and often with the immutable
attribute on Linux, backed up and checksummed just to be sure.

I'd rather just override the value on the command line, so if that
feature may turn up for fd: related reasons, it'll be handy for the
read-only moved qcow2 file reason too.

-- Jamie



Re: [Qemu-devel] [PATCH 1/2] coroutine: introduce coroutines

2011-05-24 Thread Jamie Lokier
Daniel P. Berrange wrote:
 On Wed, May 11, 2011 at 03:45:39PM +0200, Paolo Bonzini wrote:
  On 05/11/2011 03:05 PM, Anthony Liguori wrote:
  
  A very slow way, too (on Windows at least if you use qemu_cond...).
  
  That doesn't mean you can't do a fiber implementation for Windows... but
  having a highly portable fallback is a good thing.
  
  I agree but where would you place it, since QEMU is only portable to
  POSIX and Windows?
  
  osdep-$(CONFIG_POSIX) += coroutine-posix.c
  osdep-$(CONFIG_WIN32) += coroutine-win32.c
  osdep-??? += coroutine-fallback.c
 
 NetBSD forbids the use of 'makecontext' in any application
 which also links to libpthread.so[1]. We used makecontext in
 GTK-VNC's coroutines and got random crashes in threaded
 apps running on NetBSD. So for NetBSD we tell people to use
 the thread based coroutines instead.

You have to use swapcontext(), no wait, you have to use setjmp(), no wait,
_setjmp(), no wait, threads Read on.

From Glibc's FAQ, setjmp/longjmp are not portable choices:

- UNIX provides no other (portable) way of effecting a synchronous
  context switch (also known as co-routine switch).  Some versions
  support this via setjmp()/longjmp() but this does not work
  universally.

So in principle you should use swapcontext() in portable code.

(By the way, Glibc goes on about how it won't support swapcontext()
from async signal handlers, i.e. preemption, on some architectures
(IA-64/S-390), and I know it has been very subtly broken from a signal
handler on ARM.  Fair enough, somehow disappointing, but doesn't
matter for QEMU coroutines.)

But swapcontext() etc. have been withdrawn from POSIX 2008:

- Functions to be deleted

  Legacy: Delete all legacy functions except utimes (which should not be 
legacy).
  OB: Default position is to delete all OB functions.

  XSI Functions to change state

  
  _setjmp and _longjmp. Should become obsolete.
  
  getcontext, setcontext, makecontext and swapcontext are already
  marked OB and should be withdrawn. And header file ucontext.h. 

OB means obsolescent.  They were marked obsolescent a few versions
prior, with the rationale that you can use threads instead...

It's not surprising that NetBSD forbids makecontext() with
libpthread.so.  I suspect old versions of FreeBSD, OpenBSD, DragonFly
BSD, (and Mac OS X?), have the same restriction, because they have a
similar pthreads evolutionary history to LinuxThreads.  LinuxThreads
also breaks when using coroutines that switch stacks, because it uses
the stack pointer to know the current thread.

(LinuxThreads is old now, but that particular quirk still affects me
because some uCLinux platforms, on which I wish to use coroutines, still
don't have working NPTL - but they aren't likely to be running QEMU :-)

Finally, if you are using setjmp/longjmp, consider (from FreeBSD man page):

The setjmp()/longjmp() pairs save and restore the signal mask
while _setjmp()/_longjmp() pairs save and restore only the
register set and the stack.  (See sigprocmask(2).)

As setjmp/longjmp were chosen for performance, you may wish to use
_setjmp/_longjmp instead (when available), as swizzling the signal
mask on each switch may involve a system call and be rather slow.

-- Jamie



Re: [Qemu-devel] [PATCH 1/2] coroutine: introduce coroutines

2011-05-24 Thread Jamie Lokier
Stefan Hajnoczi wrote:
 On Thu, May 12, 2011 at 10:51 AM, Jan Kiszka jan.kis...@siemens.com wrote:
  On 2011-05-11 12:15, Stefan Hajnoczi wrote:
  From: Kevin Wolf kw...@redhat.com
 
  Asynchronous code is becoming very complex.  At the same time
  synchronous code is growing because it is convenient to write.
  Sometimes duplicate code paths are even added, one synchronous and the
  other asynchronous.  This patch introduces coroutines which allow code
  that looks synchronous but is asynchronous under the covers.
 
  A coroutine has its own stack and is therefore able to preserve state
  across blocking operations, which traditionally require callback
  functions and manual marshalling of parameters.
 
  Creating and starting a coroutine is easy:
 
    coroutine = qemu_coroutine_create(my_coroutine);
    qemu_coroutine_enter(coroutine, my_data);
 
  The coroutine then executes until it returns or yields:
 
    void coroutine_fn my_coroutine(void *opaque) {
        MyData *my_data = opaque;
 
        /* do some work */
 
        qemu_coroutine_yield();
 
        /* do some more work */
    }
 
  Yielding switches control back to the caller of qemu_coroutine_enter().
  This is typically used to switch back to the main thread's event loop
  after issuing an asynchronous I/O request.  The request callback will
  then invoke qemu_coroutine_enter() once more to switch back to the
  coroutine.
 
  Note that coroutines never execute concurrently and should only be used
  from threads which hold the global mutex.  This restriction makes
  programming with coroutines easier than with threads.  Race conditions
  cannot occur since only one coroutine may be active at any time.  Other
  coroutines can only run across yield.
 
  Mmh, is there anything that conceptually prevent fixing this limitation
  later on? I would really like to remove such dependency long-term as
  well to have VCPUs operate truly independently on independent device models.
 
 The use case that has motivated coroutines is the block layer.  It is
 synchronous in many places and definitely not thread-safe.  Coroutines
 is a step that solves the synchronous part of the problem but does
 not tackle the not thread-safe part.
 
 It is possible to move from coroutines to threads but we need to
 remove single-thread assumptions from all the block layer code, which
 isn't a small task.  Coroutines does not prevent us from making the
 block layer thread-safe!

Keeping in mind that you may have to do some of the work even with
coroutines.  If the code is not thread safe, it may contain
assumptions that certain state does not change when it makes blocking
I/O calls, which stops being true once you have coroutines and replace
the I/O calls with async calls.  But at least the checking can be
confined to those places in the code.

It's quite similar to the Linux BKL - scheduling points have to be
checked but nowhere else does.  And, like the BKL, it could be pushed
down in stages over a long time period, to convert the coroutine code
over to concurrent threads over time, rather than in a single step.

By the end, even with full concurrency, there is still some potential
for coroutines, and/or async calls, to be useful for performance
balancing.

-- Jamie



Re: [Qemu-devel] [PATCH 1/2] coroutine: introduce coroutines

2011-05-24 Thread Jamie Lokier
Stefan Hajnoczi wrote:
 My current plan is to try using sigaltstack(2) instead of
 makecontext()/swapcontext() as a hack since OpenBSD doesn't have
 makecontext()/swapcontext().

sigaltstack() is just a system call to tell the system about an
alternative signal stack - that you have allocated yourself using
malloc().  According to 'info libc Signal Stack'.  It won't help you
get a new stack by itself.

Maybe take a look at what GNU Pth does.  It has a similar matrix of
tested platforms using different strategies on each, though it is
slightly different because it obviously doesn't link with
libpthread.so (it provides it!), and it has to context switch from the
SIGALRM handler for pre-emption.

 TBH I'm almost at the stage where I think we should just use threads
 and/or async callbacks, as appropriate.  Hopefully I'll be able to cook
 up a reasonably portable implementation of coroutines though, because
 the prospect of having to go fully threaded or do async callbacks isn't
 attractive in many cases.

Another classic trick is just to call a function recursively which has
a large local array(*), setjmp() every M calls, and longjmp() back to
the start after M*N calls.  That gets you N setjmp() contexts to
switch between, all in the same larger stack so it's fine even with
old pthread implementations, providing the total stack used isn't too
big, and the individual stacks you've allocated aren't too small for
the program.

If the large local array insists on being optimised away, it's
probably better anyway to track the address of a local variable, and
split the stack whenever the address has changed by enough.  Try to
make sure the compiler doesn't optimise away the tail recursion :-)

It works better on non-threaded programs as per-thread stacks are more
likely to have limited size.  *But* the initial thread often has a
large growable stack, just like a single-threaded program.  So it's a
good idea to do the stack carving in the initial thread (doesn't
necessarily have to be at the start of the program).  You may be able
to add guard pages afterwards with mprotect() if you're paranoid :-)

-- Jamie



Re: [Qemu-devel] [0/25] Async threading for VirtFS using glib threads coroutines.

2011-05-24 Thread Jamie Lokier
Venkateswararao Jujjuri wrote:
 This model makes the code simple and also in one shot we can convert
 all v9fs_do_syscalls into asynchronous threads. But as Aneesh raised
 will there be any additional overhead for the additional jumps?  We
 can quickly test it out too.

I'm not sure if this is exactly the right place (I haven't followed
the whole discussion), but there is a useful trick for getting rid of
one of the thread context switches:

Swizzle *which* thread is your main coroutine thread.

Instead of queuing up an item on the work queue, waking the worker
thread pool, and having a worker thread pick up the coroutine, you:

Declare the current thread to *be* a worker through from this point,
and queue the calling context for a worker thread to pick up.  When it
picks it up, *that* thread declares itself to be the main thread
coroutine thread.

So the coroutine entry step is just queuing a context for another
thread to pick up, and then diving into the blocking system call
(optimising out the enqueue/dequeue and thread switch).

In a sense, you make the main thread a long-lived work queue entry,
and have a symmetric pool, except that the main thread tends to behave
differently than the other work items.

This only works if the main thread's state is able to follow the
swizzling.  I don't know if KVM VCPUs will do that, for example, or if
there's other per-thread state that won't work.

If the main thread can't be swizzled, you can still use this trick
when doing the coroutine-syscall step starting form an existing
worker thread.

-- Jamie




Re: [Qemu-devel] [RFC] Memory API

2011-05-23 Thread Jamie Lokier
Gleb Natapov wrote:
 On Sun, May 22, 2011 at 10:50:22AM +0300, Avi Kivity wrote:
  On 05/20/2011 02:25 PM, Gleb Natapov wrote:
  
A) Removing regions will change significantly. So far this is done by
setting a region to IO_MEM_UNASSIGNED, keeping truncation. With the new
API that will be a true removal which will additionally restore hidden
regions.
  
  And what problem do you expect may arise from that? Currently accessing
  such region after unassign will result in undefined behaviour, so this
  code is non working today, you can't make it worse.
  
  
  If the conversion were perfect then yes.  However there is a
  possibility that the conversion will not be perfect.
  
  It's also good to have to have the code document its intentions.  If
  you see _overlap() you know there is dynamic address decoding going
  on, or something clever.
  
B) Uncontrolled overlapping is a bug that should be caught by the core,
and a new API is a perfect chance to do this.
  
  Well, this will indeed introduce the difference in behaviour :) The guest
  that ran before will abort now. Are you actually aware of any such
  overlaps in the current code base?
  
  Put a BAR over another BAR, then unmap it.
  
 _overlap will not help with that. PCI BARs can overlap, so _overlap will
 be used to register them. You do not what to abort qemu when guest
 configure overlapping PCI BARs don't you?

I'd rather guests have no way to abort qemu, except by explicit
agreement... even if they program BARs randomly or do anything else.
Right now my virtual server provider won't let me run my own kernels
because they are paranoid that a non-approved kernel might crash KVM.
Which is reasonable.  Even so, it's possible to reprogram BARs from
guest userspace.

Hot-adding devices, including ones with MMIO or IO addresses that
overlap another existing device, shouldn't make qemu abort either.
Perhaps disable the device, perhaps respond with an error, that's all.

Even then, if hot-adding some ISA device overlaps an existing PCI BAR,
it would be preferable if the devices (probably both of them) simply
didn't receive any bus cycles until the BARs were moved elsewhere,
maybe triggered PCI bus errors or MCEs or something like that, rather
than introducing never-tested-in-practice management-visible state
such as a disabled or refused device.

I don't know if qemu has devices like this, but many real ISA devices
have software-configurable IO, MMI and IRQ settings (ISAPNP) - it's
not just PCI.

I thoroughly approve of the plan to keep track of overlapping regions
so that adding/removing them has no side effect.  When they conflict
at equal priorities I suggest a good behaviour would be:

   - No access to the underlying device
   - MCE interrupt or equivalent, signalling a bus error

Then the order of registration doesn't make any difference, which is good.

-- Jamie



Re: [Qemu-devel] [PATCH] Add support for fd: protocol

2011-05-23 Thread Jamie Lokier
Markus Armbruster wrote:
 Anthony Liguori anth...@codemonkey.ws writes:
 
  On 05/23/2011 05:30 AM, Daniel P. Berrange wrote:
  It feels to me that turning the current block driver code which just does
  open(2) on files, into something which issues events  asynchronously
  waits for a file would potentially be quite complex.
 
  You also need to be much more careful from a security POV if the mgmt
  app is accepting requests to open arbitrary files from QEMU, to ensure
  the filenames are correctly/strictly validated before opening them and
  giving them back to QEMU. An architecture where the mgmt app decides
  what FDs to supply upfront, has less potential for security errors.
 
  To me the ideal would thus be that we can supply FDs for the backing
  store with -blockdev syntax, and that places where QEMU re-opens files
  would be enhanced to avoid that need. If there are things we can't do
  without closing  re-opening the same file, then perhaps we need some
  new ioctl()/fcntl() calls to change those file attributes on the fly.
 
  I agree.  But my view of blockdev is that you wouldn't set an fd
  attribute but rather the backing file name and use the fd protocol.
  For instance:
 
  -blockdev id=foo-base,path=fd:4,format=raw
  -blockdev id=foo,path=fd:3,format=qcow2,backing_file=foo
 
 I guess you mean backing_file=foo-base.
 
 If none is specified, use the backing file specification stored in the
 image.
 
 Matches my current thinking.

Being able to override the backing file path would be useful anyway.

I've already had problems when moving established qcow2 files between
systems, that for historical reasons contain either an absolute path
inside for the backing file, or some messy ../../whatever, or
foo/bar/whatever, or backing.img (lots of different ones with the
same name), all of which are a pain to work around.

(Imho, it would also make sense if qcow2 files contained a UUID for
their backing file to verify you've given the correct backing file,
and maybe help find it (much like Linux finds real disk devices and
filesystems when mounting these days).)

-- Jamie



[Qemu-devel] Use a hex string (was: [PATCH] qemu: json: Fix parsing of integers = 0x8000000000000000)

2011-05-23 Thread Jamie Lokier
Richard W.M. Jones wrote:
 The problem is to be able to send 64 bit memory and disk offsets
 faithfully.  This doesn't just fail to solve the problem, it's
 actually going to make it a whole lot worse.

Such offsets would be so much more readable in hexadecimal.

So why not use a string 0x80001234 instead?

That is universally Javascript compatible as well as much more
convenient for humans.

Or at least, *accept* a hex string wherever a number is required by
QMP (just because hex is convenient anyway, no compatibility issue),
and *emit* a hex string where the number may be out of Javascript's
unambiguous range, or where a hex string would make more sense anyway.

-- Jamie



Re: [Qemu-devel] Use a hex string

2011-05-23 Thread Jamie Lokier
Anthony Liguori wrote:
 On 05/23/2011 06:02 PM, Jamie Lokier wrote:
 Richard W.M. Jones wrote:
 The problem is to be able to send 64 bit memory and disk offsets
 faithfully.  This doesn't just fail to solve the problem, it's
 actually going to make it a whole lot worse.
 
 Such offsets would be so much more readable in hexadecimal.
 
 So why not use a string 0x80001234 instead?
 
 This doesn't change the fundamental issue here.  Javascript's internal 
 representation for integers isn't 2s compliment, but IEEE794.  This 
 means the expectations about how truncation/overflow is handled is 
 fundamentally different.

No, the point is it's a string so Javascript numerics doesn't come
into it, no overflow, no truncation, no arithmetic.  Every program
that wants to handle them handles them as a *string-valued attribute*
externally, and whatever representation it needs for a particular
attribute internally.  (Just as enum values are represented with
strings too).

In the unlikely event that someone wants to do arithmetic on these
values *in actual Javascript*, it'll be tricky for them, but the
representation doesn't have much to do with that.

-- Jamie



Re: [Qemu-devel] [PATCH 2/2 V7] qemu, qmp: add inject-nmi qmp command

2011-05-03 Thread Jamie Lokier
Gleb Natapov wrote:
 On Thu, Apr 07, 2011 at 04:39:58PM -0500, Anthony Liguori wrote:
  On 04/07/2011 01:51 PM, Gleb Natapov wrote:
  NMI does not have to generate crash dump on every guest we support.
  Actually even for windows guest it does not generate one without
  tweaking registry. For all I know there is a guest that checks mail when
  NMI arrives.
  
  And for all we know, a guest can respond to an ACPI poweroff event
  by tweeting the star spangled banner but we still call the
  corresponding QMP command system_poweroff.
  
 Correct :) But at least system_poweroff implements ACPI poweroff as
 defined by ACPI spec. NMI is not designed as core dump event and is not
 used as such by majority of the guests.

Imho acpi_poweroff or poweroff_button would have been a clearer name.
Or even 'sendkey poweroff' - it's just a button someone on the
keyboard on a lot of systems anyway.  Next to the email button and what
looks, on my laptop, like the play-a-tune button :-)

I put system_poweroff into some QEMU-controlling scripts once, and was
disappointed when several guests ignored it.

But it's done now.

-- Jamie




Re: [Qemu-devel] [PATCH 2/2 V7] qemu, qmp: add inject-nmi qmp command

2011-05-03 Thread Jamie Lokier
Blue Swirl wrote:
 On Fri, Apr 8, 2011 at 9:04 AM, Gleb Natapov g...@redhat.com wrote:
  On Thu, Apr 07, 2011 at 04:41:03PM -0500, Anthony Liguori wrote:
  On 04/07/2011 02:17 PM, Gleb Natapov wrote:
  On Thu, Apr 07, 2011 at 10:04:00PM +0300, Blue Swirl wrote:
  On Thu, Apr 7, 2011 at 9:51 PM, Gleb Natapovg...@redhat.com  wrote:
  
  I'd prefer something more generic like these:
  raise /apic@fee0:l1int
  lower /i44FX-pcihost/e1000@03.0/pinD
  
  The clumsier syntax shouldn't be a problem, since this would be a
  system developer tool.
  
  Some kind of IRQ registration would be needed for this to work without
  lots of changes.
  True. The ability to trigger any interrupt line is very useful for
  debugging. I often re-implement it during debug.
 
  And it's a good thing to have, but exposing this as the only API to
  do something as simple as generating a guest crash dump is not the
  friendliest thing in the world to do to users.
 
  Well, this is not intended to be used by regular users directly and
  management can provide nicer interface for issuing NMI. But really,
  my point is that NMI actually generates guest core dump in such rare
  cases (only preconfigured Windows guests) that it doesn't warrant to
  name command as such. Management is in much better position to implement
  functionality with such name since it knows what type of guest it runs
  and can tell agent to configure guest accordingly.
 
 Does the management need to know about each and every debugging
 oriented interface? For example, info regs,  info mem, info irq
 and tracepoints?

Linux uses NMI for performance tracing, profiling, watchdog etc. so in
practice, NMI is very similar to the other IRQs.  I.e. highly guest
specific and depending on what's wired up to it.  Injecting NMI to all
CPUs at once does not make any sense for those Linux guests.

For Windows crash dumps, I think it makes sense to have a button
wired to NMI device, rather than inject-nmi directly, but I can see
that inject-nmi solves the intended problem quite neatly.

For Linux crash dumps, for example, there are other key combinations,
as well as watchdog devices, that can be used to trigger them.  A
virtual button wired to GPIO/PCI-IRQ/etc. device might be quite
handy for debugging Linux guests, and would fit comfortably in a
management interface.

-- Jamie



Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

2011-01-19 Thread Jamie Lokier
Chunqiang Tang wrote:
   Moreover, using a host file system not only adds overhead, but
   also introduces data integrity issues. Specifically, if I/Os uses 
 O_DSYNC,
   it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data
   integrity in the event of a host crash. See
   http://lwn.net/Articles/348739/ .
  
   You have the same issue with O_DIRECT when using a raw disk device
   too.  That is, O_DIRECT on a raw device does not guarantee integrity
   in the event of a host crash either, for mostly the same reasons.
  
  QEMU has semantics that use O_DIRECT safely; there is no issue here.
  When a drive is added with cache=none, QEMU not only uses O_DIRECT but
  also advertises an enabled write cache to the guest.
  
  The guest *must* flush the cache when it wants to ensure data is
  stable.  In the event of a host crash, all, some, or none of the I/O
  since the last flush may have made it to disk.  Each of these
  possibilities is fair game since the guest may only depend on writes
  being on disk if they completed and a successful flush was issued
  afterwards.
 
 Thank both of you for the explanation, which is very helpful to me. With 
 FVD's capability of eliminating the host file system and storing the image 
 on a logical volume, then perhaps we can always use O_DSYNC, because there 
 is little (or no?) LVM metadata that needs a flush on every write and 
 hence O_DSYNC  does not add overhead? I am not certain on this, and need 
 help for confirmation. If this is true, the guest does not need to flush 
 the cache. 

I think O_DSYNC does not work as you might expect on raw disk devices
and logical volumes.

That doesn't mean you don't need something for crash durability!
Instead, you need to issue the disk cache flushes in whatever way works.

It actually has a very *high* overhead.

The overhead isn't from metadata - it is from needing to flush the
disk cache after every write, which prevents the disk from reordering
writes.

If you don't issue the flushes, and the physical device has a volatile
write cache, then you cannot guarantee integrity in the event of a
host crash.

This can make a filesystem faster than a raw disk or logical volume in
some configurations, if the filesystem journals data writes to limit
the seeking needed to commit durably.

-- Jamie



Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

2011-01-18 Thread Jamie Lokier
Chunqiang Tang wrote:
 Doing both fault injection and verification together introduces some 
 subtlety. For example, even under the random failure mode, two disk writes 
 triggered by one VM-issued write must either fail together or succeed 
 together. Otherwise, the truth image and the test image will diverge and 
 verification won't succeed. Currently, qemu-test carefully works with the 
 'sim' driver to guarantee those conditions. Those conditions need be 
 retained after code restructure. 

If the real backend is a host system file or device, and AIO or
multi-threaded writes are used, you can't depend on two parallel disk
writes (triggered by one VM-issued write) failing together or
succeeding together.  All you can do is look at the error code after
each operation completes, and use it to prevent issuing later
operations.  You can't stop the other parallel operations that are
already in progress.

Is that an issue in your design assumptions?

Thanks,
-- Jamie



Re: [Qemu-devel] [RFC] Propose the Fast Virtual Disk (FVD) image format that outperforms QCOW2 by 249%

2011-01-18 Thread Jamie Lokier
Chunqiang Tang wrote:
  Based on my limited understanding, I think FVD shares a 
  lot in common with the COW format (block/cow.c).
  
  But I think most of the advantages you mention could be considered as 
  additions to either qcow2 or qed.  At any rate, the right way to have 
  that discussion is in the form of patches on the ML.
 
 FVD is much more advanced than block/cow.c. I would be happy to discuss 
 possible leverage, but setting aside the details of QCOW2, QED, and FVD, 
 let’s start with a discussion of what is needed for the next generation 
 image format. 

Thank you for the detailed description.

FVD looks quite good to me; it seems very simple yet performant at the
same time, due to its smart yet simple design.

 Moreover, using a host file system not only adds overhead, but 
 also introduces data integrity issues. Specifically, if I/Os uses O_DSYNC, 
 it may be too slow. If I/Os use O_DIRECT, it cannot guarantee data 
 integrity in the event of a host crash. See 
 http://lwn.net/Articles/348739/ . 

You have the same issue with O_DIRECT when using a raw disk device
too.  That is, O_DIRECT on a raw device does not guarantee integrity
in the event of a host crash either, for mostly the same reasons.

-- Jamie



Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

2010-09-10 Thread Jamie Lokier
Stefan Hajnoczi wrote:
 Since there is no ordering imposed between the data write and metadata
 update, the following scenarios may occur on crash:
 1. Neither data write nor metadata update reach the disk.  This is
 fine, qed metadata has not been corrupted.
 2. Data reaches disk but metadata update does not.  We have leaked a
 cluster but not corrupted metadata.  Leaked clusters can be detected
 with qemu-img check.
 3. Metadata update reaches disk but data does not.  The interesting
 case!  The L2 table now points to a cluster which is beyond the last
 cluster in the image file.  Remember that file size is rounded down by
 cluster size, so partial data writes are discarded and this case
 applies.

Better add:

4. File size is extended fully, but the data didn't all reach the disk.
5. Metadata is partially updated.
6. (Nasty) Metadata partial write has clobbered neighbouring
   metadata which wasn't meant to be changed.  (This may happen up
   to a sector size on normal hard disks - data is hard to come by.
   This happens to a much larger file range on flash and RAIDs
   sometimes - I call it the radius of destruction).

6 can also happen when doing the L1 updated mentioned earlier, in
which case you might lose a much larger part of the guest image.

-- Jamie



Re: [Qemu-devel] Anyone seeing huge slowdown launching qemu with Linux 2.6.35?

2010-08-03 Thread Jamie Lokier
Richard W.M. Jones wrote:
 We could demand that OSes write device drivers for more qemu devices
 -- already OS vendors write thousands of device drivers for all sorts
 of obscure devices, so this isn't really much of a demand for them.
 In fact, they're already doing it.

Result: Most OSes not working with qemu?

Actually we seem to be going that way.  Recent qemus don't work with
older versions of Windows any more, so we have to use different
versions of qemu for different guests.

-- Jamie



Re: [Qemu-devel] [PATCH] move 'unsafe' to end of caching modes in help

2010-07-21 Thread Jamie Lokier
Anthony Liguori wrote:
 On 07/21/2010 04:58 PM, Daniel P. Berrange wrote:
 Yes there is.  Use the version number.
  
 The version number is not suitable, because features can be removed at
 compile time and/or
 
 I don't see any features that libvirt would need to know about that are 
 disabled at compile time that aren't disabled by platform features (i.e. 
 being on a Linux vs. Windows host).
 
   added via patch backports.
 
 If a distro backports a feature, it should change the QEMU version 
 string.  If it doesn't, that's a distro problem.

To what version?  It can't use the newer version if it only backports
a subset of features; it would have to use a distro-specific version
number or a version string that somehow encodes feature independent of
the version number itself, by some agreed libvirt standard.  Which
isn't far off advertising features in the help string :-)

-- Jamie



Re: [Qemu-devel] [Bug 595117] Re: qemu-nbd slow and missing writeback cache option

2010-06-23 Thread Jamie Lokier
Serge Hallyn wrote:
 The default of qemu-img (of using O_SYNC) is not very sensible
 because anyway, the client (the kernel) uses caches (write-back),
 (and qemu-nbd -d doesn't flush those by the way). So if for
 instance qemu-nbd is killed, regardless of whether qemu-nbd uses
 O_SYNC, O_DIRECT or not, the data in the image will not be
 consistent anyway, unless syncs are done by the client (like fsync
 on the nbd device or sync mount option), and with qemu-nbd's O_SYNC
 mode, those syncs will be extremely slow.

Do the client syncs cause the nbd server to fsync or fdatasync the file?

 It appears it is because by default the disk image it serves is open
 with O_SYNC. The --nocache option, unintuitively, makes matters a
 bit better because it causes the image to be open with O_DIRECT
 instead of O_SYNC.
[...]
 --cache=off is the same as --nocache (that is use O_DIRECT),
 writethrough is using O_SYNC and is still the default so this patch
 doesn't change the functionality. writeback is none of those flags,
 so is the addition of this patch. The patch also does an fsync upon
 qemu-nbd -d to make sure data is flushed to the image before
 removing the nbd.

I really wish qemu's options didn't give the false impression
nocache does less caching than writethrough.  O_DIRECT does
caching in the disk controller/hardware, while O_SYNC hopefully does
not, nowadays.

-- Jamie



Re: [Qemu-devel] Re: block: format vs. protocol, and how they stack

2010-06-22 Thread Jamie Lokier
Christoph Hellwig wrote:
 On Mon, Jun 21, 2010 at 09:51:23AM -0500, Anthony Liguori wrote:
  I can appreciate the desire to keep protocols and formats as an internal 
  distinction but as a user visible concept, I think your two examples 
  highlight why exposing protocols as formats make sense.  A user doesn't 
  necessarily care what's happening under the cover.  I think:
  
  -blockdev format=qcow2,file=image.qcow2,id=blk1
  
  and:
  
  -blockdev protocol=vvfat,file=/tmp/dir,id=blk1
  
  Would cause a bit of confusion.  It's not immediately clear why vvfat is 
  a protocol and qcow2 isn't.  It's really an implementation detail that 
  we implement qcow2 on top of a protocol called file.
 
 Everything involving vvfat will end up in sheer confusion, and that's
 because vvfat is such a beast.  But it's a rather traditional example
 of a protocol.  Unlike qcow2 / vmdk / vpc it can not be stacked on
 an arbitrary protocol (file/nbd/http), but rather accessed a directory
 tree.

There is no technical reason why vvfat couldn't be stacked on top of
FTP or HTTP-DAV or RSYNC or SCP, or even wget -R.  Basically
anything with multiple files addressed by paths, and a way to retrieve
directories to find all the paths.

vvfat doesn't stack on top of file-like protocols, it stacks
conceptually on top of directory tree-like protocols, of which there
is currently one.  The arrival of Plan9fs may motivate the addition of
more.

You can't meaningfully stack qcow2 or any other format than raw on
top of the virtual file image created by vvfat.  So that's another reason
it isn't the same as other protocols.

-- Jamie



Re: [Qemu-devel] Re: block: format vs. protocol, and how they stack

2010-06-22 Thread Jamie Lokier
Kevin Wolf wrote:
  The protocol parlance breaks down when we move away from the simple
  stuff.  For instance, qcow2 needs two children: the block driver
  providing the delta bits (in qcow2 format), and the block driver
  providing the base bits (whose configuration happens to be stored in the
  delta bits).  
 
 Backing files are different. When talking about opening images (which is
 what we do here) the main difference is that they can be opened only
 after the image itself has been opened. I don't think we should include
 them in this discussion.

Imho, being unable to override the qcow2 backing file from the command
line / monitor is very annoying, if you've moved files from another
machine or just renamed them for tidiness.  It's especially bad if the
supplied qcow2 file has an absolute path in it, quite bad if it has
subdirectories or .. components, annoying if you've been given
several qcow2 files all of which have the name backing-file stored
in them which are different images because they were originally on
different machines, and awful if it has the name of a block device in it.

So, imho, for the discussion of command line / QMP options, there
should be reserved a place for giving the name of the backing file
through command line/monitor/QMP, along with the backing file's
formats/protocols/transports/options, and so on recursively in a tree
structure of arbitrary depth.

There is also the matter of qcow2 files encoding the path, but not
necessarily all the blockdev options that you might want to use to
access the backing file, such as cache=.

In QMP it's obviously quite simple to accept a full child blockdev
specification object as a qcow2-specific parameter, thus not needing
any further discussion in this thread.  It's less obvious how to do it
on the command line or human monitor.

-- Jamie



Re: [Qemu-devel] Re: block: format vs. protocol, and how they stack

2010-06-22 Thread Jamie Lokier
Markus Armbruster wrote:
 A possible reason why we currently expose format and protocol at the
 user interface is to avoid stacking there.

Pragmatic solution?: A few generic flags in each stacking module
(format/protocol/transport), which govern which other modules are
allowed to stack on top or underneath.

For example, vvfat may provide a blockdev-like abstraction, along with
flags STACK_ABOVE_ONLY_RAW | STACK_BELOW_ONLY_DIRECTORY, which means
raw and blkdebug are allowed above (of course ;-) but other things
like the image formats shouldn't be.  And below, it can't stack on a
blockdev-like abstraction, but needs a directory and uses filesystem
operations on it - the thing that Plan9fs needs.

Btw, I think we expose format because virtual disk image file
format is a useful and meaningful concept to users.  When someone
needs to use a .VMDK file, they know it as a VMDK format file, not
I must use the VMDK protocol with this file.

-- Jamie



Re: [Qemu-devel] Re: [Bug 596106] Re: kvm to emulate 64 bit cpu on 32 bit host

2010-06-20 Thread Jamie Lokier
Paolo Bonzini wrote:
 On 06/19/2010 03:01 PM, Natalia Portillo wrote:
 VMWare is able to do it, we should be able.
 
 They do it like TCG does it, not like KVM.

I heard rumours VMWare use KVM-style chip virtualisation when running
a 64-bit guest on a 32-bit host kernel on 64-bit hardware.

If true, that makes particular sense for Windows host users, who can't
just drop in a 64-bit host kernel without breaking their userspace
thoroughly.  (If it was that easy, 64-bit Windows wouldn't use
a surruptitious VM to run 32-bit apps :-).

It seems like a good way for Windows users to run a single 64-bit app
on an otherwise 32-bit system that's working fine.

On Linux hosts I would expect you can drop in a 64-bit kernel, while
continuing to run a 32-bit userspace.  But I don't know if (a) that's
entirely true, and (b) if distro packaging blocks that sort of thing
from being easy.

Unfortunately even that doesn't help people who just want to run a
64-bit VM as an ordinary user and aren't permitted to change their
Linux host kernel, e.g. a shared system, or some rented servers.

-- Jamie



Re: [Qemu-devel] [Bug 596106] Re: kvm to emulate 64 bit cpu on 32 bit host

2010-06-20 Thread Jamie Lokier
Natalia Portillo wrote:
You got the point wrong, I'm talking running WITH 64 bit hardware in a
32 bit guest.
This is done in Mac OS X Leopard (kernel is only 32 bit) and Mac OS X
Snow Leopard (using 32 bit kernel not 64 bit one) by VMWare, Parallels
and VirtualBox, as well as on Windows 32 bit using VMWare (dunno about
VBox and Parallels, VirtualPC is unable to run 64 bit guests at all
even on 64 bit hosts), just provided of course, the hardware is 64
bit.

Ah yes, Mac OS X too.

Apart from breaking userspace, the other reason people stick with
32-bit host kernels on both Windows and Macs is the 64-bit device
drivers often don't work properly, or aren't available at all.  They
continue to improve, but still aren't as mature and dependable as
32-bit drivers.

This is also true of Linux 64-bit kernels - both bugs and unavailable
third party drivers/firmware.  (But less so than the other OSes.)
So even with Linux people cannot assume dropping in a 64-bit host
kernel is always free of kernel/driver issues.

Marking this feature request won't fix is just a statement that KVM
developers aren't inclined to support this feature.

But there's nothing to stop an interested contributor having a go.
I'm sure if it works and the code is clean enough it will be accepted.

VirtualPC is unable to run 64 bit guests at all even on 64 bit
hosts

Are you sure?  Microsoft provides numerous downloadable 64-bit guest
Windows images, and VirtualPC is Microsoft's; they must be running on
something.

-- Jamie



Re: [Qemu-devel] Re: [PATCH 09/10] pci: set PCI multi-function bit appropriately.

2010-06-18 Thread Jamie Lokier
Isaku Yamahata wrote:
 On Fri, Jun 18, 2010 at 03:44:04PM +0300, Michael S. Tsirkin wrote:
  If we really want the ability to put unrelated devices
  as functions in a single one, let's just add
  a 'multifunction' qdev property, and validate that
  it is set appropriately.
 
 I think unrelated is policy. There is no obvious way to determine
 which functions can be in a same device.
 For example, popular chipset contains isa bridge, ide controller,
 usb controller, sound and modem in a single device as functions.
 It's up to hardware designer policy which functions are grouped into
 a device.

In hardware terms, quad-port ethernet controllers often present
themselves as a PCI bridge with four independent PCI ethernet
controllers, so they work with standard drivers.  Even though all four
are in a single chip.  Some USB devices do the same, presenting a
small bulk storage device to ship windows drivers [:roll-eyes: ;-)]
alongside the actual device.

-- Jamie



Re: [Qemu-devel] Re: [PATCH 1/2] qemu-io: retry fgets() when errno is EINTRg

2010-06-17 Thread Jamie Lokier
Kevin Wolf wrote:
 Am 16.06.2010 18:52, schrieb MORITA Kazutaka:
  At Wed, 16 Jun 2010 13:04:47 +0200,
  Kevin Wolf wrote:
 
  Am 15.06.2010 19:53, schrieb MORITA Kazutaka:
  posix-aio-compat sends a signal in aio operations, so we should
  consider that fgets() could be interrupted here.
 
  Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
  ---
   cmd.c |3 +++
   1 files changed, 3 insertions(+), 0 deletions(-)
 
  diff --git a/cmd.c b/cmd.c
  index 2336334..460df92 100644
  --- a/cmd.c
  +++ b/cmd.c
  @@ -272,7 +272,10 @@ fetchline(void)
return NULL;
printf(%s, get_prompt());
fflush(stdout);
  +again:
if (!fgets(line, MAXREADLINESZ, stdin)) {
  + if (errno == EINTR)
  + goto again;
free(line);
return NULL;
}
 
  This looks like a loop replaced by goto (and braces are missing). What
  about this instead?
 
  do {
  ret = fgets(...)
  } while (ret == NULL  errno == EINTR)
 
  if (ret == NULL) {
 fail
  }
 
  
  I agree.
  
  However, it seems that my second patch have already solved the
  problem.  We register this readline routines as an aio handler now, so
  fgets() does not block and cannot return with EINTR.
  
  This patch looks no longer needed, sorry.
 
 Good point. Thanks for having a look.

Anyway, are you sure stdio functions can be interrupted with EINTR?
Linus reminds us that some stdio functions have to retry internally
anyway:

http://comments.gmane.org/gmane.comp.version-control.git/18285

-- Jamie



Re: [Qemu-devel] VLIW?

2010-06-17 Thread Jamie Lokier
Gibbons, Scott wrote:
 My architecture is an Interleaved Multithreading VLIW architecture.  One 
 bundle (packet) executes per processor cycle, rotating between threads (i.e., 
 thread 0 executes at time 0, thread 1 executes at time 1, then thread 0 
 executes at time 2, etc.).  Each thread has its own context (including a 
 program counter).  I'm not sure what kind of performance I would get in 
 translating a single bundle at a time (or maybe I'm misunderstanding).
 
 I think I'll get basic single-thread operation working first, then attempt 
 multithreading when I have a spare month or so.

I know of another CPU architecture that has fine-grained hardware
threads and has working qemu emulation at a useful performance for
debugging kernels, but it's not public as far as I know, and I don't
know if it's ok to name it.  I don't think it's VLIW, only that it has
lots of hardware threads and a working qemu model.

-- Jamie



Re: [Qemu-devel] Re: [PATCH V4 2/3] qemu: Generic task offloading framework: threadlets

2010-06-16 Thread Jamie Lokier
Paolo Bonzini wrote:
 These should be (at least for now) block-obj-$(CONFIG_POSIX).
 
 +while (QTAILQ_EMPTY((queue-request_list))
 +   (ret != ETIMEDOUT)) {
 +ret = qemu_cond_timedwait((queue-cond),
 +(queue-lock), 10*10);
 +}
 
 Using qemu_cond_timedwait is a hack for not properly broadcasting the 
 condvar in flush_threadlet_queue.

Are you sure?  It looks like it also expires idle threads after a
fixed amount of idle time.

-- Jamie



Re: [Qemu-devel] Re: [PATCH V4 2/3] qemu: Generic task offloading framework: threadlets

2010-06-16 Thread Jamie Lokier
Anthony Liguori wrote:
 On 06/16/2010 09:29 AM, Paolo Bonzini wrote:
 On 06/16/2010 04:22 PM, Jamie Lokier wrote:
 Paolo Bonzini wrote:
 These should be (at least for now) block-obj-$(CONFIG_POSIX).
 
 +while (QTAILQ_EMPTY((queue-request_list))
 +   (ret != ETIMEDOUT)) {
 +ret = qemu_cond_timedwait((queue-cond),
 + (queue-lock), 10*10);
 +}
 
 Using qemu_cond_timedwait is a hack for not properly broadcasting the
 condvar in flush_threadlet_queue.
 
 Are you sure?  It looks like it also expires idle threads after a
 fixed amount of idle time.
 
 Unnecessary idle threads are immediately expired as soon as the 
 threadlet exits if ncecessary, since here
 
 If a threadlet is waiting to consume more work, unless we do a 
 pthread_cancel (I dislike cancellation) it will keep waiting until it 
 gets more work (which would mean it's not actually idle)...

There's some mild abuse of the mutex/condvar going on.

As (queue-exit || queue-idle_threads  queue-min_threads) is a
condition for breaking out of the loop, that condition ought to be
checked in the mutex-cond_wait region, but it isn't.

It doesn't matter here because the queue is empty when queue-exit,
and the idle  min_threads condition can't become true.

 The min/max_threads parameters of the queue are currently immutable, 
 so it can never happen that a thread has to be expired while it's 
 waiting.  It may well become true in the future, in which case the 
 condvar will have to be broadcast when min_threads changes.

Broadcasting when min_threads decreases wouldn't be enough, because
min_threads isn't checked inside the mutex-cond_wait region.

-- Jamie



Re: [Qemu-devel] Re: [PATCH V4 2/3] qemu: Generic task offloading framework: threadlets

2010-06-16 Thread Jamie Lokier
Jamie Lokier wrote:
 Anthony Liguori wrote:
  On 06/16/2010 09:29 AM, Paolo Bonzini wrote:
  On 06/16/2010 04:22 PM, Jamie Lokier wrote:
  Paolo Bonzini wrote:
  These should be (at least for now) block-obj-$(CONFIG_POSIX).
  
  +while (QTAILQ_EMPTY((queue-request_list))
  +   (ret != ETIMEDOUT)) {
  +ret = qemu_cond_timedwait((queue-cond),
  + (queue-lock), 10*10);
  +}
  
  Using qemu_cond_timedwait is a hack for not properly broadcasting the
  condvar in flush_threadlet_queue.
  
  Are you sure?  It looks like it also expires idle threads after a
  fixed amount of idle time.
  
  Unnecessary idle threads are immediately expired as soon as the 
  threadlet exits if ncecessary, since here
  
  If a threadlet is waiting to consume more work, unless we do a 
  pthread_cancel (I dislike cancellation) it will keep waiting until it 
  gets more work (which would mean it's not actually idle)...
 
 There's some mild abuse of the mutex/condvar going on.
 
 As (queue-exit || queue-idle_threads  queue-min_threads) is a
 condition for breaking out of the loop, that condition ought to be
 checked in the mutex-cond_wait region, but it isn't.
 
 It doesn't matter here because the queue is empty when queue-exit,
 and the idle  min_threads condition can't become true.

Sorry, thinko.  It does matter when queue-exit, precisely because the
queue is empty :-)

Even cond_broadcast after queue-exit is set isn't enough to remove
the need for the timed wait hack.

Putting the whole condition inside the mutex-cond_wait region, not
just empty queue test, will remove the need for timed wait.  Broadcast
is still needed, or alternatively a cond_signal from each exiting
thread will allow them to wake and close without a thundering herd.

-- Jamie



Re: [Qemu-devel] Re: [CFR 6/10] cont command

2010-06-16 Thread Jamie Lokier
Anthony Liguori wrote:
 On 06/16/2010 11:17 AM, Juan Quintela wrote:
 Consider the example that I showed you:
 
 (host A) (host B)
 launch qemu launch qemu -incoming
 migrate host B
  .
  do your things
  exit/poweroff/...
 
 At this point you have a qemu launched on machine A, with nothing on
 machine B.  running cont on machine A, have disastreus consecuences,
 and there is no way to prevent it :(

 
 If there was a reasonable belief that it wouldn't result in disaster, I 
 would fully support you.  However, I can't think of any rational reason 
 why someone would do this.  I can't think of a better analogy to 
 shooting yourself in the foot.

That looks like a useful way to fork a guest for testing, if host B is
launched with -snapshot, or a copy of the disk image, or a qcow2 child of it.

Does it work? :-)

-- Jamie



Re: [Qemu-devel] [PATCH] virtio-blk: Avoid zeroing every request structure

2010-05-29 Thread Jamie Lokier
Alexander Graf wrote:
 Anthony Liguori wrote:
  I'd prefer to stick to bug fixes for stable releases.  Performance
  improvements are a good motivation for people to upgrade to 0.13 :-)
 
 In general I agree, but this one looks like a really simple one.

Besides, there are too many reported guest regressions at the moment
to upgrade if using any of them.

-- Jamie



Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-19 Thread Jamie Lokier
Michael Tokarev wrote:
 Anthony Liguori wrote:
 []
  For the Bug Day, anything is interesting IMHO.  My main interest is to
  get as many people involved in testing and bug fixing as possible.  If
  folks are interested in testing specific things like unusual or older
  OSes, I'm happy to see it!
 
 Well, interesting or not, but I for one don't know what to do with the
 results.  There were a thread on kvm@ about sigsegv in cirrus code when
 running winNT. The issue has been identified and appears to be fixed,
 as in, kvm process does not SIGSEGV anymore, but it does not work anyway,
 now printing:
 
  BUG: kvm_dirty_pages_log_enable_slot: invalid parameters
 
 with garbled guest display.  Thanks goes to Stefano Stabellini for
 finding the SIGSEGV case, but unfortunately his hard work isn't quite
 useful since the behavour isn't very much different from the previous
 version... ;)

A BUG: is good to see in a bug report: It gives you something
specific to analyse.  Good luck ;-)

Imho, it'd be quite handy to keep a timeline of working/non-working
guests in a table somewhere, and which qemu versions and options they
were observed to work or break with.

 Also, thanks to Andre Przywara, whole winNT thing works but it requires
 -cpu qemu64,level=1 (or level=2 or =3), -- _not_ with default CPU.  This
 is also testing, but it's not obvious what to do witht the result...

Doesn't WinNT work with qemu32 or kvm32?
It's a 32-bit OS after all.

- Jamie



Re: [Qemu-devel] [PATCH] Add QEMU DirectFB display driver

2010-05-19 Thread Jamie Lokier
Julian Pidancet wrote:
 So after all, why not implementing our own VT switching and using
 directly the fbdev interface.

It's a good idea.  VT switching isn't hard to track reliably.

Being able to tell qemu, through the monitor, to attach/detach from a
particular VT might be a nice easy bonus too.

 I just checked the linux fbdev code to
 find out if it provides with a blitting method that could perform
 the pixel color conversion automatically for Qemu.
 
 Unfortunately, from what I have read from the
 drivers/video/cfbimgblt.c file in the linux tree, there's no such
 thing, and it also means that we cannot take advantage of any kind
 of hardware pixel format conversion.

I'm not sure if DirectFB provides that particular operation, but I
have the impression it's the sort of thing DirectFB is intended for: A
framebuffer, plus a variety of 2d acceleration methods (and other
things like multi-buffering, video and alpha channel overlay).

-- Jamie



Re: [Qemu-devel] [RFC] Bug Day - June 1st, 2010

2010-05-18 Thread Jamie Lokier
Natalia Portillo wrote:
 Hi,
 
  - We'll try to migrate as many confirmable bugs from the Source Forge 
  tracker to Launchpad.
 I think that part of the bug day should also include retesting OSes that 
 appear in OS Support List as having bug and confirming if the bug is still 
 present and if it's in Launchpad or not.

There have been reports of several legacy OSes being unable to install
or boot in the newer qemu while working in the older one.  They're
probably not in the OS Support List though.  Are they effectively
uninteresting for the purpose of the 0.13 release?

Unfortunately I doubt I will have time to participate in the Bug Day.

Thanks,
-- Jamie




[Qemu-devel] A20 line control (was Re: [PATCH 0/2] pckbd improvements)

2010-05-17 Thread Jamie Lokier
Blue Swirl wrote:
 On 5/16/10, Jamie Lokier ja...@shareable.org wrote:
  Blue Swirl wrote:
On 5/16/10, Paolo Bonzini pbonz...@redhat.com wrote:
 On 05/15/2010 11:49 AM, Blue Swirl wrote:

  In 2/2, A20 logic changes a bit but I doubt any guest would be broken
  if A20 line written through I/O port 92 couldn't be read via i8042.
  The reverse (write using i8042 and read port 92) will work.
 

  Why take the risk?
   
The alternative is to route a signal from port 92 to i8042. Or maybe
port 92 should belong to i8042, that could make things simpler but
then the port would appear on non-PC architectures as well.
   
But I doubt any OS would depend on such details, because the details
seem to be murky:
http://www.win.tue.nl/~aeb/linux/kbd/A20.html
 
 
  It's not hard to imagine some DOS memory driver or 286/386 DOS
   extender expecting to read the bit, if that's normal on PCs.
 
   The earlier PCs didn't have port 92h, so presumably older DOS software
   uses the keyboard controller exclusively.
 
   The details are murky, but on the other hand, I remember back in day,
   A20 line was common knowledge amongst DOS hackers on 286s and 386s,
   and the time I remember it from, port 92h was not available, so it
   can't have been too murky to use the i8042.
 
 Right, but with this patch, writing to and reading from i8042 would
 still work, likewise for writing to and reading from port 92. Even
 writing via i8042, but reading via port 92 would work. What would not
 work reliably (even then, 50% probability of being correct) is when
 port 92 is written, but reading happens with i8042.
 
   i8042 emulation isn't the same on PC on a non-PC because of the
   PC-specific wiring (outside the chip), such as its ability to reset
   the motherboard.  I don't see that it makes sense for qemu to pretend
   there are no differences at all.  Or, perhaps it makes sense to imply
   different GPIO wiring, separate from the i8042 itself.
 
   On the other hand, something which makes sense to me:
 
   In a PC, are port 92h and i8042's outputs OR'd together or AND'd
   together to control A20 proper?  Then they'd be two independent
   signals, and shouldn't mirror each other.
 
 That's exactly what I meant, how could also random OS designer trust
 that the signals are combined the same way on every PC? With logic
 circuits, i8042 would still see its own output only, not the combined
 signal. If instead the signals were wired together, with some
 combination of inputs the output would not match what QEMU generates.
 Currently QEMU does not implement any logic for A20 line, which
 obviously can't match real hardware (or maybe some kind of dual port
 memory).

http://www.openwatcom.org/index.php/A20_Line

According to that page, MS-DOS's HIMEM.SYS tries 17 different methods
to control the A20 line! :-) Meanwhile, DOS/4GW, a DOS extender (there
are lots of those) allows the method to be set manually.

But there are only two common ones that are still implemented in
modern PC hardware: The keyboard commands to read, modify and write
the output port, and port 92h.

The random DOS-extender designers had to try each method, by checking
if the address space was actually wrapped.

With port 92h being known as the fast A20 gate, I'm pretty sure any
program which includes that method will try that one first.

According to the wiki page (actually, my interpretation of it), the
output signal from port 92h is usually OR'd with the output signal
from the keyboard controller port.  That is, they are independent signals.

-- Jamie






Re: [Qemu-devel] Re: [PATCH] Add cache=volatile parameter to -drive

2010-05-17 Thread Jamie Lokier
Alexander Graf wrote:
 
 On 17.05.2010, at 18:26, Anthony Liguori wrote:
 
  On 05/17/2010 11:23 AM, Paul Brook wrote:
  I don't see a difference between the results. Apparently the barrier
  option doesn't change a thing.

  Ok.  I don't like it, but I can see how it's compelling.  I'd like to
  see the documentation improved though.  I also think a warning printed
  on stdio about the safety of the option would be appropriate.
  
  I disagree with this last bit.
  
  Errors should be issued if the user did something wrong.
  Warnings should be issued if qemu did (or will soon do) something other 
  than
  what the user requested, or otherwise made questionable decisions on the
  user's behalf.
  
  In this case we're doing exactly what the user requested. The only 
  plausible
  failure case is where a user is blindly trying options that they clearly 
  don't
  understand or read the documentation for. I have zero sympathy for 
  complaints
  like Someone on the Internet told me to use --breakme, and broke thinks.

  
  I see it as the equivalent to the Taint bit in Linux.  I want to make it 
  clear to users up front that if you use this option, and you have data loss 
  issues, don't complain.
  
  Just putting something in qemu-doc.texi is not enough IMHO.  Few people 
  actually read it.
 
 But that's why it's no default and also called volatile. If you prefer, we 
 can call it cache=destroys_your_image.

With that semantic, a future iteration of cache=volatile could even
avoid writing to the backing file at all, if that's yet faster.  I
wonder if that would be faster.  Anyone fancy doing a hack with the
whole guest image as a big malloc inside qemu?  I don't have enough RAM :-)

-- Jamie



[Qemu-devel] Re: [PATCH 3/8] Add QBuffer

2010-05-17 Thread Jamie Lokier
Jan Kiszka wrote:
 Jamie Lokier wrote:
  Anthony Liguori wrote:
  Instead of encoding just as a string, it would be a good idea to encode 
  it as something like:
 
  {'__class__': 'base64', 'data': ...}
  
  Is there a benefit to the class indirection, over simply a keyword?:
  
  {'__base64__': ...}
  
  __class__ seems to suggest much more than it's being used for here.
  
 
 Depending on how sophisticated your parser is, you could directly push
 the result into an object of the proper type. And we can add more
 complex objects in the future that do not only consists of a single data
 key. Note that this extension is not just about encoding, it is about
 typecasting (dict - custom type).

Sure, if that's the plan.

Does it make sense to combine encoding and custom types in this way?
It looks like mixing syntax and semantics, which has consequences for
code using generic parsers with separate semantic layer, but I realise
there's no correct answer.

Back to the syntax: I'm under the impression from earlier discussion
that the '__*__' keyspace reserved, so even types could use the
compact syntax?

Or is there something Javascript-ish (and not merely JSON-ish) about
'__class__' in particular which makes it appropriate?

-- Jamie



Re: [Qemu-devel] [PATCH 3/8] Add QBuffer

2010-05-16 Thread Jamie Lokier
Anthony Liguori wrote:
 Instead of encoding just as a string, it would be a good idea to encode 
 it as something like:
 
 {'__class__': 'base64', 'data': ...}

Is there a benefit to the class indirection, over simply a keyword?:

{'__base64__': ...}

__class__ seems to suggest much more than it's being used for here.

-- Jamie



Re: [Qemu-devel] Re: [PATCH 0/2] pckbd improvements

2010-05-16 Thread Jamie Lokier
Blue Swirl wrote:
 On 5/16/10, Paolo Bonzini pbonz...@redhat.com wrote:
  On 05/15/2010 11:49 AM, Blue Swirl wrote:
 
   In 2/2, A20 logic changes a bit but I doubt any guest would be broken
   if A20 line written through I/O port 92 couldn't be read via i8042.
   The reverse (write using i8042 and read port 92) will work.
  
 
   Why take the risk?
 
 The alternative is to route a signal from port 92 to i8042. Or maybe
 port 92 should belong to i8042, that could make things simpler but
 then the port would appear on non-PC architectures as well.
 
 But I doubt any OS would depend on such details, because the details
 seem to be murky:
 http://www.win.tue.nl/~aeb/linux/kbd/A20.html

It's not hard to imagine some DOS memory driver or 286/386 DOS
extender expecting to read the bit, if that's normal on PCs.

The earlier PCs didn't have port 92h, so presumably older DOS software
uses the keyboard controller exclusively.

The details are murky, but on the other hand, I remember back in day,
A20 line was common knowledge amongst DOS hackers on 286s and 386s,
and the time I remember it from, port 92h was not available, so it
can't have been too murky to use the i8042.

i8042 emulation isn't the same on PC on a non-PC because of the
PC-specific wiring (outside the chip), such as its ability to reset
the motherboard.  I don't see that it makes sense for qemu to pretend
there are no differences at all.  Or, perhaps it makes sense to imply
different GPIO wiring, separate from the i8042 itself.

On the other hand, something which makes sense to me:

In a PC, are port 92h and i8042's outputs OR'd together or AND'd
together to control A20 proper?  Then they'd be two independent
signals, and shouldn't mirror each other.

-- Jamie



Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus

2010-05-13 Thread Jamie Lokier
Stefano Stabellini wrote:
  I think we need to consider only dstpitch for a full invalidate.  We 
  might be copying an offscreen bitmap into the screen, and srcpitch is 
  likely to be the bitmap width instead of the screen pitch.
 
 Agreed.

Even when copying on-screen (or partially on-screen), the srcpitch
does not affect the invalidated area.  The source area might be
strange (parallelogram, single line repeated), but srcpitch should
only affect whether qemu_console_copy can be used, not the
invalidation.

-- Jamie



Re: [Qemu-devel] [PATCH 2/4] Add support for execution from ROMs in IO device mode

2010-05-13 Thread Jamie Lokier
Jan Kiszka wrote:
 While IO_MEM_ROMD marks an I/O memory region as read/execute from RAM,
 but write to I/O handler, there is no flag indicating that an I/O
 region which is fully managed by I/O handlers can still be hosting
 executable code. One use case for this are flash device models that
 switch to I/O mode during reprogramming. Not all reprogramming states
 modify to read data, thus practically allow to continue execution.
 Moreover, we need to avoid switching the modes too frequently for
 performance reasons which requires fetching opcodes while still in I/O
 device mode.

I like this change.

Does fetching opcodes while still in I/O device mode fetch opcodes
from the RAM backing, or via the I/O read handlers?

If the latter, I'm wondering how KVM would cope with that.

Thanks,
-- Jamie



Re: [Qemu-devel] [PATCH 0/2] Enable qemu block layer to not flush

2010-05-12 Thread Jamie Lokier
Stefan Hajnoczi wrote:
 Why add a nop AIO operation instead of setting
 BlockDriverState-enable_write_cache to zero?  In that case no write
 cache would be reported to the guest (just like cache=writethrough).

Hmm.  If the guest sees write cache absent, that prevents changing the
cache policy on the host later (from not flushing to flushing), which
you might want to do after an OS install has finished and booted up.

-- Jamie



Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-12 Thread Jamie Lokier
Paul Brook wrote:
   Paul Brook wrote:
cache=none:
  No host caching. Reads and writes both go directly to underlying
  storage.

Useful to avoid double-caching.

cache=writethrough

  Reads are cached. Writes go directly to underlying storage.  Useful
  for

broken guests that aren't aware of drive caches.
   
   These are misleading descriptions - because cache=none does not push
   writes down to powerfail-safe storage, while cache=writethrough might.
  
  If so, then this is a serious bug.
 
 .. though it may be a kernel bug rather that a qemu bug, depending on the 
 exact details.

It's not a kernel bug.  cache=none uses O_DIRECT, and O_DIRECT must
not force writes to powerfail-safe storage.  If it did, it would be
unusably slow for applications using O_DIRECT as a performance
enhancer / memory saver.  They can call fsync/fdatasync when they need
to for integrity.  (There might be kernel bugs in the latter department.)

 Either way, I consider any mode that inhibits host filesystem write
 cache but not volatile drive cache to be pretty worthless.

On the contrary, it greatly reduces host memory consumption so that
guest data isn't cached twice (it's already cached in the guest), and
it may improve performance by relaxing the POSIX write-serialisation
constraint (not sure if Linux cares; Solaris does).

 Either we guaranteed data integrity on completion or we don't.

The problem with the description of cache=none is it uses O_DIRECT,
which does always not push writes to powerfail-safe storage,.

O_DIRECT is effectively a hint.  It requests less caching in kernel
memory, may reduce memory usage and copying, may invoke direct DMA.

O_DIRECT does not tell the disk hardware to commit to powerfail-safe
storage.  I.e. it doesn't issue barriers or disable disk write caching.
(However, depending on a host setup, it might have that effect if disk
write cache is disabled by the admin).

Also, it doesn't even always write to disk: It falls back to buffered
in some circumstances, even on filesystems which support it - see
recent patches for btrfs which use buffered I/O for O_DIRECT for some
parts of some files.  (Many non-Linux OSes fall back to buffered
when any other process holds a non-O_DIRECT file descriptor, or when
requests don't meet some criteria).

The POSIX thing to use for cache=none would be O_DSYNC|O_RSYNC, and
that should work on some hosts, but Linux doesn't implement real O_RSYNC.

A combination which ought to work is O_DSYNC|O_DIRECT.  O_DIRECT is
the performance hint; O_DSYNC provides the commit request.  Christoph
Hellwig has mentioned that combination elsewhere on this thread.
It makes sense to me for cache=none.

O_DIRECT by itself is a useful performance  memory hint, so there
does need to be some option which maps onto O_DIRECT alone.  But it
shouldn't be documented as stronger than cache=writethrough, because
it isn't.

--  Jamie



Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-05-12 Thread Jamie Lokier
Gerhard Wiesinger wrote:
 On Wed, 21 Apr 2010, Jamie Lokier wrote:
 
 Gerhard Wiesinger wrote:
 Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW
 of QEMU even from KVM must be possible (e.g. memory and port accesses are
 done on nearly every virtual device) and therefore I'm ending in C code in
 the QEMU hw/*.c directory. Therefore also the VGA memory area should be
 able to be accessable from KVM but with the specialized and fast memory
 access of QEMU.  Am I missing something?
 
 What you're missing is that when KVM calls out to QEMU to handle
 hw/*.c traps, that call is very slow.  It's because the hardware-VM
 support is a bit slow when the trap happens, and then the the call
 from KVM in the kernel up to QEMU is a bit slow again.  Then all the
 way back.  It adds up to a lot, for every I/O operation.
 
 Isn't that then a general problem of KVM virtualization (oder hardware 
 virtualization) in general? Is this CPU dependend (AMD vs. Intel)?

Yes it is a general problem, but KVM emulates some time-critical
things in the kernel (like APIC and CPU instructions), so it's not too bad.

KVM is about 5x faster than TCG for most things, and slower for a few
things, so on balance it is usually faster.

The slow 256-colour mode writes sound like just a simple bug, though.
No need for complicated changes.

 In 256-colour mode, KVM should be writing to the VGA memory at high
 speed a lot like normal RAM, not trapping at the hardware-VM level,
 and not calling up to the code in hw/*.c for every byte.
 
 Yes, same picture to me: 256 color mode should be only a memory write (16 
 color mode is more difficult as pixel/byte mapping is not the same).
 But it looks like this isn't the case in this test scenario.
 
 You might double-check if your guest is using VGA Mode X.  (See 
 Wikipedia.)
 
 That was a way to accelerate VGA on real PCs, but it will be slow in
 KVM for the same reasons as 16-colour mode.
 
 Which way do you mean?

Look up Mode X on Wikipedia if you're interested, but it isn't
relevant to the problem you've reported.  Mode X cannot be enabled
with a BIOS call; it's a VGA hardware programming trick.  It would not
be useful in a VM environment.

-- Jamie



Re: [Qemu-devel] [PATCH 0/2] Enable qemu block layer to not flush

2010-05-12 Thread Jamie Lokier
Stefan Hajnoczi wrote:
 On Wed, May 12, 2010 at 10:42 AM, Jamie Lokier ja...@shareable.org wrote:
  Stefan Hajnoczi wrote:
  Why add a nop AIO operation instead of setting
  BlockDriverState-enable_write_cache to zero?  In that case no write
  cache would be reported to the guest (just like cache=writethrough).
 
  Hmm.  If the guest sees write cache absent, that prevents changing the
  cache policy on the host later (from not flushing to flushing), which
  you might want to do after an OS install has finished and booted up.
 
 Right.  There are 3 cases from the guest perspective:
 
 1. Disable write cache or no write cache.  Flushing not needed.
 2. Disable flushing but leave write cache enabled.
 3. Enable write cache and use flushing.
 
 When we don't report a write cache at all, the guest is always stuck at 1.
 
 If you're going to do this for installs and other temporary workloads,
 then enabling the write cache again isn't an issue.  After installing
 successfully, restart the guest with a sane cache= mode.

That only works if you're happy to reboot the guest after the process
finishes.  I guess that is usually fine, but it is a restriction.

Is it possible via QMP to request that the guest is paused when it
next reboots, so that QMP operations to change the cache= mode can be
done (as it's not safe to change the guest-visible disk write cache
availability when it's running, and probably a request to do so should
be denied).

-- Jamie



Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus

2010-05-12 Thread Jamie Lokier
Stefano Stabellini wrote:
 On Wed, 12 May 2010, Avi Kivity wrote:
  It's useful if you have a one-line horizontal pattern you want to 
  propagate all over.
  
 It might be useful all right, but it is not entirely clear what the
 hardware should do in this situation from the documentation we have, and
 certainly the current state of the cirrus emulation code doesn't help.

It's quite a reasonable thing for hardware to do, even if not documented.
It would be surprising if the hardware didn't copy the one-line pattern.

-- Jamie



Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-11 Thread Jamie Lokier
Anthony Liguori wrote:
 On 05/11/2010 08:12 AM, Paul Brook wrote:
 cache=always (or a more scary name like cache=lie to defend against
 idiots)
 
 Reads and writes are cached. Guest flushes are ignored.  Useful for
 dumb guests in non-critical environments.

 I really don't believe that we should support a cache=lie.  There are
 many other obtain the same results.  For instance, mount your guest
 filesystem with barrier=0.
  
 Ideally yes. However in practice I suspect this is still a useful option. 
 Is
 it even possible to disable barriers in all cases (e.g. NTFS under 
 windows)?
 
 In a production environment it's probably not so useful - you're generally
 dealing with long lived, custom configured guests.
 
 In a development environment the rules can be a bit different. For example 
 if
 you're testing an OS installer then you really don't want to be passing 
 magic
 mount options. If the host machine dies then you don't care about the 
 state of
 the guest because you're going to start from scratch anyway.

 
 Then create a mount point on your host and mount the host file system 
 under that mount with barrier=0.

Two reasons that advice doesn't work:

1. It doesn't work in many environments.  You can't mount a filesystem
with barrier=0 in one place and barrier=1 on a different point, and
there's ofen only one host partition.

2. barrier=0 does _not_ provide the cache=off behaviour.  It only
disables barriers; it does not prevent writing to the disk hardware.

If you are doing a transient OS install, ideally you want an amount
equal to your free RAM not written to disk until the end.  barrier=0
does not achieve that.

 The problem with options added for developers is that those options are 
 very often accidentally used for production.

We already have risky cache= options.  Also, do we call fdatasync
(with barrier) on _every_ write for guests which disable the
emulated disk cache?

-- Jamie



Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-11 Thread Jamie Lokier
Anthony Liguori wrote:
 qemu-img create -f raw foo.img 10G
 mkfs.ext3 foo.img
 mount -oloop,rw,barrier=1 -t ext3 foo.img mnt
 
 Works perfectly fine.

Hmm, interesting.  Didn't know loop propagated barriers.

So you're suggesting to use qemu with a loop device, and ext2 (bit
faster than ext3) and barrier=0 (well, that's implied if you use
ext2), and a raw image file on the ext2/3 filesystem, to provide the
effect of flush=off, becuase the loop device caches block writes on
the host, except for explicit barrier requests from the fs, which are
turned off?

That wasn't obvious the first time :-)

Does the loop device cache fs writes instead of propagating them
immediately to the underlying fs?  I guess it probably does.

Does the loop device allow the backing file to grow sparsely, to get
behavious like qcow2?

That's ugly but it might just work.

 2. barrier=0 does _not_ provide the cache=off behaviour.  It only
 disables barriers; it does not prevent writing to the disk hardware.
 
 The proposal has nothing to do with cache=off.

Sorry, I meant flush=off (the proposal).  Mounting the host filesystem
(i.e. not using a loop device anywhere) with barrier=0 doesn't have
even close to the same effect.

 The problem with options added for developers is that those options are
 very often accidentally used for production.
  
 We already have risky cache= options.  Also, do we call fdatasync
 (with barrier) on _every_ write for guests which disable the
 emulated disk cache?
 
 None of our cache= options should result in data corruption on power 
 loss.  If they do, it's a bug.

(I might have the details below a bit off.)

If cache=none uses O_DIRECT without calling fdatasync for guest
barriers, then it will get data corruption on power loss.

If cache=none does call fdatasync for guest barriers, then it might
still get corruption on power loss; I am not sure if recent Linux host
behaviour of O_DIRECT+fdatasync (with no buffered writes to commit)
issues the necessary barriers.  I am quite sure that older kernels did not.

cache=writethrough will get data corruption on power loss with older
Linux host kernels.  O_DSYNC did not issue barriers.  I'm not sure if
the behaviour of O_DSYNC that was recently changed is now issuing
barriers after every write.

Provided all the cache= options call fdatasync/fsync when the guest
issues a cache flush, and call fdatasync/fsync following _every_ write
when the guest has disabled the emulated write cache, that should be
as good as Qemu can reasonably do.  It's up to the host from there.

-- Jamie



Re: [Qemu-devel] Re: [PATCH 2/2] Add flush=off parameter to -drive

2010-05-11 Thread Jamie Lokier
Paul Brook wrote:
 cache=none:
   No host caching. Reads and writes both go directly to underlying storage. 
 Useful to avoid double-caching.
 
 cache=writethrough
   Reads are cached. Writes go directly to underlying storage.  Useful for 
 broken guests that aren't aware of drive caches.

These are misleading descriptions - because cache=none does not push
writes down to powerfail-safe storage, while cache=writethrough might.

 cache=always (or a more scary name like cache=lie to defend against idiots)
   Reads and writes are cached. Guest flushes are ignored.  Useful for dumb 
 guests in non-critical environments.

cache=unsafe would tell it like it is.

Even non-idiots could be excused for getting the wrong impression from
cache=always.

-- Jamie



Re: [Qemu-devel] [PATCH 0/2] Enable qemu block layer to not flush

2010-05-11 Thread Jamie Lokier
Anthony Liguori wrote:
 There's got to be a better place to fix this.  Disable barriers in your 
 guests?

If only it were that easy.

OS installs are the thing that this feature would most help.  They
take ages, do a huge amount of writing with lots of seeking, and if
the host fails you're going to discard the image.

I'm not sure how I would change that setting for most OS install GUIs,
especially Windows, or if it's even possible.

It's usually much easier to change barrier settings after installing
and you've got a command line or registry editing tool.  But by then,
it's not useful any more.

Any other ideas?

-- Jamie



Re: [Qemu-devel] [RFC] default mac address issue

2010-05-11 Thread Jamie Lokier
Anthony Liguori wrote:
 Hi Bruce,
 
 On 05/10/2010 02:07 PM, Bruce Rogers wrote:
 I know this behavior has worked this way all along, but I wanted to bring 
 up the following concern and float a few ideas about possible solutions. 
 Please provide your perspective, opinion, etc.
 
 qemu (or qemu-kvm) users can easily get into trouble when they don't 
 specifying the mac address for their vm's nic and don't realize that 
 multiple vm's running this way on the same network segment are colliding, 
 since they all get a default mac address that is the same. They may be 
 under the assumption that a random mac would be the default, as in many 
 higher level tools for vm creation

 
 This is certainly an important issue but it's one that's difficult to 
 resolve.
 
 Does it make sense to do any of the following:
 
 1) have qemu print a warning to stdout/stderr that the default mac address 
 is being used and that it will interfere with other vms running the same 
 way on a common network segment

 
 This is definitely reasonable.
 
 2) what about changing the default behavior to randomizing the mac, and 
 provide the legacy behavior with -net nic,macaddr=default or just 
 -use-default-mac
 
 (or, as a flip side to #2):
 
 3) to at least make it easy for people to get around the problem, and just
 use qem directly (without additional tools to launch qemu), add an option 
 such as -net nic,macaddr=randomize or -use-random-mac which randomizes 
 the mac for you
 each time the machine is brought up, and hence avoids possible collisions.

 
 A random mac address is almost always wrong.  If you run a guest twice 
 with this option, it's usually enough to trigger a new network detection 
 which which rename the network device to ethN + 1.  The result would be 
 broken networking for naive users since distros don't bother configuring 
 interfaces that weren't present during installation.

Yes, I've seen this when moving disk images between (real)
motherboards.  In the good old days it Just Worked.

Now, current distros using udev remember the MAC from the old board,
so the new board gets an interface called eth1 instead of eth0.

That's fine, but rather stupidly they've configured a useful default
for eth0 which is DHCP, but the default for eth1 etc. is to leave
it down.  Result: Disk moved to a replacement motherboard, and the
machine no longer responds to network connections.  Quite annoying if
it's a headless box, or one which boots up as a kiosk or something
with no console access.

Anyway, Anthony's right: Changing the MAC address of a guest each time
it is run (with the same disk image) is likely to be annoying.

It might be a good idea to store the chosen MAC in the qcow2 metadata,
if qcow2 is used?

For my Perl-managed qemu/kvm VMs, I find I need a small config file,
and a small state file which records run time state that survives
reboots (like the MAC address, and things like which CD and floppy
images were loaded).

(Perhaps in the search for a holy grail of a qemu config file format,
it might also be worth a mention that it's handy to store non-config
state somewhere too.)

-- Jamie



Re: [Qemu-devel] [PATCH 07/22] qemu-error: Introduce get_errno_string()

2010-05-11 Thread Jamie Lokier
Anthony Liguori wrote:
 QMP should insult users from underlying platform quirks.  We should 
 translate errnos to appropriate QMP error types.

Fair enough.  What should it do when the platform returns an errno
value that qemu doesn't know about, and wants to pass to the QMP caller?

-- Jamie



Re: [Qemu-devel] QLicense chaos

2010-05-07 Thread Jamie Lokier
Jan Kiszka wrote:
 Moreover, some of the QObject files are LGPL, some GPL. I bet this was
 also not intended. But what was the idea behind the LGPL? Some libqmp which
 can be used by closed source apps?

I believe LGPL is needed for source apps that have GPLv2-incompatible
licensing.  E.g. GPLv3, Apache license, OpenSSL?  (I'm not sure exactly.)

And for those who want to keep their own apps BSD-like.

-- Jamie




Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-06 Thread Jamie Lokier
Rusty Russell wrote:
  Seems over-zealous.
  If the recovery_header held a strong checksum of the recovery_data you would
  not need the first fsync, and as long as you have two places to write 
  recovery
  data, you don't need the 3rd and 4th syncs.
  Just:

  write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
fsync / msync
overwrite_with_new_data()
  
  To recovery you choose the most recent log_space and replay the content.
  That may be a redundant operation, but that is no loss.
 
 I think you missed a checksum for the new data?  Otherwise we can't tell if
 the new data is completely written.

The data checksum can go in the recovery-data block.  If there's
enough slack in the log, by the time that recovery-data block is
overwritten, you can be sure that an fsync has been done for that
data (by a later commit).

 But yes, I will steal this scheme for TDB2, thanks!

Take a look at the filesystems.  I think ext4 did some optimisations
in this area, and that checksums had to be added anyway due to a
subtle replay-corruption problem that happens when the log is
partially corrupted, and followed by non-corrupt blocks.

Also, you can remove even more fsyncs by adding a bit of slack to the
data space and writing into unused/fresh areas some of the time -
i.e. a bit like btrfs/zfs or anything log-structured, but you don't
have to go all the way with that.

 In practice, it's the first sync which is glacial, the rest are pretty cheap.

The 3rd and 4th fsyncs imply a disk seek each, just because the
preceding writes are to different areas of the disk.  Seeks are quite
slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap?
That it's only a couple of seeks, or that you don't see even that?

 
  Also cannot see the point of msync if you have already performed an fsync,
  and if there is a point, I would expect you to call msync before
  fsync... Maybe there is some subtlety there that I am not aware of.
 
 I assume it's this from the msync man page:
 
msync()  flushes  changes  made  to the in-core copy of a file that was
mapped into memory using mmap(2) back to disk.   Without  use  of  this
call  there  is  no guarantee that changes are written back before mun‐
map(2) is called. 

Historically, that means msync() ensures dirty mapping data is written
to the file as if with write(), and that mapping pages are removed or
refreshed to get the effect of read() (possibly a lazy one).  It's
more obvious in the early mmap implementations where mappings don't
share pages with the filesystem cache, so msync() has explicit
behaviour.

Like with write(), after calling msync() you would then call fsync()
to ensure the data is flushed to disk.

If you've been calling fsync then msync, I guess that's another fine
example of how these function are so hard to test, that they aren't.

Historically on Linux, msync has been iffy on some architectures, and
I'm still not sure it has the same semantics as other unixes.  fsync
as we know has also been iffy, and even now that fsync is tidier it
does not always issue a hardware-level cache commit.

But then historically writable mmap has been iffy on a boatload of
unixes.

   It's an implementation detail; barrier has less flexibility because it has
   less information about what is required. I'm saying I want to give you as
   much information as I can, even if you don't use it yet.
  
  Only we know that approach doesn't work.
  People will learn that they don't need to give the extra information to 
  still
  achieve the same result - just like they did with ext3 and fsync.
  Then when we improve the implementation to only provide the guarantees that
  you asked for, people will complain that they are getting empty files that
  they didn't expect.
 
 I think that's an oversimplification: IIUC that occurred to people *not*
 using fsync().  They weren't using it because it was too slow.  Providing
 a primitive which is as fast or faster and more specific doesn't have the
 same magnitude of social issues.

I agree with Rusty.  Let's make it perform well so there is no reason
to deliberately avoid using it, and let's make say what apps actually
want to request without being way too strong.

And please, if anyone has ideas on how we could make correct use of
these functions *testable* by app authors, I'm all ears.  Right now it
is quite difficult - pulling power on hard disks mid-transaction is
not a convenient method :)

  The abstraction I would like to see is a simple 'barrier' that contains no
  data and has a filesystem-wide effect.
 
 I think you lack ambition ;)
 
 Thinking about the single-file use case (eg. kvm guest or tdb), isn't that
 suboptimal for md?  Since you have to hand your barrier to every device
 whereas a file-wide primitive may theoretically only go to some.

Yes.

Note that database-like programs still need fsync-like behaviour
*sometimes*: The D in ACID depends on it, and the C 

Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-06 Thread Jamie Lokier
Rusty Russell wrote:
 On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
  Jens Axboe wrote:
   On Tue, May 04 2010, Rusty Russell wrote:
ISTR someone mentioning a desire for such an API years ago, so CC'ing 
the
usual I/O suspects...
   
   It would be nice to have a more fuller API for this, but the reality is
   that only the flush approach is really workable. Even just strict
   ordering of requests could only be supported on SCSI, and even there the
   kernel still lacks proper guarantees on error handling to prevent
   reordering there.
  
  There's a few I/O scheduling differences that might be useful:
  
  1. The I/O scheduler could freely move WRITEs before a FLUSH but not
 before a BARRIER.  That might be useful for time-critical WRITEs,
 and those issued by high I/O priority.
 
 This is only because noone actually wants flushes or barriers, though
 I/O people seem to only offer that.  We really want these writes must
 occur before this write.  That offers maximum choice to the I/O subsystem
 and potentially to smart (virtual?) disks.

We do want flushes for the D in ACID - such things as after
receiving a mail, or blog update into a database file (could be TDB),
and confirming that to the sender, to have high confidence that the
update won't disappear on system crash or power failure.

Less obviously, it's also needed for the C in ACID when more than
one file is involved.  C is about differently updated things staying
consistent with each other.

For example, imagine you have a TDB file mapping Samba usernames to
passwords, and another mapping Samba usernames to local usernames.  (I
don't know if you do this; it's just an illustration).

To rename a Samba user involves updating both.  Let's ignore transient
transactional issues :-) and just think about what happens with
per-file barriers and no sync, when a crash happens long after the
updates, and before the system has written out all data and issued low
level cache flushes.

After restarting, due to lack of sync, the Samba username could be
present in one file and not the other.

  2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
 only for data belonging to a particular file (e.g. fdatasync with
 no file size change, even on btrfs if O_DIRECT was used for the
 writes being committed).  That would entail tagging FLUSHes and
 WRITEs with a fs-specific identifier (such as inode number), opaque
 to the scheduler which only checks equality.
 
 This is closer.  In userspace I'd be happy with a all prior writes to this
 struct file before all future writes.  Even if the original guarantees were
 stronger (ie. inode basis).  We currently implement transactions using 4 fsync
 /msync pairs.
 
   write_recovery_data(fd);
   fsync(fd);
   msync(mmap);
   write_recovery_header(fd);
   fsync(fd);
   msync(mmap);
   overwrite_with_new_data(fd);
   fsync(fd);
   msync(mmap);
   remove_recovery_header(fd);
   fsync(fd);
   msync(mmap);
 
 Yet we really only need ordering, not guarantees about it actually hitting
 disk before returning.
 
  In other words, FLUSH can be more relaxed than BARRIER inside the
  kernel.  It's ironic that we think of fsync as stronger than
  fbarrier outside the kernel :-)
 
 It's an implementation detail; barrier has less flexibility because it has
 less information about what is required. I'm saying I want to give you as
 much information as I can, even if you don't use it yet.

I agree, and I've started a few threads about it over the last couple of years.

An fsync_range() system call would be very easy to use and
(most importantly) easy to understand.

With optional flags to weaken it (into fdatasync, barrier without sync,
sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush,
etc.), it would be very versatile, and still easy to understand.

With an AIO version, and another flag meaning don't rush, just return
when satisfied, and I suspect it would be useful for the most
demanding I/O apps.

-- Jamie




Re: [Qemu-devel] question on virtio

2010-05-06 Thread Jamie Lokier
Michael S. Tsirkin wrote:
 Hi!
 I see this in virtio_ring.c:
 
 /* Put entry in available array (but don't update avail-idx *
  until they do sync). */
 
 Why is it done this way?
 It seems that updating the index straight away would be simpler, while
 this might allow the host to specilatively look up the buffer and handle
 it, without waiting for the kick.

Even better, if the host updates a location containing which index it
has seen recently, you can avoid the kick entirely during sustained
flows - just like your recent patch to avoid sending irqs to the
guest.

-- Jamie




Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-05 Thread Jamie Lokier
Jens Axboe wrote:
 On Tue, May 04 2010, Rusty Russell wrote:
  ISTR someone mentioning a desire for such an API years ago, so CC'ing the
  usual I/O suspects...
 
 It would be nice to have a more fuller API for this, but the reality is
 that only the flush approach is really workable. Even just strict
 ordering of requests could only be supported on SCSI, and even there the
 kernel still lacks proper guarantees on error handling to prevent
 reordering there.

There's a few I/O scheduling differences that might be useful:

1. The I/O scheduler could freely move WRITEs before a FLUSH but not
   before a BARRIER.  That might be useful for time-critical WRITEs,
   and those issued by high I/O priority.

2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
   only for data belonging to a particular file (e.g. fdatasync with
   no file size change, even on btrfs if O_DIRECT was used for the
   writes being committed).  That would entail tagging FLUSHes and
   WRITEs with a fs-specific identifier (such as inode number), opaque
   to the scheduler which only checks equality.

3. By delaying FLUSHes through reordering as above, the I/O scheduler
   could merge multiple FLUSHes into a single command.

4. On MD/RAID, BARRIER requires every backing device to quiesce before
   sending the low-level cache-flush, and all of those to finish
   before resuming each backing device.  FLUSH doesn't require as much
   synchronising.  (With per-file FLUSH; see 2; it could even avoid
   FLUSH altogether to some backing devices for small files).

In other words, FLUSH can be more relaxed than BARRIER inside the
kernel.  It's ironic that we think of fsync as stronger than
fbarrier outside the kernel :-)

-- Jamie




Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-05 Thread Jamie Lokier
Rusty Russell wrote:
 On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
  I took a stub at documenting CMD and FLUSH request types in virtio
  block.  Christoph, could you look over this please?
  
  I note that the interface seems full of warts to me,
  this might be a first step to cleaning them.
 
 ISTR Christoph had withdrawn some patches in this area, and was waiting
 for him to resubmit?
 
 I've given up on figuring out the block device.  What seem to me to be sane
 semantics along the lines of memory barriers are foreign to disk people: they
 want (and depend on) flushing everywhere.
 
 For example, tdb transactions do not require a flush, they only require what
 I would call a barrier: that prior data be written out before any future data.
 Surely that would be more efficient in general than a flush!  In fact, TDB
 wants only writes to *that file* (and metadata) written out first; it has no
 ordering issues with other I/O on the same device.

I've just posted elsewhere on this thread, that an I/O level flush can
be more efficient than an I/O level barrier (implemented using a
cache-flush really), because the barrier has stricter ordering
requirements at the I/O scheduling level.

By the time you work up to tdb, another way to think of it is
distinguishing eager fsync from fsync but I'm not in a hurry -
delay as long as is convenient.  The latter makes much more sense
with AIO.

 A generic I/O interface would allow you to specify this request
 depends on these outstanding requests and leave it at that.  It
 might have some sync flush command for dumb applications and OSes.

For filesystems, it would probably be easy to label in-place
overwrites and fdatasync data flushes when there's no file extension
with an opqaue per-file identifier for certain operations.  Typically
over-writing in place and fdatasync would match up and wouldn't need
ordering against anything else.  Other operations would tend to get
labelled as ordered against everything including these.

-- Jamie




Re: [Qemu-devel] Re: Bug#573439: qemu-kvm: fail to set hdd serial number

2010-04-26 Thread Jamie Lokier
Michael Tokarev wrote:
 24.04.2010 17:05, Andreas Färber wrote:
 Am 22.04.2010 um 11:40 schrieb Michael Tokarev:
 
 11.03.2010 18:34, Michael Tokarev wrote:
 []
 On version 0.12.3, -drive serial=XXX option does not work.
 Below patch fixes it. 'serial' is pointer, not array.
 
 
 --- qemu-kvm-0.12.3+dfsg/vl.c 2010-02-26 11:34:00.0 +0900
 +++ qemu-kvm-0.12.3+dfsg.old/vl.c 2010-03-11 02:26:00.134217787 +0900
 [...]
 
 Folks, can we please add this trivial one-liner to -stable or something?
 It has been one and a half months since it has been fixed in debian...
 
 Try submitting it as a proper Git patch with summary so that it can be
 applied to master first; if it's already in master, post the commit id
 so it can be cherry-picked. Also, mark the subject as [STABLE] or [0.12]
 or something for Anthony to find it.
 
 Well, It's not that difficult to carry it in the debian package.
 Hopefully other distros will follow (the ones who are not already),
 so that support requests in #...@freenode wont mention that again.

It would be nice for such trivial fixes to be committed to the stable
branch for those of us compiling stable versions from source.

Especially with so many guest regressions lately, so that keeping
multiple qemu versions around is an unfortunate necessity for the time
being.

-- Jamie




Re: [libvirt] [Qemu-devel] Re: Libvirt debug API

2010-04-26 Thread Jamie Lokier
Daniel P. Berrange wrote:
  Much better to exact a commitment from libvirt to track all QMP (and 
  command line) capabilities.  Instead of adding cleverness to QMP, add 
  APIs to libvirt.
 
 Agreed. Despite adding this monitor / XML passthrough capability, we still
 do not want apps to be using this at all. If there is some capability
 missing that apps need then the default mode of operation is to add the
 neccessary bits of libvirt. The monitor/XML pasthrough is just a short
 term quick workaround until the official support is done. As such I do
 not really think we need to put huge amounts of effort in the wierd 
 complex racey edge cases. The effort is better spent on getting the 
 features in libvirt.

All the features?  The qemu API is quite large already (look at all
the command line options and monitor commands).  I'll be very
surprised if libvirt provides all of it that obscure apps may use.

I'm thinking of features which are relatively obscure but nonetheless
useful to a small number of deployments.  Probably not enough to
justify the effort building data models, specifying the XML and remote
protocol and so on in libvirt.

(Unless that becomes so easily mapped to qemu's API that it's almost an
automatic thing... Which sounds like QMP, doesn't it?)

Is libvirt ever likely to go to the effort of providing all the
easily-usable API, or hooks, for:

- sending keys to a guest, driven by a timed host script?

- rebooting the guest while switching between USB touchpad and
  mouse devices, because one of them is needed during an OS
  install and the other is needed after?

- changing the amount of RAM available to the guest at the next
  reboot, for OS install needing more memory than run time, in a
  scripted fashion when building new VMs from install disk images?

- switching the guest between qemu mode and kvm mode on the next
  guest reset, because qemu is faster for some things (VGA
  updates) and kvm is faster for other things, so the best choice
  depends on which app you need to run on that guest

- pausing a VM, making a copy, and resuming it, so as to fork it
  into two VMs (literally fork)?

- setting up the host network container and NAT IP forwarding, on
  demand as guests are stopped and started, so that it works in
  the above scenario despite clashing IP addresses?

- running a copy of the same guest, or perhaps an entire OS
  install process (scripted), many times for different qemu and
  qemu-kvm versions, different BIOSes, and different
  almost-equivalent hardware emulations (i.e. different NIC types,
  SMP count, CPU features, disk controller type, AIO/cache type) -
  for testing guests and apps on them - with some paralellism?

None of those, except perhaps the first, as what I think of as typical
virtualisation workloads, and they all seem obscure things probably
outside libvirt's remit.  Probably not many users either :-)

Yet you can do them all today with qemu and scripting the monitor, and
it's getting easier with QMP.

Which is fine, qemu works, but it would be great to be able to see
those guests and interact in the basic ways through the libvirt-based
GUIs?

QMP pass-through or QMP multiple monitors seems to provide most of
that, although I can see libvirt getting a bit confused about which
devices and how much RAM the guest has installed at different times.

The bit about forking guests, I'm not sure how complicated it is to
tie in to libvirt's notion of which disk images are being used, and
hooking into it's network configuration to handle the clashing
addresses.

If those things are considered to be entirely outside libvirt's remit,
that's fine with me.  Fair enough: I will continue to live with ssh
and vinagre.

I'm just raising my hand as a potential user who might like to monitor
a bunch of active and inactive guests, remotely, see how much memory
they report using, etc. launch VNC viewer from the GUI, even choose
the target host based on load and migrate on demand, while also
needing a fair bit of non-standardness and qemu-level scripting too.

Imho, that probably comes under the heading of apps using pass-through
or multiple QMP monitors, which use features that probably won't and
probably shouldn't ever be handled by libvirt itself.

-- Jamie




Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-23 Thread Jamie Lokier
Ian Molton wrote:
 Jamie Lokier wrote:
  First of all: Why are your egd daemons with open connections dying
  anyway?  Buggy egd?
 
 No, they aren't buggy. but occasionally, the sysadmin of said server
 farms may want to, you know, update the daemon?

Many daemons don't kill active connections on upgrade.  For example
sshd, telnetd, ftpd, rsyncd...  Only new connections get the new daemon.

But let's approach this from a different angle:

What do _other_ long-lived EGD clients do?  Is it:

   1. When egd is upgraded, the clients break.
   3. Active connections aren't killed on egd upgrade.
   2. They keep trying to reconnect, as you've implemented in qemu.
   4. Whenever they want entropy, they're expected to open a
  connection, request what they want, read it, and close.  Each time.

Whatever other long-lived clients do, that's probably best for qemu
too.

4 is interesting because it's an alternative approach to rate-limiting
the byte stream: Instead, fetch a batch of bytes in a single
open/read/close transaction when needed.  Rate-limit _that_, and you
don't need separate reconnection code.

So I trying checking if egd kills connections when upgraded, and found...

No 'egd' package for my Debian and Ubuntu systems, nor anything which
looks obvious.  There are several other approaches to gathering
entropy from hardware sources, for example rng-tools, haveged, ekeyd, and
ekeyd-egd-linux (aha... it's a client).

All of those have in common: they fetch entropy from something, and
write it to the kernel's /dev/random pool.  Applications are expected
to read from that pool.

In particular, if you do have a hardware or network EGD entropy
source, you can run ekeyd-egd-linux which is an EGD client, which
transfers from EGD - the kernel, so that applications can read from
/dev/random.

That means, on Debian  Ubuntu Linux at least, there is no need for
applications to talk EGD protocol themselves, even to get network or
hardware entropy - it's better left to egd-linux, rng-tools etc. to
manage.

But the situation is no doubt different on non-Linux hosts.

By the way, ekeyd-egd-linux is a bit thoughtful: For example it has a
shannons-per-byte option, and it doesn't drain the EGD server at all
when the local pool is sufficiently full.

Does your EGD client + virtio-rng support do that - avoid draining the
source when the guest's pool is full enough?

  If guests need a _reliable_ source of data for security, silently not
  complaining when it's gone away and hoping it comes back isn't good
  enough.
 
 Why? its not like the guest:
 
 a) Has a choice in the matter
 b) Would carry on without the entropy (it knows it has no entropy)

Because one might prefer a big red light, a halted machine removed
from the cluster which can resume its work when ready, and an email to
warn you that the machine isn't able to operate normally _without_
having to configure each guest's email, rather than a working machine
with increasing numbers of stuck crypto processes waiting on
/dev/random which runs out of memory and after getting into swap hell,
you have to reboot it, losing the other work that it was in the
middle of doing.

Well, you personally might not prefer that.  But that's why we
separate policy from mechanism...

  But then it would need to sync with the guest on reconnection, so that
  the guest can restart whatever protocol it's using over the byte
  stream.
 
 Er, why? we're not talking about port forwarding here, we're talking
 about emulation of device hardware.

virtio-serial isn't emulating a normal serial port.  It supports apps
like send machine status blobs regularly, without having to be
robust against half a blob being delivered.

You can design packets so that doesn't matter, but virtio-serial
supports not needing to do that, making the apps simpler.

  I don't think it'll happen.  I think egd is a rather unusual
  If another backend ever needs it, it's easy to move code around.
 
 *bangs head on wall*
 
 That was the exact same argument I made about the rate limiting code.
 Why is that apparently only valid if its not me that says it?

Because you're talking to multiple people who hold different opinions,
and opinions change as more is learned and thought about.  It's
iterative, and I, for one, am not in a position to make merging
decisions, only give my view on it.  Can't speak for the others.

  I'm not convinced there's a need for it even for egd.
 
 So what? I'm not convinced theres a need for about 90% of whats out
 there,

Ah, that's not quite what I meant.  I meant I wasn't convinced it is
needed for egd, not I don't think anyone should use egd.  (But now I
see that egd-linux has a reconnect time option, perhaps reconnecting
_is_ de facto part of EGD protocol.)

But now that we've confirmed that on Debian  Ubuntu, all hardware
entropy sources are injected into /dev/random by userspace daemons
rather than serving EGD protocol, and if you do have an EGD server you
can run egd-linux and apps can

Re: [Qemu-devel] Atomicity of i386 guest atomic instructions

2010-04-23 Thread Jamie Lokier
Alexander Graf wrote:
 They should be atomic. TCG SMP swaps between different vCPUs only
 after translation blocks are done. In fact, the only way I'm aware
 of to stop the execution of a TB mid-way is a page fault.

A page fault would interrupt it if the atomic is implemented as
a read followed by a write, and the write faults.

 You can as always check things with the -d parameter.

-- Jamie




Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Jamie Lokier
Yoshiaki Tamura wrote:
 Jamie Lokier wrote:
 Yoshiaki Tamura wrote:
 Dor Laor wrote:
 On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
 Event tapping is the core component of Kemari, and it decides on which
 event the
 primary should synchronize with the secondary. The basic assumption
 here is
 that outgoing I/O operations are idempotent, which is usually true for
 disk I/O
 and reliable network protocols such as TCP.
 
 IMO any type of network even should be stalled too. What if the VM runs
 non tcp protocol and the packet that the master node sent reached some
 remote client and before the sync to the slave the master failed?
 
 In current implementation, it is actually stalling any type of network
 that goes through virtio-net.
 
 However, if the application was using unreliable protocols, it should have
 its own recovering mechanism, or it should be completely stateless.
 
 Even with unreliable protocols, if slave takeover causes the receiver
 to have received a packet that the sender _does not think it has ever
 sent_, expect some protocols to break.
 
 If the slave replaying master's behaviour since the last sync means it
 will definitely get into the same state of having sent the packet,
 that works out.
 
 That's something we're expecting now.
 
 But you still have to be careful that the other end's responses to
 that packet are not seen by the slave too early during that replay.
 Otherwise, for example, the slave may observe a TCP ACK to a packet
 that it hasn't yet sent, which is an error.
 
 Even current implementation syncs just before network output, what you 
 pointed out could happen.  In this case, would the connection going to be 
 lost, or would client/server recover from it?  If latter, it would be fine, 
 otherwise I wonder how people doing similar things are handling this 
 situation.

In the case of TCP in a synchronised state, I think it will recover
according to the rules in RFC793.  In an unsynchronised state
(during connection), I'm not sure if it recovers or if it looks like a
Connection reset error.  I suspect it does recover but I'm not certain.

But that's TCP.  Other protocols, such as over UDP, may behave
differently, because this is not an anticipated behaviour of a
network.

 However there is one respect in which they're not idempotent:
 
 The TTL field should be decreased if packets are delayed.  Packets
 should not appear to live in the network for longer than TTL seconds.
 If they do, some protocols (like TCP) can react to the delayed ones
 differently, such as sending a RST packet and breaking a connection.
 
 It is acceptable to reduce TTL faster than the minimum.  After all, it
 is reduced by 1 on every forwarding hop, in addition to time delays.
 
 So the problem is, when the slave takes over, it sends a packet with same 
 TTL which client may have received.

Yes.  I guess this is a general problem with time-based protocols and
virtual machines getting stopped for 1 minute (say), without knowing
that real time has moved on for the other nodes.

Some application transaction, caching and locking protocols will give
wrong results when their time assumptions are discontinuous to such a
large degree.  It's a bit nasty to impose that on them after they
worked so hard on their reliability :-)

However, I think such implementations _could_ be made safe if those
programs can arrange to definitely be interrupted with a signal when
the discontinuity happens.  Of course, only if they're aware they may
be running on a Kemari system...

I have an intuitive idea that there is a solution to that, but each
time I try to write the next paragraph explaining it, some little
complication crops up and it needs more thought.  Something about
concurrent, asynchronous transactions to keep the master running while
recording the minimum states that replay needs to be safe, while
slewing the replaying slave's virtual clock back to real time quickly
during recovery mode.

-- Jamie




Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-23 Thread Jamie Lokier
Ian Molton wrote:
 You can configure any chardev to be a tcp client. I never do that though
 as I find it much more convenient to configure it as server.
 
 Perhaps thats because chardev clients are nearly useless right now 
 because they just die if the connection drops...

Which is why drops/missing server should be a QMP event + action, same
as other triggers like disk full and watchdog trigger.

I do not want my guests to continue running if they are configured to
depend on Qemu entropy and it's not available.

 Or are you suggesting that we create another type of chardev, thats 
 nearly like a socket, but speaks egd and can reconnect? That seems 
 hideous to me.

Why hideous?

An egd chardev is a good thing because you can then trivially
use it as a random byte source for virtio-serial, isa-serial,
pci-serial, custom-soc-serial, debug-port even :-), and anything
else which might want random bytes as Gerd said.

That's way more useful than restricting to virtio-rng, because most
guests don't support virtio at all, but they can probably all take
entropy from a serial-like device.

Similarly the ability to connect to /dev/urandom directly, with the
rate-limiting but no auto-reconnection, looking like a chardev in
the same way, would make sense.  Reconnection is not needed in this
case - missing device should be an error at startup.

Your idea for an 'egd line discipline' would need to look exactly like
a chardev internally, or all the devices which might find it useful
would have to be changed to know about line disciplines, or it just
wouldn't be available as a random byte source to everything that uses
a chardev - unnecessary limiting.

There's nothing wrong with the egd chardev actually _being
implemented_ like a line discipline on top of another chardev, with a
chardev interface so everything can use it.

In which case it's quite natural to expose the options as a
user-visible chardev 'egd', defined to return random bytes on input
and ignore output, which takes all the same options as 'socket' and
actually uses a 'socket' chardev (passing along the options).

(Is there any actual point in supporting egd over non-sockets?)

I think rate-limiting is more generically useful as a 'line
discipline'-like feature, to work with any chardev type.  But it
should then have properties governing incoming and outgoing rate
limiting separately, which won't get much testing for the only
imminent user which is input-only.

 That way it wouldn't matter if it were a socket or anything else that 
 the data came in via, which is the case with the patch as I wrote it - 
 you can feed in EGD from a file, a socket, anything, and it just works.

What's the point in feeding egd protocol from a file?
If you want entropy from a file, it should probably be raw, not egd protocol.

-- Jamie




Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Jamie Lokier
Yoshiaki Tamura wrote:
 Dor Laor wrote:
 On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
 Event tapping is the core component of Kemari, and it decides on which
 event the
 primary should synchronize with the secondary. The basic assumption
 here is
 that outgoing I/O operations are idempotent, which is usually true for
 disk I/O
 and reliable network protocols such as TCP.
 
 IMO any type of network even should be stalled too. What if the VM runs
 non tcp protocol and the packet that the master node sent reached some
 remote client and before the sync to the slave the master failed?
 
 In current implementation, it is actually stalling any type of network 
 that goes through virtio-net.
 
 However, if the application was using unreliable protocols, it should have 
 its own recovering mechanism, or it should be completely stateless.

Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.

But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.

About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.

 I currently don't have good numbers that I can share right now.
 Snapshots/sec depends on what kind of workload is running, and if the 
 guest was almost idle, there will be no snapshots in 5sec.  On the other 
 hand, if the guest was running I/O intensive workloads (netperf, iozone 
 for example), there will be about 50 snapshots/sec.

That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.

-- Jamie




Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-22 Thread Jamie Lokier
Ian Molton wrote:
  It might make sense to have the reconnect logic in the egd chardev
  backend then, thereby obsoleting the socket reconnect patch.
 
 Im not sure I agree there... surely there are other things which would
 benefit from generic socket reconnection support (virtio-rng cant be the
 only driver that might want to rely on a reliable source of data via a
 socket in a server-farm type situation?)

First of all: Why are your egd daemons with open connections dying
anyway?  Buggy egd?

Secondly: why isn't egd death an event reported over QMP, with a
monitor command to reconnect manually?

If guests need a _reliable_ source of data for security, silently not
complaining when it's gone away and hoping it comes back isn't good
enough.  It should be an error condition known to management, which
can halt the guest until egd is fixed or restarts if running without
entropy isn't acceptable in its policy.

Thirdly, which other things do you think would use it?

Maybe some virtio-serial apps would like it.

But then it would need to sync with the guest on reconnection, so that
the guest can restart whatever protocol it's using over the byte
stream.

In which case, it's better to tell the guest that the connection died,
and give the guest a way to request a new one when it's ready.

Reconnecting and resuming in the middle of the byte stram would be bad
(even for egd protocol?).  Pure /dev/urandom fetching is quite unusual in not
caring about this, but you shouldn't need to reconnect to that.

 Do we really want to re-implement reconnection (and reconnection retry
 anti-flood limiting) in every single backend?

I don't think it'll happen.  I think egd is a rather unusual

If another backend ever needs it, it's easy to move code around.

I'm not convinced there's a need for it even for egd.  Either egd
shouldn't be killing open connections (and is buggy if it is), or this
is normal egd behavior and so it's part of the egd protocol to
repeatedly reconnect, and therefore can go in the egd client code.

Meanwhile, because the egd might not return, it should be reported as
an error condition over QMP for management to do what it deems
appropriate.  In which case, management could tell it to reconnect
when it thinks is a good time, or do other things like switch the
randomness source to something else, or stop the guest, or warn the
admin that a guest is running without entropy.

-- Jamie




Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-21 Thread Jamie Lokier
Gerd Hoffmann wrote:
 On 04/20/10 23:31, Ian Molton wrote:
 
 Using virtio-rng means that the data is going into the guest
 kernels hwrng subsystem.
 
 Which is *the* major advantage of the virtio-rng driver.  In case the 
 guest kernel is recent enougth to have support for it, it will 
 JustWork[tm].  No need for guest configuration, no need for some 
 userspace tool.  I'd like to see this driver being merged.
 
 With any kind of serial port (be it a emulated 16550 or virtio-serial) 
 you'll need some daemon running inside the guest grabbing entropy data 
 from the port and feeding it back into the kernel.

That's a bunch of false assumptions.

There's no reason a hwrng connector to virtio-serial could not be
automatic in a similar way to the console.

But enough of that: It's history now; the guest virtio-rng has existed
for more than a year.  It is also amazingly short and simple.  Yay for Rusty!

I don't object to virtio-rng; I think it's fine in principle and would
be happy to see the existing guests which have a virtio-rng driver
make use of it.

A bit of overlapping functionality is rife in emulators anyway :-)

-- Jamie




Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier
Avi Kivity wrote:
 On 04/19/2010 10:14 PM, Gerhard Wiesinger wrote:
 Hello,
 
 Finally I got QEMU-KVM to work but video performance under DOS is very 
 low (QEMU 0.12.3 stable and QEMU GIT master branch is fast, QEMU KVM 
 is slow)
 
 I'm measuring 2 performance critical video performance parameters:
 1.) INT 10h, function AX=4F05h (set same window/set window/get window)
 2.) Memory performance to segment page A000h
 
 So BIOS performance (which might be port performance to VGA 
 index/value port) is about factor 5 slower, memory performance is 
 about factor 100 slower.
 
 QEMU 0.12.3 and QEMU GIT performance is the same (in the measurement 
 tolerance) and listed only once, QEMU KVM is much more slower (details 
 see below).
 
 Test programs can be provided, source code will be release soon.
 
 Any ideas why KVM is so slow? 
 
 16-color vga is slow because kvm cannot map the framebuffer to the guest 
 (writes are not interpreted as RAM writes).  256+-color vga should be 
 fast, except when switching the vga window.  Note it's only fast on 
 average, the first write into a page will be slow as kvm maps it in.

I don't understand: why is 256+-colour mappable and 16-colour not mappable?

Is this a case where TCG would run significantly faster for code blocks
that have been detected to access the VGA memory?

 Which mode are you using?
 
 Any ideas for improvement?
 
 Currently when the physical memory map changes (which is what happens 
 when the vga window is updated), kvm drops the entire shadow cache.  
 It's possible to do this only for vga memory, but not easy.

If it's a page fault handled in the kernel, I would expect it to be
about as fast as those old VGA DOS-extender drivers which provide the
illusion of a single flat mapping, and bank switch on page faults -
multiplied by the speed of modern CPUs compared with then.  For many
graphics things those DOS-extender drivers worked perfectly well.

If it's a trap out to qemu on every vga window change, perhaps not
quite so well.

-- Jamie




Re: [Qemu-devel] Re: [PATCH 04/22] savevm: do_loadvm(): Always resume the VM

2010-04-21 Thread Jamie Lokier
Juan Quintela wrote:
 Luiz Capitulino lcapitul...@redhat.com wrote:
  On Wed, 21 Apr 2010 15:28:16 +0200
  Kevin Wolf kw...@redhat.com wrote:
 I tried a variation of this in the past, and was not a clear agreement.
 
 Basically, after a working migration to other host, you don't want to
 allow cont on the source node (it target has ever changed anything, it
 would give disk corruption).

This is not true if the target is using a copy of the disk.

Making copies is cheap on some hosts (Linux btrfs with it's COW features).

Forking a guest can be handy for testing things, starting from a known
run state.  The only thing to get confused is networking because of
duplicate addresses, and there are host-side ways around that (Linux
network namespaces).

If I understand correctly, we can already do this by migrating to a
file and copying the files.  There's no reason to block the live
equivalent, provided there is a way to copy the disk image when it's
quiesced.

So it's wrong to block cont on the source, but 
cont --I_know_what_I_am_doing might be good advice :-)

 But my suggestion to disable cont after that got complains that people
 wanted a I_know_what_I_am_doing_cont. (not the real syntax).  Perhaps
 it is time to revise this issue?

-- Jamie




Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier
Gerhard Wiesinger wrote:
 I'm using VESA mode 0x101 (640x480 256 colors), but performance is 
 there very low (~1MB/s). Test is also WITHOUT any vga window change, so 
 there isn't any page switching overhead involved in this test case.
 
 Any ideas for improvement?
 
 Currently when the physical memory map changes (which is what happens 
 when the vga window is updated), kvm drops the entire shadow cache.  It's 
 possible to do this only for vga memory, but not easy.
 
 I don't think changing VGA window is a problem because there are 
 500.000-1Mio changes/s possible.

1MB/s, 500k-1M changes/s Coincidence?  Is it taking a page fault
or trap on every write?

 Would it be possible to handle these writes through QEMU directly (without 
 KVM), because performance is there very well (looking at the code there 
 is some pointer arithmetic and some memory write done)?

I've noticed extremely slow VGA performance too, when installing OSes.
It makes the difference between installing in a few minutes, and
installing taking hours - just because of the slow VGA.

So generally I use qemu for installing old versions of Windows, then
change to KVM to run them after installing.

Switching between KVM and qemu automatically based on guest code
behaviour, and making both memory models and device models compatible
at run time, is a difficult thing.  I guess it's not worth the
difficulty just to speed up VGA.

-- Jamie




Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier
Avi Kivity wrote:
 Writes to vga in 16-color mode don't change set a memory location to a 
 value, instead they change multiple memory locations.

While code is just writing to the VGA memory, not reading(*) and not
touching the VGA I/O register that control the write latches, is it
possible in principle to swizzle the format around in memory to make
regular writes work?

(*) Reading should be ok for some settings of the write latches, I
think.

I wonder if guests of interest behave like that.

 Is this a case where TCG would run significantly faster for code blocks
 that have been detected to access the VGA memory?
 
 Yes.

$ date
Wed Apr 21 19:37:38 2015
$ modprobe ktcg
;-)

-- Jamie




Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier
Gerhard Wiesinger wrote:
 Would it be possible to handle these writes through QEMU directly 
 (without
 KVM), because performance is there very well (looking at the code there
 is some pointer arithmetic and some memory write done)?
 
 I've noticed extremely slow VGA performance too, when installing OSes.
 It makes the difference between installing in a few minutes, and
 installing taking hours - just because of the slow VGA.
 
 So generally I use qemu for installing old versions of Windows, then
 change to KVM to run them after installing.
 
 Switching between KVM and qemu automatically based on guest code
 behaviour, and making both memory models and device models compatible
 at run time, is a difficult thing.  I guess it's not worth the
 difficulty just to speed up VGA.
 
 I think this is very easy to distingish:
 1.) VGA Segment A000 is legacy and should be handled through QEMU 
 and not through KVM (because it is much more faster). Also 16 color modes 
 should be fast enough there.
 2.) All other flat PCI memory accesses should be handled through KVM 
 (there is a specialized driver loaded for that PCI device in the non 
 legacy OS).
 
 Is that easily possible?

No it isn't.  Distingushing addresses is trivial.  You've ignored the
hard part, which is switching between different virtualisation
architectures...

-- Jamie




Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier
Avi Kivity wrote:
 On 04/21/2010 09:39 PM, Jamie Lokier wrote:
 Avi Kivity wrote:

 Writes to vga in 16-color mode don't change set a memory location to a
 value, instead they change multiple memory locations.
  
 While code is just writing to the VGA memory, not reading(*) and not
 touching the VGA I/O register that control the write latches, is it
 possible in principle to swizzle the format around in memory to make
 regular writes work?

 
 Not in software.  We can map pages, not cross address lines.

Hence swizzle.  You rearrange the data inside the page for the
crossed address lines, and undo the swizzle later on demand.  That
doesn't work for other VGA magic though.

 Guests that use 16 color vga are usually of little interest.

Fair enough.  We can move on :-)

It's been said that the super-slow VGA writes triggering this thread
are in 256-colour mode, so there's a different problem.  That should
be fast, shouldn't it?

I vaguely recall extremely slow OS installs I've seen in KVM, which
were fast in QEMU (and fast in KVM after installing), were using text
mode.  Possibly it was Windows 2000, or Windows Server 2003.  Text
mode should be fast too, shouldn't it?  I suppose it's possible that
it just looked like text mode and was really 16-colour mode.

-- Jamie




Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-21 Thread Jamie Lokier
Gerhard Wiesinger wrote:
 Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW 
 of QEMU even from KVM must be possible (e.g. memory and port accesses are 
 done on nearly every virtual device) and therefore I'm ending in C code in
 the QEMU hw/*.c directory. Therefore also the VGA memory area should be 
 able to be accessable from KVM but with the specialized and fast memory
 access of QEMU.  Am I missing something?

What you're missing is that when KVM calls out to QEMU to handle
hw/*.c traps, that call is very slow.  It's because the hardware-VM
support is a bit slow when the trap happens, and then the the call
from KVM in the kernel up to QEMU is a bit slow again.  Then all the
way back.  It adds up to a lot, for every I/O operation.

When QEMU does the same thing, it's fast because it's inside the same
process; it's just a function call.

That's why the most often called devices are emulated separately in
KVM's kernel code, things like the interrupt controller, timer chip
etc.  It's also why individual instructions that need help are
emulated in KVM's kernel code, instead of passing control up to QEMU
just for one instruction.

 BTW: Still not clear why performance is low with KVM since there are 
 no window changes in the testcase involved which could cause a (slow) page 
 fault.

It sounds like a bug.  Avi gave suggests about what to look for.
If it fixes my OS install speeds too, I'll be very happy :-)

In 256-colour mode, KVM should be writing to the VGA memory at high
speed a lot like normal RAM, not trapping at the hardware-VM level,
and not calling up to the code in hw/*.c for every byte.

You might double-check if your guest is using VGA Mode X.  (See Wikipedia.)

That was a way to accelerate VGA on real PCs, but it will be slow in
KVM for the same reasons as 16-colour mode.

-- Jamie




Re: [Qemu-devel] [PATCH 2/2] VirtIO RNG

2010-04-20 Thread Jamie Lokier
Ian Molton wrote:
 One last and quite important point - where should the EGD protocol
 implementation go? really it needs to work as kind-of a line discipline,
 but AFAICT thats not supported? it would be a mistake to put rate
 limiting in the chardev layer and leave the EGD protocol implementation
 in the driver.

What do the other hypervisors supporting virtio-rng do?

Personally I'm failing to see why EGD support is needed in Qemu, as
none of the crypto services on my Linux machines seem to need it so
why should Qemu be special, but I acknowledge there might be some
obscure reason.

 TBH, for the sake of one very simple driver, and given that apparently
 no other users in qemu seem to want rate-limiting, *I( think that you
 are massively over-complicating matters right now. If more drivers need
 rate limiting, perhaps, but that doesnt seem to be the case.

Rate limiting both networking and serial ports may be a handy little
option sometimes.  Iirc, there are some guests which get confused when
data transfers are gazillions times faster than they expected, or
gazillions times more bursty in the case of networking.

  We already have a virtual serial port implementation designed for 
  exactly this kind of application.
 
 Except that it doesn't speak to the kernels virtio-rng implementation.
 And that interface is not going to change just because you don't like
 it. (Unless you'd like to rewrite the kernels hwrng core? feel free! I'm
 sure it'd be appreciated - but if not, then don't complain)

Would it be much work to change the guest to use virtio-serial
instead?  Would it fit the problem or does virtio-rng need more
metadata than just a bytestream?

-- Jamie




  1   2   3   4   >