Re: [Qemu-devel] coroutines and block I/O considerations

2011-07-25 Thread Paolo Bonzini

On 07/19/2011 12:57 PM, Stefan Hajnoczi wrote:

 From what I understand committed on Windows means that physical
pages have been allocated and pagefile space has been set aside:
http://msdn.microsoft.com/en-us/library/ms810627.aspx


Yes, memory that is reserved on Windows is just a contiguous part of 
the address space that is set aside, like MAP_NORESERVE under Linux. 
Memory that is committed is really allocated.



The question is how can we get the same effect on Windows and does the
current Fibers implementation not already work?


Windows thread and fiber stacks have both a reserved and a committed 
part.  The dwStackSize argument to CreateFiber indeed represents 
_committed_ stack size, so we're now committing 4 MB of stack per fiber. 
 The maximum size that the stack can grow to is set to the 
(per-executable) default.


If you want to specify both the reserved and committed stack sizes, you 
can do that with CreateFiberEx.


http://msdn.microsoft.com/en-us/library/ms682406%28v=vs.85%29.aspx

4 MB is quite a lot of address space anyway to waste for a thread.  A 
coroutine should not need that much, even on Linux.  I think for Windows 
64 KB of initial stack size and 1 MB of maximum size should do (for 
Linux it would 1 MB overall).


Paolo



Re: [Qemu-devel] coroutines and block I/O considerations

2011-07-25 Thread Stefan Hajnoczi
On Mon, Jul 25, 2011 at 9:56 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 On 07/19/2011 12:57 PM, Stefan Hajnoczi wrote:

  From what I understand committed on Windows means that physical
 pages have been allocated and pagefile space has been set aside:
 http://msdn.microsoft.com/en-us/library/ms810627.aspx

 Yes, memory that is reserved on Windows is just a contiguous part of the
 address space that is set aside, like MAP_NORESERVE under Linux. Memory that
 is committed is really allocated.

 The question is how can we get the same effect on Windows and does the
 current Fibers implementation not already work?

 Windows thread and fiber stacks have both a reserved and a committed part.
  The dwStackSize argument to CreateFiber indeed represents _committed_ stack
 size, so we're now committing 4 MB of stack per fiber.  The maximum size
 that the stack can grow to is set to the (per-executable) default.

 If you want to specify both the reserved and committed stack sizes, you can
 do that with CreateFiberEx.

 http://msdn.microsoft.com/en-us/library/ms682406%28v=vs.85%29.aspx

 4 MB is quite a lot of address space anyway to waste for a thread.  A
 coroutine should not need that much, even on Linux.  I think for Windows 64
 KB of initial stack size and 1 MB of maximum size should do (for Linux it
 would 1 MB overall).

I agree, let's make sure not to commit all this memory upfront.

Stefan



[Qemu-devel] coroutines and block I/O considerations

2011-07-19 Thread Frediano Ziglio
Hi,
  I'm exercise myself in block I/O layer and I decided to test
coroutine branch cause I find it easier to use instead of normal
callback. Looking at normal code there are a lot of rows in source to
save/restore state and declare callbacks and is not that easier to
understand the normal flow. At the end I would like to create a new
image format to get rid of some performance problem I encounter using
writethrough and snapshots. I have some questions regard block I/O and
also coroutines

1- threading model. I don't understand it. I can see that aio pool
routines does not contain locking code so I think aio layer is mainly
executed in a single thread. I saw introduction of some locking using
coroutines so I think coroutines are now called from different threads
and needs lock (current implementation serialize all device
operations)

2- memory considerations on coroutines. Beside coroutines allow more
readable code I wonder if somebody considered memory. For every
coroutines a different stack has to be allocated. For instance
ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO
this require about 512mb of ram (mostly only committed but not used
and coroutines are reused).

About snapshot and block i/o I think that using external snapshot
would help making some stuff easier. By external snapshot I mean
creating a new image with backing file as current image file and using
this new image for future operations. This would allow for instance
- support snapshot with every format (even raw)
- making snapshot backup using external programs (even from different
hosts using clustered file system and without many locking issues as
original image is now read-only)
- convert images live (just snapshot, qemu-img convert, remove snapshot)

Regards
  Frediano



Re: [Qemu-devel] coroutines and block I/O considerations

2011-07-19 Thread Kevin Wolf
Am 19.07.2011 10:06, schrieb Frediano Ziglio:
   I'm exercise myself in block I/O layer and I decided to test
 coroutine branch cause I find it easier to use instead of normal
 callback. Looking at normal code there are a lot of rows in source to
 save/restore state and declare callbacks and is not that easier to
 understand the normal flow. 

Yes. This is one of the reasons why we're trying to switch to
coroutines. QED is a prototype for a fully asynchronous callback-based
image format, and sometimes it's really hard to follow its code paths.
That the real functionality gets lost in the noise of transferring state
doesn't really help with readability either.

 At the end I would like to create a new
 image format to get rid of some performance problem I encounter using
 writethrough and snapshots. I have some questions regard block I/O and
 also coroutines

No. A new image format is the wrong answer, whatever the question may
be. :-)

If writethrough doesn't perform well with the existing format drivers,
fix the existing format drivers. You need very good reasons to convince
me that qcow2 can't do what your new format could do.

The solution for slow writethrough mode in qcow2 is probably to make
requests parallel, even if they touch metadata. This is a change that
becomes possible relatively easily once we have switched to coroutines.

What exactly is the problem with snapshots? Saving/loading internal
snapshots is too slow, or general performance with an image that has
snapshots? I think Luiz reported the first one a while ago, and it
should be easy enough to fix (use Qcow2Cache in writeback mode during
the refcount update).

 1- threading model. I don't understand it. I can see that aio pool
 routines does not contain locking code so I think aio layer is mainly
 executed in a single thread. I saw introduction of some locking using
 coroutines so I think coroutines are now called from different threads
 and needs lock (current implementation serialize all device
 operations)

You can view coroutines as threads with cooperative scheduling. That is,
unlike threads a coroutine is never interrupted by a scheduler, but it
can only call qemu_coroutine_yield(), which transfers control to a
different coroutine. Compared to threads this simplifies locking a bit
because you exactly know at which point other code may run.

But of course, even though you know where it happens, you have other
code running in the middle of your function,  so there can be a need to
lock things, which is why there are things like a CoMutex.

They are still all running in the same thread.

 2- memory considerations on coroutines. Beside coroutines allow more
 readable code I wonder if somebody considered memory. For every
 coroutines a different stack has to be allocated. For instance
 ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO
 this require about 512mb of ram (mostly only committed but not used
 and coroutines are reused).

128 concurrent requests is a lot. And even then, it's only virtual
memory. I doubt that we're actually using much more than we do in the
old code with the AIOCBs (which will disappear and become local
variables when we complete the conversion).

 About snapshot and block i/o I think that using external snapshot
 would help making some stuff easier. By external snapshot I mean
 creating a new image with backing file as current image file and using
 this new image for future operations. This would allow for instance
 - support snapshot with every format (even raw)
 - making snapshot backup using external programs (even from different
 hosts using clustered file system and without many locking issues as
 original image is now read-only)
 - convert images live (just snapshot, qemu-img convert, remove snapshot)

These are things that are actively worked on. snapshot_blkdev is a
monitor command that already exists and does exactly what you describe.
For the rest, live block copy and image streaming are the keywords that
you should be looking for. We've had quite some discussions on these in
the past few weeks. You may also be interested in this wiki page:
http://wiki.qemu.org/Features/LiveBlockMigration

Kevin



Re: [Qemu-devel] coroutines and block I/O considerations

2011-07-19 Thread Stefan Hajnoczi
On Tue, Jul 19, 2011 at 11:10 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 19.07.2011 10:06, schrieb Frediano Ziglio:
 2- memory considerations on coroutines. Beside coroutines allow more
 readable code I wonder if somebody considered memory. For every
 coroutines a different stack has to be allocated. For instance
 ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO
 this require about 512mb of ram (mostly only committed but not used
 and coroutines are reused).

 128 concurrent requests is a lot. And even then, it's only virtual
 memory. I doubt that we're actually using much more than we do in the
 old code with the AIOCBs (which will disappear and become local
 variables when we complete the conversion).

From what I understand committed on Windows means that physical
pages have been allocated and pagefile space has been set aside:
http://msdn.microsoft.com/en-us/library/ms810627.aspx

On Linux memory is overcommitted and will not require swap space or
any actual pages.  This behavior can be configured differently IIRC
but the default is to be lazy about claiming memory resources so that
even 4 MB thread/coroutine stacks are not an issue.

The question is how can we get the same effect on Windows and does the
current Fibers implementation not already work?

Stefan



Re: [Qemu-devel] coroutines and block I/O considerations

2011-07-19 Thread Anthony Liguori

On 07/19/2011 05:10 AM, Kevin Wolf wrote:

Am 19.07.2011 10:06, schrieb Frediano Ziglio:
They are still all running in the same thread.


2- memory considerations on coroutines. Beside coroutines allow more
readable code I wonder if somebody considered memory. For every
coroutines a different stack has to be allocated. For instance
ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO
this require about 512mb of ram (mostly only committed but not used
and coroutines are reused).


128 concurrent requests is a lot. And even then, it's only virtual
memory. I doubt that we're actually using much more than we do in the
old code with the AIOCBs (which will disappear and become local
variables when we complete the conversion).


A 4mb stack is probably overkill anyway.  It's easiest to just start 
with a large stack and then once all of the functionality is worked out, 
optimize to a smaller stack.


The same problem exists with using threads FWIW since the default thread 
stack is usually quite large.


Regards,

Anthony Liguori