Re: [Qemu-devel] coroutines and block I/O considerations
On 07/19/2011 12:57 PM, Stefan Hajnoczi wrote: From what I understand committed on Windows means that physical pages have been allocated and pagefile space has been set aside: http://msdn.microsoft.com/en-us/library/ms810627.aspx Yes, memory that is reserved on Windows is just a contiguous part of the address space that is set aside, like MAP_NORESERVE under Linux. Memory that is committed is really allocated. The question is how can we get the same effect on Windows and does the current Fibers implementation not already work? Windows thread and fiber stacks have both a reserved and a committed part. The dwStackSize argument to CreateFiber indeed represents _committed_ stack size, so we're now committing 4 MB of stack per fiber. The maximum size that the stack can grow to is set to the (per-executable) default. If you want to specify both the reserved and committed stack sizes, you can do that with CreateFiberEx. http://msdn.microsoft.com/en-us/library/ms682406%28v=vs.85%29.aspx 4 MB is quite a lot of address space anyway to waste for a thread. A coroutine should not need that much, even on Linux. I think for Windows 64 KB of initial stack size and 1 MB of maximum size should do (for Linux it would 1 MB overall). Paolo
Re: [Qemu-devel] coroutines and block I/O considerations
On Mon, Jul 25, 2011 at 9:56 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 07/19/2011 12:57 PM, Stefan Hajnoczi wrote: From what I understand committed on Windows means that physical pages have been allocated and pagefile space has been set aside: http://msdn.microsoft.com/en-us/library/ms810627.aspx Yes, memory that is reserved on Windows is just a contiguous part of the address space that is set aside, like MAP_NORESERVE under Linux. Memory that is committed is really allocated. The question is how can we get the same effect on Windows and does the current Fibers implementation not already work? Windows thread and fiber stacks have both a reserved and a committed part. The dwStackSize argument to CreateFiber indeed represents _committed_ stack size, so we're now committing 4 MB of stack per fiber. The maximum size that the stack can grow to is set to the (per-executable) default. If you want to specify both the reserved and committed stack sizes, you can do that with CreateFiberEx. http://msdn.microsoft.com/en-us/library/ms682406%28v=vs.85%29.aspx 4 MB is quite a lot of address space anyway to waste for a thread. A coroutine should not need that much, even on Linux. I think for Windows 64 KB of initial stack size and 1 MB of maximum size should do (for Linux it would 1 MB overall). I agree, let's make sure not to commit all this memory upfront. Stefan
[Qemu-devel] coroutines and block I/O considerations
Hi, I'm exercise myself in block I/O layer and I decided to test coroutine branch cause I find it easier to use instead of normal callback. Looking at normal code there are a lot of rows in source to save/restore state and declare callbacks and is not that easier to understand the normal flow. At the end I would like to create a new image format to get rid of some performance problem I encounter using writethrough and snapshots. I have some questions regard block I/O and also coroutines 1- threading model. I don't understand it. I can see that aio pool routines does not contain locking code so I think aio layer is mainly executed in a single thread. I saw introduction of some locking using coroutines so I think coroutines are now called from different threads and needs lock (current implementation serialize all device operations) 2- memory considerations on coroutines. Beside coroutines allow more readable code I wonder if somebody considered memory. For every coroutines a different stack has to be allocated. For instance ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO this require about 512mb of ram (mostly only committed but not used and coroutines are reused). About snapshot and block i/o I think that using external snapshot would help making some stuff easier. By external snapshot I mean creating a new image with backing file as current image file and using this new image for future operations. This would allow for instance - support snapshot with every format (even raw) - making snapshot backup using external programs (even from different hosts using clustered file system and without many locking issues as original image is now read-only) - convert images live (just snapshot, qemu-img convert, remove snapshot) Regards Frediano
Re: [Qemu-devel] coroutines and block I/O considerations
Am 19.07.2011 10:06, schrieb Frediano Ziglio: I'm exercise myself in block I/O layer and I decided to test coroutine branch cause I find it easier to use instead of normal callback. Looking at normal code there are a lot of rows in source to save/restore state and declare callbacks and is not that easier to understand the normal flow. Yes. This is one of the reasons why we're trying to switch to coroutines. QED is a prototype for a fully asynchronous callback-based image format, and sometimes it's really hard to follow its code paths. That the real functionality gets lost in the noise of transferring state doesn't really help with readability either. At the end I would like to create a new image format to get rid of some performance problem I encounter using writethrough and snapshots. I have some questions regard block I/O and also coroutines No. A new image format is the wrong answer, whatever the question may be. :-) If writethrough doesn't perform well with the existing format drivers, fix the existing format drivers. You need very good reasons to convince me that qcow2 can't do what your new format could do. The solution for slow writethrough mode in qcow2 is probably to make requests parallel, even if they touch metadata. This is a change that becomes possible relatively easily once we have switched to coroutines. What exactly is the problem with snapshots? Saving/loading internal snapshots is too slow, or general performance with an image that has snapshots? I think Luiz reported the first one a while ago, and it should be easy enough to fix (use Qcow2Cache in writeback mode during the refcount update). 1- threading model. I don't understand it. I can see that aio pool routines does not contain locking code so I think aio layer is mainly executed in a single thread. I saw introduction of some locking using coroutines so I think coroutines are now called from different threads and needs lock (current implementation serialize all device operations) You can view coroutines as threads with cooperative scheduling. That is, unlike threads a coroutine is never interrupted by a scheduler, but it can only call qemu_coroutine_yield(), which transfers control to a different coroutine. Compared to threads this simplifies locking a bit because you exactly know at which point other code may run. But of course, even though you know where it happens, you have other code running in the middle of your function, so there can be a need to lock things, which is why there are things like a CoMutex. They are still all running in the same thread. 2- memory considerations on coroutines. Beside coroutines allow more readable code I wonder if somebody considered memory. For every coroutines a different stack has to be allocated. For instance ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO this require about 512mb of ram (mostly only committed but not used and coroutines are reused). 128 concurrent requests is a lot. And even then, it's only virtual memory. I doubt that we're actually using much more than we do in the old code with the AIOCBs (which will disappear and become local variables when we complete the conversion). About snapshot and block i/o I think that using external snapshot would help making some stuff easier. By external snapshot I mean creating a new image with backing file as current image file and using this new image for future operations. This would allow for instance - support snapshot with every format (even raw) - making snapshot backup using external programs (even from different hosts using clustered file system and without many locking issues as original image is now read-only) - convert images live (just snapshot, qemu-img convert, remove snapshot) These are things that are actively worked on. snapshot_blkdev is a monitor command that already exists and does exactly what you describe. For the rest, live block copy and image streaming are the keywords that you should be looking for. We've had quite some discussions on these in the past few weeks. You may also be interested in this wiki page: http://wiki.qemu.org/Features/LiveBlockMigration Kevin
Re: [Qemu-devel] coroutines and block I/O considerations
On Tue, Jul 19, 2011 at 11:10 AM, Kevin Wolf kw...@redhat.com wrote: Am 19.07.2011 10:06, schrieb Frediano Ziglio: 2- memory considerations on coroutines. Beside coroutines allow more readable code I wonder if somebody considered memory. For every coroutines a different stack has to be allocated. For instance ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO this require about 512mb of ram (mostly only committed but not used and coroutines are reused). 128 concurrent requests is a lot. And even then, it's only virtual memory. I doubt that we're actually using much more than we do in the old code with the AIOCBs (which will disappear and become local variables when we complete the conversion). From what I understand committed on Windows means that physical pages have been allocated and pagefile space has been set aside: http://msdn.microsoft.com/en-us/library/ms810627.aspx On Linux memory is overcommitted and will not require swap space or any actual pages. This behavior can be configured differently IIRC but the default is to be lazy about claiming memory resources so that even 4 MB thread/coroutine stacks are not an issue. The question is how can we get the same effect on Windows and does the current Fibers implementation not already work? Stefan
Re: [Qemu-devel] coroutines and block I/O considerations
On 07/19/2011 05:10 AM, Kevin Wolf wrote: Am 19.07.2011 10:06, schrieb Frediano Ziglio: They are still all running in the same thread. 2- memory considerations on coroutines. Beside coroutines allow more readable code I wonder if somebody considered memory. For every coroutines a different stack has to be allocated. For instance ucontext and win32 implementation use 4mb. Assuming 128 concurrent AIO this require about 512mb of ram (mostly only committed but not used and coroutines are reused). 128 concurrent requests is a lot. And even then, it's only virtual memory. I doubt that we're actually using much more than we do in the old code with the AIOCBs (which will disappear and become local variables when we complete the conversion). A 4mb stack is probably overkill anyway. It's easiest to just start with a large stack and then once all of the functionality is worked out, optimize to a smaller stack. The same problem exists with using threads FWIW since the default thread stack is usually quite large. Regards, Anthony Liguori