Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: At such a tiny difference, I'm wondering why Linux-AIO exists at all, as it complicates the kernel rather a lot. I can see the theoretical appeal, but if performance is so marginal, I'm surprised it's in there. Linux aio exists, but that's all that can be said for it. It works mostly for raw disks, doesn't integrate with networking, and doesn't advance at the same pace as the rest of the kernel. I believe only databases use it (and a userspace filesystem I wrote some time ago). And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? I'm also surprised the Glibc implementation of AIO using ordinary threads is so close to it. Why are you surprised? Because I've read that Glibc AIO (which uses a thread pool) is a relatively poor performer as AIO implementations go, and is only there for API compatibility, not suggested for performance. But I read that quite a while ago, perhaps it's changed. It's me at fault here. I just assumed that because it's easy to do aio in a thread pool efficiently, that's what glibc does. Unfortunately the code does some ridiculous things like not service multiple requests on a single fd in parallel. I see absolutely no reason for it (the code says fight for resources). So my comments only apply to linux-aio vs a sane thread pool. Sorry for spreading confusion. Actually the glibc implementation could be improved from what I've heard. My estimates are for a thread pool implementation, but there is not reason why glibc couldn't achieve exactly the same performance. Erm... I thought you said it _does_ achieve nearly the same performance, not that it _could_. Do you mean it could achieve exactly the same performance by using Linux AIO when possible? It could and should. It probably doesn't. A simple thread pool implementation could come within 10% of Linux aio for most workloads. It will never be exactly, but for small numbers of disks, close enough. And then, I'm wondering why use AIO it all: it suggests QEMU would run about as fast doing synchronous I/O in a few dedicated I/O threads. Posix aio is the unix API for this, why not use it? Because far more host platforms have threads than have POSIX AIO. (I suspect both options will end up supported in the end, as dedicated I/O threads were already suggested for other things.) Agree. Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. Does the separate LUN make any difference? I thought O_DIRECT on a filesystem was meant to be pretty close to block device performance. On a good extent-based filesystem like XFS you will get good performance (though more cpu overhead due to needing to go through additional mapping layers. Old clunkers like ext3 will require additional seeks or a ton of cache (1 GB per 1 TB). Hmm. Thanks. I may consider switching to XFS now I'm rooting for btrfs myself. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? Yes, unless the implementation in the kernel (or glibc) is threaded. With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. If the disk is busy, it doesn't matter. The requests will queue and the elevator will sort them out. So it's just the first few requests that may get to disk out of order. With AIO (non-Glibc! (and non-kthreads)) it might be better at keeping the intended issue order, I'm not sure. It is highly desirable: O_DIRECT streaming performance depends on avoiding seeks (no reordering) and on keeping the request queue non-empty (no gap). I read a man page for some other unix, describing AIO as better than threaded parallel reads for reading tape drives because of this (tape seeks are very expensive). But the rest of the man page didn't say anything more. Unfortunately I don't remember where I read it. I have no idea whether AIO submission order is nearly always preserved in general, or expected to be. I haven't considered tape, but this is a good point indeed. I expect it doesn't make much of a difference for a loaded disk. It's me at fault here. I just assumed that because it's easy to do aio in a thread pool efficiently, that's what glibc does. Unfortunately the code does some ridiculous things like not service multiple requests on a single fd in parallel. I see absolutely no reason for it (the code says fight for resources). Ouch. Perhaps that relates to my thought above, about multiple requests to the same file causing seek storms when thread scheduling is unlucky? My first thought on seeing this is that it relates to a deficiency on older kernels servicing multiple requests on a single fd (i.e. a per-file lock). I don't know if such a deficiency ever existed, though. It could and should. It probably doesn't. A simple thread pool implementation could come within 10% of Linux aio for most workloads. It will never be exactly, but for small numbers of disks, close enough. I would wait for benchmark results for I/O patterns like sequential reading and writing, because of potential for seeks caused by request reordering, before being confident of that. I did have measurements (and a test rig) at a previous job (where I did a lot of I/O work); IIRC the performance of a tuned thread pool was not far behind aio, both for seeks and sequential. It was a while back though. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. I believe he'd like a hint to get good scheduling, not a guarantee. With a thread pool if the threads are scheduled out of order, so are your requests. If the elevator doesn't plug the queue, the first few requests may not be optimally sorted. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. You misunderstand. I'm not talking about guarantees, I'm talking about expectations for the performance effect. Basically, to do performant streaming read with O_DIRECT you need two things: 1. Overlap at least 2 requests, so the device is kept busy. 2. Requests be sent to the disk in a good order, which is usually (but not always) sequential offset order. The kernel does this itself with buffered reads, doing readahead. It works very well, unless you have other problems caused by readahead. With O_DIRECT, an application has to do the equivalent of readahead itself to get performant streaming. If the app uses two threads calling pread(), it's hard to ensure the kernel even _sees_ the first two calls in sequential offset order. You spawn two threads, and then both threads call pread() with non-deterministic scheduling. The problem starts before even entering the kernel. Then, depending on I/O scheduling in the kernel, it might send the less good pread() to the disk immediately, then later a backward head seek and the other one. The elevator cannot fix this: it doesn't have enough information, unless it adds artificial delays. But artificial delays may harm too; it's not optimal. After that, the two threads tend to call pread() in the best order provided there's no scheduling conflicts, but are easily disrupted by other tasks, especially on SMP (one reading thread per CPU, so when one of them is descheduled, the other continues and issues a request in the 'wrong' order.) With AIO, even though you can't be sure what the kernel does, you can be sure the kernel receives aio_read() calls in the exact order which is most likely to perform well. Application knowledge of it's access pattern is passed along better. As I've said, I saw a man page which described why this makes AIO superior to using threads for reading tapes on that OS. So it's not a completely spurious point. This has nothing to do with guarantees. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: Anthony Liguori wrote: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. I believe he'd like a hint to get good scheduling, not a guarantee. With a thread pool if the threads are scheduled out of order, so are your requests. If the elevator doesn't plug the queue, the first few requests may not be optimally sorted. That's right. Then they tend to settle to a good order. But any delay in scheduling one of the threads, or a signal received by one of them, can make it lose order briefly, making the streaming stutter as the disk performes a few local seeks until it settles to good order again. You can mitigate the disruption in various ways. 1. If all threads share an offset variable, and reads and increments that atomically just prior to calling pread(), that helps especially at the start. (If threaded I/O is used for QEMU disk emulation, I would suggest doing that, in the more general form of popping a request from QEMU's internal shared queue at the last moment.) 2. Using more threads helps keep it sustained, at the cost of more wasted I/O when there's a cancellation (changed mind), and more memory. However, AIO, in principle (if not implementations...) could be better at keeping the suggested I/O order than thread, without special tricks. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? Yes, unless the implementation in the kernel (or glibc) is threaded. With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. If the disk is busy, it doesn't matter. The requests will queue and the elevator will sort them out. So it's just the first few requests that may get to disk out of order. There's two cases where it matters to a read-streaming app: 1. Disk isn't busy with anything else, maximum streaming performance is desired. 2. Disk is busy with unrelated things, but you're using I/O priorities to give the streaming app near-absolute priority. Then you need to maintain overlapped streaming requests, otherwise disk is given to a lower priority I/O. If that happens often, you lose, priority is ineffective. Because one of the streaming requests is usually being serviced, elevator has similar limitations as for a disk which is not busy with anything else. I haven't considered tape, but this is a good point indeed. I expect it doesn't make much of a difference for a loaded disk. Yes, as long as it's loaded with unrelated requests at the same I/O priority, the elevator has time to sort requests and hide thread scheduling artifacts. Btw, regarding QEMU: QEMU gets requests _after_ sorting by the guest's elevator, then submits them to the host's elevator. If the guest and host elevators are both configured 'anticipatory', do the anticipatory delays add up? -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. With AIO (non-Glibc! (and non-kthreads)) it might be better at keeping the intended issue order, I'm not sure. It is highly desirable: O_DIRECT streaming performance depends on avoiding seeks (no reordering) and on keeping the request queue non-empty (no gap). I read a man page for some other unix, describing AIO as better than threaded parallel reads for reading tape drives because of this (tape seeks are very expensive). But the rest of the man page didn't say anything more. Unfortunately I don't remember where I read it. I have no idea whether AIO submission order is nearly always preserved in general, or expected to be. It's me at fault here. I just assumed that because it's easy to do aio in a thread pool efficiently, that's what glibc does. Unfortunately the code does some ridiculous things like not service multiple requests on a single fd in parallel. I see absolutely no reason for it (the code says fight for resources). Ouch. Perhaps that relates to my thought above, about multiple requests to the same file causing seek storms when thread scheduling is unlucky? So my comments only apply to linux-aio vs a sane thread pool. Sorry for spreading confusion. Thanks. I thought you'd measured it :-) It could and should. It probably doesn't. A simple thread pool implementation could come within 10% of Linux aio for most workloads. It will never be exactly, but for small numbers of disks, close enough. I would wait for benchmark results for I/O patterns like sequential reading and writing, because of potential for seeks caused by request reordering, before being confident of that. Hmm. Thanks. I may consider switching to XFS now I'm rooting for btrfs myself. In the unlikely event they backport btrfs to kernel 2.4.26-uc0, I'll be happy to give it a try! :-) -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
On Tue, Apr 22, 2008 at 3:10 AM, Avi Kivity [EMAIL PROTECTED] wrote: I'm rooting for btrfs myself. but could btrfs (when stable) work for migration? i'm curious about OCFS2 performance on this kind of load... when i manage to sell the idea of a KVM cluster i'd like to know if i should try first EVMS-HA (cluster LV's) or OCFS (cluster FS) -- Javier - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: Does that mean for the majority of deployments, the slow version is sufficient. The few that care about performance can use Linux AIO? In essence, yes. s/slow/slower/ and s/performance/ultimate block device performance/. Many deployments don't care at all about block device performance; they care mostly about networking performance. That's interesting. I'd have expected block device performance to be important for most things, for the same reason that disk performance is (well, reasonably) important for non-virtual machines. Seek time is important. Bandwidth is somewhat important. But for one- and two- spindle workloads (the majority), the cpu utilization induced by getting requests to the disk is not important, and that's what we're optimizing here. Disks work at around 300 Hz. Processors at around 3 GHz. That's seven orders of magnitude difference. Even if you spent 100 usec calculating what's the next best seek, even if it saves you only 10% of seeks it's a win. And of course modern processors spend a few microseconds at most getting a request out. You really need 50+ disks or a large write-back cache to make microoptimizations around the submission path felt. But as you say next: I'm under the impression that the entire and only point of Linux AIO is that it's faster than POSIX AIO on Linux. It is. I estimate posix aio adds a few microseconds above linux aio per I/O request, when using O_DIRECT. Assuming 10 microseconds, you will need 10,000 I/O requests per second per vcpu to have a 10% performance difference. That's definitely rare. Oh, I didn't realise the difference was so small. At such a tiny difference, I'm wondering why Linux-AIO exists at all, as it complicates the kernel rather a lot. I can see the theoretical appeal, but if performance is so marginal, I'm surprised it's in there. Linux aio exists, but that's all that can be said for it. It works mostly for raw disks, doesn't integrate with networking, and doesn't advance at the same pace as the rest of the kernel. I believe only databases use it (and a userspace filesystem I wrote some time ago). I'm also surprised the Glibc implementation of AIO using ordinary threads is so close to it. Why are you surprised? Actually the glibc implementation could be improved from what I've heard. My estimates are for a thread pool implementation, but there is not reason why glibc couldn't achieve exactly the same performance. And then, I'm wondering why use AIO it all: it suggests QEMU would run about as fast doing synchronous I/O in a few dedicated I/O threads. Posix aio is the unix API for this, why not use it? Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. Does the separate LUN make any difference? I thought O_DIRECT on a filesystem was meant to be pretty close to block device performance. On a good extent-based filesystem like XFS you will get good performance (though more cpu overhead due to needing to go through additional mapping layers. Old clunkers like ext3 will require additional seeks or a ton of cache (1 GB per 1 TB). I base this on messages here and there which say swapping to a file is about as fast as swapping to a block device, nowadays. Swapping to a file preloads the block mapping into memory, so the filesystem is not involved at all in the I/O path. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Javier Guerra Giraldez wrote: On Sunday 20 April 2008, Avi Kivity wrote: Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. i think that too; but still that LUN would be accessed by the VM's via one of these IO emulation layers, right? Yes. Hopefully Linux aio. or maybe you're advocating using the SAN initiator in the VM instead of the host? That works too, especially for iSCSI, but that's not what I'm advocating. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: At such a tiny difference, I'm wondering why Linux-AIO exists at all, as it complicates the kernel rather a lot. I can see the theoretical appeal, but if performance is so marginal, I'm surprised it's in there. Linux aio exists, but that's all that can be said for it. It works mostly for raw disks, doesn't integrate with networking, and doesn't advance at the same pace as the rest of the kernel. I believe only databases use it (and a userspace filesystem I wrote some time ago). And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) I'm also surprised the Glibc implementation of AIO using ordinary threads is so close to it. Why are you surprised? Because I've read that Glibc AIO (which uses a thread pool) is a relatively poor performer as AIO implementations go, and is only there for API compatibility, not suggested for performance. But I read that quite a while ago, perhaps it's changed. Actually the glibc implementation could be improved from what I've heard. My estimates are for a thread pool implementation, but there is not reason why glibc couldn't achieve exactly the same performance. Erm... I thought you said it _does_ achieve nearly the same performance, not that it _could_. Do you mean it could achieve exactly the same performance by using Linux AIO when possible? And then, I'm wondering why use AIO it all: it suggests QEMU would run about as fast doing synchronous I/O in a few dedicated I/O threads. Posix aio is the unix API for this, why not use it? Because far more host platforms have threads than have POSIX AIO. (I suspect both options will end up supported in the end, as dedicated I/O threads were already suggested for other things.) Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. Does the separate LUN make any difference? I thought O_DIRECT on a filesystem was meant to be pretty close to block device performance. On a good extent-based filesystem like XFS you will get good performance (though more cpu overhead due to needing to go through additional mapping layers. Old clunkers like ext3 will require additional seeks or a ton of cache (1 GB per 1 TB). Hmm. Thanks. I may consider switching to XFS now -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: For the majority of deployments posix aio should be sufficient. The few that need something else can use Linux aio. Does that mean for the majority of deployments, the slow version is sufficient. The few that care about performance can use Linux AIO? I'm under the impression that the entire and only point of Linux AIO is that it's faster than POSIX AIO on Linux. Of course, a managed environment can use Linux aio unconditionally if knows the kernel has all the needed goodies. Does that mean a managed environment can have some code which check the host kernel version + filesystem type holding the VM image, to conditionally enable Linux AIO? (Since if you care about performance, which is the sole reason for using Linux AIO, you wouldn't want to enable Linux AIO on any host in your cluster where it will trash performance.) Just wondering. Thanks, -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: For the majority of deployments posix aio should be sufficient. The few that need something else can use Linux aio. Does that mean for the majority of deployments, the slow version is sufficient. The few that care about performance can use Linux AIO? In essence, yes. s/slow/slower/ and s/performance/ultimate block device performance/. Many deployments don't care at all about block device performance; they care mostly about networking performance. I'm under the impression that the entire and only point of Linux AIO is that it's faster than POSIX AIO on Linux. It is. I estimate posix aio adds a few microseconds above linux aio per I/O request, when using O_DIRECT. Assuming 10 microseconds, you will need 10,000 I/O requests per second per vcpu to have a 10% performance difference. That's definitely rare. Of course, a managed environment can use Linux aio unconditionally if knows the kernel has all the needed goodies. Does that mean a managed environment can have some code which check the host kernel version + filesystem type holding the VM image, to conditionally enable Linux AIO? (Since if you care about performance, which is the sole reason for using Linux AIO, you wouldn't want to enable Linux AIO on any host in your cluster where it will trash performance.) Either that, or mandate that all hosts use a filesystem and kernel which provide the necessary performance. Take ovirt for example, which provides the entire hypervisor environment, and so can guarantee this. Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. Just wondering. Hope this clarifies. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
On Sunday 20 April 2008, Avi Kivity wrote: Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. i think that too; but still that LUN would be accessed by the VM's via one of these IO emulation layers, right? or maybe you're advocating using the SAN initiator in the VM instead of the host? -- Javier signature.asc Description: This is a digitally signed message part. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Daniel P. Berrange wrote: Those cases aren't always discoverable. Linux-aio just falls back to using synchronous IO. It's pretty terrible. We need a new AIO interface for Linux (and yes, we're working on this). Once we have something better, we'll change that to be the default and things will Just Work for most users. If QEMU can't discover cases where it won't work, what criteria should the end user use to decide between the impls, or for that matter, what criteria should a management api/app like libvirt use ? If the only decision logic is 'try it benchmark your VM' then its not a particularly useful option. Good use of Linux-AIO requires that you basically know which cases it handles well, and which ones it doesn't. Falling back to synchronous I/O with no indication (except speed) is a pretty atrocious API imho. But that's what the Linux folks decided to do. I suspect what you have to do is: 1. Try opening the file with O_DIRECT. 2. Use fstat to check the filesystem type and block device type. 3. If it's on a whitelist of filesystem types, 4. and a whitelist of block device types, 5. and the kernel version is later than an fs+bd-dependent value, 6. then select an alignment size (kernel version dependent) and use Linux-AIO with it. Otherwise don't use Linux-AIO. You may then decide to use Glibc's POSIX-AIO (which uses threads), or use threads for I/O yourself. In future, the above recipe will be more complicated, in that you have to use the same decision tree to decide between: - Synchronous IO. - Your own thread based IO. - Glibc POSIX-AIO using threads. - Linux-AIO. - Virtio thing or whatever is based around vringfd. - Syslets if they gain traction and perform well. I've basically got a choice of making libvirt always ad '-aio linux' or never add it at all. My inclination is to the latter since it is compatible with existing QEMU which has no -aio option. Presumably '-aio linux' is intended to provide some performance benefit so it'd be nice to use it. If we can't express some criteria under which it should be turned on, I can't enable it; where as if you can express some criteria, then QEMU should apply them automatically. I'm of the view that '-aio auto' would be a really good option - and when it's proven itself, it should be the default. It could work on all QEMU hosts: it would pick synchronous IO when there is nothing else. The criteria for selecting a good AIO strategy on Linux are quite complex, and might be worth hard coding. In that case, putting that into QEMU itself would be much better than every program which launches QEMU having it's own implementation of the criteria. Pushing this choice of AIO impls to the app or user invoking QEMU just does not seem like a win here. I think having the choice is very good, because whatever the hard coded selection criteria, there will be times when it's wrong (ideally in conservative ways - it should always be functional, just suboptimal). So I do support this patch to add the switch. But _forcing_ the user to decide is not good, since the criteria are rather obscure and change with things like filesystem. At least, a set of command line options to QEMU ought to work when you copy a VM to another machine! So I think '-aio auto', which invokes the selection criteria of the day and is guaranteed to work (conservatively picking a slower method if it cannot be sure a faster one will work) would be the most useful option of all. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: I've basically got a choice of making libvirt always ad '-aio linux' or never add it at all. My inclination is to the latter since it is compatible with existing QEMU which has no -aio option. Presumably '-aio linux' is intended to provide some performance benefit so it'd be nice to use it. If we can't express some criteria under which it should be turned on, I can't enable it; where as if you can express some criteria, then QEMU should apply them automatically. I'm of the view that '-aio auto' would be a really good option - and when it's proven itself, it should be the default. It could work on all QEMU hosts: it would pick synchronous IO when there is nothing else. Right now, not specifying the -aio option is equivalent to your proposed -aio auto. I guess I should include an info aio to let the user know what type of aio they are using. We can add selection criteria later but semantically, not specifying an explicit -aio option allows QEMU to choose whichever one it thinks is best. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: I'm of the view that '-aio auto' would be a really good option - and when it's proven itself, it should be the default. It could work on all QEMU hosts: it would pick synchronous IO when there is nothing else. Right now, not specifying the -aio option is equivalent to your proposed -aio auto. I guess I should include an info aio to let the user know what type of aio they are using. We can add selection criteria later but semantically, not specifying an explicit -aio option allows QEMU to choose whichever one it thinks is best. Great. I guess the next step is to add selection criteria, otherwise a million Wikis will tell everyone to use '-aio linux' :-) Do you know what the selection criteria should be - or is there a document/paper somewhere which says (ideally from benchmarks)? I'm interested for an unrelated project using AIO - so I'm willing to help get this right to some extent. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: Right now, not specifying the -aio option is equivalent to your proposed -aio auto. I guess I should include an info aio to let the user know what type of aio they are using. We can add selection criteria later but semantically, not specifying an explicit -aio option allows QEMU to choose whichever one it thinks is best. For the majority of deployments posix aio should be sufficient. The few that need something else can use Linux aio. Of course, a managed environment can use Linux aio unconditionally if knows the kernel has all the needed goodies. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel