Re: cfq performance gap
In article <[EMAIL PROTECTED]>, Chen, Kenneth W <[EMAIL PROTECTED]> wrote: >Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM >> Chen, Kenneth W <[EMAIL PROTECTED]> wrote: >> >This rawio test plows through sequential I/O and modulo each small record >> >over number of threads. So each thread appears to be non-contiguous within >> >its own process context, overall request hitting the device are sequential. >> >I can't see how any application does that kind of I/O pattern. >> >> A NNTP server that has many incoming connections, handled by >> multiple threads, that stores the data in cylic buffers ? > >Then whichever the thread that dumps the buffer content to the storage >will do one large contiguous I/O. In this context, "cyclic buffer" means "large fixed-size file" or "disk partition", and when the end of that file/partition is reached, writing resumes at the start (wraps around, starts the next cycle). Each thread writes an article to disk, which can differ in size from 1K to 1M. The writes all together are sequential, but the writes from one thread are definitely not. This is a real-world example - I have written software that does exactly this, multithreaded versions of INN exist that with CNFS storage does exactly this, and Diablo does something comparable (only it uses processes instead of threads). Mike. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: cfq performance gap
Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM > Chen, Kenneth W <[EMAIL PROTECTED]> wrote: > >This rawio test plows through sequential I/O and modulo each small record > >over number of threads. So each thread appears to be non-contiguous within > >its own process context, overall request hitting the device are sequential. > >I can't see how any application does that kind of I/O pattern. > > A NNTP server that has many incoming connections, handled by > multiple threads, that stores the data in cylic buffers ? Then whichever the thread that dumps the buffer content to the storage will do one large contiguous I/O. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
In article <[EMAIL PROTECTED]>, Chen, Kenneth W <[EMAIL PROTECTED]> wrote: >This rawio test plows through sequential I/O and modulo each small record >over number of threads. So each thread appears to be non-contiguous within >its own process context, overall request hitting the device are sequential. >I can't see how any application does that kind of I/O pattern. A NNTP server that has many incoming connections, handled by multiple threads, that stores the data in cylic buffers ? Mike. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
In article [EMAIL PROTECTED], Chen, Kenneth W [EMAIL PROTECTED] wrote: This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. A NNTP server that has many incoming connections, handled by multiple threads, that stores the data in cylic buffers ? Mike. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: cfq performance gap
Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM Chen, Kenneth W [EMAIL PROTECTED] wrote: This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. A NNTP server that has many incoming connections, handled by multiple threads, that stores the data in cylic buffers ? Then whichever the thread that dumps the buffer content to the storage will do one large contiguous I/O. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
In article [EMAIL PROTECTED], Chen, Kenneth W [EMAIL PROTECTED] wrote: Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM Chen, Kenneth W [EMAIL PROTECTED] wrote: This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. A NNTP server that has many incoming connections, handled by multiple threads, that stores the data in cylic buffers ? Then whichever the thread that dumps the buffer content to the storage will do one large contiguous I/O. In this context, cyclic buffer means large fixed-size file or disk partition, and when the end of that file/partition is reached, writing resumes at the start (wraps around, starts the next cycle). Each thread writes an article to disk, which can differ in size from 1K to 1M. The writes all together are sequential, but the writes from one thread are definitely not. This is a real-world example - I have written software that does exactly this, multithreaded versions of INN exist that with CNFS storage does exactly this, and Diablo does something comparable (only it uses processes instead of threads). Mike. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
On Tue, Dec 12 2006, AVANTIKA R. MATHUR wrote: > >That said, I might add some logic to detect when we can cheaply switch > >queues instead of waiting for a new request from the same queue. > >Averaging slice times over a period of time instead of 1:1 with that > >logic, should help cases like this while still being fair. > > > Thank you for looking at this issue. > I've found an IBM/SUSE bugzilla bug for the same performance gap on > rawio. There was a fix for this bug included in SLES10-RC1, do you know > why it was not added in mainline? Which bug do you mean? It was likely me doing the fixing on that bug, and I'm certain that the patch is in mainline. If you included the bug number, I could have expanded on that. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: cfq performance gap
AVANTIKA R. MATHUR wrote on Tuesday, December 12, 2006 5:33 PM > >> rawio is actually performing sequential reads, but I don't believe it is > >> purely sequential with the multiple processes. > >> I am currently running the test with longer runtimes and will post > >> results once it is complete. > >> I've also attached the rawio source. > >> > > > > It's certainly the slice and idling hurting here. But at the same time, > > I don't really think your test case is very interesting. The test area > > is very small and you have 16 threads trying to read the same thing, > > optimizing for that would be silly as I don't think it has much real > > world relevance. > > Could a database have similar workload to this test? No. Not what I have seen with db workloads exhibits such pattern. There are basically two types of db workloads: one does transaction processing, and I/O pattern are truly random with large stride, both in the context of process and overall I/O seen at device level. A second one is decision making type of db queries. They does large sequential I/O within one process context. This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
Jens Axboe wrote: On Fri, Dec 08 2006, Avantika Mathur wrote: On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote: On Thu, Dec 07 2006, Avantika Mathur wrote: Hi Jens, (you probably noticed now, but the [EMAIL PROTECTED] email is no longer valid) I saw that, thanks! I've noticed a performance gap between the cfq scheduler and other io schedulers when running the rawio benchmark. The benchmark workload is 16 processes running 4k random reads. Is this performance gap a known issue? CFQ could be a little slower at this benchmark, but your results are much worse than I would expect. What is the queueing depth of sda? How are you invoking rawio? I am running rawio with the following options: rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096 The queue depth on sda is 4. Your runtime is very low, how does it look if you allow the test to run for much longer? 30MiB/sec random read bandwidth seems very high, I'm wondering what exactly is being tested here. rawio is actually performing sequential reads, but I don't believe it is purely sequential with the multiple processes. I am currently running the test with longer runtimes and will post results once it is complete. I've also attached the rawio source. It's certainly the slice and idling hurting here. But at the same time, I don't really think your test case is very interesting. The test area is very small and you have 16 threads trying to read the same thing, optimizing for that would be silly as I don't think it has much real world relevance. Could a database have similar workload to this test? That said, I might add some logic to detect when we can cheaply switch queues instead of waiting for a new request from the same queue. Averaging slice times over a period of time instead of 1:1 with that logic, should help cases like this while still being fair. Thank you for looking at this issue. I've found an IBM/SUSE bugzilla bug for the same performance gap on rawio. There was a fix for this bug included in SLES10-RC1, do you know why it was not added in mainline? Thanks again, Avantika Mathur - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
Jens Axboe wrote: On Fri, Dec 08 2006, Avantika Mathur wrote: On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote: On Thu, Dec 07 2006, Avantika Mathur wrote: Hi Jens, (you probably noticed now, but the [EMAIL PROTECTED] email is no longer valid) I saw that, thanks! I've noticed a performance gap between the cfq scheduler and other io schedulers when running the rawio benchmark. The benchmark workload is 16 processes running 4k random reads. Is this performance gap a known issue? CFQ could be a little slower at this benchmark, but your results are much worse than I would expect. What is the queueing depth of sda? How are you invoking rawio? I am running rawio with the following options: rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096 The queue depth on sda is 4. Your runtime is very low, how does it look if you allow the test to run for much longer? 30MiB/sec random read bandwidth seems very high, I'm wondering what exactly is being tested here. rawio is actually performing sequential reads, but I don't believe it is purely sequential with the multiple processes. I am currently running the test with longer runtimes and will post results once it is complete. I've also attached the rawio source. It's certainly the slice and idling hurting here. But at the same time, I don't really think your test case is very interesting. The test area is very small and you have 16 threads trying to read the same thing, optimizing for that would be silly as I don't think it has much real world relevance. Could a database have similar workload to this test? That said, I might add some logic to detect when we can cheaply switch queues instead of waiting for a new request from the same queue. Averaging slice times over a period of time instead of 1:1 with that logic, should help cases like this while still being fair. Thank you for looking at this issue. I've found an IBM/SUSE bugzilla bug for the same performance gap on rawio. There was a fix for this bug included in SLES10-RC1, do you know why it was not added in mainline? Thanks again, Avantika Mathur - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: cfq performance gap
AVANTIKA R. MATHUR wrote on Tuesday, December 12, 2006 5:33 PM rawio is actually performing sequential reads, but I don't believe it is purely sequential with the multiple processes. I am currently running the test with longer runtimes and will post results once it is complete. I've also attached the rawio source. It's certainly the slice and idling hurting here. But at the same time, I don't really think your test case is very interesting. The test area is very small and you have 16 threads trying to read the same thing, optimizing for that would be silly as I don't think it has much real world relevance. Could a database have similar workload to this test? No. Not what I have seen with db workloads exhibits such pattern. There are basically two types of db workloads: one does transaction processing, and I/O pattern are truly random with large stride, both in the context of process and overall I/O seen at device level. A second one is decision making type of db queries. They does large sequential I/O within one process context. This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
On Tue, Dec 12 2006, AVANTIKA R. MATHUR wrote: That said, I might add some logic to detect when we can cheaply switch queues instead of waiting for a new request from the same queue. Averaging slice times over a period of time instead of 1:1 with that logic, should help cases like this while still being fair. Thank you for looking at this issue. I've found an IBM/SUSE bugzilla bug for the same performance gap on rawio. There was a fix for this bug included in SLES10-RC1, do you know why it was not added in mainline? Which bug do you mean? It was likely me doing the fixing on that bug, and I'm certain that the patch is in mainline. If you included the bug number, I could have expanded on that. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
On Fri, Dec 08 2006, Avantika Mathur wrote: > On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote: > > On Thu, Dec 07 2006, Avantika Mathur wrote: > > > Hi Jens, > > > > (you probably noticed now, but the [EMAIL PROTECTED] email is no longer > > valid) > > I saw that, thanks! > > > I've noticed a performance gap between the cfq scheduler and other io > > > schedulers when running the rawio benchmark. > > > Results from rawio on 2.6.19, cfq and noop schedulers: > > > > > > CFQ: > > > > > > procs devicenum read KB/sec I/O Ops/sec > > > - --- -- --- -- > > > 16 /dev/sda 16412 83382084 > > > - --- -- --- -- > > > 1616412 83382084 > > > > > > Total run time 0.492072 seconds > > > > > > > > > NOOP: > > > > > > procs devicenum read KB/sec I/O Ops/sec > > > - --- -- --- -- > > > 16 /dev/sda 16399292247306 > > > - --- -- --- -- > > > 1616399292247306 > > > > > > Total run time 0.140284 seconds > > > > > > The benchmark workload is 16 processes running 4k random reads. > > > > > > Is this performance gap a known issue? > > > > CFQ could be a little slower at this benchmark, but your results are > > much worse than I would expect. What is the queueing depth of sda? How > > are you invoking rawio? > > I am running rawio with the following options: > rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096 > > The queue depth on sda is 4. > > > > > Your runtime is very low, how does it look if you allow the test to run > > for much longer? 30MiB/sec random read bandwidth seems very high, I'm > > wondering what exactly is being tested here. > > > > rawio is actually performing sequential reads, but I don't believe it is > purely sequential with the multiple processes. > I am currently running the test with longer runtimes and will post > results once it is complete. > I've also attached the rawio source. It's certainly the slice and idling hurting here. But at the same time, I don't really think your test case is very interesting. The test area is very small and you have 16 threads trying to read the same thing, optimizing for that would be silly as I don't think it has much real world relevance. That said, I might add some logic to detect when we can cheaply switch queues instead of waiting for a new request from the same queue. Averaging slice times over a period of time instead of 1:1 with that logic, should help cases like this while still being fair. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
On Fri, Dec 08 2006, Avantika Mathur wrote: On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote: On Thu, Dec 07 2006, Avantika Mathur wrote: Hi Jens, (you probably noticed now, but the [EMAIL PROTECTED] email is no longer valid) I saw that, thanks! I've noticed a performance gap between the cfq scheduler and other io schedulers when running the rawio benchmark. Results from rawio on 2.6.19, cfq and noop schedulers: CFQ: procs devicenum read KB/sec I/O Ops/sec - --- -- --- -- 16 /dev/sda 16412 83382084 - --- -- --- -- 1616412 83382084 Total run time 0.492072 seconds NOOP: procs devicenum read KB/sec I/O Ops/sec - --- -- --- -- 16 /dev/sda 16399292247306 - --- -- --- -- 1616399292247306 Total run time 0.140284 seconds The benchmark workload is 16 processes running 4k random reads. Is this performance gap a known issue? CFQ could be a little slower at this benchmark, but your results are much worse than I would expect. What is the queueing depth of sda? How are you invoking rawio? I am running rawio with the following options: rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096 The queue depth on sda is 4. Your runtime is very low, how does it look if you allow the test to run for much longer? 30MiB/sec random read bandwidth seems very high, I'm wondering what exactly is being tested here. rawio is actually performing sequential reads, but I don't believe it is purely sequential with the multiple processes. I am currently running the test with longer runtimes and will post results once it is complete. I've also attached the rawio source. It's certainly the slice and idling hurting here. But at the same time, I don't really think your test case is very interesting. The test area is very small and you have 16 threads trying to read the same thing, optimizing for that would be silly as I don't think it has much real world relevance. That said, I might add some logic to detect when we can cheaply switch queues instead of waiting for a new request from the same queue. Averaging slice times over a period of time instead of 1:1 with that logic, should help cases like this while still being fair. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote: > On Thu, Dec 07 2006, Avantika Mathur wrote: > > Hi Jens, > > (you probably noticed now, but the [EMAIL PROTECTED] email is no longer > valid) I saw that, thanks! > > I've noticed a performance gap between the cfq scheduler and other io > > schedulers when running the rawio benchmark. > > Results from rawio on 2.6.19, cfq and noop schedulers: > > > > CFQ: > > > > procs devicenum read KB/sec I/O Ops/sec > > - --- -- --- -- > > 16 /dev/sda 16412 83382084 > > - --- -- --- -- > > 1616412 83382084 > > > > Total run time 0.492072 seconds > > > > > > NOOP: > > > > procs devicenum read KB/sec I/O Ops/sec > > - --- -- --- -- > > 16 /dev/sda 16399292247306 > > - --- -- --- -- > > 1616399292247306 > > > > Total run time 0.140284 seconds > > > > The benchmark workload is 16 processes running 4k random reads. > > > > Is this performance gap a known issue? > > CFQ could be a little slower at this benchmark, but your results are > much worse than I would expect. What is the queueing depth of sda? How > are you invoking rawio? I am running rawio with the following options: rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096 The queue depth on sda is 4. > > Your runtime is very low, how does it look if you allow the test to run > for much longer? 30MiB/sec random read bandwidth seems very high, I'm > wondering what exactly is being tested here. > rawio is actually performing sequential reads, but I don't believe it is purely sequential with the multiple processes. I am currently running the test with longer runtimes and will post results once it is complete. I've also attached the rawio source. Thanks, Avantika rawio-2.4.2.tar.gz Description: application/compressed-tar
Re: cfq performance gap
On Thu, Dec 07 2006, Avantika Mathur wrote: > Hi Jens, (you probably noticed now, but the [EMAIL PROTECTED] email is no longer valid) > I've noticed a performance gap between the cfq scheduler and other io > schedulers when running the rawio benchmark. > Results from rawio on 2.6.19, cfq and noop schedulers: > > CFQ: > > procs devicenum read KB/sec I/O Ops/sec > - --- -- --- -- > 16 /dev/sda 16412 83382084 > - --- -- --- -- > 1616412 83382084 > > Total run time 0.492072 seconds > > > NOOP: > > procs devicenum read KB/sec I/O Ops/sec > - --- -- --- -- > 16 /dev/sda 16399292247306 > - --- -- --- -- > 1616399292247306 > > Total run time 0.140284 seconds > > The benchmark workload is 16 processes running 4k random reads. > > Is this performance gap a known issue? CFQ could be a little slower at this benchmark, but your results are much worse than I would expect. What is the queueing depth of sda? How are you invoking rawio? Your runtime is very low, how does it look if you allow the test to run for much longer? 30MiB/sec random read bandwidth seems very high, I'm wondering what exactly is being tested here. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
On Thu, Dec 07 2006, Avantika Mathur wrote: Hi Jens, (you probably noticed now, but the [EMAIL PROTECTED] email is no longer valid) I've noticed a performance gap between the cfq scheduler and other io schedulers when running the rawio benchmark. Results from rawio on 2.6.19, cfq and noop schedulers: CFQ: procs devicenum read KB/sec I/O Ops/sec - --- -- --- -- 16 /dev/sda 16412 83382084 - --- -- --- -- 1616412 83382084 Total run time 0.492072 seconds NOOP: procs devicenum read KB/sec I/O Ops/sec - --- -- --- -- 16 /dev/sda 16399292247306 - --- -- --- -- 1616399292247306 Total run time 0.140284 seconds The benchmark workload is 16 processes running 4k random reads. Is this performance gap a known issue? CFQ could be a little slower at this benchmark, but your results are much worse than I would expect. What is the queueing depth of sda? How are you invoking rawio? Your runtime is very low, how does it look if you allow the test to run for much longer? 30MiB/sec random read bandwidth seems very high, I'm wondering what exactly is being tested here. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: cfq performance gap
On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote: On Thu, Dec 07 2006, Avantika Mathur wrote: Hi Jens, (you probably noticed now, but the [EMAIL PROTECTED] email is no longer valid) I saw that, thanks! I've noticed a performance gap between the cfq scheduler and other io schedulers when running the rawio benchmark. Results from rawio on 2.6.19, cfq and noop schedulers: CFQ: procs devicenum read KB/sec I/O Ops/sec - --- -- --- -- 16 /dev/sda 16412 83382084 - --- -- --- -- 1616412 83382084 Total run time 0.492072 seconds NOOP: procs devicenum read KB/sec I/O Ops/sec - --- -- --- -- 16 /dev/sda 16399292247306 - --- -- --- -- 1616399292247306 Total run time 0.140284 seconds The benchmark workload is 16 processes running 4k random reads. Is this performance gap a known issue? CFQ could be a little slower at this benchmark, but your results are much worse than I would expect. What is the queueing depth of sda? How are you invoking rawio? I am running rawio with the following options: rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096 The queue depth on sda is 4. Your runtime is very low, how does it look if you allow the test to run for much longer? 30MiB/sec random read bandwidth seems very high, I'm wondering what exactly is being tested here. rawio is actually performing sequential reads, but I don't believe it is purely sequential with the multiple processes. I am currently running the test with longer runtimes and will post results once it is complete. I've also attached the rawio source. Thanks, Avantika rawio-2.4.2.tar.gz Description: application/compressed-tar