Re: fuse scalability part 1
Hi Miklos, On 09/25/2015 05:11 AM, Miklos Szeredi wrote: On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant wrote: We did some performance testing without these patches and with these patches (with -o clone_fd option specified). We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back.Please find results attached. Interesting. This means, that serving the request on a different NUMA node as the one where the request originated doesn't appear to make the performance much worse. with the new change, contention of spinlock is significantly reduced, hence the latency caused by NUMA is not visible. Even in earlier case, the scalability was not a big problem if we bind all processes(fuse worker and user (dd threads)) to a single NUMA node. The problem was only seen when threads spread out across numa nodes and contend for the spin lock. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
Hi Miklos, On 09/25/2015 05:11 AM, Miklos Szeredi wrote: On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samantwrote: We did some performance testing without these patches and with these patches (with -o clone_fd option specified). We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back.Please find results attached. Interesting. This means, that serving the request on a different NUMA node as the one where the request originated doesn't appear to make the performance much worse. with the new change, contention of spinlock is significantly reduced, hence the latency caused by NUMA is not visible. Even in earlier case, the scalability was not a big problem if we bind all processes(fuse worker and user (dd threads)) to a single NUMA node. The problem was only seen when threads spread out across numa nodes and contend for the spin lock. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
On 09/25/2015 05:11 AM, Miklos Szeredi wrote: On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant wrote: We did some performance testing without these patches and with these patches (with -o clone_fd option specified). We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back.Please find results attached. Interesting. This means, that serving the request on a different NUMA node as the one where the request originated doesn't appear to make the performance much worse. Thanks, Miklos Yes. The main performance gain is due to the reduced contention on one spinlock(fc->lock) , especially with a large number of requests. Splitting fc->fiq per cloned device will definitely improve performance further and we can experiment further with per numa / cpu cloned device. Thanks, Ashish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant wrote: > We did some performance testing without these patches and with these patches > (with -o clone_fd option specified). We did 2 types of tests: > > 1. Throughput test : We did some parallel dd tests to read/write to FUSE > based database fs on a system with 8 numa nodes and 288 cpus. The > performance here is almost equal to the the per-numa patches we submitted a > while back.Please find results attached. Interesting. This means, that serving the request on a different NUMA node as the one where the request originated doesn't appear to make the performance much worse. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
On 09/25/2015 05:11 AM, Miklos Szeredi wrote: On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samantwrote: We did some performance testing without these patches and with these patches (with -o clone_fd option specified). We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back.Please find results attached. Interesting. This means, that serving the request on a different NUMA node as the one where the request originated doesn't appear to make the performance much worse. Thanks, Miklos Yes. The main performance gain is due to the reduced contention on one spinlock(fc->lock) , especially with a large number of requests. Splitting fc->fiq per cloned device will definitely improve performance further and we can experiment further with per numa / cpu cloned device. Thanks, Ashish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samantwrote: > We did some performance testing without these patches and with these patches > (with -o clone_fd option specified). We did 2 types of tests: > > 1. Throughput test : We did some parallel dd tests to read/write to FUSE > based database fs on a system with 8 numa nodes and 288 cpus. The > performance here is almost equal to the the per-numa patches we submitted a > while back.Please find results attached. Interesting. This means, that serving the request on a different NUMA node as the one where the request originated doesn't appear to make the performance much worse. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
On 05/18/2015 08:13 AM, Miklos Szeredi wrote: This part splits out an "input queue" and a "processing queue" from the monolithic "fuse connection", each of those having their own spinlock. The end of the patchset adds the ability to "clone" a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Patchset is available here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next Libfuse patches adding support for "clone_fd" option: git://git.code.sf.net/p/fuse/fuse clone_fd Thanks, Miklos Resending the numbers as attachments because my email client messes the formatting of the message. Sorry for the noise. We did some performance testing without these patches and with these patches (with -o clone_fd option specified). We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back.Please find results attached. 2. Spinlock access times test: We also ran some tests within the kernel to check the time spent in accessing the spinlocks per request in both cases. As can be seen, the time taken per request to access the spinlock in the kernel code throughout the lifetime of the request is 30X to 100X better in the 2nd case (with patchset). Please find results attached. Thanks, Ashish 1) Writes to single mount dd processes throughput(without patchset) throughput(with patchset) in parallel 4633 Mb/s 606 Mb/s 8583.2 Mb/s 561.6 Mb/s 16 436 Mb/s 640.6 Mb/s 32 500.5 Mb/s 718.1 Mb/s 64 440.7 Mb/s 1276.8 Mb/s 128 526.2 Mb/s 2343.4 Mb/s 2) Reading from single mount dd processes throughput(without patchset) throughput(with patchset) in parallel 4 1171 Mb/s 1059 Mb/s 8 1626 Mb/s 1677 Mb/s 16 1014 Mb/s 2240.6 Mb/s 32 807.6 Mb/s 2512.9 Mb/s 64 985.8 Mb/s 2870.3 Mb/s 128 1355 Mb/s 2996.5 Mb/s dd processes Time/req(without patchset) Time/req(with patchset) in parallel 40.025 ms0.00685 ms 80.174 ms0.0071 ms 16 0.9825 ms 0.0115 ms 32 2.4965 ms 0.0315 ms 64 4.8335 ms 0.071 ms 128 5.972 ms0.1812 ms
Re: [fuse-devel] fuse scalability part 1
On Fri, Aug 14, 2015 at 12:14 PM, Goswin von Brederlow wrote: > On Mon, May 18, 2015 at 05:13:36PM +0200, Miklos Szeredi wrote: >> This part splits out an "input queue" and a "processing queue" from the >> monolithic "fuse connection", each of those having their own spinlock. >> >> The end of the patchset adds the ability to "clone" a fuse connection. This >> means, that instead of having to read/write requests/answers on a single fuse >> device fd, the fuse daemon can have multiple distinct file descriptors open. >> Each of those can be used to receive requests and send answers, currently the >> only constraint is that a request must be answered on the same fd as it was >> read >> from. >> >> This can be extended further to allow binding a device clone to a specific >> CPU >> or NUMA node. > > How will requests be distributed across clones? > > Is the idea here to start one clone per core and have IO requests > originating from one core to be processed by the fuse clone on the > same core? I remember there was a noticeable speedup when request and > processing where on the same core. > > How is the clone for each request choosen? What if there is no clone > pinned to the same core? Will it pick the clone nearest in NUMA terms? > Will it round-robin? Will it load balance to the clone with least > number of requests pending? What if one clone stops processing requests? Good questions. I guess, first implementation should be the simplest possible. E.g. use the queue that matches (in this order): - CPU - NUMA node - any (round robin or whatever) I woudn't worry about load balancing and unresponsive queues until such issues come up in real life. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [fuse-devel] fuse scalability part 1
On Fri, Aug 14, 2015 at 12:14 PM, Goswin von Brederlowwrote: > On Mon, May 18, 2015 at 05:13:36PM +0200, Miklos Szeredi wrote: >> This part splits out an "input queue" and a "processing queue" from the >> monolithic "fuse connection", each of those having their own spinlock. >> >> The end of the patchset adds the ability to "clone" a fuse connection. This >> means, that instead of having to read/write requests/answers on a single fuse >> device fd, the fuse daemon can have multiple distinct file descriptors open. >> Each of those can be used to receive requests and send answers, currently the >> only constraint is that a request must be answered on the same fd as it was >> read >> from. >> >> This can be extended further to allow binding a device clone to a specific >> CPU >> or NUMA node. > > How will requests be distributed across clones? > > Is the idea here to start one clone per core and have IO requests > originating from one core to be processed by the fuse clone on the > same core? I remember there was a noticeable speedup when request and > processing where on the same core. > > How is the clone for each request choosen? What if there is no clone > pinned to the same core? Will it pick the clone nearest in NUMA terms? > Will it round-robin? Will it load balance to the clone with least > number of requests pending? What if one clone stops processing requests? Good questions. I guess, first implementation should be the simplest possible. E.g. use the queue that matches (in this order): - CPU - NUMA node - any (round robin or whatever) I woudn't worry about load balancing and unresponsive queues until such issues come up in real life. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
On 05/18/2015 08:13 AM, Miklos Szeredi wrote: This part splits out an "input queue" and a "processing queue" from the monolithic "fuse connection", each of those having their own spinlock. The end of the patchset adds the ability to "clone" a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Patchset is available here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next Libfuse patches adding support for "clone_fd" option: git://git.code.sf.net/p/fuse/fuse clone_fd Thanks, Miklos Resending the numbers as attachments because my email client messes the formatting of the message. Sorry for the noise. We did some performance testing without these patches and with these patches (with -o clone_fd option specified). We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back.Please find results attached. 2. Spinlock access times test: We also ran some tests within the kernel to check the time spent in accessing the spinlocks per request in both cases. As can be seen, the time taken per request to access the spinlock in the kernel code throughout the lifetime of the request is 30X to 100X better in the 2nd case (with patchset). Please find results attached. Thanks, Ashish 1) Writes to single mount dd processes throughput(without patchset) throughput(with patchset) in parallel 4633 Mb/s 606 Mb/s 8583.2 Mb/s 561.6 Mb/s 16 436 Mb/s 640.6 Mb/s 32 500.5 Mb/s 718.1 Mb/s 64 440.7 Mb/s 1276.8 Mb/s 128 526.2 Mb/s 2343.4 Mb/s 2) Reading from single mount dd processes throughput(without patchset) throughput(with patchset) in parallel 4 1171 Mb/s 1059 Mb/s 8 1626 Mb/s 1677 Mb/s 16 1014 Mb/s 2240.6 Mb/s 32 807.6 Mb/s 2512.9 Mb/s 64 985.8 Mb/s 2870.3 Mb/s 128 1355 Mb/s 2996.5 Mb/s dd processes Time/req(without patchset) Time/req(with patchset) in parallel 40.025 ms0.00685 ms 80.174 ms0.0071 ms 16 0.9825 ms 0.0115 ms 32 2.4965 ms 0.0315 ms 64 4.8335 ms 0.071 ms 128 5.972 ms0.1812 ms
Re: fuse scalability part 1
On 05/18/2015 08:13 AM, Miklos Szeredi wrote: This part splits out an "input queue" and a "processing queue" from the monolithic "fuse connection", each of those having their own spinlock. The end of the patchset adds the ability to "clone" a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Patchset is available here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next Libfuse patches adding support for "clone_fd" option: git://git.code.sf.net/p/fuse/fuse clone_fd Thanks, Miklos We did some performance testing without these patches and with these patches (with -o clone_fd option specified). Sorry for the delay in getting these done. We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back. 1) Writes to single mount dd processesthroughput(without patchset) throughput(with patchset) in parallel 4633 Mb/s 606 Mb/s 8 583.2 Mb/s 561.6 Mb/s 16 436 Mb/s640.6 Mb/s 32 500.5 Mb/s 718.1 Mb/s 64 440.7 Mb/s 1276.8 Mb/s 128 526.2 Mb/s 2343.4 Mb/s 2) Reading from single mount dd processes throughput(without patchset) throughput(with patchset) in parallel 41171 Mb/s 1059 Mb/s 81626 Mb/s 677 Mb/s 16 1014 Mb/s 2240.6 Mb/s 32 807.6 Mb/s 2512.9 Mb/s 64 985.8 Mb/s 2870.3 Mb/s 1281355 Mb/s 2996.5 Mb/s 2. Spinlock access times test: We also ran some tests within the kernel to check the time spent in accessing the spinlocks per request in both cases. As can be seen, the time taken per request to access the spinlock in the kernel code throughout the lifetime of the request is 30X to 100X better in the 2nd case (with patchset) dd processes Time/req(without patchset) Time/req(with patchset) in parallel 4 0.025 ms 0.00685 ms 8 0.174 ms 0.0071 ms 16 0.9825 ms0.0115 ms 32 2.4965 ms 0.0315 ms 64 4.8335 ms 0.071 ms 128 5.972 ms 0.1812 ms In conclusion, splitting of fc->lock into multiple locks and splitting the request queues definitely helps performance. Thanks, Ashish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fuse scalability part 1
On 05/18/2015 08:13 AM, Miklos Szeredi wrote: This part splits out an "input queue" and a "processing queue" from the monolithic "fuse connection", each of those having their own spinlock. The end of the patchset adds the ability to "clone" a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Patchset is available here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next Libfuse patches adding support for "clone_fd" option: git://git.code.sf.net/p/fuse/fuse clone_fd Thanks, Miklos We did some performance testing without these patches and with these patches (with -o clone_fd option specified). Sorry for the delay in getting these done. We did 2 types of tests: 1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back. 1) Writes to single mount dd processesthroughput(without patchset) throughput(with patchset) in parallel 4633 Mb/s 606 Mb/s 8 583.2 Mb/s 561.6 Mb/s 16 436 Mb/s640.6 Mb/s 32 500.5 Mb/s 718.1 Mb/s 64 440.7 Mb/s 1276.8 Mb/s 128 526.2 Mb/s 2343.4 Mb/s 2) Reading from single mount dd processes throughput(without patchset) throughput(with patchset) in parallel 41171 Mb/s 1059 Mb/s 81626 Mb/s 677 Mb/s 16 1014 Mb/s 2240.6 Mb/s 32 807.6 Mb/s 2512.9 Mb/s 64 985.8 Mb/s 2870.3 Mb/s 1281355 Mb/s 2996.5 Mb/s 2. Spinlock access times test: We also ran some tests within the kernel to check the time spent in accessing the spinlocks per request in both cases. As can be seen, the time taken per request to access the spinlock in the kernel code throughout the lifetime of the request is 30X to 100X better in the 2nd case (with patchset) dd processes Time/req(without patchset) Time/req(with patchset) in parallel 4 0.025 ms 0.00685 ms 8 0.174 ms 0.0071 ms 16 0.9825 ms0.0115 ms 32 2.4965 ms 0.0315 ms 64 4.8335 ms 0.071 ms 128 5.972 ms 0.1812 ms In conclusion, splitting of fc->lock into multiple locks and splitting the request queues definitely helps performance. Thanks, Ashish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fuse scalability part 1
This part splits out an "input queue" and a "processing queue" from the monolithic "fuse connection", each of those having their own spinlock. The end of the patchset adds the ability to "clone" a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Patchset is available here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next Libfuse patches adding support for "clone_fd" option: git://git.code.sf.net/p/fuse/fuse clone_fd Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fuse scalability part 1
This part splits out an input queue and a processing queue from the monolithic fuse connection, each of those having their own spinlock. The end of the patchset adds the ability to clone a fuse connection. This means, that instead of having to read/write requests/answers on a single fuse device fd, the fuse daemon can have multiple distinct file descriptors open. Each of those can be used to receive requests and send answers, currently the only constraint is that a request must be answered on the same fd as it was read from. This can be extended further to allow binding a device clone to a specific CPU or NUMA node. Patchset is available here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next Libfuse patches adding support for clone_fd option: git://git.code.sf.net/p/fuse/fuse clone_fd Thanks, Miklos -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/