Re: [fuse-devel] [PATCH] fuse: Fix fuse_get_user_pages() return value

2016-04-19 Thread Ashish Samant

Hi Seth,
On 04/19/2016 03:43 PM, Seth Forshee wrote:

fuse_direct_io() expects this to return either 0 or a negative
error code, but on success it may return a positive value.
fuse_direct_io() may return this same value when the subsequent
I/O operation doesn't transfer any data, which means that it will
return a positive value when no bytes were transferred. This is
obviously problematic.

Fix fuse_get_user_pages() to return 0 on success. This will in
turn make it so that fuse_direct_io() returns 0 if no bytes are
transferred.

Fixes: 742f992708df ("fuse: return patrial success from fuse_direct_io()")
Signed-off-by: Seth Forshee 
---
  fs/fuse/file.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b5c616c5ec98..78af5c0996b8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1295,7 +1295,7 @@ static int fuse_get_user_pages(struct fuse_req *req, 
struct iov_iter *ii,
  
  	*nbytesp = nbytes;
  
-	return ret;

+   return ret < 0 ? ret : 0;
  }
  
  static inline int fuse_iter_npages(const struct iov_iter *ii_p)



I have already sent a patch to the list that does exactly the same thing :)

https://sourceforge.net/p/fuse/mailman/message/34966327/

Thanks,
Ashish


Re: [fuse-devel] [PATCH] fuse: Fix fuse_get_user_pages() return value

2016-04-19 Thread Ashish Samant

Hi Seth,
On 04/19/2016 03:43 PM, Seth Forshee wrote:

fuse_direct_io() expects this to return either 0 or a negative
error code, but on success it may return a positive value.
fuse_direct_io() may return this same value when the subsequent
I/O operation doesn't transfer any data, which means that it will
return a positive value when no bytes were transferred. This is
obviously problematic.

Fix fuse_get_user_pages() to return 0 on success. This will in
turn make it so that fuse_direct_io() returns 0 if no bytes are
transferred.

Fixes: 742f992708df ("fuse: return patrial success from fuse_direct_io()")
Signed-off-by: Seth Forshee 
---
  fs/fuse/file.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b5c616c5ec98..78af5c0996b8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1295,7 +1295,7 @@ static int fuse_get_user_pages(struct fuse_req *req, 
struct iov_iter *ii,
  
  	*nbytesp = nbytes;
  
-	return ret;

+   return ret < 0 ? ret : 0;
  }
  
  static inline int fuse_iter_npages(const struct iov_iter *ii_p)



I have already sent a patch to the list that does exactly the same thing :)

https://sourceforge.net/p/fuse/mailman/message/34966327/

Thanks,
Ashish


Re: fuse scalability part 1

2015-09-25 Thread Ashish Samant


On 09/25/2015 05:11 AM, Miklos Szeredi wrote:

On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant  wrote:


We did some performance testing without these patches and with these patches
(with -o clone_fd  option specified). We did 2 types of tests:

1. Throughput test : We did some parallel dd tests to read/write to FUSE
based database fs on a system with 8 numa nodes and 288 cpus. The
performance here is almost equal to the the per-numa patches we submitted a
while back.Please find results attached.

Interesting.  This means, that serving the request on a different NUMA
node as the one where the request originated doesn't appear to make
the performance much worse.

Thanks,
Miklos
Yes. The main performance gain is due to the reduced contention on one 
spinlock(fc->lock) , especially with a large number of requests.
Splitting fc->fiq per cloned device will definitely improve performance 
further and we can  experiment further with per numa / cpu cloned device.


Thanks,
Ashish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fuse scalability part 1

2015-09-25 Thread Ashish Samant


On 09/25/2015 05:11 AM, Miklos Szeredi wrote:

On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant <ashish.sam...@oracle.com> wrote:


We did some performance testing without these patches and with these patches
(with -o clone_fd  option specified). We did 2 types of tests:

1. Throughput test : We did some parallel dd tests to read/write to FUSE
based database fs on a system with 8 numa nodes and 288 cpus. The
performance here is almost equal to the the per-numa patches we submitted a
while back.Please find results attached.

Interesting.  This means, that serving the request on a different NUMA
node as the one where the request originated doesn't appear to make
the performance much worse.

Thanks,
Miklos
Yes. The main performance gain is due to the reduced contention on one 
spinlock(fc->lock) , especially with a large number of requests.
Splitting fc->fiq per cloned device will definitely improve performance 
further and we can  experiment further with per numa / cpu cloned device.


Thanks,
Ashish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fuse scalability part 1

2015-09-24 Thread Ashish Samant


On 05/18/2015 08:13 AM, Miklos Szeredi wrote:

This part splits out an "input queue" and a "processing queue" from the
monolithic "fuse connection", each of those having their own spinlock.

The end of the patchset adds the ability to "clone" a fuse connection.  This
means, that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors open.
Each of those can be used to receive requests and send answers, currently the
only constraint is that a request must be answered on the same fd as it was read
from.

This can be extended further to allow binding a device clone to a specific CPU
or NUMA node.

Patchset is available here:

   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

Libfuse patches adding support for "clone_fd" option:

   git://git.code.sf.net/p/fuse/fuse clone_fd

Thanks,
Miklos


Resending the numbers as attachments because my email client messes the 
formatting of the message. Sorry for the noise.


We did some performance testing without these patches and with these 
patches (with -o clone_fd  option specified). We did 2 types of tests:


1. Throughput test : We did some parallel dd tests to read/write to FUSE 
based database fs on a system with 8 numa nodes and 288 cpus. The 
performance here is almost equal to the the per-numa patches we 
submitted a while back.Please find results attached.


2. Spinlock access times test: We also ran some tests within the kernel 
to check the time spent in accessing the spinlocks per request in both 
cases. As can be seen, the time taken per request to access the spinlock 
in the kernel code throughout the lifetime of the request is 30X to 100X 
better in the 2nd case (with patchset). Please find results attached.


Thanks,
Ashish


1) Writes to single mount

dd processes throughput(without patchset)
throughput(with patchset)
in parallel

4633 Mb/s   606 Mb/s
8583.2 Mb/s 561.6 
Mb/s
16   436 Mb/s   640.6 
Mb/s
32   500.5 Mb/s 718.1 
Mb/s
64   440.7 Mb/s 1276.8 
Mb/s
128  526.2 Mb/s 2343.4 
Mb/s

2) Reading from single mount
 
dd processes throughput(without patchset)
throughput(with patchset)
in parallel

4   1171 Mb/s   1059 
Mb/s
8   1626 Mb/s   1677 
Mb/s
16  1014 Mb/s   2240.6 
Mb/s
32  807.6 Mb/s  2512.9 
Mb/s
64  985.8 Mb/s  2870.3 
Mb/s
128 1355 Mb/s   2996.5 
Mb/s 
dd processes  Time/req(without patchset)
Time/req(with patchset)
in parallel

40.025 ms0.00685 ms
80.174 ms0.0071 ms
16   0.9825 ms   0.0115 ms
32   2.4965 ms   0.0315 ms
64   4.8335 ms   0.071 ms
128  5.972 ms0.1812 ms 


Re: fuse scalability part 1

2015-09-24 Thread Ashish Samant


On 05/18/2015 08:13 AM, Miklos Szeredi wrote:

This part splits out an "input queue" and a "processing queue" from the
monolithic "fuse connection", each of those having their own spinlock.

The end of the patchset adds the ability to "clone" a fuse connection.  This
means, that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors open.
Each of those can be used to receive requests and send answers, currently the
only constraint is that a request must be answered on the same fd as it was read
from.

This can be extended further to allow binding a device clone to a specific CPU
or NUMA node.

Patchset is available here:

   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

Libfuse patches adding support for "clone_fd" option:

   git://git.code.sf.net/p/fuse/fuse clone_fd

Thanks,
Miklos


Resending the numbers as attachments because my email client messes the 
formatting of the message. Sorry for the noise.


We did some performance testing without these patches and with these 
patches (with -o clone_fd  option specified). We did 2 types of tests:


1. Throughput test : We did some parallel dd tests to read/write to FUSE 
based database fs on a system with 8 numa nodes and 288 cpus. The 
performance here is almost equal to the the per-numa patches we 
submitted a while back.Please find results attached.


2. Spinlock access times test: We also ran some tests within the kernel 
to check the time spent in accessing the spinlocks per request in both 
cases. As can be seen, the time taken per request to access the spinlock 
in the kernel code throughout the lifetime of the request is 30X to 100X 
better in the 2nd case (with patchset). Please find results attached.


Thanks,
Ashish


1) Writes to single mount

dd processes throughput(without patchset)
throughput(with patchset)
in parallel

4633 Mb/s   606 Mb/s
8583.2 Mb/s 561.6 
Mb/s
16   436 Mb/s   640.6 
Mb/s
32   500.5 Mb/s 718.1 
Mb/s
64   440.7 Mb/s 1276.8 
Mb/s
128  526.2 Mb/s 2343.4 
Mb/s

2) Reading from single mount
 
dd processes throughput(without patchset)
throughput(with patchset)
in parallel

4   1171 Mb/s   1059 
Mb/s
8   1626 Mb/s   1677 
Mb/s
16  1014 Mb/s   2240.6 
Mb/s
32  807.6 Mb/s  2512.9 
Mb/s
64  985.8 Mb/s  2870.3 
Mb/s
128 1355 Mb/s   2996.5 
Mb/s 
dd processes  Time/req(without patchset)
Time/req(with patchset)
in parallel

40.025 ms0.00685 ms
80.174 ms0.0071 ms
16   0.9825 ms   0.0115 ms
32   2.4965 ms   0.0315 ms
64   4.8335 ms   0.071 ms
128  5.972 ms0.1812 ms 


Re: fuse scalability part 1

2015-09-23 Thread Ashish Samant


On 05/18/2015 08:13 AM, Miklos Szeredi wrote:

This part splits out an "input queue" and a "processing queue" from the
monolithic "fuse connection", each of those having their own spinlock.

The end of the patchset adds the ability to "clone" a fuse connection.  This
means, that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors open.
Each of those can be used to receive requests and send answers, currently the
only constraint is that a request must be answered on the same fd as it was read
from.

This can be extended further to allow binding a device clone to a specific CPU
or NUMA node.

Patchset is available here:

   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

Libfuse patches adding support for "clone_fd" option:

   git://git.code.sf.net/p/fuse/fuse clone_fd

Thanks,
Miklos


We did some performance testing without these patches and with these 
patches (with -o clone_fd  option specified). Sorry for the delay in 
getting these done. We did 2 types of tests:


1. Throughput test : We did some parallel dd tests to read/write to FUSE 
based database fs on a system with 8 numa nodes and 288 cpus. The 
performance here is almost equal to the the per-numa patches we 
submitted a while back.


1) Writes to single mount

dd processesthroughput(without patchset) throughput(with 
patchset)

in parallel

4633 
Mb/s   606 Mb/s
8   583.2 
Mb/s 561.6 Mb/s
16 436 
Mb/s640.6 Mb/s
32 500.5 
Mb/s 718.1 Mb/s
64 440.7 Mb/s
 1276.8 Mb/s
128   526.2 
Mb/s 2343.4 Mb/s


2) Reading from single mount

dd processes throughput(without patchset) 
throughput(with patchset)

in parallel

41171 
Mb/s  1059 Mb/s
81626 
Mb/s  677 Mb/s
16  1014 
Mb/s  2240.6 Mb/s
32  807.6 
Mb/s 2512.9 Mb/s
64  985.8 
Mb/s 2870.3 Mb/s
1281355 
Mb/s  2996.5 Mb/s




2. Spinlock access times test: We also ran some tests within the kernel 
to check the time spent in accessing the spinlocks per request in both 
cases. As can be seen, the time taken per request to access the spinlock 
in the kernel code throughout the lifetime of the request is 30X to 100X 
better in the 2nd case (with patchset)



dd processes  Time/req(without patchset) Time/req(with 
patchset)

in parallel

4 0.025 ms 
0.00685 ms
8 0.174 ms  
0.0071 ms
16   0.9825 
ms0.0115 ms
32   2.4965 ms   
 0.0315 ms

64   4.8335 ms  0.071 ms
128 5.972 ms 
0.1812 ms


In conclusion, splitting of fc->lock into multiple locks and splitting 
the request queues definitely helps performance.


Thanks,
Ashish
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fuse scalability part 1

2015-09-23 Thread Ashish Samant


On 05/18/2015 08:13 AM, Miklos Szeredi wrote:

This part splits out an "input queue" and a "processing queue" from the
monolithic "fuse connection", each of those having their own spinlock.

The end of the patchset adds the ability to "clone" a fuse connection.  This
means, that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors open.
Each of those can be used to receive requests and send answers, currently the
only constraint is that a request must be answered on the same fd as it was read
from.

This can be extended further to allow binding a device clone to a specific CPU
or NUMA node.

Patchset is available here:

   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

Libfuse patches adding support for "clone_fd" option:

   git://git.code.sf.net/p/fuse/fuse clone_fd

Thanks,
Miklos


We did some performance testing without these patches and with these 
patches (with -o clone_fd  option specified). Sorry for the delay in 
getting these done. We did 2 types of tests:


1. Throughput test : We did some parallel dd tests to read/write to FUSE 
based database fs on a system with 8 numa nodes and 288 cpus. The 
performance here is almost equal to the the per-numa patches we 
submitted a while back.


1) Writes to single mount

dd processesthroughput(without patchset) throughput(with 
patchset)

in parallel

4633 
Mb/s   606 Mb/s
8   583.2 
Mb/s 561.6 Mb/s
16 436 
Mb/s640.6 Mb/s
32 500.5 
Mb/s 718.1 Mb/s
64 440.7 Mb/s
 1276.8 Mb/s
128   526.2 
Mb/s 2343.4 Mb/s


2) Reading from single mount

dd processes throughput(without patchset) 
throughput(with patchset)

in parallel

41171 
Mb/s  1059 Mb/s
81626 
Mb/s  677 Mb/s
16  1014 
Mb/s  2240.6 Mb/s
32  807.6 
Mb/s 2512.9 Mb/s
64  985.8 
Mb/s 2870.3 Mb/s
1281355 
Mb/s  2996.5 Mb/s




2. Spinlock access times test: We also ran some tests within the kernel 
to check the time spent in accessing the spinlocks per request in both 
cases. As can be seen, the time taken per request to access the spinlock 
in the kernel code throughout the lifetime of the request is 30X to 100X 
better in the 2nd case (with patchset)



dd processes  Time/req(without patchset) Time/req(with 
patchset)

in parallel

4 0.025 ms 
0.00685 ms
8 0.174 ms  
0.0071 ms
16   0.9825 
ms0.0115 ms
32   2.4965 ms   
 0.0315 ms

64   4.8335 ms  0.071 ms
128 5.972 ms 
0.1812 ms


In conclusion, splitting of fc->lock into multiple locks and splitting 
the request queues definitely helps performance.


Thanks,
Ashish
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/