Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 10:00 PM, John Hubbard wrote:

On 11/29/18 6:30 PM, Tom Talpey wrote:

On 11/29/2018 9:21 PM, John Hubbard wrote:

On 11/29/18 6:18 PM, Tom Talpey wrote:

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

Excerpting from below:


Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


vs


With patches applied:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.



Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:


Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.


     cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73


Good result - a correct implementation, and faster.



Thanks, Tom, I really appreciate your experience and help on what performance
should look like here. (I'm sure you can guess that this is the first time
I've worked with fio, heh.)


No problem, happy to chip in. Feel free to add my

Tested-By: Tom Talpey 

I know, that's not the personal email I'm posting from, but it's me.

I'll be hopefully trying the code with the Linux SMB client (cifs.ko)
next week, Long Li is implementing direct io in that and we'll see how
it helps.

Mainly, I'm looking forward to seeing this enable RDMA-to-DAX.

Tom.



I'll send out a new, non-RFC patchset soon, then.

thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread John Hubbard
On 11/29/18 6:30 PM, Tom Talpey wrote:
> On 11/29/2018 9:21 PM, John Hubbard wrote:
>> On 11/29/18 6:18 PM, Tom Talpey wrote:
>>> On 11/29/2018 8:39 PM, John Hubbard wrote:
 On 11/28/18 5:59 AM, Tom Talpey wrote:
> On 11/27/2018 9:52 PM, John Hubbard wrote:
>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
 On 11/21/18 8:49 AM, Tom Talpey wrote:
> On 11/21/2018 1:09 AM, John Hubbard wrote:
>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>> [...]
>>> Excerpting from below:
>>>
 Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> vs
>>>
 With patches applied:
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> Perfect results, not CPU limited, and full IOPS.
>>>
>>> Curiously identical, so I trust you've checked that you measured
>>> both targets, but if so, I say it's good.
>>>
>>
>> Argh, copy-paste error in the email. The real "before" is ever so slightly
>> better, at 194K IOPS and 759 MB/s:
> 
> Definitely better - note the system CPU is lower, which is probably the
> reason for the increased IOPS.
> 
>>    cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
> 
> Good result - a correct implementation, and faster.
> 

Thanks, Tom, I really appreciate your experience and help on what performance 
should look like here. (I'm sure you can guess that this is the first time 
I've worked with fio, heh.)

I'll send out a new, non-RFC patchset soon, then.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 9:21 PM, John Hubbard wrote:

On 11/29/18 6:18 PM, Tom Talpey wrote:

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.



OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.


Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.


Continuing on, then: running a before and after test, I don't see any 
significant
difference in the fio results:


Excerpting from below:


Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
  read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
     cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


vs


With patches applied:
  read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
     cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73


Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.



Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:


Definitely better - note the system CPU is lower, which is probably the
reason for the increased IOPS.

>cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73

Good result - a correct implementation, and faster.

Tom.




  $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
 slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
 clat (usec): min=148, max=755, avg=326.85, stdev=18.13
  lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  355], 99.50th=[  537], 99.90th=[  553], 99.95th=[  553],
  | 99.99th=[  619]
bw (  KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, 
stdev=10804.59, samples=2
iops: min=191954, max=195774, avg=193864.00, stdev=2701.15, 
samples=2
   lat (usec)   : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
   cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), 
io=1024MiB (1074MB), run=1350-1350msec

Disk stats (read/write):
   nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%

thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread Tom Talpey

On 11/29/2018 8:39 PM, John Hubbard wrote:

On 11/28/18 5:59 AM, Tom Talpey wrote:

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]

I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.



OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.


Oh, good! I'm especially glad because I was having a heck of a time
reconfiguring the one machine I have available for this.


Continuing on, then: running a before and after test, I don't see any 
significant
difference in the fio results:


Excerpting from below:

> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

vs

> With patches applied:
> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73

Perfect results, not CPU limited, and full IOPS.

Curiously identical, so I trust you've checked that you measured
both targets, but if so, I say it's good.

Tom.



fio.conf:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

-
Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:

$ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
  lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
  | 99.99th=[12125]
bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, 
stdev=22112.64, samples=2
iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, 
samples=2
   lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
   lat (msec)   : 20=0.02%
   cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), 
io=1024MiB (1074MB), run=1360-1360msec

Disk stats (read/write):
   nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%

-
With patches applied:

 fast_256GB $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
 slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
  lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread John Hubbard
On 11/29/18 6:18 PM, Tom Talpey wrote:
> On 11/29/2018 8:39 PM, John Hubbard wrote:
>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
 On 11/27/18 5:21 PM, Tom Talpey wrote:
> On 11/21/2018 5:06 PM, John Hubbard wrote:
>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
 On 11/19/18 10:57 AM, Tom Talpey wrote:
 [...]
> I'm super-limited here this week hardware-wise and have not been able
> to try testing with the patched kernel.
>
> I was able to compare my earlier quick test with a Bionic 4.15 kernel
> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
> test, and without your change.
>

 So just to double check (again): you are running fio with these parameters,
 right?

 [reader]
 direct=1
 ioengine=libaio
 blocksize=4096
 size=1g
 numjobs=1
 rw=read
 iodepth=64
>>>
>>> Correct, I copy/pasted these directly. I also ran with size=10g because
>>> the 1g provides a really small sample set.
>>>
>>> There was one other difference, your results indicated fio 3.3 was used.
>>> My Bionic install has fio 3.1. I don't find that relevant because our
>>> goal is to compare before/after, which I haven't done yet.
>>>
>>
>> OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
>> options
>> set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
>> rated
>> speed of the Samsung NVMe device, so now we should have a clearer picture of 
>> the
>> performance that real users will see.
> 
> Oh, good! I'm especially glad because I was having a heck of a time
> reconfiguring the one machine I have available for this.
> 
>> Continuing on, then: running a before and after test, I don't see any 
>> significant
>> difference in the fio results:
> 
> Excerpting from below:
> 
>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> 
> vs
> 
>> With patches applied:
>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> 
> Perfect results, not CPU limited, and full IOPS.
> 
> Curiously identical, so I trust you've checked that you measured
> both targets, but if so, I say it's good.
> 

Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:

 $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
   read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
clat (usec): min=148, max=755, avg=326.85, stdev=18.13
 lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
clat percentiles (usec):
 |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
 | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
 | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
 | 99.00th=[  355], 99.50th=[  537], 99.90th=[  553], 99.95th=[  553],
 | 99.99th=[  619]
   bw (  KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, 
stdev=10804.59, samples=2
   iops: min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2
  lat (usec)   : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
  cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB 
(1074MB), run=1350-1350msec

Disk stats (read/write):
  nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00%

thanks,
-- 
John Hubbard
NVIDIA
> 
>>
>> fio.conf:
>>
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
>>
>> -
>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>
[deleted with prejudice. See the correction above, instead] --jhubbard
>>
>> -
>> With patches applied:
>>
>>  fast_256GB $ fio ./experimental-fio.conf
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 40

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread John Hubbard
On 11/28/18 5:59 AM, Tom Talpey wrote:
> On 11/27/2018 9:52 PM, John Hubbard wrote:
>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
 On 11/21/18 8:49 AM, Tom Talpey wrote:
> On 11/21/2018 1:09 AM, John Hubbard wrote:
>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>> [...]
>>> I'm super-limited here this week hardware-wise and have not been able
>>> to try testing with the patched kernel.
>>>
>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>> test, and without your change.
>>>
>>
>> So just to double check (again): you are running fio with these parameters,
>> right?
>>
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
> 
> Correct, I copy/pasted these directly. I also ran with size=10g because
> the 1g provides a really small sample set.
> 
> There was one other difference, your results indicated fio 3.3 was used.
> My Bionic install has fio 3.1. I don't find that relevant because our
> goal is to compare before/after, which I haven't done yet.
> 

OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.

Continuing on, then: running a before and after test, I don't see any 
significant 
difference in the fio results:

fio.conf:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

-
Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:

$ fio ./experimental-fio.conf 
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
clat percentiles (usec):
 |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
 | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
 | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
 | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
 | 99.99th=[12125]
   bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, 
stdev=22112.64, samples=2
   iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
  lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
  lat (msec)   : 20=0.02%
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB 
(1074MB), run=1360-1360msec

Disk stats (read/write):
  nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%

-
With patches applied:

 fast_256GB $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
clat percentiles (usec):
 |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
 | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
 | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
 | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
 | 99.99th=[12125]
   bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, 
stdev=22112.64, samples=2
   iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
  lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
  lat (msec)   : 20=0.02%
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-28 Thread Tom Talpey

On 11/27/2018 9:52 PM, John Hubbard wrote:

On 11/27/18 5:21 PM, Tom Talpey wrote:

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

[...]


What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.


I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see.



Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?



Here:

     g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.



So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


Correct, I copy/pasted these directly. I also ran with size=10g because
the 1g provides a really small sample set.

There was one other difference, your results indicated fio 3.3 was used.
My Bionic install has fio 3.1. I don't find that relevant because our
goal is to compare before/after, which I haven't done yet.

Tom.






Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?



That's the right branch, but the AuthorDate for the head commit (only) somehow
got stuck in the past. I just now amended that patch with a new date and pushed
it, so the head commit now shows Nov 27:

https://github.com/johnhubbard/linux/commits/gup_dma_testing


The actual code is the same, though. (It is still based on Nov 19th's 
f2ce1065e767
commit.)


thanks,



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-27 Thread John Hubbard
On 11/27/18 5:21 PM, Tom Talpey wrote:
> On 11/21/2018 5:06 PM, John Hubbard wrote:
>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
 On 11/19/18 10:57 AM, Tom Talpey wrote:
[...]
>>>
>>> What I'd really like to see is to go back to the original fio parameters
>>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>>> to the speced 200K IOPS of the NVMe device. There seems to be something
>>> wrong with yours, currently.
>>
>> I'll dig into what has gone wrong with the test. I see fio putting data files
>> in the right place, so the obvious "using the wrong drive" is (probably)
>> not it. Even though it really feels like that sort of thing. We'll see.
>>
>>>
>>> Then of course, the result with the patched get_user_pages, and
>>> compare whichever of IOPS or CPU% changes, and how much.
>>>
>>> If these are within a few percent, I agree it's good to go. If it's
>>> roughly 25% like the result just above, that's a rocky road.
>>>
>>> I can try this after the holiday on some basic hardware and might
>>> be able to scrounge up better. Can you post that github link?
>>>
>>
>> Here:
>>
>>     g...@github.com:johnhubbard/linux (branch: gup_dma_testing)
> 
> I'm super-limited here this week hardware-wise and have not been able
> to try testing with the patched kernel.
> 
> I was able to compare my earlier quick test with a Bionic 4.15 kernel
> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
> test, and without your change.
> 

So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64



> Say, that branch reports it has not had a commit since June 30. Is that
> the right one? What about gup_dma_for_lpc_2018?
> 

That's the right branch, but the AuthorDate for the head commit (only) somehow
got stuck in the past. I just now amended that patch with a new date and pushed 
it, so the head commit now shows Nov 27:

   https://github.com/johnhubbard/linux/commits/gup_dma_testing


The actual code is the same, though. (It is still based on Nov 19th's 
f2ce1065e767
commit.)


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-27 Thread Tom Talpey

On 11/21/2018 5:06 PM, John Hubbard wrote:

On 11/21/18 8:49 AM, Tom Talpey wrote:

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

~14000 4KB read IOPS is really, really low for an NVMe disk.


Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!


OK, then something really is wrong here...




So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.


I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...


Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
  (Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB


The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:


https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).


So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:

 -- IOPS are similar, around 60k.
 -- BW gets worse, dropping from 290 to 220 MB/s.
 -- CPU is well under 100%.
 -- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20


Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:



Patched:

 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): 
[_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
...
Thoughts?


Concern - the 74.2K IOPS unpatched drops to 56.8K patched!


ACK. :)



What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.


I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see.



Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?



Here:

g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


I'm super-limited here this week hardware-wise and have not been able
to try testing with the patched kernel.

I was able to compare my earlier quick test with a Bionic 4.15 kernel
(400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
test, and without your change.

Say, that branch reports it has not had a commit since June 30. Is that
the right one? What about gup_dma_for_lpc_2018?

Tom.


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-21 Thread John Hubbard
On 11/21/18 8:49 AM, Tom Talpey wrote:
> On 11/21/2018 1:09 AM, John Hubbard wrote:
>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
>>
>> Yes, but Jan Kara's original config file for fio is *intended* to highlight
>> the get_user_pages/put_user_pages changes. It was *not* intended to get max
>> performance,  as you can see by the numjobs and direct IO parameters:
>>
>> cat fio.conf
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
> 
> To be clear - I used those identical parameters, on my lower-spec
> machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
> higher than yours!

OK, then something really is wrong here...

> 
>> So I'm thinking that this is not a "tainted" test, but rather, we're 
>> constraining
>> things a lot with these choices. It's hard to find a good test config to run 
>> that
>> allows decisions, but so far, I'm not really seeing anything that says "this
>> is so bad that we can't afford to fix the brokenness." I think.
> 
> I'm not suggesting we tune the benchmark, I'm suggesting the results
> on your system are not meaningful since they are orders of magnitude
> low. And without meaningful data it's impossible to see the performance
> impact of the change...
> 
>>> Can you confirm what type of hardware you're running this test on?
>>> CPU, memory speed and capacity, and NVMe device especially?
>>>
>>> Tom.
>>
>> Yes, it's a nice new system, I don't expect any strange perf problems:
>>
>> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
>>  (Intel X299 chipset)
>> Block device: nvme-Samsung_SSD_970_EVO_250GB
>> DRAM: 32 GB
> 
> The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
> with a 4KB QD32 workload:
> 
> 
> https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs
> 
> And the I7-7800X is a 6-core processor (12 hyperthreads).
> 
>> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
>> patched kernel (below). Highlights:
>>
>> -- IOPS are similar, around 60k.
>> -- BW gets worse, dropping from 290 to 220 MB/s.
>> -- CPU is well under 100%.
>> -- latency is incredibly long, but...20 threads.
>>
>> Baseline:
>>
>> $ ./run.sh
>> fio configuration:
>> [reader]
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> rw=read
>> group_reporting
>> iodepth=256
>> direct=1
>> numjobs=20
> 
> Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
> That's going to cause tremendous queuing, and context switching, far
> outside of the get_user_pages() change.
> 
> But even so, it only brings IOPS to 74.2K, which is still far short of
> the device's 200K spec.
> 
> Comparing anyway:
> 
> 
>> Patched:
>>
>>  Running fio:
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=256
>> ...
>> fio-3.3
>> Starting 20 processes
>> Jobs: 13 (f=8): 
>> [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
>>  IOPS][eta 00m:02s]
>> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
>>     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
>> ...
>> Thoughts?
> 
> Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

ACK. :)

> 
> What I'd really like to see is to go back to the original fio parameters
> (1 thread, 64 iodepth) and try to get a result that gets at least close
> to the speced 200K IOPS of the NVMe device. There seems to be something
> wrong with yours, currently.

I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see. 

> 
> Then of course, the result with the patched get_user_pages, and
> compare whichever of IOPS or CPU% changes, and how much.
> 
> If these are within a few percent, I agree it's good to go. If it's
> roughly 25% like the result just above, that's a rocky road.
> 
> I can try this after the holiday on some basic hardware and might
> be able to scrounge up better. Can you post that github link?
> 

Here:

   g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


-- 
thanks,
John Hubbard
NVIDIA


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-21 Thread Tom Talpey

On 11/21/2018 1:09 AM, John Hubbard wrote:

On 11/19/18 10:57 AM, Tom Talpey wrote:

~14000 4KB read IOPS is really, really low for an NVMe disk.


Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


To be clear - I used those identical parameters, on my lower-spec
machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
higher than yours!


So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.


I'm not suggesting we tune the benchmark, I'm suggesting the results
on your system are not meaningful since they are orders of magnitude
low. And without meaningful data it's impossible to see the performance
impact of the change...


Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.


Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
 (Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB


The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
with a 4KB QD32 workload:


https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs

And the I7-7800X is a 6-core processor (12 hyperthreads).


So, here's a comparison using 20 threads, direct IO, for the baseline vs.
patched kernel (below). Highlights:

-- IOPS are similar, around 60k.
-- BW gets worse, dropping from 290 to 220 MB/s.
-- CPU is well under 100%.
-- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20


Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
That's going to cause tremendous queuing, and context switching, far
outside of the get_user_pages() change.

But even so, it only brings IOPS to 74.2K, which is still far short of
the device's 200K spec.

Comparing anyway:



Patched:

 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 13 (f=8): 
[_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
 IOPS][eta 00m:02s]
reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
...
Thoughts?


Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

What I'd really like to see is to go back to the original fio parameters
(1 thread, 64 iodepth) and try to get a result that gets at least close
to the speced 200K IOPS of the NVMe device. There seems to be something
wrong with yours, currently.

Then of course, the result with the patched get_user_pages, and
compare whichever of IOPS or CPU% changes, and how much.

If these are within a few percent, I agree it's good to go. If it's
roughly 25% like the result just above, that's a rocky road.

I can try this after the holiday on some basic hardware and might
be able to scrounge up better. Can you post that github link?

Tom.


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-20 Thread John Hubbard
On 11/19/18 10:57 AM, Tom Talpey wrote:
> John, thanks for the discussion at LPC. One of the concerns we
> raised however was the performance test. The numbers below are
> rather obviously tainted. I think we need to get a better baseline
> before concluding anything...
> 
> Here's my main concern:
> 

Hi Tom,

Thanks again for looking at this!


> On 11/10/2018 3:50 AM, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
>> ...
>> --
>> WITHOUT the patch:
>> --
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 
>> 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
>>     read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
> 
> ~14000 4KB read IOPS is really, really low for an NVMe disk.

Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf 
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.

After talking with you and reading this email, I did a bunch more test runs, 
varying the following fio parameters:

-- direct
-- numjobs
-- iodepth

...with both the baseline 4.20-rc3 kernel, and with my patches applied. (btw, if
anyone cares, I'll post a github link that has a complete, testable 
patchset--not
ready for submission as such, but it works cleanly and will allow others to 
attempt to reproduce my results).

What I'm seeing is that I can get 10x or better improvements in IOPS and BW,
just by going to 10 threads and turning off direct IO--as expected. So in the 
end,
I increased the number of threads, and also increased iodepth a bit. 


Test results below...


> 
>>    cpu  : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
> 
> CPU is obviously the limiting factor. At these IOPS, it should be far
> less.
>> --
>> OR, here's a better run WITH the patch applied, and you can see that this is 
>> nearly as good
>> as the "without" case:
>> --
>>
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 
>> 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
>>     read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
> 
> Similar low IOPS.
> 
>>    cpu  : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
> 
> Similar CPU saturation.
> 
>>
> 
> I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
> i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
> and fio version 3.1). Even then, the CPU saturates, so it's not
> necessarily a perfect test. I'd like to see your runs both get to
> "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
> give the best comparison for making a decision.

I can get to CPU < 100% by increasing to 10 or 20 threads, although it
makes latency ever so much worse.

> 
> Can you confirm what type of hardware you're running this test on?
> CPU, memory speed and capacity, and NVMe device especially?
> 
> Tom.

Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
(Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB

So, here's a comparison using 20 threads, direct IO, for the baseline vs. 
patched kernel (below). Highlights:

-- IOPS are similar, around 60k. 
-- BW gets worse, dropping from 290 to 220 MB/s.
-- CPU is well under 100%.
-- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=256
...
fio-3.3
Starting 20 processes
Jobs: 4 (f=4): 
[_(8),R(2),_(2),R(1),_(1),R(1),_(5)][95.9%][r=244MiB/s,w=0KiB/s][r=62.5k,w=0 
IOPS][eta 00m:03s]
reader: (groupid=0, jobs=20): err= 0: pid=14499: Tue Nov 20 16:20:35 2018
   read: IOPS=74.2k, BW=290MiB/s (304MB/s)(20.0GiB/70644msec)
slat (usec): min=26, 

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-19 Thread Tom Talpey

John, thanks for the discussion at LPC. One of the concerns we
raised however was the performance test. The numbers below are
rather obviously tainted. I think we need to get a better baseline
before concluding anything...

Here's my main concern:

On 11/10/2018 3:50 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 
...
--
WITHOUT the patch:
--
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 
00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)


~14000 4KB read IOPS is really, really low for an NVMe disk.


   cpu  : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72


CPU is obviously the limiting factor. At these IOPS, it should be far
less.

--
OR, here's a better run WITH the patch applied, and you can see that this is 
nearly as good
as the "without" case:
--

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 
00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)


Similar low IOPS.


   cpu  : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73


Similar CPU saturation.





I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
and fio version 3.1). Even then, the CPU saturates, so it's not
necessarily a perfect test. I'd like to see your runs both get to
"max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
give the best comparison for making a decision.

Can you confirm what type of hardware you're running this test on?
CPU, memory speed and capacity, and NVMe device especially?

Tom.