Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/29/2018 10:00 PM, John Hubbard wrote: On 11/29/18 6:30 PM, Tom Talpey wrote: On 11/29/2018 9:21 PM, John Hubbard wrote: On 11/29/18 6:18 PM, Tom Talpey wrote: On 11/29/2018 8:39 PM, John Hubbard wrote: On 11/28/18 5:59 AM, Tom Talpey wrote: On 11/27/2018 9:52 PM, John Hubbard wrote: On 11/27/18 5:21 PM, Tom Talpey wrote: On 11/21/2018 5:06 PM, John Hubbard wrote: On 11/21/18 8:49 AM, Tom Talpey wrote: On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: [...] Excerpting from below: Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 vs With patches applied: read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 Perfect results, not CPU limited, and full IOPS. Curiously identical, so I trust you've checked that you measured both targets, but if so, I say it's good. Argh, copy-paste error in the email. The real "before" is ever so slightly better, at 194K IOPS and 759 MB/s: Definitely better - note the system CPU is lower, which is probably the reason for the increased IOPS. cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 Good result - a correct implementation, and faster. Thanks, Tom, I really appreciate your experience and help on what performance should look like here. (I'm sure you can guess that this is the first time I've worked with fio, heh.) No problem, happy to chip in. Feel free to add my Tested-By: Tom Talpey I know, that's not the personal email I'm posting from, but it's me. I'll be hopefully trying the code with the Linux SMB client (cifs.ko) next week, Long Li is implementing direct io in that and we'll see how it helps. Mainly, I'm looking forward to seeing this enable RDMA-to-DAX. Tom. I'll send out a new, non-RFC patchset soon, then. thanks,
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/29/18 6:30 PM, Tom Talpey wrote: > On 11/29/2018 9:21 PM, John Hubbard wrote: >> On 11/29/18 6:18 PM, Tom Talpey wrote: >>> On 11/29/2018 8:39 PM, John Hubbard wrote: On 11/28/18 5:59 AM, Tom Talpey wrote: > On 11/27/2018 9:52 PM, John Hubbard wrote: >> On 11/27/18 5:21 PM, Tom Talpey wrote: >>> On 11/21/2018 5:06 PM, John Hubbard wrote: On 11/21/18 8:49 AM, Tom Talpey wrote: > On 11/21/2018 1:09 AM, John Hubbard wrote: >> On 11/19/18 10:57 AM, Tom Talpey wrote: >> [...] >>> Excerpting from below: >>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >>> >>> vs >>> With patches applied: read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 >>> >>> Perfect results, not CPU limited, and full IOPS. >>> >>> Curiously identical, so I trust you've checked that you measured >>> both targets, but if so, I say it's good. >>> >> >> Argh, copy-paste error in the email. The real "before" is ever so slightly >> better, at 194K IOPS and 759 MB/s: > > Definitely better - note the system CPU is lower, which is probably the > reason for the increased IOPS. > >> cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 > > Good result - a correct implementation, and faster. > Thanks, Tom, I really appreciate your experience and help on what performance should look like here. (I'm sure you can guess that this is the first time I've worked with fio, heh.) I'll send out a new, non-RFC patchset soon, then. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/29/2018 9:21 PM, John Hubbard wrote: On 11/29/18 6:18 PM, Tom Talpey wrote: On 11/29/2018 8:39 PM, John Hubbard wrote: On 11/28/18 5:59 AM, Tom Talpey wrote: On 11/27/2018 9:52 PM, John Hubbard wrote: On 11/27/18 5:21 PM, Tom Talpey wrote: On 11/21/2018 5:06 PM, John Hubbard wrote: On 11/21/18 8:49 AM, Tom Talpey wrote: On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: [...] I'm super-limited here this week hardware-wise and have not been able to try testing with the patched kernel. I was able to compare my earlier quick test with a Bionic 4.15 kernel (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick test, and without your change. So just to double check (again): you are running fio with these parameters, right? [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 Correct, I copy/pasted these directly. I also ran with size=10g because the 1g provides a really small sample set. There was one other difference, your results indicated fio 3.3 was used. My Bionic install has fio 3.1. I don't find that relevant because our goal is to compare before/after, which I haven't done yet. OK, the 50 MB/s was due to my particular .config. I had some expensive debug options set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated speed of the Samsung NVMe device, so now we should have a clearer picture of the performance that real users will see. Oh, good! I'm especially glad because I was having a heck of a time reconfiguring the one machine I have available for this. Continuing on, then: running a before and after test, I don't see any significant difference in the fio results: Excerpting from below: Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 vs With patches applied: read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 Perfect results, not CPU limited, and full IOPS. Curiously identical, so I trust you've checked that you measured both targets, but if so, I say it's good. Argh, copy-paste error in the email. The real "before" is ever so slightly better, at 194K IOPS and 759 MB/s: Definitely better - note the system CPU is lower, which is probably the reason for the increased IOPS. >cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 Good result - a correct implementation, and faster. Tom. $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018 read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec) slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61 clat (usec): min=148, max=755, avg=326.85, stdev=18.13 lat (usec): min=150, max=3483, avg=328.41, stdev=19.53 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 355], 99.50th=[ 537], 99.90th=[ 553], 99.95th=[ 553], | 99.99th=[ 619] bw ( KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2 iops: min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2 lat (usec) : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01% cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec Disk stats (read/write): nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00% thanks,
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/29/2018 8:39 PM, John Hubbard wrote: On 11/28/18 5:59 AM, Tom Talpey wrote: On 11/27/2018 9:52 PM, John Hubbard wrote: On 11/27/18 5:21 PM, Tom Talpey wrote: On 11/21/2018 5:06 PM, John Hubbard wrote: On 11/21/18 8:49 AM, Tom Talpey wrote: On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: [...] I'm super-limited here this week hardware-wise and have not been able to try testing with the patched kernel. I was able to compare my earlier quick test with a Bionic 4.15 kernel (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick test, and without your change. So just to double check (again): you are running fio with these parameters, right? [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 Correct, I copy/pasted these directly. I also ran with size=10g because the 1g provides a really small sample set. There was one other difference, your results indicated fio 3.3 was used. My Bionic install has fio 3.1. I don't find that relevant because our goal is to compare before/after, which I haven't done yet. OK, the 50 MB/s was due to my particular .config. I had some expensive debug options set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated speed of the Samsung NVMe device, so now we should have a clearer picture of the performance that real users will see. Oh, good! I'm especially glad because I was having a heck of a time reconfiguring the one machine I have available for this. Continuing on, then: running a before and after test, I don't see any significant difference in the fio results: Excerpting from below: > Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: > read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 vs > With patches applied: > read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 Perfect results, not CPU limited, and full IOPS. Curiously identical, so I trust you've checked that you measured both targets, but if so, I say it's good. Tom. fio.conf: [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 - Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611], | 99.99th=[12125] bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2 iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2 lat (usec) : 250=0.08%, 500=99.30%, 750=0.59% lat (msec) : 20=0.02% cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec Disk stats (read/write): nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00% - With patches applied: fast_256GB $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/29/18 6:18 PM, Tom Talpey wrote: > On 11/29/2018 8:39 PM, John Hubbard wrote: >> On 11/28/18 5:59 AM, Tom Talpey wrote: >>> On 11/27/2018 9:52 PM, John Hubbard wrote: On 11/27/18 5:21 PM, Tom Talpey wrote: > On 11/21/2018 5:06 PM, John Hubbard wrote: >> On 11/21/18 8:49 AM, Tom Talpey wrote: >>> On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: [...] > I'm super-limited here this week hardware-wise and have not been able > to try testing with the patched kernel. > > I was able to compare my earlier quick test with a Bionic 4.15 kernel > (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to > ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick > test, and without your change. > So just to double check (again): you are running fio with these parameters, right? [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 >>> >>> Correct, I copy/pasted these directly. I also ran with size=10g because >>> the 1g provides a really small sample set. >>> >>> There was one other difference, your results indicated fio 3.3 was used. >>> My Bionic install has fio 3.1. I don't find that relevant because our >>> goal is to compare before/after, which I haven't done yet. >>> >> >> OK, the 50 MB/s was due to my particular .config. I had some expensive debug >> options >> set in mm, fs and locking subsystems. Turning those off, I'm back up to the >> rated >> speed of the Samsung NVMe device, so now we should have a clearer picture of >> the >> performance that real users will see. > > Oh, good! I'm especially glad because I was having a heck of a time > reconfiguring the one machine I have available for this. > >> Continuing on, then: running a before and after test, I don't see any >> significant >> difference in the fio results: > > Excerpting from below: > >> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: >> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 > > vs > >> With patches applied: >> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) >> cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 > > Perfect results, not CPU limited, and full IOPS. > > Curiously identical, so I trust you've checked that you measured > both targets, but if so, I say it's good. > Argh, copy-paste error in the email. The real "before" is ever so slightly better, at 194K IOPS and 759 MB/s: $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018 read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec) slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61 clat (usec): min=148, max=755, avg=326.85, stdev=18.13 lat (usec): min=150, max=3483, avg=328.41, stdev=19.53 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 355], 99.50th=[ 537], 99.90th=[ 553], 99.95th=[ 553], | 99.99th=[ 619] bw ( KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, stdev=10804.59, samples=2 iops: min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2 lat (usec) : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01% cpu : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB (1074MB), run=1350-1350msec Disk stats (read/write): nvme0n1: ios=222853/0, merge=0/0, ticks=71410/0, in_queue=71935, util=100.00% thanks, -- John Hubbard NVIDIA > >> >> fio.conf: >> >> [reader] >> direct=1 >> ioengine=libaio >> blocksize=4096 >> size=1g >> numjobs=1 >> rw=read >> iodepth=64 >> >> - >> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: >> [deleted with prejudice. See the correction above, instead] --jhubbard >> >> - >> With patches applied: >> >> fast_256GB $ fio ./experimental-fio.conf >> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) >> 40
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/28/18 5:59 AM, Tom Talpey wrote: > On 11/27/2018 9:52 PM, John Hubbard wrote: >> On 11/27/18 5:21 PM, Tom Talpey wrote: >>> On 11/21/2018 5:06 PM, John Hubbard wrote: On 11/21/18 8:49 AM, Tom Talpey wrote: > On 11/21/2018 1:09 AM, John Hubbard wrote: >> On 11/19/18 10:57 AM, Tom Talpey wrote: >> [...] >>> I'm super-limited here this week hardware-wise and have not been able >>> to try testing with the patched kernel. >>> >>> I was able to compare my earlier quick test with a Bionic 4.15 kernel >>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to >>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick >>> test, and without your change. >>> >> >> So just to double check (again): you are running fio with these parameters, >> right? >> >> [reader] >> direct=1 >> ioengine=libaio >> blocksize=4096 >> size=1g >> numjobs=1 >> rw=read >> iodepth=64 > > Correct, I copy/pasted these directly. I also ran with size=10g because > the 1g provides a really small sample set. > > There was one other difference, your results indicated fio 3.3 was used. > My Bionic install has fio 3.1. I don't find that relevant because our > goal is to compare before/after, which I haven't done yet. > OK, the 50 MB/s was due to my particular .config. I had some expensive debug options set in mm, fs and locking subsystems. Turning those off, I'm back up to the rated speed of the Samsung NVMe device, so now we should have a clearer picture of the performance that real users will see. Continuing on, then: running a before and after test, I don't see any significant difference in the fio results: fio.conf: [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 - Baseline 4.20.0-rc3 (commit f2ce1065e767), as before: $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611], | 99.99th=[12125] bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2 iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2 lat (usec) : 250=0.08%, 500=99.30%, 750=0.59% lat (msec) : 20=0.02% cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1360-1360msec Disk stats (read/write): nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00% - With patches applied: fast_256GB $ fio ./experimental-fio.conf reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1) reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018 read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec) slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46 clat (usec): min=162, max=12247, avg=330.00, stdev=185.55 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69 clat percentiles (usec): | 1.00th=[ 322], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326], | 30.00th=[ 326], 40.00th=[ 326], 50.00th=[ 326], 60.00th=[ 326], | 70.00th=[ 326], 80.00th=[ 326], 90.00th=[ 326], 95.00th=[ 326], | 99.00th=[ 379], 99.50th=[ 594], 99.90th=[ 603], 99.95th=[ 611], | 99.99th=[12125] bw ( KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, stdev=22112.64, samples=2 iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2 lat (usec) : 250=0.08%, 500=99.30%, 750=0.59% lat (msec) : 20=0.02% cpu : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/27/2018 9:52 PM, John Hubbard wrote: On 11/27/18 5:21 PM, Tom Talpey wrote: On 11/21/2018 5:06 PM, John Hubbard wrote: On 11/21/18 8:49 AM, Tom Talpey wrote: On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: [...] What I'd really like to see is to go back to the original fio parameters (1 thread, 64 iodepth) and try to get a result that gets at least close to the speced 200K IOPS of the NVMe device. There seems to be something wrong with yours, currently. I'll dig into what has gone wrong with the test. I see fio putting data files in the right place, so the obvious "using the wrong drive" is (probably) not it. Even though it really feels like that sort of thing. We'll see. Then of course, the result with the patched get_user_pages, and compare whichever of IOPS or CPU% changes, and how much. If these are within a few percent, I agree it's good to go. If it's roughly 25% like the result just above, that's a rocky road. I can try this after the holiday on some basic hardware and might be able to scrounge up better. Can you post that github link? Here: g...@github.com:johnhubbard/linux (branch: gup_dma_testing) I'm super-limited here this week hardware-wise and have not been able to try testing with the patched kernel. I was able to compare my earlier quick test with a Bionic 4.15 kernel (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick test, and without your change. So just to double check (again): you are running fio with these parameters, right? [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 Correct, I copy/pasted these directly. I also ran with size=10g because the 1g provides a really small sample set. There was one other difference, your results indicated fio 3.3 was used. My Bionic install has fio 3.1. I don't find that relevant because our goal is to compare before/after, which I haven't done yet. Tom. Say, that branch reports it has not had a commit since June 30. Is that the right one? What about gup_dma_for_lpc_2018? That's the right branch, but the AuthorDate for the head commit (only) somehow got stuck in the past. I just now amended that patch with a new date and pushed it, so the head commit now shows Nov 27: https://github.com/johnhubbard/linux/commits/gup_dma_testing The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767 commit.) thanks,
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/27/18 5:21 PM, Tom Talpey wrote: > On 11/21/2018 5:06 PM, John Hubbard wrote: >> On 11/21/18 8:49 AM, Tom Talpey wrote: >>> On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: [...] >>> >>> What I'd really like to see is to go back to the original fio parameters >>> (1 thread, 64 iodepth) and try to get a result that gets at least close >>> to the speced 200K IOPS of the NVMe device. There seems to be something >>> wrong with yours, currently. >> >> I'll dig into what has gone wrong with the test. I see fio putting data files >> in the right place, so the obvious "using the wrong drive" is (probably) >> not it. Even though it really feels like that sort of thing. We'll see. >> >>> >>> Then of course, the result with the patched get_user_pages, and >>> compare whichever of IOPS or CPU% changes, and how much. >>> >>> If these are within a few percent, I agree it's good to go. If it's >>> roughly 25% like the result just above, that's a rocky road. >>> >>> I can try this after the holiday on some basic hardware and might >>> be able to scrounge up better. Can you post that github link? >>> >> >> Here: >> >> g...@github.com:johnhubbard/linux (branch: gup_dma_testing) > > I'm super-limited here this week hardware-wise and have not been able > to try testing with the patched kernel. > > I was able to compare my earlier quick test with a Bionic 4.15 kernel > (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to > ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick > test, and without your change. > So just to double check (again): you are running fio with these parameters, right? [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 > Say, that branch reports it has not had a commit since June 30. Is that > the right one? What about gup_dma_for_lpc_2018? > That's the right branch, but the AuthorDate for the head commit (only) somehow got stuck in the past. I just now amended that patch with a new date and pushed it, so the head commit now shows Nov 27: https://github.com/johnhubbard/linux/commits/gup_dma_testing The actual code is the same, though. (It is still based on Nov 19th's f2ce1065e767 commit.) thanks, -- John Hubbard NVIDIA
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/21/2018 5:06 PM, John Hubbard wrote: On 11/21/18 8:49 AM, Tom Talpey wrote: On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: ~14000 4KB read IOPS is really, really low for an NVMe disk. Yes, but Jan Kara's original config file for fio is *intended* to highlight the get_user_pages/put_user_pages changes. It was *not* intended to get max performance, as you can see by the numjobs and direct IO parameters: cat fio.conf [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 To be clear - I used those identical parameters, on my lower-spec machine, and got 400,000 4KB read IOPS. Those results are nearly 30x higher than yours! OK, then something really is wrong here... So I'm thinking that this is not a "tainted" test, but rather, we're constraining things a lot with these choices. It's hard to find a good test config to run that allows decisions, but so far, I'm not really seeing anything that says "this is so bad that we can't afford to fix the brokenness." I think. I'm not suggesting we tune the benchmark, I'm suggesting the results on your system are not meaningful since they are orders of magnitude low. And without meaningful data it's impossible to see the performance impact of the change... Can you confirm what type of hardware you're running this test on? CPU, memory speed and capacity, and NVMe device especially? Tom. Yes, it's a nice new system, I don't expect any strange perf problems: CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz (Intel X299 chipset) Block device: nvme-Samsung_SSD_970_EVO_250GB DRAM: 32 GB The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS with a 4KB QD32 workload: https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs And the I7-7800X is a 6-core processor (12 hyperthreads). So, here's a comparison using 20 threads, direct IO, for the baseline vs. patched kernel (below). Highlights: -- IOPS are similar, around 60k. -- BW gets worse, dropping from 290 to 220 MB/s. -- CPU is well under 100%. -- latency is incredibly long, but...20 threads. Baseline: $ ./run.sh fio configuration: [reader] ioengine=libaio blocksize=4096 size=1g rw=read group_reporting iodepth=256 direct=1 numjobs=20 Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets. That's going to cause tremendous queuing, and context switching, far outside of the get_user_pages() change. But even so, it only brings IOPS to 74.2K, which is still far short of the device's 200K spec. Comparing anyway: Patched: Running fio: reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 ... fio-3.3 Starting 20 processes Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s] reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018 read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec) ... Thoughts? Concern - the 74.2K IOPS unpatched drops to 56.8K patched! ACK. :) What I'd really like to see is to go back to the original fio parameters (1 thread, 64 iodepth) and try to get a result that gets at least close to the speced 200K IOPS of the NVMe device. There seems to be something wrong with yours, currently. I'll dig into what has gone wrong with the test. I see fio putting data files in the right place, so the obvious "using the wrong drive" is (probably) not it. Even though it really feels like that sort of thing. We'll see. Then of course, the result with the patched get_user_pages, and compare whichever of IOPS or CPU% changes, and how much. If these are within a few percent, I agree it's good to go. If it's roughly 25% like the result just above, that's a rocky road. I can try this after the holiday on some basic hardware and might be able to scrounge up better. Can you post that github link? Here: g...@github.com:johnhubbard/linux (branch: gup_dma_testing) I'm super-limited here this week hardware-wise and have not been able to try testing with the patched kernel. I was able to compare my earlier quick test with a Bionic 4.15 kernel (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick test, and without your change. Say, that branch reports it has not had a commit since June 30. Is that the right one? What about gup_dma_for_lpc_2018? Tom.
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/21/18 8:49 AM, Tom Talpey wrote: > On 11/21/2018 1:09 AM, John Hubbard wrote: >> On 11/19/18 10:57 AM, Tom Talpey wrote: >>> ~14000 4KB read IOPS is really, really low for an NVMe disk. >> >> Yes, but Jan Kara's original config file for fio is *intended* to highlight >> the get_user_pages/put_user_pages changes. It was *not* intended to get max >> performance, as you can see by the numjobs and direct IO parameters: >> >> cat fio.conf >> [reader] >> direct=1 >> ioengine=libaio >> blocksize=4096 >> size=1g >> numjobs=1 >> rw=read >> iodepth=64 > > To be clear - I used those identical parameters, on my lower-spec > machine, and got 400,000 4KB read IOPS. Those results are nearly 30x > higher than yours! OK, then something really is wrong here... > >> So I'm thinking that this is not a "tainted" test, but rather, we're >> constraining >> things a lot with these choices. It's hard to find a good test config to run >> that >> allows decisions, but so far, I'm not really seeing anything that says "this >> is so bad that we can't afford to fix the brokenness." I think. > > I'm not suggesting we tune the benchmark, I'm suggesting the results > on your system are not meaningful since they are orders of magnitude > low. And without meaningful data it's impossible to see the performance > impact of the change... > >>> Can you confirm what type of hardware you're running this test on? >>> CPU, memory speed and capacity, and NVMe device especially? >>> >>> Tom. >> >> Yes, it's a nice new system, I don't expect any strange perf problems: >> >> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz >> (Intel X299 chipset) >> Block device: nvme-Samsung_SSD_970_EVO_250GB >> DRAM: 32 GB > > The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS > with a 4KB QD32 workload: > > > https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs > > And the I7-7800X is a 6-core processor (12 hyperthreads). > >> So, here's a comparison using 20 threads, direct IO, for the baseline vs. >> patched kernel (below). Highlights: >> >> -- IOPS are similar, around 60k. >> -- BW gets worse, dropping from 290 to 220 MB/s. >> -- CPU is well under 100%. >> -- latency is incredibly long, but...20 threads. >> >> Baseline: >> >> $ ./run.sh >> fio configuration: >> [reader] >> ioengine=libaio >> blocksize=4096 >> size=1g >> rw=read >> group_reporting >> iodepth=256 >> direct=1 >> numjobs=20 > > Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets. > That's going to cause tremendous queuing, and context switching, far > outside of the get_user_pages() change. > > But even so, it only brings IOPS to 74.2K, which is still far short of > the device's 200K spec. > > Comparing anyway: > > >> Patched: >> >> Running fio: >> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) >> 4096B-4096B, ioengine=libaio, iodepth=256 >> ... >> fio-3.3 >> Starting 20 processes >> Jobs: 13 (f=8): >> [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 >> IOPS][eta 00m:02s] >> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018 >> read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec) >> ... >> Thoughts? > > Concern - the 74.2K IOPS unpatched drops to 56.8K patched! ACK. :) > > What I'd really like to see is to go back to the original fio parameters > (1 thread, 64 iodepth) and try to get a result that gets at least close > to the speced 200K IOPS of the NVMe device. There seems to be something > wrong with yours, currently. I'll dig into what has gone wrong with the test. I see fio putting data files in the right place, so the obvious "using the wrong drive" is (probably) not it. Even though it really feels like that sort of thing. We'll see. > > Then of course, the result with the patched get_user_pages, and > compare whichever of IOPS or CPU% changes, and how much. > > If these are within a few percent, I agree it's good to go. If it's > roughly 25% like the result just above, that's a rocky road. > > I can try this after the holiday on some basic hardware and might > be able to scrounge up better. Can you post that github link? > Here: g...@github.com:johnhubbard/linux (branch: gup_dma_testing) -- thanks, John Hubbard NVIDIA
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/21/2018 1:09 AM, John Hubbard wrote: On 11/19/18 10:57 AM, Tom Talpey wrote: ~14000 4KB read IOPS is really, really low for an NVMe disk. Yes, but Jan Kara's original config file for fio is *intended* to highlight the get_user_pages/put_user_pages changes. It was *not* intended to get max performance, as you can see by the numjobs and direct IO parameters: cat fio.conf [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 To be clear - I used those identical parameters, on my lower-spec machine, and got 400,000 4KB read IOPS. Those results are nearly 30x higher than yours! So I'm thinking that this is not a "tainted" test, but rather, we're constraining things a lot with these choices. It's hard to find a good test config to run that allows decisions, but so far, I'm not really seeing anything that says "this is so bad that we can't afford to fix the brokenness." I think. I'm not suggesting we tune the benchmark, I'm suggesting the results on your system are not meaningful since they are orders of magnitude low. And without meaningful data it's impossible to see the performance impact of the change... Can you confirm what type of hardware you're running this test on? CPU, memory speed and capacity, and NVMe device especially? Tom. Yes, it's a nice new system, I don't expect any strange perf problems: CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz (Intel X299 chipset) Block device: nvme-Samsung_SSD_970_EVO_250GB DRAM: 32 GB The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS with a 4KB QD32 workload: https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs And the I7-7800X is a 6-core processor (12 hyperthreads). So, here's a comparison using 20 threads, direct IO, for the baseline vs. patched kernel (below). Highlights: -- IOPS are similar, around 60k. -- BW gets worse, dropping from 290 to 220 MB/s. -- CPU is well under 100%. -- latency is incredibly long, but...20 threads. Baseline: $ ./run.sh fio configuration: [reader] ioengine=libaio blocksize=4096 size=1g rw=read group_reporting iodepth=256 direct=1 numjobs=20 Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets. That's going to cause tremendous queuing, and context switching, far outside of the get_user_pages() change. But even so, it only brings IOPS to 74.2K, which is still far short of the device's 200K spec. Comparing anyway: Patched: Running fio: reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 ... fio-3.3 Starting 20 processes Jobs: 13 (f=8): [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0 IOPS][eta 00m:02s] reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018 read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec) ... Thoughts? Concern - the 74.2K IOPS unpatched drops to 56.8K patched! What I'd really like to see is to go back to the original fio parameters (1 thread, 64 iodepth) and try to get a result that gets at least close to the speced 200K IOPS of the NVMe device. There seems to be something wrong with yours, currently. Then of course, the result with the patched get_user_pages, and compare whichever of IOPS or CPU% changes, and how much. If these are within a few percent, I agree it's good to go. If it's roughly 25% like the result just above, that's a rocky road. I can try this after the holiday on some basic hardware and might be able to scrounge up better. Can you post that github link? Tom.
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
On 11/19/18 10:57 AM, Tom Talpey wrote: > John, thanks for the discussion at LPC. One of the concerns we > raised however was the performance test. The numbers below are > rather obviously tainted. I think we need to get a better baseline > before concluding anything... > > Here's my main concern: > Hi Tom, Thanks again for looking at this! > On 11/10/2018 3:50 AM, john.hubb...@gmail.com wrote: >> From: John Hubbard >> ... >> -- >> WITHOUT the patch: >> -- >> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) >> 4096B-4096B, ioengine=libaio, iodepth=64 >> fio-3.3 >> Starting 1 process >> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta >> 00m:00s] >> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018 >> read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec) > > ~14000 4KB read IOPS is really, really low for an NVMe disk. Yes, but Jan Kara's original config file for fio is *intended* to highlight the get_user_pages/put_user_pages changes. It was *not* intended to get max performance, as you can see by the numjobs and direct IO parameters: cat fio.conf [reader] direct=1 ioengine=libaio blocksize=4096 size=1g numjobs=1 rw=read iodepth=64 So I'm thinking that this is not a "tainted" test, but rather, we're constraining things a lot with these choices. It's hard to find a good test config to run that allows decisions, but so far, I'm not really seeing anything that says "this is so bad that we can't afford to fix the brokenness." I think. After talking with you and reading this email, I did a bunch more test runs, varying the following fio parameters: -- direct -- numjobs -- iodepth ...with both the baseline 4.20-rc3 kernel, and with my patches applied. (btw, if anyone cares, I'll post a github link that has a complete, testable patchset--not ready for submission as such, but it works cleanly and will allow others to attempt to reproduce my results). What I'm seeing is that I can get 10x or better improvements in IOPS and BW, just by going to 10 threads and turning off direct IO--as expected. So in the end, I increased the number of threads, and also increased iodepth a bit. Test results below... > >> cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72 > > CPU is obviously the limiting factor. At these IOPS, it should be far > less. >> -- >> OR, here's a better run WITH the patch applied, and you can see that this is >> nearly as good >> as the "without" case: >> -- >> >> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) >> 4096B-4096B, ioengine=libaio, iodepth=64 >> fio-3.3 >> Starting 1 process >> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta >> 00m:00s] >> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018 >> read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec) > > Similar low IOPS. > >> cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73 > > Similar CPU saturation. > >> > > I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W > i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel > and fio version 3.1). Even then, the CPU saturates, so it's not > necessarily a perfect test. I'd like to see your runs both get to > "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would > give the best comparison for making a decision. I can get to CPU < 100% by increasing to 10 or 20 threads, although it makes latency ever so much worse. > > Can you confirm what type of hardware you're running this test on? > CPU, memory speed and capacity, and NVMe device especially? > > Tom. Yes, it's a nice new system, I don't expect any strange perf problems: CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz (Intel X299 chipset) Block device: nvme-Samsung_SSD_970_EVO_250GB DRAM: 32 GB So, here's a comparison using 20 threads, direct IO, for the baseline vs. patched kernel (below). Highlights: -- IOPS are similar, around 60k. -- BW gets worse, dropping from 290 to 220 MB/s. -- CPU is well under 100%. -- latency is incredibly long, but...20 threads. Baseline: $ ./run.sh fio configuration: [reader] ioengine=libaio blocksize=4096 size=1g rw=read group_reporting iodepth=256 direct=1 numjobs=20 Running fio: reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256 ... fio-3.3 Starting 20 processes Jobs: 4 (f=4): [_(8),R(2),_(2),R(1),_(1),R(1),_(5)][95.9%][r=244MiB/s,w=0KiB/s][r=62.5k,w=0 IOPS][eta 00m:03s] reader: (groupid=0, jobs=20): err= 0: pid=14499: Tue Nov 20 16:20:35 2018 read: IOPS=74.2k, BW=290MiB/s (304MB/s)(20.0GiB/70644msec) slat (usec): min=26,
Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages
John, thanks for the discussion at LPC. One of the concerns we raised however was the performance test. The numbers below are rather obviously tainted. I think we need to get a better baseline before concluding anything... Here's my main concern: On 11/10/2018 3:50 AM, john.hubb...@gmail.com wrote: From: John Hubbard ... -- WITHOUT the patch: -- reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018 read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec) ~14000 4KB read IOPS is really, really low for an NVMe disk. cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72 CPU is obviously the limiting factor. At these IOPS, it should be far less. -- OR, here's a better run WITH the patch applied, and you can see that this is nearly as good as the "without" case: -- reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018 read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec) Similar low IOPS. cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73 Similar CPU saturation. I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel and fio version 3.1). Even then, the CPU saturates, so it's not necessarily a perfect test. I'd like to see your runs both get to "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would give the best comparison for making a decision. Can you confirm what type of hardware you're running this test on? CPU, memory speed and capacity, and NVMe device especially? Tom.