Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-05 Thread John Hubbard
On 2/5/19 5:38 AM, Tom Talpey wrote:
> 
> Ok, I'm satisfied the four-9's latency spike is in not your code. :-)
> Results look good relative to baseline. Thanks for doublechecking!
> 
> Tom.


Great, in that case, I'll put the new before-and-after results in the next 
version. Appreciate your help here, as always!

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-05 Thread Christopher Lameter
On Mon, 4 Feb 2019, Ira Weiny wrote:

> On Mon, Feb 04, 2019 at 05:14:19PM +, Christopher Lameter wrote:
> > Frankly I still think this does not solve anything.
> >
> > Concurrent write access from two sources to a single page is simply wrong.
> > You cannot make this right by allowing long term RDMA pins in a filesystem
> > and thus the filesystem can never update part of its files on disk.
> >
> > Can we just disable RDMA to regular filesystems? Regular filesystems
> > should have full control of the write back and dirty status of their
> > pages.
>
> That may be a solution to the corruption/crashes but it is not a solution 
> which
> users want to see.  RDMA directly to file systems (specifically DAX) is a use
> case we have seen customers ask for.

DAX is a special file system that does not use writeback for the DAX
mappings. Thus it could be an exception. And the pages are already pinned.





Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-05 Thread Tom Talpey

On 2/5/2019 3:22 AM, John Hubbard wrote:

On 2/4/19 5:41 PM, Tom Talpey wrote:

On 2/4/2019 12:21 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 


Performance: here is an fio run on an NVMe drive, using this for the fio
configuration file:

 [reader]
 direct=1
 ioengine=libaio
 blocksize=4096
 size=1g
 numjobs=1
 rw=read
 iodepth=64

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=7011: Sun Feb  3 20:36:51 2019
    read: IOPS=190k, BW=741MiB/s (778MB/s)(1024MiB/1381msec)
 slat (nsec): min=2716, max=57255, avg=4048.14, stdev=1084.10
 clat (usec): min=20, max=12485, avg=332.63, stdev=191.77
  lat (usec): min=22, max=12498, avg=336.72, stdev=192.07
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  322], 10.00th=[  322], 20.00th=[ 
326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[ 
326],
  | 70.00th=[  326], 80.00th=[  330], 90.00th=[  330], 95.00th=[ 
330],
  | 99.00th=[  478], 99.50th=[  717], 99.90th=[ 1074], 99.95th=[ 
1090],

  | 99.99th=[12256]


These latencies are concerning. The best results we saw at the end of
November (previous approach) were MUCH flatter. These really start
spiking at three 9's, and are sky-high at four 9's. The "stdev" values
for clat and lat are about 10 times the previous. There's some kind
of serious queuing contention here, that wasn't there in November.


Hi Tom,

I think this latency problem is also there in the baseline kernel, but...



    bw (  KiB/s): min=730152, max=776512, per=99.22%, avg=753332.00, 
stdev=32781.47, samples=2
    iops    : min=182538, max=194128, avg=188333.00, 
stdev=8195.37, samples=2

   lat (usec)   : 50=0.01%, 100=0.01%, 250=0.07%, 500=99.26%, 750=0.38%
   lat (usec)   : 1000=0.02%
   lat (msec)   : 2=0.24%, 20=0.02%
   cpu  : usr=15.07%, sys=84.13%, ctx=10, majf=0, minf=74


System CPU 84% is roughly double the November results of 45%. Ouch.


That's my fault. First of all, I had a few extra, supposedly minor debug
settings in the .config, which I'm removing now--I'm doing a proper run
with the original .config file from November, below. Second, I'm not
sure I controlled the run carefully enough.



Did you re-run the baseline on the new unpatched base kernel and can
we see the before/after?


Doing that now, I see:

-- No significant perf difference between before and after, but
-- Still high clat in the 99.99th

===
Before: using commit 8834f5600cf3 ("Linux 5.0-rc5")
===
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1829: Tue Feb  5 00:08:08 2019
    read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1359msec)
     slat (nsec): min=1269, max=40309, avg=1493.66, stdev=534.83
     clat (usec): min=127, max=12249, avg=329.83, stdev=184.92
  lat (usec): min=129, max=12256, avg=331.35, stdev=185.06
     clat percentiles (usec):
  |  1.00th=[  326],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
  | 99.00th=[  347], 99.50th=[  519], 99.90th=[  529], 99.95th=[  537],
  | 99.99th=[12125]
    bw (  KiB/s): min=755032, max=781472, per=99.57%, avg=768252.00, 
stdev=18695.90, samples=2
    iops    : min=188758, max=195368, avg=192063.00, stdev=4673.98, 
samples=2

   lat (usec)   : 250=0.08%, 500=99.18%, 750=0.72%
   lat (msec)   : 20=0.02%
   cpu  : usr=12.30%, sys=46.83%, ctx=253554, majf=0, minf=74
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
  submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, 
 >=64=0.0%

  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
    READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), 
io=1024MiB (1074MB), run=1359-1359msec


Disk stats (read/write):
   nvme0n1: ios=221246/0, merge=0/0, ticks=71556/0, in_queue=704, 
util=91.35%


===
After:
===
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1803: Mon Feb  4 23:58:07 2019
    read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1359msec)
     slat (nsec): 

Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-05 Thread John Hubbard

On 2/4/19 5:41 PM, Tom Talpey wrote:

On 2/4/2019 12:21 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 


Performance: here is an fio run on an NVMe drive, using this for the fio
configuration file:

 [reader]
 direct=1
 ioengine=libaio
 blocksize=4096
 size=1g
 numjobs=1
 rw=read
 iodepth=64

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=7011: Sun Feb  3 20:36:51 2019
    read: IOPS=190k, BW=741MiB/s (778MB/s)(1024MiB/1381msec)
 slat (nsec): min=2716, max=57255, avg=4048.14, stdev=1084.10
 clat (usec): min=20, max=12485, avg=332.63, stdev=191.77
  lat (usec): min=22, max=12498, avg=336.72, stdev=192.07
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  322], 10.00th=[  322], 20.00th=[  
326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  
326],
  | 70.00th=[  326], 80.00th=[  330], 90.00th=[  330], 95.00th=[  
330],
  | 99.00th=[  478], 99.50th=[  717], 99.90th=[ 1074], 99.95th=[ 
1090],

  | 99.99th=[12256]


These latencies are concerning. The best results we saw at the end of
November (previous approach) were MUCH flatter. These really start
spiking at three 9's, and are sky-high at four 9's. The "stdev" values
for clat and lat are about 10 times the previous. There's some kind
of serious queuing contention here, that wasn't there in November.


Hi Tom,

I think this latency problem is also there in the baseline kernel, but...



    bw (  KiB/s): min=730152, max=776512, per=99.22%, avg=753332.00, 
stdev=32781.47, samples=2
    iops    : min=182538, max=194128, avg=188333.00, 
stdev=8195.37, samples=2

   lat (usec)   : 50=0.01%, 100=0.01%, 250=0.07%, 500=99.26%, 750=0.38%
   lat (usec)   : 1000=0.02%
   lat (msec)   : 2=0.24%, 20=0.02%
   cpu  : usr=15.07%, sys=84.13%, ctx=10, majf=0, minf=74


System CPU 84% is roughly double the November results of 45%. Ouch.


That's my fault. First of all, I had a few extra, supposedly minor debug
settings in the .config, which I'm removing now--I'm doing a proper run
with the original .config file from November, below. Second, I'm not
sure I controlled the run carefully enough.



Did you re-run the baseline on the new unpatched base kernel and can
we see the before/after?


Doing that now, I see:

-- No significant perf difference between before and after, but
-- Still high clat in the 99.99th

===
Before: using commit 8834f5600cf3 ("Linux 5.0-rc5")
===
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1829: Tue Feb  5 00:08:08 2019
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1359msec)
slat (nsec): min=1269, max=40309, avg=1493.66, stdev=534.83
clat (usec): min=127, max=12249, avg=329.83, stdev=184.92
 lat (usec): min=129, max=12256, avg=331.35, stdev=185.06
clat percentiles (usec):
 |  1.00th=[  326],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
 | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
 | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
 | 99.00th=[  347], 99.50th=[  519], 99.90th=[  529], 99.95th=[  537],
 | 99.99th=[12125]
   bw (  KiB/s): min=755032, max=781472, per=99.57%, avg=768252.00, 
stdev=18695.90, samples=2
   iops: min=188758, max=195368, avg=192063.00, stdev=4673.98, 
samples=2

  lat (usec)   : 250=0.08%, 500=99.18%, 750=0.72%
  lat (msec)   : 20=0.02%
  cpu  : usr=12.30%, sys=46.83%, ctx=253554, majf=0, minf=74
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
>=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, 
>=64=0.0%

 issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), 
io=1024MiB (1074MB), run=1359-1359msec


Disk stats (read/write):
  nvme0n1: ios=221246/0, merge=0/0, ticks=71556/0, in_queue=704, 
util=91.35%


===
After:
===
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
4096B-4096B, ioengine=libaio, iodepth=64

fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1803: Mon Feb  4 23:58:07 2019
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1359msec)
slat (nsec): min=1276, max=41900, avg=1505.36, stdev=565.26
clat (usec): 

Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Tom Talpey

On 2/4/2019 12:21 AM, john.hubb...@gmail.com wrote:

From: John Hubbard 


Performance: here is an fio run on an NVMe drive, using this for the fio
configuration file:

 [reader]
 direct=1
 ioengine=libaio
 blocksize=4096
 size=1g
 numjobs=1
 rw=read
 iodepth=64

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=7011: Sun Feb  3 20:36:51 2019
read: IOPS=190k, BW=741MiB/s (778MB/s)(1024MiB/1381msec)
 slat (nsec): min=2716, max=57255, avg=4048.14, stdev=1084.10
 clat (usec): min=20, max=12485, avg=332.63, stdev=191.77
  lat (usec): min=22, max=12498, avg=336.72, stdev=192.07
 clat percentiles (usec):
  |  1.00th=[  322],  5.00th=[  322], 10.00th=[  322], 20.00th=[  326],
  | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
  | 70.00th=[  326], 80.00th=[  330], 90.00th=[  330], 95.00th=[  330],
  | 99.00th=[  478], 99.50th=[  717], 99.90th=[ 1074], 99.95th=[ 1090],
  | 99.99th=[12256]


These latencies are concerning. The best results we saw at the end of
November (previous approach) were MUCH flatter. These really start
spiking at three 9's, and are sky-high at four 9's. The "stdev" values
for clat and lat are about 10 times the previous. There's some kind
of serious queuing contention here, that wasn't there in November.


bw (  KiB/s): min=730152, max=776512, per=99.22%, avg=753332.00, 
stdev=32781.47, samples=2
iops: min=182538, max=194128, avg=188333.00, stdev=8195.37, 
samples=2
   lat (usec)   : 50=0.01%, 100=0.01%, 250=0.07%, 500=99.26%, 750=0.38%
   lat (usec)   : 1000=0.02%
   lat (msec)   : 2=0.24%, 20=0.02%
   cpu  : usr=15.07%, sys=84.13%, ctx=10, majf=0, minf=74


System CPU 84% is roughly double the November results of 45%. Ouch.

Did you re-run the baseline on the new unpatched base kernel and can
we see the before/after?

Tom.


   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
  issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=741MiB/s (778MB/s), 741MiB/s-741MiB/s (778MB/s-778MB/s), 
io=1024MiB (1074MB), run=1381-1381msec

Disk stats (read/write):
   nvme0n1: ios=216966/0, merge=0/0, ticks=6112/0, in_queue=704, util=91.34%


Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Ira Weiny
On Mon, Feb 04, 2019 at 05:14:19PM +, Christopher Lameter wrote:
> Frankly I still think this does not solve anything.
> 
> Concurrent write access from two sources to a single page is simply wrong.
> You cannot make this right by allowing long term RDMA pins in a filesystem
> and thus the filesystem can never update part of its files on disk.
> 
> Can we just disable RDMA to regular filesystems? Regular filesystems
> should have full control of the write back and dirty status of their
> pages.

That may be a solution to the corruption/crashes but it is not a solution which
users want to see.  RDMA directly to file systems (specifically DAX) is a use
case we have seen customers ask for.

I think this is the correct path toward supporting this use case.

Ira



Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Matthew Wilcox
On Mon, Feb 04, 2019 at 06:21:39PM +, Christopher Lameter wrote:
> On Mon, 4 Feb 2019, Jason Gunthorpe wrote:
> 
> > On Mon, Feb 04, 2019 at 05:14:19PM +, Christopher Lameter wrote:
> > > Frankly I still think this does not solve anything.
> > >
> > > Concurrent write access from two sources to a single page is simply wrong.
> > > You cannot make this right by allowing long term RDMA pins in a filesystem
> > > and thus the filesystem can never update part of its files on disk.
> >
> > Fundamentally this patch series is fixing O_DIRECT to not crash the
> > kernel in extreme cases.. RDMA has the same problem, but it is much
> > easier to hit.
> 
> O_DIRECT is the same issue. O_DIRECT addresses always have been in
> anonymous memory or special file systems.

That's never been a constraint that's existed.


Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Christopher Lameter
On Mon, 4 Feb 2019, Jason Gunthorpe wrote:

> On Mon, Feb 04, 2019 at 05:14:19PM +, Christopher Lameter wrote:
> > Frankly I still think this does not solve anything.
> >
> > Concurrent write access from two sources to a single page is simply wrong.
> > You cannot make this right by allowing long term RDMA pins in a filesystem
> > and thus the filesystem can never update part of its files on disk.
>
> Fundamentally this patch series is fixing O_DIRECT to not crash the
> kernel in extreme cases.. RDMA has the same problem, but it is much
> easier to hit.

O_DIRECT is the same issue. O_DIRECT addresses always have been in
anonymous memory or special file systems.


Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Jason Gunthorpe
On Mon, Feb 04, 2019 at 05:14:19PM +, Christopher Lameter wrote:
> Frankly I still think this does not solve anything.
> 
> Concurrent write access from two sources to a single page is simply wrong.
> You cannot make this right by allowing long term RDMA pins in a filesystem
> and thus the filesystem can never update part of its files on disk.

Fundamentally this patch series is fixing O_DIRECT to not crash the
kernel in extreme cases.. RDMA has the same problem, but it is much
easier to hit.

I think questions related to RDMA are somewhat separate, and maybe it
should be blocked, or not, but either way O_DIRECT has to be fixed.

Jason


Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Christopher Lameter
Frankly I still think this does not solve anything.

Concurrent write access from two sources to a single page is simply wrong.
You cannot make this right by allowing long term RDMA pins in a filesystem
and thus the filesystem can never update part of its files on disk.

Can we just disable RDMA to regular filesystems? Regular filesystems
should have full control of the write back and dirty status of their
pages.

Special filesystems that do not actually do write back (like hugetlbfs),
mmaped raw device files and anonymous allocations are fine.



Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Christopher Lameter
On Mon, 4 Feb 2019, Christoph Hellwig wrote:

> On Mon, Feb 04, 2019 at 04:08:02PM +, Christopher Lameter wrote:
> > It may be worth noting a couple of times in this text that this was
> > designed for anonymous memory and that such use is/was ok. We are talking
> > about a use case here using mmapped access with a regular filesystem that
> > was not initially intended. The mmapping of from the hugepages filesystem
> > is special in that it is not a device that is actually writing things
> > back.
> >
> > Any use with a filesystem that actually writes data back to a medium
> > is something that is broken.
>
> Saying it was not intended seems rather odd, as it was supported
> since day 0 and people made use of it.

Well until last year I never thought there was a problem because I
considered it separate from regular filesystem I/O.






Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Christoph Hellwig
On Mon, Feb 04, 2019 at 04:08:02PM +, Christopher Lameter wrote:
> It may be worth noting a couple of times in this text that this was
> designed for anonymous memory and that such use is/was ok. We are talking
> about a use case here using mmapped access with a regular filesystem that
> was not initially intended. The mmapping of from the hugepages filesystem
> is special in that it is not a device that is actually writing things
> back.
> 
> Any use with a filesystem that actually writes data back to a medium
> is something that is broken.

Saying it was not intended seems rather odd, as it was supported
since day 0 and people made use of it.


Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

2019-02-04 Thread Christopher Lameter
On Sun, 3 Feb 2019, john.hubb...@gmail.com wrote:

> Some kernel components (file systems, device drivers) need to access
> memory that is specified via process virtual address. For a long time, the
> API to achieve that was get_user_pages ("GUP") and its variations. However,
> GUP has critical limitations that have been overlooked; in particular, GUP
> does not interact correctly with filesystems in all situations. That means
> that file-backed memory + GUP is a recipe for potential problems, some of
> which have already occurred in the field.

It may be worth noting a couple of times in this text that this was
designed for anonymous memory and that such use is/was ok. We are talking
about a use case here using mmapped access with a regular filesystem that
was not initially intended. The mmapping of from the hugepages filesystem
is special in that it is not a device that is actually writing things
back.

Any use with a filesystem that actually writes data back to a medium
is something that is broken.