Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Alex Deucher
On Fri, Nov 25, 2016 at 2:34 PM, Jason Gunthorpe
 wrote:
> On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
>
>> b) Allocation may not  have CPU address  at all - only GPU one.
>
> But you don't expect RDMA to work in the case, right?
>
> GPU people need to stop doing this windowed memory stuff :)
>

Blame 32 bit systems and GPUs with tons of vram :)

I think resizable bars are finally coming in a useful way so this
should go away soon.

Alex
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-25 Thread Linus Torvalds
On Fri, Nov 25, 2016 at 1:48 PM, Theodore Ts'o  wrote:
>
> There is a reason why people want to be able to do that, and that's
> because kprobes doesn't give you access to the arguments and return
> codes to the functions.

Honestly, that's simply not a good reason.

What if everybody did this? Do we pollute the whole kernel with this crap? No.

And if not, then what's so special about something like afs that it
would make sense there?

The thing is, with function tracing, you *can* get the return value
and arguments. Sure, you'll probably need to write eBPF and just
attach it to that fentry call point, and yes, if something is inlined
you're just screwed, but Christ, if you do debugging that way you
shouldn't be writing kernel code in the first place.

If you cannot do filesystem debugging without tracing every single
function entry, you are doing something seriously wrong. Add a couple
of relevant and valid trace points to get the initial arguments etc
(and perhaps to turn on the function tracing going down the stack).

> After all, we need *some* way of saying this can never be considered
> stable.

Oh, if you pollute the kernel with random idiotic trace points, not
only are they not going to be considered stable, after a while people
should stop pulling from you.

I do think we should probably add a few generic VFS level breakpoints
to make it easier for people to catch the arguments they get from the
VFS layer (not every system call - if you're a filesystem person, you
_shouldn't_ care about all the stuff that the VFS layer caches for you
so that you never even have to see it). I do think that Al's "no trace
points what-so-ever" is too strict.

But I think a lot of people add complete crap with the "maybe it's
needed some day" kind of mentality.

The tracepoints should have a good _specific_ reason, and they should
make sense. Not be randomly sprinkled "just because".

 Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-25 Thread Theodore Ts'o
On Fri, Nov 25, 2016 at 11:51:26AM -0800, Linus Torvalds wrote:
> We do have filesystem code that is just disgusting. As an example:
> fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> single function. If you want that, use the function tracer. That seems
> to be just debugging code that has been left around for others to
> stumble over. I do *not* believe that we should encourage that kind of
> "machine gun spray" use of tracepoints.

There is a reason why people want to be able to do that, and that's
because kprobes doesn't give you access to the arguments and return
codes to the functions.  Maybe there could be a way to do this more
easily using DWARF information and EBPF magic, perhaps?  It won't help
for inlined functions, of course, but most of the functions where
people want to do this aren't generally functions which are going to
be inlined, but rather things like write_begin, writepages, which are
called via a struct ops table and so will never be inlined to begin
with.

And it *is* handy to be able to do this when you don't know ahead of
time that you might need to debug a production system that is
malfunctioning for some reason.  This is the "S" in RAS (Reliability,
Availability, Serviceability).  This is why it's nice if there were a
way to be clear that it is intended for debugging purposes only ---
and maybe kprobes with EBPF and DWARF would be the answer.

After all, we need *some* way of saying this can never be considered
stable --- what would we do if some userspace program like powertop
started depending on a function name via ktrace and that function
disappeared --- would the userspace application really be intended to
demand that we revert the recatoring, because eliminating a function
name that they were depending on via ktrace point broke them?

- Ted
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Jason Gunthorpe
On Fri, Nov 25, 2016 at 09:40:10PM +0100, Christian König wrote:

> We call this "userptr" and it's just a combination of get_user_pages() on
> command submission and making sure the returned list of pages stays valid
> using a MMU notifier.

Doesn't that still pin the page?

> The "big" problem with this approach is that it is horrible slow. I mean
> seriously horrible slow so that we actually can't use it for some of the
> purposes we wanted to use it.
> 
> >The code moving the page will move it and the next GPU command that
> >needs it will refault it in the usual way, just like the CPU would.
> 
> And here comes the problem. CPU do this on a page by page basis, so they
> fault only what needed and everything else gets filled in on demand. This
> results that faulting a page is relatively light weight operation.
>
> But for GPU command submission we don't know which pages might be accessed
> beforehand, so what we do is walking all possible pages and make sure all of
> them are present.

Little confused why this is slow? So you fault the entire user MM into
your page tables at start of day and keep track of it with mmu
notifiers?

> >This might be much more efficient since it optimizes for the common
> >case of unchanging translation tables.
> 
> Yeah, completely agree. It works perfectly fine as long as you don't have
> two drivers trying to mess with the same page.

Well, the idea would be to not have the GPU block the other driver
beyond hinting that the page shouldn't be swapped out.

> >This assumes the commands are fairly short lived of course, the
> >expectation of the mmu notifiers is that a flush is reasonably prompt
> 
> Correct, this is another problem. GFX command submissions usually don't take
> longer than a few milliseconds, but compute command submission can easily
> take multiple hours.

So, that won't work - you have the same issue as RDMA with work loads
like that.

If you can't somehow fence the hardware then pinning is the only
solution. Felix has the right kind of suggestion for what is needed -
globally stop the GPU, fence the DMA, fix the page tables, and start
it up again. :\

> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)

Right. The advantage of pinning is it tells the other stuff not to
touch the page and doesn't block it, MMU notifiers have to be able to
block quickly.

> I'm thinking on this problem for about a year now and going in circles for
> quite a while. So if you have ideas on this even if they sound totally
> crazy, feel free to come up.

Well, it isn't a software problem. From what I've seen in this thread
the GPU application requires coherent page table mirroring, so the
only full & complete solution is going to be to actually implement
that somehow in GPU hardware.

Everything else is going to be deeply flawed somehow. Linux just
doesn't have the support for this kind of stuff - and I'm honestly not
sure something better is even possible considering the hardware
constraints

This doesn't have to be faulting, but really anything that lets you
pause the GPU DMA and reload the page tables.

You might look at trying to use the IOMMU and/or PCI ATS in very new
hardware. IIRC the physical IOMMU hardware can do the fault and fence
and block stuff, but I'm not sure about software support for using the
IOMMU to create coherent user page table mirrors - that is something
Linux doesn't do today. But there is demand for this kind of capability..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Felix Kuehling

On 16-11-25 03:40 PM, Christian König wrote:
> Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:
>> This assumes the commands are fairly short lived of course, the
>> expectation of the mmu notifiers is that a flush is reasonably prompt
>
> Correct, this is another problem. GFX command submissions usually
> don't take longer than a few milliseconds, but compute command
> submission can easily take multiple hours.
>
> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)
>
> I'm thinking on this problem for about a year now and going in circles
> for quite a while. So if you have ideas on this even if they sound
> totally crazy, feel free to come up.

Our GPUs (at least starting with VI) support compute-wave-save-restore
and can swap out compute queues with fairly low latency. Yes, there is
some overhead (both memory usage and time), but it's a fairly regular
thing with our hardware scheduler (firmware, actually) when we need to
preempt running compute queues to update runlists or we overcommit the
hardware queue resources.

Regards,
  Felix

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Christian König

Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:

On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:


Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.

Well a problem which wasn't mentioned so far is that while GPUs do have a
page table to mirror the CPU page table, they usually can't recover from
page faults.
So what we do is making sure that all memory accessed by the GPU Jobs stays
in place while those jobs run (pretty much the same pinning you do for the
DMA).

Yes, it is DMA, so this is a valid approach.

But, you don't need page faults from the GPU to do proper coherent
page table mirroring. Basically when the driver submits the work to
the GPU it 'faults' the pages into the CPU and mirror translation
table (instead of pinning).

Like in ODP, MMU notifiers/HMM are used to monitor for translation
changes. If a change comes in the GPU driver checks if an executing
command is touching those pages and blocks the MMU notifier until the
command flushes, then unfaults the page (blocking future commands) and
unblocks the mmu notifier.


Yeah, we have a function to "import" anonymous pages from a CPU pointer 
which works exactly that way as well.


We call this "userptr" and it's just a combination of get_user_pages() 
on command submission and making sure the returned list of pages stays 
valid using a MMU notifier.


The "big" problem with this approach is that it is horrible slow. I mean 
seriously horrible slow so that we actually can't use it for some of the 
purposes we wanted to use it.



The code moving the page will move it and the next GPU command that
needs it will refault it in the usual way, just like the CPU would.


And here comes the problem. CPU do this on a page by page basis, so they 
fault only what needed and everything else gets filled in on demand. 
This results that faulting a page is relatively light weight operation.


But for GPU command submission we don't know which pages might be 
accessed beforehand, so what we do is walking all possible pages and 
make sure all of them are present.


Now as far as I understand it the I/O subsystem for example assumes that 
it can easily change the CPU page tables without much overhead. So for 
example when a page can't modified it is temporary marked as readonly 
AFAIK (you are probably way deeper into this than me, so please confirm).


That absolutely kills any performance for GPU command submissions. We 
have use cases where we practically ended up playing ping/pong between 
the GPU driver trying to grab the page with get_user_pages() and sombody 
else in the kernel marking it readonly.



This might be much more efficient since it optimizes for the common
case of unchanging translation tables.


Yeah, completely agree. It works perfectly fine as long as you don't 
have two drivers trying to mess with the same page.



This assumes the commands are fairly short lived of course, the
expectation of the mmu notifiers is that a flush is reasonably prompt


Correct, this is another problem. GFX command submissions usually don't 
take longer than a few milliseconds, but compute command submission can 
easily take multiple hours.


I can easily imagine what would happen when kswapd is blocked by a GPU 
command submission for an hour or so while the system is under memory 
pressure :)


I'm thinking on this problem for about a year now and going in circles 
for quite a while. So if you have ideas on this even if they sound 
totally crazy, feel free to come up.


Cheers,
Christian.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-25 Thread Mike Marshall
> We do have filesystem code that is just disgusting. As an example:
> fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> single function.

Hmmm... we have "gossip" statements in Orangefs which can be triggered with
a debugfs knob... lots of functions have a gossip statement at the
top... is that
disgusting?

-Mike

On Fri, Nov 25, 2016 at 2:51 PM, Linus Torvalds
 wrote:
> On Thu, Nov 24, 2016 at 11:37 PM, Al Viro  wrote:
>>
>> My impression is that nobody (at least kernel-side) wants them to be
>> a stable ABI, so long as nobody in userland screams about their code
>> being broken, everything is fine.  As usual, if nobody notices an ABI
>> change, it hasn't happened.  The question is what happens when somebody
>> does.
>
> Right. There is basically _no_ "stable API" for the kernel anywhere,
> it's just an issue of "you can't break workflow for normal people".
>
> And if somebody writes his own trace scripts, and some random trace
> point goes away (or changes semantics), that's easy: he can just fix
> his script. Tracepoints aren't ever going to be stable in that sense.
>
> But when then somebody writes a trace script that is so useful that
> distros pick it up, and people start using it and depending on it,
> then _that_ trace point may well have become effectively locked in
> stone.
>
> That's happened once already with the whole powertop thing. It didn't
> get all that widely spread, and the fix was largely to improve on
> powertop to the point where it wasn't a problem any more, but we've
> clearly had one case of this happening.
>
> But I suspect that something like powertop is fairly unusual. There is
> certainly room for similar things in the VFS layer (think "better
> vmstat that uses tracepoints"), but I suspect the bulk of tracepoints
> are going to be for random debugging (so that developers can say
> "please run this script") rather than for an actual user application
> kind of situation.
>
> So I don't think we should be _too_ afraid of tracepoints either. When
> being too anti-tracepoint is a bigger practical problem than the
> possible problems down the line, the balance is wrong.
>
> As long as tracepoints make sense from a higher standpoint (ie not
> just random implementation detail of the day), and they aren't
> everywhere, they are unlikely to cause much problem.
>
> We do have filesystem code that is just disgusting. As an example:
> fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> single function. If you want that, use the function tracer. That seems
> to be just debugging code that has been left around for others to
> stumble over. I do *not* believe that we should encourage that kind of
> "machine gun spray" use of tracepoints.
>
> But tracing actual high-level things like IO and faults? I think that
> makes perfect sense, as long as the data that is collected is also the
> actual event data, and not so much a random implementation issue of
> the day.
>
>  Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Felix Kuehling
On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:
>
>> A white list may end up being rather complicated if it has to cover
>> different CPU generations and system architectures. I feel this is a
>> decision user space could easily make.
>>
>> Logan
> I agreed that it is better to leave up to user space to check what is
> working
> and what is not. I found that write is practically always working but
> read very
> often not. Also sometimes system BIOS update could fix the issue.
>
But is user mode always aware that P2P is going on or even possible? For
example you may have a library reading a buffer from a file, but it
doesn't necessarily know where that buffer is located (system memory,
VRAM, ...) and it may not know what kind of the device the file is on
(SATA drive, NVMe SSD, ...). The library will never know if all it gets
is a pointer and a file descriptor.

The library ends up calling a read system call. Then it would be up to
the kernel to figure out the most efficient way to read the buffer from
the file. If supported, it could use P2P between a GPU and NVMe where
the NVMe device performs a DMA write to VRAM.

If you put the burden of figuring out the P2P details on user mode code,
I think it will severely limit the use cases that actually take
advantage of it. You also risk a bunch of different implementations that
get it wrong half the time on half the systems out there.

Regards,
  Felix


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Jason Gunthorpe
On Fri, Nov 25, 2016 at 02:49:50PM -0500, Serguei Sagalovitch wrote:

> GPU could perfectly access all VRAM.  It is only issue for p2p without
> special interconnect and CPU access. Strictly speaking as long as we
> have "bus address"  we could have RDMA but  I agreed that for
> RDMA we could/should(?) always "request"  CPU address (I hope that we
> could forget about 32-bit application :-)).

At least on x86 if you have a bus address you have a CPU address. All
RDMAable VRAM has to be visible in the BAR.

> BTW/FYI: About CPU access: Some user-level API is mainly handle based
> so there is no need for CPU access by default.

You mean no need for the memory to be virtually mapped into the
process?

Do you expect to RDMA from this kind of API? How will that work?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-25 Thread Linus Torvalds
On Thu, Nov 24, 2016 at 11:37 PM, Al Viro  wrote:
>
> My impression is that nobody (at least kernel-side) wants them to be
> a stable ABI, so long as nobody in userland screams about their code
> being broken, everything is fine.  As usual, if nobody notices an ABI
> change, it hasn't happened.  The question is what happens when somebody
> does.

Right. There is basically _no_ "stable API" for the kernel anywhere,
it's just an issue of "you can't break workflow for normal people".

And if somebody writes his own trace scripts, and some random trace
point goes away (or changes semantics), that's easy: he can just fix
his script. Tracepoints aren't ever going to be stable in that sense.

But when then somebody writes a trace script that is so useful that
distros pick it up, and people start using it and depending on it,
then _that_ trace point may well have become effectively locked in
stone.

That's happened once already with the whole powertop thing. It didn't
get all that widely spread, and the fix was largely to improve on
powertop to the point where it wasn't a problem any more, but we've
clearly had one case of this happening.

But I suspect that something like powertop is fairly unusual. There is
certainly room for similar things in the VFS layer (think "better
vmstat that uses tracepoints"), but I suspect the bulk of tracepoints
are going to be for random debugging (so that developers can say
"please run this script") rather than for an actual user application
kind of situation.

So I don't think we should be _too_ afraid of tracepoints either. When
being too anti-tracepoint is a bigger practical problem than the
possible problems down the line, the balance is wrong.

As long as tracepoints make sense from a higher standpoint (ie not
just random implementation detail of the day), and they aren't
everywhere, they are unlikely to cause much problem.

We do have filesystem code that is just disgusting. As an example:
fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
single function. If you want that, use the function tracer. That seems
to be just debugging code that has been left around for others to
stumble over. I do *not* believe that we should encourage that kind of
"machine gun spray" use of tracepoints.

But tracing actual high-level things like IO and faults? I think that
makes perfect sense, as long as the data that is collected is also the
actual event data, and not so much a random implementation issue of
the day.

 Linus
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Jason Gunthorpe
On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:

> b) Allocation may not  have CPU address  at all - only GPU one.

But you don't expect RDMA to work in the case, right?

GPU people need to stop doing this windowed memory stuff :)

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Jason Gunthorpe
On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:

> >Like you say below we have to handle short lived in the usual way, and
> >that covers basically every device except IB MRs, including the
> >command queue on a NVMe drive.
> 
> Well a problem which wasn't mentioned so far is that while GPUs do have a
> page table to mirror the CPU page table, they usually can't recover from
> page faults.

> So what we do is making sure that all memory accessed by the GPU Jobs stays
> in place while those jobs run (pretty much the same pinning you do for the
> DMA).

Yes, it is DMA, so this is a valid approach.

But, you don't need page faults from the GPU to do proper coherent
page table mirroring. Basically when the driver submits the work to
the GPU it 'faults' the pages into the CPU and mirror translation
table (instead of pinning).

Like in ODP, MMU notifiers/HMM are used to monitor for translation
changes. If a change comes in the GPU driver checks if an executing
command is touching those pages and blocks the MMU notifier until the
command flushes, then unfaults the page (blocking future commands) and
unblocks the mmu notifier.

The code moving the page will move it and the next GPU command that
needs it will refault it in the usual way, just like the CPU would.

This might be much more efficient since it optimizes for the common
case of unchanging translation tables.

This assumes the commands are fairly short lived of course, the
expectation of the mmu notifiers is that a flush is reasonably prompt
..

> >Serguei, what is your plan in GPU land for migration? Ie if I have a
> >CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
> >- do you still allow the CPU to access it? Or do you swap it back to
> >cachable memory if the CPU touches it?
> 
> Depends on the policy in command, but currently it's the other way around
> most of the time.
> 
> E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids reading
> because that is slow, the GPU in turn can access it with full speed.
> 
> When we run out of VRAM we move those allocations to system memory and
> update both the CPU as well as the GPU page tables.
> 
> So that move is transparent for both userspace as well as shaders running on
> the GPU.

That makes sense to me, but the objection that came back for
non-cachable CPU mappings is that it basically breaks too much stuff
subtly, eg atomics, unaligned accesses, the CPU threading memory
model, all change on various architectures and break when caching is
disabled.

IMHO that is OK for specialty things like the GPU where the mmap comes
in via drm or something and apps know to handle that buffer specially.

But it is certainly not OK for DAX where the application is coded for
normal file open()/mmap() is not prepared for a mmap where (eg)
unaligned read accesses or atomics don't work depending on how the
filesystem is setup.

Which is why I think iopmem is still problematic..

At the very least I think a mmap flag or open flag should be needed to
opt into this behavior and by default non-cachebale DAX mmaps should
be paged into system ram when the CPU accesses them.

I'm hearing most people say ZONE_DEVICE is the way to handle this,
which means the missing remaing piece for RDMA is some kind of DMA
core support for p2p address translation..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.
About one application restriction: so it is per memory mapping? I assume 
that
it should not be problem for one application to do transfer to the 
several devices

simultaneously? Am I right?

May be we should follow RDMA MR design and register memory for p2p 
transfer from user

space?

What about the following:

a)  Device DAX is created
b) "Normal" (movable, etc.) allocation will be done for PCIe memory and 
CPU pointer/access will

be requested.
c)  p2p_mr_register() will be called and CPU pointer (mmap( on DAX 
Device)) will be returned.
Accordingly such memory will be marked as "unmovable" by e.g. graphics 
driver.

d) When p2p is not needed then p2p_mr_unregister() will be called.

What do you think? Will it work?


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan
I agreed that it is better to leave up to user space to check what is 
working
and what is not. I found that write is practically always working but 
read very

often not. Also sometimes system BIOS update could fix the issue.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch

On 2016-11-25 08:22 AM, Christian König wrote:



Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?


Depends on the policy in command, but currently it's the other way 
around most of the time.


E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full 
speed.


When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.


So that move is transparent for both userspace as well as shaders 
running on the GPU.

I would like to add more in relation to  CPU access :

a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register)
and non-CPU  accessible part.  As the result if user needs to have
CPU access than memory should be located in CPU-accessible part
of VRAM or in system memory.

Application/user mode driver could specify preference/hints of
locations based on their assumption / knowledge about access
patterns requirements, game resolution,  knowledge
about size of VRAM memory, etc.  So if CPU access performance
is critical then such memory should be allocated in system memory
as  the first (and may be only) choice.

b) Allocation may not  have CPU address  at all - only GPU one.
Also we may not be able to have CPU address/accesses for all VRAM
memory but memory may still  be migrated in any case unrelated
if we have CPU address or not.

c) " VRAM, it becomes non-cachable "
Strictly speaking VRAM is configured as WC (write-combined memory) to
provide fast CPU write access. Also it was found that sometimes if CPU
access is not critical from performance perspective it may be useful
to allocate/program system memory also as WC to  avoid needs for
extra "snooping" to synchronize with CPU caches during GPU access.
So potentially system memory could be WC too.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Christian König

Am 24.11.2016 um 17:42 schrieb Jason Gunthorpe:

On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:


On 23/11/16 02:55 PM, Jason Gunthorpe wrote:

Only ODP hardware allows changing the DMA address on the fly, and it
works at the page table level. We do not need special handling for
RDMA.

I am aware of ODP but, noted by others, it doesn't provide a general
solution to the points above.

How do you mean?

I was only saying it wasn't general in that it wouldn't work for IB
hardware that doesn't support ODP or other hardware  that doesn't do
similar things (like an NVMe drive).

There are three cases to worry about:
  - Coherent long lived page table mirroring (RDMA ODP MR)
  - Non-coherent long lived page table mirroring (RDMA MR)
  - Short lived DMA mapping (everything else)

Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.


Well a problem which wasn't mentioned so far is that while GPUs do have 
a page table to mirror the CPU page table, they usually can't recover 
from page faults.


So what we do is making sure that all memory accessed by the GPU Jobs 
stays in place while those jobs run (pretty much the same pinning you do 
for the DMA).


But since this can lock down huge amounts of memory the whole command 
submission to GPUs is bound to the memory management. So when to much 
memory would get blocked by the GPU we block further command submissions 
until the situation resolves.



any complex allocators (GPU or otherwise) should respect that. And that
seems like it should be the default way most of this works -- and I
think it wouldn't actually take too much effort to make it all work now
as is. (Our iopmem work is actually quite small and simple.)

Yes, absolutely, some kind of page pinning like locking is a hard
requirement.


Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
memory working for some time. I'd say it's a good fit. The main question
we've had is how to expose PCIe bars to userspace to be used as MRs and
such.

Is there any progress on that?

I still don't quite get what iopmem was about.. I thought the
objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
over iopmem and still ending up with uncacheable mmaps still seems
like a non-starter to me...

Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?


Depends on the policy in command, but currently it's the other way 
around most of the time.


E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full speed.


When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.


So that move is transparent for both userspace as well as shaders 
running on the GPU.



One approach might be to mmap the uncachable ZONE_DEVICE memory and
mark it inaccessible to the CPU - DMA could still translate. If the
CPU needs it then the kernel migrates it to system memory so it
becomes cachable. ??


The whole purpose of this effort is that we can do I/O on VRAM directly 
without migrating everything back to system memory.


Allowing this, but then doing the migration by the first touch of the 
CPU is clearly not a good idea.


Regards,
Christian.



Jason



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Christian König

Am 24.11.2016 um 18:55 schrieb Logan Gunthorpe:

Hey,

On 24/11/16 02:45 AM, Christian König wrote:

E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
Not PCI device B (a SATA device) can directly read/write to it because
it is on the same bus segment, but PCI device C (a network card for
example) can't because it is on a different bus segment and the bridge
can't handle P2P transactions.

Yeah, that could be an issue but in our experience we have yet to see
it. We've tested with two separate PCI buses on different CPUs connected
through QPI links and it works fine. (It is rather slow but I understand
Intel has improved the bottleneck in newer CPUs than the ones we tested.)


Well Serguei send me a couple of documents about QPI when we started to 
discuss this internally as well and that's exactly one of the cases I 
had in mind when writing this.


If I understood it correctly for such systems P2P is technical possible, 
but not necessary a good idea. Usually it is faster to just use a 
bouncing buffer when the peers are a bit "father" apart.


That this problem is solved on newer hardware is good, but doesn't helps 
us at all if we at want to support at least systems from the last five 
years or so.



It may just be older hardware that has this issue. I expect that as long
as a failed transfer can be handled gracefully by the initiator I don't
see a need to predetermine whether a device can see another devices memory.


I don't want to predetermine whether a device can see another devices 
memory at get_user_pages() time.


My thinking was more going into the direction of a whitelist to figure 
out during dma_map_single()/dma_map_sg() time if we should use a 
bouncing buffer or not.


Christian.




Logan



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm