Re: [Nouveau] [PATCH 00/12] drm/nouveau: support for GK20A, cont'd

2014-03-26 Thread Alexandre Courbot
On Wed, Mar 26, 2014 at 7:33 PM, Lucas Stach  wrote:
>> > It does so by doing the necessary manual cache flushes/invalidates on
>> > buffer access, so costs some performance. To avoid this you really want
>> > to get writecombined mappings into the kernel<->userspace interface.
>> > Simply mapping the pushbuf as WC/US has brought a 7% performance
>> > increase in OpenArena when I last tested this. This test was done with
>> > only one PCIe lane, so the perf increase may be even better with a more
>> > adequate interconnect.
>>
>> Interestingly if I allow writecombined mappings in the kernel I get
>> faults when attempting the read the mapped area:
>>
> This is most likely because your handling of those buffers produces
> conflicting mappings (if my understanding of what you are doing is
> right).
>
> At first you allocate memory from CMA without changing the pgprot flags.
> This yields pages which are mapped uncached or cached (when moveable
> pages are purged from CMA to make space for your buffer) into the
> kernels linear space.
>
> Later you regard this memory as iomem (it isn't!) and let TTM remap
> those pages into the vmalloc area with pgprot set to writecombined.
>
> I don't know exactly why this is causing havoc, but having two
> conflicting virtual mappings of the same physical memory is documented
> to at least produce undefined behavior on ARMv7.

IIUC this is not exactly what happens with GK20A, so let me explain
how VRAM is currently accessed to make sure we are in sync.

VRAM pages are allocated by nvea_ram_get(), which allocates chunks of
contiguous memory using dma_alloc_from_contiguous(). At that time I
don't think the pages are mapped anywhere for the CPU to see (contrary
to dma_alloc_coherent() for instance). Nouveau will then map the
memory into the GPU context's address space, but it is only when
nouveau_ttm_io_mem_reserve() is called that a BAR mapping is created,
making the memory accessible to the CPU through the BAR window (which
I consider as I/O memory).

The area of the BAR window pointing to the VRAM is then mapped to the
kernel (using ioremap_wc() or ioremap_nocache()) or user-space (where
ttm_io_prot() is called to get the pgprot_t to use). It is when this
mapping is writecombined that I get the faults.

So as far as I can tell, only at most one CPU mapping exists at any
time for VRAM memory, which goes through the BAR to access the actual
physical memory. It would probably be faster and more logical to map
the RAM directly so the CPU can address it, but going through the BAR
reduces CPU/GPU synchronization issues and there are a few cases where
we would need to map through the BAR anyway (e.g. tiled memory to be
made linear for the CPU).

I don't know if that help understanding what the issue might be - I
just wanted to make sure we are talking about the same thing. :)

Thanks,
Alex.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 00/12] drm/nouveau: support for GK20A, cont'd

2014-03-26 Thread Lucas Stach
Hi Alexandre,

Am Mittwoch, den 26.03.2014, 15:33 +0900 schrieb Alexandre Courbot:
> Hi Lucas,
> 
> On Mon, Mar 24, 2014 at 10:19 PM, Lucas Stach  wrote:
> > Hi Alexandre,
> >
> > Am Montag, den 24.03.2014, 17:42 +0900 schrieb Alexandre Courbot:
> >> Hi everyone,
> > [...]
> >>
> >> A few lines of hacks (not included here) are still needed to deal with 
> >> cached
> >> mappings triggering external aborts and CPU/GPU memory coherency issues, 
> >> but I
> >> hope to understand and address these issues next.
> >
> > For the coherency issue part you may want to look at my Nouveau on ARM
> > series. Most of it never made it upstream, as I lacked the time to work
> > further on this, but it solves the coherency issue from the kernel.
> 
> Oh, thanks for pointing this out, it will probably be most useful.
> Shall I assume the patches at
> https://www.mail-archive.com/nouveau@lists.freedesktop.org/msg13557.html
> are up-to-date? Would you mind if I include the relevant patches of
> yours in the next iteration of this series?
> 
> >
> > It does so by doing the necessary manual cache flushes/invalidates on
> > buffer access, so costs some performance. To avoid this you really want
> > to get writecombined mappings into the kernel<->userspace interface.
> > Simply mapping the pushbuf as WC/US has brought a 7% performance
> > increase in OpenArena when I last tested this. This test was done with
> > only one PCIe lane, so the perf increase may be even better with a more
> > adequate interconnect.
> 
> Interestingly if I allow writecombined mappings in the kernel I get
> faults when attempting the read the mapped area:
> 
This is most likely because your handling of those buffers produces
conflicting mappings (if my understanding of what you are doing is
right).

At first you allocate memory from CMA without changing the pgprot flags.
This yields pages which are mapped uncached or cached (when moveable
pages are purged from CMA to make space for your buffer) into the
kernels linear space.

Later you regard this memory as iomem (it isn't!) and let TTM remap
those pages into the vmalloc area with pgprot set to writecombined.

I don't know exactly why this is causing havoc, but having two
conflicting virtual mappings of the same physical memory is documented
to at least produce undefined behavior on ARMv7.

Regards,
Lucas

> [   78.074854] Unhandled fault: external abort on non-linefetch
> (0x1008) at 0xf003e010
> ...
> [   78.337862] [] (nouveau_bo_rd32) from []
> (nouveau_fence_update+0x5c/0x80)
> [   78.352536] [] (nouveau_fence_update) from []
> (nouveau_fence_done+0x18/0x28)
> [   78.367531] [] (nouveau_fence_done) from []
> (ttm_bo_wait+0x104/0x184)
> [   78.381915] [] (ttm_bo_wait) from []
> (nouveau_gem_ioctl_cpu_prep+0x40/0xe8)
> [   78.396849] [] (nouveau_gem_ioctl_cpu_prep) from
> [] (drm_ioctl+0x404/0x4b8)
> [   78.411790] [] (drm_ioctl) from []
> (nouveau_drm_ioctl+0x54/0x80)
> [   78.425805] [] (nouveau_drm_ioctl) from []
> (do_vfs_ioctl+0x3f0/0x5bc)
> [   78.440277] [] (do_vfs_ioctl) from []
> (SyS_ioctl+0x34/0x5c)
> [   78.453918] [] (SyS_ioctl) from []
> (ret_fast_syscall+0x0/0x30)
> 
> To avoid these I need to set the VRAM default_caching to
> TTM_PL_FLAG_UNCACHED. It is not clear to me why this is needed. The BO
> being accessed through the BAR, they are correctly considered as IO
> memory and mapped using ttm_bo_ioremap(), so it really seems to be
> unhappy with the WC mapping itself.
> 
> Note that if I go ahead and force the use of pgprot_writecombine() in
> ttm_io_prot() to get writecombined user-space mappings, pure DRM
> programs that map a buffer and try to read it fail similarly, while
> Mesa's glReadPixels() seems to be happy. I'm not sure what it does
> differently here.
> 
> Cheers,
> Alex.

-- 
Pengutronix e.K.   | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 00/12] drm/nouveau: support for GK20A, cont'd

2014-03-25 Thread Alexandre Courbot
Hi Lucas,

On Mon, Mar 24, 2014 at 10:19 PM, Lucas Stach  wrote:
> Hi Alexandre,
>
> Am Montag, den 24.03.2014, 17:42 +0900 schrieb Alexandre Courbot:
>> Hi everyone,
> [...]
>>
>> A few lines of hacks (not included here) are still needed to deal with cached
>> mappings triggering external aborts and CPU/GPU memory coherency issues, but 
>> I
>> hope to understand and address these issues next.
>
> For the coherency issue part you may want to look at my Nouveau on ARM
> series. Most of it never made it upstream, as I lacked the time to work
> further on this, but it solves the coherency issue from the kernel.

Oh, thanks for pointing this out, it will probably be most useful.
Shall I assume the patches at
https://www.mail-archive.com/nouveau@lists.freedesktop.org/msg13557.html
are up-to-date? Would you mind if I include the relevant patches of
yours in the next iteration of this series?

>
> It does so by doing the necessary manual cache flushes/invalidates on
> buffer access, so costs some performance. To avoid this you really want
> to get writecombined mappings into the kernel<->userspace interface.
> Simply mapping the pushbuf as WC/US has brought a 7% performance
> increase in OpenArena when I last tested this. This test was done with
> only one PCIe lane, so the perf increase may be even better with a more
> adequate interconnect.

Interestingly if I allow writecombined mappings in the kernel I get
faults when attempting the read the mapped area:

[   78.074854] Unhandled fault: external abort on non-linefetch
(0x1008) at 0xf003e010
...
[   78.337862] [] (nouveau_bo_rd32) from []
(nouveau_fence_update+0x5c/0x80)
[   78.352536] [] (nouveau_fence_update) from []
(nouveau_fence_done+0x18/0x28)
[   78.367531] [] (nouveau_fence_done) from []
(ttm_bo_wait+0x104/0x184)
[   78.381915] [] (ttm_bo_wait) from []
(nouveau_gem_ioctl_cpu_prep+0x40/0xe8)
[   78.396849] [] (nouveau_gem_ioctl_cpu_prep) from
[] (drm_ioctl+0x404/0x4b8)
[   78.411790] [] (drm_ioctl) from []
(nouveau_drm_ioctl+0x54/0x80)
[   78.425805] [] (nouveau_drm_ioctl) from []
(do_vfs_ioctl+0x3f0/0x5bc)
[   78.440277] [] (do_vfs_ioctl) from []
(SyS_ioctl+0x34/0x5c)
[   78.453918] [] (SyS_ioctl) from []
(ret_fast_syscall+0x0/0x30)

To avoid these I need to set the VRAM default_caching to
TTM_PL_FLAG_UNCACHED. It is not clear to me why this is needed. The BO
being accessed through the BAR, they are correctly considered as IO
memory and mapped using ttm_bo_ioremap(), so it really seems to be
unhappy with the WC mapping itself.

Note that if I go ahead and force the use of pgprot_writecombine() in
ttm_io_prot() to get writecombined user-space mappings, pure DRM
programs that map a buffer and try to read it fail similarly, while
Mesa's glReadPixels() seems to be happy. I'm not sure what it does
differently here.

Cheers,
Alex.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 00/12] drm/nouveau: support for GK20A, cont'd

2014-03-25 Thread Lucas Stach
Hi Alexandre,

Am Montag, den 24.03.2014, 17:42 +0900 schrieb Alexandre Courbot:
> Hi everyone,
[...]
> 
> A few lines of hacks (not included here) are still needed to deal with cached
> mappings triggering external aborts and CPU/GPU memory coherency issues, but I
> hope to understand and address these issues next.

For the coherency issue part you may want to look at my Nouveau on ARM
series. Most of it never made it upstream, as I lacked the time to work
further on this, but it solves the coherency issue from the kernel.

It does so by doing the necessary manual cache flushes/invalidates on
buffer access, so costs some performance. To avoid this you really want
to get writecombined mappings into the kernel<->userspace interface.
Simply mapping the pushbuf as WC/US has brought a 7% performance
increase in OpenArena when I last tested this. This test was done with
only one PCIe lane, so the perf increase may be even better with a more
adequate interconnect.

Regards,
Lucas
-- 
Pengutronix e.K.   | Lucas Stach |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [PATCH 00/12] drm/nouveau: support for GK20A, cont'd

2014-03-24 Thread Alexandre Courbot
Hi everyone,

Here is the second batch of patches to add GK20A support to Nouveau. This time
we are adding the actual chip support, and this series brings the driver to a
point where a slightly-tweaked Mesa successfully runs shaders and renders
triangles on GBM! Many thanks to Thierry Reding and the people on the
#nouveau IRC channel for their help without which we would not have reached
this milestone.

A few lines of hacks (not included here) are still needed to deal with cached
mappings triggering external aborts and CPU/GPU memory coherency issues, but I
hope to understand and address these issues next.

Most of the changes below have already been seen (and sometimes reviewed) in an
earlier patchset. What has been added is proper PGRAPH support (still needing
an external firmware and mostly reusing NVE4's code) as well as a better RAM
implementation.

How to represent and manage VRAM has been the hardest part to deal with, since
GK20A shares the system memory with the CPU without any kind of partition. I
have tried various approaches (included some in which no RAM object ever gets
instanciated) and finally decided to go for one using DMA-contiguous memory
allocations and relying on BAR mappings for kernel access and exposure to
user-space, as it fits better with existing code and keeps us safe from most of
the CPU/GPU memory coherency issues (at the cost of some performance).

Looking forward to your review of these few patches! :)

Cheers,
Alex.

Alexandre Courbot (12):
  drm/nouveau: fix missing newline
  drm/nouveau/timer: skip calibration on GK20A
  drm/nouveau/bar: only ioremap BAR3 if it exists
  drm/nouveau/bar/nvc0: support chips without BAR3
  drm/nouveau/fifo: add GK20A support
  drm/nouveau/ibus: add GK20A support
  drm/nouveau/fb: add GK20A support
  drm/nouveau/graph: enable when using external firmware
  drm/nouveau/graph: pad firmware code at load time
  drm/nouveau/graph: add GK20A support
  drm/nouveau: support GK20A in nouveau_accel_init()
  drm/nouveau: support for probing GK20A

 drivers/gpu/drm/nouveau/Makefile   |   5 +
 drivers/gpu/drm/nouveau/core/engine/device/nve0.c  |  20 +++
 drivers/gpu/drm/nouveau/core/engine/fifo/nve0.h|   1 +
 drivers/gpu/drm/nouveau/core/engine/fifo/nvea.c|  35 +
 .../gpu/drm/nouveau/core/engine/graph/ctxnve4.c|   4 +-
 drivers/gpu/drm/nouveau/core/engine/graph/nvc0.c   |  12 +-
 drivers/gpu/drm/nouveau/core/engine/graph/nvc0.h   |   9 ++
 drivers/gpu/drm/nouveau/core/engine/graph/nve4.c   |   2 +-
 drivers/gpu/drm/nouveau/core/engine/graph/nvea.c   |  75 +
 drivers/gpu/drm/nouveau/core/include/engine/fifo.h |   1 +
 .../gpu/drm/nouveau/core/include/engine/graph.h|   1 +
 drivers/gpu/drm/nouveau/core/include/subdev/fb.h   |   1 +
 drivers/gpu/drm/nouveau/core/include/subdev/ibus.h |   1 +
 drivers/gpu/drm/nouveau/core/subdev/bar/base.c |   7 +-
 drivers/gpu/drm/nouveau/core/subdev/bar/nvc0.c | 101 +++--
 drivers/gpu/drm/nouveau/core/subdev/fb/nvea.c  |  56 +++
 drivers/gpu/drm/nouveau/core/subdev/fb/priv.h  |   1 +
 drivers/gpu/drm/nouveau/core/subdev/fb/ramnvea.c   | 168 +
 drivers/gpu/drm/nouveau/core/subdev/ibus/nvea.c| 110 ++
 drivers/gpu/drm/nouveau/core/subdev/timer/nv04.c   |  19 ++-
 drivers/gpu/drm/nouveau/nouveau_drm.c  |  12 +-
 21 files changed, 578 insertions(+), 63 deletions(-)
 create mode 100644 drivers/gpu/drm/nouveau/core/engine/fifo/nvea.c
 create mode 100644 drivers/gpu/drm/nouveau/core/engine/graph/nvea.c
 create mode 100644 drivers/gpu/drm/nouveau/core/subdev/fb/nvea.c
 create mode 100644 drivers/gpu/drm/nouveau/core/subdev/fb/ramnvea.c
 create mode 100644 drivers/gpu/drm/nouveau/core/subdev/ibus/nvea.c

-- 
1.9.1

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau