Re: KMSAN: kernel-infoleak in compat_drm_wait_vblank

2021-03-03 Thread Michel Dänzer

On 2021-02-22 10:15 a.m., syzbot wrote:

Hello,

syzbot found the following issue on:

HEAD commit:29ad81a1 arch/x86: add missing include to sparsemem.h
git tree:   https://github.com/google/kmsan.git master
console output: https://syzkaller.appspot.com/x/log.txt?x=111e6312d0
kernel config:  https://syzkaller.appspot.com/x/.config?x=c8e3b38ca92283e
dashboard link: https://syzkaller.appspot.com/bug?extid=620cf21140fc7e772a5d
compiler:   Debian clang version 11.0.1-2
userspace arch: i386

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+620cf21140fc7e772...@syzkaller.appspotmail.com

=
BUG: KMSAN: kernel-infoleak in kmsan_copy_to_user+0x9c/0xb0 
mm/kmsan/kmsan_hooks.c:249
CPU: 1 PID: 26999 Comm: syz-executor.2 Not tainted 5.11.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack+0x21c/0x280 lib/dump_stack.c:120
  kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118
  kmsan_internal_check_memory+0x484/0x520 mm/kmsan/kmsan.c:437
  kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249
  instrument_copy_to_user include/linux/instrumented.h:121 [inline]
  _copy_to_user+0x1ac/0x270 lib/usercopy.c:33
  copy_to_user include/linux/uaccess.h:209 [inline]
  compat_drm_wait_vblank+0x36f/0x450 drivers/gpu/drm/drm_ioc32.c:866
  drm_compat_ioctl+0x3f6/0x590 drivers/gpu/drm/drm_ioc32.c:995
  __do_compat_sys_ioctl fs/ioctl.c:842 [inline]
  __se_compat_sys_ioctl+0x53d/0x1100 fs/ioctl.c:793
  __ia32_compat_sys_ioctl+0x4a/0x70 fs/ioctl.c:793
  do_syscall_32_irqs_on arch/x86/entry/common.c:79 [inline]
  __do_fast_syscall_32+0x102/0x160 arch/x86/entry/common.c:141
  do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:166
  do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:209
  entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
RIP: 0023:0xf7f47549
Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 
d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 
26 00 00 00 00 8d b4 26 00 00 00 00
RSP: 002b:f55415fc EFLAGS: 0296 ORIG_RAX: 0036
RAX: ffda RBX: 0003 RCX: c018643a
RDX: 2100 RSI:  RDI: 
RBP:  R08:  R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 

Uninit was stored to memory at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
  kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
  __msan_chain_origin+0x57/0xa0 mm/kmsan/kmsan_instr.c:147
  compat_drm_wait_vblank+0x43c/0x450 drivers/gpu/drm/drm_ioc32.c:865
  drm_compat_ioctl+0x3f6/0x590 drivers/gpu/drm/drm_ioc32.c:995
  __do_compat_sys_ioctl fs/ioctl.c:842 [inline]
  __se_compat_sys_ioctl+0x53d/0x1100 fs/ioctl.c:793
  __ia32_compat_sys_ioctl+0x4a/0x70 fs/ioctl.c:793
  do_syscall_32_irqs_on arch/x86/entry/common.c:79 [inline]
  __do_fast_syscall_32+0x102/0x160 arch/x86/entry/common.c:141
  do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:166
  do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:209
  entry_SYSENTER_compat_after_hwframe+0x4d/0x5c

Local variable req@compat_drm_wait_vblank created at:
  compat_drm_wait_vblank+0x7b/0x450 drivers/gpu/drm/drm_ioc32.c:849
  compat_drm_wait_vblank+0x7b/0x450 drivers/gpu/drm/drm_ioc32.c:849

Bytes 12-15 of 16 are uninitialized
Memory access of size 16 starts at 88814ffe3c98
Data copied to user address 2100
=


compat_drm_wait_vblank would need to initialize

req.reply.tval_usec = req32.reply.tval_usec;

before calling drm_ioctl_kernel, since it's not aliased by any req.request.* 
member, and drm_wait_vblank_ioctl doesn't always write to it.


--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v3] kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE

2021-02-12 Thread Michel Dänzer

On 2021-02-12 1:57 p.m., Emil Velikov wrote:

On Fri, 5 Feb 2021 at 22:01, Chris Wilson  wrote:


Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf)


As you rightfully point out, SYS_kcmp is a bit of a two edged sword.
While you mention the CONFIG issue, there is also a portability aspect
(mesa runs on more than just linux) and as well as sandbox filtering
of the extra syscall.

Last time I looked, the latter was still an issue and mesa was using
SYS_kcmp to compare device node fds.
A far shorter and more portable solution is possible, so let me
prepare a Mesa patch.


Make sure to read my comments on 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6881 first. :)



--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] kernel: Expose SYS_kcmp by default

2021-02-08 Thread Michel Dänzer

On 2021-02-08 2:34 p.m., Daniel Vetter wrote:

On Mon, Feb 8, 2021 at 12:49 PM Michel Dänzer  wrote:


On 2021-02-05 9:53 p.m., Daniel Vetter wrote:

On Fri, Feb 5, 2021 at 7:37 PM Kees Cook  wrote:


On Fri, Feb 05, 2021 at 04:37:52PM +, Chris Wilson wrote:

Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Signed-off-by: Chris Wilson 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Will Drewry 
Cc: Andrew Morton 
Cc: Dave Airlie 
Cc: Daniel Vetter 
Cc: Lucas Stach 
---
   init/Kconfig  | 11 +++
   kernel/Makefile   |  2 +-
   tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
   3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index b77c60f8b963..f62fca13ac5b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1194,6 +1194,7 @@ endif # NAMESPACES
   config CHECKPOINT_RESTORE
bool "Checkpoint/restore support"
select PROC_CHILDREN
+ select KCMP
default n
help
  Enables additional kernel features in a sake of checkpoint/restore.
@@ -1737,6 +1738,16 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
   config ARCH_HAS_MEMBARRIER_SYNC_CORE
bool

+config KCMP
+ bool "Enable kcmp() system call" if EXPERT
+ default y


I would expect this to be not default-y, especially if
CHECKPOINT_RESTORE does a "select" on it.

This is a really powerful syscall, but it is bounded by ptrace access
controls, and uses pointer address obfuscation, so it may be okay to
expose this. As it is, at least Ubuntu already has
CONFIG_CHECKPOINT_RESTORE, so really, there's probably not much
difference on exposure.

So, if you drop the "default y", I'm fine with this.


It was maybe stupid, but our userspace started relying on fd
comaprison through sys_kcomp. So for better or worse, if you want to
run the mesa3d gl/vk stacks, you need this.


That's overstating things somewhat. The vast majority of applications
will work fine regardless (as they did before Mesa started using this
functionality). Only some special ones will run into issues, because the
user-space drivers incorrectly assume two file descriptors reference
different descriptions.



Was maybe not the brighest ideas, but since enough distros had this
enabled by defaults,


Right, that (and the above) is why I considered it fair game to use.
What should I have done instead? (TBH I was surprised that this
functionality isn't generally available)


Yeah that one is fine, but I thought we've discussed (irc or
something) more uses for de-duping dma-buf and stuff like that. But
quick grep says that hasn't landed yet, so I got a bit confused (or
just dreamt). Looking at this again I'm kinda surprised the drmfd
de-duping blows up on normal linux distros, but I guess it can all
happen.


One example: GEM handle name-spaces are per file description. If 
user-space incorrectly assumes two DRM fds are independent, when they 
actually reference the same file description, closing a GEM handle with 
one file descriptor will make it unusable with the other file descriptor 
as well.




Ofc we can leave the default n, but the select if CONFIG_DRM is
unfortunately needed I think.


Per above, not sure this is really true.


We seem to be going boom on linux distros now, maybe userspace got
more creative in abusing stuff?


I don't know what you're referring to. I've only seen maybe two or three 
reports from people who didn't enable CHECKPOINT_RESTORE in their 
self-built kernels.




The entire thing is small enough that imo we don't really have to care,
e.g. we also unconditionally select dma-buf, despite that on most
systems there's only 1 gpu, and you're never going to end up with a
buffer sharing case that needs any of that code (aside from the
"here's an fd" part).

But I guess we can limit to just KCMP_FILE like you suggest in another
reply. Just feels a bit like overkill.


Making KCMP_FILE gated by DRM makes as little sense to me as by 
CHECKPOINT_RESTORE.



--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] kernel: Expose SYS_kcmp by default

2021-02-08 Thread Michel Dänzer

On 2021-02-08 12:49 p.m., Michel Dänzer wrote:

On 2021-02-05 9:53 p.m., Daniel Vetter wrote:

On Fri, Feb 5, 2021 at 7:37 PM Kees Cook  wrote:


On Fri, Feb 05, 2021 at 04:37:52PM +, Chris Wilson wrote:

Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. 
device

or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Signed-off-by: Chris Wilson 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Will Drewry 
Cc: Andrew Morton 
Cc: Dave Airlie 
Cc: Daniel Vetter 
Cc: Lucas Stach 
---
  init/Kconfig  | 11 +++
  kernel/Makefile   |  2 +-
  tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
  3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index b77c60f8b963..f62fca13ac5b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1194,6 +1194,7 @@ endif # NAMESPACES
  config CHECKPOINT_RESTORE
   bool "Checkpoint/restore support"
   select PROC_CHILDREN
+ select KCMP
   default n
   help
 Enables additional kernel features in a sake of 
checkpoint/restore.

@@ -1737,6 +1738,16 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
  config ARCH_HAS_MEMBARRIER_SYNC_CORE
   bool

+config KCMP
+ bool "Enable kcmp() system call" if EXPERT
+ default y


I would expect this to be not default-y, especially if
CHECKPOINT_RESTORE does a "select" on it.

This is a really powerful syscall, but it is bounded by ptrace access
controls, and uses pointer address obfuscation, so it may be okay to
expose this. As it is, at least Ubuntu already has
CONFIG_CHECKPOINT_RESTORE, so really, there's probably not much
difference on exposure.

So, if you drop the "default y", I'm fine with this.


It was maybe stupid, but our userspace started relying on fd
comaprison through sys_kcomp. So for better or worse, if you want to
run the mesa3d gl/vk stacks, you need this.


That's overstating things somewhat. The vast majority of applications 
will work fine regardless (as they did before Mesa started using this 
functionality). Only some special ones will run into issues, because the 
user-space drivers incorrectly assume two file descriptors reference 
different descriptions.




Was maybe not the brighest ideas, but since enough distros had this
enabled by defaults,


Right, that (and the above) is why I considered it fair game to use. 
What should I have done instead? (TBH I was surprised that this 
functionality isn't generally available)


In that spirit, an alternative might be to make KCMP_FILE available 
unconditionally, and the rest of SYS_kcmp only with CHECKPOINT_RESTORE 
as before. (Or maybe other parts of SYS_kcmp are generally useful as well?)



--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] kernel: Expose SYS_kcmp by default

2021-02-08 Thread Michel Dänzer

On 2021-02-05 9:53 p.m., Daniel Vetter wrote:

On Fri, Feb 5, 2021 at 7:37 PM Kees Cook  wrote:


On Fri, Feb 05, 2021 at 04:37:52PM +, Chris Wilson wrote:

Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.

Signed-off-by: Chris Wilson 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Will Drewry 
Cc: Andrew Morton 
Cc: Dave Airlie 
Cc: Daniel Vetter 
Cc: Lucas Stach 
---
  init/Kconfig  | 11 +++
  kernel/Makefile   |  2 +-
  tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
  3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index b77c60f8b963..f62fca13ac5b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1194,6 +1194,7 @@ endif # NAMESPACES
  config CHECKPOINT_RESTORE
   bool "Checkpoint/restore support"
   select PROC_CHILDREN
+ select KCMP
   default n
   help
 Enables additional kernel features in a sake of checkpoint/restore.
@@ -1737,6 +1738,16 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
  config ARCH_HAS_MEMBARRIER_SYNC_CORE
   bool

+config KCMP
+ bool "Enable kcmp() system call" if EXPERT
+ default y


I would expect this to be not default-y, especially if
CHECKPOINT_RESTORE does a "select" on it.

This is a really powerful syscall, but it is bounded by ptrace access
controls, and uses pointer address obfuscation, so it may be okay to
expose this. As it is, at least Ubuntu already has
CONFIG_CHECKPOINT_RESTORE, so really, there's probably not much
difference on exposure.

So, if you drop the "default y", I'm fine with this.


It was maybe stupid, but our userspace started relying on fd
comaprison through sys_kcomp. So for better or worse, if you want to
run the mesa3d gl/vk stacks, you need this.


That's overstating things somewhat. The vast majority of applications 
will work fine regardless (as they did before Mesa started using this 
functionality). Only some special ones will run into issues, because the 
user-space drivers incorrectly assume two file descriptors reference 
different descriptions.




Was maybe not the brighest ideas, but since enough distros had this
enabled by defaults,


Right, that (and the above) is why I considered it fair game to use. 
What should I have done instead? (TBH I was surprised that this 
functionality isn't generally available)



it wasn't really discovered, and now we're
shipping this everywhere.


You're making it sound like this snuck in secretly somehow, which is not 
true of course.




Ofc we can leave the default n, but the select if CONFIG_DRM is
unfortunately needed I think.


Per above, not sure this is really true.


--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


[PATCH] drm/ttm: Use __GFP_NOWARN for huge pages in ttm_pool_alloc_page

2021-01-28 Thread Michel Dänzer
From: Michel Dänzer 

Without __GFP_NOWARN, attempts at allocating huge pages can trigger
dmesg splats like below (which are essentially noise, since TTM falls
back to normal pages if it can't get a huge one).

[ 9556.710241] clinfo: page allocation failure: order:9, 
mode:0x194dc2(GFP_HIGHUSER|__GFP_RETRY_MAYFAIL|__GFP_NORETRY|__GFP_ZERO|__GFP_NOMEMALLOC),
 nodemask=(null),cpuset=user.slice,mems_allowed=0
[ 9556.710259] CPU: 1 PID: 470821 Comm: clinfo Tainted: GE 
5.10.10+ #4
[ 9556.710264] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 
TOMAHAWK (MS-7A34), BIOS 1.OR 11/29/2019
[ 9556.710268] Call Trace:
[ 9556.710281]  dump_stack+0x6b/0x83
[ 9556.710288]  warn_alloc.cold+0x7b/0xdf
[ 9556.710297]  ? __alloc_pages_direct_compact+0x137/0x150
[ 9556.710303]  __alloc_pages_slowpath.constprop.0+0xc1b/0xc50
[ 9556.710312]  __alloc_pages_nodemask+0x2ec/0x320
[ 9556.710325]  ttm_pool_alloc+0x2e4/0x5e0 [ttm]
[ 9556.710332]  ? kvmalloc_node+0x46/0x80
[ 9556.710341]  ttm_tt_populate+0x37/0xe0 [ttm]
[ 9556.710350]  ttm_bo_handle_move_mem+0x142/0x180 [ttm]
[ 9556.710359]  ttm_bo_validate+0x11d/0x190 [ttm]
[ 9556.710391]  ? drm_vma_offset_add+0x2f/0x60 [drm]
[ 9556.710399]  ttm_bo_init_reserved+0x2a7/0x320 [ttm]
[ 9556.710529]  amdgpu_bo_do_create+0x1b8/0x500 [amdgpu]
[ 9556.710657]  ? amdgpu_bo_subtract_pin_size+0x60/0x60 [amdgpu]
[ 9556.710663]  ? get_page_from_freelist+0x11f9/0x1450
[ 9556.710789]  amdgpu_bo_create+0x40/0x270 [amdgpu]
[ 9556.710797]  ? _raw_spin_unlock+0x16/0x30
[ 9556.710927]  amdgpu_gem_create_ioctl+0x123/0x310 [amdgpu]
[ 9556.711062]  ? amdgpu_gem_force_release+0x150/0x150 [amdgpu]
[ 9556.711098]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[ 9556.711133]  drm_ioctl+0x20f/0x3a0 [drm]
[ 9556.711267]  ? amdgpu_gem_force_release+0x150/0x150 [amdgpu]
[ 9556.711276]  ? preempt_count_sub+0x9b/0xd0
[ 9556.711404]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 9556.711411]  __x64_sys_ioctl+0x83/0xb0
[ 9556.711417]  do_syscall_64+0x33/0x80
[ 9556.711421]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: bf9eee249ac2 ("drm/ttm: stop using GFP_TRANSHUGE_LIGHT")
Signed-off-by: Michel Dänzer 
---
 drivers/gpu/drm/ttm/ttm_pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 8c762a03ad8a..a264760cb2cd 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -84,7 +84,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool 
*pool, gfp_t gfp_flags,
 * put_page() on a TTM allocated page is illegal.
 */
if (order)
-   gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY |
+   gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
__GFP_KSWAPD_RECLAIM;
 
if (!pool->use_dma_alloc) {
-- 
2.30.0



Re: amdgpu crashes on OOM

2020-10-26 Thread Michel Dänzer

On 2020-10-26 5:29 a.m., Alex Xu (Hello71) wrote:

Hi,

I frequently encounter OOM on my system, mostly due to my own fault.
Recently, I noticed that not only does a swap storm happen and OOM
killer gets invoked, but the graphics output freezes permanently.
Checking the kernel messages, I see:

kworker/u24:4: page allocation failure: order:5, 
mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: GW 
5.9.0-14732-g20b1adb60cf6 #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS 
P4.20 06/18/2020
Workqueue: events_unbound commit_work
Call Trace:
  ? dump_stack+0x57/0x6a
  ? warn_alloc.cold+0x69/0xcd
  ? __alloc_pages_direct_compact+0xfb/0x116
  ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
  ? __alloc_pages_nodemask+0x143/0x167
  ? kmalloc_order+0x24/0x64
  ? dc_create_state+0x1a/0x4d
  ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d


Looks like dc_create_state should use kvzalloc instead of kzalloc 
(dc_state_free already uses kvfree).


order:5 means it's trying to allocate 32 physically contiguous pages, 
which can be hard to fulfill even with lower memory pressure.



--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 3/3] RFC: dma-buf: Add an API for importing and exporting sync files (v5)

2020-09-30 Thread Michel Dänzer

On 2020-03-17 10:21 p.m., Jason Ekstrand wrote:

Explicit synchronization is the future.  At least, that seems to be what
most userspace APIs are agreeing on at this point.  However, most of our
Linux APIs (both userspace and kernel UAPI) are currently built around
implicit synchronization with dma-buf.  While work is ongoing to change
many of the userspace APIs and protocols to an explicit synchronization
model, switching over piecemeal is difficult due to the number of
potential components involved.  On the kernel side, many drivers use
dma-buf including GPU (3D/compute), display, v4l, and others.  In
userspace, we have X11, several Wayland compositors, 3D drivers, compute
drivers (OpenCL etc.), media encode/decode, and the list goes on.

This patch provides a path forward by allowing userspace to manually
manage the fences attached to a dma-buf.  Alternatively, one can think
of this as making dma-buf's implicit synchronization simply a carrier
for an explicit fence.  This is accomplished by adding two IOCTLs to
dma-buf for importing and exporting a sync file to/from the dma-buf.
This way a userspace component which is uses explicit synchronization,
such as a Vulkan driver, can manually set the write fence on a buffer
before handing it off to an implicitly synchronized component such as a
Wayland compositor or video encoder.  In this way, each of the different
components can be upgraded to an explicit synchronization model one at a
time as long as the userspace pieces connecting them are aware of it and
import/export fences at the right times.

There is a potential race condition with this API if userspace is not
careful.  A typical use case for implicit synchronization is to wait for
the dma-buf to be ready, use it, and then signal it for some other
component.  Because a sync_file cannot be created until it is guaranteed
to complete in finite time, userspace can only signal the dma-buf after
it has already submitted the work which uses it to the kernel and has
received a sync_file back.  There is no way to atomically submit a
wait-use-signal operation.  This is not, however, really a problem with
this API so much as it is a problem with explicit synchronization
itself.  The way this is typically handled is to have very explicit
ownership transfer points in the API or protocol which ensure that only
one component is using it at any given time.  Both X11 (via the PRESENT
extension) and Wayland provide such ownership transfer points via
explicit present and idle messages.

The decision was intentionally made in this patch to make the import and
export operations IOCTLs on the dma-buf itself rather than as a DRM
IOCTL.  This makes it the import/export operation universal across all
components which use dma-buf including GPU, display, v4l, and others.
It also means that a userspace component can do the import/export
without access to the DRM fd which may be tricky to get in cases where
the client communicates with DRM via a userspace API such as OpenGL or
Vulkan.  At a future date we may choose to add direct import/export APIs
to components such as drm_syncobj to avoid allocating a file descriptor
and going through two ioctls.  However, that seems to be something of a
micro-optimization as import/export operations are likely to happen at a
rate of a few per frame of rendered or decoded video.

v2 (Jason Ekstrand):
  - Use a wrapper dma_fence_array of all fences including the new one
when importing an exclusive fence.

v3 (Jason Ekstrand):
  - Lock around setting shared fences as well as exclusive
  - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
  - Initialize ret to 0 in dma_buf_wait_sync_file

v4 (Jason Ekstrand):
  - Use the new dma_resv_get_singleton helper

v5 (Jason Ekstrand):
  - Rename the IOCTLs to import/export rather than wait/signal
  - Drop the WRITE flag and always get/set the exclusive fence

Signed-off-by: Jason Ekstrand 


What's the status of this? DMA_BUF_IOCTL_EXPORT_SYNC_FILE would be 
useful for Wayland compositors to wait for client buffers to become 
ready without being prone to getting delayed by later HW access to them, 
so it would be nice to merge that at least (if 
DMA_BUF_IOCTL_IMPORT_SYNC_FILE is still controversial).



--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH AUTOSEL 5.4 13/15] drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is

2020-09-21 Thread Michel Dänzer

On 2020-09-21 4:40 p.m., Sasha Levin wrote:

From: Michel Dänzer 

[ Upstream commit 2f228aab21bbc74e90e267a721215ec8be51daf7 ]

Don't check drm_crtc_state::active for this either, per its
documentation in include/drm/drm_crtc.h:

  * Hence drivers must not consult @active in their various
  * _mode_config_funcs.atomic_check callback to reject an atomic
  * commit.

atomic_remove_fb disables the CRTC as needed for disabling the primary
plane.

This prevents at least the following problems if the primary plane gets
disabled (e.g. due to destroying the FB assigned to the primary plane,
as happens e.g. with mutter in Wayland mode):

* The legacy cursor ioctl returned EINVAL for a non-0 cursor FB ID
   (which enables the cursor plane).
* If the cursor plane was enabled, changing the legacy DPMS property
   value from off to on returned EINVAL.

v2:
* Minor changes to code comment and commit log, per review feedback.

GitLab: https://gitlab.gnome.org/GNOME/mutter/-/issues/1108
GitLab: https://gitlab.gnome.org/GNOME/mutter/-/issues/1165
GitLab: https://gitlab.gnome.org/GNOME/mutter/-/issues/1344
Suggested-by: Daniel Vetter 
Acked-by: Daniel Vetter 
Reviewed-by: Nicholas Kazlauskas 
Signed-off-by: Michel Dänzer 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 


I'm a bit nervous about this getting backported so far back so quickly. 
I'd prefer waiting for 5.9 final first at least.



--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v1 0/2] video: fbdev: radeonfb: PCI PM framework upgrade and fix-ups.

2020-09-07 Thread Michel Dänzer

On 2020-09-07 9:55 a.m., Daniel Vetter wrote:

On Thu, Aug 06, 2020 at 12:52:54PM +0530, Vaibhav Gupta wrote:

Linux Kernel Mentee: Remove Legacy Power Management.

The original goal of the patch series is to upgrade the power management
framework of radeonfb fbdev driver. This has been done by upgrading .suspend()
and .resume() callbacks.

The upgrade makes sure that the involvement of PCI Core does not change the
order of operations executed in a driver. Thus, does not change its behavior.

During this process, it was found that "#if defined(CONFIG_PM)" at line 1434 is
redundant. This was introduced in the commit
42ddb453a0cd ("radeon: Conditionally compile PM code").


I do wonder whether it wouldn't be better to just outright delete these,
we have the drm radeon driver for pretty much all the same hardware ...


In contrast to radeonfb, the radeon driver doesn't support 
suspend-to-RAM on Apple PowerPC notebooks.



--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 00/49] DRM driver for Hikey 970

2020-08-20 Thread Michel Dänzer
On 2020-08-19 10:48 p.m., Sam Ravnborg wrote:
> Hi Mauro.
> 
> It seems my review comments failed to reach dri-devel - likely due to
> the size of the mail.

Right, some e-mails in this thread went through the dri-devel moderation
queue due to their sizes. This mail of yours did as well, because you
didn't trim the quoted text (hint, hint).


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] gpu/drm: Remove TTM_PL_FLAG_WC of VRAM to fix writecombine issue for Loongson64

2020-08-10 Thread Michel Dänzer
On 2020-08-09 2:13 p.m., Christian König wrote:
> Am 08.08.20 um 15:50 schrieb Jiaxun Yang:
>> 在 2020/8/8 下午9:41, Thomas Bogendoerfer 写道:
>>> On Sat, Aug 08, 2020 at 03:25:02PM +0800, Tiezhu Yang wrote:
>>>> Loongson processors have a writecombine issue that maybe failed to
>>>> write back framebuffer used with ATI Radeon or AMD GPU at times,
>>>> after commit 8a08e50cee66 ("drm: Permit video-buffers writecombine
>>>> mapping for MIPS"), there exists some errors such as blurred screen
>>>> and lockup, and so on.
>>>>
>>>> Remove the flag TTM_PL_FLAG_WC of VRAM to fix writecombine issue for
>>>> Loongson64 to work well with ATI Radeon or AMD GPU, and it has no any
>>>> influence on the other platforms.
>>> well it's not my call to take or reject this patch, but I already
>>> indicated it might be better to disable writecombine on the CPU
>>> detection side (or do you have other devices where writecombining
>>> works ?). Something like below will disbale it for all loongson64 CPUs.
>>> If you now find out where it works and where it doesn't, you can even
>>> reduce it to the required minium of affected CPUs.
>> Hi Tiezhu, Thomas,
>>
>> Yes, writecombine works well on LS7A's internal GPU
>> And even works well with some AMD GPUs (in my case, RX550).
> 
> In this case the patch is a clear NAK since you haven't root caused the
> issue and are just working around it in a very questionable manner.

To be fair though, amdgpu & radeon are already disabling write-combining
for system memory pages in 32-bit x86 kernels for similar reasons.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH AUTOSEL 5.6 33/50] drm/amdgpu: bump version for invalidate L2 before SDMA IBs

2020-05-07 Thread Michel Dänzer
On 2020-05-07 4:27 p.m., Sasha Levin wrote:
> From: Marek Olšák 
> 
> [ Upstream commit 9017a4897a20658f010bebea825262963c10afa6 ]
> 
> This fixes GPU hangs due to cache coherency issues.
> Bump the driver version. Split out from the original patch.
> 
> Signed-off-by: Marek Olšák 
> Reviewed-by: Christian König 
> Tested-by: Pierre-Eric Pelloux-Prayer 
> Signed-off-by: Alex Deucher 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 42f4febe24c6d..8d45a2b662aeb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -85,9 +85,10 @@
>   * - 3.34.0 - Non-DC can flip correctly between buffers with different 
> pitches
>   * - 3.35.0 - Add drm_amdgpu_info_device::tcc_disabled_mask
>   * - 3.36.0 - Allow reading more status registers on si/cik
> + * - 3.37.0 - L2 is invalidated before SDMA IBs, needed for correctness
>   */
>  #define KMS_DRIVER_MAJOR 3
> -#define KMS_DRIVER_MINOR 36
> +#define KMS_DRIVER_MINOR 37
>  #define KMS_DRIVER_PATCHLEVEL0
>  
>  int amdgpu_vram_limit = 0;
> 

This requires the parent commit fdf83646c0542ecfb9adc4db8f741a1f43dca058
"drm/amdgpu: invalidate L2 before SDMA IBs (v2)". KMS_DRIVER_MINOR is
bumped to signal to userspace the fix in that commit is present.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH AUTOSEL 4.19 044/167] drm/amdgpu: validate user pitch alignment

2019-09-04 Thread Michel Dänzer
On 2019-09-04 2:08 p.m., Sasha Levin wrote:
> 
> FWIW, I've added another test to my scripts to try and catch these cases
> (Revert "%s"). It'll slow down the scripts a bit but it's better to get
> it right rather than to be done quickly :)

Indeed, thanks! And again sorry for the brouhaha, I just honestly didn't
realize before how tricky this case was for the scripts.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH AUTOSEL 4.19 044/167] drm/amdgpu: validate user pitch alignment

2019-09-04 Thread Michel Dänzer
On 2019-09-03 10:16 p.m., Daniel Vetter wrote:
> On Tue, Sep 3, 2019 at 10:01 PM Sasha Levin  wrote:
>> On Tue, Sep 03, 2019 at 07:03:47PM +0200, Greg KH wrote:
>>> On Tue, Sep 03, 2019 at 06:40:43PM +0200, Michel Dänzer wrote:
>>>> On 2019-09-03 6:23 p.m., Sasha Levin wrote:
>>>>> From: Yu Zhao 
>>>>>
>>>>> [ Upstream commit 89f23b6efef554766177bf51aa754bce14c3e7da ]
>>>>
>>>> Hold your horses!
>>>>
>>>> This commit and c4a32b266da7bb702e60381ca0c35eaddbc89a6c had to be
>>>> reverted, as they caused regressions. See commits
>>>> 25ec429e86bb790e40387a550f0501d0ac55a47c &
>>>> 92b0730eaf2d549fdfb10ecc8b71f34b9f472c12 .
>>>>
>>>>
>>>> This isn't bolstering confidence in how these patches are selected...
>>>
>>> The patch _itself_ said to be backported to the stable trees from 4.2
>>> and newer.  Why wouldn't we be confident in doing this?
>>>
>>> If the patch doesn't want to be backported, then do not add the cc:
>>> stable line to it...
>>
>> This patch was picked because it has a stable tag, which you presumably
>> saw as your Reviewed-by tag is in the patch. This is why it was
>> backported; it doesn't take AI to backport patches tagged for stable...

The patches did point to gaps in validation of ioctl parameters passed
in by userspace. Unfortunately, they turned out to be too strict,
causing valid parameters to spuriously fail. If that wasn't the case,
and the patches didn't have stable tags, maybe we'd be having a
discussion about why they didn't have the tags now...


>> The revert of this patch, however:
>>
>>  1. Didn't have a stable tag.

I guess it didn't occur to me that was necessary, as the patches got
reverted within days.


>>  2. Didn't have a "Fixes:" tag.
>>  3. Didn't have the usual "the reverts commit ..." string added by git
>>  when one does a revert.

I suspect that's because there were no stable commit hashes to
reference, see below.


>> Which is why we still kick patches for review, even though they had a
>> stable tag, just so people could take a look and confirm we're not
>> missing anything - like we did here.
>>
>> I'm not sure what you expected me to do differently here.

Yeah, sorry, I didn't realize it's tricky for scripts to recognize that
the patches have been reverted in this case.


> Yeah this looks like fail on the revert side, they need to reference
> the reverted commit somehow ...
> 
> Alex, why got this dropped? Is this more fallout from the back
> shuffling you're doing between your internal branches behind the
> firewall, and the public history?

I do suspect that was at least a contributing factor.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH AUTOSEL 4.19 044/167] drm/amdgpu: validate user pitch alignment

2019-09-03 Thread Michel Dänzer
On 2019-09-03 6:23 p.m., Sasha Levin wrote:
> From: Yu Zhao 
> 
> [ Upstream commit 89f23b6efef554766177bf51aa754bce14c3e7da ]

Hold your horses!

This commit and c4a32b266da7bb702e60381ca0c35eaddbc89a6c had to be
reverted, as they caused regressions. See commits
25ec429e86bb790e40387a550f0501d0ac55a47c &
92b0730eaf2d549fdfb10ecc8b71f34b9f472c12 .


This isn't bolstering confidence in how these patches are selected...
I'm also a little nervous about others which change values by an order
of magnitude. There were cases before where such patches were backported
to branches which didn't have other corresponding changes, so they ended
up breaking stuff instead of fixing anything.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: Support for 2D engines/blitters in V4L2 and DRM

2019-04-24 Thread Michel Dänzer
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
> On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
>> Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
>>>> It would be cool if both could be used concurrently and not just return
>>>> -EBUSY when the device is used with the other subsystem.
>>>
>>> We live in this world already :-) I think there's even patches (or merged
>>> already) to add fences to v4l, for Android.
>>
>> This work is currently suspended. It will require some feature on DRM
>> display to really make this useful, but there is also a lot of
>> challanges in V4L2. In GFX space, most of the use case are about
>> rendering as soon as possible. Though, in multimedia we have two
>> problems, we need to synchronize the frame rendering with the audio,
>> and output buffers may comes out of order due to how video CODECs are
>> made.
> 
> Definitely, it feels like the DRM display side is currently a good fit
> for render use cases, but not so much for precise display cases where
> we want to try and display a buffer at a given vblank target instead of
> "as soon as possible".
> 
> I have a userspace project where I've implemented a page flip queue,
> which only schedules the next flip when relevant and keeps ready
> buffers in the queue until then. This requires explicit vblank
> syncronisation (which DRM offsers, but pretty much all other display
> APIs, that are higher-level don't, so I'm just using a refresh-rate
> timer for them) and flip done notification.
> 
> I haven't looked too much at how to flip with a target vblank with DRM
> directly but maybe the atomic API already has the bits in for that (but
> I haven't heard of such a thing as a buffer queue, so that makes me
> doubt it).

Not directly. What's available is that if userspace waits for vblank n
and then submits a flip, the flip will complete in vblank n+1 (or a
later vblank, depending on when the flip is submitted and when the
fences the flip depends on signal).

There is reluctance allowing more than one flip to be queued in the
kernel, as it would considerably increase complexity in the kernel. It
would probably only be considered if there was a compelling use-case
which was outright impossible otherwise.


> Well, I need to handle stuff like SDL in my userspace project, so I have
> to have all that queuing stuff in software anyway, but it would be good
> if each project didn't have to implement that. Worst case, it could be
> in libdrm too.

Usually, this kind of queuing will be handled in a display server such
as Xorg or a Wayland compositor, not by the application such as a video
player itself, or any library in the latter's address space. I'm not
sure there's much potential for sharing code between display servers for
this.


>> In the first, we'd need a mechanism where we can schedule a render at a
>> specific time or vblank. We can of course already implement this in
>> software, but with fences, the scheduling would need to be done in the
>> driver. Then if the fence is signalled earlier, the driver should hold
>> on until the delay is met. If the fence got signalled late, we also
>> need to think of a workflow. As we can't schedule more then one render
>> in DRM at one time, I don't really see yet how to make that work.
> 
> Indeed, that's also one of the main issues I've spotted. Before using
> an implicit fence, we basically have to make sure the frame is due for
> display at the next vblank. Otherwise, we need to refrain from using
> the fence and schedule the flip later, which is kind of counter-
> productive.

Fences are about signalling that the contents of a frame are "done" and
ready to be presented. They're not about specifying which frame is to be
presented when.


> I feel like specifying a target vblank would be a good unit for that,

The mechanism described above works for that.

> since it's our native granularity after all (while a timestamp is not).

Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync)
changes things in this regard. It makes the vblank length variable, and
if you wait for multiple vblanks between flips, you get the maximum
vblank length corresponding to the minimum refresh rate / timing
granularity. Thus, it would be useful to allow userspace to specify a
timestamp corresponding to the earliest time when the flip is to
complete. The kernel could then try to hit that as closely as possible.


-- 
Earthling Michel Dänzer   |  https://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2] gpu: radeon: fix a potential NULL-pointer dereference

2019-03-26 Thread Michel Dänzer
On 2019-03-25 10:05 p.m., Kangjie Lu wrote:
> In case alloc_workqueue fails, the fix frees memory and
> returns -ENOMEM to avoid potential NULL pointer dereference.
> 
> Signed-off-by: Kangjie Lu 
> ---
> v2: use radeon_crtc_destroy to properly clean up resources as
> suggested by Michel Dänzer 
> 
>  [...]
> 
> @@ -671,13 +671,18 @@ static void radeon_crtc_init(struct drm_device *dev, 
> int index)
>  
>   radeon_crtc = kzalloc(sizeof(struct radeon_crtc) + (RADEONFB_CONN_LIMIT 
> * sizeof(struct drm_connector *)), GFP_KERNEL);
>   if (radeon_crtc == NULL)
> - return;
> + return -ENOMEM;
>  
>   drm_crtc_init(dev, _crtc->base, _crtc_funcs);
>  
>   drm_mode_crtc_set_gamma_size(_crtc->base, 256);
>   radeon_crtc->crtc_id = index;
>   radeon_crtc->flip_queue = alloc_workqueue("radeon-crtc", WQ_HIGHPRI, 0);
> + if (!radeon_crtc->flip_queue) {
> + DRM_ERROR("failed to allocate the flip queue\n");
> + radeon_crtc_destroy(_crtc->base);
> + return -ENOMEM;
> + }

radeon_crtc_destroy currently unconditionally calls:

destroy_workqueue(radeon_crtc->flip_queue);

AFAICT destroy_workqueue will blow up if NULL is passed to it, so
radeon_crtc_destroy needs to check for that.


-- 
Earthling Michel Dänzer   |  https://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-24 Thread Michel Dänzer
On 2019-01-24 12:45 p.m., Ard Biesheuvel wrote:
> On Thu, 24 Jan 2019 at 12:37, Koenig, Christian
>  wrote:
>> Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
>>> On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
>>>  wrote:
>>>> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
>>>>> [SNIP]
>>>>> This is *exactly* my point the whole time.
>>>>>
>>>>> The current code has
>>>>>
>>>>> static inline bool drm_arch_can_wc_memory(void)
>>>>> {
>>>>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
>>>>>  return false;
>>>>>
>>>>> which means the optimization is disabled *unless the system is
>>>>> non-cache coherent*
>>>>>
>>>>> So if you have reports that the optimization works on some PowerPC, it
>>>>> must be non-cache coherent PowerPC, because that is the only place
>>>>> where it is enabled in the first place.
>>>>>
>>>>>> The only problematic here actually seems to be ARM, so you should
>>>>>> probably just add an "#ifdef .._ARM return false;".
>>>>>>
>>>>> ARM/arm64 does not have a Kconfig symbol like
>>>>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
>>>>> there are non-coherent ARM systems that are currently working in the
>>>>> same way as those non-coherent PowerPC systems, we will break them by
>>>>> doing this.
>>>> Summing the things I've read so far for ARM up I actually think it
>>>> depends on a runtime configuration and not on compile time one.
>>>>
>>>> So the whole idea of providing the device to the drm_*_can_wc_memory()
>>>> function isn't so far fetched.
>>>>
>>> Thank you.
>>>
>>>> But for now I do prefer working and slightly slower system over broken
>>>> one, so I think we should just disable this on ARM for now.
>>>>
>>> Again, this is not about non-cache coherent being slower without the
>>> optimization, it is about non-cache coherent likely not working *at
>>> all* unless the optimization is enabled.
>>
>> As Michel tried to explain this CAN'T happen. The optimization is a
>> purely optional request from userspace.
>>
> 
> Right.
> 
> So in that case, we can assume that the following test
> 
> static inline bool drm_arch_can_wc_memory(void)
> {
> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> return false;
> 
> is bogus, and it was just unnecessary caution on the part of the
> author to disregard non-cache coherent devices.

This is driver-independent core code, meaning "non-snooped PCIe
transfers don't work on cache coherent PPC". It doesn't imply anything
about whether or not amdgpu or any other driver works on
non-cache-coherent PPC in general.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-24 Thread Michel Dänzer
On 2019-01-23 5:52 p.m., Ard Biesheuvel wrote:
> On Wed, 23 Jan 2019 at 17:44, Christoph Hellwig  wrote:
>>
>> I think we just want a driver-local check for those combinations
>> where we know this hack actually works, which really just seems
>> to be x86-64 with PAT. Something like the patch below, but maybe with
>> even more strong warnings to not do something like this elsewhere:
> 
> I agree that your patch seems like the right way to ensure that the WC
> optimization hack is only used where we know it works.
> 
> But my concern is that it seems likely that non-cache coherent
> implementations are relying on this hack as well.

I've been trying to tell you they can't rely on that, because the amdgpu
driver doesn't use this functionality for fundamentals such as ring
buffers used for feeding the hardware with commands. Instead, for those
it relies on snooped PCIe transfers being coherent with the CPU caches.


>> -#elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT)
>> -   /* Don't try to enable write-combining when it can't work, or things
>> -* may be slow
>> -* See https://bugs.freedesktop.org/show_bug.cgi?id=88758
>> -*/
>> -
>> -#ifndef CONFIG_COMPILE_TEST
>> -#warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better 
>> performance \
>> -thanks to write-combining
>> -#endif
>> -
>> -   if (bo->flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
>> -   DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT 
>> for "
>> - "better performance thanks to 
>> write-combining\n");

FWIW, please don't drop these compile and build time warnings where we
continue to take advantage of PAT.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-21 Thread Michel Dänzer
On 2019-01-21 7:28 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 19:24, Michel Dänzer  wrote:
>> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer  wrote:
>>>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
>>>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer  wrote:
>>>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
>>>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig  
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Until that happens we should just change the driver ifdefs to default
>>>>>>>> the hacks to off and only enable them on setups where we 100%
>>>>>>>> positively know that they actually work.  And document that fact
>>>>>>>> in big fat comments.
>>>>>>>
>>>>>>> Well, as I mentioned in my commit log as well, if we default to off
>>>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
>>>>>>> the device is in fact non-cache coherent, and relies on this
>>>>>>> 'optimization' to get things working.
>>>>>>
>>>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
>>>>>> correct basic operation (the scenario Christian brought up is a very
>>>>>> specialized use-case), so that shouldn't be an issue.
>>>>>
>>>>> The point is that this is only true for x86.
>>>>>
>>>>> On other architectures, the use of non-cached mappings on the CPU side
>>>>> means that you /do/ rely on non-snooped transfers, since if those
>>>>> transfers turn out not to snoop inadvertently, the accesses are
>>>>> incoherent with the CPU's view of memory.
>>>>
>>>> The driver generally only uses non-cached mappings if
>>>> drm_arch/device_can_wc_memory returns true.
>>>
>>> Indeed. And so we should take care to only return 'true' from that
>>> function if it is guaranteed that non-cached CPU mappings are coherent
>>> with the mappings used by the GPU, either because that is always the
>>> case (like on x86), or because we know that the platform in question
>>> implements NoSnoop correctly throughout the interconnect.
>>>
>>> What seems to be complicating matters is that in some cases, the
>>> device is non-cache coherent to begin with, so regardless of whether
>>> the NoSnoop attribute is used or not, those accesses will not snoop in
>>> the caches and be coherent with the non-cached mappings used by the
>>> CPU. So if we restrict this optimization [on non-X86] to platforms
>>> that are known to implement NoSnoop correctly, we may break platforms
>>> that are implicitly NoSnoop all the time.
>>
>> Since the driver generally doesn't rely on non-snooped accesses for
>> correctness, that couldn't "break" anything that hasn't always been broken.
> 
> Again, that is only true on x86.
> 
> On other architectures, DMA writes from the device may allocate in the
> caches, and be invisible to the CPU when it uses non-cached mappings.

Let me try one last time:

If drm_arch_can_wc_memory returns false, the driver falls back to the
normal mode of operation, using a cacheable CPU mapping and snooped GPU
transfers, even if userspace asks (as a performance optimization) for a
write-combined CPU mapping and non-snooped GPU transfers via
AMDGPU_GEM_CREATE_CPU_GTT_USWC. This normal mode of operation is also
used for the ring buffers at the heart of the driver's operation. If
there is a platform where this normal mode of operation doesn't work,
the driver could never have worked reliably on that platform, since
before AMDGPU_GEM_CREATE_CPU_GTT_USWC or drm_arch_can_wc_memory even
existed.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-21 Thread Michel Dänzer
On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer  wrote:
>>
>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer  wrote:
>>>>
>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig  
>>>>> wrote:
>>>>>
>>>>>> Until that happens we should just change the driver ifdefs to default
>>>>>> the hacks to off and only enable them on setups where we 100%
>>>>>> positively know that they actually work.  And document that fact
>>>>>> in big fat comments.
>>>>>
>>>>> Well, as I mentioned in my commit log as well, if we default to off
>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
>>>>> the device is in fact non-cache coherent, and relies on this
>>>>> 'optimization' to get things working.
>>>>
>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
>>>> correct basic operation (the scenario Christian brought up is a very
>>>> specialized use-case), so that shouldn't be an issue.
>>>>
>>>
>>> The point is that this is only true for x86.
>>>
>>> On other architectures, the use of non-cached mappings on the CPU side
>>> means that you /do/ rely on non-snooped transfers, since if those
>>> transfers turn out not to snoop inadvertently, the accesses are
>>> incoherent with the CPU's view of memory.
>>
>> The driver generally only uses non-cached mappings if
>> drm_arch/device_can_wc_memory returns true.
>>
> 
> Indeed. And so we should take care to only return 'true' from that
> function if it is guaranteed that non-cached CPU mappings are coherent
> with the mappings used by the GPU, either because that is always the
> case (like on x86), or because we know that the platform in question
> implements NoSnoop correctly throughout the interconnect.
> 
> What seems to be complicating matters is that in some cases, the
> device is non-cache coherent to begin with, so regardless of whether
> the NoSnoop attribute is used or not, those accesses will not snoop in
> the caches and be coherent with the non-cached mappings used by the
> CPU. So if we restrict this optimization [on non-X86] to platforms
> that are known to implement NoSnoop correctly, we may break platforms
> that are implicitly NoSnoop all the time.

Since the driver generally doesn't rely on non-snooped accesses for
correctness, that couldn't "break" anything that hasn't always been broken.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-21 Thread Michel Dänzer
On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer  wrote:
>>
>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig  wrote:
>>>
>>>> Until that happens we should just change the driver ifdefs to default
>>>> the hacks to off and only enable them on setups where we 100%
>>>> positively know that they actually work.  And document that fact
>>>> in big fat comments.
>>>
>>> Well, as I mentioned in my commit log as well, if we default to off
>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
>>> the device is in fact non-cache coherent, and relies on this
>>> 'optimization' to get things working.
>>
>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
>> correct basic operation (the scenario Christian brought up is a very
>> specialized use-case), so that shouldn't be an issue.
>>
> 
> The point is that this is only true for x86.
> 
> On other architectures, the use of non-cached mappings on the CPU side
> means that you /do/ rely on non-snooped transfers, since if those
> transfers turn out not to snoop inadvertently, the accesses are
> incoherent with the CPU's view of memory.

The driver generally only uses non-cached mappings if
drm_arch/device_can_wc_memory returns true.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC PATCH] drm: disable WC optimization for cache coherent devices on non-x86

2019-01-21 Thread Michel Dänzer
On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig  wrote:
> 
>> Until that happens we should just change the driver ifdefs to default
>> the hacks to off and only enable them on setups where we 100%
>> positively know that they actually work.  And document that fact
>> in big fat comments.
> 
> Well, as I mentioned in my commit log as well, if we default to off
> unless CONFIG_X86, we may break working setups on MIPS and Power where
> the device is in fact non-cache coherent, and relies on this
> 'optimization' to get things working.

FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
correct basic operation (the scenario Christian brought up is a very
specialized use-case), so that shouldn't be an issue.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-10 Thread Michel Dänzer
On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote:
> ARM systems do not permit the use of anything other than cached
> mappings for system memory, since that memory may be mapped in the
> linear region as well, and the architecture does not permit aliases
> with mismatched attributes.
> 
> So short-circuit the evaluation in ttm_io_prot() if the flags include
> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
> attributes immediately.
> 
> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
> Without this change, amdgpu does not start at all, and radeon only
> produces corrupt display output.
> 
> Cc: Christian Koenig 
> Cc: Huang Rui 
> Cc: Junwei Zhang 
> Cc: David Airlie 
> Reported-by: Carsten Haitzler 
> Signed-off-by: Ard Biesheuvel 
> ---
>  drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
> b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 046a6dda690a..0c1eef5f7ae3 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t 
> tmp)
>   if (caching_flags & TTM_PL_FLAG_CACHED)
>   return tmp;
>  
> +#if defined(__arm__) || defined(__aarch64__)
> + /* ARM only permits cached mappings of system memory */
> + if (caching_flags & TTM_PL_SYSTEM)
> + return tmp;
> +#endif
>  #if defined(__i386__) || defined(__x86_64__)
>   if (caching_flags & TTM_PL_FLAG_WC)
>   tmp = pgprot_writecombine(tmp);
> 

Apart from Christian's concerns, I think this is the wrong place for
this, because other TTM / driver code will still consider the memory
uncacheable. E.g. the amdgpu driver will program the GPU to treat the
memory as uncacheable, so it won't participate in cache coherency
protocol for it, which is unlikely to work as expected in general if the
CPU treats the memory as cacheable.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH AUTOSEL 4.20 034/117] drm/amdgpu: Correct get_crtc_scanoutpos behavior when vpos >= vtotal

2019-01-09 Thread Michel Dänzer
On 2019-01-08 8:25 p.m., Sasha Levin wrote:
> From: Nicholas Kazlauskas 
> 
> [ Upstream commit 520f08df45fbe300ed650da786a74093d658b7e1 ]
> 
> When variable refresh rate is active [...]

Variable refresh rate (FreeSync) support is only landing in 5.0,
therefore this fix isn't needed in older kernels.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v6 1/2] drm/amd: validate user pitch alignment

2019-01-08 Thread Michel Dänzer
On 2019-01-07 11:51 p.m., Yu Zhao wrote:
> Userspace may request pitch alignment that is not supported by GPU.
> Some requests 32, but GPU ignores it and uses default 64 when cpp is
> 4. If GEM object is allocated based on the smaller alignment, GPU
> DMA will go out of bound.
> 
> Cc: sta...@vger.kernel.org # v4.2+
> Signed-off-by: Yu Zhao 

Both patches applied to amd-staging-drm-next (should land in 5.0), thanks!


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v5 1/2] drm/amd: validate user pitch alignment

2019-01-07 Thread Michel Dänzer
On 2019-01-07 5:00 a.m., Yu Zhao wrote:
> On Thu, Jan 03, 2019 at 05:33:19PM +0100, Michel Dänzer wrote:
>> On 2018-12-30 2:00 a.m., Yu Zhao wrote:
>>> Userspace may request pitch alignment that is not supported by GPU.
>>> Some requests 32, but GPU ignores it and uses default 64 when cpp is
>>> 4. If GEM object is allocated based on the smaller alignment, GPU
>>> DMA will go out of bound.
>>>
>>> Cc: sta...@vger.kernel.org # v4.2+
>>> Signed-off-by: Yu Zhao 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 16 
>>>  1 file changed, 16 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> index 15ce7e681d67..16af80ccd0a0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> @@ -527,6 +527,22 @@ amdgpu_display_user_framebuffer_create(struct 
>>> drm_device *dev,
>>> struct drm_gem_object *obj;
>>> struct amdgpu_framebuffer *amdgpu_fb;
>>> int ret;
>>> +   struct amdgpu_device *adev = dev->dev_private;
>>> +   int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
>>> +   int pitch = mode_cmd->pitches[0] / cpp;
>>> +
>>> +   if (pitch < mode_cmd->width) {
>>> +   DRM_DEBUG_KMS("Expecting pitch(%d)/cpp(%d) >= width(%d)\n",
>>> + mode_cmd->pitches[0], cpp, mode_cmd->width);
>>> +   return ERR_PTR(-EINVAL);
>>> +   }
>>
>> This should possibly be tested in drm_internal_framebuffer_create instead.
> 
> It seems to me drm_format_info_min_pitch() in framebuffer_check()
> already does it. We could just remove the check here?

Yeah, looks like that should be fine.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v5 1/2] drm/amd: validate user pitch alignment

2019-01-03 Thread Michel Dänzer
On 2018-12-30 2:00 a.m., Yu Zhao wrote:
> Userspace may request pitch alignment that is not supported by GPU.
> Some requests 32, but GPU ignores it and uses default 64 when cpp is
> 4. If GEM object is allocated based on the smaller alignment, GPU
> DMA will go out of bound.
> 
> Cc: sta...@vger.kernel.org # v4.2+
> Signed-off-by: Yu Zhao 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> index 15ce7e681d67..16af80ccd0a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> @@ -527,6 +527,22 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
> *dev,
>   struct drm_gem_object *obj;
>   struct amdgpu_framebuffer *amdgpu_fb;
>   int ret;
> + struct amdgpu_device *adev = dev->dev_private;
> + int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
> + int pitch = mode_cmd->pitches[0] / cpp;
> +
> + if (pitch < mode_cmd->width) {
> + DRM_DEBUG_KMS("Expecting pitch(%d)/cpp(%d) >= width(%d)\n",
> +   mode_cmd->pitches[0], cpp, mode_cmd->width);
> + return ERR_PTR(-EINVAL);
> + }

This should possibly be tested in drm_internal_framebuffer_create instead.


> + pitch = amdgpu_align_pitch(adev, pitch, cpp, false);
> + if (mode_cmd->pitches[0] != pitch) {
> + DRM_DEBUG_KMS("Invalid pitch: expecting %d but got %d\n",
> +   pitch, mode_cmd->pitches[0]);
> + return ERR_PTR(-EINVAL);
> + }
>  
>   obj = drm_gem_object_lookup(file_priv, mode_cmd->handles[0]);
>   if (obj ==  NULL) {
> 

Apart from the above, the v5 series looks good to me.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 2/3] drm/amd: validate user pitch alignment

2018-12-27 Thread Michel Dänzer
On 2018-12-23 10:44 p.m., Yu Zhao wrote:
> On Fri, Dec 21, 2018 at 10:07:26AM +0100, Michel Dänzer wrote:
>> On 2018-12-21 4:10 a.m., Yu Zhao wrote:
>>> Userspace may request pitch alignment that is not supported by GPU.
>>> Some requests 32, but GPU ignores it and uses default 64 when cpp is
>>> 4. If GEM object is allocated based on the smaller alignment, GPU
>>> DMA will go out of bound.
>>>
>>> For GPU that does frame buffer compression, DMA writing out of bound
>>> memory will cause memory corruption.
>>>
>>> Signed-off-by: Yu Zhao 
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 9 +
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> index e309d26170db..755daa332f8a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
>>> @@ -527,6 +527,15 @@ amdgpu_display_user_framebuffer_create(struct 
>>> drm_device *dev,
>>> struct drm_gem_object *obj;
>>> struct amdgpu_framebuffer *amdgpu_fb;
>>> int ret;
>>> +   struct amdgpu_device *adev = dev->dev_private;
>>> +   int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
>>> +   int pitch = amdgpu_align_pitch(adev, mode_cmd->width, cpp, false);
>>
>> Also, this needs to use mode_cmd->pitches[0] instead of mode_cmd->width,
>> otherwise it'll spuriously fail for larger but well-aligned pitches.
> 
> Actually mode_cmd->pitches[0] is aligned mode_cmd->width multiplied
> by cpp. So we can't just use mode_cmd->pitches[0].

Use mode_cmd->pitches[0] / cpp then?


> And I'm not sure if the hardware works with larger alignment

It does.

> (it certainly ignores smaller alignment).

Yeah, pitch must be >= width. Maybe this patch could check that as well.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] drm: add capability DRM_CAP_ASYNC_UPDATE

2018-12-21 Thread Michel Dänzer
On 2018-12-21 2:26 p.m., Daniel Vetter wrote:
> On Fri, Dec 21, 2018 at 10:47:27AM +0100, Michel Dänzer wrote:
>> On 2018-12-20 6:16 p.m., Michel Dänzer wrote:
>>> On 2018-12-20 6:09 p.m., Daniel Vetter wrote:
>>>> On Thu, Dec 20, 2018 at 6:03 PM Alex Deucher  wrote:
>>>>> On Thu, Dec 20, 2018 at 11:54 AM Daniel Vetter  wrote:
>>>>
>>>>>> Not sure about the gamma thing since we had opposite bugs on i915
>>>>>> about gamma not being vsynced and tearing terribly. Cursor is special
>>>>>> since it tends to be too small to notice tearing.
>>>>>
>>>>> Our cursor hw (and possibly gamma as well Nicholas?  Harry?) is double
>>>>> buffered, so we can update it any time for the most part and the
>>>>> changes won't take affect until the next vupdate period.
>>>>
>>>> Hm, I guess we could make the gamma update optionally async, and let
>>>> drivers deal. The issue is that the current async helper stuff won't
>>>> cope with gamma updates (it's aimed at plane updates only, not at crtc
>>>> property updates). Or we get userspace to do proper atomic updates. Or
>>>> we do some faking in the kernel, e.g. waiting with the gamma update
>>>> until the next atomic update happens. But that kinda breaks
>>>> ->atomic_check.
>>>
>>> Would it be possible to merge gamma changes into a pending commit, if
>>> there is one, and create a new commit otherwise?
>>>
>>> Otherwise the atomic API just moves this same problem to be solved by
>>> userspace.
>>
>> Actually, I'm not sure this is even solvable in a Xorg driver. When it
>> gets called to update the gamma LUT:
>>
>> 1. If there's a pending atomic commit, it cannot amend that, can it?
> 
> Yup, but on the kernel side we kinda have the same problem.

It's probably easier to solve in the kernel though; any solution
allowing userspace to amend commits would likely need similar changes in
the kernel anyway.


>> 2. It has no way of knowing when the next atomic commit would be
>> submitted e.g. for a page flip, so it kind of has to create its own
>> commit for the gamma update.
> 
> Block handler is what I had in mind for the fallback commit, if no other
> commit happened meanwhile (which would need to include it).

The BlockHandler runs more or less immediately after the gamma update,
after any other X11 requests submitted later by the same client and
flushed to the X server together. So this would still result in a commit
with only the gamma update most of the time. There might be a small
chance of combining with a flip with something like Night Light which is
done by the compositor itself (but still only if the compositor defers
the gamma update until just before it call SwapBuffers), basically 0
chance with something separate like RedShift.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] drm: add capability DRM_CAP_ASYNC_UPDATE

2018-12-21 Thread Michel Dänzer
On 2018-12-20 6:38 p.m., Kazlauskas, Nicholas wrote:
> On 12/20/18 12:09 PM, Daniel Vetter wrote:
>> On Thu, Dec 20, 2018 at 6:03 PM Alex Deucher  wrote:
>>> On Thu, Dec 20, 2018 at 11:54 AM Daniel Vetter  wrote:
>>>>
>>>> Not sure about the gamma thing since we had opposite bugs on i915
>>>> about gamma not being vsynced and tearing terribly. Cursor is special
>>>> since it tends to be too small to notice tearing.
>>>
>>> Our cursor hw (and possibly gamma as well Nicholas?  Harry?) is double
>>> buffered, so we can update it any time for the most part and the
>>> changes won't take affect until the next vupdate period.
> 
> I haven't really investigated too much into the gamma stuttering issue, 
> but I think it's similar to the cursor update - a high volume of atomic 
> updates that ends up skipping over a vblank or two.

FWIW, I don't think the use-cases described in
https://bugs.freedesktop.org/108917 (Night Light / RedShift) involve a
particularly high volume of gamma updates. I was able to reproduce the
stuttering with ~10 gamma changes per second, but I suspect even a
single one could cause a frame drop. I assume the issue is that gamma
updates are done as separate atomic commits, which can delay other
commits for page flips.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] drm: add capability DRM_CAP_ASYNC_UPDATE

2018-12-21 Thread Michel Dänzer
On 2018-12-20 6:16 p.m., Michel Dänzer wrote:
> On 2018-12-20 6:09 p.m., Daniel Vetter wrote:
>> On Thu, Dec 20, 2018 at 6:03 PM Alex Deucher  wrote:
>>> On Thu, Dec 20, 2018 at 11:54 AM Daniel Vetter  wrote:
>>
>>>> Not sure about the gamma thing since we had opposite bugs on i915
>>>> about gamma not being vsynced and tearing terribly. Cursor is special
>>>> since it tends to be too small to notice tearing.
>>>
>>> Our cursor hw (and possibly gamma as well Nicholas?  Harry?) is double
>>> buffered, so we can update it any time for the most part and the
>>> changes won't take affect until the next vupdate period.
>>
>> Hm, I guess we could make the gamma update optionally async, and let
>> drivers deal. The issue is that the current async helper stuff won't
>> cope with gamma updates (it's aimed at plane updates only, not at crtc
>> property updates). Or we get userspace to do proper atomic updates. Or
>> we do some faking in the kernel, e.g. waiting with the gamma update
>> until the next atomic update happens. But that kinda breaks
>> ->atomic_check.
> 
> Would it be possible to merge gamma changes into a pending commit, if
> there is one, and create a new commit otherwise?
> 
> Otherwise the atomic API just moves this same problem to be solved by
> userspace.

Actually, I'm not sure this is even solvable in a Xorg driver. When it
gets called to update the gamma LUT:

1. If there's a pending atomic commit, it cannot amend that, can it?

2. It has no way of knowing when the next atomic commit would be
submitted e.g. for a page flip, so it kind of has to create its own
commit for the gamma update.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 3/3] drm/amd: validate user GEM object size

2018-12-21 Thread Michel Dänzer
On 2018-12-21 4:10 a.m., Yu Zhao wrote:
> When creating frame buffer, userspace may request to attach to a
> previously allocated GEM object that is smaller than what GPU
> requires. Validation must be done to prevent out-of-bound DMA,
> which could not only corrupt memory but also reveal sensitive data.
> 
> This fix is not done in a common code path because individual
> driver might have different requirement.
> 
> Signed-off-by: Yu Zhao 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> index 755daa332f8a..bb48b016cc68 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> @@ -527,6 +527,7 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
> *dev,
>   struct drm_gem_object *obj;
>   struct amdgpu_framebuffer *amdgpu_fb;
>   int ret;
> + int height;
>   struct amdgpu_device *adev = dev->dev_private;
>   int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
>   int pitch = amdgpu_align_pitch(adev, mode_cmd->width, cpp, false);
> @@ -550,6 +551,13 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
> *dev,
>   return ERR_PTR(-EINVAL);
>   }
>  
> + height = ALIGN(mode_cmd->height, 8);
> + if (obj->size < pitch * height) {
> + dev_err(>pdev->dev, "Invalid GEM size: expecting %d but 
> got %d\n",
> +     pitch * height, obj->size);

Same comment as patch 2, and maybe write "expecting >= %d but got %d"
for clarity.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 2/3] drm/amd: validate user pitch alignment

2018-12-21 Thread Michel Dänzer
On 2018-12-21 4:10 a.m., Yu Zhao wrote:
> Userspace may request pitch alignment that is not supported by GPU.
> Some requests 32, but GPU ignores it and uses default 64 when cpp is
> 4. If GEM object is allocated based on the smaller alignment, GPU
> DMA will go out of bound.
> 
> For GPU that does frame buffer compression, DMA writing out of bound
> memory will cause memory corruption.
> 
> Signed-off-by: Yu Zhao 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> index e309d26170db..755daa332f8a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> @@ -527,6 +527,15 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
> *dev,
>   struct drm_gem_object *obj;
>   struct amdgpu_framebuffer *amdgpu_fb;
>   int ret;
> + struct amdgpu_device *adev = dev->dev_private;
> + int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
> + int pitch = amdgpu_align_pitch(adev, mode_cmd->width, cpp, false);

Also, this needs to use mode_cmd->pitches[0] instead of mode_cmd->width,
otherwise it'll spuriously fail for larger but well-aligned pitches.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 2/3] drm/amd: validate user pitch alignment

2018-12-21 Thread Michel Dänzer
On 2018-12-21 4:10 a.m., Yu Zhao wrote:
> Userspace may request pitch alignment that is not supported by GPU.
> Some requests 32, but GPU ignores it and uses default 64 when cpp is
> 4. If GEM object is allocated based on the smaller alignment, GPU
> DMA will go out of bound.
> 
> For GPU that does frame buffer compression, DMA writing out of bound
> memory will cause memory corruption.
> 
> Signed-off-by: Yu Zhao 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> index e309d26170db..755daa332f8a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c
> @@ -527,6 +527,15 @@ amdgpu_display_user_framebuffer_create(struct drm_device 
> *dev,
>   struct drm_gem_object *obj;
>   struct amdgpu_framebuffer *amdgpu_fb;
>   int ret;
> + struct amdgpu_device *adev = dev->dev_private;
> + int cpp = drm_format_plane_cpp(mode_cmd->pixel_format, 0);
> + int pitch = amdgpu_align_pitch(adev, mode_cmd->width, cpp, false);
> +
> + if (mode_cmd->pitches[0] != pitch) {
> + dev_err(>pdev->dev, "Invalid pitch: expecting %d but got 
> %d\n",
> + pitch, mode_cmd->pitches[0]);

This message shouldn't be printed by default, or buggy / malicious
userspace can spam dmesg. I suggest using DRM_DEBUG_KMS or DRM_DEBUG_DRIVER.


> + return ERR_PTR(-EINVAL);
> + }
>  
>   obj = drm_gem_object_lookup(file_priv, mode_cmd->handles[0]);
>   if (obj ==  NULL) {
> 

Other than that, looks good to me.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 1/3] drm/amd: fix race in page flip job

2018-12-21 Thread Michel Dänzer
On 2018-12-21 4:10 a.m., Yu Zhao wrote:
> Fix race between page flip job submission and completion. We invoke
> page_flip callback to submit page flip job to GPU first and then set
> pflip_status. If GPU fires page flip done irq in between, its handler
> won't see the correct pflip_status thus will refuse to notify the job
> completion. The job will eventually times out.
> 
> Reverse the order of calling page_flip and setting pflip_status to
> fix the race.

There is no race, amdgpu_crtc->pflip_status is protected by dev->event_lock.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] drm: add capability DRM_CAP_ASYNC_UPDATE

2018-12-20 Thread Michel Dänzer
On 2018-12-20 6:09 p.m., Daniel Vetter wrote:
> On Thu, Dec 20, 2018 at 6:03 PM Alex Deucher  wrote:
>> On Thu, Dec 20, 2018 at 11:54 AM Daniel Vetter  wrote:
> 
>>> Not sure about the gamma thing since we had opposite bugs on i915
>>> about gamma not being vsynced and tearing terribly. Cursor is special
>>> since it tends to be too small to notice tearing.
>>
>> Our cursor hw (and possibly gamma as well Nicholas?  Harry?) is double
>> buffered, so we can update it any time for the most part and the
>> changes won't take affect until the next vupdate period.
> 
> Hm, I guess we could make the gamma update optionally async, and let
> drivers deal. The issue is that the current async helper stuff won't
> cope with gamma updates (it's aimed at plane updates only, not at crtc
> property updates). Or we get userspace to do proper atomic updates. Or
> we do some faking in the kernel, e.g. waiting with the gamma update
> until the next atomic update happens. But that kinda breaks
> ->atomic_check.

Would it be possible to merge gamma changes into a pending commit, if
there is one, and create a new commit otherwise?

Otherwise the atomic API just moves this same problem to be solved by
userspace.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


[PATCH] drm/ttm: Use drm_debug_printer for all ttm_bo_mem_space_debug output

2018-12-20 Thread Michel Dänzer
From: Michel Dänzer 

No need for pr_err here, the pr_err message in ttm_bo_evict is enough
to draw attention to something not going as planned.

Signed-off-by: Michel Dänzer 
---

Similar to a previous patch, but this one leaves the pr_err messages in
ttm_bo_evict untouched.

 drivers/gpu/drm/ttm/ttm_bo.c | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index d87935bf8e30..0ec08394e17a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -77,38 +77,39 @@ static inline int ttm_mem_type_from_place(const struct 
ttm_place *place,
return 0;
 }
 
-static void ttm_mem_type_debug(struct ttm_bo_device *bdev, int mem_type)
+static void ttm_mem_type_debug(struct ttm_bo_device *bdev, struct drm_printer 
*p,
+  int mem_type)
 {
struct ttm_mem_type_manager *man = >man[mem_type];
-   struct drm_printer p = drm_debug_printer(TTM_PFX);
 
-   pr_err("has_type: %d\n", man->has_type);
-   pr_err("use_type: %d\n", man->use_type);
-   pr_err("flags: 0x%08X\n", man->flags);
-   pr_err("gpu_offset: 0x%08llX\n", man->gpu_offset);
-   pr_err("size: %llu\n", man->size);
-   pr_err("available_caching: 0x%08X\n", man->available_caching);
-   pr_err("default_caching: 0x%08X\n", man->default_caching);
+   drm_printf(p, "has_type: %d\n", man->has_type);
+   drm_printf(p, "use_type: %d\n", man->use_type);
+   drm_printf(p, "flags: 0x%08X\n", man->flags);
+   drm_printf(p, "gpu_offset: 0x%08llX\n", man->gpu_offset);
+   drm_printf(p, "size: %llu\n", man->size);
+   drm_printf(p, "available_caching: 0x%08X\n", 
man->available_caching);
+   drm_printf(p, "default_caching: 0x%08X\n", man->default_caching);
if (mem_type != TTM_PL_SYSTEM)
-   (*man->func->debug)(man, );
+   (*man->func->debug)(man, p);
 }
 
 static void ttm_bo_mem_space_debug(struct ttm_buffer_object *bo,
struct ttm_placement *placement)
 {
+   struct drm_printer p = drm_debug_printer(TTM_PFX);
int i, ret, mem_type;
 
-   pr_err("No space for %p (%lu pages, %luK, %luM)\n",
-  bo, bo->mem.num_pages, bo->mem.size >> 10,
-  bo->mem.size >> 20);
+   drm_printf(, "No space for %p (%lu pages, %luK, %luM)\n",
+  bo, bo->mem.num_pages, bo->mem.size >> 10,
+  bo->mem.size >> 20);
for (i = 0; i < placement->num_placement; i++) {
ret = ttm_mem_type_from_place(>placement[i],
_type);
if (ret)
return;
-   pr_err("  placement[%d]=0x%08X (%d)\n",
-  i, placement->placement[i].flags, mem_type);
-   ttm_mem_type_debug(bo->bdev, mem_type);
+   drm_printf(, "  placement[%d]=0x%08X (%d)\n",
+  i, placement->placement[i].flags, mem_type);
+   ttm_mem_type_debug(bo->bdev, , mem_type);
}
 }
 
-- 
2.20.1



Re: [PATCH 2/2] drm/ttm: Use pr_debug for all output from ttm_bo_evict

2018-12-06 Thread Michel Dänzer
On 2018-12-06 5:46 p.m., Joe Perches wrote:
> On Thu, 2018-12-06 at 17:39 +0800, Zhang, Jerry(Junwei) wrote:
>> On 12/6/18 5:33 PM, Koenig, Christian wrote:
>>> Am 06.12.18 um 10:09 schrieb Michel Dänzer:
>>>> On 2018-12-06 3:43 a.m., Zhang, Jerry(Junwei) wrote:
>>>>> On 12/6/18 12:56 AM, Michel Dänzer wrote:
>>>>>> From: Michel Dänzer 
>>>>>>
>>>>>> All the output is related, so it should all be printed the same way.
>>>>>> Some of it was using pr_debug, but some of it appeared in dmesg by
>>>>>> default. The caller should handle failure, so there's no need to spam
>>>>>> dmesg with potentially quite a lot of output by default.
>>>>>>
>>>>>> Signed-off-by: Michel Dänzer 
>>>>> Sounds reasonable, but personally prefer to show error when some
>>>>> vital incident happens, e.g. no memory on eviction.
>>>> The amdgpu driver still prints these in that case:
>>>>
>>>>[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* 
>>>> amdgpu_cs_list_validate(validated) failed.
>>>>[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command 
>>>> submission!
> 
> Aren't dump_stack()s already done on all these allocation failures?
> I don't notice any use of __GFP_NOWARN on generic allocations in drm.

Most of the time, these messages are due to being unable to allocate a
TTM BO or move it where it needs to go, not due to the kernel failing to
allocate memory in general.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 1/2] drm: Only #define DEBUG if CONFIG_DYNAMIC_DEBUG is disabled

2018-12-06 Thread Michel Dänzer
On 2018-12-06 5:10 p.m., Daniel Thompson wrote:
> On Thu, Dec 06, 2018 at 03:41:16PM +0100, Michel Dänzer wrote:
>> On 2018-12-06 1:23 p.m., Joe Perches wrote:
>>> On Thu, 2018-12-06 at 12:52 +0100, Michel Dänzer wrote:
>>>> In contrast to the 2b case, the pr_debug output isn't visible by default
>>>> with 1b, so the latter doesn't fit "always produce output" either.
>>>
>>> I think you are mistaken here.
>>
>> Still puzzled as to what you're hoping to achieve with that kind of
>> language. None of the confusion about this patch has been on my part. :)
>>
>>
>>> Adding #define DEBUG as Chris did enables pr_debug output
>>> and is your 1b.
>>>
>>> Perhaps your default console logging level is set to a
>>> non-default value.
>>
>> I have CONFIG_DYNAMIC_DEBUG enabled in my kernels. The problem addressed
>> by this patch is that messages from drm_debug_printer are visible by
>> default (case 2b), whereas they shouldn't be (case 2a, like 1b).
> 
> When enabled (either dynamically or statically) pr_debug() will emit
> output at KERN_DEBUG level regardless of whether CONFIG_DYNAMIC_DEBUG
> is defined or not.
> 
> Thus unless you change additional settings (either dynamically or
> statically) then debug messages should not be shown on the console
> because the default settings filter KERN_DEBUG messages. However they
> are available via dmesg and system loggers (syslogd, journal, etc).
> 
> The patch proposed will change the behaviour of the debug messages
> w.r.t. system loggers based on whether the user has enabled
> CONFIG_DYNAMIC_DEBUG or not, violating the principle of least surprise.

Ah, that makes sense now, thanks.

I'm withdrawing this patch.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 1/2] drm: Only #define DEBUG if CONFIG_DYNAMIC_DEBUG is disabled

2018-12-06 Thread Michel Dänzer
On 2018-12-06 1:23 p.m., Joe Perches wrote:
> On Thu, 2018-12-06 at 12:52 +0100, Michel Dänzer wrote:
>> In contrast to the 2b case, the pr_debug output isn't visible by default
>> with 1b, so the latter doesn't fit "always produce output" either.
> 
> I think you are mistaken here.

Still puzzled as to what you're hoping to achieve with that kind of
language. None of the confusion about this patch has been on my part. :)


> Adding #define DEBUG as Chris did enables pr_debug output
> and is your 1b.
> 
> Perhaps your default console logging level is set to a
> non-default value.

I have CONFIG_DYNAMIC_DEBUG enabled in my kernels. The problem addressed
by this patch is that messages from drm_debug_printer are visible by
default (case 2b), whereas they shouldn't be (case 2a, like 1b).


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 3/5] drm: fix drm_mode_addfb() on big endian machines.

2018-09-04 Thread Michel Dänzer
On 2018-09-03 7:07 p.m., Ilia Mirkin wrote:
> On Mon, Sep 3, 2018 at 12:45 PM, Daniel Vetter  wrote:
>> On Mon, Sep 03, 2018 at 12:57:54PM +0200, Gerd Hoffmann wrote:
>>> Userspace on big endian machhines typically expects the ADDFB ioctl
>>> returns a big endian framebuffer.  drm_mode_addfb() will call
>>> drm_mode_addfb2() unconditionally with little endian DRM_FORMAT_*
>>> values though, which is wrong.  This patch fixes that.
>>>
>>> Drivers (both kernel and xorg) have quirks in place to deal with the
>>> broken drm_mode_addfb() behavior.  Because of this we can't just change
>>> drm_mode_addfb() behavior for everybody without breaking things.  So add
>>> a new driver feature flag DRIVER_PREFER_HOST_BYTE_ORDER, so drivers can
>>> opt-in.
>>>
>>> Signed-off-by: Gerd Hoffmann 
>>> ---
>>>  include/drm/drm_drv.h |  1 +
>>>  drivers/gpu/drm/drm_framebuffer.c | 11 +++
>>>  2 files changed, 12 insertions(+)
>>>
>>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>>> index 46a8009784..9cf12596cd 100644
>>> --- a/include/drm/drm_drv.h
>>> +++ b/include/drm/drm_drv.h
>>> @@ -57,6 +57,7 @@ struct drm_printer;
>>>  #define DRIVER_KMS_LEGACY_CONTEXT0x2
>>>  #define DRIVER_SYNCOBJ  0x4
>>>  #define DRIVER_PREFER_XBGR_30BPP0x8
>>> +#define DRIVER_PREFER_HOST_BYTE_ORDER   0x10
>>
>> Hm, not a huge fan of using driver_flags for random little quirks. I think
>> a boolean in sturct drm_mode_config would be much better. Bonus if you
>> also move the 30bpp hack over to that. Something like
>> mode_config.quirk_addfb_host_byte_order and
>> mode_config.quirk_addfb_prefer_xbgr_30bpp or whatever. That has the upside
>> of giving us a really nice place to put a huge comment about what this is
>> supposed to do.
>>
>> I think otherwise this looks overall rather reasonable. I think the only
>> other driver that ever cared about big endian was radeon/amdgpu. Would be
>> good to get at least an ack from amd folks, or a "meh, stopped caring".
> 
> and nouveau.
> 
> I do believe that ADDFB should be made to always prefer host byte
> order -- this is how all of the existing implementations work in
> practice. However e.g. nouveau wants DRM_FORMAT_XRGB. But it still
> treats it as host byte order. This will become more important in a
> world where ADDFB2 is more common.
> 
> So, I think that this change should be applied, drivers (I suspect
> just nouveau and radeon) fixed up to consume the "new" formats, [...]

As explained before, that would break radeon userspace on big endian hosts.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 3/5] drm: fix drm_mode_addfb() on big endian machines.

2018-09-04 Thread Michel Dänzer
On 2018-09-03 7:07 p.m., Ilia Mirkin wrote:
> On Mon, Sep 3, 2018 at 12:45 PM, Daniel Vetter  wrote:
>> On Mon, Sep 03, 2018 at 12:57:54PM +0200, Gerd Hoffmann wrote:
>>> Userspace on big endian machhines typically expects the ADDFB ioctl
>>> returns a big endian framebuffer.  drm_mode_addfb() will call
>>> drm_mode_addfb2() unconditionally with little endian DRM_FORMAT_*
>>> values though, which is wrong.  This patch fixes that.
>>>
>>> Drivers (both kernel and xorg) have quirks in place to deal with the
>>> broken drm_mode_addfb() behavior.  Because of this we can't just change
>>> drm_mode_addfb() behavior for everybody without breaking things.  So add
>>> a new driver feature flag DRIVER_PREFER_HOST_BYTE_ORDER, so drivers can
>>> opt-in.
>>>
>>> Signed-off-by: Gerd Hoffmann 
>>> ---
>>>  include/drm/drm_drv.h |  1 +
>>>  drivers/gpu/drm/drm_framebuffer.c | 11 +++
>>>  2 files changed, 12 insertions(+)
>>>
>>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>>> index 46a8009784..9cf12596cd 100644
>>> --- a/include/drm/drm_drv.h
>>> +++ b/include/drm/drm_drv.h
>>> @@ -57,6 +57,7 @@ struct drm_printer;
>>>  #define DRIVER_KMS_LEGACY_CONTEXT0x2
>>>  #define DRIVER_SYNCOBJ  0x4
>>>  #define DRIVER_PREFER_XBGR_30BPP0x8
>>> +#define DRIVER_PREFER_HOST_BYTE_ORDER   0x10
>>
>> Hm, not a huge fan of using driver_flags for random little quirks. I think
>> a boolean in sturct drm_mode_config would be much better. Bonus if you
>> also move the 30bpp hack over to that. Something like
>> mode_config.quirk_addfb_host_byte_order and
>> mode_config.quirk_addfb_prefer_xbgr_30bpp or whatever. That has the upside
>> of giving us a really nice place to put a huge comment about what this is
>> supposed to do.
>>
>> I think otherwise this looks overall rather reasonable. I think the only
>> other driver that ever cared about big endian was radeon/amdgpu. Would be
>> good to get at least an ack from amd folks, or a "meh, stopped caring".
> 
> and nouveau.
> 
> I do believe that ADDFB should be made to always prefer host byte
> order -- this is how all of the existing implementations work in
> practice. However e.g. nouveau wants DRM_FORMAT_XRGB. But it still
> treats it as host byte order. This will become more important in a
> world where ADDFB2 is more common.
> 
> So, I think that this change should be applied, drivers (I suspect
> just nouveau and radeon) fixed up to consume the "new" formats, [...]

As explained before, that would break radeon userspace on big endian hosts.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 3/5] drm: fix drm_mode_addfb() on big endian machines.

2018-09-03 Thread Michel Dänzer
On 2018-09-03 6:45 p.m., Daniel Vetter wrote:
> On Mon, Sep 03, 2018 at 12:57:54PM +0200, Gerd Hoffmann wrote:
>> Userspace on big endian machhines typically expects the ADDFB ioctl
>> returns a big endian framebuffer.  drm_mode_addfb() will call
>> drm_mode_addfb2() unconditionally with little endian DRM_FORMAT_*
>> values though, which is wrong.  This patch fixes that.
>>
>> Drivers (both kernel and xorg) have quirks in place to deal with the
>> broken drm_mode_addfb() behavior.  Because of this we can't just change
>> drm_mode_addfb() behavior for everybody without breaking things.  So add
>> a new driver feature flag DRIVER_PREFER_HOST_BYTE_ORDER, so drivers can
>> opt-in.

Since the changes are opt-in now, they shouldn't affect drivers which
don't opt in; they should work as well (or as badly :) after these
changes as they did before. So no concerns from my side anymore.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 3/5] drm: fix drm_mode_addfb() on big endian machines.

2018-09-03 Thread Michel Dänzer
On 2018-09-03 6:45 p.m., Daniel Vetter wrote:
> On Mon, Sep 03, 2018 at 12:57:54PM +0200, Gerd Hoffmann wrote:
>> Userspace on big endian machhines typically expects the ADDFB ioctl
>> returns a big endian framebuffer.  drm_mode_addfb() will call
>> drm_mode_addfb2() unconditionally with little endian DRM_FORMAT_*
>> values though, which is wrong.  This patch fixes that.
>>
>> Drivers (both kernel and xorg) have quirks in place to deal with the
>> broken drm_mode_addfb() behavior.  Because of this we can't just change
>> drm_mode_addfb() behavior for everybody without breaking things.  So add
>> a new driver feature flag DRIVER_PREFER_HOST_BYTE_ORDER, so drivers can
>> opt-in.

Since the changes are opt-in now, they shouldn't affect drivers which
don't opt in; they should work as well (or as badly :) after these
changes as they did before. So no concerns from my side anymore.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-25 Thread Michel Dänzer
On 2018-05-25 10:41 AM, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 03:13:58PM +0200, Christian König wrote:
>> Am 02.05.2018 um 18:59 schrieb Michel Dänzer:
>>> On 2018-05-02 06:21 PM, Christoph Hellwig wrote:
>>>> On Wed, May 02, 2018 at 04:31:09PM +0200, Michel Dänzer wrote:
>>>>>> No.  __GFP_NOWARN (and gfp_t flags in general) are the wrong interface
>>>>>> for dma allocations and just cause problems.  I actually plan to
>>>>>> get rid of the gfp_t argument in dma_alloc_attrs sooner, and only
>>>>>> allow either GFP_KERNEL or GFP_DMA passed in dma_alloc_coherent.
>>>>> How about GFP_TRANSHUGE_LIGHT? TTM uses that to opportunistically
>>>>> allocate huge pages (GFP_TRANSHUGE can result in unacceptably long
>>>>> delays with memory pressure).
>>>> Well, that is exactly what I don't want drivers to do - same for
>>>> __GFP_COMP in some drm code.  This very much assumes the page allocator
>>>> is used to back dma allocations, which very often it actually isn't, and
>>>> any use of magic gfp flags creates a tight coupling of consumers with a
>>>> specific implementation.
>>>>
>>>> In general I can't think of a good reason not to actually use
>>>> GFP_TRANSHUGE_LIGHT by default in the dma allocator unless
>>>> DMA_ATTR_ALLOC_SINGLE_PAGES is set.  Can you prepare a patch for that?
>>> I'm afraid I'll have to leave that to somebody else.
>>
>> Coming back to this topic once more, sorry for the delay but busy as usual 
>> :)
>>
>> What exactly do you mean with "dma allocator" here? The TTM allocator using 
>> the dma_alloc_coherent calls? Or the swiotlb implementation of the calls?
> 
> dma allocatr in this case: backends for dma_alloc_coherent/
> dma_alloc_attrs.  Most importantly dma_direct_alloc.
> 
> But while we're at it I can't actually see any GFP_TRANSHUGE_LIGHT
> usage in TTM, just plain old GFP_TRANSHUGE.

See
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da291320baec914f0bb4e65a9dccb86bd6c728f2
.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-25 Thread Michel Dänzer
On 2018-05-25 10:41 AM, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 03:13:58PM +0200, Christian König wrote:
>> Am 02.05.2018 um 18:59 schrieb Michel Dänzer:
>>> On 2018-05-02 06:21 PM, Christoph Hellwig wrote:
>>>> On Wed, May 02, 2018 at 04:31:09PM +0200, Michel Dänzer wrote:
>>>>>> No.  __GFP_NOWARN (and gfp_t flags in general) are the wrong interface
>>>>>> for dma allocations and just cause problems.  I actually plan to
>>>>>> get rid of the gfp_t argument in dma_alloc_attrs sooner, and only
>>>>>> allow either GFP_KERNEL or GFP_DMA passed in dma_alloc_coherent.
>>>>> How about GFP_TRANSHUGE_LIGHT? TTM uses that to opportunistically
>>>>> allocate huge pages (GFP_TRANSHUGE can result in unacceptably long
>>>>> delays with memory pressure).
>>>> Well, that is exactly what I don't want drivers to do - same for
>>>> __GFP_COMP in some drm code.  This very much assumes the page allocator
>>>> is used to back dma allocations, which very often it actually isn't, and
>>>> any use of magic gfp flags creates a tight coupling of consumers with a
>>>> specific implementation.
>>>>
>>>> In general I can't think of a good reason not to actually use
>>>> GFP_TRANSHUGE_LIGHT by default in the dma allocator unless
>>>> DMA_ATTR_ALLOC_SINGLE_PAGES is set.  Can you prepare a patch for that?
>>> I'm afraid I'll have to leave that to somebody else.
>>
>> Coming back to this topic once more, sorry for the delay but busy as usual 
>> :)
>>
>> What exactly do you mean with "dma allocator" here? The TTM allocator using 
>> the dma_alloc_coherent calls? Or the swiotlb implementation of the calls?
> 
> dma allocatr in this case: backends for dma_alloc_coherent/
> dma_alloc_attrs.  Most importantly dma_direct_alloc.
> 
> But while we're at it I can't actually see any GFP_TRANSHUGE_LIGHT
> usage in TTM, just plain old GFP_TRANSHUGE.

See
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=da291320baec914f0bb4e65a9dccb86bd6c728f2
.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-02 Thread Michel Dänzer
On 2018-05-02 06:21 PM, Christoph Hellwig wrote:
> On Wed, May 02, 2018 at 04:31:09PM +0200, Michel Dänzer wrote:
>>> No.  __GFP_NOWARN (and gfp_t flags in general) are the wrong interface
>>> for dma allocations and just cause problems.  I actually plan to
>>> get rid of the gfp_t argument in dma_alloc_attrs sooner, and only
>>> allow either GFP_KERNEL or GFP_DMA passed in dma_alloc_coherent.
>>
>> How about GFP_TRANSHUGE_LIGHT? TTM uses that to opportunistically
>> allocate huge pages (GFP_TRANSHUGE can result in unacceptably long
>> delays with memory pressure).
> 
> Well, that is exactly what I don't want drivers to do - same for
> __GFP_COMP in some drm code.  This very much assumes the page allocator
> is used to back dma allocations, which very often it actually isn't, and
> any use of magic gfp flags creates a tight coupling of consumers with a
> specific implementation.
> 
> In general I can't think of a good reason not to actually use
> GFP_TRANSHUGE_LIGHT by default in the dma allocator unless
> DMA_ATTR_ALLOC_SINGLE_PAGES is set.  Can you prepare a patch for that?

I'm afraid I'll have to leave that to somebody else.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-02 Thread Michel Dänzer
On 2018-05-02 06:21 PM, Christoph Hellwig wrote:
> On Wed, May 02, 2018 at 04:31:09PM +0200, Michel Dänzer wrote:
>>> No.  __GFP_NOWARN (and gfp_t flags in general) are the wrong interface
>>> for dma allocations and just cause problems.  I actually plan to
>>> get rid of the gfp_t argument in dma_alloc_attrs sooner, and only
>>> allow either GFP_KERNEL or GFP_DMA passed in dma_alloc_coherent.
>>
>> How about GFP_TRANSHUGE_LIGHT? TTM uses that to opportunistically
>> allocate huge pages (GFP_TRANSHUGE can result in unacceptably long
>> delays with memory pressure).
> 
> Well, that is exactly what I don't want drivers to do - same for
> __GFP_COMP in some drm code.  This very much assumes the page allocator
> is used to back dma allocations, which very often it actually isn't, and
> any use of magic gfp flags creates a tight coupling of consumers with a
> specific implementation.
> 
> In general I can't think of a good reason not to actually use
> GFP_TRANSHUGE_LIGHT by default in the dma allocator unless
> DMA_ATTR_ALLOC_SINGLE_PAGES is set.  Can you prepare a patch for that?

I'm afraid I'll have to leave that to somebody else.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-02 Thread Michel Dänzer
On 2018-05-02 02:41 PM, Christoph Hellwig wrote:
> On Wed, May 02, 2018 at 02:18:56PM +0200, Daniel Vetter wrote:
>> Other dma-api backends like cma just shut up when __GFP_NOWARN is
>> passed. And afaiui Christoph Hellwig has plans to nuke the DMA_ATTR
>> stuff (or at least clean it up) - should we just remove
>> DMA_ATTR_NO_WARN and instead only look at __GFP_NOWARN?
> 
> No.  __GFP_NOWARN (and gfp_t flags in general) are the wrong interface
> for dma allocations and just cause problems.  I actually plan to
> get rid of the gfp_t argument in dma_alloc_attrs sooner, and only
> allow either GFP_KERNEL or GFP_DMA passed in dma_alloc_coherent.

How about GFP_TRANSHUGE_LIGHT? TTM uses that to opportunistically
allocate huge pages (GFP_TRANSHUGE can result in unacceptably long
delays with memory pressure).


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-02 Thread Michel Dänzer
On 2018-05-02 02:41 PM, Christoph Hellwig wrote:
> On Wed, May 02, 2018 at 02:18:56PM +0200, Daniel Vetter wrote:
>> Other dma-api backends like cma just shut up when __GFP_NOWARN is
>> passed. And afaiui Christoph Hellwig has plans to nuke the DMA_ATTR
>> stuff (or at least clean it up) - should we just remove
>> DMA_ATTR_NO_WARN and instead only look at __GFP_NOWARN?
> 
> No.  __GFP_NOWARN (and gfp_t flags in general) are the wrong interface
> for dma allocations and just cause problems.  I actually plan to
> get rid of the gfp_t argument in dma_alloc_attrs sooner, and only
> allow either GFP_KERNEL or GFP_DMA passed in dma_alloc_coherent.

How about GFP_TRANSHUGE_LIGHT? TTM uses that to opportunistically
allocate huge pages (GFP_TRANSHUGE can result in unacceptably long
delays with memory pressure).


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-05-02 Thread Michel Dänzer
On 2018-04-29 01:56 AM, Ilia Mirkin wrote:
> On Sat, Apr 28, 2018 at 7:02 PM, Michel Dänzer <mic...@daenzer.net> wrote:
>>
>> Unfortunately, there was an swiotlb regression (not directly related to
>> Christian's work) shortly after this fix, also in 4.16-rc1, which is now
>> fixed in 4.17-rc1 and will be backported to 4.16.y.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.16.5=2c9dacf5bfe1e45d96dfe97cb71d2b717786a7b9
> 
> This guy? Didn't help. I'm running 4.16.4 right now.

Try https://patchwork.freedesktop.org/patch/219765/ .


>>> We now have *two* broken releases, v4.15 and v4.16 (anything that
>>> spews error messages and stack traces ad-infinitum in dmesg is, by
>>> definition, broken).
>>
>> I haven't seen any evidence that there's still an issue in 4.15, is
>> there any?
> 
> Well, I did have a late 4.15 rc kernel in addition to the 'suppress
> warning' patch. Now I'm questioning my memory of whether the issue was
> resolved there or not. I'm pretty sure that 'not', but no longer 100%.

It seems pretty clear it is in fact resolved in 4.15. Even if it indeed
wasn't for you, did you expect us to find out by monitoring you on IRC 24/7?


> As a user who is not in a position to debug this directly due to lack
> of time and knowledge, [...]

I have plenty of other things to do other than looking into this and
coming up with the fix above as well, and I'm no more knowledgeable
about the SWIOTLB code than you.

Anyway, nobody can track down every problem they run into. But let me
kindly ask you to more carefully look at the information available
before deciding which tree to bark up in the future.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-05-02 Thread Michel Dänzer
On 2018-04-29 01:56 AM, Ilia Mirkin wrote:
> On Sat, Apr 28, 2018 at 7:02 PM, Michel Dänzer  wrote:
>>
>> Unfortunately, there was an swiotlb regression (not directly related to
>> Christian's work) shortly after this fix, also in 4.16-rc1, which is now
>> fixed in 4.17-rc1 and will be backported to 4.16.y.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.16.5=2c9dacf5bfe1e45d96dfe97cb71d2b717786a7b9
> 
> This guy? Didn't help. I'm running 4.16.4 right now.

Try https://patchwork.freedesktop.org/patch/219765/ .


>>> We now have *two* broken releases, v4.15 and v4.16 (anything that
>>> spews error messages and stack traces ad-infinitum in dmesg is, by
>>> definition, broken).
>>
>> I haven't seen any evidence that there's still an issue in 4.15, is
>> there any?
> 
> Well, I did have a late 4.15 rc kernel in addition to the 'suppress
> warning' patch. Now I'm questioning my memory of whether the issue was
> resolved there or not. I'm pretty sure that 'not', but no longer 100%.

It seems pretty clear it is in fact resolved in 4.15. Even if it indeed
wasn't for you, did you expect us to find out by monitoring you on IRC 24/7?


> As a user who is not in a position to debug this directly due to lack
> of time and knowledge, [...]

I have plenty of other things to do other than looking into this and
coming up with the fix above as well, and I'm no more knowledgeable
about the SWIOTLB code than you.

Anyway, nobody can track down every problem they run into. But let me
kindly ask you to more carefully look at the information available
before deciding which tree to bark up in the future.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-05-01 Thread Michel Dänzer
On 2018-05-01 01:15 AM, Dave Airlie wrote:
>>
>>
>> Yes, I fixed the original false positive messages myself with the swiotlb
>> maintainer and I was CCed in fixing the recent fallout from Chris changes as
>> well.
> 
> So do we have a good summary of where this at now?
> 
> I'm getting reports on 4.16.4 still displaying these, what hammer do I
> need to hit things with to get 4.16.x+1 to not do this?
> 
> Is there still outstanding issues upstream.

There are, https://patchwork.freedesktop.org/patch/219765/ should
hopefully fix the last of it.


> [...] I've no idea if the swiotlb things people report are the false
> positive, or some new thing.

The issues I've seen reported with 4.16 are false positives from TTM's
perspective, which uses DMA_ATTR_NO_WARN to suppress these warnings, due
to multiple regressions introduced by commit
0176adb004065d6815a8e67946752df4cd947c5b "swiotlb: refactor
 coherent buffer allocation" in 4.16-rc1.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-05-01 Thread Michel Dänzer
On 2018-05-01 01:15 AM, Dave Airlie wrote:
>>
>>
>> Yes, I fixed the original false positive messages myself with the swiotlb
>> maintainer and I was CCed in fixing the recent fallout from Chris changes as
>> well.
> 
> So do we have a good summary of where this at now?
> 
> I'm getting reports on 4.16.4 still displaying these, what hammer do I
> need to hit things with to get 4.16.x+1 to not do this?
> 
> Is there still outstanding issues upstream.

There are, https://patchwork.freedesktop.org/patch/219765/ should
hopefully fix the last of it.


> [...] I've no idea if the swiotlb things people report are the false
> positive, or some new thing.

The issues I've seen reported with 4.16 are false positives from TTM's
perspective, which uses DMA_ATTR_NO_WARN to suppress these warnings, due
to multiple regressions introduced by commit
0176adb004065d6815a8e67946752df4cd947c5b "swiotlb: refactor
 coherent buffer allocation" in 4.16-rc1.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


[PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-01 Thread Michel Dänzer
From: Michel Dänzer <michel.daen...@amd.com>

The result was printing the warning only when we were explicitly asked
not to.

Cc: sta...@vger.kernel.org
Fixes: 0176adb004065d6815a8e67946752df4cd947c5b "swiotlb: refactor
 coherent buffer allocation"
Signed-off-by: Michel Dänzer <michel.daen...@amd.com>
---
 lib/swiotlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index c43ec2271469..e9ac21540628 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -750,7 +750,7 @@ swiotlb_alloc_buffer(struct device *dev, size_t size, 
dma_addr_t *dma_handle,
swiotlb_tbl_unmap_single(dev, phys_addr, size, DMA_TO_DEVICE,
DMA_ATTR_SKIP_CPU_SYNC);
 out_warn:
-   if ((attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) {
+   if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) {
dev_warn(dev,
"swiotlb: coherent allocation failed, size=%zu\n",
size);
-- 
2.17.0



[PATCH] swiotlb: Fix inversed DMA_ATTR_NO_WARN test

2018-05-01 Thread Michel Dänzer
From: Michel Dänzer 

The result was printing the warning only when we were explicitly asked
not to.

Cc: sta...@vger.kernel.org
Fixes: 0176adb004065d6815a8e67946752df4cd947c5b "swiotlb: refactor
 coherent buffer allocation"
Signed-off-by: Michel Dänzer 
---
 lib/swiotlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index c43ec2271469..e9ac21540628 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -750,7 +750,7 @@ swiotlb_alloc_buffer(struct device *dev, size_t size, 
dma_addr_t *dma_handle,
swiotlb_tbl_unmap_single(dev, phys_addr, size, DMA_TO_DEVICE,
DMA_ATTR_SKIP_CPU_SYNC);
 out_warn:
-   if ((attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) {
+   if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) {
dev_warn(dev,
"swiotlb: coherent allocation failed, size=%zu\n",
size);
-- 
2.17.0



Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-04-30 Thread Michel Dänzer
On 2018-04-29 09:02 AM, Christian König wrote:
> Am 29.04.2018 um 01:02 schrieb Michel Dänzer:
>> On 2018-04-28 06:30 PM, Ilia Mirkin wrote:
>>> On Fri, Apr 27, 2018 at 9:08 AM, Michel Dänzer <mic...@daenzer.net>
>>> wrote:
>>>> From: Michel Dänzer <michel.daen...@amd.com>
>>>>
>>>> Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled)
>>>> try to allocate huge pages. However, not all drivers can take advantage
>>>> of huge pages, but they would incur the overhead for allocating and
>>>> freeing them anyway.
>>>>
>>>> Now, drivers which can take advantage of huge pages need to set the new
>>>> flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag
>>>> no longer incur any overhead for allocating or freeing huge pages.
>>>>
>>>> v2:
>>>> * Also guard swapping of consecutive pages in ttm_get_pages
>>>> * Reword commit log, hopefully clearer now
>>>>
>>>> Cc: sta...@vger.kernel.org
>>>> Signed-off-by: Michel Dänzer <michel.daen...@amd.com>
>>> Both I and lots of other people, based on reports, are still seeing
>>> plenty of issues with this as late as 4.16.4.
>> "lots of other people", "plenty of issues" sounds a bit exaggerated from
>> what I've seen. FWIW, while I did see the original messages myself, I
>> haven't seen any since Christian's original fix (see below), neither
>> with amdgpu nor radeon, even before this patch you followed up to.
>>
>>
>>> Admittedly I'm on nouveau, but others have reported issues with
>>> radeon/amdgpu as well. It's been going on since the feature was merged
>>> in v4.15, with what seems like little investigation from the authors
>>> introducing the feature.
>> That's not a fair assessment. See
>> https://bugs.freedesktop.org/show_bug.cgi?id=104082#c40 and following
>> comments.
>>
>> Christian fixed the original issue in
>> d0bc0c2a31c95002d37c3cc511ffdcab851b3256 "swiotlb: suppress warning when
>> __GFP_NOWARN is set". Christian did his best to try and get the fix in
>> before 4.15 final, but for reasons beyond his control, it was delayed
>> until 4.16-rc1 and then backported to 4.15.5.
>>
>> Unfortunately, there was an swiotlb regression (not directly related to
>> Christian's work) shortly after this fix, also in 4.16-rc1, which is now
>> fixed in 4.17-rc1 and will be backported to 4.16.y.
> 
> And that's exactly the reason why I intentionally kept this enabled for
> all users of the TTM DMA page pool and not put it behind a flag.
> 
> This change has surfaced quite a number of bugs in the swiotlb code
> which could have caused issues before. It's just that those code path
> where never exercised massively before.
> 
> Additional to that using huge pages is beneficial for the MM and CPU TLB
> (not implemented yet) even when the GPU driver can't make much use of it.

Do I understand correctly that you're against this patch?

AFAIU the only benefit of huge pages with a driver which doesn't take
advantage of them directly is "for the MM". Can you describe a bit more
what that benefit is exactly? Is it expected to outweigh the cost of
allocating / freeing huge pages?


>> It looks like there's at least one more bug left, but it's not clear yet
>> when that was introduced, whether it's directly related to Christian's
>> work, or indeed what the impact is. Let's not get ahead of ourselves.
> 
> Well my patches surfaced the problems, but the underlying issues where
> present even before those changes and I'm very well involved in fixing
> the underlying issues.
> 
> I even considered to just revert the huge page path for the DMA pool
> allocator, but it's just that the TTM patches seem to work exactly as
> they are intended. So that doesn't feel like doing the right thing here.

I agree. Has anyone reported this to the DMA/SWIOTLB developers?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-04-30 Thread Michel Dänzer
On 2018-04-29 09:02 AM, Christian König wrote:
> Am 29.04.2018 um 01:02 schrieb Michel Dänzer:
>> On 2018-04-28 06:30 PM, Ilia Mirkin wrote:
>>> On Fri, Apr 27, 2018 at 9:08 AM, Michel Dänzer 
>>> wrote:
>>>> From: Michel Dänzer 
>>>>
>>>> Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled)
>>>> try to allocate huge pages. However, not all drivers can take advantage
>>>> of huge pages, but they would incur the overhead for allocating and
>>>> freeing them anyway.
>>>>
>>>> Now, drivers which can take advantage of huge pages need to set the new
>>>> flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag
>>>> no longer incur any overhead for allocating or freeing huge pages.
>>>>
>>>> v2:
>>>> * Also guard swapping of consecutive pages in ttm_get_pages
>>>> * Reword commit log, hopefully clearer now
>>>>
>>>> Cc: sta...@vger.kernel.org
>>>> Signed-off-by: Michel Dänzer 
>>> Both I and lots of other people, based on reports, are still seeing
>>> plenty of issues with this as late as 4.16.4.
>> "lots of other people", "plenty of issues" sounds a bit exaggerated from
>> what I've seen. FWIW, while I did see the original messages myself, I
>> haven't seen any since Christian's original fix (see below), neither
>> with amdgpu nor radeon, even before this patch you followed up to.
>>
>>
>>> Admittedly I'm on nouveau, but others have reported issues with
>>> radeon/amdgpu as well. It's been going on since the feature was merged
>>> in v4.15, with what seems like little investigation from the authors
>>> introducing the feature.
>> That's not a fair assessment. See
>> https://bugs.freedesktop.org/show_bug.cgi?id=104082#c40 and following
>> comments.
>>
>> Christian fixed the original issue in
>> d0bc0c2a31c95002d37c3cc511ffdcab851b3256 "swiotlb: suppress warning when
>> __GFP_NOWARN is set". Christian did his best to try and get the fix in
>> before 4.15 final, but for reasons beyond his control, it was delayed
>> until 4.16-rc1 and then backported to 4.15.5.
>>
>> Unfortunately, there was an swiotlb regression (not directly related to
>> Christian's work) shortly after this fix, also in 4.16-rc1, which is now
>> fixed in 4.17-rc1 and will be backported to 4.16.y.
> 
> And that's exactly the reason why I intentionally kept this enabled for
> all users of the TTM DMA page pool and not put it behind a flag.
> 
> This change has surfaced quite a number of bugs in the swiotlb code
> which could have caused issues before. It's just that those code path
> where never exercised massively before.
> 
> Additional to that using huge pages is beneficial for the MM and CPU TLB
> (not implemented yet) even when the GPU driver can't make much use of it.

Do I understand correctly that you're against this patch?

AFAIU the only benefit of huge pages with a driver which doesn't take
advantage of them directly is "for the MM". Can you describe a bit more
what that benefit is exactly? Is it expected to outweigh the cost of
allocating / freeing huge pages?


>> It looks like there's at least one more bug left, but it's not clear yet
>> when that was introduced, whether it's directly related to Christian's
>> work, or indeed what the impact is. Let's not get ahead of ourselves.
> 
> Well my patches surfaced the problems, but the underlying issues where
> present even before those changes and I'm very well involved in fixing
> the underlying issues.
> 
> I even considered to just revert the huge page path for the DMA pool
> allocator, but it's just that the TTM patches seem to work exactly as
> they are intended. So that doesn't feel like doing the right thing here.

I agree. Has anyone reported this to the DMA/SWIOTLB developers?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-04-28 Thread Michel Dänzer
On 2018-04-28 06:30 PM, Ilia Mirkin wrote:
> On Fri, Apr 27, 2018 at 9:08 AM, Michel Dänzer <mic...@daenzer.net> wrote:
>> From: Michel Dänzer <michel.daen...@amd.com>
>>
>> Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled)
>> try to allocate huge pages. However, not all drivers can take advantage
>> of huge pages, but they would incur the overhead for allocating and
>> freeing them anyway.
>>
>> Now, drivers which can take advantage of huge pages need to set the new
>> flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag
>> no longer incur any overhead for allocating or freeing huge pages.
>>
>> v2:
>> * Also guard swapping of consecutive pages in ttm_get_pages
>> * Reword commit log, hopefully clearer now
>>
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Michel Dänzer <michel.daen...@amd.com>
> 
> Both I and lots of other people, based on reports, are still seeing
> plenty of issues with this as late as 4.16.4.

"lots of other people", "plenty of issues" sounds a bit exaggerated from
what I've seen. FWIW, while I did see the original messages myself, I
haven't seen any since Christian's original fix (see below), neither
with amdgpu nor radeon, even before this patch you followed up to.


> Admittedly I'm on nouveau, but others have reported issues with
> radeon/amdgpu as well. It's been going on since the feature was merged
> in v4.15, with what seems like little investigation from the authors
> introducing the feature.

That's not a fair assessment. See
https://bugs.freedesktop.org/show_bug.cgi?id=104082#c40 and following
comments.

Christian fixed the original issue in
d0bc0c2a31c95002d37c3cc511ffdcab851b3256 "swiotlb: suppress warning when
__GFP_NOWARN is set". Christian did his best to try and get the fix in
before 4.15 final, but for reasons beyond his control, it was delayed
until 4.16-rc1 and then backported to 4.15.5.

Unfortunately, there was an swiotlb regression (not directly related to
Christian's work) shortly after this fix, also in 4.16-rc1, which is now
fixed in 4.17-rc1 and will be backported to 4.16.y.

It looks like there's at least one more bug left, but it's not clear yet
when that was introduced, whether it's directly related to Christian's
work, or indeed what the impact is. Let's not get ahead of ourselves.


> We now have *two* broken releases, v4.15 and v4.16 (anything that
> spews error messages and stack traces ad-infinitum in dmesg is, by
> definition, broken).

I haven't seen any evidence that there's still an issue in 4.15, is
there any?


> You're putting this behind a flag now (finally),

I wrote this patch because I realized due to some remark I happened to
see you make this week on IRC that the huge page support in TTM was
enabled for all drivers. Instead of making that kind of remark on IRC,
it would have been more constructive, and more conducive to quick
implementation, to suggest making the feature not active for drivers
which don't need it in a mailing list post.


At least, please do more research before making this kind of negative
post.

P.S. You might also want to look into whether nouveau really should be
hitting swiotlb in these cases.

-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-04-28 Thread Michel Dänzer
On 2018-04-28 06:30 PM, Ilia Mirkin wrote:
> On Fri, Apr 27, 2018 at 9:08 AM, Michel Dänzer  wrote:
>> From: Michel Dänzer 
>>
>> Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled)
>> try to allocate huge pages. However, not all drivers can take advantage
>> of huge pages, but they would incur the overhead for allocating and
>> freeing them anyway.
>>
>> Now, drivers which can take advantage of huge pages need to set the new
>> flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag
>> no longer incur any overhead for allocating or freeing huge pages.
>>
>> v2:
>> * Also guard swapping of consecutive pages in ttm_get_pages
>> * Reword commit log, hopefully clearer now
>>
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Michel Dänzer 
> 
> Both I and lots of other people, based on reports, are still seeing
> plenty of issues with this as late as 4.16.4.

"lots of other people", "plenty of issues" sounds a bit exaggerated from
what I've seen. FWIW, while I did see the original messages myself, I
haven't seen any since Christian's original fix (see below), neither
with amdgpu nor radeon, even before this patch you followed up to.


> Admittedly I'm on nouveau, but others have reported issues with
> radeon/amdgpu as well. It's been going on since the feature was merged
> in v4.15, with what seems like little investigation from the authors
> introducing the feature.

That's not a fair assessment. See
https://bugs.freedesktop.org/show_bug.cgi?id=104082#c40 and following
comments.

Christian fixed the original issue in
d0bc0c2a31c95002d37c3cc511ffdcab851b3256 "swiotlb: suppress warning when
__GFP_NOWARN is set". Christian did his best to try and get the fix in
before 4.15 final, but for reasons beyond his control, it was delayed
until 4.16-rc1 and then backported to 4.15.5.

Unfortunately, there was an swiotlb regression (not directly related to
Christian's work) shortly after this fix, also in 4.16-rc1, which is now
fixed in 4.17-rc1 and will be backported to 4.16.y.

It looks like there's at least one more bug left, but it's not clear yet
when that was introduced, whether it's directly related to Christian's
work, or indeed what the impact is. Let's not get ahead of ourselves.


> We now have *two* broken releases, v4.15 and v4.16 (anything that
> spews error messages and stack traces ad-infinitum in dmesg is, by
> definition, broken).

I haven't seen any evidence that there's still an issue in 4.15, is
there any?


> You're putting this behind a flag now (finally),

I wrote this patch because I realized due to some remark I happened to
see you make this week on IRC that the huge page support in TTM was
enabled for all drivers. Instead of making that kind of remark on IRC,
it would have been more constructive, and more conducive to quick
implementation, to suggest making the feature not active for drivers
which don't need it in a mailing list post.


At least, please do more research before making this kind of negative
post.

P.S. You might also want to look into whether nouveau really should be
hitting swiotlb in these cases.

-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


[PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-04-27 Thread Michel Dänzer
From: Michel Dänzer <michel.daen...@amd.com>

Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled)
try to allocate huge pages. However, not all drivers can take advantage
of huge pages, but they would incur the overhead for allocating and
freeing them anyway.

Now, drivers which can take advantage of huge pages need to set the new
flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag
no longer incur any overhead for allocating or freeing huge pages.

v2:
* Also guard swapping of consecutive pages in ttm_get_pages
* Reword commit log, hopefully clearer now

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer <michel.daen...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c  |  2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 35 +---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  8 --
 include/drm/ttm/ttm_tt.h |  1 +
 4 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dfd22db13fb1..e03e9e361e2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -988,7 +988,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct 
ttm_buffer_object *bo,
return NULL;
}
gtt->ttm.ttm.func = _backend_func;
-   if (ttm_sg_tt_init(>ttm, bo, page_flags)) {
+   if (ttm_sg_tt_init(>ttm, bo, page_flags | 
TTM_PAGE_FLAG_TRANSHUGE)) {
kfree(gtt);
return NULL;
}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index f0481b7b60c5..476d668e1cbd 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -760,7 +760,7 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
unsigned long irq_flags;
unsigned i;
@@ -780,7 +780,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
}
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(flags & TTM_PAGE_FLAG_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | 
TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
for (j = 0; j < HPAGE_PMD_NR; ++j)
if (p++ != pages[i + j])
break;
@@ -805,6 +806,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge) {
unsigned max_size, n2free;
 
@@ -877,7 +880,7 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
struct list_head plist;
struct page *p = NULL;
@@ -906,7 +909,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(gfp_flags & GFP_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
@@ -933,9 +937,13 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
return -ENOMEM;
}
 
-   /* Swap the pages if we detect consecutive order */
-   if (i > first && pages[i - 1] == p - 1)
-   swap(p, pages[i - 1]);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE) {
+   /* Swap the pages if we detect consecutive 
order */
+   if (i > first && pages[i - 1] == p - 1)
+   swap(p, pages[i - 1]);
+   }
+#endif
 
pages[i++] = p;
--npages;
@@ -946,6 +954,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
count = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = 

[PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE

2018-04-27 Thread Michel Dänzer
From: Michel Dänzer 

Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled)
try to allocate huge pages. However, not all drivers can take advantage
of huge pages, but they would incur the overhead for allocating and
freeing them anyway.

Now, drivers which can take advantage of huge pages need to set the new
flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag
no longer incur any overhead for allocating or freeing huge pages.

v2:
* Also guard swapping of consecutive pages in ttm_get_pages
* Reword commit log, hopefully clearer now

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c  |  2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 35 +---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  8 --
 include/drm/ttm/ttm_tt.h |  1 +
 4 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dfd22db13fb1..e03e9e361e2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -988,7 +988,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct 
ttm_buffer_object *bo,
return NULL;
}
gtt->ttm.ttm.func = _backend_func;
-   if (ttm_sg_tt_init(>ttm, bo, page_flags)) {
+   if (ttm_sg_tt_init(>ttm, bo, page_flags | 
TTM_PAGE_FLAG_TRANSHUGE)) {
kfree(gtt);
return NULL;
}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index f0481b7b60c5..476d668e1cbd 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -760,7 +760,7 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
unsigned long irq_flags;
unsigned i;
@@ -780,7 +780,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
}
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(flags & TTM_PAGE_FLAG_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | 
TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
for (j = 0; j < HPAGE_PMD_NR; ++j)
if (p++ != pages[i + j])
break;
@@ -805,6 +806,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge) {
unsigned max_size, n2free;
 
@@ -877,7 +880,7 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
struct list_head plist;
struct page *p = NULL;
@@ -906,7 +909,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(gfp_flags & GFP_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
@@ -933,9 +937,13 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
return -ENOMEM;
}
 
-   /* Swap the pages if we detect consecutive order */
-   if (i > first && pages[i - 1] == p - 1)
-   swap(p, pages[i - 1]);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE) {
+   /* Swap the pages if we detect consecutive 
order */
+   if (i > first && pages[i - 1] == p - 1)
+   swap(p, pages[i - 1]);
+   }
+#endif
 
pages[i++] = p;
--npages;
@@ -946,6 +954,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
count = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if 

Re: [PATCH 1/2] drm/ttm: Add TTM_PAGE_FLAG_TRANSHUGE

2018-04-27 Thread Michel Dänzer

[ Dropping Roger He, his e-mail address seems to bounce ]

On 2018-04-27 04:51 AM, zhoucm1 wrote:
> On 2018年04月26日 23:06, Michel Dänzer wrote:
>> From: Michel Dänzer <michel.daen...@amd.com>
>>
>> When it's set, TTM tries to allocate huge pages if possible.
> Do you mean original driver doesn't do this?
> From the code, driver always try huge pages if
> CONFIG_TRANSPARENT_HUGEPAGE is enabled.

Right, before this change, TTM would do this unconditionally for all
drivers. The point of this change is not to incur any huge page related
overhead for drivers which can't take advantage of huge pages anyway.
I'll try changing the commit log to make this clearer in v2.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 1/2] drm/ttm: Add TTM_PAGE_FLAG_TRANSHUGE

2018-04-27 Thread Michel Dänzer

[ Dropping Roger He, his e-mail address seems to bounce ]

On 2018-04-27 04:51 AM, zhoucm1 wrote:
> On 2018年04月26日 23:06, Michel Dänzer wrote:
>> From: Michel Dänzer 
>>
>> When it's set, TTM tries to allocate huge pages if possible.
> Do you mean original driver doesn't do this?
> From the code, driver always try huge pages if
> CONFIG_TRANSPARENT_HUGEPAGE is enabled.

Right, before this change, TTM would do this unconditionally for all
drivers. The point of this change is not to incur any huge page related
overhead for drivers which can't take advantage of huge pages anyway.
I'll try changing the commit log to make this clearer in v2.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


[PATCH 2/2] drm/ttm: Use GFP_TRANSHUGE_LIGHT for allocating huge pages

2018-04-26 Thread Michel Dänzer
From: Michel Dänzer <michel.daen...@amd.com>

GFP_TRANSHUGE tries very hard to allocate huge pages, which can result
in long delays with high memory pressure. I have observed firefox
freezing for up to around a minute due to this while restic was taking
a full system backup.

Since we don't really need huge pages, use GFP_TRANSHUGE_LIGHT |
__GFP_NORETRY instead, in order to fail quickly when there are no huge
pages available.

Set __GFP_KSWAPD_RECLAIM as well, in order for huge pages to be freed
up in the background if necessary.

With these changes, I'm no longer seeing freezes during a restic backup.

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer <michel.daen...@amd.com>
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 11 ---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  3 ++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index 2ce91272b111..6794f15461d9 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -914,7 +914,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
-   huge_flags |= GFP_TRANSHUGE;
+   huge_flags |= GFP_TRANSHUGE_LIGHT | 
__GFP_NORETRY |
+   __GFP_KSWAPD_RECLAIM;
huge_flags &= ~__GFP_MOVABLE;
huge_flags &= ~__GFP_COMP;
p = alloc_pages(huge_flags, HPAGE_PMD_ORDER);
@@ -1033,11 +1034,15 @@ int ttm_page_alloc_init(struct ttm_mem_global *glob, 
unsigned max_pages)
  GFP_USER | GFP_DMA32, "uc dma", 0);
 
ttm_page_pool_init_locked(&_manager->wc_pool_huge,
- GFP_TRANSHUGE & ~(__GFP_MOVABLE | __GFP_COMP),
+ (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+  __GFP_KSWAPD_RECLAIM) &
+ ~(__GFP_MOVABLE | __GFP_COMP),
  "wc huge", order);
 
ttm_page_pool_init_locked(&_manager->uc_pool_huge,
- GFP_TRANSHUGE & ~(__GFP_MOVABLE | __GFP_COMP)
+ (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+  __GFP_KSWAPD_RECLAIM) &
+ ~(__GFP_MOVABLE | __GFP_COMP)
  , "uc huge", order);
 
_manager->options.max_size = max_pages;
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 291b04213ec5..5a551158c289 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -910,7 +910,8 @@ static gfp_t ttm_dma_pool_gfp_flags(struct ttm_dma_tt 
*ttm_dma, bool huge)
gfp_flags |= __GFP_ZERO;
 
if (huge) {
-   gfp_flags |= GFP_TRANSHUGE;
+   gfp_flags |= GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+   __GFP_KSWAPD_RECLAIM;
gfp_flags &= ~__GFP_MOVABLE;
gfp_flags &= ~__GFP_COMP;
}
-- 
2.17.0



[PATCH 2/2] drm/ttm: Use GFP_TRANSHUGE_LIGHT for allocating huge pages

2018-04-26 Thread Michel Dänzer
From: Michel Dänzer 

GFP_TRANSHUGE tries very hard to allocate huge pages, which can result
in long delays with high memory pressure. I have observed firefox
freezing for up to around a minute due to this while restic was taking
a full system backup.

Since we don't really need huge pages, use GFP_TRANSHUGE_LIGHT |
__GFP_NORETRY instead, in order to fail quickly when there are no huge
pages available.

Set __GFP_KSWAPD_RECLAIM as well, in order for huge pages to be freed
up in the background if necessary.

With these changes, I'm no longer seeing freezes during a restic backup.

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer 
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 11 ---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  3 ++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index 2ce91272b111..6794f15461d9 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -914,7 +914,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
-   huge_flags |= GFP_TRANSHUGE;
+   huge_flags |= GFP_TRANSHUGE_LIGHT | 
__GFP_NORETRY |
+   __GFP_KSWAPD_RECLAIM;
huge_flags &= ~__GFP_MOVABLE;
huge_flags &= ~__GFP_COMP;
p = alloc_pages(huge_flags, HPAGE_PMD_ORDER);
@@ -1033,11 +1034,15 @@ int ttm_page_alloc_init(struct ttm_mem_global *glob, 
unsigned max_pages)
  GFP_USER | GFP_DMA32, "uc dma", 0);
 
ttm_page_pool_init_locked(&_manager->wc_pool_huge,
- GFP_TRANSHUGE & ~(__GFP_MOVABLE | __GFP_COMP),
+ (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+  __GFP_KSWAPD_RECLAIM) &
+ ~(__GFP_MOVABLE | __GFP_COMP),
  "wc huge", order);
 
ttm_page_pool_init_locked(&_manager->uc_pool_huge,
- GFP_TRANSHUGE & ~(__GFP_MOVABLE | __GFP_COMP)
+ (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+  __GFP_KSWAPD_RECLAIM) &
+ ~(__GFP_MOVABLE | __GFP_COMP)
  , "uc huge", order);
 
_manager->options.max_size = max_pages;
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 291b04213ec5..5a551158c289 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -910,7 +910,8 @@ static gfp_t ttm_dma_pool_gfp_flags(struct ttm_dma_tt 
*ttm_dma, bool huge)
gfp_flags |= __GFP_ZERO;
 
if (huge) {
-   gfp_flags |= GFP_TRANSHUGE;
+   gfp_flags |= GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+   __GFP_KSWAPD_RECLAIM;
gfp_flags &= ~__GFP_MOVABLE;
gfp_flags &= ~__GFP_COMP;
}
-- 
2.17.0



[PATCH 1/2] drm/ttm: Add TTM_PAGE_FLAG_TRANSHUGE

2018-04-26 Thread Michel Dänzer
From: Michel Dänzer <michel.daen...@amd.com>

When it's set, TTM tries to allocate huge pages if possible. Drivers
which can take advantage of huge pages should set it.

Drivers not setting this flag no longer incur any overhead related to
allocating or freeing huge pages.

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer <michel.daen...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c  |  2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 14 ++
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  8 +---
 include/drm/ttm/ttm_tt.h |  1 +
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dfd22db13fb1..e03e9e361e2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -988,7 +988,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct 
ttm_buffer_object *bo,
return NULL;
}
gtt->ttm.ttm.func = _backend_func;
-   if (ttm_sg_tt_init(>ttm, bo, page_flags)) {
+   if (ttm_sg_tt_init(>ttm, bo, page_flags | 
TTM_PAGE_FLAG_TRANSHUGE)) {
kfree(gtt);
return NULL;
}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index f0481b7b60c5..2ce91272b111 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -760,7 +760,7 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
unsigned long irq_flags;
unsigned i;
@@ -780,7 +780,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
}
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(flags & TTM_PAGE_FLAG_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | 
TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
for (j = 0; j < HPAGE_PMD_NR; ++j)
if (p++ != pages[i + j])
break;
@@ -805,6 +806,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge) {
unsigned max_size, n2free;
 
@@ -877,7 +880,7 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
struct list_head plist;
struct page *p = NULL;
@@ -906,7 +909,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(gfp_flags & GFP_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
@@ -946,6 +950,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
count = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge && npages >= HPAGE_PMD_NR) {
INIT_LIST_HEAD();
ttm_page_pool_get_pages(huge, , flags, cstate,
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 8a25d1974385..291b04213ec5 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -949,7 +949,8 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct 
device *dev,
type = ttm_to_type(ttm->page_flags, ttm->caching_state);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (ttm->page_flags & TTM_PAGE_FLAG_DMA32)
+   if ((ttm->page_flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE))
+   != TTM_PAGE_FLAG_TRANSHUGE)
goto skip_huge;
 
pool = ttm_dma_find_pool(dev, type | IS_HUGE);
@@ -1035,7 +1036,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, 
struct device *dev)
 {
struct ttm_tt *ttm = _dma->ttm;
struct ttm_mem_global *mem_glob = ttm->bdev->glob->mem_glob;
-   struct dma_pool *pool;
+   struct dma_pool *pool = NUL

[PATCH 1/2] drm/ttm: Add TTM_PAGE_FLAG_TRANSHUGE

2018-04-26 Thread Michel Dänzer
From: Michel Dänzer 

When it's set, TTM tries to allocate huge pages if possible. Drivers
which can take advantage of huge pages should set it.

Drivers not setting this flag no longer incur any overhead related to
allocating or freeing huge pages.

Cc: sta...@vger.kernel.org
Signed-off-by: Michel Dänzer 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c  |  2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 14 ++
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |  8 +---
 include/drm/ttm/ttm_tt.h |  1 +
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dfd22db13fb1..e03e9e361e2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -988,7 +988,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct 
ttm_buffer_object *bo,
return NULL;
}
gtt->ttm.ttm.func = _backend_func;
-   if (ttm_sg_tt_init(>ttm, bo, page_flags)) {
+   if (ttm_sg_tt_init(>ttm, bo, page_flags | 
TTM_PAGE_FLAG_TRANSHUGE)) {
kfree(gtt);
return NULL;
}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index f0481b7b60c5..2ce91272b111 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -760,7 +760,7 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
unsigned long irq_flags;
unsigned i;
@@ -780,7 +780,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
}
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(flags & TTM_PAGE_FLAG_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | 
TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
for (j = 0; j < HPAGE_PMD_NR; ++j)
if (p++ != pages[i + j])
break;
@@ -805,6 +806,8 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge) {
unsigned max_size, n2free;
 
@@ -877,7 +880,7 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+   struct ttm_page_pool *huge = NULL;
 #endif
struct list_head plist;
struct page *p = NULL;
@@ -906,7 +909,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 
i = 0;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (!(gfp_flags & GFP_DMA32)) {
+   if ((flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE)) ==
+   TTM_PAGE_FLAG_TRANSHUGE) {
while (npages >= HPAGE_PMD_NR) {
gfp_t huge_flags = gfp_flags;
 
@@ -946,6 +950,8 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
count = 0;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (flags & TTM_PAGE_FLAG_TRANSHUGE)
+   huge = ttm_get_pool(flags, true, cstate);
if (huge && npages >= HPAGE_PMD_NR) {
INIT_LIST_HEAD();
ttm_page_pool_get_pages(huge, , flags, cstate,
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 8a25d1974385..291b04213ec5 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -949,7 +949,8 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct 
device *dev,
type = ttm_to_type(ttm->page_flags, ttm->caching_state);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (ttm->page_flags & TTM_PAGE_FLAG_DMA32)
+   if ((ttm->page_flags & (TTM_PAGE_FLAG_DMA32 | TTM_PAGE_FLAG_TRANSHUGE))
+   != TTM_PAGE_FLAG_TRANSHUGE)
goto skip_huge;
 
pool = ttm_dma_find_pool(dev, type | IS_HUGE);
@@ -1035,7 +1036,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, 
struct device *dev)
 {
struct ttm_tt *ttm = _dma->ttm;
struct ttm_mem_global *mem_glob = ttm->bdev->glob->mem_glob;
-   struct dma_pool *pool;
+   struct dma_pool *pool = NULL;
struct dma_page *d_page, *next;
 

Re: [PATCH] drm/radeon: Change the default to PCI on PowerPC

2018-04-25 Thread Michel Dänzer
On 2018-04-24 09:55 PM, Mathieu Malaterre wrote:
> AGP mode is unstable on PowerPC. Symptoms are generally of the form:
> 
> [ 1228.795711] radeon :00:10.0: ring 0 stalled for more than 10240msec
> 
> Reviewed-by: Christian König <christian.koe...@amd.com>
> Signed-off-by: Mathieu Malaterre <ma...@debian.org>
> ---
>  drivers/gpu/drm/radeon/radeon_drv.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c 
> b/drivers/gpu/drm/radeon/radeon_drv.c
> index b28288a781ef..2a7977a23b31 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -168,7 +168,12 @@ int radeon_no_wb;
>  int radeon_modeset = -1;
>  int radeon_dynclks = -1;
>  int radeon_r4xx_atom = 0;
> +#ifdef __powerpc__
> +/* Default to PCI on PowerPC (fdo #95017) */
> +int radeon_agpmode = -1;
> +#else
>  int radeon_agpmode = 0;
> +#endif
>  int radeon_vram_limit = 0;
>  int radeon_gart_size = -1; /* auto */
>  int radeon_benchmarking = 0;
> 

Pushed to amd-staging-drm-next, thanks!


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] drm/radeon: Change the default to PCI on PowerPC

2018-04-25 Thread Michel Dänzer
On 2018-04-24 09:55 PM, Mathieu Malaterre wrote:
> AGP mode is unstable on PowerPC. Symptoms are generally of the form:
> 
> [ 1228.795711] radeon :00:10.0: ring 0 stalled for more than 10240msec
> 
> Reviewed-by: Christian König 
> Signed-off-by: Mathieu Malaterre 
> ---
>  drivers/gpu/drm/radeon/radeon_drv.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c 
> b/drivers/gpu/drm/radeon/radeon_drv.c
> index b28288a781ef..2a7977a23b31 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -168,7 +168,12 @@ int radeon_no_wb;
>  int radeon_modeset = -1;
>  int radeon_dynclks = -1;
>  int radeon_r4xx_atom = 0;
> +#ifdef __powerpc__
> +/* Default to PCI on PowerPC (fdo #95017) */
> +int radeon_agpmode = -1;
> +#else
>  int radeon_agpmode = 0;
> +#endif
>  int radeon_vram_limit = 0;
>  int radeon_gart_size = -1; /* auto */
>  int radeon_benchmarking = 0;
> 

Pushed to amd-staging-drm-next, thanks!


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-24 Thread Michel Dänzer

Adding the dri-devel list, since this is driver independent code.


On 2018-04-24 05:30 PM, Andrey Grodzovsky wrote:
> Avoid calling wait_event_killable when you are possibly being called
> from get_signal routine since in that case you end up in a deadlock
> where you are alreay blocked in singla processing any trying to wait

Multiple typos here, "[...] already blocked in signal processing and [...]"?


> on a new signal.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> ---
>  drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 088ff2b..09fd258 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
> drm_gpu_scheduler *sched,
>   return;
>   /**
>* The client will not queue more IBs during this fini, consume existing
> -  * queued IBs or discard them on SIGKILL
> +  * queued IBs or discard them when in death signal state since
> +  * wait_event_killable can't receive signals in that state.
>   */
> - if ((current->flags & PF_SIGNALED) && current->exit_code == SIGKILL)
> + if (current->flags & PF_SIGNALED)
>   entity->fini_status = -ERESTARTSYS;
>   else
>   entity->fini_status = wait_event_killable(sched->job_scheduled,
> 


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 2/3] drm/scheduler: Don't call wait_event_killable for signaled process.

2018-04-24 Thread Michel Dänzer

Adding the dri-devel list, since this is driver independent code.


On 2018-04-24 05:30 PM, Andrey Grodzovsky wrote:
> Avoid calling wait_event_killable when you are possibly being called
> from get_signal routine since in that case you end up in a deadlock
> where you are alreay blocked in singla processing any trying to wait

Multiple typos here, "[...] already blocked in signal processing and [...]"?


> on a new signal.
> 
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c 
> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 088ff2b..09fd258 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -227,9 +227,10 @@ void drm_sched_entity_do_release(struct 
> drm_gpu_scheduler *sched,
>   return;
>   /**
>* The client will not queue more IBs during this fini, consume existing
> -  * queued IBs or discard them on SIGKILL
> +  * queued IBs or discard them when in death signal state since
> +  * wait_event_killable can't receive signals in that state.
>   */
> - if ((current->flags & PF_SIGNALED) && current->exit_code == SIGKILL)
> + if (current->flags & PF_SIGNALED)
>   entity->fini_status = -ERESTARTSYS;
>   else
>   entity->fini_status = wait_event_killable(sched->job_scheduled,
> 


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: AMD graphics performance regression in 4.15 and later

2018-04-23 Thread Michel Dänzer
On 2018-04-20 09:40 PM, Felix Kuehling wrote:
> On 2018-04-20 10:47 AM, Michel Dänzer wrote:
>> On 2018-04-11 11:37 AM, Christian König wrote:
>>> Am 11.04.2018 um 06:00 schrieb Gabriel C:
>>>> 2018-04-09 11:42 GMT+02:00 Christian König
>>>> <ckoenig.leichtzumer...@gmail.com>:
>>>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at:
>>>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
>>>>>> Feel free to comment since you have a better understanding of what's
>>>>>> going on.
>>>>>>
>>>>>> One last question: right now I'm running 4.15.0 with the "offending"
>>>>>> patch reverted. Is that safe to run or are there possible bad
>>>>>> interactions with other changes.
>>>>> That should work without problems.
>>>>>
>>>>> But I just had another idea as well, if you want you could still test
>>>>> the
>>>>> new code path which will be using in 4.17.
>>>>>
>>>> While Firefox may do some strange things is not about only Firefox.
>>>>
>>>> With your patches my EPYC box is unusable with  4.15++ kernels.
>>>> The whole Desktop is acting weird.  This one is using
>>>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>>>
>>>> Box is  2 * EPYC 7281 with 128 GB ECC RAM
>>>>
>>>> Also a 14C Xeon box with a HD7700 is broken same way.
>>> The hardware is irrelevant for this. We need to know what software stack
>>> you use on top of it.
>>>
>>> E.g. desktop environment/Mesa and DDX version etc...
>>>
>>>> Everything breaks in X .. scrolling , moving windows , flickering etc.
>>>>
>>>>
>>>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and
>>>> 648bc3574716400acc06f99915815f80d9563783
>>>> from an 4.15 kernel makes things work again.
>>>>
>>>>
>>>>> Backporting all the detection logic is to invasive, but you could
>>>>> just go
>>>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other
>>>>> code path.
>>>>>
>>>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.
>>>>>
>>>> Well you really can't be serious about these suggestions ? Are you ?
>>>>
>>>> Telling peoples to #if 0 random code is not a solution.
>>> That is for testing and not a permanent solution.
>>>
>>>> You broke existsing working userland with your patches and at least
>>>> please fix that for 4.16.
>>>>
>>>> I can help testing code for 4.17/++ if you wish but that is
>>>> *different* storry.
>>> Please test Alex's amd-staging-drm-next branch from
>>> git://people.freedesktop.org/~agd5f/linux.
>> I think we're still missing something here.
>>
>> I'm currently running 4.16.2 + the DRM subsystem changes which are going
>> into 4.17 (so I have the changes Christian is referring to) with a
>> Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc.
>> Some observations:
>>
>> Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the
>> order of a minute, during which the kernel is spending most of one
>> core's cycles inside alloc_pages (__alloc_pages_nodemask to be more
>> precise), called from ttm_alloc_new_pages.
> Philip debugged a similar problem with a KFD memory stress test about
> two weeks ago, where the kernel was seemingly stuck in an infinite loop
> trying to allocate huge pages. I'm pasting his analysis for the record:
> 
>> [...] it uses huge_flags GFP_TRANSHUGE to call alloc_pages(), this
>> seems a corner case inside __alloc_pages_slowpath(), it never exits
>> but goes to retry path every time. It can reclaim pages and
>> did_some_progress (as a result, no_progress_loops is reset to 0 every
>> loop, never reach MAX_RECLAIM_RETRIES) but cannot finish huge page
>> allocations under this specific memory pressure.  
> As a workaround to unblock our release branch testing we removed
> transparent huge page allocation from  ttm_get_pages. We're seeing this
> as far back as 4.13 on our release branch.

Thanks for sharing this. In the future, please raise issues like this on
the public mailing lists from the beginning.


> If we're really talking about the same problem, I don't think it's
> caused by recent page allocator changes, but rather exposed by recent
> TTM changes.

It sounds related, but probably not exactly the same problem. I already
had the TTM code using GFP_TRANSHUGE before I ran into the issue. Also,
__alloc_pages_slowpath eventually succeeds for me, it can just take up
to about a minute.

I'm currently testing using (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY)
instead of GFP_TRANSHUGE in TTM.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: AMD graphics performance regression in 4.15 and later

2018-04-23 Thread Michel Dänzer
On 2018-04-20 09:40 PM, Felix Kuehling wrote:
> On 2018-04-20 10:47 AM, Michel Dänzer wrote:
>> On 2018-04-11 11:37 AM, Christian König wrote:
>>> Am 11.04.2018 um 06:00 schrieb Gabriel C:
>>>> 2018-04-09 11:42 GMT+02:00 Christian König
>>>> :
>>>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at:
>>>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
>>>>>> Feel free to comment since you have a better understanding of what's
>>>>>> going on.
>>>>>>
>>>>>> One last question: right now I'm running 4.15.0 with the "offending"
>>>>>> patch reverted. Is that safe to run or are there possible bad
>>>>>> interactions with other changes.
>>>>> That should work without problems.
>>>>>
>>>>> But I just had another idea as well, if you want you could still test
>>>>> the
>>>>> new code path which will be using in 4.17.
>>>>>
>>>> While Firefox may do some strange things is not about only Firefox.
>>>>
>>>> With your patches my EPYC box is unusable with  4.15++ kernels.
>>>> The whole Desktop is acting weird.  This one is using
>>>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>>>
>>>> Box is  2 * EPYC 7281 with 128 GB ECC RAM
>>>>
>>>> Also a 14C Xeon box with a HD7700 is broken same way.
>>> The hardware is irrelevant for this. We need to know what software stack
>>> you use on top of it.
>>>
>>> E.g. desktop environment/Mesa and DDX version etc...
>>>
>>>> Everything breaks in X .. scrolling , moving windows , flickering etc.
>>>>
>>>>
>>>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and
>>>> 648bc3574716400acc06f99915815f80d9563783
>>>> from an 4.15 kernel makes things work again.
>>>>
>>>>
>>>>> Backporting all the detection logic is to invasive, but you could
>>>>> just go
>>>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other
>>>>> code path.
>>>>>
>>>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.
>>>>>
>>>> Well you really can't be serious about these suggestions ? Are you ?
>>>>
>>>> Telling peoples to #if 0 random code is not a solution.
>>> That is for testing and not a permanent solution.
>>>
>>>> You broke existsing working userland with your patches and at least
>>>> please fix that for 4.16.
>>>>
>>>> I can help testing code for 4.17/++ if you wish but that is
>>>> *different* storry.
>>> Please test Alex's amd-staging-drm-next branch from
>>> git://people.freedesktop.org/~agd5f/linux.
>> I think we're still missing something here.
>>
>> I'm currently running 4.16.2 + the DRM subsystem changes which are going
>> into 4.17 (so I have the changes Christian is referring to) with a
>> Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc.
>> Some observations:
>>
>> Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the
>> order of a minute, during which the kernel is spending most of one
>> core's cycles inside alloc_pages (__alloc_pages_nodemask to be more
>> precise), called from ttm_alloc_new_pages.
> Philip debugged a similar problem with a KFD memory stress test about
> two weeks ago, where the kernel was seemingly stuck in an infinite loop
> trying to allocate huge pages. I'm pasting his analysis for the record:
> 
>> [...] it uses huge_flags GFP_TRANSHUGE to call alloc_pages(), this
>> seems a corner case inside __alloc_pages_slowpath(), it never exits
>> but goes to retry path every time. It can reclaim pages and
>> did_some_progress (as a result, no_progress_loops is reset to 0 every
>> loop, never reach MAX_RECLAIM_RETRIES) but cannot finish huge page
>> allocations under this specific memory pressure.  
> As a workaround to unblock our release branch testing we removed
> transparent huge page allocation from  ttm_get_pages. We're seeing this
> as far back as 4.13 on our release branch.

Thanks for sharing this. In the future, please raise issues like this on
the public mailing lists from the beginning.


> If we're really talking about the same problem, I don't think it's
> caused by recent page allocator changes, but rather exposed by recent
> TTM changes.

It sounds related, but probably not exactly the same problem. I already
had the TTM code using GFP_TRANSHUGE before I ran into the issue. Also,
__alloc_pages_slowpath eventually succeeds for me, it can just take up
to about a minute.

I'm currently testing using (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY)
instead of GFP_TRANSHUGE in TTM.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: AMD graphics performance regression in 4.15 and later

2018-04-20 Thread Michel Dänzer
On 2018-04-11 11:37 AM, Christian König wrote:
> Am 11.04.2018 um 06:00 schrieb Gabriel C:
>> 2018-04-09 11:42 GMT+02:00 Christian König
>> <ckoenig.leichtzumer...@gmail.com>:
>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin:
>>>> Hi Christian,
>>>>
>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at:
>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
>>>> Feel free to comment since you have a better understanding of what's
>>>> going on.
>>>>
>>>> One last question: right now I'm running 4.15.0 with the "offending"
>>>> patch reverted. Is that safe to run or are there possible bad
>>>> interactions with other changes.
>>>
>>> That should work without problems.
>>>
>>> But I just had another idea as well, if you want you could still test
>>> the
>>> new code path which will be using in 4.17.
>>>
>> While Firefox may do some strange things is not about only Firefox.
>>
>> With your patches my EPYC box is unusable with  4.15++ kernels.
>> The whole Desktop is acting weird.  This one is using
>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>
>> Box is  2 * EPYC 7281 with 128 GB ECC RAM
>>
>> Also a 14C Xeon box with a HD7700 is broken same way.
> 
> The hardware is irrelevant for this. We need to know what software stack
> you use on top of it.
> 
> E.g. desktop environment/Mesa and DDX version etc...
> 
>>
>> Everything breaks in X .. scrolling , moving windows , flickering etc.
>>
>>
>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and
>> 648bc3574716400acc06f99915815f80d9563783
>> from an 4.15 kernel makes things work again.
>>
>>
>>> Backporting all the detection logic is to invasive, but you could
>>> just go
>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other
>>> code path.
>>>
>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.
>>>
>> Well you really can't be serious about these suggestions ? Are you ?
>>
>> Telling peoples to #if 0 random code is not a solution.
> 
> That is for testing and not a permanent solution.
> 
>> You broke existsing working userland with your patches and at least
>> please fix that for 4.16.
>>
>> I can help testing code for 4.17/++ if you wish but that is
>> *different* storry.
> 
> Please test Alex's amd-staging-drm-next branch from
> git://people.freedesktop.org/~agd5f/linux.

I think we're still missing something here.

I'm currently running 4.16.2 + the DRM subsystem changes which are going
into 4.17 (so I have the changes Christian is referring to) with a
Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc.
Some observations:

Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the
order of a minute, during which the kernel is spending most of one
core's cycles inside alloc_pages (__alloc_pages_nodemask to be more
precise), called from ttm_alloc_new_pages.

At least in the case of Firefox, this happens due to Mesa internal BO
allocations for glTex(Sub)Image, so it's not obvious that Firefox is
doing something wrong.

I never noticed this before this week. Before, I was running 4.15.y +
DRM subsystem changes from 4.16. Maybe something has changed in core
code, trying harder to allocate huge pages.


Maybe TTM should only try to use any huge pages that happen to be
available, not spend any (/ "too much"?) additional effort trying to
free up huge pages?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: AMD graphics performance regression in 4.15 and later

2018-04-20 Thread Michel Dänzer
On 2018-04-11 11:37 AM, Christian König wrote:
> Am 11.04.2018 um 06:00 schrieb Gabriel C:
>> 2018-04-09 11:42 GMT+02:00 Christian König
>> :
>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin:
>>>> Hi Christian,
>>>>
>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at:
>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
>>>> Feel free to comment since you have a better understanding of what's
>>>> going on.
>>>>
>>>> One last question: right now I'm running 4.15.0 with the "offending"
>>>> patch reverted. Is that safe to run or are there possible bad
>>>> interactions with other changes.
>>>
>>> That should work without problems.
>>>
>>> But I just had another idea as well, if you want you could still test
>>> the
>>> new code path which will be using in 4.17.
>>>
>> While Firefox may do some strange things is not about only Firefox.
>>
>> With your patches my EPYC box is unusable with  4.15++ kernels.
>> The whole Desktop is acting weird.  This one is using
>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>
>> Box is  2 * EPYC 7281 with 128 GB ECC RAM
>>
>> Also a 14C Xeon box with a HD7700 is broken same way.
> 
> The hardware is irrelevant for this. We need to know what software stack
> you use on top of it.
> 
> E.g. desktop environment/Mesa and DDX version etc...
> 
>>
>> Everything breaks in X .. scrolling , moving windows , flickering etc.
>>
>>
>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and
>> 648bc3574716400acc06f99915815f80d9563783
>> from an 4.15 kernel makes things work again.
>>
>>
>>> Backporting all the detection logic is to invasive, but you could
>>> just go
>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other
>>> code path.
>>>
>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.
>>>
>> Well you really can't be serious about these suggestions ? Are you ?
>>
>> Telling peoples to #if 0 random code is not a solution.
> 
> That is for testing and not a permanent solution.
> 
>> You broke existsing working userland with your patches and at least
>> please fix that for 4.16.
>>
>> I can help testing code for 4.17/++ if you wish but that is
>> *different* storry.
> 
> Please test Alex's amd-staging-drm-next branch from
> git://people.freedesktop.org/~agd5f/linux.

I think we're still missing something here.

I'm currently running 4.16.2 + the DRM subsystem changes which are going
into 4.17 (so I have the changes Christian is referring to) with a
Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc.
Some observations:

Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the
order of a minute, during which the kernel is spending most of one
core's cycles inside alloc_pages (__alloc_pages_nodemask to be more
precise), called from ttm_alloc_new_pages.

At least in the case of Firefox, this happens due to Mesa internal BO
allocations for glTex(Sub)Image, so it's not obvious that Firefox is
doing something wrong.

I never noticed this before this week. Before, I was running 4.15.y +
DRM subsystem changes from 4.16. Maybe something has changed in core
code, trying harder to allocate huge pages.


Maybe TTM should only try to use any huge pages that happen to be
available, not spend any (/ "too much"?) additional effort trying to
free up huge pages?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-04-04 Thread Michel Dänzer
On 2018-04-04 11:36 AM, Lucas Stach wrote:
> Am Mittwoch, den 04.04.2018, 11:09 +0200 schrieb Michel Dänzer:
>> On 2018-03-26 04:36 PM, Lucas Stach wrote:
>>> Am Dienstag, den 30.01.2018, 11:28 +0100 schrieb Michal Hocko:
>>>> On Tue 30-01-18 10:29:10, Michel Dänzer wrote:
>>>>> On 2018-01-24 12:50 PM, Michal Hocko wrote:
>>>>>> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>>>>>>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>>>>>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
>>>>>>
>>>>>> [...]
>>>>>>>>> 2. If the OOM killer kills a process which is sharing BOs
>>>>>>>>> with another
>>>>>>>>> process, this should result in the other process dropping
>>>>>>>>> its references
>>>>>>>>> to the BOs as well, at which point the memory is released.
>>>>>>>>
>>>>>>>> OK. How exactly are those BOs mapped to the userspace?
>>>>>>>
>>>>>>> I'm not sure what you're asking. Userspace mostly uses a GEM
>>>>>>> handle to
>>>>>>> refer to a BO. There can also be userspace CPU mappings of the
>>>>>>> BO's
>>>>>>> memory, but userspace doesn't need CPU mappings for all BOs and
>>>>>>> only
>>>>>>> creates them as needed.
>>>>>>
>>>>>> OK, I guess you have to bear with me some more. This whole stack
>>>>>> is a
>>>>>> complete uknonwn. I am mostly after finding a boundary where you
>>>>>> can
>>>>>> charge the allocated memory to the process so that the oom killer
>>>>>> can
>>>>>> consider it. Is there anything like that? Except for the proposed
>>>>>> file
>>>>>> handle hack?
>>>>>
>>>>> How about the other way around: what APIs can we use to charge /
>>>>> "uncharge" memory to a process? If we have those, we can experiment
>>>>> with
>>>>> different places to call them.
>>>>
>>>> add_mm_counter() and I would add a new counter e.g. MM_KERNEL_PAGES.
>>>
>>> So is anyone still working on this? This is hurting us bad enough that
>>> I don't want to keep this topic rotting for another year.
>>>
>>> If no one is currently working on this I would volunteer to give the
>>> simple "just account private, non-shared buffers in process RSS" a
>>> spin.
>>
>> Sounds good. FWIW, I think shared buffers can also be easily handled by
>> accounting them in each process which has a reference. But that's more
>> of a detail, shouldn't make a big difference overall either way.
> 
> Yes, both options to wither never account shared buffers or to always
> account them into every process having a reference should be pretty
> easy. Where it gets hard is when trying to account the buffer only in
> the last process holding a reference or something like this.

FWIW, I don't think that would make sense anyway. A shared buffer is
actually used by all processes which have a reference to it, so it
should be accounted the same in all of them.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-04-04 Thread Michel Dänzer
On 2018-04-04 11:36 AM, Lucas Stach wrote:
> Am Mittwoch, den 04.04.2018, 11:09 +0200 schrieb Michel Dänzer:
>> On 2018-03-26 04:36 PM, Lucas Stach wrote:
>>> Am Dienstag, den 30.01.2018, 11:28 +0100 schrieb Michal Hocko:
>>>> On Tue 30-01-18 10:29:10, Michel Dänzer wrote:
>>>>> On 2018-01-24 12:50 PM, Michal Hocko wrote:
>>>>>> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>>>>>>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>>>>>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
>>>>>>
>>>>>> [...]
>>>>>>>>> 2. If the OOM killer kills a process which is sharing BOs
>>>>>>>>> with another
>>>>>>>>> process, this should result in the other process dropping
>>>>>>>>> its references
>>>>>>>>> to the BOs as well, at which point the memory is released.
>>>>>>>>
>>>>>>>> OK. How exactly are those BOs mapped to the userspace?
>>>>>>>
>>>>>>> I'm not sure what you're asking. Userspace mostly uses a GEM
>>>>>>> handle to
>>>>>>> refer to a BO. There can also be userspace CPU mappings of the
>>>>>>> BO's
>>>>>>> memory, but userspace doesn't need CPU mappings for all BOs and
>>>>>>> only
>>>>>>> creates them as needed.
>>>>>>
>>>>>> OK, I guess you have to bear with me some more. This whole stack
>>>>>> is a
>>>>>> complete uknonwn. I am mostly after finding a boundary where you
>>>>>> can
>>>>>> charge the allocated memory to the process so that the oom killer
>>>>>> can
>>>>>> consider it. Is there anything like that? Except for the proposed
>>>>>> file
>>>>>> handle hack?
>>>>>
>>>>> How about the other way around: what APIs can we use to charge /
>>>>> "uncharge" memory to a process? If we have those, we can experiment
>>>>> with
>>>>> different places to call them.
>>>>
>>>> add_mm_counter() and I would add a new counter e.g. MM_KERNEL_PAGES.
>>>
>>> So is anyone still working on this? This is hurting us bad enough that
>>> I don't want to keep this topic rotting for another year.
>>>
>>> If no one is currently working on this I would volunteer to give the
>>> simple "just account private, non-shared buffers in process RSS" a
>>> spin.
>>
>> Sounds good. FWIW, I think shared buffers can also be easily handled by
>> accounting them in each process which has a reference. But that's more
>> of a detail, shouldn't make a big difference overall either way.
> 
> Yes, both options to wither never account shared buffers or to always
> account them into every process having a reference should be pretty
> easy. Where it gets hard is when trying to account the buffer only in
> the last process holding a reference or something like this.

FWIW, I don't think that would make sense anyway. A shared buffer is
actually used by all processes which have a reference to it, so it
should be accounted the same in all of them.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-04-04 Thread Michel Dänzer
On 2018-03-26 04:36 PM, Lucas Stach wrote:
> Am Dienstag, den 30.01.2018, 11:28 +0100 schrieb Michal Hocko:
>> On Tue 30-01-18 10:29:10, Michel Dänzer wrote:
>>> On 2018-01-24 12:50 PM, Michal Hocko wrote:
>>>> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>>>>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>>>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
>>>>
>>>> [...]
>>>>>>> 2. If the OOM killer kills a process which is sharing BOs
>>>>>>> with another
>>>>>>> process, this should result in the other process dropping
>>>>>>> its references
>>>>>>> to the BOs as well, at which point the memory is released.
>>>>>>
>>>>>> OK. How exactly are those BOs mapped to the userspace?
>>>>>
>>>>> I'm not sure what you're asking. Userspace mostly uses a GEM
>>>>> handle to
>>>>> refer to a BO. There can also be userspace CPU mappings of the
>>>>> BO's
>>>>> memory, but userspace doesn't need CPU mappings for all BOs and
>>>>> only
>>>>> creates them as needed.
>>>>
>>>> OK, I guess you have to bear with me some more. This whole stack
>>>> is a
>>>> complete uknonwn. I am mostly after finding a boundary where you
>>>> can
>>>> charge the allocated memory to the process so that the oom killer
>>>> can
>>>> consider it. Is there anything like that? Except for the proposed
>>>> file
>>>> handle hack?
>>>
>>> How about the other way around: what APIs can we use to charge /
>>> "uncharge" memory to a process? If we have those, we can experiment
>>> with
>>> different places to call them.
>>
>> add_mm_counter() and I would add a new counter e.g. MM_KERNEL_PAGES.
> 
> So is anyone still working on this? This is hurting us bad enough that
> I don't want to keep this topic rotting for another year.
> 
> If no one is currently working on this I would volunteer to give the
> simple "just account private, non-shared buffers in process RSS" a
> spin.

Sounds good. FWIW, I think shared buffers can also be easily handled by
accounting them in each process which has a reference. But that's more
of a detail, shouldn't make a big difference overall either way.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-04-04 Thread Michel Dänzer
On 2018-03-26 04:36 PM, Lucas Stach wrote:
> Am Dienstag, den 30.01.2018, 11:28 +0100 schrieb Michal Hocko:
>> On Tue 30-01-18 10:29:10, Michel Dänzer wrote:
>>> On 2018-01-24 12:50 PM, Michal Hocko wrote:
>>>> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>>>>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>>>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
>>>>
>>>> [...]
>>>>>>> 2. If the OOM killer kills a process which is sharing BOs
>>>>>>> with another
>>>>>>> process, this should result in the other process dropping
>>>>>>> its references
>>>>>>> to the BOs as well, at which point the memory is released.
>>>>>>
>>>>>> OK. How exactly are those BOs mapped to the userspace?
>>>>>
>>>>> I'm not sure what you're asking. Userspace mostly uses a GEM
>>>>> handle to
>>>>> refer to a BO. There can also be userspace CPU mappings of the
>>>>> BO's
>>>>> memory, but userspace doesn't need CPU mappings for all BOs and
>>>>> only
>>>>> creates them as needed.
>>>>
>>>> OK, I guess you have to bear with me some more. This whole stack
>>>> is a
>>>> complete uknonwn. I am mostly after finding a boundary where you
>>>> can
>>>> charge the allocated memory to the process so that the oom killer
>>>> can
>>>> consider it. Is there anything like that? Except for the proposed
>>>> file
>>>> handle hack?
>>>
>>> How about the other way around: what APIs can we use to charge /
>>> "uncharge" memory to a process? If we have those, we can experiment
>>> with
>>> different places to call them.
>>
>> add_mm_counter() and I would add a new counter e.g. MM_KERNEL_PAGES.
> 
> So is anyone still working on this? This is hurting us bad enough that
> I don't want to keep this topic rotting for another year.
> 
> If no one is currently working on this I would volunteer to give the
> simple "just account private, non-shared buffers in process RSS" a
> spin.

Sounds good. FWIW, I think shared buffers can also be easily handled by
accounting them in each process which has a reference. But that's more
of a detail, shouldn't make a big difference overall either way.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] radeon: hide pointless #warning when compile testing

2018-02-21 Thread Michel Dänzer
On 2018-02-16 04:26 PM, Arnd Bergmann wrote:
> In randconfig testing, we sometimes get this warning:
> 
> drivers/gpu/drm/radeon/radeon_object.c: In function 'radeon_bo_create':
> drivers/gpu/drm/radeon/radeon_object.c:242:2: error: #warning Please enable 
> CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to 
> write-combining [-Werror=cpp]
>  #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance 
> \
> 
> This is rather annoying since almost all other code produces no build-time
> output unless we have found a real bug. We already fixed this in the
> amdgpu driver in commit 31bb90f1cd08 ("drm/amdgpu: shut up #warning for
> compile testing") by adding a CONFIG_COMPILE_TEST check last year and
> agreed to do the same here, but both Michel and I then forgot about it
> until I came across the issue again now.
> 
> For stable kernels, as this is one of very few remaining randconfig
> warnings in 4.14.
> 
> Cc: sta...@vger.kernel.org
> Link: https://patchwork.kernel.org/patch/9550009/
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
>  drivers/gpu/drm/radeon/radeon_object.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_object.c 
> b/drivers/gpu/drm/radeon/radeon_object.c
> index 093594976126..54c2b4fc5ead 100644
> --- a/drivers/gpu/drm/radeon/radeon_object.c
> +++ b/drivers/gpu/drm/radeon/radeon_object.c
> @@ -238,9 +238,10 @@ int radeon_bo_create(struct radeon_device *rdev,
>* may be slow
>* See https://bugs.freedesktop.org/show_bug.cgi?id=88758
>*/
> -
> +#ifndef CONFIG_COMPILE_TEST
>  #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance 
> \
>thanks to write-combining
> +#endif
>  
>   if (bo->flags & RADEON_GEM_GTT_WC)
>   DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for 
> "
> 

Merged on the internal amd-staging-drm-next branch, thanks!


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH] radeon: hide pointless #warning when compile testing

2018-02-21 Thread Michel Dänzer
On 2018-02-16 04:26 PM, Arnd Bergmann wrote:
> In randconfig testing, we sometimes get this warning:
> 
> drivers/gpu/drm/radeon/radeon_object.c: In function 'radeon_bo_create':
> drivers/gpu/drm/radeon/radeon_object.c:242:2: error: #warning Please enable 
> CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to 
> write-combining [-Werror=cpp]
>  #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance 
> \
> 
> This is rather annoying since almost all other code produces no build-time
> output unless we have found a real bug. We already fixed this in the
> amdgpu driver in commit 31bb90f1cd08 ("drm/amdgpu: shut up #warning for
> compile testing") by adding a CONFIG_COMPILE_TEST check last year and
> agreed to do the same here, but both Michel and I then forgot about it
> until I came across the issue again now.
> 
> For stable kernels, as this is one of very few remaining randconfig
> warnings in 4.14.
> 
> Cc: sta...@vger.kernel.org
> Link: https://patchwork.kernel.org/patch/9550009/
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/gpu/drm/radeon/radeon_object.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_object.c 
> b/drivers/gpu/drm/radeon/radeon_object.c
> index 093594976126..54c2b4fc5ead 100644
> --- a/drivers/gpu/drm/radeon/radeon_object.c
> +++ b/drivers/gpu/drm/radeon/radeon_object.c
> @@ -238,9 +238,10 @@ int radeon_bo_create(struct radeon_device *rdev,
>* may be slow
>* See https://bugs.freedesktop.org/show_bug.cgi?id=88758
>*/
> -
> +#ifndef CONFIG_COMPILE_TEST
>  #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance 
> \
>thanks to write-combining
> +#endif
>  
>   if (bo->flags & RADEON_GEM_GTT_WC)
>   DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for 
> "
> 

Merged on the internal amd-staging-drm-next branch, thanks!


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 12:56 PM, Christian König wrote:
> Am 30.01.2018 um 12:42 schrieb Michel Dänzer:
>> On 2018-01-30 12:36 PM, Nicolai Hähnle wrote:
>>> On 30.01.2018 12:34, Michel Dänzer wrote:
>>>> On 2018-01-30 12:28 PM, Christian König wrote:
>>>>> Am 30.01.2018 um 12:02 schrieb Michel Dänzer:
>>>>>> On 2018-01-30 11:40 AM, Christian König wrote:
>>>>>>> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>>>>>>>> [SNIP]
>>>>>>>>> Would it be ok to hang onto potentially arbitrary mmget references
>>>>>>>>> essentially forever? If that's ok I think we can do your process
>>>>>>>>> based
>>>>>>>>> account (minus a few minor inaccuracies for shared stuff perhaps,
>>>>>>>>> but no
>>>>>>>>> one cares about that).
>>>>>>>> Honestly, I think you and Christian are overthinking this. Let's
>>>>>>>> try
>>>>>>>> charging the memory to every process which shares a buffer, and go
>>>>>>>> from
>>>>>>>> there.
>>>>>>> My problem is that this needs to be bullet prove.
>>>>>>>
>>>>>>> For example imagine an application which allocates a lot of BOs,
>>>>>>> then
>>>>>>> calls fork() and let the parent process die. The file descriptor
>>>>>>> lives
>>>>>>> on in the child process, but the memory is not accounted against the
>>>>>>> child.
>>>>>> What exactly are you referring to by "the file descriptor" here?
>>>>> The file descriptor used to identify the connection to the driver. In
>>>>> other words our drm_file structure in the kernel.
>>>>>
>>>>>> What happens to BO handles in general in this case? If both parent
>>>>>> and
>>>>>> child process keep the same handle for the same BO, one of them
>>>>>> destroying the handle will result in the other one not being able to
>>>>>> use
>>>>>> it anymore either, won't it?
>>>>> Correct.
>>>>>
>>>>> That usage is actually not useful at all, but we already had
>>>>> applications which did exactly that by accident.
>>>>>
>>>>> Not to mention that somebody could do it on purpose.
>>>> Can we just prevent child processes from using their parent's DRM file
>>>> descriptors altogether? Allowing it seems like a bad idea all around.
>>> Existing protocols pass DRM fds between processes though, don't they?
>>>
>>> Not child processes perhaps, but special-casing that seems like awful
>>> design.
>> Fair enough.
>>
>> Can we disallow passing DRM file descriptors which have any buffers
>> allocated? :)
> 
> Hehe good point, but I'm sorry I have to ruin that.
> 
> The root VM page table is allocated when the DRM file descriptor is
> created and we want to account those to whoever uses the file descriptor
> as well.

Alternatively, since the file descriptor is closed in the sending
process in this case, maybe we can "uncharge" the buffer memory from the
sending process and charge it to the receiving one during the transfer?


> Looking into the fs layer there actually only seem to be two function
> which are involved when a file descriptor is installed/removed from a
> process. So we just need to add some callbacks there.

That could work for file descriptor passing, but I'm not sure it really
helps for the fork case. Let's say we charge the buffer memory to the
child process as well. If either process later destroys a buffer handle,
the buffer becomes inaccessible to the other process as well, however
its memory remains charged to it (even though it may already be freed).

I think using a DRM file descriptor in both parent and child processes
is a pathological case that we really want to prevent rather than
worrying about how to make it work well. It doesn't seem to be working
well in general already anyway.


Maybe we could keep track of which process "owns" a DRM file descriptor,
and return an error from any relevant system calls for it from another
process. When passing an fd, its ownership would transfer to the
receiving process. When forking, the ownership would remain with the
parent process.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 12:56 PM, Christian König wrote:
> Am 30.01.2018 um 12:42 schrieb Michel Dänzer:
>> On 2018-01-30 12:36 PM, Nicolai Hähnle wrote:
>>> On 30.01.2018 12:34, Michel Dänzer wrote:
>>>> On 2018-01-30 12:28 PM, Christian König wrote:
>>>>> Am 30.01.2018 um 12:02 schrieb Michel Dänzer:
>>>>>> On 2018-01-30 11:40 AM, Christian König wrote:
>>>>>>> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>>>>>>>> [SNIP]
>>>>>>>>> Would it be ok to hang onto potentially arbitrary mmget references
>>>>>>>>> essentially forever? If that's ok I think we can do your process
>>>>>>>>> based
>>>>>>>>> account (minus a few minor inaccuracies for shared stuff perhaps,
>>>>>>>>> but no
>>>>>>>>> one cares about that).
>>>>>>>> Honestly, I think you and Christian are overthinking this. Let's
>>>>>>>> try
>>>>>>>> charging the memory to every process which shares a buffer, and go
>>>>>>>> from
>>>>>>>> there.
>>>>>>> My problem is that this needs to be bullet prove.
>>>>>>>
>>>>>>> For example imagine an application which allocates a lot of BOs,
>>>>>>> then
>>>>>>> calls fork() and let the parent process die. The file descriptor
>>>>>>> lives
>>>>>>> on in the child process, but the memory is not accounted against the
>>>>>>> child.
>>>>>> What exactly are you referring to by "the file descriptor" here?
>>>>> The file descriptor used to identify the connection to the driver. In
>>>>> other words our drm_file structure in the kernel.
>>>>>
>>>>>> What happens to BO handles in general in this case? If both parent
>>>>>> and
>>>>>> child process keep the same handle for the same BO, one of them
>>>>>> destroying the handle will result in the other one not being able to
>>>>>> use
>>>>>> it anymore either, won't it?
>>>>> Correct.
>>>>>
>>>>> That usage is actually not useful at all, but we already had
>>>>> applications which did exactly that by accident.
>>>>>
>>>>> Not to mention that somebody could do it on purpose.
>>>> Can we just prevent child processes from using their parent's DRM file
>>>> descriptors altogether? Allowing it seems like a bad idea all around.
>>> Existing protocols pass DRM fds between processes though, don't they?
>>>
>>> Not child processes perhaps, but special-casing that seems like awful
>>> design.
>> Fair enough.
>>
>> Can we disallow passing DRM file descriptors which have any buffers
>> allocated? :)
> 
> Hehe good point, but I'm sorry I have to ruin that.
> 
> The root VM page table is allocated when the DRM file descriptor is
> created and we want to account those to whoever uses the file descriptor
> as well.

Alternatively, since the file descriptor is closed in the sending
process in this case, maybe we can "uncharge" the buffer memory from the
sending process and charge it to the receiving one during the transfer?


> Looking into the fs layer there actually only seem to be two function
> which are involved when a file descriptor is installed/removed from a
> process. So we just need to add some callbacks there.

That could work for file descriptor passing, but I'm not sure it really
helps for the fork case. Let's say we charge the buffer memory to the
child process as well. If either process later destroys a buffer handle,
the buffer becomes inaccessible to the other process as well, however
its memory remains charged to it (even though it may already be freed).

I think using a DRM file descriptor in both parent and child processes
is a pathological case that we really want to prevent rather than
worrying about how to make it work well. It doesn't seem to be working
well in general already anyway.


Maybe we could keep track of which process "owns" a DRM file descriptor,
and return an error from any relevant system calls for it from another
process. When passing an fd, its ownership would transfer to the
receiving process. When forking, the ownership would remain with the
parent process.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 12:36 PM, Nicolai Hähnle wrote:
> On 30.01.2018 12:34, Michel Dänzer wrote:
>> On 2018-01-30 12:28 PM, Christian König wrote:
>>> Am 30.01.2018 um 12:02 schrieb Michel Dänzer:
>>>> On 2018-01-30 11:40 AM, Christian König wrote:
>>>>> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>>>>>> [SNIP]
>>>>>>> Would it be ok to hang onto potentially arbitrary mmget references
>>>>>>> essentially forever? If that's ok I think we can do your process
>>>>>>> based
>>>>>>> account (minus a few minor inaccuracies for shared stuff perhaps,
>>>>>>> but no
>>>>>>> one cares about that).
>>>>>> Honestly, I think you and Christian are overthinking this. Let's try
>>>>>> charging the memory to every process which shares a buffer, and go
>>>>>> from
>>>>>> there.
>>>>> My problem is that this needs to be bullet prove.
>>>>>
>>>>> For example imagine an application which allocates a lot of BOs, then
>>>>> calls fork() and let the parent process die. The file descriptor lives
>>>>> on in the child process, but the memory is not accounted against the
>>>>> child.
>>>> What exactly are you referring to by "the file descriptor" here?
>>>
>>> The file descriptor used to identify the connection to the driver. In
>>> other words our drm_file structure in the kernel.
>>>
>>>> What happens to BO handles in general in this case? If both parent and
>>>> child process keep the same handle for the same BO, one of them
>>>> destroying the handle will result in the other one not being able to
>>>> use
>>>> it anymore either, won't it?
>>> Correct.
>>>
>>> That usage is actually not useful at all, but we already had
>>> applications which did exactly that by accident.
>>>
>>> Not to mention that somebody could do it on purpose.
>>
>> Can we just prevent child processes from using their parent's DRM file
>> descriptors altogether? Allowing it seems like a bad idea all around.
> 
> Existing protocols pass DRM fds between processes though, don't they?
> 
> Not child processes perhaps, but special-casing that seems like awful
> design.

Fair enough.

Can we disallow passing DRM file descriptors which have any buffers
allocated? :)


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 12:36 PM, Nicolai Hähnle wrote:
> On 30.01.2018 12:34, Michel Dänzer wrote:
>> On 2018-01-30 12:28 PM, Christian König wrote:
>>> Am 30.01.2018 um 12:02 schrieb Michel Dänzer:
>>>> On 2018-01-30 11:40 AM, Christian König wrote:
>>>>> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>>>>>> [SNIP]
>>>>>>> Would it be ok to hang onto potentially arbitrary mmget references
>>>>>>> essentially forever? If that's ok I think we can do your process
>>>>>>> based
>>>>>>> account (minus a few minor inaccuracies for shared stuff perhaps,
>>>>>>> but no
>>>>>>> one cares about that).
>>>>>> Honestly, I think you and Christian are overthinking this. Let's try
>>>>>> charging the memory to every process which shares a buffer, and go
>>>>>> from
>>>>>> there.
>>>>> My problem is that this needs to be bullet prove.
>>>>>
>>>>> For example imagine an application which allocates a lot of BOs, then
>>>>> calls fork() and let the parent process die. The file descriptor lives
>>>>> on in the child process, but the memory is not accounted against the
>>>>> child.
>>>> What exactly are you referring to by "the file descriptor" here?
>>>
>>> The file descriptor used to identify the connection to the driver. In
>>> other words our drm_file structure in the kernel.
>>>
>>>> What happens to BO handles in general in this case? If both parent and
>>>> child process keep the same handle for the same BO, one of them
>>>> destroying the handle will result in the other one not being able to
>>>> use
>>>> it anymore either, won't it?
>>> Correct.
>>>
>>> That usage is actually not useful at all, but we already had
>>> applications which did exactly that by accident.
>>>
>>> Not to mention that somebody could do it on purpose.
>>
>> Can we just prevent child processes from using their parent's DRM file
>> descriptors altogether? Allowing it seems like a bad idea all around.
> 
> Existing protocols pass DRM fds between processes though, don't they?
> 
> Not child processes perhaps, but special-casing that seems like awful
> design.

Fair enough.

Can we disallow passing DRM file descriptors which have any buffers
allocated? :)


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 12:28 PM, Christian König wrote:
> Am 30.01.2018 um 12:02 schrieb Michel Dänzer:
>> On 2018-01-30 11:40 AM, Christian König wrote:
>>> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>>>> [SNIP]
>>>>> Would it be ok to hang onto potentially arbitrary mmget references
>>>>> essentially forever? If that's ok I think we can do your process based
>>>>> account (minus a few minor inaccuracies for shared stuff perhaps,
>>>>> but no
>>>>> one cares about that).
>>>> Honestly, I think you and Christian are overthinking this. Let's try
>>>> charging the memory to every process which shares a buffer, and go from
>>>> there.
>>> My problem is that this needs to be bullet prove.
>>>
>>> For example imagine an application which allocates a lot of BOs, then
>>> calls fork() and let the parent process die. The file descriptor lives
>>> on in the child process, but the memory is not accounted against the
>>> child.
>> What exactly are you referring to by "the file descriptor" here?
> 
> The file descriptor used to identify the connection to the driver. In
> other words our drm_file structure in the kernel.
> 
>> What happens to BO handles in general in this case? If both parent and
>> child process keep the same handle for the same BO, one of them
>> destroying the handle will result in the other one not being able to use
>> it anymore either, won't it?
> Correct.
> 
> That usage is actually not useful at all, but we already had
> applications which did exactly that by accident.
> 
> Not to mention that somebody could do it on purpose.

Can we just prevent child processes from using their parent's DRM file
descriptors altogether? Allowing it seems like a bad idea all around.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 12:28 PM, Christian König wrote:
> Am 30.01.2018 um 12:02 schrieb Michel Dänzer:
>> On 2018-01-30 11:40 AM, Christian König wrote:
>>> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>>>> [SNIP]
>>>>> Would it be ok to hang onto potentially arbitrary mmget references
>>>>> essentially forever? If that's ok I think we can do your process based
>>>>> account (minus a few minor inaccuracies for shared stuff perhaps,
>>>>> but no
>>>>> one cares about that).
>>>> Honestly, I think you and Christian are overthinking this. Let's try
>>>> charging the memory to every process which shares a buffer, and go from
>>>> there.
>>> My problem is that this needs to be bullet prove.
>>>
>>> For example imagine an application which allocates a lot of BOs, then
>>> calls fork() and let the parent process die. The file descriptor lives
>>> on in the child process, but the memory is not accounted against the
>>> child.
>> What exactly are you referring to by "the file descriptor" here?
> 
> The file descriptor used to identify the connection to the driver. In
> other words our drm_file structure in the kernel.
> 
>> What happens to BO handles in general in this case? If both parent and
>> child process keep the same handle for the same BO, one of them
>> destroying the handle will result in the other one not being able to use
>> it anymore either, won't it?
> Correct.
> 
> That usage is actually not useful at all, but we already had
> applications which did exactly that by accident.
> 
> Not to mention that somebody could do it on purpose.

Can we just prevent child processes from using their parent's DRM file
descriptors altogether? Allowing it seems like a bad idea all around.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 11:40 AM, Christian König wrote:
> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>> [SNIP]
>>> Would it be ok to hang onto potentially arbitrary mmget references
>>> essentially forever? If that's ok I think we can do your process based
>>> account (minus a few minor inaccuracies for shared stuff perhaps, but no
>>> one cares about that).
>> Honestly, I think you and Christian are overthinking this. Let's try
>> charging the memory to every process which shares a buffer, and go from
>> there.
> 
> My problem is that this needs to be bullet prove.
> 
> For example imagine an application which allocates a lot of BOs, then
> calls fork() and let the parent process die. The file descriptor lives
> on in the child process, but the memory is not accounted against the child.

What exactly are you referring to by "the file descriptor" here?


What happens to BO handles in general in this case? If both parent and
child process keep the same handle for the same BO, one of them
destroying the handle will result in the other one not being able to use
it anymore either, won't it?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 11:40 AM, Christian König wrote:
> Am 30.01.2018 um 10:43 schrieb Michel Dänzer:
>> [SNIP]
>>> Would it be ok to hang onto potentially arbitrary mmget references
>>> essentially forever? If that's ok I think we can do your process based
>>> account (minus a few minor inaccuracies for shared stuff perhaps, but no
>>> one cares about that).
>> Honestly, I think you and Christian are overthinking this. Let's try
>> charging the memory to every process which shares a buffer, and go from
>> there.
> 
> My problem is that this needs to be bullet prove.
> 
> For example imagine an application which allocates a lot of BOs, then
> calls fork() and let the parent process die. The file descriptor lives
> on in the child process, but the memory is not accounted against the child.

What exactly are you referring to by "the file descriptor" here?


What happens to BO handles in general in this case? If both parent and
child process keep the same handle for the same BO, one of them
destroying the handle will result in the other one not being able to use
it anymore either, won't it?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 11:42 AM, Daniel Vetter wrote:
> On Tue, Jan 30, 2018 at 10:43:10AM +0100, Michel Dänzer wrote:
>> On 2018-01-30 10:31 AM, Daniel Vetter wrote:
>>
>>> I guess a good first order approximation would be if we simply charge any
>>> newly allocated buffers to the process that created them, but that means
>>> hanging onto lots of mm_struct pointers since we want to make sure we then
>>> release those pages to the right mm again (since the process that drops
>>> the last ref might be a totally different one, depending upon how the
>>> buffers or DRM fd have been shared).
>>>
>>> Would it be ok to hang onto potentially arbitrary mmget references
>>> essentially forever? If that's ok I think we can do your process based
>>> account (minus a few minor inaccuracies for shared stuff perhaps, but no
>>> one cares about that).
>>
>> Honestly, I think you and Christian are overthinking this. Let's try
>> charging the memory to every process which shares a buffer, and go from
>> there.
> 
> I'm not concerned about wrongly accounting shared buffers (they don't
> matter), but imbalanced accounting. I.e. allocate a buffer in the client,
> share it, but then the compositor drops the last reference.

I don't think the order matters. The memory is "uncharged" in each
process when it drops its reference.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 11:42 AM, Daniel Vetter wrote:
> On Tue, Jan 30, 2018 at 10:43:10AM +0100, Michel Dänzer wrote:
>> On 2018-01-30 10:31 AM, Daniel Vetter wrote:
>>
>>> I guess a good first order approximation would be if we simply charge any
>>> newly allocated buffers to the process that created them, but that means
>>> hanging onto lots of mm_struct pointers since we want to make sure we then
>>> release those pages to the right mm again (since the process that drops
>>> the last ref might be a totally different one, depending upon how the
>>> buffers or DRM fd have been shared).
>>>
>>> Would it be ok to hang onto potentially arbitrary mmget references
>>> essentially forever? If that's ok I think we can do your process based
>>> account (minus a few minor inaccuracies for shared stuff perhaps, but no
>>> one cares about that).
>>
>> Honestly, I think you and Christian are overthinking this. Let's try
>> charging the memory to every process which shares a buffer, and go from
>> there.
> 
> I'm not concerned about wrongly accounting shared buffers (they don't
> matter), but imbalanced accounting. I.e. allocate a buffer in the client,
> share it, but then the compositor drops the last reference.

I don't think the order matters. The memory is "uncharged" in each
process when it drops its reference.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 10:31 AM, Daniel Vetter wrote:
> On Wed, Jan 24, 2018 at 01:11:09PM +0100, Christian König wrote:
>> Am 24.01.2018 um 12:50 schrieb Michal Hocko:
>>> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>>>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
>>> [...]
>>>>>> 2. If the OOM killer kills a process which is sharing BOs with another
>>>>>> process, this should result in the other process dropping its references
>>>>>> to the BOs as well, at which point the memory is released.
>>>>> OK. How exactly are those BOs mapped to the userspace?
>>>> I'm not sure what you're asking. Userspace mostly uses a GEM handle to
>>>> refer to a BO. There can also be userspace CPU mappings of the BO's
>>>> memory, but userspace doesn't need CPU mappings for all BOs and only
>>>> creates them as needed.
>>> OK, I guess you have to bear with me some more. This whole stack is a
>>> complete uknonwn. I am mostly after finding a boundary where you can
>>> charge the allocated memory to the process so that the oom killer can
>>> consider it. Is there anything like that? Except for the proposed file
>>> handle hack?
>>
>> Not that I knew of.
>>
>> As I said before we need some kind of callback that a process now starts to
>> use a file descriptor, but without anything from that file descriptor mapped
>> into the address space.
> 
> For more context: With DRI3 and wayland the compositor opens the DRM fd
> and then passes it to the client, which then starts allocating stuff. That
> makes book-keeping rather annoying.

Actually, what you're describing is only true for the buffers shared by
an X server with an X11 compositor. For the actual applications, the
buffers are created on the client side and then shared with the X server
/ Wayland compositor.

Anyway, it doesn't really matter. In all cases, the buffers are actually
used by all parties that are sharing them, so charging the memory to all
of them is perfectly appropriate.


> I guess a good first order approximation would be if we simply charge any
> newly allocated buffers to the process that created them, but that means
> hanging onto lots of mm_struct pointers since we want to make sure we then
> release those pages to the right mm again (since the process that drops
> the last ref might be a totally different one, depending upon how the
> buffers or DRM fd have been shared).
> 
> Would it be ok to hang onto potentially arbitrary mmget references
> essentially forever? If that's ok I think we can do your process based
> account (minus a few minor inaccuracies for shared stuff perhaps, but no
> one cares about that).

Honestly, I think you and Christian are overthinking this. Let's try
charging the memory to every process which shares a buffer, and go from
there.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-30 10:31 AM, Daniel Vetter wrote:
> On Wed, Jan 24, 2018 at 01:11:09PM +0100, Christian König wrote:
>> Am 24.01.2018 um 12:50 schrieb Michal Hocko:
>>> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>>>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
>>> [...]
>>>>>> 2. If the OOM killer kills a process which is sharing BOs with another
>>>>>> process, this should result in the other process dropping its references
>>>>>> to the BOs as well, at which point the memory is released.
>>>>> OK. How exactly are those BOs mapped to the userspace?
>>>> I'm not sure what you're asking. Userspace mostly uses a GEM handle to
>>>> refer to a BO. There can also be userspace CPU mappings of the BO's
>>>> memory, but userspace doesn't need CPU mappings for all BOs and only
>>>> creates them as needed.
>>> OK, I guess you have to bear with me some more. This whole stack is a
>>> complete uknonwn. I am mostly after finding a boundary where you can
>>> charge the allocated memory to the process so that the oom killer can
>>> consider it. Is there anything like that? Except for the proposed file
>>> handle hack?
>>
>> Not that I knew of.
>>
>> As I said before we need some kind of callback that a process now starts to
>> use a file descriptor, but without anything from that file descriptor mapped
>> into the address space.
> 
> For more context: With DRI3 and wayland the compositor opens the DRM fd
> and then passes it to the client, which then starts allocating stuff. That
> makes book-keeping rather annoying.

Actually, what you're describing is only true for the buffers shared by
an X server with an X11 compositor. For the actual applications, the
buffers are created on the client side and then shared with the X server
/ Wayland compositor.

Anyway, it doesn't really matter. In all cases, the buffers are actually
used by all parties that are sharing them, so charging the memory to all
of them is perfectly appropriate.


> I guess a good first order approximation would be if we simply charge any
> newly allocated buffers to the process that created them, but that means
> hanging onto lots of mm_struct pointers since we want to make sure we then
> release those pages to the right mm again (since the process that drops
> the last ref might be a totally different one, depending upon how the
> buffers or DRM fd have been shared).
> 
> Would it be ok to hang onto potentially arbitrary mmget references
> essentially forever? If that's ok I think we can do your process based
> account (minus a few minor inaccuracies for shared stuff perhaps, but no
> one cares about that).

Honestly, I think you and Christian are overthinking this. Let's try
charging the memory to every process which shares a buffer, and go from
there.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-24 12:50 PM, Michal Hocko wrote:
> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
> [...]
>>>> 2. If the OOM killer kills a process which is sharing BOs with another
>>>> process, this should result in the other process dropping its references
>>>> to the BOs as well, at which point the memory is released.
>>>
>>> OK. How exactly are those BOs mapped to the userspace?
>>
>> I'm not sure what you're asking. Userspace mostly uses a GEM handle to
>> refer to a BO. There can also be userspace CPU mappings of the BO's
>> memory, but userspace doesn't need CPU mappings for all BOs and only
>> creates them as needed.
> 
> OK, I guess you have to bear with me some more. This whole stack is a
> complete uknonwn. I am mostly after finding a boundary where you can
> charge the allocated memory to the process so that the oom killer can
> consider it. Is there anything like that? Except for the proposed file
> handle hack?

How about the other way around: what APIs can we use to charge /
"uncharge" memory to a process? If we have those, we can experiment with
different places to call them.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-30 Thread Michel Dänzer
On 2018-01-24 12:50 PM, Michal Hocko wrote:
> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
> [...]
>>>> 2. If the OOM killer kills a process which is sharing BOs with another
>>>> process, this should result in the other process dropping its references
>>>> to the BOs as well, at which point the memory is released.
>>>
>>> OK. How exactly are those BOs mapped to the userspace?
>>
>> I'm not sure what you're asking. Userspace mostly uses a GEM handle to
>> refer to a BO. There can also be userspace CPU mappings of the BO's
>> memory, but userspace doesn't need CPU mappings for all BOs and only
>> creates them as needed.
> 
> OK, I guess you have to bear with me some more. This whole stack is a
> complete uknonwn. I am mostly after finding a boundary where you can
> charge the allocated memory to the process so that the oom killer can
> consider it. Is there anything like that? Except for the proposed file
> handle hack?

How about the other way around: what APIs can we use to charge /
"uncharge" memory to a process? If we have those, we can experiment with
different places to call them.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-24 Thread Michel Dänzer
On 2018-01-24 12:50 PM, Michal Hocko wrote:
> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
> [...]
>>>> 2. If the OOM killer kills a process which is sharing BOs with another
>>>> process, this should result in the other process dropping its references
>>>> to the BOs as well, at which point the memory is released.
>>>
>>> OK. How exactly are those BOs mapped to the userspace?
>>
>> I'm not sure what you're asking. Userspace mostly uses a GEM handle to
>> refer to a BO. There can also be userspace CPU mappings of the BO's
>> memory, but userspace doesn't need CPU mappings for all BOs and only
>> creates them as needed.
> 
> OK, I guess you have to bear with me some more. This whole stack is a
> complete uknonwn. I am mostly after finding a boundary where you can
> charge the allocated memory to the process so that the oom killer can
> consider it. Is there anything like that?

I think something like charging the memory of a BO to the process when a
userspace handle is created for it, and "uncharging" when a handle is
destroyed, could be a good start.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [RFC] Per file OOM badness

2018-01-24 Thread Michel Dänzer
On 2018-01-24 12:50 PM, Michal Hocko wrote:
> On Wed 24-01-18 12:23:10, Michel Dänzer wrote:
>> On 2018-01-24 12:01 PM, Michal Hocko wrote:
>>> On Wed 24-01-18 11:27:15, Michel Dänzer wrote:
> [...]
>>>> 2. If the OOM killer kills a process which is sharing BOs with another
>>>> process, this should result in the other process dropping its references
>>>> to the BOs as well, at which point the memory is released.
>>>
>>> OK. How exactly are those BOs mapped to the userspace?
>>
>> I'm not sure what you're asking. Userspace mostly uses a GEM handle to
>> refer to a BO. There can also be userspace CPU mappings of the BO's
>> memory, but userspace doesn't need CPU mappings for all BOs and only
>> creates them as needed.
> 
> OK, I guess you have to bear with me some more. This whole stack is a
> complete uknonwn. I am mostly after finding a boundary where you can
> charge the allocated memory to the process so that the oom killer can
> consider it. Is there anything like that?

I think something like charging the memory of a BO to the process when a
userspace handle is created for it, and "uncharging" when a handle is
destroyed, could be a good start.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


  1   2   3   4   >