Re: [PATCH v7 1/2] drm/buddy: Add start address support to trim function

2024-07-23 Thread Marek Olšák
The reason is that our DCC requires 768K alignment in some cases. I haven't
read this patch series, but one way to do that is to align to 256K,
overallocate by 512K, and then not use either 0, 256K, or 512K at the
beginning to get to 768K alignment.

Marek

On Tue, Jul 23, 2024, 11:04 Matthew Auld  wrote:

> On 23/07/2024 14:43, Paneer Selvam, Arunpravin wrote:
> > Hi Matthew,
> >
> > Can we push this version for now as we need to mainline the DCC changes
> > ASAP,
> > while we continue our discussion and proceed to implement the permanent
> > solution
> > for address alignment?
>
> Yeah, we can always merge now and circle back around later, if this for
> sure helps your usecase and is needed asap. I just didn't fully get the
> idea for needing this interface, but likely I am missing something.
>
> >
> > Thanks,
> > Arun.
> >
> > On 7/23/2024 6:55 PM, Arunpravin Paneer Selvam wrote:
> >> - Add a new start parameter in trim function to specify exact
> >>address from where to start the trimming. This would help us
> >>in situations like if drivers would like to do address alignment
> >>for specific requirements.
> >>
> >> - Add a new flag DRM_BUDDY_TRIM_DISABLE. Drivers can use this
> >>flag to disable the allocator trimming part. This patch enables
> >>the drivers control trimming and they can do it themselves
> >>based on the application requirements.
> >>
> >> v1:(Matthew)
> >>- check new_start alignment with min chunk_size
> >>- use range_overflows()
> >>
> >> Signed-off-by: Arunpravin Paneer Selvam <
> arunpravin.paneersel...@amd.com>
> >> Acked-by: Alex Deucher 
> >> Acked-by: Christian König 
> >> ---
> >>   drivers/gpu/drm/drm_buddy.c  | 25 +++--
> >>   drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  2 +-
> >>   include/drm/drm_buddy.h  |  2 ++
> >>   3 files changed, 26 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
> >> index 6a8e45e9d0ec..103c185bb1c8 100644
> >> --- a/drivers/gpu/drm/drm_buddy.c
> >> +++ b/drivers/gpu/drm/drm_buddy.c
> >> @@ -851,6 +851,7 @@ static int __alloc_contig_try_harder(struct
> >> drm_buddy *mm,
> >>* drm_buddy_block_trim - free unused pages
> >>*
> >>* @mm: DRM buddy manager
> >> + * @start: start address to begin the trimming.
> >>* @new_size: original size requested
> >>* @blocks: Input and output list of allocated blocks.
> >>* MUST contain single block as input to be trimmed.
> >> @@ -866,11 +867,13 @@ static int __alloc_contig_try_harder(struct
> >> drm_buddy *mm,
> >>* 0 on success, error code on failure.
> >>*/
> >>   int drm_buddy_block_trim(struct drm_buddy *mm,
> >> + u64 *start,
> >>u64 new_size,
> >>struct list_head *blocks)
> >>   {
> >>   struct drm_buddy_block *parent;
> >>   struct drm_buddy_block *block;
> >> +u64 block_start, block_end;
> >>   LIST_HEAD(dfs);
> >>   u64 new_start;
> >>   int err;
> >> @@ -882,6 +885,9 @@ int drm_buddy_block_trim(struct drm_buddy *mm,
> >>struct drm_buddy_block,
> >>link);
> >> +block_start = drm_buddy_block_offset(block);
> >> +block_end = block_start + drm_buddy_block_size(mm, block);
> >> +
> >>   if (WARN_ON(!drm_buddy_block_is_allocated(block)))
> >>   return -EINVAL;
> >> @@ -894,6 +900,20 @@ int drm_buddy_block_trim(struct drm_buddy *mm,
> >>   if (new_size == drm_buddy_block_size(mm, block))
> >>   return 0;
> >> +new_start = block_start;
> >> +if (start) {
> >> +new_start = *start;
> >> +
> >> +if (new_start < block_start)
> >> +return -EINVAL;
> >> +
> >> +if (!IS_ALIGNED(new_start, mm->chunk_size))
> >> +return -EINVAL;
> >> +
> >> +if (range_overflows(new_start, new_size, block_end))
> >> +return -EINVAL;
> >> +}
> >> +
> >>   list_del(&block->link);
> >>   mark_free(mm, block);
> >>   mm->avail += drm_buddy_block_size(mm, block);
> >> @@ -904,7 +924,6 @@ int drm_buddy_block_trim(struct drm_buddy *mm,
> >>   parent = block->parent;
> >>   block->parent = NULL;
> >> -new_start = drm_buddy_block_offset(block);
> >>   list_add(&block->tmp_link, &dfs);
> >>   err =  __alloc_range(mm, &dfs, new_start, new_size, blocks, NULL);
> >>   if (err) {
> >> @@ -1066,7 +1085,8 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
> >>   } while (1);
> >>   /* Trim the allocated block to the required size */
> >> -if (original_size != size) {
> >> +if (!(flags & DRM_BUDDY_TRIM_DISABLE) &&
> >> +original_size != size) {
> >>   struct list_head *trim_list;
> >>   LIST_HEAD(temp);
> >>   u64 trim_size;
> >> @@ -1083,6 +1103,7 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
> >>   }
> >>   drm_buddy_block_trim(mm,
> >> + NULL,
> >>

Re: [PATCH v5 2/2] drm/amdgpu: Add address alignment support to DCC buffers

2024-07-16 Thread Marek Olšák
AMDGPU_GEM_CREATE_GFX12_DCC is set on 90% of all memory allocations, and
almost all of them are not displayable. Shouldn't we use a different way to
indicate that we need a non-power-of-two alignment, such as looking at the
alignment field directly?

Marek

On Tue, Jul 16, 2024, 11:45 Arunpravin Paneer Selvam <
arunpravin.paneersel...@amd.com> wrote:

> Add address alignment support to the DCC VRAM buffers.
>
> v2:
>   - adjust size based on the max_texture_channel_caches values
> only for GFX12 DCC buffers.
>   - used AMDGPU_GEM_CREATE_GFX12_DCC flag to apply change only
> for DCC buffers.
>   - roundup non power of two DCC buffer adjusted size to nearest
> power of two number as the buddy allocator does not support non
> power of two alignments. This applies only to the contiguous
> DCC buffers.
>
> v3:(Alex)
>   - rewrite the max texture channel caches comparison code in an
> algorithmic way to determine the alignment size.
>
> v4:(Alex)
>   - Move the logic from amdgpu_vram_mgr_dcc_alignment() to gmc_v12_0.c
> and add a new gmc func callback for dcc alignment. If the callback
> is non-NULL, call it to get the alignment, otherwise, use the default.
>
> v5:(Alex)
>   - Set the Alignment to a default value if the callback doesn't exist.
>   - Add the callback to amdgpu_gmc_funcs.
>
> v6:
>   - Fix checkpatch error reported by Intel CI.
>
> Signed-off-by: Arunpravin Paneer Selvam 
> Acked-by: Alex Deucher 
> Acked-by: Christian König 
> Reviewed-by: Frank Min 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  6 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 36 ++--
>  drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c   | 15 
>  3 files changed, 55 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> index febca3130497..654d0548a3f8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> @@ -156,6 +156,8 @@ struct amdgpu_gmc_funcs {
>   uint64_t addr, uint64_t *flags);
> /* get the amount of memory used by the vbios for pre-OS console */
> unsigned int (*get_vbios_fb_size)(struct amdgpu_device *adev);
> +   /* get the DCC buffer alignment */
> +   u64 (*get_dcc_alignment)(struct amdgpu_device *adev);
>
> enum amdgpu_memory_partition (*query_mem_partition_mode)(
> struct amdgpu_device *adev);
> @@ -363,6 +365,10 @@ struct amdgpu_gmc {
> (adev)->gmc.gmc_funcs->override_vm_pte_flags\
> ((adev), (vm), (addr), (pte_flags))
>  #define amdgpu_gmc_get_vbios_fb_size(adev)
> (adev)->gmc.gmc_funcs->get_vbios_fb_size((adev))
> +#define amdgpu_gmc_get_dcc_alignment(_adev) ({ \
> +   typeof(_adev) (adev) = (_adev); \
> +   ((adev)->gmc.gmc_funcs->get_dcc_alignment((adev))); \
> +})
>
>  /**
>   * amdgpu_gmc_vram_full_visible - Check if full VRAM is visible through
> the BAR
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index f91cc149d06c..aa9dca12371c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -512,6 +512,16 @@ static int amdgpu_vram_mgr_new(struct
> ttm_resource_manager *man,
> vres->flags |= DRM_BUDDY_RANGE_ALLOCATION;
>
> remaining_size = (u64)vres->base.size;
> +   if (bo->flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS &&
> +   bo->flags & AMDGPU_GEM_CREATE_GFX12_DCC) {
> +   u64 adjust_size;
> +
> +   if (adev->gmc.gmc_funcs->get_dcc_alignment) {
> +   adjust_size = amdgpu_gmc_get_dcc_alignment(adev);
> +   remaining_size = roundup_pow_of_two(remaining_size
> + adjust_size);
> +   vres->flags |= DRM_BUDDY_TRIM_DISABLE;
> +   }
> +   }
>
> mutex_lock(&mgr->lock);
> while (remaining_size) {
> @@ -521,8 +531,12 @@ static int amdgpu_vram_mgr_new(struct
> ttm_resource_manager *man,
> min_block_size = mgr->default_page_size;
>
> size = remaining_size;
> -   if ((size >= (u64)pages_per_block << PAGE_SHIFT) &&
> -   !(size & (((u64)pages_per_block << PAGE_SHIFT) - 1)))
> +
> +   if (bo->flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS &&
> +   bo->flags & AMDGPU_GEM_CREATE_GFX12_DCC)
> +   min_block_size = size;
> +   else if ((size >= (u64)pages_per_block << PAGE_SHIFT) &&
> +!(size & (((u64)pages_per_block << PAGE_SHIFT) -
> 1)))
> min_block_size = (u64)pages_per_block <<
> PAGE_SHIFT;
>
> BUG_ON(min_block_size < mm->chunk_size);
> @@ -553,6 +567,24 @@ static int amdgpu_vram_mgr_new(struct
> t

Re: [PATCH] firmware: sysfb: Fix reference count of sysfb parent device

2024-06-28 Thread Marek Olšák
Hi Thomas,

FYI, this doesn't fix the issue of lightdm not being able to start for me.

Marek


Marek

On Tue, Jun 25, 2024 at 4:18 AM Thomas Zimmermann  wrote:
>
> Retrieving the system framebuffer's parent device in sysfb_init()
> increments the parent device's reference count. Hence release the
> reference before leaving the init function.
>
> Adding the sysfb platform device acquires and additional reference
> for the parent. This keeps the parent device around while the system
> framebuffer is in use.
>
> Signed-off-by: Thomas Zimmermann 
> Fixes: 9eac534db001 ("firmware/sysfb: Set firmware-framebuffer parent device")
> Cc: Thomas Zimmermann 
> Cc: Javier Martinez Canillas 
> Cc: Helge Deller 
> Cc: Jani Nikula 
> Cc: Dan Carpenter 
> Cc: Arnd Bergmann 
> Cc: Sui Jingfeng 
> Cc:  # v6.9+
> ---
>  drivers/firmware/sysfb.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/firmware/sysfb.c b/drivers/firmware/sysfb.c
> index 880ffcb50088..dd274563deeb 100644
> --- a/drivers/firmware/sysfb.c
> +++ b/drivers/firmware/sysfb.c
> @@ -101,8 +101,10 @@ static __init struct device *sysfb_parent_dev(const 
> struct screen_info *si)
> if (IS_ERR(pdev)) {
> return ERR_CAST(pdev);
> } else if (pdev) {
> -   if (!sysfb_pci_dev_is_enabled(pdev))
> +   if (!sysfb_pci_dev_is_enabled(pdev)) {
> +   pci_dev_put(pdev);
> return ERR_PTR(-ENODEV);
> +   }
> return &pdev->dev;
> }
>
> @@ -137,7 +139,7 @@ static __init int sysfb_init(void)
> if (compatible) {
> pd = sysfb_create_simplefb(si, &mode, parent);
> if (!IS_ERR(pd))
> -   goto unlock_mutex;
> +   goto put_device;
> }
>
> /* if the FB is incompatible, create a legacy framebuffer device */
> @@ -155,7 +157,7 @@ static __init int sysfb_init(void)
> pd = platform_device_alloc(name, 0);
> if (!pd) {
> ret = -ENOMEM;
> -   goto unlock_mutex;
> +   goto put_device;
> }
>
> pd->dev.parent = parent;
> @@ -170,9 +172,12 @@ static __init int sysfb_init(void)
> if (ret)
> goto err;
>
> -   goto unlock_mutex;
> +
> +   goto put_device;
>  err:
> platform_device_put(pd);
> +put_device:
> +   put_device(parent);
>  unlock_mutex:
> mutex_unlock(&disable_lock);
> return ret;
> --
> 2.45.2
>


Re: "firmware/sysfb: Set firmware-framebuffer parent device" breaks lightdm on Ubuntu 22.04 using amdgpu

2024-06-19 Thread Marek Olšák
Attached is the revert commit that works for me. Tested with Radeon
6800 and Radeon 7900XTX.

Marek


Marek

On Wed, Jun 19, 2024 at 9:50 AM Thomas Zimmermann  wrote:
>
> Hi
>
> Am 13.06.24 um 07:59 schrieb Marek Olšák:
> > Hi Thomas,
> >
> > Commit 9eac534db0013aff9b9124985dab114600df9081 as per the title
> > breaks (crashes?) lightdm (login screen) such that all I get is the
> > terminal. It's also reproducible with tag v6.9 where the commit is
> > present.
>
> I was able to reproduce the problem with Ubutu 22.04 and later under
> qemu plus qxl, sort of. I login via gdm3 and then the quest machine
> switches off entirely.
>
> >
> > Reverting the commit fixes lightdm. A workaround is to bypass lightdm
> > by triggering auto-login. This is a bug report.
>
> The problem is that reverting the commit doesn't fix the issue for me.
> I'll try to do my own bisecting.
>
> Best regards
> Thomas
>
> >
> > (For AMD folks: It's also reproducible with amd-staging-drm-next.)
> >
> > Marek
>
> --
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Frankenstrasse 146, 90461 Nuernberg, Germany
> GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
> HRB 36809 (AG Nuernberg)
>
From 2431978beb04ea4f3befe8d6e0aa89e7207f1b5c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Marek=20Ol=C5=A1=C3=A1k?= 
Date: Thu, 13 Jun 2024 01:32:35 -0400
Subject: [PATCH] Revert "firmware/sysfb: Set firmware-framebuffer parent
 device"
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This reverts commit 9eac534db0013aff9b9124985dab114600df9081.

Signed-off-by: Marek Olšák 
---
 drivers/firmware/sysfb.c  | 51 +--
 drivers/firmware/sysfb_simplefb.c |  5 +--
 include/linux/sysfb.h |  6 ++--
 3 files changed, 4 insertions(+), 58 deletions(-)

diff --git a/drivers/firmware/sysfb.c b/drivers/firmware/sysfb.c
index 880ffcb50088..defd7a36cb08 100644
--- a/drivers/firmware/sysfb.c
+++ b/drivers/firmware/sysfb.c
@@ -29,7 +29,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -70,49 +69,9 @@ void sysfb_disable(void)
 }
 EXPORT_SYMBOL_GPL(sysfb_disable);
 
-#if defined(CONFIG_PCI)
-static __init bool sysfb_pci_dev_is_enabled(struct pci_dev *pdev)
-{
-	/*
-	 * TODO: Try to integrate this code into the PCI subsystem
-	 */
-	int ret;
-	u16 command;
-
-	ret = pci_read_config_word(pdev, PCI_COMMAND, &command);
-	if (ret != PCIBIOS_SUCCESSFUL)
-		return false;
-	if (!(command & PCI_COMMAND_MEMORY))
-		return false;
-	return true;
-}
-#else
-static __init bool sysfb_pci_dev_is_enabled(struct pci_dev *pdev)
-{
-	return false;
-}
-#endif
-
-static __init struct device *sysfb_parent_dev(const struct screen_info *si)
-{
-	struct pci_dev *pdev;
-
-	pdev = screen_info_pci_dev(si);
-	if (IS_ERR(pdev)) {
-		return ERR_CAST(pdev);
-	} else if (pdev) {
-		if (!sysfb_pci_dev_is_enabled(pdev))
-			return ERR_PTR(-ENODEV);
-		return &pdev->dev;
-	}
-
-	return NULL;
-}
-
 static __init int sysfb_init(void)
 {
 	struct screen_info *si = &screen_info;
-	struct device *parent;
 	struct simplefb_platform_data mode;
 	const char *name;
 	bool compatible;
@@ -126,16 +85,10 @@ static __init int sysfb_init(void)
 
 	sysfb_apply_efi_quirks();
 
-	parent = sysfb_parent_dev(si);
-	if (IS_ERR(parent)) {
-		ret = PTR_ERR(parent);
-		goto unlock_mutex;
-	}
-
 	/* try to create a simple-framebuffer device */
 	compatible = sysfb_parse_mode(si, &mode);
 	if (compatible) {
-		pd = sysfb_create_simplefb(si, &mode, parent);
+		pd = sysfb_create_simplefb(si, &mode);
 		if (!IS_ERR(pd))
 			goto unlock_mutex;
 	}
@@ -158,8 +111,6 @@ static __init int sysfb_init(void)
 		goto unlock_mutex;
 	}
 
-	pd->dev.parent = parent;
-
 	sysfb_set_efifb_fwnode(pd);
 
 	ret = platform_device_add_data(pd, si, sizeof(*si));
diff --git a/drivers/firmware/sysfb_simplefb.c b/drivers/firmware/sysfb_simplefb.c
index 75a186bf8f8e..74363ed7501f 100644
--- a/drivers/firmware/sysfb_simplefb.c
+++ b/drivers/firmware/sysfb_simplefb.c
@@ -91,8 +91,7 @@ __init bool sysfb_parse_mode(const struct screen_info *si,
 }
 
 __init struct platform_device *sysfb_create_simplefb(const struct screen_info *si,
-		 const struct simplefb_platform_data *mode,
-		 struct device *parent)
+		 const struct simplefb_platform_data *mode)
 {
 	struct platform_device *pd;
 	struct resource res;
@@ -144,8 +143,6 @@ __init struct platform_device *sysfb_create_simplefb(const struct screen_info *s
 	if (!pd)
 		return ERR_PTR(-ENOMEM);
 
-	pd->dev.parent = parent;
-
 	sysfb_set_efifb_fwnode(pd);
 
 	ret = platform_device_add_resources(pd, &res, 1);
diff --git a/include/linux/sysfb.h b/include/linux/sysfb.h
index c9cb657dad08..19cb8

Re: "firmware/sysfb: Set firmware-framebuffer parent device" breaks lightdm on Ubuntu 22.04 using amdgpu

2024-06-13 Thread Marek Olšák
On Thu, Jun 13, 2024 at 3:23 AM Thomas Zimmermann  wrote:
>
> Hi
>
> Am 13.06.24 um 08:00 schrieb Marek Olšák:
> > +amd-gfx
> >
> > On Thu, Jun 13, 2024 at 1:59 AM Marek Olšák  wrote:
> >> Hi Thomas,
> >>
> >> Commit 9eac534db0013aff9b9124985dab114600df9081 as per the title
> >> breaks (crashes?) lightdm (login screen) such that all I get is the
> >> terminal. It's also reproducible with tag v6.9 where the commit is
> >> present.
> >>
> >> Reverting the commit fixes lightdm. A workaround is to bypass lightdm
> >> by triggering auto-login. This is a bug report.
>
> I see. Do you know why it crashes? Or have any logs.

How to debug this? I only know it's run through systemctl somehow.

Marek


Re: "firmware/sysfb: Set firmware-framebuffer parent device" breaks lightdm on Ubuntu 22.04 using amdgpu

2024-06-12 Thread Marek Olšák
+amd-gfx

On Thu, Jun 13, 2024 at 1:59 AM Marek Olšák  wrote:
>
> Hi Thomas,
>
> Commit 9eac534db0013aff9b9124985dab114600df9081 as per the title
> breaks (crashes?) lightdm (login screen) such that all I get is the
> terminal. It's also reproducible with tag v6.9 where the commit is
> present.
>
> Reverting the commit fixes lightdm. A workaround is to bypass lightdm
> by triggering auto-login. This is a bug report.
>
> (For AMD folks: It's also reproducible with amd-staging-drm-next.)
>
> Marek


"firmware/sysfb: Set firmware-framebuffer parent device" breaks lightdm on Ubuntu 22.04 using amdgpu

2024-06-12 Thread Marek Olšák
Hi Thomas,

Commit 9eac534db0013aff9b9124985dab114600df9081 as per the title
breaks (crashes?) lightdm (login screen) such that all I get is the
terminal. It's also reproducible with tag v6.9 where the commit is
present.

Reverting the commit fixes lightdm. A workaround is to bypass lightdm
by triggering auto-login. This is a bug report.

(For AMD folks: It's also reproducible with amd-staging-drm-next.)

Marek


[ANNOUNCE] libdrm 2.4.121

2024-06-01 Thread Marek Olšák
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512


Adrián Larumbe (1):
  meson: make build system happy by replacing deprecated feature

David Heidelberg (1):
  include poll.h instead of sys/poll.h

David Rosca (1):
  amdgpu: Make amdgpu_device_deinitialize thread-safe

Dylan Baker (2):
  Revert "xf86drm: ignore symlinks in process_device()"
  xf86drm: Don't consider node names longer than the maximum allowed

Flora Cui (3):
  tests/amdgpu: fix compile warning with the guard enum value
  tests/amdgpu: fix compile error with gcc7.5
  tests/amdgpu: fix compile error with gcc14

Francesco Valla (1):
  tests/util: add tidss driver

Joaquim Monteiro (2):
  meson: Replace usages of deprecated ExternalProgram.path()
  meson: Fix broken str.format usage

Jonathan Gray (6):
  amdgpu: add marketing names from Adrenalin 23.11.1
  amdgpu: add marketing names from PRO Edition for W7700
  amdgpu: add marketing names from Windows Steam Deck OLED APU driver
  amdgpu: add marketing names from amd-6.0
  amdgpu: add marketing name for Radeon RX 6550M
  amdgpu: add marketing names from amd-6.0.1

José Expósito (1):
  amdgpu: Make amdgpu_cs_signal_semaphore() thread-safe

Marek Olšák (2):
  amdgpu: sync amdgpu_drm.h
  Bump version to 2.4.121

Matt Turner (2):
  symbols-check: Add _GLOBAL_OFFSET_TABLE_
  symbols-check: Add _fbss, _fdata, _ftext

Pierre-Eric Pelloux-Prayer (5):
  amdgpu: add amdgpu_va_manager
  amdgpu: expose amdgpu_va_manager publicly
  amdgpu: add amdgpu_va_range_alloc2
  amdgpu: add amdgpu_device_initialize2
  amdgpu: fix deinit logic

Simon Ser (3):
  ci: build with meson --fatal-meson-warnings
  ci: use "meson setup" sub-command
  xf86drm: document drmDevicesEqual()

Tobias Jakobi (1):
  xf86drm: ignore symlinks in process_device()

git tag: libdrm-2.4.121

https://dri.freedesktop.org/libdrm/libdrm-2.4.121.tar.xz
SHA256: 909084a505d7638887f590b70791b3bbd9069c710c948f5d1f1ce6d080cdfcab
 libdrm-2.4.121.tar.xz
SHA512: 
cc8816d61884caa0e404348d1caeb0b2952fb50e1dc401716adfe08121096e2a67826db0bda0d8b163d67c5ee048870177670d5eac28a5abe5792d09ba77ab2e
 libdrm-2.4.121.tar.xz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.121.tar.xz.sig

-BEGIN PGP SIGNATURE-

iQGzBAEBCgAdFiEE86UtuOzp654zvEjGkXo+6XoPzRoFAmZbXJwACgkQkXo+6XoP
zRqHVwv+KRbbqQP2ahamI8S/7dedztW5SWX7BF1UqDzRik2YZfMBffhCzfGMW21U
ABSge4zDYyOtbL3DTod6BADaFsdpVnGDlhbAT9fpZi7RDtfQGfPl20+ZwYPCIAhP
9b91Yr8nJLFP6unUvgPX0IYQJdv7TD6Y3oqXrK/IsYOTXSiIEzqA0YJOc70AQU18
sgqArrctak1g67aI1XeFpdRjca2ZUqZwShigG+jQGeR0dsHC/A1HV7ilzF6MW2Mw
A9YO/i3SFrCoIzZC0zaaAO8MjGMPFgU+MIp/pkHBXNpkKa2rN7Yb/fEvuGhcDPy8
Ir/RFs53Gja3O4P4oBWYcHSifF/y+FOZddGCwrsRGkFgUEW7yBc++9fR242ChAOA
UhTJmUnxoxjpQ8JF5sfChu5fW3+rAzpeQOctDUskwHdSMyZj8BoThUWxq96p1S+w
CCRBDPNcw0rPwsnreijVFD/vJh2Kycq6Q9w8/uvFBkSM0m7hPgsH+RxmOJzOLga6
3TRyxV6a
=9Via
-END PGP SIGNATURE-

If gmail messed up the message, the original signed message for
signature verification is attached.

Marek
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512


Adrián Larumbe (1):
  meson: make build system happy by replacing deprecated feature

David Heidelberg (1):
  include poll.h instead of sys/poll.h

David Rosca (1):
  amdgpu: Make amdgpu_device_deinitialize thread-safe

Dylan Baker (2):
  Revert "xf86drm: ignore symlinks in process_device()"
  xf86drm: Don't consider node names longer than the maximum allowed

Flora Cui (3):
  tests/amdgpu: fix compile warning with the guard enum value
  tests/amdgpu: fix compile error with gcc7.5
  tests/amdgpu: fix compile error with gcc14

Francesco Valla (1):
  tests/util: add tidss driver

Joaquim Monteiro (2):
  meson: Replace usages of deprecated ExternalProgram.path()
  meson: Fix broken str.format usage

Jonathan Gray (6):
  amdgpu: add marketing names from Adrenalin 23.11.1
  amdgpu: add marketing names from PRO Edition for W7700
  amdgpu: add marketing names from Windows Steam Deck OLED APU driver
  amdgpu: add marketing names from amd-6.0
  amdgpu: add marketing name for Radeon RX 6550M
  amdgpu: add marketing names from amd-6.0.1

José Expósito (1):
  amdgpu: Make amdgpu_cs_signal_semaphore() thread-safe

Marek Olšák (2):
  amdgpu: sync amdgpu_drm.h
  Bump version to 2.4.121

Matt Turner (2):
  symbols-check: Add _GLOBAL_OFFSET_TABLE_
  symbols-check: Add _fbss, _fdata, _ftext

Pierre-Eric Pelloux-Prayer (5):
  amdgpu: add amdgpu_va_manager
  amdgpu: expose amdgpu_va_manager publicly
  amdgpu: add amdgpu_va_range_alloc2
  amdgpu: add amdgpu_device_initialize2
  amdgpu: fix deinit logic

Simon Ser (3):
  ci: build with meson --fatal-meson-warnings
  ci: use "meson setup" sub-command
  xf86drm: document drmDevicesEqual()

Tobias Jakobi (1):
  xf86drm: ignore symlinks in process_device()

gi

Re: [RFC PATCH 00/18] TTM interface for managing VRAM oversubscription

2024-04-25 Thread Marek Olšák
The most extreme ping-ponging is mitigated by throttling buffer moves
in the kernel, but it only works without VM_ALWAYS_VALID and you can
set BO priorities in the BO list. A better approach that works with
VM_ALWAYS_VALID would be nice.

Marek

On Wed, Apr 24, 2024 at 1:12 PM Friedrich Vock  wrote:
>
> Hi everyone,
>
> recently I've been looking into remedies for apps (in particular, newer
> games) that experience significant performance loss when they start to
> hit VRAM limits, especially on older or lower-end cards that struggle
> to fit both desktop apps and all the game data into VRAM at once.
>
> The root of the problem lies in the fact that from userspace's POV,
> buffer eviction is very opaque: Userspace applications/drivers cannot
> tell how oversubscribed VRAM is, nor do they have fine-grained control
> over which buffers get evicted.  At the same time, with GPU APIs becoming
> increasingly lower-level and GPU-driven, only the application itself
> can know which buffers are used within a particular submission, and
> how important each buffer is. For this, GPU APIs include interfaces
> to query oversubscription and specify memory priorities: In Vulkan,
> oversubscription can be queried through the VK_EXT_memory_budget
> extension. Different buffers can also be assigned priorities via the
> VK_EXT_pageable_device_local_memory extension. Modern games, especially
> D3D12 games via vkd3d-proton, rely on oversubscription being reported and
> priorities being respected in order to perform their memory management.
>
> However, relaying this information to the kernel via the current KMD uAPIs
> is not possible. On AMDGPU for example, all work submissions include a
> "bo list" that contains any buffer object that is accessed during the
> course of the submission. If VRAM is oversubscribed and a buffer in the
> list was evicted to system memory, that buffer is moved back to VRAM
> (potentially evicting other unused buffers).
>
> Since the usermode driver doesn't know what buffers are used by the
> application, its only choice is to submit a bo list that contains every
> buffer the application has allocated. In case of VRAM oversubscription,
> it is highly likely that some of the application's buffers were evicted,
> which almost guarantees that some buffers will get moved around. Since
> the bo list is only known at submit time, this also means the buffers
> will get moved right before submitting application work, which is the
> worst possible time to move buffers from a latency perspective. Another
> consequence of the large bo list is that nearly all memory from other
> applications will be evicted, too. When different applications (e.g. game
> and compositor) submit work one after the other, this causes a ping-pong
> effect where each app's submission evicts the other app's memory,
> resulting in a large amount of unnecessary moves.
>
> This overly aggressive eviction behavior led to RADV adopting a change
> that effectively allows all VRAM applications to reside in system memory
> [1].  This worked around the ping-ponging/excessive buffer moving problem,
> but also meant that any memory evicted to system memory would forever
> stay there, regardless of how VRAM is used.
>
> My proposal aims at providing a middle ground between these extremes.
> The goals I want to meet are:
> - Userspace is accurately informed about VRAM oversubscription/how much
>   VRAM has been evicted
> - Buffer eviction respects priorities set by userspace - Wasteful
>   ping-ponging is avoided to the extent possible
>
> I have been testing out some prototypes, and came up with this rough
> sketch of an API:
>
> - For each ttm_resource_manager, the amount of evicted memory is tracked
>   (similarly to how "usage" tracks the memory usage). When memory is
>   evicted via ttm_bo_evict, the size of the evicted memory is added, when
>   memory is un-evicted (see below), its size is subtracted. The amount of
>   evicted memory for e.g. VRAM can be queried by userspace via an ioctl.
>
> - Each ttm_resource_manager maintains a list of evicted buffer objects.
>
> - ttm_mem_unevict walks the list of evicted bos for a given
>   ttm_resource_manager and tries moving evicted resources back. When a
>   buffer is freed, this function is called to immediately restore some
>   evicted memory.
>
> - Each ttm_buffer_object independently tracks the mem_type it wants
>   to reside in.
>
> - ttm_bo_try_unevict is added as a helper function which attempts to
>   move the buffer to its preferred mem_type. If no space is available
>   there, it fails with -ENOSPC/-ENOMEM.
>
> - Similar to how ttm_bo_evict works, each driver can implement
>   uneviction_valuable/unevict_flags callbacks to control buffer
>   un-eviction.
>
> This is what patches 1-10 accomplish (together with an amdgpu
> implementation utilizing the new API).
>
> Userspace priorities could then be implemented as follows:
>
> - TTM already manages priorities for each buffer object. These pri

Re: 回复: Re: 回复: Re: [PATCH libdrm 1/2] amdgpu: fix parameter of amdgpu_cs_ctx_create2

2024-01-09 Thread Marek Olšák
int p = -1.
unsigned u = p;
int p2 = u;

p2 is -1.

Marek

On Tue, Jan 9, 2024, 03:26 Christian König  wrote:

> Am 09.01.24 um 09:09 schrieb 李真能:
>
> Thanks!
>
> What about the second patch?
>
> The second patch:   amdgpu: change proirity value to be consistent with
> kernel.
>
> As I want to pass AMDGPU_CTX_PRIORITY_LOW to kernel module drm-scheduler,
> if these two patches are not applyed,
>
> It can not pass LOW priority to drm-scheduler.
>
> Do you have any other suggestion?
>
>
> Well what exactly is the problem? Just use AMD_PRIORITY=-512.
>
> As far as I can see that is how it is supposed to be used.
>
> Regards,
> Christian.
>
>
>
>
>
>
>
>
> 
>
>
>
>
>
> *主 题:*Re: 回复: Re: [PATCH libdrm 1/2] amdgpu: fix parameter of
> amdgpu_cs_ctx_create2
> *日 期:*2024-01-09 15:15
> *发件人:*Christian König
> *收件人:*李真能;Marek Olsak;Pierre-Eric Pelloux-Prayer;dri-devel;amd-gfx;
>
> Am 09.01.24 um 02:50 schrieb 李真能:
>
> When the priority value is passed to the kernel, the kernel compares it
> with the following values:
>
> #define AMDGPU_CTX_PRIORITY_VERY_LOW-1023
> #define AMDGPU_CTX_PRIORITY_LOW -512
> #define AMDGPU_CTX_PRIORITY_NORMAL  0
> #define AMDGPU_CTX_PRIORITY_HIGH512
> #define AMDGPU_CTX_PRIORITY_VERY_HIGH   1023
>
> If priority is uint32_t, we can't set LOW and VERY_LOW value to kernel
> context priority,
>
> Well that's nonsense.
>
> How the kernel handles the values and how userspace handles them are two
> separate things. You just need to make sure that it's always 32 bits.
>
> In other words if you have signed or unsigned data type in userspace is
> irrelevant for the kernel.
>
> You can refer to the kernel function amdgpu_ctx_priority_permit, if
> priority is greater
>
> than 0, and this process has not  CAP_SYS_NICE capibility or DRM_MASTER
> permission,
>
> this process will be exited.
>
> Correct, that's intentional.
>
> Regards,
> Christian.
>
>
>
>
>
>
> 
>
>
>
>
>
> *主 题:*Re: [PATCH libdrm 1/2] amdgpu: fix parameter of
> amdgpu_cs_ctx_create2
> *日 期:*2024-01-09 00:28
> *发件人:*Christian König
> *收件人:*李真能;Marek Olsak;Pierre-Eric Pelloux-Prayer;dri-devel;amd-gfx;
>
> Am 08.01.24 um 10:40 schrieb Zhenneng Li:
> > In order to pass the correct priority parameter to the kernel,
> > we must change priority type from uint32_t to int32_t.
>
> Hui what? Why should it matter if the parameter is signed or not?
>
> That doesn't seem to make sense.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Zhenneng Li
> > ---
> > amdgpu/amdgpu.h | 2 +-
> > amdgpu/amdgpu_cs.c | 2 +-
> > 2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/amdgpu/amdgpu.h b/amdgpu/amdgpu.h
> > index 9bdbf366..f46753f3 100644
> > --- a/amdgpu/amdgpu.h
> > +++ b/amdgpu/amdgpu.h
> > @@ -896,7 +896,7 @@ int amdgpu_bo_list_update(amdgpu_bo_list_handle
> handle,
> > *
> > */
> > int amdgpu_cs_ctx_create2(amdgpu_device_handle dev,
> > - uint32_t priority,
> > + int32_t priority,
> > amdgpu_context_handle *context);
> > /**
> > * Create GPU execution Context
> > diff --git a/amdgpu/amdgpu_cs.c b/amdgpu/amdgpu_cs.c
> > index 49fc16c3..eb72c638 100644
> > --- a/amdgpu/amdgpu_cs.c
> > +++ b/amdgpu/amdgpu_cs.c
> > @@ -49,7 +49,7 @@ static int amdgpu_cs_reset_sem(amdgpu_semaphore_handle
> sem);
> > * \return 0 on success otherwise POSIX Error code
> > */
> > drm_public int amdgpu_cs_ctx_create2(amdgpu_device_handle dev,
> > - uint32_t priority,
> > + int32_t priority,
> > amdgpu_context_handle *context)
> > {
> > struct amdgpu_context *gpu_context;
>
>
>


[ANNOUNCE] libdrm 2.4.119

2023-12-21 Thread Marek Olšák
New libdrm has been released.

Marek Olšák (2):
  amdgpu: add amdgpu_va_get_start_addr
  meson: bump libdrm version to 2.4.119

git tag: libdrm-2.4.119

https://dri.freedesktop.org/libdrm/libdrm-2.4.119.tar.xz
SHA256: 0a49f12f09b5b6e68eaaaff3f02ca7cff9aa926939b212d343161d3e8ac56291
 libdrm-2.4.119.tar.xz
SHA512: 
c8dd7665e85c01a67fcce1c1c614bc05a3ec311f31cae7de5fb1cd27d0f11f1801be63de3fa3e33b2f505544fd4b1bc292965c5e8de46a3beaaedb10334945ca
 libdrm-2.4.119.tar.xz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.119.tar.xz.sig

See the attachment for the signed message.

Marek
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512


New libdrm has been released.

Marek Olšák (2):
  amdgpu: add amdgpu_va_get_start_addr
  meson: bump libdrm version to 2.4.119

git tag: libdrm-2.4.119

https://dri.freedesktop.org/libdrm/libdrm-2.4.119.tar.xz
SHA256: 0a49f12f09b5b6e68eaaaff3f02ca7cff9aa926939b212d343161d3e8ac56291  
libdrm-2.4.119.tar.xz
SHA512: 
c8dd7665e85c01a67fcce1c1c614bc05a3ec311f31cae7de5fb1cd27d0f11f1801be63de3fa3e33b2f505544fd4b1bc292965c5e8de46a3beaaedb10334945ca
  libdrm-2.4.119.tar.xz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.119.tar.xz.sig

-BEGIN PGP SIGNATURE-

iQGzBAEBCgAdFiEE86UtuOzp654zvEjGkXo+6XoPzRoFAmWEKx8ACgkQkXo+6XoP
zRqmZQwAn7zcB2GyPFGu+jJOGpM4xN07WFNnV/SsDL/kIPsXuQvOfqzyh+3itDvu
nJzldeNyWa9EDEtj40y1hLmYXgmSBcbPqsj4gmi190UVMEYyYKvKKqH++SyWa2KE
0DLjOibka2AjTAWYDf3JA86eezKn8xXa7l8/RaUIm/8DYXfL8mk0MdjrZhySnMZn
zrZ5QT8rsNEaVIOHHRlYbkRQs+WZXS9W7FfXq+BGrPZjP+C4dalt5GJoaV/Ng3gH
C900SRF7eSkseRwNKyE1l86aWFa8PwxoU1B0f+g1vlYAvive7BIr7WXJu+7shGHI
yVhu+RlWF1AVUDscHjCOsnSxS1f4PHASMXdVreN4MpfpCHYYlzFt0kuwXJa4FOHF
/w90D+HSARF+vvSyGwnFRybIJ16uwQZSSGo9FL2LhhDlGqOfd6cwyl1vLUk2gV0q
tBtrahC1m0EevPxOGTowtfEIkuFkxrmHOg3HH3Wj0Te7AHA10OuOg1aJpnyDIUpr
WMUmHONt
=wqGI
-END PGP SIGNATURE-


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-09 Thread Marek Olšák
On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer  wrote:
>
> On 8/8/23 19:03, Marek Olšák wrote:
> > It's the same situation as SIGSEGV. A process can catch the signal,
> > but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> > to catch the GPU error and prevent the process termination. If you
> > don't use the API, you'll get undefined behavior, which means anything
> > can happen, including process termination.
>
> Got a spec reference for that?
>
> I know the spec allows process termination in response to e.g. out of bounds 
> buffer access by the application (which corresponds to SIGSEGV). There are 
> other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec 
> says:
>
> If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
> then the implementation will never deliver notification of reset
> events, and GetGraphicsResetStatusARB will always return
> NO_ERROR[fn1].
>[fn1: In this case it is recommended that implementations should
> not allow loss of context state no matter what events occur.
> However, this is only a recommendation, and cannot be relied
> upon by applications.]
>
> No mention of process termination, that rather sounds to me like the GL 
> implementation should do its best to keep the application running.

It basically says that we can do anything.

A frozen window or flipping between 2 random frames can't be described
as "keeping the application running". That's the worst user
experience. I will not accept it.

A window system can force-enable robustness for its non-robust apps
and control that. That's the best possible user experience and it's
achievable everywhere. Everything else doesn't matter.

Marek




Marek


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-08-08 Thread Marek Olšák
It's the same situation as SIGSEGV. A process can catch the signal,
but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
to catch the GPU error and prevent the process termination. If you
don't use the API, you'll get undefined behavior, which means anything
can happen, including process termination.



Marek

On Tue, Aug 8, 2023 at 8:14 AM Sebastian Wick  wrote:
>
> On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter  wrote:
> >
> > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen 
> > > Signed-off-by: André Almeida 
> > > ---
> > >
> > > v4: 
> > > https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst 
> > > b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
> > > third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware 
> > > bugs,
> > > +faulty applications and everything in between the many layers. Some 
> > > errors
> > > +require resetting the device in order to make the device usable again. 
> > > This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +--
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to 
> > > perform
> > > +it as needed. Usually a hang is detected when a job gets stuck 
> > > executing. KMD
> > > +should keep track of resets, because userspace can query any time about 
> > > the
> > > +reset stats for an specific context. This is needed to propagate to the 
> > > rest of
> > > +the stack that a reset has happened. Currently, this is implemented by 
> > > each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the 
> > > device has
> > > +been reset, and this can be checked more often if the UMD requires it. 
> > > After
> > > +detecting a reset, UMD will then proceed to report it to the application 
> > > using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +--
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is 
> > > using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. 
> > > However,
> > > +there is no guarantee that the app will use such features correctly, and 
> > > the
> > > +UMD can implement policies to close the app if it is a repeating 
> > > offender,
> >
> > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> > offender killing is more a policy where the kernel enforces policy, and no
> > longer up to userspace to dtrt (because very clearly userspace is not
> > really doing the right thing anymore when it's just hanging the gpu in an
> > endless loop). Also maybe tune it down further to something like "the
> > kernel driver may implemnent ..."
> >
> > In my opinion the umd shouldn't implement these kind of magic guesses, the
> > entire point of robustness apis is to delegate responsibility for
> > correctly recovering to the application. And the kernel is left with
> > enforcing fair resource usage policies (which eventually might be a
> > cgroups limit on how much gpu time you're allowed to waste with gpu
> > resets).
>
> Killing apps that the kernel thinks are misbehaving really doesn't
> seem like a good idea to me. What if the process is a service getting
> restarted after getting killed? What if killing that process leaves
> the system in a bad state?
>
> Can't the kernel provide some information to user space so that e.g.
> systemd can handle those situations?
>
> > > +likely in a broken loop. This is done to ensure that it does not keep 
> > > blocking
> > > +the user interface from being correctly displayed. This should be done 
> > > even if
> > > +the app is correct but happens to trigger some bug in the 
> > > hardware/driver.
> > > +
> > > +OpenGL
> > > +~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` 

Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-08-02 Thread Marek Olšák
A screen that doesn't update isn't usable. Killing the window system
and returning to the login screen is one option. Killing the window
system manually from a terminal or over ssh and then returning to the
login screen is another option, but 99% of users either hard-reset the
machine or do sysrq+REISUB anyway because it's faster that way. Those
are all your options. If we don't do the kill, users might decide to
do a hard reset with an unsync'd file system, which can cause more
damage.

The precedent from the CPU land is pretty strong here. There is
SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
instructions, yet we do nothing for invalid GPU memory access and
invalid GPU instructions. Sending a terminating signal from the kernel
would be the most natural thing to do. Instead, we just keep a frozen
GUI to keep users helpless, or we continue command submission and then
the hanging app can cause an infinite cycle of GPU hangs and resets,
making the GPU unusable until somebody kills the app over ssh.

That's why GL/Vulkan robustness is required - either robust apps, or a
robust compositor that greys out lost windows and pops up a diagnostic
message with a list of actions to choose from. That's the direction we
should be taking. Non-robust apps under a non-robust compositor should
just be killed if they crash the GPU.


Marek

On Wed, Jul 26, 2023 at 4:07 AM Michel Dänzer
 wrote:
>
> On 7/25/23 15:02, André Almeida wrote:
> > Em 25/07/2023 05:03, Michel Dänzer escreveu:
> >> On 7/25/23 04:55, André Almeida wrote:
> >>> Hi everyone,
> >>>
> >>> It's not clear what we should do about non-robust OpenGL apps after GPU 
> >>> resets, so I'll try to summarize the topic, show some options and my 
> >>> proposal to move forward on that.
> >>>
> >>> Em 27/06/2023 10:23, André Almeida escreveu:
>  +Robustness
>  +--
>  +
>  +The only way to try to keep an application working after a reset is if 
>  it
>  +complies with the robustness aspects of the graphical API that it is 
>  using.
>  +
>  +Graphical APIs provide ways to applications to deal with device resets. 
>  However,
>  +there is no guarantee that the app will use such features correctly, 
>  and the
>  +UMD can implement policies to close the app if it is a repeating 
>  offender,
>  +likely in a broken loop. This is done to ensure that it does not keep 
>  blocking
>  +the user interface from being correctly displayed. This should be done 
>  even if
>  +the app is correct but happens to trigger some bug in the 
>  hardware/driver.
>  +
> >>> Depending on the OpenGL version, there are different robustness API 
> >>> available:
> >>>
> >>> - OpenGL ABR extension [0]
> >>> - OpenGL KHR extension [1]
> >>> - OpenGL ES extension  [2]
> >>>
> >>> Apps written in OpenGL should use whatever version is available for them 
> >>> to make the app robust for GPU resets. That usually means calling 
> >>> GetGraphicsResetStatusARB(), checking the status, and if it encounter 
> >>> something different from NO_ERROR, that means that a reset has happened, 
> >>> the context is considered lost and should be recreated. If an app follow 
> >>> this, it will likely succeed recovering a reset.
> >>>
> >>> What should non-robustness apps do then? They certainly will not be 
> >>> notified if a reset happens, and thus can't recover if their context is 
> >>> lost. OpenGL specification does not explicitly define what should be done 
> >>> in such situations[3], and I believe that usually when the spec mandates 
> >>> to close the app, it would explicitly note it.
> >>>
> >>> However, in reality there are different types of device resets, causing 
> >>> different results. A reset can be precise enough to damage only the 
> >>> guilty context, and keep others alive.
> >>>
> >>> Given that, I believe drivers have the following options:
> >>>
> >>> a) Kill all non-robust apps after a reset. This may lead to lose work 
> >>> from innocent applications.
> >>>
> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications 
> >>> would still be alive, but the user interface would be freeze. The user 
> >>> would need to close it manually anyway, but in some corner cases, the app 
> >>> could autosave some work or the user might be able to interact with it 
> >>> using some alternative method (command line?).
> >>>
> >>> c) Kill just the affected non-robust applications. To do that, the driver 
> >>> need to be 100% sure on the impact of its resets.
> >>>
> >>> RadeonSI currently implements a), as can be seen at [4], while Iris 
> >>> implements what I think it's c)[5].
> >>>
> >>> For the user experience point-of-view, c) is clearly the best option, but 
> >>> it's the hardest to archive. There's not much gain on having b) over a), 
> >>> perhaps it could be an optional env var for such corner case applications.
> >>
> >> I disagree on these conclusions.
> >>
> >> c)

Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

2023-07-25 Thread Marek Olšák
On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
 wrote:
>
> On 7/25/23 04:55, André Almeida wrote:
> > Hi everyone,
> >
> > It's not clear what we should do about non-robust OpenGL apps after GPU 
> > resets, so I'll try to summarize the topic, show some options and my 
> > proposal to move forward on that.
> >
> > Em 27/06/2023 10:23, André Almeida escreveu:
> >> +Robustness
> >> +--
> >> +
> >> +The only way to try to keep an application working after a reset is if it
> >> +complies with the robustness aspects of the graphical API that it is 
> >> using.
> >> +
> >> +Graphical APIs provide ways to applications to deal with device resets. 
> >> However,
> >> +there is no guarantee that the app will use such features correctly, and 
> >> the
> >> +UMD can implement policies to close the app if it is a repeating offender,
> >> +likely in a broken loop. This is done to ensure that it does not keep 
> >> blocking
> >> +the user interface from being correctly displayed. This should be done 
> >> even if
> >> +the app is correct but happens to trigger some bug in the hardware/driver.
> >> +
> > Depending on the OpenGL version, there are different robustness API 
> > available:
> >
> > - OpenGL ABR extension [0]
> > - OpenGL KHR extension [1]
> > - OpenGL ES extension  [2]
> >
> > Apps written in OpenGL should use whatever version is available for them to 
> > make the app robust for GPU resets. That usually means calling 
> > GetGraphicsResetStatusARB(), checking the status, and if it encounter 
> > something different from NO_ERROR, that means that a reset has happened, 
> > the context is considered lost and should be recreated. If an app follow 
> > this, it will likely succeed recovering a reset.
> >
> > What should non-robustness apps do then? They certainly will not be 
> > notified if a reset happens, and thus can't recover if their context is 
> > lost. OpenGL specification does not explicitly define what should be done 
> > in such situations[3], and I believe that usually when the spec mandates to 
> > close the app, it would explicitly note it.
> >
> > However, in reality there are different types of device resets, causing 
> > different results. A reset can be precise enough to damage only the guilty 
> > context, and keep others alive.
> >
> > Given that, I believe drivers have the following options:
> >
> > a) Kill all non-robust apps after a reset. This may lead to lose work from 
> > innocent applications.
> >
> > b) Ignore all non-robust apps OpenGL calls. That means that applications 
> > would still be alive, but the user interface would be freeze. The user 
> > would need to close it manually anyway, but in some corner cases, the app 
> > could autosave some work or the user might be able to interact with it 
> > using some alternative method (command line?).
> >
> > c) Kill just the affected non-robust applications. To do that, the driver 
> > need to be 100% sure on the impact of its resets.
> >
> > RadeonSI currently implements a), as can be seen at [4], while Iris 
> > implements what I think it's c)[5].
> >
> > For the user experience point-of-view, c) is clearly the best option, but 
> > it's the hardest to archive. There's not much gain on having b) over a), 
> > perhaps it could be an optional env var for such corner case applications.
>
> I disagree on these conclusions.
>
> c) is certainly better than a), but it's not "clearly the best" in all cases. 
> The OpenGL UMD is not a privileged/special component and is in no position to 
> decide whether or not the process as a whole (only some thread(s) of which 
> may use OpenGL at all) gets to continue running or not.

That's not true. I recommend that you enable b) with your driver and
then hang the GPU under different scenarios and see the result. Then
enable a) and do the same and compare.

Options a) and c) can be merged into one because they are not separate
options to choose from.

If Wayland wanted to grey out lost apps, they would appear as robust
contexts in gallium, but the reset status would be piped through the
Wayland protocol instead of the GL API.

Marek



Marek


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-05 Thread Marek Olšák
On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer  wrote:
>
> On 7/5/23 08:30, Marek Olšák wrote:
> > On Tue, Jul 4, 2023, 03:55 Michel Dänzer  wrote:
> > On 7/4/23 04:34, Marek Olšák wrote:
> > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer  > > wrote:
> > > On 6/30/23 22:32, Marek Olšák wrote:
> > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
> >  wrote:
> > > >> On 6/30/23 16:59, Alex Deucher wrote:
> > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> > > >>> mailto:sebastian.w...@redhat.com> 
> > wrote:
> > > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
> >  wrote:
> > > >>>>>
> > > >>>>> +Robustness
> > > >>>>> +--
> > > >>>>> +
> > > >>>>> +The only way to try to keep an application working after a 
> > reset is if it
> > > >>>>> +complies with the robustness aspects of the graphical API 
> > that it is using.
> > > >>>>> +
> > > >>>>> +Graphical APIs provide ways to applications to deal with 
> > device resets. However,
> > > >>>>> +there is no guarantee that the app will use such features 
> > correctly, and the
> > > >>>>> +UMD can implement policies to close the app if it is a 
> > repeating offender,
> > > >>>>> +likely in a broken loop. This is done to ensure that it 
> > does not keep blocking
> > > >>>>> +the user interface from being correctly displayed. This 
> > should be done even if
> > > >>>>> +the app is correct but happens to trigger some bug in the 
> > hardware/driver.
> > > >>>>
> > > >>>> I still don't think it's good to let the kernel arbitrarily 
> > kill
> > > >>>> processes that it thinks are not well-behaved based on some 
> > heuristics
> > > >>>> and policy.
> > > >>>>
> > > >>>> Can't this be outsourced to user space? Expose the 
> > information about
> > > >>>> processes causing a device and let e.g. systemd deal with 
> > coming up
> > > >>>> with a policy and with killing stuff.
> > > >>>
> > > >>> I don't think it's the kernel doing the killing, it would be 
> > the UMD.
> > > >>> E.g., if the app is guilty and doesn't support robustness the 
> > UMD can
> > > >>> just call exit().
> > > >>
> > > >> It would be safer to just ignore API calls[0], similarly to 
> > what is done until the application destroys the context with robustness. 
> > Calling exit() likely results in losing any unsaved work, whereas at least 
> > some applications might otherwise allow saving the work by other means.
> > > >
> > > > That's a terrible idea. Ignoring API calls would be identical 
> > to a freeze. You might as well disable GPU recovery because the result 
> > would be the same.
> > >
> > > No GPU recovery would affect everything using the GPU, whereas 
> > this affects only non-robust applications.
> > >
> > > which is currently the majority.
> >
> > Not sure where you're going with this. Applications need to use 
> > robustness to be able to recover from a GPU hang, and the GPU needs to be 
> > reset for that. So disabling GPU reset is not the same as what we're 
> > discussing here.
> >
> >
> > > > - non-robust contexts: call exit(1) immediately, which is the 
> > best way to recover
> > >
> > > That's not the UMD's call to make.
> > >
> > > That's absolutely the UMD's call to make because that's mandated by 
> > the hw and API design
> >
> > Can you point us to a spec which mandates that the process must be 
> > killed in this case?
> >
> >
> > > and only driver devs know this, which this thread is a proof of. The 
> > default behavior is to skip all command submission if a non-robust context 
> > is l

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-04 Thread Marek Olšák
On Tue, Jul 4, 2023, 03:55 Michel Dänzer  wrote:

> On 7/4/23 04:34, Marek Olšák wrote:
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer  <mailto:michel.daen...@mailbox.org>> wrote:
> > On 6/30/23 22:32, Marek Olšák wrote:
> > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daen...@mailbox.org <mailto:michel.daen...@mailbox.org>  michel.daen...@mailbox.org <mailto:michel.daen...@mailbox.org>>> wrote:
> > >> On 6/30/23 16:59, Alex Deucher wrote:
> > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> > >>> mailto:sebastian.w...@redhat.com>
> <mailto:sebastian.w...@redhat.com <mailto:sebastian.w...@redhat.com>>>
> wrote:
> > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <
> andrealm...@igalia.com <mailto:andrealm...@igalia.com>  andrealm...@igalia.com <mailto:andrealm...@igalia.com>>> wrote:
> > >>>>>
> > >>>>> +Robustness
> > >>>>> +--
> > >>>>> +
> > >>>>> +The only way to try to keep an application working after a
> reset is if it
> > >>>>> +complies with the robustness aspects of the graphical API
> that it is using.
> > >>>>> +
> > >>>>> +Graphical APIs provide ways to applications to deal with
> device resets. However,
> > >>>>> +there is no guarantee that the app will use such features
> correctly, and the
> > >>>>> +UMD can implement policies to close the app if it is a
> repeating offender,
> > >>>>> +likely in a broken loop. This is done to ensure that it does
> not keep blocking
> > >>>>> +the user interface from being correctly displayed. This
> should be done even if
> > >>>>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> > >>>>
> > >>>> I still don't think it's good to let the kernel arbitrarily kill
> > >>>> processes that it thinks are not well-behaved based on some
> heuristics
> > >>>> and policy.
> > >>>>
> > >>>> Can't this be outsourced to user space? Expose the information
> about
> > >>>> processes causing a device and let e.g. systemd deal with
> coming up
> > >>>> with a policy and with killing stuff.
> > >>>
> > >>> I don't think it's the kernel doing the killing, it would be the
> UMD.
> > >>> E.g., if the app is guilty and doesn't support robustness the
> UMD can
> > >>> just call exit().
> > >>
> > >> It would be safer to just ignore API calls[0], similarly to what
> is done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> > >
> > > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
> >
> > No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> >
> > which is currently the majority.
>
> Not sure where you're going with this. Applications need to use robustness
> to be able to recover from a GPU hang, and the GPU needs to be reset for
> that. So disabling GPU reset is not the same as what we're discussing here.
>
>
> > > - non-robust contexts: call exit(1) immediately, which is the best
> way to recover
> >
> > That's not the UMD's call to make.
> >
> > That's absolutely the UMD's call to make because that's mandated by the
> hw and API design
>
> Can you point us to a spec which mandates that the process must be killed
> in this case?
>
>
> > and only driver devs know this, which this thread is a proof of. The
> default behavior is to skip all command submission if a non-robust context
> is lost, which looks like a freeze. That's required to prevent infinite
> hangs from the same context and can be caused by the side effects of the
> GPU reset itself, not by the cause of the previous hang. The only way out
> of that is killing the process.
>
> The UMD killing the process is not the only way out of that, and doing so
> is overreach

Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Marek Olšák
On Mon, Jul 3, 2023, 22:38 Randy Dunlap  wrote:

>
>
> On 7/3/23 19:34, Marek Olšák wrote:
> >
> >
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer  <mailto:michel.daen...@mailbox.org>> wrote:
> >
>
> Marek,
> Please stop sending html emails to the mailing lists.
> The mailing list software drops them.
>
> Please set your email interface to use plain text mode instead.
> Thanks.
>

The mobile Gmail app doesn't support plain text, which I use frequently.

Marek


> --
> ~Randy
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-07-03 Thread Marek Olšák
On Mon, Jul 3, 2023, 03:12 Michel Dänzer  wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daen...@mailbox.org <mailto:michel.daen...@mailbox.org>> wrote:
> >> On 6/30/23 16:59, Alex Deucher wrote:
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> mailto:sebastian.w...@redhat.com>> wrote:
> >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida  <mailto:andrealm...@igalia.com>> wrote:
> >>>>>
> >>>>> +Robustness
> >>>>> +--
> >>>>> +
> >>>>> +The only way to try to keep an application working after a reset is
> if it
> >>>>> +complies with the robustness aspects of the graphical API that it
> is using.
> >>>>> +
> >>>>> +Graphical APIs provide ways to applications to deal with device
> resets. However,
> >>>>> +there is no guarantee that the app will use such features
> correctly, and the
> >>>>> +UMD can implement policies to close the app if it is a repeating
> offender,
> >>>>> +likely in a broken loop. This is done to ensure that it does not
> keep blocking
> >>>>> +the user interface from being correctly displayed. This should be
> done even if
> >>>>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >>>>
> >>>> I still don't think it's good to let the kernel arbitrarily kill
> >>>> processes that it thinks are not well-behaved based on some heuristics
> >>>> and policy.
> >>>>
> >>>> Can't this be outsourced to user space? Expose the information about
> >>>> processes causing a device and let e.g. systemd deal with coming up
> >>>> with a policy and with killing stuff.
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().
> >>
> >> It would be safer to just ignore API calls[0], similarly to what is
> done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> >
> > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
>
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
>

which is currently the majority.


>
> > - non-robust contexts: call exit(1) immediately, which is the best way
> to recover
>
> That's not the UMD's call to make.
>

That's absolutely the UMD's call to make because that's mandated by the hw
and API design and only driver devs know this, which this thread is a proof
of. The default behavior is to skip all command submission if a non-robust
context is lost, which looks like a freeze. That's required to prevent
infinite hangs from the same context and can be caused by the side effects
of the GPU reset itself, not by the cause of the previous hang. The only
way out of that is killing the process.

Marek


>
> >> [0] Possibly accompanied by a one-time message to stderr along the
> lines of "GPU reset detected but robustness not enabled in context,
> ignoring OpenGL API calls".
>
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer
>
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-30 Thread Marek Olšák
That's a terrible idea. Ignoring API calls would be identical to a freeze.
You might as well disable GPU recovery because the result would be the same.

There are 2 scenarios:
- robust contexts: report the GPU reset status and skip API calls; let the
app recreate the context to recover
- non-robust contexts: call exit(1) immediately, which is the best way to
recover

Marek

On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer 
wrote:

> On 6/30/23 16:59, Alex Deucher wrote:
> > On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >  wrote:
> >> On Tue, Jun 27, 2023 at 3:23 PM André Almeida 
> wrote:
> >>>
> >>> +Robustness
> >>> +--
> >>> +
> >>> +The only way to try to keep an application working after a reset is
> if it
> >>> +complies with the robustness aspects of the graphical API that it is
> using.
> >>> +
> >>> +Graphical APIs provide ways to applications to deal with device
> resets. However,
> >>> +there is no guarantee that the app will use such features correctly,
> and the
> >>> +UMD can implement policies to close the app if it is a repeating
> offender,
> >>> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> >>> +the user interface from being correctly displayed. This should be
> done even if
> >>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >>
> >> I still don't think it's good to let the kernel arbitrarily kill
> >> processes that it thinks are not well-behaved based on some heuristics
> >> and policy.
> >>
> >> Can't this be outsourced to user space? Expose the information about
> >> processes causing a device and let e.g. systemd deal with coming up
> >> with a policy and with killing stuff.
> >
> > I don't think it's the kernel doing the killing, it would be the UMD.
> > E.g., if the app is guilty and doesn't support robustness the UMD can
> > just call exit().
>
> It would be safer to just ignore API calls[0], similarly to what is done
> until the application destroys the context with robustness. Calling exit()
> likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
>
>
> [0] Possibly accompanied by a one-time message to stderr along the lines
> of "GPU reset detected but robustness not enabled in context, ignoring
> OpenGL API calls".
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer
>
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread Marek Olšák
On Tue, Jun 27, 2023 at 5:31 PM André Almeida 
wrote:

> Hi Marek,
>
> Em 27/06/2023 15:57, Marek Olšák escreveu:
> > On Tue, Jun 27, 2023, 09:23 André Almeida  > <mailto:andrealm...@igalia.com>> wrote:
> >
> > +User Mode Driver
> > +
> > +
> > +The UMD should check before submitting new commands to the KMD if
> > the device has
> > +been reset, and this can be checked more often if the UMD requires
> > it. After
> > +detecting a reset, UMD will then proceed to report it to the
> > application using
> > +the appropriate API error code, as explained in the section below
> about
> > +robustness.
> >
> >
> > The UMD won't check the device status before every command submission
> > due to ioctl overhead. Instead, the KMD should skip command submission
> > and return an error that it was skipped.
>
> I wrote like this because when reading the source code for
> vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under
> the impression that UMD checks the reset status before every
> submission/flush.
>

It only does that before every command submission when the context is
robust. When it's not robust, radeonsi doesn't do anything.


>
> Is your comment about of how things are currently implemented, or how
> they would ideally work? Either way I can apply your suggestion, I just
> want to make it clear.
>

Yes. Ideally, we would get the reply whether the context is lost from the
CS ioctl. This is not currently implemented.

Marek


>
> [0]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
> [1]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83
>
> >
> > The only case where that won't be applicable is user queues where
> > drivers don't call into the kernel to submit work, but they do call into
> > the kernel to create a dma_fence. In that case, the call to create a
> > dma_fence can fail with an error.
> >
> > Marek
>
>


Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

2023-06-27 Thread Marek Olšák
On Tue, Jun 27, 2023, 09:23 André Almeida  wrote:

> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen 
> Signed-off-by: André Almeida 
> ---
>
> v4:
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealm...@igalia.com/
>
> Changes:
>  - Grammar fixes (Randy)
>
>  Documentation/gpu/drm-uapi.rst | 68 ++
>  1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst
> b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third
> handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +
> +
> +The GPU stack is really complex and is prone to errors, from hardware
> bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again.
> This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +--
> +
> +The KMD is responsible for checking if the device needs a reset, and to
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing.
> KMD
> +should keep track of resets, because userspace can query any time about
> the
> +reset stats for an specific context. This is needed to propagate to the
> rest of
> +the stack that a reset has happened. Currently, this is implemented by
> each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +
> +
> +The UMD should check before submitting new commands to the KMD if the
> device has
> +been reset, and this can be checked more often if the UMD requires it.
> After
> +detecting a reset, UMD will then proceed to report it to the application
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
>

The UMD won't check the device status before every command submission due
to ioctl overhead. Instead, the KMD should skip command submission and
return an error that it was skipped.

The only case where that won't be applicable is user queues where drivers
don't call into the kernel to submit work, but they do call into the kernel
to create a dma_fence. In that case, the call to create a dma_fence can
fail with an error.

Marek

+
> +Robustness
> +--
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is
> using.
> +
> +Graphical APIs provide ways to applications to deal with device resets.
> However,
> +there is no guarantee that the app will use such features correctly, and
> the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> +the user interface from being correctly displayed. This should be done
> even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES).
> This
> +interface tells if a reset has happened, and if so, all the context state
> is
> +considered lost and the app proceeds by creating new ones. If it is
> possible to
> +determine that robustness is not in use, the UMD will terminate the app
> when a
> +reset is detected, giving that the contexts are lost and the app won't be
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for
> submissions.
> +This error code means, among other things, that a device reset has
> happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--
> +
> +Apart from propagating the reset through the stack so apps can recover,
> it's
> +really useful for driver developers to learn more about what caused the
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.41.0
>
>


Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
On Wed, May 3, 2023, 14:53 André Almeida  wrote:

> Em 03/05/2023 14:08, Marek Olšák escreveu:
> > GPU hangs are pretty common post-bringup. They are not common per user,
> > but if we gather all hangs from all users, we can have lots and lots of
> > them.
> >
> > GPU hangs are indeed not very debuggable. There are however some things
> > we can do:
> > - Identify the hanging IB by its VA (the kernel should know it)
>
> How can the kernel tell which VA range is being executed? I only found
> that information at mmCP_IB1_BASE_ regs, but as stated in this thread by
> Christian this is not reliable to be read.
>

The kernel receives the VA and the size via the CS ioctl. When user queues
are enabled, the kernel will no longer receive them.


> > - Read and parse the IB to detect memory corruption.
> > - Print active waves with shader disassembly if SQ isn't hung (often
> > it's not).
> >
> > Determining which packet the CP is stuck on is tricky. The CP has 2
> > engines (one frontend and one backend) that work on the same command
> > buffer. The frontend engine runs ahead, executes some packets and
> > forwards others to the backend engine. Only the frontend engine has the
> > command buffer VA somewhere. The backend engine only receives packets
> > from the frontend engine via a FIFO, so it might not be possible to tell
> > where it's stuck if it's stuck.
>
> Do they run at the same asynchronously or does the front end waits the
> back end to execute?
>

They run asynchronously and should run asynchronously for performance, but
they can be synchronized using a special packet (PFP_SYNC_ME).

Marek


> >
> > When the gfx pipeline hangs outside of shaders, making a scandump seems
> > to be the only way to have a chance at finding out what's going wrong,
> > and only AMD-internal versions of hw can be scanned.
> >
> > Marek
> >
> > On Wed, May 3, 2023 at 11:23 AM Christian König
> >  > <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
> >
> > Am 03.05.23 um 17:08 schrieb Felix Kuehling:
> >  > Am 2023-05-03 um 03:59 schrieb Christian König:
> >  >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
> >  >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
> >  >>> mailto:timur.kris...@gmail.com>>
> wrote:
> >  >>>> [SNIP]
> >  >>>>>>>> In my opinion, the correct solution to those problems
> would be
> >  >>>>>>>> if
> >  >>>>>>>> the kernel could give userspace the necessary information
> > about
> >  >>>>>>>> a
> >  >>>>>>>> GPU hang before a GPU reset.
> >  >>>>>>>>
> >  >>>>>>>   The fundamental problem here is that the kernel doesn't
> have
> >  >>>>>>> that
> >  >>>>>>> information either. We know which IB timed out and can
> >  >>>>>>> potentially do
> >  >>>>>>> a devcoredump when that happens, but that's it.
> >  >>>>>>
> >  >>>>>> Is it really not possible to know such a fundamental thing
> > as what
> >  >>>>>> the
> >  >>>>>> GPU was doing when it hung? How are we supposed to do any
> > kind of
> >  >>>>>> debugging without knowing that?
> >  >>
> >  >> Yes, that's indeed something at least I try to figure out for
> years
> >  >> as well.
> >  >>
> >  >> Basically there are two major problems:
> >  >> 1. When the ASIC is hung you can't talk to the firmware engines
> any
> >  >> more and most state is not exposed directly, but just through
> some
> >  >> fw/hw interface.
> >  >> Just take a look at how umr reads the shader state from the
> SQ.
> >  >> When that block is hung you can't do that any more and basically
> > have
> >  >> no chance at all to figure out why it's hung.
> >  >>
> >  >> Same for other engines, I remember once spending a week
> > figuring
> >  >> out why the UVD block is hung during suspend. Turned out to be a
> >  >> debugging nightmare because any time you touch any register of
> that
> >  >> block the 

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
WRITE_DATA with ENGINE=PFP will execute the packet on the frontend engine,
while ENGINE=ME will execute the packet on the backend engine.

Marek

On Wed, May 3, 2023 at 1:08 PM Marek Olšák  wrote:

> GPU hangs are pretty common post-bringup. They are not common per user,
> but if we gather all hangs from all users, we can have lots and lots of
> them.
>
> GPU hangs are indeed not very debuggable. There are however some things we
> can do:
> - Identify the hanging IB by its VA (the kernel should know it)
> - Read and parse the IB to detect memory corruption.
> - Print active waves with shader disassembly if SQ isn't hung (often it's
> not).
>
> Determining which packet the CP is stuck on is tricky. The CP has 2
> engines (one frontend and one backend) that work on the same command
> buffer. The frontend engine runs ahead, executes some packets and forwards
> others to the backend engine. Only the frontend engine has the command
> buffer VA somewhere. The backend engine only receives packets from the
> frontend engine via a FIFO, so it might not be possible to tell where it's
> stuck if it's stuck.
>
> When the gfx pipeline hangs outside of shaders, making a scandump seems to
> be the only way to have a chance at finding out what's going wrong, and
> only AMD-internal versions of hw can be scanned.
>
> Marek
>
> On Wed, May 3, 2023 at 11:23 AM Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Am 03.05.23 um 17:08 schrieb Felix Kuehling:
>> > Am 2023-05-03 um 03:59 schrieb Christian König:
>> >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
>> >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
>> >>>  wrote:
>> >>>> [SNIP]
>> >>>>>>>> In my opinion, the correct solution to those problems would be
>> >>>>>>>> if
>> >>>>>>>> the kernel could give userspace the necessary information about
>> >>>>>>>> a
>> >>>>>>>> GPU hang before a GPU reset.
>> >>>>>>>>
>> >>>>>>>   The fundamental problem here is that the kernel doesn't have
>> >>>>>>> that
>> >>>>>>> information either. We know which IB timed out and can
>> >>>>>>> potentially do
>> >>>>>>> a devcoredump when that happens, but that's it.
>> >>>>>>
>> >>>>>> Is it really not possible to know such a fundamental thing as what
>> >>>>>> the
>> >>>>>> GPU was doing when it hung? How are we supposed to do any kind of
>> >>>>>> debugging without knowing that?
>> >>
>> >> Yes, that's indeed something at least I try to figure out for years
>> >> as well.
>> >>
>> >> Basically there are two major problems:
>> >> 1. When the ASIC is hung you can't talk to the firmware engines any
>> >> more and most state is not exposed directly, but just through some
>> >> fw/hw interface.
>> >> Just take a look at how umr reads the shader state from the SQ.
>> >> When that block is hung you can't do that any more and basically have
>> >> no chance at all to figure out why it's hung.
>> >>
>> >> Same for other engines, I remember once spending a week figuring
>> >> out why the UVD block is hung during suspend. Turned out to be a
>> >> debugging nightmare because any time you touch any register of that
>> >> block the whole system would hang.
>> >>
>> >> 2. There are tons of things going on in a pipeline fashion or even
>> >> completely in parallel. For example the CP is just the beginning of a
>> >> rather long pipeline which at the end produces a bunch of pixels.
>> >> In almost all cases I've seen you ran into a problem somewhere
>> >> deep in the pipeline and only very rarely at the beginning.
>> >>
>> >>>>>>
>> >>>>>> I wonder what AMD's Windows driver team is doing with this problem,
>> >>>>>> surely they must have better tools to deal with GPU hangs?
>> >>>>> For better or worse, most teams internally rely on scan dumps via
>> >>>>> JTAG
>> >>>>> which sort of limits the usefulness outside of AMD, but also gives
>> >>>>> you
>> >>>>> the exact state of the h

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
GPU hangs are pretty common post-bringup. They are not common per user, but
if we gather all hangs from all users, we can have lots and lots of them.

GPU hangs are indeed not very debuggable. There are however some things we
can do:
- Identify the hanging IB by its VA (the kernel should know it)
- Read and parse the IB to detect memory corruption.
- Print active waves with shader disassembly if SQ isn't hung (often it's
not).

Determining which packet the CP is stuck on is tricky. The CP has 2 engines
(one frontend and one backend) that work on the same command buffer. The
frontend engine runs ahead, executes some packets and forwards others to
the backend engine. Only the frontend engine has the command buffer VA
somewhere. The backend engine only receives packets from the frontend
engine via a FIFO, so it might not be possible to tell where it's stuck if
it's stuck.

When the gfx pipeline hangs outside of shaders, making a scandump seems to
be the only way to have a chance at finding out what's going wrong, and
only AMD-internal versions of hw can be scanned.

Marek

On Wed, May 3, 2023 at 11:23 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 03.05.23 um 17:08 schrieb Felix Kuehling:
> > Am 2023-05-03 um 03:59 schrieb Christian König:
> >> Am 02.05.23 um 20:41 schrieb Alex Deucher:
> >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf
> >>>  wrote:
>  [SNIP]
>  In my opinion, the correct solution to those problems would be
>  if
>  the kernel could give userspace the necessary information about
>  a
>  GPU hang before a GPU reset.
> 
> >>>   The fundamental problem here is that the kernel doesn't have
> >>> that
> >>> information either. We know which IB timed out and can
> >>> potentially do
> >>> a devcoredump when that happens, but that's it.
> >>
> >> Is it really not possible to know such a fundamental thing as what
> >> the
> >> GPU was doing when it hung? How are we supposed to do any kind of
> >> debugging without knowing that?
> >>
> >> Yes, that's indeed something at least I try to figure out for years
> >> as well.
> >>
> >> Basically there are two major problems:
> >> 1. When the ASIC is hung you can't talk to the firmware engines any
> >> more and most state is not exposed directly, but just through some
> >> fw/hw interface.
> >> Just take a look at how umr reads the shader state from the SQ.
> >> When that block is hung you can't do that any more and basically have
> >> no chance at all to figure out why it's hung.
> >>
> >> Same for other engines, I remember once spending a week figuring
> >> out why the UVD block is hung during suspend. Turned out to be a
> >> debugging nightmare because any time you touch any register of that
> >> block the whole system would hang.
> >>
> >> 2. There are tons of things going on in a pipeline fashion or even
> >> completely in parallel. For example the CP is just the beginning of a
> >> rather long pipeline which at the end produces a bunch of pixels.
> >> In almost all cases I've seen you ran into a problem somewhere
> >> deep in the pipeline and only very rarely at the beginning.
> >>
> >>
> >> I wonder what AMD's Windows driver team is doing with this problem,
> >> surely they must have better tools to deal with GPU hangs?
> > For better or worse, most teams internally rely on scan dumps via
> > JTAG
> > which sort of limits the usefulness outside of AMD, but also gives
> > you
> > the exact state of the hardware when it's hung so the hardware teams
> > prefer it.
> >
>  How does this approach scale? It's not something we can ask users to
>  do, and even if all of us in the radv team had a JTAG device, we
>  wouldn't be able to play every game that users experience random hangs
>  with.
> >>> It doesn't scale or lend itself particularly well to external
> >>> development, but that's the current state of affairs.
> >>
> >> The usual approach seems to be to reproduce a problem in a lab and
> >> have a JTAG attached to give the hw guys a scan dump and they can
> >> then tell you why something didn't worked as expected.
> >
> > That's the worst-case scenario where you're debugging HW or FW issues.
> > Those should be pretty rare post-bringup. But are there hangs caused
> > by user mode driver or application bugs that are easier to debug and
> > probably don't even require a GPU reset? For example most VM faults
> > can be handled without hanging the GPU. Similarly, a shader in an
> > endless loop should not require a full GPU reset. In the KFD compute
> > case, that's still preemptible and the offending process can be killed
> > with Ctrl-C or debugged with rocm-gdb.
>
> We also have infinite loop in shader abort for gfx and page faults are
> pretty rare with OpenGL (a bit more often with Vulkan) and can be
> handled gracefully on modern hw (they just spam the logs).
>
> The majo

Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type

2023-04-26 Thread Marek Olšák
Perhaps I should clarify this. There are GL and Vulkan features that if any
app uses them and its shaders are killed, the next IB will hang. One of
them is Draw Indirect - if a shader is killed before storing the vertex
count and instance count in memory, the next draw will hang with a high
probability. No such app can be allowed to continue executing after a reset.

Marek

On Wed, Apr 26, 2023 at 5:51 AM Michel Dänzer 
wrote:

> On 4/25/23 21:11, Marek Olšák wrote:
> > The last 3 comments in this thread contain arguments that are false and
> were specifically pointed out as false 6 comments ago: Soft resets are just
> as fatal as hard resets. There is nothing better about soft resets. If the
> VRAM is lost completely, that's a different story, and if the hard reset is
> 100% unreliable, that's also a different story, but other than those two
> outliers, there is no difference between the two from the user point view.
> Both can repeatedly hang if you don't prevent the app that caused the hang
> from using the GPU even if the app is not robust. The robustness context
> type doesn't matter here. By definition, no guilty app can continue after a
> reset, and no innocent apps affected by a reset can continue either because
> those can now hang too. That's how destructive all resets are. Personal
> anecdotes that the soft reset is better are just that, anecdotes.
>
> You're trying to frame the situation as black or white, but reality is
> shades of grey.
>
>
> There's a similar situation with kernel Oopsen: In principle it's not safe
> to continue executing the kernel after it hits an Oops, since it might be
> in an inconsistent state, which could result in any kind of misbehaviour.
> Still, the default behaviour is to continue executing, and in most cases it
> turns out fine. Users which cannot accept the residual risk can choose to
> make the kernel panic when it hits an Oops (either via CONFIG_PANIC_ON_OOPS
> at build time, or via oops=panic on the kernel command line). A kernel
> panic means that the machine basically freezes from a user PoV, which would
> be worse as the default behaviour for most users (because it would e.g.
> incur a higher risk of losing filesystem data).
>
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer
>
>


Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type

2023-04-25 Thread Marek Olšák
The last 3 comments in this thread contain arguments that are false and
were specifically pointed out as false 6 comments ago: Soft resets are just
as fatal as hard resets. There is nothing better about soft resets. If the
VRAM is lost completely, that's a different story, and if the hard reset is
100% unreliable, that's also a different story, but other than those two
outliers, there is no difference between the two from the user point view.
Both can repeatedly hang if you don't prevent the app that caused the hang
from using the GPU even if the app is not robust. The robustness context
type doesn't matter here. By definition, no guilty app can continue after a
reset, and no innocent apps affected by a reset can continue either because
those can now hang too. That's how destructive all resets are. Personal
anecdotes that the soft reset is better are just that, anecdotes.

Marek

On Tue, Apr 25, 2023, 08:44 Christian König 
wrote:

> Am 25.04.23 um 14:14 schrieb Michel Dänzer:
> > On 4/25/23 14:08, Christian König wrote:
> >> Well signaling that something happened is not the question. We do this
> for both soft as well as hard resets.
> >>
> >> The question is if errors result in blocking further submissions with
> the same context or not.
> >>
> >> In case of a hard reset and potential loss of state we have to kill the
> context, otherwise a follow up submission would just lockup the hardware
> once more.
> >>
> >> In case of a soft reset I think we can keep the context alive, this way
> even applications without robustness handling can keep work.
> >>
> >> You potentially still get some corruption, but at least not your
> compositor killed.
> > Right, and if there is corruption, the user can restart the session.
> >
> >
> > Maybe a possible compromise could be making soft resets fatal if user
> space enabled robustness for the context, and non-fatal if not.
>
> Well that should already be mostly the case. If an application has
> enabled robustness it should notice that something went wrong and act
> appropriately.
>
> The only thing we need to handle is for applications without robustness
> in case of a hard reset or otherwise it will trigger an reset over and
> over again.
>
> Christian.
>
> >
> >
> >> Am 25.04.23 um 13:07 schrieb Marek Olšák:
> >>> That supposedly depends on the compositor. There may be compositors
> for very specific cases (e.g. Steam Deck) that handle resets very well, and
> those would like to be properly notified of all resets because that's how
> they get the best outcome, e.g. no corruption. A soft reset that is
> unhandled by userspace may result in persistent corruption.
> >
>
>


Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type

2023-04-25 Thread Marek Olšák
That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck) that handle resets very well, and
those would like to be properly notified of all resets because that's how
they get the best outcome, e.g. no corruption. A soft reset that is
unhandled by userspace may result in persistent corruption.

Marek

On Tue, Apr 25, 2023 at 6:27 AM Michel Dänzer 
wrote:

> On 4/24/23 18:45, Marek Olšák wrote:
> > Soft resets are fatal just as hard resets, but no reset is "always
> fatal". There are cases when apps keep working depending on which features
> are being used. It's still unsafe.
>
> Agreed, in theory.
>
> In practice, from a user PoV, right now there's pretty much 0 chance of
> the user session surviving if the GPU context in certain critical processes
> (e.g. the Wayland compositor or Xwayland) hits a fatal reset. There's a > 0
> chance of it surviving after a soft reset. There's ongoing work towards
> making user-space components more robust against fatal resets, but it's
> taking time. Meanwhile, I suspect most users would take the > 0 chance.
>
>
> --
> Earthling Michel Dänzer|  https://redhat.com
> Libre software enthusiast  | Mesa and Xwayland developer
>
>


Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type

2023-04-24 Thread Marek Olšák
Soft resets are fatal just as hard resets, but no reset is "always fatal".
There are cases when apps keep working depending on which features are
being used. It's still unsafe.

Marek

On Mon, Apr 24, 2023, 03:03 Christian König 
wrote:

> Am 24.04.23 um 03:43 schrieb André Almeida:
> > When a DRM job timeout, the GPU is probably hang and amdgpu have some
> > ways to deal with that, ranging from soft recoveries to full device
> > reset. Anyway, when userspace ask the kernel the state of the context
> > (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
> > reset, regardless if a full reset happened or not.
> >
> > However, amdgpu only marks a context guilty in the ASIC reset path. This
> > makes the userspace report incomplete, given that on soft recovery path
> > the guilty context is not told that it's the guilty one.
> >
> > Fix this by marking the context guilty for every type of reset when a
> > job timeouts.
>
> The guilty handling is pretty much broken by design and only works
> because we go through multiple hops of validating the entity after the
> job has already been pushed to the hw.
>
> I think we should probably just remove that completely and use an
> approach where we check the in flight submissions in the query state
> IOCTL. See my other patch on the mailing list regarding that.
>
> Additional to that I currently didn't considered soft-recovered
> submissions as fatal and continue accepting submissions from that
> context, but already wanted to talk with Marek about that behavior.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: André Almeida 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 +++-
> >   2 files changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index ac78caa7cba8..ea169d1689e2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct
> amdgpu_device *adev,
> >
> >   amdgpu_fence_driver_isr_toggle(adev, false);
> >
> > - if (job && job->vm)
> > - drm_sched_increase_karma(&job->base);
> > -
> >   r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
> >   /* If reset handler not implemented, continue; otherwise return */
> >   if (r == -ENOSYS)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c3d9d75143f4..097ed8f06865 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >   memset(&ti, 0, sizeof(struct amdgpu_task_info));
> >   adev->job_hang = true;
> >
> > + amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> > +
> > + if (job && job->vm) {
> > + DRM_INFO("marking %s context as guilty", ti.process_name);
> > + drm_sched_increase_karma(&job->base);
> > + }
> > +
> >   if (amdgpu_gpu_recovery &&
> >   amdgpu_ring_soft_recovery(ring, job->vmid,
> s_job->s_fence->parent)) {
> >   DRM_ERROR("ring %s timeout, but soft recovered\n",
> > @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >   goto exit;
> >   }
> >
> > - amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> >   DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> > job->base.sched->name,
> atomic_read(&ring->fence_drv.last_seq),
> > ring->fence_drv.sync_seq);
>
>


Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-29 Thread Marek Olšák
I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, 
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> --
> *From:* Koenig, Christian 
> 
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter  ; Daniel Stone
>  ; Olsak, Marek
>  ; Grodzovsky, Andrey
>  
> *Cc:* Rob Clark  ; Rob Clark
>  ; Sharma, Shashank
>  ; Christian König
>  ;
> Somalapuram, Amaranath 
> ; Abhinav Kumar 
> ; dri-devel 
> ; amd-gfx list
>  ; Deucher,
> Alexander  ;
> Shashank Sharma 
> 
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone 
>  wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark 
>  wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>>   wrote:
>  Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Marek Olšák
I don't think fork() would work with userspace where all buffers are
shared. It certainly doesn't work now. The driver needs to be notified that
a buffer or texture is shared to ensure data coherency between processes,
and the driver must execute decompression and other render passes when a
buffer or texture is being shared for the first time. Those aren't called
when fork() is called.

Marek

On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling 
wrote:

> Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> > Am 17.01.22 um 15:17 schrieb Felix Kuehling:
> >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
> >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>  Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> > Am 14.01.22 um 17:44 schrieb Daniel Vetter:
> >> Top post because I tried to catch up on the entire discussion here.
> >>
> >> So fundamentally I'm not opposed to just close this fork() hole
> >> once and
> >> for all. The thing that worries me from a upstream/platform pov is
> >> really
> >> only if we don't do it consistently across all drivers.
> >>
> >> So maybe as an idea:
> >> - Do the original patch, but not just for ttm but all gem rendernode
> >>  drivers at least (or maybe even all gem drivers, no idea), with
> >> the
> >>  below discussion cleaned up as justification.
> > I know of at least one use case which this will break.
> >
> > A couple of years back we had a discussion on the Mesa mailing list
> > because (IIRC) Marek introduced a background thread to push command
> > submissions to the kernel.
> >
> > That broke because some compositor used to initialize OpenGL and then
> > do a fork(). This indeed worked previously (no GPUVM at that time),
> > but with the addition of the backround thread obviously broke.
> >
> > The conclusion back then was that the compositor is broken and needs
> > fixing, but it still essentially means that there could be people out
> > there with really old userspace where this setting would just break
> > the desktop.
> >
> > I'm not really against that change either, but at least in theory we
> > could make fork() work perfectly fine even with VMs and background
> > threads.
>  You may regret this if you ever try to build a shared virtual address
>  space between GPU and CPU. Then you have two processes (parent and
>  child) sharing the same render context and GPU VM address space.
>  But the
>  CPU address spaces are different. You can't maintain consistent shared
>  virtual address spaces for both processes when the GPU address
>  space is
>  shared between them.
> >>> That's actually not much of a problem.
> >>>
> >>> All you need to do is to use pthread_atfork() and do the appropriate
> >>> action in parent/child to clean up your context:
> >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
> >> Thunk already does that. However, it's not foolproof. pthread_atfork
> >> hanlders aren't called when the process is forked with a clone call.
> >
> > Yeah, but that's perfectly intentional. clone() is usually used to
> > create threads.
>
> Clone can be used to create new processes. Maybe not the common use today.
>
>
> >
> >>> The rest is just to make sure that all shared and all private data are
> >>> kept separate all the time. Sharing virtual memory is already done for
> >>> decades this way, it's just that nobody ever did it with a statefull
> >>> device like GPUs.
> >> My concern is not with sharing or not sharing data. It's with sharing
> >> the address space itself. If you share the render node, you share GPU
> >> virtual address space. However CPU address space is not shared between
> >> parent and child. That's a fundamental mismatch between the CPU world
> >> and current GPU driver implementation.
> >
> > Correct, but even that is easily solvable. As I said before you can
> > hang this state on a VMA and let it be cloned together with the CPU
> > address space.
>
> I'm not following. The address space I'm talking about is struct
> amdgpu_vm. It's associated with the render node file descriptor.
> Inheriting and using that file descriptor in the child inherits the
> amdgpu_vm. I don't see how you can hang that state on any one VMA.
>
> To be consistent with the CPU, you'd need to clone the GPU address space
> (struct amdgpu_vm) in the child process. That means you need a new
> render node file descriptor that imports all the BOs from the parent
> address space. It's a bunch of extra work to fork a process, that you're
> proposing to immediately undo with an atfork handler. So I really don't
> see the point.
>
> Regards,
>   Felix
>
>
> >
> > Since VMAs are informed about their cloning (in opposite to file
> > descriptors) it's trivial to even just clone kernel data on first access.
> >
> > Regards,
> > Christian.
> >
> >>
> >> Regards,
> >>Felix
> >>
> >>
> >>> 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Marek Olšák
Timeline semaphore waits (polling on memory) will be unmonitored and as
fast as the roundtrip to memory. Semaphore writes will be slower because
the copy of those write requests will also be forwarded to the kernel.
Arbitrary writes are not protected by the hw but the kernel will take
action against such behavior because it will receive them too.

I don't know if that would work with dma_fence.

Marek


On Thu, Jun 17, 2021 at 3:04 PM Daniel Vetter  wrote:

> On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> > The kernel will know who should touch the implicit-sync semaphore next,
> and
> > at the same time, the copy of all write requests to the implicit-sync
> > semaphore will be forwarded to the kernel for monitoring and bo_wait.
> >
> > Syncobjs could either use the same monitored access as implicit sync or
> be
> > completely unmonitored. We haven't decided yet.
> >
> > Syncfiles could either use one of the above or wait for a syncobj to go
> > idle before converting to a syncfile.
>
> Hm this sounds all like you're planning to completely rewrap everything
> ... I'm assuming the plan is still that this is going to be largely
> wrapped in dma_fence? Maybe with timeline objects being a bit more
> optimized, but I'm not sure how much you can optimize without breaking the
> interfaces.
> -Daniel
>
> >
> > Marek
> >
> >
> >
> > On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:
> >
> > > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > > As long as we can figure out who touched to a certain sync object
> last
> > > that
> > > > would indeed work, yes.
> > >
> > > Don't you need to know who will touch it next, i.e. who is holding up
> your
> > > fence? Or maybe I'm just again totally confused.
> > > -Daniel
> > >
> > > >
> > > > Christian.
> > > >
> > > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > > The call to the hw scheduler has a limitation on the size of all
> > > > > parameters combined. I think we can only pass a 32-bit sequence
> number
> > > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not
> much
> > > > > else.
> > > > >
> > > > > The syncobj handle can be an element index in a global (per-GPU)
> > > syncobj
> > > > > table and it's read only for all processes with the exception of
> the
> > > > > signal command. Syncobjs can either have per VMID write access
> flags
> > > for
> > > > > the signal command (slow), or any process can write to any
> syncobjs and
> > > > > only rely on the kernel checking the write log (fast).
> > > > >
> > > > > In any case, we can execute the memory write in the queue engine
> and
> > > > > only use the hw scheduler for logging, which would be perfect.
> > > > >
> > > > > Marek
> > > > >
> > > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > > >  > > > > <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
> > > > >
> > > > > Hi guys,
> > > > >
> > > > > maybe soften that a bit. Reading from the shared memory of the
> > > > > user fence is ok for everybody. What we need to take more care
> of
> > > > > is the writing side.
> > > > >
> > > > > So my current thinking is that we allow read only access, but
> > > > > writing a new sequence value needs to go through the
> > > scheduler/kernel.
> > > > >
> > > > > So when the CPU wants to signal a timeline fence it needs to
> call
> > > > > an IOCTL. When the GPU wants to signal the timeline fence it
> needs
> > > > > to hand that of to the hardware scheduler.
> > > > >
> > > > > If we lockup the kernel can check with the hardware who did the
> > > > >     last write and what value was written.
> > > > >
> > > > > That together with an IOCTL to give out sequence number for
> > > > > implicit sync to applications should be sufficient for the
> kernel
> > > > > to track who is responsible if something bad happens.
> > > > >
> > > > > In other words when the hardware says that the shader wrote
> stuff
> > > &

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Marek Olšák
The kernel will know who should touch the implicit-sync semaphore next, and
at the same time, the copy of all write requests to the implicit-sync
semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or be
completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go
idle before converting to a syncfile.

Marek



On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:

> On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > As long as we can figure out who touched to a certain sync object last
> that
> > would indeed work, yes.
>
> Don't you need to know who will touch it next, i.e. who is holding up your
> fence? Or maybe I'm just again totally confused.
> -Daniel
>
> >
> > Christian.
> >
> > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > The call to the hw scheduler has a limitation on the size of all
> > > parameters combined. I think we can only pass a 32-bit sequence number
> > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > else.
> > >
> > > The syncobj handle can be an element index in a global (per-GPU)
> syncobj
> > > table and it's read only for all processes with the exception of the
> > > signal command. Syncobjs can either have per VMID write access flags
> for
> > > the signal command (slow), or any process can write to any syncobjs and
> > > only rely on the kernel checking the write log (fast).
> > >
> > > In any case, we can execute the memory write in the queue engine and
> > > only use the hw scheduler for logging, which would be perfect.
> > >
> > > Marek
> > >
> > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > >  > > <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
> > >
> > > Hi guys,
> > >
> > > maybe soften that a bit. Reading from the shared memory of the
> > > user fence is ok for everybody. What we need to take more care of
> > > is the writing side.
> > >
> > > So my current thinking is that we allow read only access, but
> > > writing a new sequence value needs to go through the
> scheduler/kernel.
> > >
> > > So when the CPU wants to signal a timeline fence it needs to call
> > > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > to hand that of to the hardware scheduler.
> > >
> > > If we lockup the kernel can check with the hardware who did the
> > > last write and what value was written.
> > >
> > > That together with an IOCTL to give out sequence number for
> > > implicit sync to applications should be sufficient for the kernel
> > > to track who is responsible if something bad happens.
> > >
> > > In other words when the hardware says that the shader wrote stuff
> > > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > > who did that.
> > >
> > > If the hardware says that seq - 1 was written fine, but seq is
> > > missing then the kernel blames whoever was supposed to write seq.
> > >
> > > Just pieping the write through a privileged instance should be
> > > fine to make sure that we don't run into issues.
> > >
> > > Christian.
> > >
> > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > Hi Daniel,
> > > >
> > > > We just talked about this whole topic internally and we came up
> > > > to the conclusion that the hardware needs to understand sync
> > > > object handles and have high-level wait and signal operations in
> > > > the command stream. Sync objects will be backed by memory, but
> > > > they won't be readable or writable by processes directly. The
> > > > hardware will log all accesses to sync objects and will send the
> > > > log to the kernel periodically. The kernel will identify
> > > > malicious behavior.
> > > >
> > > > Example of a hardware command stream:
> > > > ...
> > > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > > number is assigned by the kernel
> > > > Draw();
> > > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > ...
> > > >
&g

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-14 Thread Marek Olšák
The call to the hw scheduler has a limitation on the size of all parameters
combined. I think we can only pass a 32-bit sequence number and a ~16-bit
global (per-GPU) syncobj handle in one call and not much else.

The syncobj handle can be an element index in a global (per-GPU) syncobj
table and it's read only for all processes with the exception of the signal
command. Syncobjs can either have per VMID write access flags for the
signal command (slow), or any process can write to any syncobjs and only
rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine and only
use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Hi guys,
>
> maybe soften that a bit. Reading from the shared memory of the user fence
> is ok for everybody. What we need to take more care of is the writing side.
>
> So my current thinking is that we allow read only access, but writing a
> new sequence value needs to go through the scheduler/kernel.
>
> So when the CPU wants to signal a timeline fence it needs to call an
> IOCTL. When the GPU wants to signal the timeline fence it needs to hand
> that of to the hardware scheduler.
>
> If we lockup the kernel can check with the hardware who did the last write
> and what value was written.
>
> That together with an IOCTL to give out sequence number for implicit sync
> to applications should be sufficient for the kernel to track who is
> responsible if something bad happens.
>
> In other words when the hardware says that the shader wrote stuff like
> 0xdeadbeef 0x0 or 0x into memory we kill the process who did that.
>
> If the hardware says that seq - 1 was written fine, but seq is missing
> then the kernel blames whoever was supposed to write seq.
>
> Just pieping the write through a privileged instance should be fine to
> make sure that we don't run into issues.
>
> Christian.
>
> Am 10.06.21 um 17:59 schrieb Marek Olšák:
>
> Hi Daniel,
>
> We just talked about this whole topic internally and we came up to the
> conclusion that the hardware needs to understand sync object handles and
> have high-level wait and signal operations in the command stream. Sync
> objects will be backed by memory, but they won't be readable or writable by
> processes directly. The hardware will log all accesses to sync objects and
> will send the log to the kernel periodically. The kernel will identify
> malicious behavior.
>
> Example of a hardware command stream:
> ...
> ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
> assigned by the kernel
> Draw();
> ImplicitSyncSignalWhenDone(syncObjHandle);
> ...
>
> I'm afraid we have no other choice because of the TLB invalidation
> overhead.
>
> Marek
>
>
> On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  wrote:
>
>> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
>> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
>> > > [SNIP]
>> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
>> > > >
>> > > > The lighweight can be used when you are sure that you don't have
>> any of the
>> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
>> > > > invalidate the TLB.
>> > > >
>> > > > The heavyweight must be used when you need to invalidate the TLB
>> *AND* make
>> > > > sure that no concurrently operation moves new stuff into the TLB.
>> > > >
>> > > > The problem is for this use case we have to use the heavyweight one.
>> > > Just for my own curiosity: So the lightweight flush is only for
>> in-between
>> > > CS when you know access is idle? Or does that also not work if
>> userspace
>> > > has a CS on a dma engine going at the same time because the tlb aren't
>> > > isolated enough between engines?
>> >
>> > More or less correct, yes.
>> >
>> > The problem is a lightweight flush only invalidates the TLB, but doesn't
>> > take care of entries which have been handed out to the different
>> engines.
>> >
>> > In other words what can happen is the following:
>> >
>> > 1. Shader asks TLB to resolve address X.
>> > 2. TLB looks into its cache and can't find address X so it asks the
>> walker
>> > to resolve.
>> > 3. Walker comes back with result for address X and TLB puts that into
>> its
>> > cache and gives it to Shader.
>> >

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-10 Thread Marek Olšák
Hi Daniel,

We just talked about this whole topic internally and we came up to the
conclusion that the hardware needs to understand sync object handles and
have high-level wait and signal operations in the command stream. Sync
objects will be backed by memory, but they won't be readable or writable by
processes directly. The hardware will log all accesses to sync objects and
will send the log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  wrote:

> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > [SNIP]
> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > >
> > > > The lighweight can be used when you are sure that you don't have any
> of the
> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > > invalidate the TLB.
> > > >
> > > > The heavyweight must be used when you need to invalidate the TLB
> *AND* make
> > > > sure that no concurrently operation moves new stuff into the TLB.
> > > >
> > > > The problem is for this use case we have to use the heavyweight one.
> > > Just for my own curiosity: So the lightweight flush is only for
> in-between
> > > CS when you know access is idle? Or does that also not work if
> userspace
> > > has a CS on a dma engine going at the same time because the tlb aren't
> > > isolated enough between engines?
> >
> > More or less correct, yes.
> >
> > The problem is a lightweight flush only invalidates the TLB, but doesn't
> > take care of entries which have been handed out to the different engines.
> >
> > In other words what can happen is the following:
> >
> > 1. Shader asks TLB to resolve address X.
> > 2. TLB looks into its cache and can't find address X so it asks the
> walker
> > to resolve.
> > 3. Walker comes back with result for address X and TLB puts that into its
> > cache and gives it to Shader.
> > 4. Shader starts doing some operation using result for address X.
> > 5. You send lightweight TLB invalidate and TLB throws away cached values
> for
> > address X.
> > 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> > accesses address X
> >
> > See it like the shader has their own 1 entry L0 TLB cache which is not
> > affected by the lightweight flush.
> >
> > The heavyweight flush on the other hand sends out a broadcast signal to
> > everybody and only comes back when we are sure that an address is not in
> use
> > any more.
>
> Ah makes sense. On intel the shaders only operate in VA, everything goes
> around as explicit async messages to IO blocks. So we don't have this, the
> only difference in tlb flushes is between tlb flush in the IB and an mmio
> one which is independent for anything currently being executed on an
> egine.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
On Thu., Jun. 3, 2021, 15:18 Daniel Vetter,  wrote:

> On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák  wrote:
> >
> > Daniel, I think what you are suggesting is that we need to enable user
> queues with the drm scheduler and dma_fence first, and once that works, we
> can investigate how much of that kernel logic can be moved to the hw. Would
> that work? In theory it shouldn't matter whether the kernel does it or the
> hw does it. It's the same code, just in a different place.
>
> Yeah I guess that's another way to look at it. Maybe in practice
> you'll just move it from the kernel to userspace, which then programs
> the hw waits directly into its IB. That's at least how I'd do it on
> i915, assuming I'd have such hw. So these fences that userspace
> programs directly (to sync within itself) won't even show up as
> dependencies in the kernel.
>
> And then yes on the other side you can lift work from the
> drm/scheduler wrt dependencies you get in the kernel (whether explicit
> sync with sync_file, or implicit sync fished out of dma_resv) and
> program the hw directly that way. That would mean that userspace wont
> fill the ringbuffer directly, but the kernel would do that, so that
> you have space to stuff in the additional waits. Again assuming i915
> hw model, maybe works differently on amd. Iirc we have some of that
> already in the i915 scheduler, but I'd need to recheck how much it
> really uses the hw semaphores.
>

I was thinking we would pass per process syncobj handles and buffer handles
into commands in the user queue, or something equivalent. We do have a
large degree of programmability in the hw that we can do something like
that. The question is whether this high level user->hw interface would have
any advantage over trivial polling on memory, etc. My impression is no
because the kernel would be robust enough that it wouldn't matter what
userspace does, but I don't know. Anyway, all we need is user queues and
what your proposed seems totally sufficient.

Marek

-Daniel
>
> > Thanks,
> > Marek
> >
> > On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter  wrote:
> >>
> >> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
> >> >
> >> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
> >> >>
> >> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> >> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter 
> wrote:
> >> >> >
> >> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> >> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter 
> wrote:
> >> >> > > >
> >> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> >> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <
> mar...@gmail.com> wrote:
> >> >> > > > > >
> >> >> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> >> >> > > things
> >> >> > > > > > > for us. It's pretty much all NAK'd already. We are
> trying to gather
> >> >> > > > > more
> >> >> > > > > > > knowledge and then make better decisions.
> >> >> > > > > > >
> >> >> > > > > > > The idea we are considering is that we'll expose
> memory-based sync
> >> >> > > > > objects
> >> >> > > > > > > to userspace for read only, and the kernel or hw will
> strictly
> >> >> > > control
> >> >> > > > > the
> >> >> > > > > > > memory writes to those sync objects. The hole in that
> idea is that
> >> >> > > > > > > userspace can decide not to signal a job, so even if
> userspace
> >> >> > > can't
> >> >> > > > > > > overwrite memory-based sync object states arbitrarily,
> it can still
> >> >> > > > > decide
> >> >> > > > > > > not to signal them, and then a future fence is born.
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > > This would actually be treated as a GPU hang caused by
> that context,
> >> >> > > so
> >> >> > > > > it
> >> >> > > > > > should be fine

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
Daniel, I think what you are suggesting is that we need to enable user
queues with the drm scheduler and dma_fence first, and once that works, we
can investigate how much of that kernel logic can be moved to the hw. Would
that work? In theory it shouldn't matter whether the kernel does it or the
hw does it. It's the same code, just in a different place.

Thanks,
Marek

On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter  wrote:

> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
> >
> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
> >>
> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
> >> >
> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter 
> wrote:
> >> > > >
> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák 
> wrote:
> >> > > > > >
> >> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> >> > > things
> >> > > > > > > for us. It's pretty much all NAK'd already. We are trying
> to gather
> >> > > > > more
> >> > > > > > > knowledge and then make better decisions.
> >> > > > > > >
> >> > > > > > > The idea we are considering is that we'll expose
> memory-based sync
> >> > > > > objects
> >> > > > > > > to userspace for read only, and the kernel or hw will
> strictly
> >> > > control
> >> > > > > the
> >> > > > > > > memory writes to those sync objects. The hole in that idea
> is that
> >> > > > > > > userspace can decide not to signal a job, so even if
> userspace
> >> > > can't
> >> > > > > > > overwrite memory-based sync object states arbitrarily, it
> can still
> >> > > > > decide
> >> > > > > > > not to signal them, and then a future fence is born.
> >> > > > > > >
> >> > > > > >
> >> > > > > > This would actually be treated as a GPU hang caused by that
> context,
> >> > > so
> >> > > > > it
> >> > > > > > should be fine.
> >> > > > >
> >> > > > > This is practically what I proposed already, except your not
> doing it
> >> > > with
> >> > > > > dma_fence. And on the memory fence side this also doesn't
> actually give
> >> > > > > what you want for that compute model.
> >> > > > >
> >> > > > > This seems like a bit a worst of both worlds approach to me?
> Tons of
> >> > > work
> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and still
> pain to
> >> > > > > actually drive the hardware like it should be for compute or
> direct
> >> > > > > display.
> >> > > > >
> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
> >> > > suggestion
> >> > > > > how to fake the entire dma_fence stuff on top of new hw. Would
> be
> >> > > > > interesting to know what doesn't work there instead of amd
> folks going
> >> > > of
> >> > > > > into internal again and then coming back with another rfc from
> out of
> >> > > > > nowhere :-)
> >> > > > >
> >> > > >
> >> > > > Going internal again is probably a good idea to spare you the long
> >> > > > discussions and not waste your time, but we haven't talked about
> the
> >> > > > dma_fence stuff internally other than acknowledging that it can be
> >> > > solved.
> >> > > >
> >> > > > The compute use case already uses the hw as-is with no
> inter-process
> >> > > > sharing, which mostly keeps the kernel out of the picture. It uses
> >> > > glFinish
> >> > > > to sync with GL.
> >> > > >
> >> > > > The gfx use case needs new hardware logic to support implicit

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:

> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
> >
> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter 
> wrote:
> > > >
> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák 
> wrote:
> > > > > >
> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> > > things
> > > > > > > for us. It's pretty much all NAK'd already. We are trying to
> gather
> > > > > more
> > > > > > > knowledge and then make better decisions.
> > > > > > >
> > > > > > > The idea we are considering is that we'll expose memory-based
> sync
> > > > > objects
> > > > > > > to userspace for read only, and the kernel or hw will strictly
> > > control
> > > > > the
> > > > > > > memory writes to those sync objects. The hole in that idea is
> that
> > > > > > > userspace can decide not to signal a job, so even if userspace
> > > can't
> > > > > > > overwrite memory-based sync object states arbitrarily, it can
> still
> > > > > decide
> > > > > > > not to signal them, and then a future fence is born.
> > > > > > >
> > > > > >
> > > > > > This would actually be treated as a GPU hang caused by that
> context,
> > > so
> > > > > it
> > > > > > should be fine.
> > > > >
> > > > > This is practically what I proposed already, except your not doing
> it
> > > with
> > > > > dma_fence. And on the memory fence side this also doesn't actually
> give
> > > > > what you want for that compute model.
> > > > >
> > > > > This seems like a bit a worst of both worlds approach to me? Tons
> of
> > > work
> > > > > in the kernel to hide these not-dma_fence-but-almost, and still
> pain to
> > > > > actually drive the hardware like it should be for compute or direct
> > > > > display.
> > > > >
> > > > > Also maybe I've missed it, but I didn't see any replies to my
> > > suggestion
> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > interesting to know what doesn't work there instead of amd folks
> going
> > > of
> > > > > into internal again and then coming back with another rfc from out
> of
> > > > > nowhere :-)
> > > > >
> > > >
> > > > Going internal again is probably a good idea to spare you the long
> > > > discussions and not waste your time, but we haven't talked about the
> > > > dma_fence stuff internally other than acknowledging that it can be
> > > solved.
> > > >
> > > > The compute use case already uses the hw as-is with no inter-process
> > > > sharing, which mostly keeps the kernel out of the picture. It uses
> > > glFinish
> > > > to sync with GL.
> > > >
> > > > The gfx use case needs new hardware logic to support implicit and
> > > explicit
> > > > sync. When we propose a solution, it's usually torn apart the next
> day by
> > > > ourselves.
> > > >
> > > > Since we are talking about next hw or next next hw, preemption
> should be
> > > > better.
> > > >
> > > > user queue = user-mapped ring buffer
> > > >
> > > > For implicit sync, we will only let userspace lock access to a buffer
> > > via a
> > > > user queue, which waits for the per-buffer sequence counter in
> memory to
> > > be
> > > > >= the number assigned by the kernel, and later unlock the access
> with
> > > > another command, which increments the per-buffer sequence counter in
> > > memory
> > > > with atomic_inc regardless of the number assigned by the kernel. The
> > > kernel
> > > > counter and the counter in memory can be out-of-sync, and I'll
> explain
> > > why
> > > > it's OK. If a process incre

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:

> On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
> >
> > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > >
> > > > > Yes, we can't break anything because we don't want to complicate
> things
> > > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > > more
> > > > > knowledge and then make better decisions.
> > > > >
> > > > > The idea we are considering is that we'll expose memory-based sync
> > > objects
> > > > > to userspace for read only, and the kernel or hw will strictly
> control
> > > the
> > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > userspace can decide not to signal a job, so even if userspace
> can't
> > > > > overwrite memory-based sync object states arbitrarily, it can still
> > > decide
> > > > > not to signal them, and then a future fence is born.
> > > > >
> > > >
> > > > This would actually be treated as a GPU hang caused by that context,
> so
> > > it
> > > > should be fine.
> > >
> > > This is practically what I proposed already, except your not doing it
> with
> > > dma_fence. And on the memory fence side this also doesn't actually give
> > > what you want for that compute model.
> > >
> > > This seems like a bit a worst of both worlds approach to me? Tons of
> work
> > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > actually drive the hardware like it should be for compute or direct
> > > display.
> > >
> > > Also maybe I've missed it, but I didn't see any replies to my
> suggestion
> > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > interesting to know what doesn't work there instead of amd folks going
> of
> > > into internal again and then coming back with another rfc from out of
> > > nowhere :-)
> > >
> >
> > Going internal again is probably a good idea to spare you the long
> > discussions and not waste your time, but we haven't talked about the
> > dma_fence stuff internally other than acknowledging that it can be
> solved.
> >
> > The compute use case already uses the hw as-is with no inter-process
> > sharing, which mostly keeps the kernel out of the picture. It uses
> glFinish
> > to sync with GL.
> >
> > The gfx use case needs new hardware logic to support implicit and
> explicit
> > sync. When we propose a solution, it's usually torn apart the next day by
> > ourselves.
> >
> > Since we are talking about next hw or next next hw, preemption should be
> > better.
> >
> > user queue = user-mapped ring buffer
> >
> > For implicit sync, we will only let userspace lock access to a buffer
> via a
> > user queue, which waits for the per-buffer sequence counter in memory to
> be
> > >= the number assigned by the kernel, and later unlock the access with
> > another command, which increments the per-buffer sequence counter in
> memory
> > with atomic_inc regardless of the number assigned by the kernel. The
> kernel
> > counter and the counter in memory can be out-of-sync, and I'll explain
> why
> > it's OK. If a process increments the kernel counter but not the memory
> > counter, that's its problem and it's the same as a GPU hang caused by
> that
> > process. If a process increments the memory counter but not the kernel
> > counter, the ">=" condition alongside atomic_inc guarantee that
> signaling n
> > will signal n+1, so it will never deadlock but also it will effectively
> > disable synchronization. This method of disabling synchronization is
> > similar to a process corrupting the buffer, which should be fine. Can you
> > find any flaw in it? I can't find any.
>
> Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
> of "we let userspace free-wheel whatever it wants, kernel ensures
> correctness of the resulting chain of dma_fence with reset the entire
> context" is what I proposed too.
>
> Like you say, userspace is allowed to render garbage already.
>
> > The explicit submit can be done by userspace (if there is no
> >

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:

> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> >
> > > Yes, we can't break anything because we don't want to complicate things
> > > for us. It's pretty much all NAK'd already. We are trying to gather
> more
> > > knowledge and then make better decisions.
> > >
> > > The idea we are considering is that we'll expose memory-based sync
> objects
> > > to userspace for read only, and the kernel or hw will strictly control
> the
> > > memory writes to those sync objects. The hole in that idea is that
> > > userspace can decide not to signal a job, so even if userspace can't
> > > overwrite memory-based sync object states arbitrarily, it can still
> decide
> > > not to signal them, and then a future fence is born.
> > >
> >
> > This would actually be treated as a GPU hang caused by that context, so
> it
> > should be fine.
>
> This is practically what I proposed already, except your not doing it with
> dma_fence. And on the memory fence side this also doesn't actually give
> what you want for that compute model.
>
> This seems like a bit a worst of both worlds approach to me? Tons of work
> in the kernel to hide these not-dma_fence-but-almost, and still pain to
> actually drive the hardware like it should be for compute or direct
> display.
>
> Also maybe I've missed it, but I didn't see any replies to my suggestion
> how to fake the entire dma_fence stuff on top of new hw. Would be
> interesting to know what doesn't work there instead of amd folks going of
> into internal again and then coming back with another rfc from out of
> nowhere :-)
>

Going internal again is probably a good idea to spare you the long
discussions and not waste your time, but we haven't talked about the
dma_fence stuff internally other than acknowledging that it can be solved.

The compute use case already uses the hw as-is with no inter-process
sharing, which mostly keeps the kernel out of the picture. It uses glFinish
to sync with GL.

The gfx use case needs new hardware logic to support implicit and explicit
sync. When we propose a solution, it's usually torn apart the next day by
ourselves.

Since we are talking about next hw or next next hw, preemption should be
better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer via a
user queue, which waits for the per-buffer sequence counter in memory to be
>= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in memory
with atomic_inc regardless of the number assigned by the kernel. The kernel
counter and the counter in memory can be out-of-sync, and I'll explain why
it's OK. If a process increments the kernel counter but not the memory
counter, that's its problem and it's the same as a GPU hang caused by that
process. If a process increments the memory counter but not the kernel
counter, the ">=" condition alongside atomic_inc guarantee that signaling n
will signal n+1, so it will never deadlock but also it will effectively
disable synchronization. This method of disabling synchronization is
similar to a process corrupting the buffer, which should be fine. Can you
find any flaw in it? I can't find any.

The explicit submit can be done by userspace (if there is no
synchronization), but we plan to use the kernel to do it for implicit sync.
Essentially, the kernel will receive a buffer list and addresses of wait
commands in the user queue. It will assign new sequence numbers to all
buffers and write those numbers into the wait commands, and ring the hw
doorbell to start execution of that queue.

Marek


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
On Wed, Jun 2, 2021 at 5:44 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 02.06.21 um 10:57 schrieb Daniel Stone:
> > Hi Christian,
> >
> > On Tue, 1 Jun 2021 at 13:51, Christian König
> >  wrote:
> >> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> >>> If you want to enable this use-case with driver magic and without the
> >>> compositor being aware of what's going on, the solution is EGLStreams.
> >>> Not sure we want to go there, but it's definitely a lot more feasible
> >>> than trying to stuff eglstreams semantics into dma-buf implicit
> >>> fencing support in a desperate attempt to not change compositors.
> >> Well not changing compositors is certainly not something I would try
> >> with this use case.
> >>
> >> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> >> to support that we the newest hardware generation.
> > Serious question, have you talked to Canonical?
> >
> > I mean there's a hell of a lot of effort being expended here, but it
> > seems to all be predicated on the assumption that Ubuntu's LTS
> > HWE/backport policy is totally immutable, and that we might need to
> > make the kernel do backflips to work around that. But ... is it? Has
> > anyone actually asked them how they feel about this?
>
> This was merely an example. What I wanted to say is that we need to
> support system already deployed.
>
> In other words our customers won't accept that they need to replace the
> compositor just because they switch to a new hardware generation.
>
> > I mean, my answer to the first email is 'no, absolutely not' from the
> > technical perspective (the initial proposal totally breaks current and
> > future userspace), from a design perspective (it breaks a lot of
> > usecases which aren't single-vendor GPU+display+codec, or aren't just
> > a simple desktop), and from a sustainability perspective (cutting
> > Android adrift again isn't acceptable collateral damage to make it
> > easier to backport things to last year's Ubuntu release).
> >
> > But then again, I don't even know what I'm NAKing here ... ? The
> > original email just lists a proposal to break a ton of things, with
> > proposed replacements which aren't technically viable, and it's not
> > clear why? Can we please have some more details and some reasoning
> > behind them?
> >
> > I don't mind that userspace (compositor, protocols, clients like Mesa
> > as well as codec APIs) need to do a lot of work to support the new
> > model. I do really care though that the hard-binary-switch model works
> > fine enough for AMD but totally breaks heterogeneous systems and makes
> > it impossible for userspace to do the right thing.
>
> Well how the handling for new Android, distributions etc... is going to
> look like is a completely different story.
>
> And I completely agree with both Daniel Vetter and you that we need to
> keep this in mind when designing the compatibility with older software.
>
> For Android I'm really not sure what to do. In general Android is
> already trying to do the right thing by using explicit sync, the problem
> is that this is build around the idea that this explicit sync is
> syncfile kernel based.
>
> Either we need to change Android and come up with something that works
> with user fences as well or we somehow invent a compatibility layer for
> syncfile as well.
>

What's the issue with syncfiles that syncobjs don't suffer from?

Marek


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:

> Yes, we can't break anything because we don't want to complicate things
> for us. It's pretty much all NAK'd already. We are trying to gather more
> knowledge and then make better decisions.
>
> The idea we are considering is that we'll expose memory-based sync objects
> to userspace for read only, and the kernel or hw will strictly control the
> memory writes to those sync objects. The hole in that idea is that
> userspace can decide not to signal a job, so even if userspace can't
> overwrite memory-based sync object states arbitrarily, it can still decide
> not to signal them, and then a future fence is born.
>

This would actually be treated as a GPU hang caused by that context, so it
should be fine.

Marek


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
Yes, we can't break anything because we don't want to complicate things for
us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.

Marek

On Wed, Jun 2, 2021 at 4:57 AM Daniel Stone  wrote:

> Hi Christian,
>
> On Tue, 1 Jun 2021 at 13:51, Christian König
>  wrote:
> > Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > > If you want to enable this use-case with driver magic and without the
> > > compositor being aware of what's going on, the solution is EGLStreams.
> > > Not sure we want to go there, but it's definitely a lot more feasible
> > > than trying to stuff eglstreams semantics into dma-buf implicit
> > > fencing support in a desperate attempt to not change compositors.
> >
> > Well not changing compositors is certainly not something I would try
> > with this use case.
> >
> > Not changing compositors is more like ok we have Ubuntu 20.04 and need
> > to support that we the newest hardware generation.
>
> Serious question, have you talked to Canonical?
>
> I mean there's a hell of a lot of effort being expended here, but it
> seems to all be predicated on the assumption that Ubuntu's LTS
> HWE/backport policy is totally immutable, and that we might need to
> make the kernel do backflips to work around that. But ... is it? Has
> anyone actually asked them how they feel about this?
>
> I mean, my answer to the first email is 'no, absolutely not' from the
> technical perspective (the initial proposal totally breaks current and
> future userspace), from a design perspective (it breaks a lot of
> usecases which aren't single-vendor GPU+display+codec, or aren't just
> a simple desktop), and from a sustainability perspective (cutting
> Android adrift again isn't acceptable collateral damage to make it
> easier to backport things to last year's Ubuntu release).
>
> But then again, I don't even know what I'm NAKing here ... ? The
> original email just lists a proposal to break a ton of things, with
> proposed replacements which aren't technically viable, and it's not
> clear why? Can we please have some more details and some reasoning
> behind them?
>
> I don't mind that userspace (compositor, protocols, clients like Mesa
> as well as codec APIs) need to do a lot of work to support the new
> model. I do really care though that the hard-binary-switch model works
> fine enough for AMD but totally breaks heterogeneous systems and makes
> it impossible for userspace to do the right thing.
>
> Cheers,
> Daniel
>


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Marek Olšák
On Tue., Jun. 1, 2021, 08:51 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > On Tue, Jun 1, 2021 at 2:10 PM Christian König
> >  wrote:
> >> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >>> On 2021-06-01 12:21 p.m., Christian König wrote:
> >>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> >>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >>>>>> 3) Compositors (and other privileged processes, and display
> flipping) can't trust imported/exported fences. They need a timeout
> recovery mechanism from the beginning, and the following are some possible
> solutions to timeouts:
> >>>>>>
> >>>>>> a) use a CPU wait with a small absolute timeout, and display the
> previous content on timeout
> >>>>>> b) use a GPU wait with a small absolute timeout, and conditional
> rendering will choose between the latest content (if signalled) and
> previous content (if timed out)
> >>>>>>
> >>>>>> The result would be that the desktop can run close to 60 fps even
> if an app runs at 1 fps.
> >>>>> FWIW, this is working with
> >>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even
> with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to
> provide the same dma-buf poll semantics as other drivers and high priority
> GFX contexts via EGL_IMG_context_priority which can preempt lower priority
> ones).
> >>>> Yeah, that is really nice to have.
> >>>>
> >>>> One question is if you wait on the CPU or the GPU for the new surface
> to become available?
> >>> It's based on polling dma-buf fds, i.e. CPU.
> >>>
> >>>> The former is a bit bad for latency and power management.
> >>> There isn't a choice for Wayland compositors in general, since there
> can be arbitrary other state which needs to be applied atomically together
> with the new buffer. (Though in theory, a compositor might get fancy and
> special-case surface commits which can be handled by waiting on the GPU)
> >>>
> >>> Latency is largely a matter of scheduling in the compositor. The
> latency incurred by the compositor shouldn't have to be more than
> single-digit milliseconds. (I've seen total latency from when the client
> starts processing a (static) frame to when it starts being scanned out as
> low as ~6 ms with
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than
> typical with Xorg)
> >> Well let me describe it like this:
> >>
> >> We have an use cases for 144 Hz guaranteed refresh rate. That
> >> essentially means that the client application needs to be able to spit
> >> out one frame/window content every ~6.9ms. That's tough, but doable.
> >>
> >> When you now add 6ms latency in the compositor that means the client
> >> application has only .9ms left for it's frame which is basically
> >> impossible to do.
> >>
> >> See for the user fences handling the display engine will learn to read
> >> sequence numbers from memory and decide on it's own if the old frame or
> >> the new one is scanned out. To get the latency there as low as possible.
> > This won't work with implicit sync at all.
> >
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.
>
> Well not changing compositors is certainly not something I would try
> with this use case.
>
> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> to support that we the newest hardware generation.
>
> > I still think the most reasonable approach here is that we wrap a
> > dma_fence compat layer/mode over new hw for existing
> > userspace/compositors. And then enable userspace memory fences and the
> > fancy new features those allow with a new model that's built for them.
>
> Yeah, that's basically the same direction I'm heading. Question is how
> to fix all those details.
>
> > Also even with dma_fence we could implement your model of staying with
> > the previous buffer (or an older buffer at that's already rendered),
> > but it needs explicit involvement of the 

Re: Linux Graphics Next: Userspace submission update

2021-05-28 Thread Marek Olšák
My first email can be ignored except for the sync files. Oh well.

I think I see what you mean, Christian. If we assume that an imported fence
is always read only (the buffer with the sequence number is read only),
only the process that created and exported the fence can signal it. If the
fence is not signaled, the exporting process is guilty. The only thing the
importing process must do when it's about to use the fence as a dependency
is to notify the kernel about it. Thus, the kernel will always know the
dependency graph. Then if the importing process times out, the kernel will
blame any of the processes that passed it a fence that is still unsignaled.
The kernel will blame the process that timed out only if all imported
fences have been signaled. It seems pretty robust.

It's the same with implicit sync except that the buffer with the sequence
number is writable. Any process that has an implicitly-sync'd buffer can
set the sequence number to 0 or UINT64_MAX. 0 will cause a timeout for the
next job, while UINT64_MAX might cause a timeout a little later. The
timeout can be mitigated by the kernel because the kernel knows the
greatest number that should be there, but it's not possible to know which
process is guilty (all processes holding the buffer handle would be
suspects).

Marek

On Fri, May 28, 2021 at 6:25 PM Marek Olšák  wrote:

> If both implicit and explicit synchronization are handled the same, then
> the kernel won't be able to identify the process that caused an implicit
> sync deadlock. The process that is stuck waiting for a fence can be
> innocent, and the kernel can't punish it. Likewise, the GPU reset guery
> that reports which process is guilty and innocent will only be able to
> report unknown. Is that OK?
>
> Marek
>
> On Fri, May 28, 2021 at 10:41 AM Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Hi Marek,
>>
>> well I don't think that implicit and explicit synchronization needs to be
>> mutual exclusive.
>>
>> What we should do is to have the ability to transport an synchronization
>> object with each BO.
>>
>> Implicit and explicit synchronization then basically become the same,
>> they just transport the synchronization object differently.
>>
>> The biggest problem are the sync_files for Android, since they are really
>> not easy to support at all. If Android wants to support user queues we
>> would probably have to do some changes there.
>>
>> Regards,
>> Christian.
>>
>> Am 27.05.21 um 23:51 schrieb Marek Olšák:
>>
>> Hi,
>>
>> Since Christian believes that we can't deadlock the kernel with some
>> changes there, we just need to make everything nice for userspace too.
>> Instead of explaining how it will work, I will explain the cases where
>> future hardware (and its kernel driver) will break existing userspace in
>> order to protect everybody from deadlocks. Anything that uses implicit sync
>> will be spared, so X and Wayland will be fine, assuming they don't
>> import/export fences. Those use cases that do import/export fences might or
>> might not work, depending on how the fences are used.
>>
>> One of the necessities is that all fences will become future fences. The
>> semantics of imported/exported fences will change completely and will have
>> new restrictions on the usage. The restrictions are:
>>
>>
>> 1) Android sync files will be impossible to support, so won't be
>> supported. (they don't allow future fences)
>>
>>
>> 2) Implicit sync and explicit sync will be mutually exclusive between
>> process. A process can either use one or the other, but not both. This is
>> meant to prevent a deadlock condition with future fences where any process
>> can malevolently deadlock execution of any other process, even execution of
>> a higher-privileged process. The kernel will impose the following
>> restrictions to protect against the deadlock:
>>
>> a) a process with an implicitly-sync'd imported/exported buffer can't
>> import/export a fence from/to another process
>> b) a process with an imported/exported fence can't import/export an
>> implicitly-sync'd buffer from/to another process
>>
>> Alternative: A higher-privileged process could enforce both restrictions
>> instead of the kernel to protect itself from the deadlock, but this would
>> be a can of worms for existing userspace. It would be better if the kernel
>> just broke unsafe userspace on future hw, just like sync files.
>>
>> If both implicit and explicit sync are allowed to occur simultaneously,
>> sending a future fence that will

Re: Linux Graphics Next: Userspace submission update

2021-05-28 Thread Marek Olšák
If both implicit and explicit synchronization are handled the same, then
the kernel won't be able to identify the process that caused an implicit
sync deadlock. The process that is stuck waiting for a fence can be
innocent, and the kernel can't punish it. Likewise, the GPU reset guery
that reports which process is guilty and innocent will only be able to
report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Hi Marek,
>
> well I don't think that implicit and explicit synchronization needs to be
> mutual exclusive.
>
> What we should do is to have the ability to transport an synchronization
> object with each BO.
>
> Implicit and explicit synchronization then basically become the same, they
> just transport the synchronization object differently.
>
> The biggest problem are the sync_files for Android, since they are really
> not easy to support at all. If Android wants to support user queues we
> would probably have to do some changes there.
>
> Regards,
> Christian.
>
> Am 27.05.21 um 23:51 schrieb Marek Olšák:
>
> Hi,
>
> Since Christian believes that we can't deadlock the kernel with some
> changes there, we just need to make everything nice for userspace too.
> Instead of explaining how it will work, I will explain the cases where
> future hardware (and its kernel driver) will break existing userspace in
> order to protect everybody from deadlocks. Anything that uses implicit sync
> will be spared, so X and Wayland will be fine, assuming they don't
> import/export fences. Those use cases that do import/export fences might or
> might not work, depending on how the fences are used.
>
> One of the necessities is that all fences will become future fences. The
> semantics of imported/exported fences will change completely and will have
> new restrictions on the usage. The restrictions are:
>
>
> 1) Android sync files will be impossible to support, so won't be
> supported. (they don't allow future fences)
>
>
> 2) Implicit sync and explicit sync will be mutually exclusive between
> process. A process can either use one or the other, but not both. This is
> meant to prevent a deadlock condition with future fences where any process
> can malevolently deadlock execution of any other process, even execution of
> a higher-privileged process. The kernel will impose the following
> restrictions to protect against the deadlock:
>
> a) a process with an implicitly-sync'd imported/exported buffer can't
> import/export a fence from/to another process
> b) a process with an imported/exported fence can't import/export an
> implicitly-sync'd buffer from/to another process
>
> Alternative: A higher-privileged process could enforce both restrictions
> instead of the kernel to protect itself from the deadlock, but this would
> be a can of worms for existing userspace. It would be better if the kernel
> just broke unsafe userspace on future hw, just like sync files.
>
> If both implicit and explicit sync are allowed to occur simultaneously,
> sending a future fence that will never signal to any process will deadlock
> that process after it acquires the implicit sync lock, which is a sequence
> number that the process is required to write to memory and send an
> interrupt from the GPU in a finite time. This is how the deadlock can
> happen:
>
> * The process gets sequence number N from the kernel for an
> implicitly-sync'd buffer.
> * The process inserts (into the GPU user-mapped queue) a wait for sequence
> number N-1.
> * The process inserts a wait for a fence, but it doesn't know that it will
> never signal ==> deadlock.
> ...
> * The process inserts a command to write sequence number N to a
> predetermined memory location. (which will make the buffer idle and send an
> interrupt to the kernel)
> ...
> * The kernel will terminate the process because it has never received the
> interrupt. (i.e. a less-privileged process just killed a more-privileged
> process)
>
> It's the interrupt for implicit sync that never arrived that caused the
> termination, and the only way another process can cause it is by sending a
> fence that will never signal. Thus, importing/exporting fences from/to
> other processes can't be allowed simultaneously with implicit sync.
>
>
> 3) Compositors (and other privileged processes, and display flipping)
> can't trust imported/exported fences. They need a timeout recovery
> mechanism from the beginning, and the following are some possible solutions
> to timeouts:
>
> a) use a CPU wait with a small absolute timeout, and display the previous
> content on timeout
> b) use

Linux Graphics Next: Userspace submission update

2021-05-27 Thread Marek Olšák
Hi,

Since Christian believes that we can't deadlock the kernel with some
changes there, we just need to make everything nice for userspace too.
Instead of explaining how it will work, I will explain the cases where
future hardware (and its kernel driver) will break existing userspace in
order to protect everybody from deadlocks. Anything that uses implicit sync
will be spared, so X and Wayland will be fine, assuming they don't
import/export fences. Those use cases that do import/export fences might or
might not work, depending on how the fences are used.

One of the necessities is that all fences will become future fences. The
semantics of imported/exported fences will change completely and will have
new restrictions on the usage. The restrictions are:


1) Android sync files will be impossible to support, so won't be supported.
(they don't allow future fences)


2) Implicit sync and explicit sync will be mutually exclusive between
process. A process can either use one or the other, but not both. This is
meant to prevent a deadlock condition with future fences where any process
can malevolently deadlock execution of any other process, even execution of
a higher-privileged process. The kernel will impose the following
restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported buffer can't
import/export a fence from/to another process
b) a process with an imported/exported fence can't import/export an
implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both restrictions
instead of the kernel to protect itself from the deadlock, but this would
be a can of worms for existing userspace. It would be better if the kernel
just broke unsafe userspace on future hw, just like sync files.

If both implicit and explicit sync are allowed to occur simultaneously,
sending a future fence that will never signal to any process will deadlock
that process after it acquires the implicit sync lock, which is a sequence
number that the process is required to write to memory and send an
interrupt from the GPU in a finite time. This is how the deadlock can
happen:

* The process gets sequence number N from the kernel for an
implicitly-sync'd buffer.
* The process inserts (into the GPU user-mapped queue) a wait for sequence
number N-1.
* The process inserts a wait for a fence, but it doesn't know that it will
never signal ==> deadlock.
...
* The process inserts a command to write sequence number N to a
predetermined memory location. (which will make the buffer idle and send an
interrupt to the kernel)
...
* The kernel will terminate the process because it has never received the
interrupt. (i.e. a less-privileged process just killed a more-privileged
process)

It's the interrupt for implicit sync that never arrived that caused the
termination, and the only way another process can cause it is by sending a
fence that will never signal. Thus, importing/exporting fences from/to
other processes can't be allowed simultaneously with implicit sync.


3) Compositors (and other privileged processes, and display flipping) can't
trust imported/exported fences. They need a timeout recovery mechanism from
the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering
will choose between the latest content (if signalled) and previous content
(if timed out)

The result would be that the desktop can run close to 60 fps even if an app
runs at 1 fps.

*Redefining imported/exported fences and breaking some users/OSs is the
only way to have userspace GPU command submission, and the deadlock example
here is the counterexample proving that there is no other way.*

So, what are the chances this is going to fly with the ecosystem?

Thanks,
Marek


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-04 Thread Marek Olšák
I see some mentions of XNACK and recoverable page faults. Note that all
gaming AMD hw that has userspace queues doesn't have XNACK, so there is no
overhead in compute units. My understanding is that recoverable page faults
are still supported without XNACK, but instead of the compute unit
replaying the faulting instruction, the L1 cache does that. Anyway, the
point is that XNACK is totally irrelevant here.

Marek

On Tue., May 4, 2021, 08:48 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> >  wrote:
> >> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> >>> [SNIP]
>  Yeah, it just takes to long for the preemption to complete to be
> really
>  useful for the feature we are discussing here.
> 
>  As I said when the kernel requests to preempt a queue we can easily
> expect a
>  timeout of ~100ms until that comes back. For compute that is even in
> the
>  multiple seconds range.
> >>> 100ms for preempting an idle request sounds like broken hw to me. Of
> >>> course preemting something that actually runs takes a while, that's
> >>> nothing new. But it's also not the thing we're talking about here. Is
> this
> >>> 100ms actual numbers from hw for an actual idle ringbuffer?
> >> Well 100ms is just an example of the scheduler granularity. Let me
> >> explain in a wider context.
> >>
> >> The hardware can have X queues mapped at the same time and every Y time
> >> interval the hardware scheduler checks if those queues have changed and
> >> only if they have changed the necessary steps to reload them are
> started.
> >>
> >> Multiple queues can be rendering at the same time, so you can have X as
> >> a high priority queue active and just waiting for a signal to start and
> >> the client rendering one frame after another and a third background
> >> compute task mining bitcoins for you.
> >>
> >> As long as everything is static this is perfectly performant. Adding a
> >> queue to the list of active queues is also relatively simple, but taking
> >> one down requires you to wait until we are sure the hardware has seen
> >> the change and reloaded the queues.
> >>
> >> Think of it as an RCU grace period. This is simply not something which
> >> is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
>
> Well I wouldn't call it broken. It's just not made for the use case we
> are trying to abuse it for.
>
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
>
> Yeah, exactly that's why it isn't much of a problem for process
> termination or freeing memory.
>
> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> >
>  The "preemption" feature is really called suspend and made just for
> the case
>  when we want to put a process to sleep or need to forcefully kill it
> for
>  misbehavior or stuff like that. It is not meant to be used in normal
>  operation.
> 
>  If we only attach it on ->move then yeah maybe a last resort
> possibility to
>  do it this way, but I think in that case we could rather stick with
> kernel
>  submissions.
> >>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so
> you
> >>> can keep dma-fences working. Because the dma-fence stuff wont work with
> >>> pure userspace submit, I think that conclusion is rather solid. Once
> more
> >>> even after this long thread here.
> >> When assisted with unload fences, then yes. Problem is that I can't see
> >> how we could implement those performant currently.
> > Is there really no way to fix fw here? Like if process start/teardown
> > takes 100ms, that's going to suck no matter what.
>
> As I said adding the queue is unproblematic and teardown just results in
> a bit more waiting to free things up.
>
> Problematic is more overcommit swapping and OOM situations which need to
> wait for the hw scheduler to come back and tell us that the queue is now
> unmapped.
>
> > Also, if userspace lies to us and keeps pushing crap into the ring
> > after it's supposed to be idle: Userspace is already allowed to waste
> > gpu time. If you're too worried about this set a fairly aggressive
> > preempt timeout on the unload fence, and kill the context if it takes
> > longer than what preempting an idle ring should take (because that
> > would indicate broken/evil userspace).
>  I think you have the wrong expectation here. It is perfectly valid and
>  expected for userspace to keep writing commands into the ring buffer.
> 
>  After all when one frame is completed they want to immediately start
>  rendering the next one.
> >>> Sure, for the true userspace direct su

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Marek Olšák
Proposal for a new CS ioctl, kernel pseudo code:

lock(&global_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(&global_lock);

See? Just like userspace submit, but in the kernel without
concurrency/preemption. Is this now safe enough for dma_fence?

Marek

On Mon, May 3, 2021 at 4:36 PM Marek Olšák  wrote:

> What about direct submit from the kernel where the process still has write
> access to the GPU ring buffer but doesn't use it? I think that solves your
> preemption example, but leaves a potential backdoor for a process to
> overwrite the signal commands, which shouldn't be a problem since we are OK
> with timeouts.
>
> Marek
>
> On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand 
> wrote:
>
>> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>>  wrote:
>> >
>> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand 
>> wrote:
>> > >
>> > > Sorry for the top-post but there's no good thing to reply to here...
>> > >
>> > > One of the things pointed out to me recently by Daniel Vetter that I
>> > > didn't fully understand before is that dma_buf has a very subtle
>> > > second requirement beyond finite time completion:  Nothing required
>> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
>> > > of allocating memory may wait on your dma-fence.  This, as it turns
>> > > out, is a massively more strict requirement than finite time
>> > > completion and, I think, throws out all of the proposals we have so
>> > > far.
>> > >
>> > > Take, for instance, Marek's proposal for userspace involvement with
>> > > dma-fence by asking the kernel for a next serial and the kernel
>> > > trusting userspace to signal it.  That doesn't work at all if
>> > > allocating memory to trigger a dma-fence can blow up.  There's simply
>> > > no way for the kernel to trust userspace to not do ANYTHING which
>> > > might allocate memory.  I don't even think there's a way userspace can
>> > > trust itself there.  It also blows up my plan of moving the fences to
>> > > transition boundaries.
>> > >
>> > > Not sure where that leaves us.
>> >
>> > Honestly the more I look at things I think userspace-signalable fences
>> > with a timeout sound like they are a valid solution for these issues.
>> > Especially since (as has been mentioned countless times in this email
>> > thread) userspace already has a lot of ways to cause timeouts and or
>> > GPU hangs through GPU work already.
>> >
>> > Adding a timeout on the signaling side of a dma_fence would ensure:
>> >
>> > - The dma_fence signals in finite time
>> > -  If the timeout case does not allocate memory then memory allocation
>> > is not a blocker for signaling.
>> >
>> > Of course you lose the full dependency graph and we need to make sure
>> > garbage collection of fences works correctly when we have cycles.
>> > However, the latter sounds very doable and the first sounds like it is
>> > to some extent inevitable.
>> >
>> > I feel like I'm missing some requirement here given that we
>> > immediately went to much more complicated things but can't find it.
>> > Thoughts?
>>
>> Timeouts are sufficient to protect the kernel but they make the fences
>> unpredictable and unreliable from a userspace PoV.  One of the big
>> problems we face is that, once we expose a dma_fence to userspace,
>> we've allowed for some pretty crazy potential dependencies that
>> neither userspace nor the kernel can sort out.  Say you have marek's
>> "next serial, please" proposal and a multi-threaded application.
>> Between time time you ask the kernel for a serial and get a dma_fence
>> and submit the work to signal that serial, your process may get
>> preempted, something else shoved in which allocates memory, and then
>> we end up blocking on that dma_fence.  There's no way userspace can
>> predict and defend itself from that.
>>
>> So I think where that leaves us is that there is no safe place to
>> create a dma_fence except for inside the ioctl which submits the work
>> and only after any necessary memory has been allocated.  That's a
>> pretty stiff requirement.  We may still be able to interact with
>> userspace a bit more explicitly but I think it throws

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-03 Thread Marek Olšák
What about direct submit from the kernel where the process still has write
access to the GPU ring buffer but doesn't use it? I think that solves your
preemption example, but leaves a potential backdoor for a process to
overwrite the signal commands, which shouldn't be a problem since we are OK
with timeouts.

Marek

On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand  wrote:

> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>  wrote:
> >
> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand 
> wrote:
> > >
> > > Sorry for the top-post but there's no good thing to reply to here...
> > >
> > > One of the things pointed out to me recently by Daniel Vetter that I
> > > didn't fully understand before is that dma_buf has a very subtle
> > > second requirement beyond finite time completion:  Nothing required
> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > > of allocating memory may wait on your dma-fence.  This, as it turns
> > > out, is a massively more strict requirement than finite time
> > > completion and, I think, throws out all of the proposals we have so
> > > far.
> > >
> > > Take, for instance, Marek's proposal for userspace involvement with
> > > dma-fence by asking the kernel for a next serial and the kernel
> > > trusting userspace to signal it.  That doesn't work at all if
> > > allocating memory to trigger a dma-fence can blow up.  There's simply
> > > no way for the kernel to trust userspace to not do ANYTHING which
> > > might allocate memory.  I don't even think there's a way userspace can
> > > trust itself there.  It also blows up my plan of moving the fences to
> > > transition boundaries.
> > >
> > > Not sure where that leaves us.
> >
> > Honestly the more I look at things I think userspace-signalable fences
> > with a timeout sound like they are a valid solution for these issues.
> > Especially since (as has been mentioned countless times in this email
> > thread) userspace already has a lot of ways to cause timeouts and or
> > GPU hangs through GPU work already.
> >
> > Adding a timeout on the signaling side of a dma_fence would ensure:
> >
> > - The dma_fence signals in finite time
> > -  If the timeout case does not allocate memory then memory allocation
> > is not a blocker for signaling.
> >
> > Of course you lose the full dependency graph and we need to make sure
> > garbage collection of fences works correctly when we have cycles.
> > However, the latter sounds very doable and the first sounds like it is
> > to some extent inevitable.
> >
> > I feel like I'm missing some requirement here given that we
> > immediately went to much more complicated things but can't find it.
> > Thoughts?
>
> Timeouts are sufficient to protect the kernel but they make the fences
> unpredictable and unreliable from a userspace PoV.  One of the big
> problems we face is that, once we expose a dma_fence to userspace,
> we've allowed for some pretty crazy potential dependencies that
> neither userspace nor the kernel can sort out.  Say you have marek's
> "next serial, please" proposal and a multi-threaded application.
> Between time time you ask the kernel for a serial and get a dma_fence
> and submit the work to signal that serial, your process may get
> preempted, something else shoved in which allocates memory, and then
> we end up blocking on that dma_fence.  There's no way userspace can
> predict and defend itself from that.
>
> So I think where that leaves us is that there is no safe place to
> create a dma_fence except for inside the ioctl which submits the work
> and only after any necessary memory has been allocated.  That's a
> pretty stiff requirement.  We may still be able to interact with
> userspace a bit more explicitly but I think it throws any notion of
> userspace direct submit out the window.
>
> --Jason
>
>
> > - Bas
> > >
> > > --Jason
> > >
> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher 
> wrote:
> > > >
> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák  wrote:
> > > > >
> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer 
> wrote:
> > > > >>
> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > > >> > Hi Dave,
> > > > >> >
> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > > >> >> Supporting interop with any device i

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-05-01 Thread Marek Olšák
On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer  wrote:

> On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> >
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on
> which drivers we need to interoperate with and update them. We've already
> found the path forward for amdgpu. We just need to find out how many other
> drivers need to be updated and evaluate the cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  airl...@gmail.com>> wrote:
> >>
> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> >>  ckoenig.leichtzumer...@gmail.com>> wrote:
> >> >
> >> > Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >> >
> >> > That could only be a problem for A+I Laptops.
> >>
> >> Since I think you mentioned you'd only be enabling this on newer
> >> chipsets, won't it be a problem for A+A where one A is a generation
> >> behind the other?
> >>
> >
> > Crap, that is a good point as well.
> >
> >>
> >> I'm not really liking where this is going btw, seems like a ill
> >> thought out concept, if AMD is really going down the road of
> designing
> >> hw that is currently Linux incompatible, you are going to have to
> >> accept a big part of the burden in bringing this support in to more
> >> than just amd drivers for upcoming generations of gpu.
> >>
> >
> > Well we don't really like that either, but we have no other option as
> far as I can see.
>
> I don't really understand what "future hw may remove support for kernel
> queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the queues,
> or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
>
> Surely there are resource limits for the per-context queues, so the kernel
> driver needs to do some kind of virtualization / multi-plexing anyway, or
> we'll get sad user faces when there's no queue available for  game>.
>
> I'm probably missing something though, awaiting enlightenment. :)
>

The hw interface for userspace is that the ring buffer is mapped to the
process address space alongside a doorbell aperture (4K page) that isn't
real memory, but when the CPU writes into it, it tells the hw scheduler
that there are new GPU commands in the ring buffer. Userspace inserts all
the wait, draw, and signal commands into the ring buffer and then "rings"
the doorbell. It's my understanding that the ring buffer and the doorbell
are always mapped in the same GPU address space as the process, which makes
it very difficult to emulate the current protected ring buffers in the
kernel. The VMID of the ring buffer is also not changeable.

The hw scheduler doesn't do any synchronization and it doesn't see any
dependencies. It only chooses which queue to execute, so it's really just a
simple queue manager handling the virtualization aspect and not much else.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
On Wed., Apr. 28, 2021, 00:01 Jason Ekstrand,  wrote:

> On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák  wrote:
> >
> > Jason, both memory-based signalling as well as interrupt-based
> signalling to the CPU would be supported by amdgpu. External devices don't
> need to support memory-based sync objects. The only limitation is that they
> can't convert amdgpu sync objects to dma_fence.
>
> Sure.  I'm not worried about the mechanism.  We just need a word that
> means "the new fence thing" and I've been throwing "memory fence"
> around for that.  Other mechanisms may work as well.
>
> > The sad thing is that "external -> amdgpu" dependencies are really
> "external <-> amdgpu" dependencies due to mutually-exclusive access
> required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
> only interop that would initially work with those buffers. Explicitly
> sync'd buffers also won't work if other drivers convert explicit fences to
> dma_fence. Thus, both implicit sync and explicit sync might not work with
> other drivers at all. The only interop that would initially work is
> explicit fences with memory-based waiting and signalling on the external
> device to keep the kernel out of the picture.
>
> Yup.  This is where things get hard.  That said, I'm not quite ready
> to give up on memory/interrupt fences just yet.
>
> One thought that came to mind which might help would be if we added an
> extremely strict concept of memory ownership.  The idea would be that
> any given BO would be in one of two states at any given time:
>
>  1. legacy: dma_fences and implicit sync works as normal but it cannot
> be resident in any "modern" (direct submission, ULLS, whatever you
> want to call it) context
>
>  2. modern: In this mode they should not be used by any legacy
> context.  We can't strictly prevent this, unfortunately, but maybe we
> can say reading produces garbage and writes may be discarded.  In this
> mode, they can be bound to modern contexts.
>
> In theory, when in "modern" mode, you could bind the same buffer in
> multiple modern contexts at a time.  However, when that's the case, it
> makes ownership really tricky to track.  Therefore, we might want some
> sort of dma-buf create flag for "always modern" vs. "switchable" and
> only allow binding to one modern context at a time when it's
> switchable.
>
> If we did this, we may be able to move any dma_fence shenanigans to
> the ownership transition points.  We'd still need some sort of "wait
> for fence and transition" which has a timeout.  However, then we'd be
> fairly well guaranteed that the application (not just Mesa!) has
> really and truly decided it's done with the buffer and we wouldn't (I
> hope!) end up with the accidental edges in the dependency graph.
>
> Of course, I've not yet proven any of this correct so feel free to
> tell me why it won't work. :-)  It was just one of those "about to go
> to bed and had a thunk" type thoughts.
>

We'd like to keep userspace outside of Mesa drivers intact and working
except for interop where we don't have much choice. At the same time,
future hw may remove support for kernel queues, so we might not have much
choice there either, depending on what the hw interface will look like.

The idea is to have an ioctl for querying a timeline semaphore buffer
associated with a shared BO, and an ioctl for querying the next wait and
signal number (e.g. n and n+1) for that semaphore. Waiting for n would be
like mutex lock and signaling would be like mutex unlock. The next process
would use the same ioctl and get n+1 and n+2, etc. There is a deadlock
condition because one process can do lock A, lock B, and another can do
lock B, lock A, which can be prevented such that the ioctl that returns the
numbers would return them for multiple buffers at once. This solution needs
no changes to userspace outside of Mesa drivers, and we'll also keep the BO
wait ioctl for GPU-CPU sync.

Marek


> --Jason
>
> P.S.  Daniel was 100% right when he said this discussion needs a glossary.
>
>
> > Marek
> >
> >
> > On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand 
> wrote:
> >>
> >> Trying to figure out which e-mail in this mess is the right one to
> reply to
> >>
> >> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> >> > > Ok. So that would only make the following use cases broken for now:
> >> > 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Jason, both memory-based signalling as well as interrupt-based signalling
to the CPU would be supported by amdgpu. External devices don't need to
support memory-based sync objects. The only limitation is that they can't
convert amdgpu sync objects to dma_fence.

The sad thing is that "external -> amdgpu" dependencies are really
"external <-> amdgpu" dependencies due to mutually-exclusive access
required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
only interop that would initially work with those buffers. Explicitly
sync'd buffers also won't work if other drivers convert explicit fences to
dma_fence. Thus, both implicit sync and explicit sync might not work with
other drivers at all. The only interop that would initially work is
explicit fences with memory-based waiting and signalling on the external
device to keep the kernel out of the picture.

Marek


On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand  wrote:

> Trying to figure out which e-mail in this mess is the right one to reply
> to
>
> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach 
> wrote:
> >
> > Hi,
> >
> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> > > Ok. So that would only make the following use cases broken for now:
> > > - amd render -> external gpu
>
> Assuming said external GPU doesn't support memory fences.  If we do
> amdgpu and i915 at the same time, that covers basically most of the
> external GPU use-cases.  Of course, we'd want to convert nouveau as
> well for the rest.
>
> > > - amd video encode -> network device
> >
> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > unhappy, as we have some cases where we are combining an AMD APU with a
> > FPGA based graphics card. I can't go into the specifics of this use-
> > case too much but basically the AMD graphics is rendering content that
> > gets composited on top of a live video pipeline running through the
> > FPGA.
>
> I think it's worth taking a step back and asking what's being here
> before we freak out too much.  If we do go this route, it doesn't mean
> that your FPGA use-case can't work, it just means it won't work
> out-of-the box anymore.  You'll have to separate execution and memory
> dependencies inside your FPGA driver.  That's still not great but it's
> not as bad as you maybe made it sound.
>
> > > What about the case when we get a buffer from an external device and
> > > we're supposed to make it "busy" when we are using it, and the
> > > external device wants to wait until we stop using it? Is it something
> > > that can happen, thus turning "external -> amd" into "external <->
> > > amd"?
> >
> > Zero-copy texture sampling from a video input certainly appreciates
> > this very much. Trying to pass the render fence through the various
> > layers of userspace to be able to tell when the video input can reuse a
> > buffer is a great experience in yak shaving. Allowing the video input
> > to reuse the buffer as soon as the read dma_fence from the GPU is
> > signaled is much more straight forward.
>
> Oh, it's definitely worse than that.  Every window system interaction
> is bi-directional.  The X server has to wait on the client before
> compositing from it and the client has to wait on X before re-using
> that back-buffer.  Of course, we can break that later dependency by
> doing a full CPU wait but that's going to mean either more latency or
> reserving more back buffers.  There's no good clean way to claim that
> any of this is one-directional.
>
> --Jason
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Supporting interop with any device is always possible. It depends on which
drivers we need to interoperate with and update them. We've already found
the path forward for amdgpu. We just need to find out how many other
drivers need to be updated and evaluate the cost/benefit aspect.

Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  wrote:

> On Tue, 27 Apr 2021 at 22:06, Christian König
>  wrote:
> >
> > Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >
> > That could only be a problem for A+I Laptops.
>
> Since I think you mentioned you'd only be enabling this on newer
> chipsets, won't it be a problem for A+A where one A is a generation
> behind the other?
>
> I'm not really liking where this is going btw, seems like a ill
> thought out concept, if AMD is really going down the road of designing
> hw that is currently Linux incompatible, you are going to have to
> accept a big part of the burden in bringing this support in to more
> than just amd drivers for upcoming generations of gpu.
>
> Dave.
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device

What about the case when we get a buffer from an external device and we're
supposed to make it "busy" when we are using it, and the external device
wants to wait until we stop using it? Is it something that can happen, thus
turning "external -> amd" into "external <-> amd"?

Marek

On Tue., Apr. 27, 2021, 08:50 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Only amd -> external.
>
> We can easily install something in an user queue which waits for a
> dma_fence in the kernel.
>
> But we can't easily wait for an user queue as dependency of a dma_fence.
>
> The good thing is we have this wait before signal case on Vulkan timeline
> semaphores which have the same problem in the kernel.
>
> The good news is I think we can relatively easily convert i915 and older
> amdgpu device to something which is compatible with user fences.
>
> So yes, getting that fixed case by case should work.
>
> Christian
>
> Am 27.04.21 um 14:46 schrieb Marek Olšák:
>
> I'll defer to Christian and Alex to decide whether dropping sync with
> non-amd devices (GPUs, cameras etc.) is acceptable.
>
> Rewriting those drivers to this new sync model could be done on a case by
> case basis.
>
> For now, would we only lose the "amd -> external" dependency? Or the
> "external -> amd" dependency too?
>
> Marek
>
> On Tue., Apr. 27, 2021, 08:15 Daniel Vetter,  wrote:
>
>> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
>> > Ok. I'll interpret this as "yes, it will work, let's do it".
>>
>> It works if all you care about is drm/amdgpu. I'm not sure that's a
>> reasonable approach for upstream, but it definitely is an approach :-)
>>
>> We've already gone somewhat through the pain of drm/amdgpu redefining
>> how implicit sync works without sufficiently talking with other
>> people, maybe we should avoid a repeat of this ...
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
>> ckoenig.leichtzumer...@gmail.com> wrote:
>> >>
>> >> Correct, we wouldn't have synchronization between device with and
>> without user queues any more.
>> >>
>> >> That could only be a problem for A+I Laptops.
>> >>
>> >> Memory management will just work with preemption fences which pause
>> the user queues of a process before evicting something. That will be a
>> dma_fence, but also a well known approach.
>> >>
>> >> Christian.
>> >>
>> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>> >>
>> >> If we don't use future fences for DMA fences at all, e.g. we don't use
>> them for memory management, it can work, right? Memory management can
>> suspend user queues anytime. It doesn't need to use DMA fences. There might
>> be something that I'm missing here.
>> >>
>> >> What would we lose without DMA fences? Just inter-device
>> synchronization? I think that might be acceptable.
>> >>
>> >> The only case when the kernel will wait on a future fence is before a
>> page flip. Everything today already depends on userspace not hanging the
>> gpu, which makes everything a future fence.
>> >>
>> >> Marek
>> >>
>> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>> >>>
>> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> >>> > Thanks everybody. The initial proposal is dead. Here are some
>> thoughts on
>> >>> > how to do it differently.
>> >>> >
>> >>> > I think we can have direct command submission from userspace via
>> >>> > memory-mapped queues ("user queues") without changing window
>> systems.
>> >>> >
>> >>> > The memory management doesn't have to use GPU page faults like HMM.
>> >>> > Instead, it can wait for user queues of a specific process to go
>> idle and
>> >>> > then unmap the queues, so that userspace can't submit anything.
>> Buffer
>> >>> > evictions, pinning, etc. can be executed when all queues are
>> unmapped
>> >>> > (suspended). Thus, no BO fences and page faults are needed.
>> >>> >
>> >>> > Inter-

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
I'll defer to Christian and Alex to decide whether dropping sync with
non-amd devices (GPUs, cameras etc.) is acceptable.

Rewriting those drivers to this new sync model could be done on a case by
case basis.

For now, would we only lose the "amd -> external" dependency? Or the
"external -> amd" dependency too?

Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter,  wrote:

> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák  wrote:
> > Ok. I'll interpret this as "yes, it will work, let's do it".
>
> It works if all you care about is drm/amdgpu. I'm not sure that's a
> reasonable approach for upstream, but it definitely is an approach :-)
>
> We've already gone somewhat through the pain of drm/amdgpu redefining
> how implicit sync works without sufficiently talking with other
> people, maybe we should avoid a repeat of this ...
> -Daniel
>
> >
> > Marek
> >
> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
> ckoenig.leichtzumer...@gmail.com> wrote:
> >>
> >> Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >>
> >> That could only be a problem for A+I Laptops.
> >>
> >> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
> >>
> >> Christian.
> >>
> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
> >>
> >> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
> >>
> >> What would we lose without DMA fences? Just inter-device
> synchronization? I think that might be acceptable.
> >>
> >> The only case when the kernel will wait on a future fence is before a
> page flip. Everything today already depends on userspace not hanging the
> gpu, which makes everything a future fence.
> >>
> >> Marek
> >>
> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
> >>>
> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> >>> > Thanks everybody. The initial proposal is dead. Here are some
> thoughts on
> >>> > how to do it differently.
> >>> >
> >>> > I think we can have direct command submission from userspace via
> >>> > memory-mapped queues ("user queues") without changing window systems.
> >>> >
> >>> > The memory management doesn't have to use GPU page faults like HMM.
> >>> > Instead, it can wait for user queues of a specific process to go
> idle and
> >>> > then unmap the queues, so that userspace can't submit anything.
> Buffer
> >>> > evictions, pinning, etc. can be executed when all queues are unmapped
> >>> > (suspended). Thus, no BO fences and page faults are needed.
> >>> >
> >>> > Inter-process synchronization can use timeline semaphores. Userspace
> will
> >>> > query the wait and signal value for a shared buffer from the kernel.
> The
> >>> > kernel will keep a history of those queries to know which process is
> >>> > responsible for signalling which buffer. There is only the
> wait-timeout
> >>> > issue and how to identify the culprit. One of the solutions is to
> have the
> >>> > GPU send all GPU signal commands and all timed out wait commands via
> an
> >>> > interrupt to the kernel driver to monitor and validate userspace
> behavior.
> >>> > With that, it can be identified whether the culprit is the waiting
> process
> >>> > or the signalling process and which one. Invalid signal/wait
> parameters can
> >>> > also be detected. The kernel can force-signal only the semaphores
> that time
> >>> > out, and punish the processes which caused the timeout or used
> invalid
> >>> > signal/wait parameters.
> >>> >
> >>> > The question is whether this synchronization solution is robust
> enough for
> >>> > dma_fence and whatever the kernel and window systems need.
> >>>
> >>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
> >>> (without page faults). That means dma_fence for synchronization is
> doa, at
&g

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
Ok. I'll interpret this as "yes, it will work, let's do it".

Marek

On Tue., Apr. 27, 2021, 08:06 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Correct, we wouldn't have synchronization between device with and without
> user queues any more.
>
> That could only be a problem for A+I Laptops.
>
> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
>
> Christian.
>
> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
>
> What would we lose without DMA fences? Just inter-device synchronization?
> I think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page
> flip. Everything today already depends on userspace not hanging the gpu,
> which makes everything a future fence.
>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:
>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts
>> on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle
>> and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace
>> will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have
>> the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace
>> behavior.
>> > With that, it can be identified whether the culprit is the waiting
>> process
>> > or the signalling process and which one. Invalid signal/wait parameters
>> can
>> > also be detected. The kernel can force-signal only the semaphores that
>> time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough
>> for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
&g

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-27 Thread Marek Olšák
If we don't use future fences for DMA fences at all, e.g. we don't use them
for memory management, it can work, right? Memory management can suspend
user queues anytime. It doesn't need to use DMA fences. There might be
something that I'm missing here.

What would we lose without DMA fences? Just inter-device synchronization? I
think that might be acceptable.

The only case when the kernel will wait on a future fence is before a page
flip. Everything today already depends on userspace not hanging the gpu,
which makes everything a future fence.

Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,  wrote:

> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
> > how to do it differently.
> >
> > I think we can have direct command submission from userspace via
> > memory-mapped queues ("user queues") without changing window systems.
> >
> > The memory management doesn't have to use GPU page faults like HMM.
> > Instead, it can wait for user queues of a specific process to go idle and
> > then unmap the queues, so that userspace can't submit anything. Buffer
> > evictions, pinning, etc. can be executed when all queues are unmapped
> > (suspended). Thus, no BO fences and page faults are needed.
> >
> > Inter-process synchronization can use timeline semaphores. Userspace will
> > query the wait and signal value for a shared buffer from the kernel. The
> > kernel will keep a history of those queries to know which process is
> > responsible for signalling which buffer. There is only the wait-timeout
> > issue and how to identify the culprit. One of the solutions is to have
> the
> > GPU send all GPU signal commands and all timed out wait commands via an
> > interrupt to the kernel driver to monitor and validate userspace
> behavior.
> > With that, it can be identified whether the culprit is the waiting
> process
> > or the signalling process and which one. Invalid signal/wait parameters
> can
> > also be detected. The kernel can force-signal only the semaphores that
> time
> > out, and punish the processes which caused the timeout or used invalid
> > signal/wait parameters.
> >
> > The question is whether this synchronization solution is robust enough
> for
> > dma_fence and whatever the kernel and window systems need.
>
> The proper model here is the preempt-ctx dma_fence that amdkfd uses
> (without page faults). That means dma_fence for synchronization is doa, at
> least as-is, and we're back to figuring out the winsys problem.
>
> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
> akin to saying that we're solving deadlock issues in a locking design by
> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
> avoids having to reach the reset button, but that's about it.
>
> And the fundamental problem is that once you throw in userspace command
> submission (and syncing, at least within the userspace driver, otherwise
> there's kinda no point if you still need the kernel for cross-engine sync)
> means you get deadlocks if you still use dma_fence for sync under
> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>
> See silly diagramm at the bottom.
>
> Now I think all isn't lost, because imo the first step to getting to this
> brave new world is rebuilding the driver on top of userspace fences, and
> with the adjusted cmd submit model. You probably don't want to use amdkfd,
> but port that as a context flag or similar to render nodes for gl/vk. Of
> course that means you can only use this mode in headless, without
> glx/wayland winsys support, but it's a start.
> -Daniel
>
> >
> > Marek
> >
> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone 
> wrote:
> >
> > > Hi,
> > >
> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
> > >
> > >> The thing is, you can't do this in drm/scheduler. At least not without
> > >> splitting up the dma_fence in the kernel into separate memory fences
> > >> and sync fences
> > >
> > >
> > > I'm starting to think this thread needs its own glossary ...
> > >
> > > I propose we use 'residency fence' for execution fences which enact
> > > memory-residency operations, e.g. faulting in a page ultimately
> depending
> > > on GPU work retiring.
> > >
> > >

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-26 Thread Marek Olšák
Thanks everybody. The initial proposal is dead. Here are some thoughts on
how to do it differently.

I think we can have direct command submission from userspace via
memory-mapped queues ("user queues") without changing window systems.

The memory management doesn't have to use GPU page faults like HMM.
Instead, it can wait for user queues of a specific process to go idle and
then unmap the queues, so that userspace can't submit anything. Buffer
evictions, pinning, etc. can be executed when all queues are unmapped
(suspended). Thus, no BO fences and page faults are needed.

Inter-process synchronization can use timeline semaphores. Userspace will
query the wait and signal value for a shared buffer from the kernel. The
kernel will keep a history of those queries to know which process is
responsible for signalling which buffer. There is only the wait-timeout
issue and how to identify the culprit. One of the solutions is to have the
GPU send all GPU signal commands and all timed out wait commands via an
interrupt to the kernel driver to monitor and validate userspace behavior.
With that, it can be identified whether the culprit is the waiting process
or the signalling process and which one. Invalid signal/wait parameters can
also be detected. The kernel can force-signal only the semaphores that time
out, and punish the processes which caused the timeout or used invalid
signal/wait parameters.

The question is whether this synchronization solution is robust enough for
dma_fence and whatever the kernel and window systems need.

Marek

On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone  wrote:

> Hi,
>
> On Tue, 20 Apr 2021 at 20:30, Daniel Vetter  wrote:
>
>> The thing is, you can't do this in drm/scheduler. At least not without
>> splitting up the dma_fence in the kernel into separate memory fences
>> and sync fences
>
>
> I'm starting to think this thread needs its own glossary ...
>
> I propose we use 'residency fence' for execution fences which enact
> memory-residency operations, e.g. faulting in a page ultimately depending
> on GPU work retiring.
>
> And 'value fence' for the pure-userspace model suggested by timeline
> semaphores, i.e. fences being (*addr == val) rather than being able to look
> at ctx seqno.
>
> Cheers,
> Daniel
> ___
> mesa-dev mailing list
> mesa-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Marek Olšák
On Tue, Apr 20, 2021 at 2:39 PM Daniel Vetter  wrote:

> On Tue, Apr 20, 2021 at 6:25 PM Marek Olšák  wrote:
> >
> > Daniel, imagine hardware that can only do what Windows does: future
> fences signalled by userspace whenever userspace wants, and no kernel
> queues like we have today.
> >
> > The only reason why current AMD GPUs work is because they have a ring
> buffer per queue with pointers to userspace command buffers followed by
> fences. What will we do if that ring buffer is removed?
>
> Well this is an entirely different problem than what you set out to
> describe. This is essentially the problem where hw does not have any
> support for priviledged commands and separate priviledges command
> buffer, and direct userspace submit is the only thing that is
> available.
>
> I think if this is your problem, then you get to implement some very
> interesting compat shim. But that's an entirely different problem from
> what you've described in your mail. This pretty much assumes at the hw
> level the only thing that works is ATS/pasid, and vram is managed with
> HMM exclusively. Once you have that pure driver stack you get to fake
> it in the kernel for compat with everything that exists already. How
> exactly that will look and how exactly you best construct your
> dma_fences for compat will depend highly upon how much is still there
> in this hw (e.g. wrt interrupt generation). A lot of the
> infrastructure was also done as part of drm_syncobj. I mean we have
> entirely fake kernel drivers like vgem/vkms that create dma_fence, so
> a hw ringbuffer is really not required.
>
> So ... is this your problem underneath it all, or was that more a wild
> strawman for the discussion?
>

Yes, that's the problem.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Marek Olšák
Daniel, imagine hardware that can only do what Windows does: future fences
signalled by userspace whenever userspace wants, and no kernel queues like
we have today.

The only reason why current AMD GPUs work is because they have a ring
buffer per queue with pointers to userspace command buffers followed by
fences. What will we do if that ring buffer is removed?

Marek

On Tue, Apr 20, 2021 at 11:50 AM Daniel Stone  wrote:

> Hi,
>
> On Tue, 20 Apr 2021 at 16:16, Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Am 20.04.21 um 17:07 schrieb Daniel Stone:
>>
>> If the compositor no longer has a guarantee that the buffer will be ready
>> for composition in a reasonable amount of time (which dma_fence gives us,
>> and this proposal does not appear to give us), then the compositor isn't
>> trying to use the buffer for compositing, it's waiting asynchronously on a
>> notification that the fence has signaled before it attempts to use the
>> buffer.
>>
>> Marek's initial suggestion is that the kernel signal the fence, which
>> would unblock composition (and presumably show garbage on screen, or at
>> best jump back to old content).
>>
>> My position is that the compositor will know the process has crashed
>> anyway - because its socket has been closed - at which point we destroy all
>> the client's resources including its windows and buffers regardless.
>> Signaling the fence doesn't give us any value here, _unless_ the compositor
>> is just blindly waiting for the fence to signal ... which it can't do
>> because there's no guarantee the fence will ever signal.
>>
>>
>> Yeah, but that assumes that the compositor has change to not blindly wait
>> for the client to finish rendering and as Daniel explained that is rather
>> unrealistic.
>>
>> What we need is a fallback mechanism which signals the fence after a
>> timeout and gives a penalty to the one causing the timeout.
>>
>> That gives us the same functionality we have today with the in software
>> scheduler inside the kernel.
>>
>
> OK, if that's the case then I think I'm really missing something which
> isn't explained in this thread, because I don't understand what the
> additional complexity and API change gains us (see my first reply in this
> thread).
>
> By way of example - say I have a blind-but-explicit compositor that takes
> a drm_syncobj along with a dmabuf with each client presentation request,
> but doesn't check syncobj completion, it just imports that into a
> VkSemaphore + VkImage and schedules work for the next frame.
>
> Currently, that generates an execbuf ioctl for the composition (ignore KMS
> for now) with a sync point to wait on, and the kernel+GPU scheduling
> guarantees that the composition work will not begin until the client
> rendering work has retired. We have a further guarantee that this work will
> complete in reasonable time, for some value of 'reasonable'.
>
> My understanding of this current proposal is that:
> * userspace creates a 'present fence' with this new ioctl
> * the fence becomes signaled when a value is written to a location in
> memory, which is visible through both CPU and GPU mappings of that page
> * this 'present fence' is imported as a VkSemaphore (?) and the userspace
> Vulkan driver will somehow wait on this value  either before submitting
> work or as a possibly-hardware-assisted GPU-side wait (?)
> * the kernel's scheduler is thus eliminated from the equation, and every
> execbuf is submitted directly to hardware, because either userspace knows
> that the fence has already been signaled, or it will issue a GPU-side wait
> (?)
> * but the kernel is still required to monitor completion of every fence
> itself, so it can forcibly complete, or penalise the client (?)
>
> Lastly, let's say we stop ignoring KMS: what happens for the
> render-with-GPU-display-on-KMS case? Do we need to do the equivalent of
> glFinish() in userspace and only submit the KMS atomic request when the GPU
> work has fully retired?
>
> Clarifying those points would be really helpful so this is less of a
> strawman. I have some further opinions, but I'm going to wait until I
> understand what I'm actually arguing against before I go too far. :) The
> last point is very salient though.
>
> Cheers,
> Daniel
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-20 Thread Marek Olšák
Daniel, are you suggesting that we should skip any deadlock prevention in
the kernel, and just let userspace wait for and signal any fence it has
access to?

Do you have any concern with the deprecation/removal of BO fences in the
kernel assuming userspace is only using explicit fences? Any concern with
the submit and return fences for modesetting and other producer<->consumer
scenarios?

Thanks,
Marek

On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter  wrote:

> On Tue, Apr 20, 2021 at 12:15 PM Christian König
>  wrote:
> >
> > Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > > Not going to comment on everything on the first pass...
> > >
> > > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> > >> Hi,
> > >>
> > >> This is our initial proposal for explicit fences everywhere and new
> memory management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> > >>
> > >>
> > >> 1. Introduction
> > >> (skip this if you are already sold on explicit fences)
> > >>
> > >> The current Linux graphics architecture was initially designed for
> GPUs with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> > >>
> > >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.
> > >>
> > >>
> > >> 2. Explicit synchronization for window systems and modesetting
> > >>
> > >> The producer is an application and the consumer is a compositor or a
> modesetting driver.
> > >>
> > >> 2.1. The Present request
> > >>
> > >> As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> > >> - The submit fence: Initially unsignalled, it will be signalled when
> the producer has finished drawing into the presented buffer.
> > >> - The return fence: Initially unsignalled, it will be signalled when
> the consumer has finished using the presented buffer.
> > > I'm not sure syncobj is what we want.  In the Intel world we're trying
> > > to go even further to something we're calling "userspace fences" which
> > > are a timeline implemented as a single 64-bit value in some
> > > CPU-mappable BO.  The client writes a higher value into the BO to
> > > signal the timeline.
> >
> > Well that is exactly what our Windows guys have suggested as well, but
> > it strongly looks like that this isn't sufficient.
> >
> > First of all you run into security problems when any application can
> > just write any value to that memory location. Just imagine an
> > application sets the counter to zero and X waits forever for some
> > rendering to finish.
>
> The thing is, with userspace fences security boundary issue prevent
> moves into userspace entirely. And it really doesn't matter whether
> the event you're waiting on doesn't complete because the other app
> crashed or was stupid or intentionally gave you a wrong fence point:
> You have to somehow handle that, e.g. perhaps with conditional
> rendering and just using the old frame in compositing if the new one
> doesn't show up in time. Or something like that. So trying to get the
> kernel involved but also not so much involved sounds like a bad design
> to me.
>
> > Additional to that in such a model you can't determine who is the guilty
> > queue in case of a hang and can't reset the synchronization primitives
> > in case of an error.
> >
> > Apart from that this is rather inefficient, e.g. we don't have any way
> > to prevent priority inversion

Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-19 Thread Marek Olšák
We already don't have accurate BO fences in some cases. Instead, BOs can
have fences which are equal to the last seen command buffer for each queue.
It's practically the same as if the kernel had no visibility into command
submissions and just added a fence into all queues when it needed to wait
for idle. That's already one alternative to BO fences that would work
today. The only BOs that need accurate BO fences are shared buffers, and
those use cases can be converted to explicit fences.

Removing memory management from all command buffer submission logic would
be one of the benefits that is quite appealing.

You don't need to depend on apps for budgeting and placement determination.
You can sort buffers according to driver usage, e.g. scratch/spill buffers,
shader IO rings, MSAA images, other images, and buffers. Alternatively, you
can have just internal buffers vs app buffers. Then you assign VRAM from
left to right until you reach the quota. This is optional, so this part can
be ignored.

>> - A GPU hang signals all fences. Other deadlocks will be handled like
GPU hangs.
>
>What do you mean by "all"?  All fences that were supposed to be
>signaled by the hung context?

Yes, that's one of the possibilities. Any GPU hang followed by a GPU reset
can clear VRAM, so all processes should recreate their contexts and
reinitialize resources. A deadlock caused by userspace could be handled
similarly.

I don't know how timeline fences would work across processes and how
resilient they would be to segfaults.

Marek

On Mon, Apr 19, 2021 at 11:48 AM Jason Ekstrand 
wrote:

> Not going to comment on everything on the first pass...
>
> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> >
> > Hi,
> >
> > This is our initial proposal for explicit fences everywhere and new
> memory management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> >
> >
> > 1. Introduction
> > (skip this if you are already sold on explicit fences)
> >
> > The current Linux graphics architecture was initially designed for GPUs
> with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> >
> > The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.
> >
> >
> > 2. Explicit synchronization for window systems and modesetting
> >
> > The producer is an application and the consumer is a compositor or a
> modesetting driver.
> >
> > 2.1. The Present request
> >
> > As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> > - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
> > - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.
>
> I'm not sure syncobj is what we want.  In the Intel world we're trying
> to go even further to something we're calling "userspace fences" which
> are a timeline implemented as a single 64-bit value in some
> CPU-mappable BO.  The client writes a higher value into the BO to
> signal the timeline.  The kernel then provides some helpers for
> waiting on them reliably and without spinning.  I don't expect
> everyone to support these right away but, If we're going to re-plumb
> userspace for explicit synchronization, I'd like to make sure we take
> this into account so we only have to do it once.
>
>
> > Deadlock mitigation to recover from segfaults:
> > - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
>
> This isn't clear to me.  Yes, if we'

[RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-19 Thread Marek Olšák
Hi,

This is our initial proposal for explicit fences everywhere and new memory
management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


*1. Introduction*
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs
with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory
management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.


*2. Explicit synchronization for window systems and modesetting*

The producer is an application and the consumer is a compositor or a
modesetting driver.

*2.1. The Present request*

As part of the Present request, the producer will pass 2 fences (sync
objects) to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when the
producer has finished drawing into the presented buffer.
- The return fence: Initially unsignalled, it will be signalled when the
consumer has finished using the presented buffer.

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence. This
information is part of the Present request and supplied by userspace.
- If the producer crashes, the kernel signals the submit fence, so that the
consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so that the
producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled like GPU
hangs.

Other window system requests can follow the same idea.

Merged fences where one fence object contains multiple fences will be
supported. A merged fence is signalled only when its fences are signalled.
The consumer will have the option to redefine the unsignalled return fence
to a merged fence.

*2.2. Modesetting*

Since a modesetting driver can also be the consumer, the present ioctl will
contain a submit fence and a return fence too. One small problem with this
is that userspace can hang the modesetting driver, but in theory, any later
present ioctl can override the previous one, so the unsignalled
presentation is never used.


*3. New memory management*

The per-BO fences will be removed and the kernel will not know which
buffers are busy. This will reduce CPU overhead and latency. The kernel
will not need per-BO fences with explicit synchronization, so we just need
to remove their last user: buffer evictions. It also resolves the current
OOM deadlock.

*3.1. Evictions*

If the kernel wants to move a buffer, it will have to wait for everything
to go idle, halt all userspace command submissions, move the buffer, and
resume everything. This is not expected to happen when memory is not
exhausted. Other more efficient ways of synchronization are also possible
(e.g. sync only one process), but are not discussed here.

*3.2. Per-process VRAM usage quota*

Each process can optionally and periodically query its VRAM usage quota and
change domains of its buffers to obey that quota. For example, a process
allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
GB. The process can change the domains of the least important buffers to
GTT to get the best outcome for itself. If the process doesn't do it, the
kernel will choose which buffers to evict at random. (thanks to Christian
Koenig for this idea)

*3.3. Buffer destruction without per-BO fences*

When the buffer destroy ioctl is called, an optional fence list can be
passed to the kernel to indicate when it's safe to deallocate the buffer.
If the fence list is empty, the buffer will be deallocated immediately.
Shared buffers will be handled by merging fence lists from all processes
that destroy them. Mitigation of malicious behavior:
- If userspace destroys a busy buffer, it will get a GPU page fault.
- If userspace sends fences that never signal, the kernel will have a
timeout period and then will proceed to deallocate the buffer anyway.

*3.4. Other notes on MM*

Overcommitment of GPU-accessible memory will cause an allocation failure or
invoke the OOM killer.

Re: Is LLVM 13 (git) really ready for testing/development? libclc didn't compile

2021-03-05 Thread Marek Olšák
Hi,

I can't answer this because our Mesa team doesn't work on LLVM and we don't
build libclc.

Marek

On Thu, Mar 4, 2021 at 10:20 PM Dieter Nützel  wrote:

> Hello Marek,
>
> can't compile anything, here.
> Poor Intel Nehalem X3470.
>
> Trying LLVM 12-rc2 later.
>
> Greetings,
> Dieter
>
> llvm-project/libclc> cd build && cmake ../
> -- The CXX compiler identification is GNU 10.2.1
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> LLVM version: 13.0.0git
> LLVM system libs:
> LLVM libs: -lLLVM-13git
> LLVM libdir: /usr/local/lib
> LLVM bindir: /usr/local/bin
> LLVM ld flags: -L/usr/local/lib
> LLVM cxx flags:
>
> -I/usr/local/include;-std=c++14;;;-fno-exceptions;-D_GNU_SOURCE;-D__STDC_CONSTANT_MACROS;-D__STDC_FORMAT_MACROS;-D__STDC_LIMIT_MACROS;-fno-rtti;-fno-exceptions
>
> clang: /usr/local/bin/clang
> llvm-as: /usr/local/bin/llvm-as
> llvm-link: /usr/local/bin/llvm-link
> opt: /usr/local/bin/opt
> llvm-spirv: /usr/local/bin/llvm-spirv
>
> -- Check for working CLC compiler: /usr/local/bin/clang
> -- Check for working CLC compiler: /usr/local/bin/clang -- works
> -- Check for working LLAsm compiler: /usr/local/bin/llvm-as
> -- Check for working LLAsm compiler: /usr/local/bin/llvm-as -- broken
> CMake Error at cmake/CMakeTestLLAsmCompiler.cmake:40 (message):
>The LLAsm compiler "/usr/local/bin/llvm-as" is not able to compile a
> simple
>test program.
>
>It fails with the following output:
>
> Change Dir: /opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp
>
>
>
>Run Build Command(s):/usr/bin/gmake cmTC_87af9/fast && /usr/bin/gmake
> -f
>CMakeFiles/cmTC_87af9.dir/build.make CMakeFiles/cmTC_87af9.dir/build
>
>gmake[1]: Verzeichnis
>„/opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp“ wird betreten
>
>Building LLAsm object CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc
>
>/usr/local/bin/clang -E -P -x cl
>
> /opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp/testLLAsmCompiler.ll
> -o
>CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc.temp
>
>/usr/local/bin/llvm-as -o
> CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc
>CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc.temp
>
>/usr/local/bin/llvm-as:
>CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc.temp:1:1: error:
> expected
>top-level entity
>
>typedef unsigned char uchar;
>
>^
>
>gmake[1]: *** [CMakeFiles/cmTC_87af9.dir/build.make:86:
>CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc] Fehler 1
>
>gmake[1]: Verzeichnis
>„/opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp“ wird verlassen
>
>gmake: *** [Makefile:140: cmTC_87af9/fast] Fehler 2
>
>
>
>
>
>
>
>CMake will not be able to correctly generate this project.
> Call Stack (most recent call first):
>CMakeLists.txt:127 (enable_language)
>
>
> -- Configuring incomplete, errors occurred!
> See also "/opt/llvm-project/libclc/build/CMakeFiles/CMakeOutput.log".
> See also "/opt/llvm-project/libclc/build/CMakeFiles/CMakeError.log".
>
>
> CMakeError.log
> Determining if the LLAsm compiler works failed with the following
> output:
> Change Dir: /opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp
>
> Run Build Command(s):/usr/bin/gmake cmTC_87af9/fast && /usr/bin/gmake
> -f CMakeFiles/cmTC_87af9.dir/build.make CMakeFiles/cmTC_87af9.dir/build
> gmake[1]: Verzeichnis
> „/opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp“ wird betreten
> Building LLAsm object CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc
> /usr/local/bin/clang -E -P -x cl
> /opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp/testLLAsmCompiler.ll
> -o CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc.temp
> /usr/local/bin/llvm-as -o CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc
> CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc.temp
> /usr/local/bin/llvm-as:
> CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc.temp:1:1: error: expected
> top-level entity
> typedef unsigned char uchar;
> ^
> gmake[1]: *** [CMakeFiles/cmTC_87af9.dir/build.make:86:
> CMakeFiles/cmTC_87af9.dir/testLLAsmCompiler.bc] Fehler 1
> gmake[1]: Verzeichnis
> „/opt/llvm-project/libclc/build/CMakeFiles/CMakeTmp“ wird verlassen
> gmake: *** [Makefile:140: cmTC_87af9/fast] Fehler 2
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 8/8] drm/amd/display: Expose modifiers.

2020-09-02 Thread Marek Olšák
OK. Reviewed-by: Marek Olšák 

Marek

On Wed, Sep 2, 2020 at 6:31 AM Bas Nieuwenhuizen 
wrote:

> On Fri, Aug 7, 2020 at 9:43 PM Marek Olšák  wrote:
> >
> > On Tue, Aug 4, 2020 at 5:32 PM Bas Nieuwenhuizen <
> b...@basnieuwenhuizen.nl> wrote:
> >>
> >> This expose modifier support on GFX9+.
> >>
> >> Only modifiers that can be rendered on the current GPU are
> >> added. This is to reduce the number of modifiers exposed.
> >>
> >> The HW could expose more, but the best mechanism to decide
> >> what to expose without an explosion in modifiers is still
> >> to be decided, and in the meantime this should not regress
> >> things from pre-modifiers and does not risk regressions as
> >> we make up our mind in the future.
> >>
> >> Signed-off-by: Bas Nieuwenhuizen 
> >> ---
> >>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 343 +-
> >>  1 file changed, 342 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> index c38257081868..6594cbe625f9 100644
> >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> @@ -3891,6 +3891,340 @@ fill_gfx9_tiling_info_from_modifier(const
> struct amdgpu_device *adev,
> >> }
> >>  }
> >>
> >> +enum dm_micro_swizzle {
> >> +   MICRO_SWIZZLE_Z = 0,
> >> +   MICRO_SWIZZLE_S = 1,
> >> +   MICRO_SWIZZLE_D = 2,
> >> +   MICRO_SWIZZLE_R = 3
> >> +};
> >> +
> >> +static bool dm_plane_format_mod_supported(struct drm_plane *plane,
> >> + uint32_t format,
> >> + uint64_t modifier)
> >> +{
> >> +   struct amdgpu_device *adev = plane->dev->dev_private;
> >> +   const struct drm_format_info *info = drm_format_info(format);
> >> +
> >> +   enum dm_micro_swizzle microtile =
> modifier_gfx9_swizzle_mode(modifier) & 3;
> >> +
> >> +   if (!info)
> >> +   return false;
> >> +
> >> +   /*
> >> +* We always have to allow this modifier, because core DRM still
> >> +* checks LINEAR support if userspace does not provide modifers.
> >> +*/
> >> +   if (modifier == DRM_FORMAT_MOD_LINEAR)
> >> +   return true;
> >> +
> >> +   /*
> >> +* The arbitrary tiling support for multiplane formats has not
> been hooked
> >> +* up.
> >> +*/
> >> +   if (info->num_planes > 1)
> >> +   return false;
> >> +
> >> +   /*
> >> +* For D swizzle the canonical modifier depends on the bpp, so
> check
> >> +* it here.
> >> +*/
> >> +   if (AMD_FMT_MOD_GET(TILE_VERSION, modifier) ==
> AMD_FMT_MOD_TILE_VER_GFX9 &&
> >> +   adev->family >= AMDGPU_FAMILY_NV) {
> >> +   if (microtile == MICRO_SWIZZLE_D && info->cpp[0] == 4)
> >> +   return false;
> >> +   }
> >> +
> >> +   if (adev->family >= AMDGPU_FAMILY_RV && microtile ==
> MICRO_SWIZZLE_D &&
> >> +   info->cpp[0] < 8)
> >> +   return false;
> >> +
> >> +   if (modifier_has_dcc(modifier)) {
> >> +   /* Per radeonsi comments 16/64 bpp are more
> complicated. */
> >> +   if (info->cpp[0] != 4)
> >> +   return false;
> >> +   }
> >> +
> >> +   return true;
> >> +}
> >> +
> >> +static void
> >> +add_modifier(uint64_t **mods, uint64_t *size, uint64_t *cap, uint64_t
> mod)
> >> +{
> >> +   if (!*mods)
> >> +   return;
> >> +
> >> +   if (*cap - *size < 1) {
> >> +   uint64_t new_cap = *cap * 2;
> >> +   uint64_t *new_mods = kmalloc(new_cap *
> sizeof(uint64_t), GFP_KERNEL);
> >> +
> >> +   if (!new_mods) {
> >> +   kfree(*mods);
> >> +   *mods = NULL;
> >> +   return;
> >> +   }
> >> +
> >> +   m

Re: [PATCH 3/7] drm/amd/display: Avoid using unvalidated tiling_flags and tmz_surface in prepare_planes

2020-08-16 Thread Marek Olšák
On Wed, Aug 12, 2020 at 9:54 AM Daniel Vetter  wrote:

> On Tue, Aug 11, 2020 at 09:42:11AM -0400, Marek Olšák wrote:
> > There are a few cases when the flags can change, for example DCC can be
> > disabled due to a hw limitation in the 3d engine. Modifiers give the
> > misleading impression that they help with that, but they don't. They
> don't
> > really help with anything.
>
> But if that happens, how do you tell the other side that it needs to
> sample new flags? Does that just happen all the time?
>
> Also do the DDC state changes happen for shared buffers too?
>

I thought we were only talking about shared buffers.

If the other side is only a consumer and the producer must disable DCC, the
producer decompresses DCC and then disables it and updates the BO flags.
The consumer doesn't need the new flags, because even if DCC stays enabled
in the consumer, it's in a decompressed state (it has no effect). Only the
producer knows it's disabled, and any new consumer will also know it when
it queries the latest BO flags.

It doesn't work if both sides use writes, because it's not communicated
that DCC is disabled (BO flags are queried only once). This hasn't been a
problem so far.

Is there a way to disable DCC correctly and safely across processes? Yes.
So why don't we do it? Because it would add more GPU overhead.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 3/7] drm/amd/display: Avoid using unvalidated tiling_flags and tmz_surface in prepare_planes

2020-08-11 Thread Marek Olšák
There are a few cases when the flags can change, for example DCC can be
disabled due to a hw limitation in the 3d engine. Modifiers give the
misleading impression that they help with that, but they don't. They don't
really help with anything.

Marek

On Mon., Aug. 10, 2020, 08:30 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 10.08.20 um 14:25 schrieb Daniel Vetter:
> > On Fri, Aug 07, 2020 at 10:29:09AM -0400, Kazlauskas, Nicholas wrote:
> >> On 2020-08-07 4:30 a.m., dan...@ffwll.ch wrote:
> >>> On Thu, Jul 30, 2020 at 04:36:38PM -0400, Nicholas Kazlauskas wrote:
>  [Why]
>  We're racing with userspace as the flags could potentially change
>  from when we acquired and validated them in commit_check.
> >>> Uh ... I didn't know these could change. I think my comments on Bas'
> >>> series are even more relevant now. I think long term would be best to
> bake
> >>> these flags in at addfb time when modifiers aren't set. And otherwise
> >>> always use the modifiers flag, and completely ignore the legacy flags
> >>> here.
> >>> -Daniel
> >>>
> >> There's a set tiling/mod flags IOCTL that can be called after addfb
> happens,
> >> so unless there's some sort of driver magic preventing this from working
> >> when it's already been pinned for scanout then I don't see anything
> stopping
> >> this from happening.
> >>
> >> I still need to review the modifiers series in a little more detail but
> that
> >> looks like a good approach to fixing these kind of issues.
> > Yeah we had the same model for i915, but it's awkward and can surprise
> > compositors (since the client could change the tiling mode from
> underneath
> > the compositor). So freezing the tiling mode at addfb time is the right
> > thing to do, and anyway how things work with modifiers.
> >
> > Ofc maybe good to audit the -amd driver, but hopefully it doesn't do
> > anything silly with changed tiling. If it does, it's kinda sad day.
>
> Marek should know this right away, but I think we only set the tilling
> flags once while exporting the BO and then never change them.
>
> Regards,
> Christian.
>
> >
> > Btw for i915 we even went a step further, and made the set_tiling ioctl
> > return an error if a framebuffer for that gem_bo existed. Just to make
> > sure we don't ever accidentally break this.
> >
> > Cheers, Daniel
> >
> >> Regards,
> >> Nicholas Kazlauskas
> >>
>  [How]
>  We unfortunately can't drop this function in its entirety from
>  prepare_planes since we don't know the afb->address at commit_check
>  time yet.
> 
>  So instead of querying new tiling_flags and tmz_surface use the ones
>  from the plane_state directly.
> 
>  While we're at it, also update the force_disable_dcc option based
>  on the state from atomic check.
> 
>  Cc: Bhawanpreet Lakha 
>  Cc: Rodrigo Siqueira 
>  Signed-off-by: Nicholas Kazlauskas 
>  ---
> .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36
> ++-
> 1 file changed, 19 insertions(+), 17 deletions(-)
> 
>  diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>  index bf1881bd492c..f78c09c9585e 100644
>  --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>  +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>  @@ -5794,14 +5794,8 @@ static int dm_plane_helper_prepare_fb(struct
> drm_plane *plane,
> struct list_head list;
> struct ttm_validate_buffer tv;
> struct ww_acquire_ctx ticket;
>  -  uint64_t tiling_flags;
> uint32_t domain;
> int r;
>  -  bool tmz_surface = false;
>  -  bool force_disable_dcc = false;
>  -
>  -  dm_plane_state_old = to_dm_plane_state(plane->state);
>  -  dm_plane_state_new = to_dm_plane_state(new_state);
> if (!new_state->fb) {
> DRM_DEBUG_DRIVER("No FB bound\n");
>  @@ -5845,27 +5839,35 @@ static int dm_plane_helper_prepare_fb(struct
> drm_plane *plane,
> return r;
> }
>  -  amdgpu_bo_get_tiling_flags(rbo, &tiling_flags);
>  -
>  -  tmz_surface = amdgpu_bo_encrypted(rbo);
>  -
> ttm_eu_backoff_reservation(&ticket, &list);
> afb->address = amdgpu_bo_gpu_offset(rbo);
> amdgpu_bo_ref(rbo);
>  +  /**
>  +   * We don't do surface updates on planes that have been newly
> created,
>  +   * but we also don't have the afb->address during atomic check.
>  +   *
>  +   * Fill in buffer attributes depending on the address here, but
> only on
>  +   * newly created planes since they're not being used by DC yet and
> this
>  +   * won't modify global state.
>  +   */
>  +  dm_plane_state_old = to_dm_plane_state(plane->state);
>  +  dm_plane_state_new = to_dm_plane_state(new_state);
>  +
> 

Re: [PATCH 8/8] drm/amd/display: Expose modifiers.

2020-08-07 Thread Marek Olšák
On Tue, Aug 4, 2020 at 5:32 PM Bas Nieuwenhuizen 
wrote:

> This expose modifier support on GFX9+.
>
> Only modifiers that can be rendered on the current GPU are
> added. This is to reduce the number of modifiers exposed.
>
> The HW could expose more, but the best mechanism to decide
> what to expose without an explosion in modifiers is still
> to be decided, and in the meantime this should not regress
> things from pre-modifiers and does not risk regressions as
> we make up our mind in the future.
>
> Signed-off-by: Bas Nieuwenhuizen 
> ---
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 343 +-
>  1 file changed, 342 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index c38257081868..6594cbe625f9 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -3891,6 +3891,340 @@ fill_gfx9_tiling_info_from_modifier(const struct
> amdgpu_device *adev,
> }
>  }
>
> +enum dm_micro_swizzle {
> +   MICRO_SWIZZLE_Z = 0,
> +   MICRO_SWIZZLE_S = 1,
> +   MICRO_SWIZZLE_D = 2,
> +   MICRO_SWIZZLE_R = 3
> +};
> +
> +static bool dm_plane_format_mod_supported(struct drm_plane *plane,
> + uint32_t format,
> + uint64_t modifier)
> +{
> +   struct amdgpu_device *adev = plane->dev->dev_private;
> +   const struct drm_format_info *info = drm_format_info(format);
> +
> +   enum dm_micro_swizzle microtile =
> modifier_gfx9_swizzle_mode(modifier) & 3;
> +
> +   if (!info)
> +   return false;
> +
> +   /*
> +* We always have to allow this modifier, because core DRM still
> +* checks LINEAR support if userspace does not provide modifers.
> +*/
> +   if (modifier == DRM_FORMAT_MOD_LINEAR)
> +   return true;
> +
> +   /*
> +* The arbitrary tiling support for multiplane formats has not
> been hooked
> +* up.
> +*/
> +   if (info->num_planes > 1)
> +   return false;
> +
> +   /*
> +* For D swizzle the canonical modifier depends on the bpp, so
> check
> +* it here.
> +*/
> +   if (AMD_FMT_MOD_GET(TILE_VERSION, modifier) ==
> AMD_FMT_MOD_TILE_VER_GFX9 &&
> +   adev->family >= AMDGPU_FAMILY_NV) {
> +   if (microtile == MICRO_SWIZZLE_D && info->cpp[0] == 4)
> +   return false;
> +   }
> +
> +   if (adev->family >= AMDGPU_FAMILY_RV && microtile ==
> MICRO_SWIZZLE_D &&
> +   info->cpp[0] < 8)
> +   return false;
> +
> +   if (modifier_has_dcc(modifier)) {
> +   /* Per radeonsi comments 16/64 bpp are more complicated. */
> +   if (info->cpp[0] != 4)
> +   return false;
> +   }
> +
> +   return true;
> +}
> +
> +static void
> +add_modifier(uint64_t **mods, uint64_t *size, uint64_t *cap, uint64_t mod)
> +{
> +   if (!*mods)
> +   return;
> +
> +   if (*cap - *size < 1) {
> +   uint64_t new_cap = *cap * 2;
> +   uint64_t *new_mods = kmalloc(new_cap * sizeof(uint64_t),
> GFP_KERNEL);
> +
> +   if (!new_mods) {
> +   kfree(*mods);
> +   *mods = NULL;
> +   return;
> +   }
> +
> +   memcpy(new_mods, *mods, sizeof(uint64_t) * *size);
> +   kfree(*mods);
> +   *mods = new_mods;
> +   *cap = new_cap;
> +   }
> +
> +   (*mods)[*size] = mod;
> +   *size += 1;
> +}
> +
> +static void
> +add_gfx9_modifiers(const struct amdgpu_device *adev,
> + uint64_t **mods, uint64_t *size, uint64_t *capacity)
> +{
> +   int pipes =
> ilog2(adev->gfx.config.gb_addr_config_fields.num_pipes);
> +   int pipe_xor_bits = min(8, pipes +
> +
>  ilog2(adev->gfx.config.gb_addr_config_fields.num_se));
> +   int bank_xor_bits = min(8 - pipe_xor_bits,
> +
>  ilog2(adev->gfx.config.gb_addr_config_fields.num_banks));
> +   int rb = ilog2(adev->gfx.config.gb_addr_config_fields.num_se) +
> +
> ilog2(adev->gfx.config.gb_addr_config_fields.num_rb_per_se);
> +
> +
> +   if (adev->family == AMDGPU_FAMILY_RV) {
> +   /*
> +* No _D DCC swizzles yet because we only allow 32bpp,
> which
> +* doesn't support _D on DCN
> +*/
> +
> +   /*
> +* Always enable constant encoding, because the only unit
> that
> +* didn't support it was CB. But on texture/display we can
> +* always interpret it.
> +*/
> +   add_modifier(mods, size, capacity, AMD_FMT_MOD |
> +   AMD_FMT_MOD_SET(TILE,
> AMD_FMT_MOD_TILE_GFX9_64K_S_X) |
> +   AMD_FMT_MOD

Re: [PATCH v3] drm/fourcc: document modifier uniqueness requirements

2020-06-03 Thread Marek Olšák
TMZ is more complicated. If there is a TMZ buffer used by a command buffer,
then all other used buffers must also be TMZ or read only. If no TMZ
buffers are used by a command buffer, then TMZ is disabled. If a context is
not secure, TMZ is also disabled. A context can switch between secure and
non-secure based on the buffers being used.

So mixing secure and non-secure memory writes in one command buffer won't
work. This is not fixable in the driver - apps must be aware of this.

Marek

On Wed, Jun 3, 2020 at 5:50 AM Daniel Stone  wrote:

> Hi Alex,
>
> On Mon, 1 Jun 2020 at 15:25, Alex Deucher  wrote:
> > On Fri, May 29, 2020 at 11:03 AM Daniel Stone 
> wrote:
> > > What Weston _does_ know, however, is that display controller can work
> > > with modifier set A, and the GPU can work with modifier set B, and if
> > > the client can pick something from modifier set A, then there is a
> > > much greater probability that Weston can leave the GPU alone so it can
> > > be entirely used by the client. It also knows that if the surface
> > > can't be directly scanned out for whatever reason, then there's no
> > > point in the client optimising for direct scanout, and it can tell the
> > > client to select based on optimality purely for the GPU.
> >
> > Just so I understand this correctly, the main reason for this is to
> > deal with display hardware and render hardware from different vendors
> > which may or may not support any common formats other than linear.
>
> It handles pretty much everything other than a single-context,
> single-GPU, single-device, tunnel.
>
> When sharing between subsystems and device categories, it lets us talk
> about capabilities in a more global way. For example, GBM lets you
> talk about 'scanout' and 'texture' and 'render', but what about media
> codecs? We could add the concept of decode/encode to something like
> GBM, and all the protocols like Wayland/X11 as well, then hope it
> actually works, but ...
>
> When sharing between heterogeneous vendors, it lets us talk about
> capabilities in a neutral way. For example, if you look at most modern
> Arm SoCs, your GPU, display controller, and media codec, will very
> likely all be from three totally different vendors. A GPU like
> Mali-T8xx can be shipped in tens of different vendor SoCs in several
> different revisions each. Just saying 'scanout' is totally meaningless
> for the Panfrost driver. Putting awareness for every different KMS
> platform and every different codec down into the Mesa driver is a
> synchronisation nightmare, and all those drivers would also need
> specific awareness about the Mesa driver. So modifiers allow us to
> explicitly describe that we want a particular revision of Arm
> Framebuffer Compression, and all the components can understand that
> without having to be specifically aware of 15 different KMS drivers.
> But even if you have the same vendor ...
>
> When sharing between multiple devices of the same class from the same
> vendor, it lets us surface and transit that information in a generic
> way, without AMD having to figure out ways to tunnel back-channel
> information between different instances of drivers potentially
> targeting different revisions. The alternatives seem to be deeply
> pessimal hacks, and we think we can do better. And when we get
> pessimal ...
>
> In every case, modifiers are about surfacing and sharing information.
> One of the reasons Collabora have been putting so much time and energy
> into this work is exactly _because_ solving those problems on a
> case-by-case basis was a pretty lucrative source of revenue for us.
> Debugging these kinds of issues before has usually involved specific
> driver knowledge, hacking into the driver to insert your own tracing,
> etc.
>
> If you (as someone who's trying to use a device optimally) are
> fortunate enough that you can get the attention of a vendor and have
> them solve the problem for you, then that's lucky for everyone apart
> from the AMD engineers who have to go solve it. If you're not, and you
> can't figure it out yourself, then you have to go pay a consultancy.
> On the face of it, that's good for us, except that we don't want to be
> doing that kind of repetitive boring work. But it's bad for the
> ecosystem that this knowledge is hidden away and that you have to pay
> specialists to extract it. So we're really keen to surface as much
> mechanism and information as possible, to give people the tools to
> either solve their own problems or at least make well-informed
> reports, burn down a toxic source of revenue, waste less engineering
> time extracting hidden information, and empower users as much as
> possible.
>
> > It
> > provides a way to tunnel device capabilities between the different
> > drivers.  In the case of a device with display and rendering on the
> > same device or multiple devices from the same vendor, it not really
> > that useful.
>
> Oh no, it's still super useful. There are a ton of corner cases where
> 'i

Re: [PATCH v3] drm/fourcc: document modifier uniqueness requirements

2020-05-28 Thread Marek Olšák
On most hardware, there is a minimum pitch alignment for linear and any
greater multiple of the alignment is fine.

On Navi, the pitch in bytes for linear must be align(width * bpp / 8, 256).
That's because the hw computes the pitch from the width and doesn't allow
setting a custom pitch. For that reason, multi-GPU sharing might not be
possible if the other GPU doesn't align the pitch in exactly the same way.

Marek

On Thu, May 28, 2020 at 10:38 AM Simon Ser  wrote:

> There have suggestions to bake pitch alignment, address alignement,
> contiguous memory or other placement (hidden VRAM, GTT/BAR, etc)
> constraints into modifiers. Last time this was brought up it seemed
> like the consensus was to not allow this. Document this in drm_fourcc.h.
>
> There are several reasons for this.
>
> - Encoding all of these constraints in the modifiers would explode the
>   search space pretty quickly (we only have 64 bits to work with).
> - Modifiers need to be unambiguous: a buffer can only have a single
>   modifier.
> - Modifier users aren't expected to parse modifiers.
>
> v2: add paragraph about aliases (Daniel)
>
> v3: fix unrelated changes sent with the patch
>
> Signed-off-by: Simon Ser 
> Reviewed-by: Daniel Vetter 
> Cc: Daniel Stone 
> Cc: Bas Nieuwenhuizen 
> Cc: Dave Airlie 
> Cc: Marek Olšák 
> ---
>  include/uapi/drm/drm_fourcc.h | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h
> index 490143500a50..f41fcb1ed63d 100644
> --- a/include/uapi/drm/drm_fourcc.h
> +++ b/include/uapi/drm/drm_fourcc.h
> @@ -58,6 +58,21 @@ extern "C" {
>   * may preserve meaning - such as number of planes - from the fourcc code,
>   * whereas others may not.
>   *
> + * Modifiers must uniquely encode buffer layout. In other words, a buffer
> must
> + * match only a single modifier. A modifier must not be a subset of
> layouts of
> + * another modifier. For instance, it's incorrect to encode pitch
> alignment in
> + * a modifier: a buffer may match a 64-pixel aligned modifier and a
> 32-pixel
> + * aligned modifier. That said, modifiers can have implicit minimal
> + * requirements.
> + *
> + * For modifiers where the combination of fourcc code and modifier can
> alias,
> + * a canonical pair needs to be defined and used by all drivers. An
> example
> + * is AFBC, where both ARGB and ABGR have the exact same compressed
> layout.
> + *
> + * Users see modifiers as opaque tokens they can check for equality and
> + * intersect. Users musn't need to know to reason about the modifier value
> + * (i.e. users are not expected to extract information out of the
> modifier).
> + *
>   * Vendors should document their modifier usage in as much detail as
>   * possible, to ensure maximum compatibility across devices, drivers and
>   * applications.
> --
> 2.26.2
>
>
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH AUTOSEL 5.6 33/50] drm/amdgpu: bump version for invalidate L2 before SDMA IBs

2020-05-18 Thread Marek Olšák
Hi Sasha,

I disagree with this. Bumping the driver version will have implications on
other new features, because it's like an ABI barrier exposing new
functionality.

Marek

On Thu, May 7, 2020 at 10:28 AM Sasha Levin  wrote:

> From: Marek Olšák 
>
> [ Upstream commit 9017a4897a20658f010bebea825262963c10afa6 ]
>
> This fixes GPU hangs due to cache coherency issues.
> Bump the driver version. Split out from the original patch.
>
> Signed-off-by: Marek Olšák 
> Reviewed-by: Christian König 
> Tested-by: Pierre-Eric Pelloux-Prayer 
> Signed-off-by: Alex Deucher 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 42f4febe24c6d..8d45a2b662aeb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -85,9 +85,10 @@
>   * - 3.34.0 - Non-DC can flip correctly between buffers with different
> pitches
>   * - 3.35.0 - Add drm_amdgpu_info_device::tcc_disabled_mask
>   * - 3.36.0 - Allow reading more status registers on si/cik
> + * - 3.37.0 - L2 is invalidated before SDMA IBs, needed for correctness
>   */
>  #define KMS_DRIVER_MAJOR   3
> -#define KMS_DRIVER_MINOR   36
> +#define KMS_DRIVER_MINOR   37
>  #define KMS_DRIVER_PATCHLEVEL  0
>
>  int amdgpu_vram_limit = 0;
> --
> 2.20.1
>
> ___
> amd-gfx mailing list
> amd-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Plumbing explicit synchronization through the Linux ecosystem

2020-03-19 Thread Marek Olšák
On Thu., Mar. 19, 2020, 06:51 Daniel Vetter,  wrote:

> On Tue, Mar 17, 2020 at 11:01:57AM +0100, Michel Dänzer wrote:
> > On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> > > On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer 
> wrote:
> > >> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> > >>> The synchronization works because the Mesa driver waits for idle
> (drains
> > >>> the GFX pipeline) at the end of command buffers and there is only 1
> > >>> graphics queue, so everything is ordered.
> > >>>
> > >>> The GFX pipeline runs asynchronously to the command buffer, meaning
> the
> > >>> command buffer only starts draws and doesn't wait for completion. If
> the
> > >>> Mesa driver didn't wait at the end of the command buffer, the command
> > >>> buffer would finish and a different process could start execution of
> its
> > >>> own command buffer while shaders of the previous process are still
> > >> running.
> > >>>
> > >>> If the Mesa driver submits a command buffer internally (because it's
> > >> full),
> > >>> it doesn't wait, so the GFX pipeline doesn't notice that a command
> buffer
> > >>> ended and a new one started.
> > >>>
> > >>> The waiting at the end of command buffers happens only when the
> flush is
> > >>> external (Swap buffers, glFlush).
> > >>>
> > >>> It's a performance problem, because the GFX queue is blocked until
> the
> > >> GFX
> > >>> pipeline is drained at the end of every frame at least.
> > >>>
> > >>> So explicit fences for SwapBuffers would help.
> > >>
> > >> Not sure what difference it would make, since the same thing needs to
> be
> > >> done for explicit fences as well, doesn't it?
> > >
> > > No. Explicit fences don't require userspace to wait for idle in the
> command
> > > buffer. Fences are signalled when the last draw is complete and caches
> are
> > > flushed. Before that happens, any command buffer that is not dependent
> on
> > > the fence can start execution. There is never a need for the GPU to be
> idle
> > > if there is enough independent work to do.
> >
> > I don't think explicit fences in the context of this discussion imply
> > using that different fence signalling mechanism though. My understanding
> > is that the API proposed by Jason allows implicit fences to be used as
> > explicit ones and vice versa, so presumably they have to use the same
> > signalling mechanism.
> >
> >
> > Anyway, maybe the different fence signalling mechanism you describe
> > could be used by the amdgpu kernel driver in general, then Mesa could
> > drop the waits for idle and get the benefits with implicit sync as well?
>
> Yeah, this is entirely about the programming model visible to userspace.
> There shouldn't be any impact on the driver's choice of a top vs. bottom
> of the gpu pipeline used for synchronization, that's entirely up to what
> you're hw/driver/scheduler can pull off.
>
> Doing a full gfx pipeline flush for shared buffers, when your hw can do
> be, sounds like an issue to me that's not related to this here at all. It
> might be intertwined with amdgpu's special interpretation of dma_resv
> fences though, no idea. We might need to revamp all that. But for a
> userspace client that does nothing fancy (no multiple render buffer
> targets in one bo, or vk style "I write to everything all the time,
> perhaps" stuff) there should be 0 perf difference between implicit sync
> through dma_resv and explicit sync through sync_file/syncobj/dma_fence
> directly.
>
> If there is I'd consider that a bit a driver bug.
>

Last time I checked, there was no fence sync in gnome shell and compiz
after an app passes a buffer to it. So drivers have to invent hacks to work
around it and decrease performance. It's not a driver bug.

Implicit sync really means that apps and compositors don't sync, so the
driver has to guess when it should sync.

Marek


-Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Plumbing explicit synchronization through the Linux ecosystem

2020-03-17 Thread Marek Olšák
On Tue., Mar. 17, 2020, 06:02 Michel Dänzer,  wrote:

> On 2020-03-16 7:33 p.m., Marek Olšák wrote:
> > On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer 
> wrote:
> >> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> >>> The synchronization works because the Mesa driver waits for idle
> (drains
> >>> the GFX pipeline) at the end of command buffers and there is only 1
> >>> graphics queue, so everything is ordered.
> >>>
> >>> The GFX pipeline runs asynchronously to the command buffer, meaning the
> >>> command buffer only starts draws and doesn't wait for completion. If
> the
> >>> Mesa driver didn't wait at the end of the command buffer, the command
> >>> buffer would finish and a different process could start execution of
> its
> >>> own command buffer while shaders of the previous process are still
> >> running.
> >>>
> >>> If the Mesa driver submits a command buffer internally (because it's
> >> full),
> >>> it doesn't wait, so the GFX pipeline doesn't notice that a command
> buffer
> >>> ended and a new one started.
> >>>
> >>> The waiting at the end of command buffers happens only when the flush
> is
> >>> external (Swap buffers, glFlush).
> >>>
> >>> It's a performance problem, because the GFX queue is blocked until the
> >> GFX
> >>> pipeline is drained at the end of every frame at least.
> >>>
> >>> So explicit fences for SwapBuffers would help.
> >>
> >> Not sure what difference it would make, since the same thing needs to be
> >> done for explicit fences as well, doesn't it?
> >
> > No. Explicit fences don't require userspace to wait for idle in the
> command
> > buffer. Fences are signalled when the last draw is complete and caches
> are
> > flushed. Before that happens, any command buffer that is not dependent on
> > the fence can start execution. There is never a need for the GPU to be
> idle
> > if there is enough independent work to do.
>
> I don't think explicit fences in the context of this discussion imply
> using that different fence signalling mechanism though. My understanding
> is that the API proposed by Jason allows implicit fences to be used as
> explicit ones and vice versa, so presumably they have to use the same
> signalling mechanism.
>
>
> Anyway, maybe the different fence signalling mechanism you describe
> could be used by the amdgpu kernel driver in general, then Mesa could
> drop the waits for idle and get the benefits with implicit sync as well?
>

Yes. If there is any waiting, or should be done in the GPU scheduler, not
in the command buffer, so that independent command buffers can use the GFX
queue.

Marek


>
> --
> Earthling Michel Dänzer   |   https://redhat.com
> Libre software enthusiast | Mesa and X developer
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem

2020-03-16 Thread Marek Olšák
On Mon, Mar 16, 2020 at 5:57 AM Michel Dänzer  wrote:

> On 2020-03-16 4:50 a.m., Marek Olšák wrote:
> > The synchronization works because the Mesa driver waits for idle (drains
> > the GFX pipeline) at the end of command buffers and there is only 1
> > graphics queue, so everything is ordered.
> >
> > The GFX pipeline runs asynchronously to the command buffer, meaning the
> > command buffer only starts draws and doesn't wait for completion. If the
> > Mesa driver didn't wait at the end of the command buffer, the command
> > buffer would finish and a different process could start execution of its
> > own command buffer while shaders of the previous process are still
> running.
> >
> > If the Mesa driver submits a command buffer internally (because it's
> full),
> > it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
> > ended and a new one started.
> >
> > The waiting at the end of command buffers happens only when the flush is
> > external (Swap buffers, glFlush).
> >
> > It's a performance problem, because the GFX queue is blocked until the
> GFX
> > pipeline is drained at the end of every frame at least.
> >
> > So explicit fences for SwapBuffers would help.
>
> Not sure what difference it would make, since the same thing needs to be
> done for explicit fences as well, doesn't it?
>

No. Explicit fences don't require userspace to wait for idle in the command
buffer. Fences are signalled when the last draw is complete and caches are
flushed. Before that happens, any command buffer that is not dependent on
the fence can start execution. There is never a need for the GPU to be idle
if there is enough independent work to do.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem

2020-03-15 Thread Marek Olšák
The synchronization works because the Mesa driver waits for idle (drains
the GFX pipeline) at the end of command buffers and there is only 1
graphics queue, so everything is ordered.

The GFX pipeline runs asynchronously to the command buffer, meaning the
command buffer only starts draws and doesn't wait for completion. If the
Mesa driver didn't wait at the end of the command buffer, the command
buffer would finish and a different process could start execution of its
own command buffer while shaders of the previous process are still running.

If the Mesa driver submits a command buffer internally (because it's full),
it doesn't wait, so the GFX pipeline doesn't notice that a command buffer
ended and a new one started.

The waiting at the end of command buffers happens only when the flush is
external (Swap buffers, glFlush).

It's a performance problem, because the GFX queue is blocked until the GFX
pipeline is drained at the end of every frame at least.

So explicit fences for SwapBuffers would help.

Marek

On Sun., Mar. 15, 2020, 22:49 Jason Ekstrand,  wrote:

> Could you elaborate. If there's something missing from my mental model of
> how implicit sync works, I'd like to have it corrected. People continue
> claiming that AMD is somehow special but I have yet to grasp what makes it
> so.  (Not that anyone has bothered to try all that hard to explain it.)
>
>
> --Jason
>
> On March 13, 2020 21:03:21 Marek Olšák  wrote:
>
>> There is no synchronization between processes (e.g. 3D app and
>> compositor) within X on AMD hw. It works because of some hacks in Mesa.
>>
>> Marek
>>
>> On Wed, Mar 11, 2020 at 1:31 PM Jason Ekstrand 
>> wrote:
>>
>>> All,
>>>
>>> Sorry for casting such a broad net with this one. I'm sure most people
>>> who reply will get at least one mailing list rejection.  However, this
>>> is an issue that affects a LOT of components and that's why it's
>>> thorny to begin with.  Please pardon the length of this e-mail as
>>> well; I promise there's a concrete point/proposal at the end.
>>>
>>>
>>> Explicit synchronization is the future of graphics and media.  At
>>> least, that seems to be the consensus among all the graphics people
>>> I've talked to.  I had a chat with one of the lead Android graphics
>>> engineers recently who told me that doing explicit sync from the start
>>> was one of the best engineering decisions Android ever made.  It's
>>> also the direction being taken by more modern APIs such as Vulkan.
>>>
>>>
>>> ## What are implicit and explicit synchronization?
>>>
>>> For those that aren't familiar with this space, GPUs, media encoders,
>>> etc. are massively parallel and synchronization of some form is
>>> required to ensure that everything happens in the right order and
>>> avoid data races.  Implicit synchronization is when bits of work (3D,
>>> compute, video encode, etc.) are implicitly based on the absolute
>>> CPU-time order in which API calls occur.  Explicit synchronization is
>>> when the client (whatever that means in any given context) provides
>>> the dependency graph explicitly via some sort of synchronization
>>> primitives.  If you're still confused, consider the following
>>> examples:
>>>
>>> With OpenGL and EGL, almost everything is implicit sync.  Say you have
>>> two OpenGL contexts sharing an image where one writes to it and the
>>> other textures from it.  The way the OpenGL spec works, the client has
>>> to make the API calls to render to the image before (in CPU time) it
>>> makes the API calls which texture from the image.  As long as it does
>>> this (and maybe inserts a glFlush?), the driver will ensure that the
>>> rendering completes before the texturing happens and you get correct
>>> contents.
>>>
>>> Implicit synchronization can also happen across processes.  Wayland,
>>> for instance, is currently built on implicit sync where the client
>>> does their rendering and then does a hand-off (via wl_surface::commit)
>>> to tell the compositor it's done at which point the compositor can now
>>> texture from the surface.  The hand-off ensures that the client's
>>> OpenGL API calls happen before the server's OpenGL API calls.
>>>
>>> A good example of explicit synchronization is the Vulkan API.  There,
>>> a client (or multiple clients) can simultaneously build command
>>> buffers in different threads where one of those command buffers
>>>

Re: [Mesa-dev] Plumbing explicit synchronization through the Linux ecosystem

2020-03-13 Thread Marek Olšák
There is no synchronization between processes (e.g. 3D app and compositor)
within X on AMD hw. It works because of some hacks in Mesa.

Marek

On Wed, Mar 11, 2020 at 1:31 PM Jason Ekstrand  wrote:

> All,
>
> Sorry for casting such a broad net with this one. I'm sure most people
> who reply will get at least one mailing list rejection.  However, this
> is an issue that affects a LOT of components and that's why it's
> thorny to begin with.  Please pardon the length of this e-mail as
> well; I promise there's a concrete point/proposal at the end.
>
>
> Explicit synchronization is the future of graphics and media.  At
> least, that seems to be the consensus among all the graphics people
> I've talked to.  I had a chat with one of the lead Android graphics
> engineers recently who told me that doing explicit sync from the start
> was one of the best engineering decisions Android ever made.  It's
> also the direction being taken by more modern APIs such as Vulkan.
>
>
> ## What are implicit and explicit synchronization?
>
> For those that aren't familiar with this space, GPUs, media encoders,
> etc. are massively parallel and synchronization of some form is
> required to ensure that everything happens in the right order and
> avoid data races.  Implicit synchronization is when bits of work (3D,
> compute, video encode, etc.) are implicitly based on the absolute
> CPU-time order in which API calls occur.  Explicit synchronization is
> when the client (whatever that means in any given context) provides
> the dependency graph explicitly via some sort of synchronization
> primitives.  If you're still confused, consider the following
> examples:
>
> With OpenGL and EGL, almost everything is implicit sync.  Say you have
> two OpenGL contexts sharing an image where one writes to it and the
> other textures from it.  The way the OpenGL spec works, the client has
> to make the API calls to render to the image before (in CPU time) it
> makes the API calls which texture from the image.  As long as it does
> this (and maybe inserts a glFlush?), the driver will ensure that the
> rendering completes before the texturing happens and you get correct
> contents.
>
> Implicit synchronization can also happen across processes.  Wayland,
> for instance, is currently built on implicit sync where the client
> does their rendering and then does a hand-off (via wl_surface::commit)
> to tell the compositor it's done at which point the compositor can now
> texture from the surface.  The hand-off ensures that the client's
> OpenGL API calls happen before the server's OpenGL API calls.
>
> A good example of explicit synchronization is the Vulkan API.  There,
> a client (or multiple clients) can simultaneously build command
> buffers in different threads where one of those command buffers
> renders to an image and the other textures from it and then submit
> both of them at the same time with instructions to the driver for
> which order to execute them in.  The execution order is described via
> the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> extension, you can even submit the work which does the texturing
> BEFORE the work which does the rendering and the driver will sort it
> out.
>
> The #1 problem with implicit synchronization (which explicit solves)
> is that it leads to a lot of over-synchronization both in client space
> and in driver/device space.  The client has to synchronize a lot more
> because it has to ensure that the API calls happen in a particular
> order.  The driver/device have to synchronize a lot more because they
> never know what is going to end up being a synchronization point as an
> API call on another thread/process may occur at any time.  As we move
> to more and more multi-threaded programming this synchronization (on
> the client-side especially) becomes more and more painful.
>
>
> ## Current status in Linux
>
> Implicit synchronization in Linux works via a the kernel's internal
> dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> which represents the "done" status for some bit of work.  Typically,
> dma_fences are created as a by-product of someone submitting some bit
> of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> set of dma_fences on it representing shared (read) and exclusive
> (write) access to the object.  When work is submitted which, for
> instance renders to the dma_buf, it's queued waiting on all the fences
> on the dma_buf and and a dma_fence is created representing the end of
> said rendering work and it's installed as the dma_buf's exclusive
> fence.  This way, the kernel can manage all its internal queues (3D
> rendering, display, video encode, etc.) and know which things to
> submit in what order.
>
> For the last few years, we've had sync_file in the kernel and it's
> plumbed into some drivers.  A sync_file is just a wrapper around a
> single dma_fence.  A sync_file is typically created as a by-product of
> submitting work (3D,

Re: [Mesa-dev] [Intel-gfx] gitlab.fd.o financial situation and impact on services

2020-02-29 Thread Marek Olšák
For Mesa, we could run CI only when Marge pushes, so that it's a strictly
pre-merge CI.

Marek

On Sat., Feb. 29, 2020, 17:20 Nicolas Dufresne, 
wrote:

> Le samedi 29 février 2020 à 15:54 -0600, Jason Ekstrand a écrit :
> > On Sat, Feb 29, 2020 at 3:47 PM Timur Kristóf 
> wrote:
> > > On Sat, 2020-02-29 at 14:46 -0500, Nicolas Dufresne wrote:
> > > > > 1. I think we should completely disable running the CI on MRs which
> > > > > are
> > > > > marked WIP. Speaking from personal experience, I usually make a lot
> > > > > of
> > > > > changes to my MRs before they are merged, so it is a waste of CI
> > > > > resources.
> > > >
> > > > In the mean time, you can help by taking the habit to use:
> > > >
> > > >   git push -o ci.skip
> > >
> > > Thanks for the advice, I wasn't aware such an option exists. Does this
> > > also work on the mesa gitlab or is this a GStreamer only thing?
> >
> > Mesa is already set up so that it only runs on MRs and branches named
> > ci-* (or maybe it's ci/*; I can't remember).
> >
> > > How hard would it be to make this the default?
> >
> > I strongly suggest looking at how Mesa does it and doing that in
> > GStreamer if you can.  It seems to work pretty well in Mesa.
>
> You are right, they added CI_MERGE_REQUEST_SOURCE_BRANCH_NAME in 11.6
> (we started our CI a while ago). But there is even better now, ou can
> do:
>
>   only:
> refs:
>   - merge_requests
>
> Thanks for the hint, I'll suggest that. I've lookup some of the backend
> of mesa, I think it's really nice, though there is a lot of concept
> that won't work in a multi-repo CI. Again, I need to refresh on what
> was moved from the enterprise to the community version in this regard,
>
> >
> > --Jason
> >
> >
> > > > That's a much more difficult goal then it looks like. Let each
> > > > projects
> > > > manage their CI graph and content, as each case is unique. Running
> > > > more
> > > > tests, or building more code isn't the main issue as the CPU time is
> > > > mostly sponsored. The data transfers between the cloud of gitlab and
> > > > the runners (which are external), along to sending OS image to Lava
> > > > labs is what is likely the most expensive.
> > > >
> > > > As it was already mention in the thread, what we are missing now, and
> > > > being worked on, is per group/project statistics that give us the
> > > > hotspot so we can better target the optimization work.
> > >
> > > Yes, would be nice to know what the hotspot is, indeed.
> > >
> > > As far as I understand, the problem is not CI itself, but the bandwidth
> > > needed by the build artifacts, right? Would it be possible to not host
> > > the build artifacts on the gitlab, but rather only the place where the
> > > build actually happened? Or at least, only transfer the build artifacts
> > > on-demand?
> > >
> > > I'm not exactly familiar with how the system works, so sorry if this is
> > > a silly question.
> > >
> > > ___
> > > mesa-dev mailing list
> > > mesa-...@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
> ___
> mesa-dev mailing list
> mesa-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] drm: Add AMD GFX9+ format modifiers.

2019-10-17 Thread Marek Olšák
On Wed, Oct 16, 2019 at 9:48 AM Bas Nieuwenhuizen 
wrote:

> This adds initial format modifiers for AMD GFX9 and newer GPUs.
>
> This is particularly useful to determine if we can use DCC, and whether
> we need an extra display compatible DCC metadata plane.
>
> Design decisions:
>   - Always expose a single plane
>This way everything works correctly with images with multiple
> planes.
>
>   - Do not add an extra memory region in DCC for putting a bit on whether
> we are in compressed state.
>A decompress on import is cheap enough if already decompressed, and
>I do think in most cases we can avoid it in advance during modifier
>negotiation. The remainder is probably not common enough to worry
>about.
>
>   - Explicitly define the sizes as part of the modifier description instead
> of using whatever the current version of radeonsi does.
>This way we can avoid dedicated buffers and we can make sure we keep
>compatibility across mesa versions. I'd like to put some tests on
>this on ac_surface.c so we can learn early in the process if things
>need to be changed. Furthermore, the lack of configurable strides on
>GFX10 means things already go wrong if we do not agree, making a
>custom stride somewhat less useful.
>

The custom stride will be back for 2D images (not for 3D/Array), so
Navi10-14 will be the only hw not supporting the custom stride for 2D. It
might not be worth adding the width and height into the modifier just
because of Navi10-14, though I don't feel strongly about it.

This patch doesn't add the sizes into the description anyway.

The rest looks good.

Marek


>
>   - No usage of BO metadata at all for modifier usecases.
>To avoid the requirement of dedicated dma bufs per image. For
>non-modifier based interop we still use the BO metadata, since we
>need to keep compatibility with old mesa and this is used for
>depth/msaa/3d/CL etc. API interop.
>
>   - A single FD for all planes.
>Easier in Vulkan / bindless and radeonsi is already transitioning.
>
>   - Make a single modifier for DCN1
>   It defines things uniquely given bpp, which we can assume, so adding
>   more modifier values do not add clarity.
>
>   - Not exposing the 4K and 256B tiling modes.
>   These are largely only better for something like a cursor or very
> long
>   and/or tall images. Are they worth the added complexity to save
> memory?
>   For context, at 32bpp, tiles are 128x128 pixels.
>
>   - For multiplane images, every plane uses the same tiling.
>   On GFX9/GFX10 we can, so no need to make it complicated.
>
>   - We use family_id + external_rev to distinguish between incompatible
> GPUs.
>   PCI ID is not enough, as RAVEN and RAVEN2 have the same PCI device
> id,
>   but different tiling. We might be able to find bigger equivalence
>   groups for _X, but especially for DCC I would be uncomfortable
> making it
>   shared between GPUs.
>
>   - For DCN1 DCC, radeonsi currently uses another texelbuffer with indices
> to reorder. This is not shared.
>   Specific to current implementation and does not need to be shared. To
>   pave the way to shader-based solution, lets keep this internal to
> each
>   driver. This should reduce the modifier churn if any of the driver
>   implementations change. (Especially as you'd want to support the old
>   implementation for a while to stay compatible with old kernels not
>   supporting a new modifier yet).
>
>   - No support for rotated swizzling.
>   Can be added easily later and nothing in the stack would generate it
>   currently.
>
>   - Add extra enum values in the definitions.
>   This way we can easily switch on modifier without having to pass
> around
>   the current GPU everywhere, assuming the modifier has been validated.
> ---
>
>  Since my previous attempt for modifiers got bogged down on details for
>  the GFX6-GFX8 modifiers in previous discussions, this only attempts to
>  define modifiers for GFX9+, which is significantly simpler.
>
>  For a final version I'd like to wait until I have written most of the
>  userspace + kernelspace so we can actually test it. However, I'd
>  appreciate any early feedback people are willing to give.
>
>  Initial Mesa amd/common support + tests are available at
>  https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/tree/modifiers
>
>  I tested the HW to actually behave as described in the descriptions
>  on Raven and plan to test on a subset of the others.
>
>  include/uapi/drm/drm_fourcc.h | 118 ++
>  1 file changed, 118 insertions(+)
>
> diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h
> index 3feeaa3f987a..9bd286ab2bee 100644
> --- a/include/uapi/drm/drm_fourcc.h
> +++ b/include/uapi/drm/drm_fourcc.h
> @@ -756,6 +756,124 @@ extern "C" {
>   */
>  #define DRM_FORMAT_MOD

[ANNOUNCE] libdrm 2.4.100

2019-10-16 Thread Marek Olšák
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512


Anusha Srivatsa (1):
  intel: sync i915_pciids.h with kernel

Emil Velikov (1):
  *-symbols-check: use normal shell over bash

Eric Engestrom (7):
  xf86drm: dedupe `#define`s
  xf86drm: use max size of drm node name instead of arbitrary size
  xf86drm: dedupe drmGetDeviceName() logic
  meson: fix sys/mkdev.h detection on Solaris
  *-symbols-check: let meson figure out how to execute the scripts
  RELEASING: update instructions to use meson instead of autotools
  libdrm: remove autotools support

Flora Cui (3):
  tests/amdgpu: fix for dispatch/draw test
  tests/amdgpu: add gpu reset test
  tests/amdgpu: disable reset test for now

Guchun Chen (7):
  amdgpu: add gfx ras inject configuration file
  tests/amdgpu/ras: refine ras inject test
  amdgpu: add umc ras inject test configuration
  amdgpu: remove json package dependence
  amdgpu: delete test configuration file
  amdgpu: add ras inject unit test
  amdgpu: add ras feature capability check in inject test

Ilia Mirkin (1):
  tests/util: fix incorrect memset argument order

Jonathan Gray (2):
  xf86drm: test for render nodes before primary nodes
  xf86drm: open correct render node on non-linux

Le Ma (2):
  tests/amdgpu: divide dispatch test into compute and gfx
  tests/amdgpu: add the missing deactivation case for dispatch test

Lucas De Marchi (1):
  intel: sync i915_pciids.h with kernel

Marek Olšák (5):
  include: update amdgpu_drm.h
  amdgpu: add amdgpu_cs_query_reset_state2 for AMDGPU_CTX_OP_QUERY_STATE2
  Bump the version to 2.4.100
  Revert "libdrm: remove autotools support"
  Bump the version to 2.4.100 for autotools

Niclas Zeising (2):
  meson.build: Fix typo
  meson.build: Fix header detection on FreeBSD

Nirmoy Das (1):
  test/amdgpu: don't free unused bo handle

Rodrigo Vivi (2):
  intel: add the TGL 12 PCI IDs and macros
  intel: Add support for EHL

git tag: libdrm-2.4.100

https://dri.freedesktop.org/libdrm/libdrm-2.4.100.tar.bz2
MD5:  f47bc87e28198ba527e6b44ffdd62f65  libdrm-2.4.100.tar.bz2
SHA1: 9f526909aba08b5658cfba3f7fde9385cad6f3b5  libdrm-2.4.100.tar.bz2
SHA256: c77cc828186c9ceec3e56ae202b43ee99eb932b4a87255038a80e8a1060d0a5d  
libdrm-2.4.100.tar.bz2
SHA512: 
4d3a5556e650872944af52f49de395e0ce8ac9ac58530e39a34413e94dc56c231ee71b8b8de9fb944263515a922b3ebbf7ddfebeaaa91543c2604f9bcf561247
  libdrm-2.4.100.tar.bz2
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.100.tar.bz2.sig

https://dri.freedesktop.org/libdrm/libdrm-2.4.100.tar.gz
MD5:  c47b1718734cc661734ed63f94bc27c1  libdrm-2.4.100.tar.gz
SHA1: 2097f0b98deaff16b8f3b93cedcb5cd35291a3c1  libdrm-2.4.100.tar.gz
SHA256: 6a5337c054c0c47bc16607a21efa2b622e08030be4101ef4a241c5eb05b6619b  
libdrm-2.4.100.tar.gz
SHA512: 
b61835473c77691c4a8e67b32b9df420661e8bf8700507334b58bde5e6a402dee4aea2bec1e5b83343dd28fcb6cf9fd084064d437332f178df81c4780552595b
  libdrm-2.4.100.tar.gz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.100.tar.gz.sig

-BEGIN PGP SIGNATURE-

iQEzBAEBCgAdFiEEzUfFNBo3XzO+97r6/dFdWs7w8rEFAl2njhEACgkQ/dFdWs7w
8rHl9Af+ODvdiUlbe20uOd8vBDVFgIR5Z4J8aFr/bUJ7ZxXSAytWglfY6Th9U89H
sN6UyXes9tr3OhAotBgZ2LYh1BDM18XSBIdteqQz9uiaZfw+L8OnZq3eikJ8Axlw
bCbJZtWa17KJQjFR7Cv3WozsaopKJm7A6cXHmuhk1lq9ukUBDPCQIqaqf9K+zTBQ
sIV8wliAmEK3s9tRhT3vsk12DmDfO0kUN504vOXdOjhAClDN03M0w4RXqDENG7gz
qm7eHgE7ugqkCxBPGdNfTSIoamFVOaNR/PCqzt5/6VMaW9oFKSTTgyglrVUjulE6
AZRDHPdQ9D6P84WpPtAWOug4EJhVuQ==
=NkL8
-END PGP SIGNATURE-
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm: add drm device name

2019-09-18 Thread Marek Olšák
Let's drop this patch. Mesa will use family_id.

Marek

On Wed, Sep 18, 2019 at 4:10 PM Marek Olšák  wrote:

> On Wed, Sep 18, 2019 at 10:03 AM Michel Dänzer  wrote:
>
>> On 2019-09-18 1:41 a.m., Marek Olšák wrote:
>> > drmVersion::name = amdgpu, radeon, intel, etc.
>> > drmVersion::desc = vega10, vega12, vega20, ...
>> >
>> > The common Mesa code will use name and desc to select the driver.
>>
>> Like the Xorg modesetting driver, that code doesn't need this kernel
>> functionality or new PCI IDs. It can just select the current driver for
>> all devices which aren't supported by older drivers (which is a fixed
>> set at this point).
>>
>>
>> > The AMD-specific Mesa code will use desc to identify the chip.
>>
>> Doesn't libdrm_amdgpu's struct amdgpu_gpu_info::family_id provide the
>> same information?
>>
>
> Not for the common code, though I guess common Mesa code could use the
> INFO ioctl. Is that what you mean?
>
> Marek
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm: add drm device name

2019-09-18 Thread Marek Olšák
On Wed, Sep 18, 2019 at 10:03 AM Michel Dänzer  wrote:

> On 2019-09-18 1:41 a.m., Marek Olšák wrote:
> > drmVersion::name = amdgpu, radeon, intel, etc.
> > drmVersion::desc = vega10, vega12, vega20, ...
> >
> > The common Mesa code will use name and desc to select the driver.
>
> Like the Xorg modesetting driver, that code doesn't need this kernel
> functionality or new PCI IDs. It can just select the current driver for
> all devices which aren't supported by older drivers (which is a fixed
> set at this point).
>
>
> > The AMD-specific Mesa code will use desc to identify the chip.
>
> Doesn't libdrm_amdgpu's struct amdgpu_gpu_info::family_id provide the
> same information?
>

Not for the common code, though I guess common Mesa code could use the INFO
ioctl. Is that what you mean?

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm: add drm device name

2019-09-17 Thread Marek Olšák
drmVersion::name = amdgpu, radeon, intel, etc.
drmVersion::desc = vega10, vega12, vega20, ...

The common Mesa code will use name and desc to select the driver.

The AMD-specific Mesa code will use desc to identify the chip.

Mesa won't receive any PCI IDs for future chips.

Marek


On Tue, Sep 17, 2019 at 10:33 AM Michel Dänzer  wrote:

> On 2019-09-17 1:20 p.m., Christian König wrote:
> > Am 17.09.19 um 11:27 schrieb Michel Dänzer:
> >> On 2019-09-17 11:23 a.m., Michel Dänzer wrote:
> >>> On 2019-09-17 10:23 a.m., Koenig, Christian wrote:
> >>>> Am 17.09.19 um 10:17 schrieb Daniel Vetter:
> >>>>> On Tue, Sep 17, 2019 at 10:12 AM Christian König
> >>>>>  wrote:
> >>>>>> Am 17.09.19 um 07:47 schrieb Jani Nikula:
> >>>>>>> On Mon, 16 Sep 2019, Marek Olšák  wrote:
> >>>>>>>> The purpose is to get rid of all PCI ID tables for all drivers in
> >>>>>>>> userspace. (or at least stop updating them)
> >>>>>>>>
> >>>>>>>> Mesa common code and modesetting will use this.
> >>>>>>> I'd think this would warrant a high level description of what you
> >>>>>>> want
> >>>>>>> to achieve in the commit message.
> >>>>>> And maybe explicitly call it uapi_name or even uapi_driver_name.
> >>>>> If it's uapi_name, then why do we need a new one for every
> generation?
> >>>>> Userspace drivers tend to span a lot more than just 1 generation. And
> >>>>> if you want to have per-generation data from the kernel to userspace,
> >>>>> then imo that's much better suited in some amdgpu ioctl, instead of
> >>>>> trying to encode that into the driver name.
> >>>> Well we already have an IOCTL for that, but I thought the intention
> >>>> here
> >>>> was to get rid of the PCI-ID tables in userspace to figure out which
> >>>> driver to load.
> >>> That's just unrealistic in general, I'm afraid. See e.g. the ongoing
> >>> transition from i965 to iris for recent Intel hardware. How is the
> >>> kernel supposed to know which driver is to be used?
> >
> > Well how is userspace currently handling that? The kernel should NOT say
> > which driver to use in userspace, but rather which one is used in the
> > kernel.
>
> Would that really help though? E.g. the radeon kernel driver supports
> radeon/r200/r300/r600/radeonsi DRI drivers, the i915 one i915/i965/iris
> (and the amdgpu one radeonsi/amdgpu).
>
> The HW generation identifier proposed in these patches might be useful,
> but I suspect there'll always be cases where userspace needs to know
> more precisely.
>
>
> > Mapping that information to an userspace driver still needs to be done
> > somewhere else, but the difference is that you don't need to add all
> > PCI-IDs twice.
>
> It should only really be necessary in Mesa.
>
>
> On 2019-09-17 1:32 p.m., Daniel Vetter wrote:
> > How are other compositors solving this? I don't expect they have a
> > pciid table like modesetting copied to all of them ...
>
> They don't need any of this. The Xorg modesetting driver only did for
> determining the client driver name to advertise via the DRI2 extension.
>
>
> --
> Earthling Michel Dänzer   |   https://redhat.com
> Libre software enthusiast | Mesa and X developer
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm: add drm device name

2019-09-16 Thread Marek Olšák
The purpose is to get rid of all PCI ID tables for all drivers in
userspace. (or at least stop updating them)

Mesa common code and modesetting will use this.

Marek

On Sat, Sep 7, 2019 at 3:48 PM Daniel Vetter  wrote:

> On Sat, Sep 7, 2019 at 3:18 AM Rob Clark  wrote:
> >
> > On Fri, Sep 6, 2019 at 3:16 PM Marek Olšák  wrote:
> > >
> > > + dri-devel
> > >
> > > On Tue, Sep 3, 2019 at 5:41 PM Jiang, Sonny 
> wrote:
> > >>
> > >> Add DRM device name and use DRM_IOCTL_VERSION ioctl drmVersion::desc
> passing it to user space
> > >> instead of unused DRM driver name descriptor.
> > >>
> > >> Change-Id: I809f6d3e057111417efbe8fa7cab8f0113ba4b21
> > >> Signed-off-by: Sonny Jiang 
> > >> ---
> > >>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 ++
> > >>  drivers/gpu/drm/drm_drv.c  | 17 +
> > >>  drivers/gpu/drm/drm_ioctl.c|  2 +-
> > >>  include/drm/drm_device.h   |  3 +++
> > >>  include/drm/drm_drv.h  |  1 +
> > >>  5 files changed, 24 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >> index 67b09cb2a9e2..8f0971cea363 100644
> > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >> @@ -2809,6 +2809,8 @@ int amdgpu_device_init(struct amdgpu_device
> *adev,
> > >> /* init the mode config */
> > >> drm_mode_config_init(adev->ddev);
> > >>
> > >> +   drm_dev_set_name(adev->ddev,
> amdgpu_asic_name[adev->asic_type]);
> > >> +
> > >> r = amdgpu_device_ip_init(adev);
> > >> if (r) {
> > >> /* failed in exclusive mode due to timeout */
> > >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > >> index 862621494a93..6c33879bb538 100644
> > >> --- a/drivers/gpu/drm/drm_drv.c
> > >> +++ b/drivers/gpu/drm/drm_drv.c
> > >> @@ -802,6 +802,7 @@ void drm_dev_fini(struct drm_device *dev)
> > >> mutex_destroy(&dev->struct_mutex);
> > >> drm_legacy_destroy_members(dev);
> > >> kfree(dev->unique);
> > >> +   kfree(dev->name);
> > >>  }
> > >>  EXPORT_SYMBOL(drm_dev_fini);
> > >>
> > >> @@ -1078,6 +1079,22 @@ int drm_dev_set_unique(struct drm_device *dev,
> const char *name)
> > >>  }
> > >>  EXPORT_SYMBOL(drm_dev_set_unique);
> > >>
> > >> +/**
> > >> + * drm_dev_set_name - Set the name of a DRM device
> > >> + * @dev: device of which to set the name
> > >> + * @name: name to be set
> > >> + *
> > >> + * Return: 0 on success or a negative error code on failure.
> > >> + */
> > >> +int drm_dev_set_name(struct drm_device *dev, const char *name)
> > >> +{
> > >> +   kfree(dev->name);
> > >> +   dev->name = kstrdup(name, GFP_KERNEL);
> > >> +
> > >> +   return dev->name ? 0 : -ENOMEM;
> > >> +}
> > >> +EXPORT_SYMBOL(drm_dev_set_name);
> > >> +
> > >>  /*
> > >>   * DRM Core
> > >>   * The DRM core module initializes all global DRM objects and makes
> them
> > >> diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
> > >> index 2263e3ddd822..61f02965106b 100644
> > >> --- a/drivers/gpu/drm/drm_ioctl.c
> > >> +++ b/drivers/gpu/drm/drm_ioctl.c
> > >> @@ -506,7 +506,7 @@ int drm_version(struct drm_device *dev, void
> *data,
> > >> dev->driver->date);
> > >> if (!err)
> > >> err = drm_copy_field(version->desc,
> &version->desc_len,
> > >> -   dev->driver->desc);
> > >> +   dev->name);
> >
> > I suspect this needs to be something like dev->name ? dev->name :
> > dev->driver->desc
> >
> > Or somewhere something needs to arrange for dev->name to default to
> > dev->driver->desc
> >
> > And maybe this should be dev->desc instead of dev->name.. that at
> > least seems less c

Re: [PATCH] drm: add drm device name

2019-09-06 Thread Marek Olšák
+ dri-devel

On Tue, Sep 3, 2019 at 5:41 PM Jiang, Sonny  wrote:

> Add DRM device name and use DRM_IOCTL_VERSION ioctl drmVersion::desc
> passing it to user space
> instead of unused DRM driver name descriptor.
>
> Change-Id: I809f6d3e057111417efbe8fa7cab8f0113ba4b21
> Signed-off-by: Sonny Jiang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 ++
>  drivers/gpu/drm/drm_drv.c  | 17 +
>  drivers/gpu/drm/drm_ioctl.c|  2 +-
>  include/drm/drm_device.h   |  3 +++
>  include/drm/drm_drv.h  |  1 +
>  5 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 67b09cb2a9e2..8f0971cea363 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2809,6 +2809,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> /* init the mode config */
> drm_mode_config_init(adev->ddev);
>
> +   drm_dev_set_name(adev->ddev, amdgpu_asic_name[adev->asic_type]);
> +
> r = amdgpu_device_ip_init(adev);
> if (r) {
> /* failed in exclusive mode due to timeout */
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 862621494a93..6c33879bb538 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -802,6 +802,7 @@ void drm_dev_fini(struct drm_device *dev)
> mutex_destroy(&dev->struct_mutex);
> drm_legacy_destroy_members(dev);
> kfree(dev->unique);
> +   kfree(dev->name);
>  }
>  EXPORT_SYMBOL(drm_dev_fini);
>
> @@ -1078,6 +1079,22 @@ int drm_dev_set_unique(struct drm_device *dev,
> const char *name)
>  }
>  EXPORT_SYMBOL(drm_dev_set_unique);
>
> +/**
> + * drm_dev_set_name - Set the name of a DRM device
> + * @dev: device of which to set the name
> + * @name: name to be set
> + *
> + * Return: 0 on success or a negative error code on failure.
> + */
> +int drm_dev_set_name(struct drm_device *dev, const char *name)
> +{
> +   kfree(dev->name);
> +   dev->name = kstrdup(name, GFP_KERNEL);
> +
> +   return dev->name ? 0 : -ENOMEM;
> +}
> +EXPORT_SYMBOL(drm_dev_set_name);
> +
>  /*
>   * DRM Core
>   * The DRM core module initializes all global DRM objects and makes them
> diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
> index 2263e3ddd822..61f02965106b 100644
> --- a/drivers/gpu/drm/drm_ioctl.c
> +++ b/drivers/gpu/drm/drm_ioctl.c
> @@ -506,7 +506,7 @@ int drm_version(struct drm_device *dev, void *data,
> dev->driver->date);
> if (!err)
> err = drm_copy_field(version->desc, &version->desc_len,
> -   dev->driver->desc);
> +   dev->name);
>
> return err;
>  }
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index 7f9ef709b2b6..e29912c484e4 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -123,6 +123,9 @@ struct drm_device {
> /** @unique: Unique name of the device */
> char *unique;
>
> +   /** @name: device name */
> +   char *name;
> +
> /**
>  * @struct_mutex:
>  *
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 68ca736c548d..f742e2bde467 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -798,6 +798,7 @@ static inline bool drm_drv_uses_atomic_modeset(struct
> drm_device *dev)
>
>
>  int drm_dev_set_unique(struct drm_device *dev, const char *name);
> +int drm_dev_set_name(struct drm_device *dev, const char *name);
>
>
>  #endif
> --
> 2.17.1
>
> ___
> amd-gfx mailing list
> amd-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?

2019-07-05 Thread Marek Olšák
On Fri, Jul 5, 2019 at 5:27 AM Timur Kristóf 
wrote:

> On Wed, 2019-07-03 at 14:44 -0400, Marek Olšák wrote:
> > You can run:
> > AMD_DEBUG=testdmaperf glxgears
> >
> > It tests transfer sizes of up to 128 MB, and it tests ~60 slightly
> > different methods of transfering data.
> >
> > Marek
>
>
> Thanks Marek, I didn't know about that option.
> Tried it, here is the output: https://pastebin.com/raw/9SAAbbAA
>
> I'm not quite sure how to interpret the numbers, they are inconsistent
> with the results from both pcie_bw and amdgpu.benchmark, for example
> GTT->VRAM at a 128 KB is around 1400 MB/s (I assume that is megabytes /
> sec, right?).
>

Based on the SDMA results, you have 2.4 GB/s. For 128KB, it's 2.2 GB/s for
GTT->VRAM copies.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/2] mesa: Fix clang build error w/ util_blitter_get_color_format_for_zs()

2019-07-03 Thread Marek Olšák
Thanks for the notice. I had already had a fix for this, but forgot to push
it. It's pushed now.

Marek

On Wed, Jul 3, 2019 at 7:10 PM John Stultz  wrote:

> Building with clang, I'm seeing
>  u_blitter.h:627:1: error: control may reach end of non-void function
> [-Werror,-Wreturn-type]
>
> The util_blitter_get_color_format_for_zs() asserts for any
> unhandled types, so we do not expect to reach the end of the
> function here.
>
> But provide a dummy return with an explicit assert above
> to ensure we don't hit it with any future changes to the logic.
>
> Cc: Rob Clark 
> Cc: Emil Velikov 
> Cc: Amit Pundir 
> Cc: Sumit Semwal 
> Cc: Alistair Strachan 
> Cc: Greg Hartman 
> Cc: Tapani Pälli 
> Cc: Marek Olšák 
> Signed-off-by: John Stultz 
> ---
>  src/gallium/auxiliary/util/u_blitter.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/src/gallium/auxiliary/util/u_blitter.h
> b/src/gallium/auxiliary/util/u_blitter.h
> index 9e3fa55e648..7d6c3db64da 100644
> --- a/src/gallium/auxiliary/util/u_blitter.h
> +++ b/src/gallium/auxiliary/util/u_blitter.h
> @@ -624,6 +624,9 @@ util_blitter_get_color_format_for_zs(enum pipe_format
> format)
> default:
>assert(0);
> }
> +   assert(0);
> +   /*XXX NEVER GET HERE*/
> +   return PIPE_FORMAT_R32G32_UINT;
>  }
>
>  #ifdef __cplusplus
> --
> 2.17.1
>
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: Why is Thunderbolt 3 limited to 2.5 GT/s on Linux?

2019-07-03 Thread Marek Olšák
You can run:
AMD_DEBUG=testdmaperf glxgears

It tests transfer sizes of up to 128 MB, and it tests ~60 slightly
different methods of transfering data.

Marek

On Wed, Jul 3, 2019 at 4:07 AM Michel Dänzer  wrote:

> On 2019-07-02 11:49 a.m., Timur Kristóf wrote:
> > On Tue, 2019-07-02 at 10:09 +0200, Michel Dänzer wrote:
> >> On 2019-07-01 6:01 p.m., Timur Kristóf wrote:
> >>> On Mon, 2019-07-01 at 16:54 +0200, Michel Dänzer wrote:
>  On 2019-06-28 2:21 p.m., Timur Kristóf wrote:
> > I haven't found a good way to measure the maximum PCIe
> > throughput
> > between the CPU and GPU,
> 
>  amdgpu.benchmark=3
> 
>  on the kernel command line will measure throughput for various
>  transfer
>  sizes during driver initialization.
> >>>
> >>> Thanks, I will definitely try that.
> >>> Is this the only way to do this, or is there a way to benchmark it
> >>> after it already booted?
> >>
> >> The former. At least in theory, it's possible to unload the amdgpu
> >> module while nothing is using it, then load it again.
> >
> > Okay, so I booted my system with amdgpu.benchmark=3
> > You can find the full dmesg log here: https://pastebin.com/zN9FYGw4
> >
> > The result is between 1-5 Gbit / sec depending on the transfer size
> > (the higher the better), which corresponds to neither the 8 Gbit / sec
> > that the kernel thinks it is limited to, nor the 20 Gbit / sec which I
> > measured earlier with pcie_bw.
>
> 5 Gbit/s throughput could be consistent with 8 Gbit/s theoretical
> bandwidth, due to various overhead.
>
>
> > Since pcie_bw only shows the maximum PCIe packet size (and not the
> > actual size), could it be that it's so inaccurate that the 20 Gbit /
> > sec is a fluke?
>
> Seems likely or at least plausible.
>
>
> > but I did take a look at AMD's sysfs interface at
> > /sys/class/drm/card1/device/pcie_bw which while running the
> > bottlenecked
> > game. The highest throughput I saw there was only 2.43 Gbit
> > /sec.
> 
>  PCIe bandwidth generally isn't a bottleneck for games, since they
>  don't
>  constantly transfer large data volumes across PCIe, but store
>  them in
>  the GPU's local VRAM, which is connected at much higher
>  bandwidth.
> >>>
> >>> There are reasons why I think the problem is the bandwidth:
> >>> 1. The same issues don't happen when the GPU is not used with a TB3
> >>> enclosure.
> >>> 2. In case of radeonsi, the problem was mitigated once Marek's SDMA
> >>> patch was merged, which hugely reduces the PCIe bandwidth use.
> >>> 3. In less optimized cases (for example D9VK), the problem is still
> >>> very noticable.
> >>
> >> However, since you saw as much as ~20 Gbit/s under different
> >> circumstances, the 2.43 Gbit/s used by this game clearly isn't a hard
> >> limit; there must be other limiting factors.
> >
> > There may be other factors, yes. I can't offer a good explanation on
> > what exactly is happening, but it's pretty clear that amdgpu can't take
> > full advantage of the TB3 link, so it seemed like a good idea to start
> > investigating this first.
>
> Yeah, actually it would be consistent with ~16-32 KB granularity
> transfers based on your measurements above, which is plausible. So
> making sure that the driver doesn't artificially limit the PCIe
> bandwidth might indeed help.
>
> OTOH this also indicates a similar potential for improvement by using
> larger transfers in Mesa and/or the kernel.
>
>
> --
> Earthling Michel Dänzer   |  https://www.amd.com
> Libre software enthusiast | Mesa and X developer
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[ANNOUNCE] libdrm 2.4.99

2019-07-02 Thread Marek Olšák
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512


Adrian Salido (1):
  libdrm: reduce number of reallocations in drmModeAtomicAddProperty

Chunming Zhou (9):
  add cs chunk for syncobj timeline
  add timeline wait/query ioctl v2
  wrap syncobj timeline query/wait APIs for amdgpu v3
  add timeline signal/transfer ioctls v2
  expose timeline signal/export/import interfaces v2
  wrap transfer interfaces
  add syncobj timeline tests v3
  update drm.h
  enable syncobj test depending on capability

Hawking Zhang (1):
  libdrm/amdgpu: add new member in drm_amdgpu_device_info for navi10

Hemant Hariyani (1):
  libdrm: omap: Add DRM_RDWR flag to dmabuf export

Huang Rui (1):
  amdgpu: add navi family id

Ilia Mirkin (11):
  util: add C8 format, support it with SMPTE pattern
  util: fix MAKE_RGBA macro for 10bpp modes
  util: add gradient pattern
  util: add fp16 format support
  util: add cairo drawing for 30bpp formats when available
  modetest: don't pretend that atomic mode includes a format
  modetest: add an add_property_optional variant that does not print errors
  modetest: add C8 support to generate SMPTE pattern
  modetest: add the ability to specify fill patterns on the commandline
  modetest: add FP16 format support
  util: fix include path for drm_mode.h

John Stultz (2):
  libdrm: Android.mk: Add minimal Android platform check
  libdrm: amdgpu: Initialize unions with memset rather than "= {0}"

Leo Liu (1):
  tests/amdgpu/vcn: add VCN2.0 decode support

Lucas Stach (1):
  etnaviv: drop etna_bo_from_handle symbol

Marek Olšák (1):
  Bump version to 2.4.99

Marek Vasut (1):
  etnaviv: Fix double-free in etna_bo_cache_free()

Michel Dänzer (6):
  amdgpu: Add amdgpu_cs_syncobj_transfer to amdgpu-symbol-check
  amdgpu: Move union declaration to top of amdgpu_cs_ctx_override_priority
  amdgpu: Update amdgpu_bo_handle_type_kms_noimport documentation
  amdgpu: Pass file descriptor directly to amdgpu_close_kms_handle
  amdgpu: Add BO handle to table in amdgpu_bo_create
  amdgpu: Rename fd_mutex/list to dev_mutex/list

Prabhanjan Kandula (1):
  libdrm: Avoid additional drm open close

Sean Paul (1):
  libdrm: Use mmap64 instead of __mmap2

Seung-Woo Kim (2):
  tests/libkms-test-plane: fix possbile memory leak
  xf86drm: Fix possible memory leak with drmModeGetPropertyPtr()

Tao Zhou (1):
  libdrm/amdgpu: add new vram type (GDDR6) for navi10

git tag: libdrm-2.4.99

https://dri.freedesktop.org/libdrm/libdrm-2.4.99.tar.bz2
MD5:  72539626815b35159a63d45bc4c14ee6  libdrm-2.4.99.tar.bz2
SHA1: e15a3fcc2d321b03d233a245a8593abde7feefd4  libdrm-2.4.99.tar.bz2
SHA256: 4dbf539c7ed25dbb2055090b77ab87508fc46be39a9379d15fed4b5517e1da5e  
libdrm-2.4.99.tar.bz2
SHA512: 
04702eebe8dca97fac61653623804fdcb0b8b3714bdc6f5e72f0dfdce9c9524cf16f69d37aa9feac79ddc1c11939be44a216484563a612414668ea5eaeadf191
  libdrm-2.4.99.tar.bz2
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.99.tar.bz2.sig

https://dri.freedesktop.org/libdrm/libdrm-2.4.99.tar.gz
MD5:  4c6951cfe4094805fe1f1cb39f5dbfc2  libdrm-2.4.99.tar.gz
SHA1: 402b6b1c2db1a6b754a4ecb5775ecc74d02541e8  libdrm-2.4.99.tar.gz
SHA256: 597fb879e2f45193431a0d352d10cd79ef61a24ab31f44320168583e10cb6302  
libdrm-2.4.99.tar.gz
SHA512: 
f0f90b8d115897600d0882a82b35f825558f40189f798dcbbdfd7b5f1b9096b38aa44ebafc9bc439c5e968b0bd56a886b5367706785f980983f79138c0943644
  libdrm-2.4.99.tar.gz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.99.tar.gz.sig

-BEGIN PGP SIGNATURE-

iQEzBAEBCgAdFiEEzUfFNBo3XzO+97r6/dFdWs7w8rEFAl0bpQYACgkQ/dFdWs7w
8rE7Zgf/R6BzBoY9gvGMwKeCmgRogH8CiBrNcyLrdGzahu2UcfCvc5j2nmMY7aUR
VN1/JPag5t3VWs/V2Oufd0YroYzj4swC8hkj29XXZU1wg7VJNBc0uDoF22jE9mpN
X8+34YUStTrBWjAHZ/SAVuBh152ppczP7isfAlEm+xZd2PcbV20Efmr8JVWjmpJV
2DeAes38E8uL4T/meeWOEEZVQjcA7CaTJnQzv0qnSWUI7PfjtFzlcubRbRj+BOmb
pjVrvjFFbv0B4gzUsJ8r13thWewDNNJGsMVlv7cVLRrgSkJbxEQ351h/XJVWunTb
tcaCfu0ODS+ADO63lDMllNqswrXe8A==
=TU5/
-END PGP SIGNATURE-
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS

2019-05-14 Thread Marek Olšák
This series fixes the OOM errors. However, if I torture the kernel driver
more, I can get it to deadlock and end up with unkillable processes. I can
also get an OOM error. I just ran the test 5 times:

AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears &
AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears &
AMD_DEBUG=testgdsmm glxgears

Marek

On Tue, May 14, 2019 at 8:31 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> This avoids OOM situations when we have lots of threads
> submitting at the same time.
>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index fff558cf385b..f9240a94217b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct
> amdgpu_cs_parser *p,
> }
>
> r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
> -  &duplicates, true);
> +  &duplicates, false);
> if (unlikely(r != 0)) {
> if (r != -ERESTARTSYS)
> DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
> --
> 2.17.1
>
> ___
> amd-gfx mailing list
> amd-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 4/4] drm/amdgpu: stop removing BOs from the LRU during CS

2019-05-10 Thread Marek Olšák
Hi,

This patch series doesn't help with the OOM errors due to GDS. Reproducible
with:

AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears

Marek


On Fri, May 10, 2019 at 10:13 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> This avoids OOM situations when we have lots of threads
> submitting at the same time.
>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index a1d6a0721e53..8828d30cd409 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct
> amdgpu_cs_parser *p,
> }
>
> r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
> -  &duplicates, true);
> +  &duplicates, false);
> if (unlikely(r != 0)) {
> if (r != -ERESTARTSYS)
> DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
> --
> 2.17.1
>
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[ANNOUNCE] libdrm 2.4.97

2019-01-22 Thread Marek Olšák
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512


Alex Deucher (1):
  amdgpu: update to latest marketing names from 18.50

Andrey Grodzovsky (3):
  amdgpu/test: Add illegal register and memory access test v2
  amdgpu/test: Disable deadlock tests for all non gfx8/9 ASICs.
  amdgpu/test: Enable deadlock test for CI family (gfx7)

Christian König (1):
  amdgpu: add VM test to exercise max/min address space

Daniel Vetter (1):
  doc: Rename README&CONTRIBUTING to .rst

Eric Anholt (2):
  Avoid hardcoded strlens in drmParseSubsystemType().
  drm: Attempt to parse SPI devices as platform bus devices.

Eric Engestrom (6):
  xf86drmHash: remove unused loop variable
  meson: fix typo in compiler flag
  tests: skip drmdevice test if the machine doesn't have any drm device
  freedreno: remove always-defined #ifdef
  xf86atomic: #undef internal define
  README: reflow the project description to improve readability

François Tigeot (2):
  xf86drm: implement drmParseSubsystemType for DragonFly
  libdrm: Use DRM_IOCTL_GET_PCIINFO on DragonFly

Leo Liu (1):
  tests/amdgpu/vcn: fix the nop command in IBs

Lucas De Marchi (2):
  gitignore: sort file
  gitignore: add _build

Marek Olšák (3):
  amdgpu: update amdgpu_drm.h
  amdgpu: add a faster BO list API
  Bump the version to 2.4.97

Mauro Rossi (1):
  android: Fix 32-bit app crashing in 64-bit Android

git tag: libdrm-2.4.97

https://dri.freedesktop.org/libdrm/libdrm-2.4.97.tar.bz2
MD5:  acef22d0c62c89692348c2dd5591393e  libdrm-2.4.97.tar.bz2
SHA1: 7635bec769a17edd140282fa2c46838c4a44bc91  libdrm-2.4.97.tar.bz2
SHA256: 77d0ccda3e10d6593398edb70b1566bfe1a23a39bd3da98ace2147692eadd123  
libdrm-2.4.97.tar.bz2
SHA512: 
3e08ee9d6c9ce265d783a59b51e22449905ea73aa27f25a082a1e9e1532f7c99e1c9f7cb966eb0970be2a08e2e5993dc9aa55093b1bff548689fdb465e7145ed
  libdrm-2.4.97.tar.bz2
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.97.tar.bz2.sig

https://dri.freedesktop.org/libdrm/libdrm-2.4.97.tar.gz
MD5:  a8bb09d6f4ed28191ba6e86e788dc3a4  libdrm-2.4.97.tar.gz
SHA1: af778f72d716589e9eacec9336bafc81b447cc42  libdrm-2.4.97.tar.gz
SHA256: 8c6f4d0934f5e005cc61bc05a917463b0c867403de176499256965f6797092f1  
libdrm-2.4.97.tar.gz
SHA512: 
9a7130ab5534555d7cf5ff95ac761d2cd2fe2c44eb9b63c7ad3f9b912d0f13f1e3ff099487d8e90b08514329c61adb4e73fe25404e7c2f4c26b205c64be8d114
  libdrm-2.4.97.tar.gz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.97.tar.gz.sig

-BEGIN PGP SIGNATURE-

iQEyBAEBCgAdFiEEzUfFNBo3XzO+97r6/dFdWs7w8rEFAlxHR3sACgkQ/dFdWs7w
8rEg/Af3d1I0DnABd0j3GUTxUAfHn7/yYkyFunFmqD9tmhdZZ8rl+PzXMocEhtDz
dn+lrG3JHhj4O0istZBe0B8oZIyCuSk+36j5t3XEgR1SfF5YlDhXnlEMaPuJQerr
ZrdXsggmQyv1BjeaLcseHM4wdnbkcClSoHXCNqbKQLPOyS0r0xEj0Ft6QvtDfPxh
rpCMdNjIPSFhBiJqyFuaHw6dWbX1elzSIjtXpdBOYrf7mfF/laE6OX7p+P7LtwC4
PkoeuzdHqt77iGASBoQI28XfVGpfQvBrTzDI4xRGI9IGXyc5oeJuk0uTnVjT3A9I
zHocD5j8r4pQJdfb49RQzyPOaxvF
=Lxi8
-END PGP SIGNATURE-
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm v2 2/2] amdgpu/test: Fix deadlock tests for AI and RV v2

2018-10-03 Thread Marek Olšák
Yes, Andrey has commit rights.

Marek

On Wed, Oct 3, 2018 at 10:34 AM Christian König
 wrote:
>
> Thanks for keeping working on this.
>
> Series is Reviewed-by: Christian König  as well.
>
> Do you now have commit rights?
>
> Christian.
>
> Am 02.10.2018 um 22:47 schrieb Marek Olšák:
> > For the series:
> >
> > Reviewed-by: Marek Olšák 
> >
> > Marek
> > On Fri, Sep 28, 2018 at 10:46 AM Andrey Grodzovsky
> >  wrote:
> >> Seems like AI and RV requires uncashed memory mapping to be able
> >> to pickup value written to memory by CPU after the WAIT_REG_MEM
> >> command was already launched.
> >> .
> >> Enable the test for AI and RV.
> >>
> >> v2:
> >> Update commit description.
> >>
> >> Signed-off-by: Andrey Grodzovsky 
> >> ---
> >>   tests/amdgpu/deadlock_tests.c | 13 -
> >>   1 file changed, 8 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/tests/amdgpu/deadlock_tests.c b/tests/amdgpu/deadlock_tests.c
> >> index 304482d..292ec4e 100644
> >> --- a/tests/amdgpu/deadlock_tests.c
> >> +++ b/tests/amdgpu/deadlock_tests.c
> >> @@ -80,6 +80,8 @@ static  uint32_t  minor_version;
> >>   static pthread_t stress_thread;
> >>   static uint32_t *ptr;
> >>
> >> +int use_uc_mtype = 0;
> >> +
> >>   static void amdgpu_deadlock_helper(unsigned ip_type);
> >>   static void amdgpu_deadlock_gfx(void);
> >>   static void amdgpu_deadlock_compute(void);
> >> @@ -92,13 +94,14 @@ CU_BOOL suite_deadlock_tests_enable(void)
> >>   &minor_version, 
> >> &device_handle))
> >>  return CU_FALSE;
> >>
> >> -   if (device_handle->info.family_id == AMDGPU_FAMILY_AI ||
> >> -   device_handle->info.family_id == AMDGPU_FAMILY_SI ||
> >> -   device_handle->info.family_id == AMDGPU_FAMILY_RV) {
> >> +   if (device_handle->info.family_id == AMDGPU_FAMILY_SI) {
> >>  printf("\n\nCurrently hangs the CP on this ASIC, deadlock 
> >> suite disabled\n");
> >>  enable = CU_FALSE;
> >>  }
> >>
> >> +   if (device_handle->info.family_id >= AMDGPU_FAMILY_AI)
> >> +   use_uc_mtype = 1;
> >> +
> >>  if (amdgpu_device_deinitialize(device_handle))
> >>  return CU_FALSE;
> >>
> >> @@ -183,8 +186,8 @@ static void amdgpu_deadlock_helper(unsigned ip_type)
> >>  r = amdgpu_cs_ctx_create(device_handle, &context_handle);
> >>  CU_ASSERT_EQUAL(r, 0);
> >>
> >> -   r = amdgpu_bo_alloc_and_map(device_handle, 4096, 4096,
> >> -   AMDGPU_GEM_DOMAIN_GTT, 0,
> >> +   r = amdgpu_bo_alloc_and_map_raw(device_handle, 4096, 4096,
> >> +   AMDGPU_GEM_DOMAIN_GTT, 0, use_uc_mtype ? 
> >> AMDGPU_VM_MTYPE_UC : 0,
> >>  &ib_result_handle, 
> >> &ib_result_cpu,
> >>  
> >> &ib_result_mc_address, &va_handle);
> >>  CU_ASSERT_EQUAL(r, 0);
> >> --
> >> 2.7.4
> >>
> >> ___
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > ___
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm v2 2/2] amdgpu/test: Fix deadlock tests for AI and RV v2

2018-10-02 Thread Marek Olšák
For the series:

Reviewed-by: Marek Olšák 

Marek
On Fri, Sep 28, 2018 at 10:46 AM Andrey Grodzovsky
 wrote:
>
> Seems like AI and RV requires uncashed memory mapping to be able
> to pickup value written to memory by CPU after the WAIT_REG_MEM
> command was already launched.
> .
> Enable the test for AI and RV.
>
> v2:
> Update commit description.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>  tests/amdgpu/deadlock_tests.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/tests/amdgpu/deadlock_tests.c b/tests/amdgpu/deadlock_tests.c
> index 304482d..292ec4e 100644
> --- a/tests/amdgpu/deadlock_tests.c
> +++ b/tests/amdgpu/deadlock_tests.c
> @@ -80,6 +80,8 @@ static  uint32_t  minor_version;
>  static pthread_t stress_thread;
>  static uint32_t *ptr;
>
> +int use_uc_mtype = 0;
> +
>  static void amdgpu_deadlock_helper(unsigned ip_type);
>  static void amdgpu_deadlock_gfx(void);
>  static void amdgpu_deadlock_compute(void);
> @@ -92,13 +94,14 @@ CU_BOOL suite_deadlock_tests_enable(void)
>  &minor_version, &device_handle))
> return CU_FALSE;
>
> -   if (device_handle->info.family_id == AMDGPU_FAMILY_AI ||
> -   device_handle->info.family_id == AMDGPU_FAMILY_SI ||
> -   device_handle->info.family_id == AMDGPU_FAMILY_RV) {
> +   if (device_handle->info.family_id == AMDGPU_FAMILY_SI) {
> printf("\n\nCurrently hangs the CP on this ASIC, deadlock 
> suite disabled\n");
> enable = CU_FALSE;
> }
>
> +   if (device_handle->info.family_id >= AMDGPU_FAMILY_AI)
> +   use_uc_mtype = 1;
> +
> if (amdgpu_device_deinitialize(device_handle))
> return CU_FALSE;
>
> @@ -183,8 +186,8 @@ static void amdgpu_deadlock_helper(unsigned ip_type)
> r = amdgpu_cs_ctx_create(device_handle, &context_handle);
> CU_ASSERT_EQUAL(r, 0);
>
> -   r = amdgpu_bo_alloc_and_map(device_handle, 4096, 4096,
> -   AMDGPU_GEM_DOMAIN_GTT, 0,
> +   r = amdgpu_bo_alloc_and_map_raw(device_handle, 4096, 4096,
> +   AMDGPU_GEM_DOMAIN_GTT, 0, use_uc_mtype ? 
> AMDGPU_VM_MTYPE_UC : 0,
> &ib_result_handle, 
> &ib_result_cpu,
> &ib_result_mc_address, 
> &va_handle);
> CU_ASSERT_EQUAL(r, 0);
> --
> 2.7.4
>
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 1/3] amdgpu: Propogate user flags to amdgpu_bo_va_op_raw

2018-09-27 Thread Marek Olšák
This will break old UMDs that didn't set the flags correctly. Instead,
UMDs should stop using amdgpu_bo_va_op if they want to set the flags.

Marek
On Thu, Sep 27, 2018 at 3:05 PM Andrey Grodzovsky
 wrote:
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>  amdgpu/amdgpu_bo.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/amdgpu/amdgpu_bo.c b/amdgpu/amdgpu_bo.c
> index c0f42e8..1892345 100644
> --- a/amdgpu/amdgpu_bo.c
> +++ b/amdgpu/amdgpu_bo.c
> @@ -736,7 +736,7 @@ drm_public int amdgpu_bo_va_op(amdgpu_bo_handle bo,
>uint64_t offset,
>uint64_t size,
>uint64_t addr,
> -  uint64_t flags,
> +  uint64_t extra_flags,
>uint32_t ops)
>  {
> amdgpu_device_handle dev = bo->dev;
> @@ -746,7 +746,8 @@ drm_public int amdgpu_bo_va_op(amdgpu_bo_handle bo,
> return amdgpu_bo_va_op_raw(dev, bo, offset, size, addr,
>AMDGPU_VM_PAGE_READABLE |
>AMDGPU_VM_PAGE_WRITEABLE |
> -  AMDGPU_VM_PAGE_EXECUTABLE, ops);
> +  AMDGPU_VM_PAGE_EXECUTABLE |
> +  extra_flags, ops);
>  }
>
>  drm_public int amdgpu_bo_va_op_raw(amdgpu_device_handle dev,
> --
> 2.7.4
>
> ___
> amd-gfx mailing list
> amd-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] drm/amdgpu: Add macros and documentation for format modifiers.

2018-09-07 Thread Marek Olšák
On Fri, Sep 7, 2018 at 5:55 AM, Bas Nieuwenhuizen
 wrote:
> On Fri, Sep 7, 2018 at 6:51 AM Marek Olšák  wrote:
>>
>> Hopefully this answers some questions.
>>
>> Other parameters that affect tiling layouts are GB_ADDR_CONFIG (all
>> chips) and MC_ARB_RAMCFG (GFX6-8 only), and those vary with each chip.
>
> For GFX6-GFX8:
> From GB_ADDR_CONFIG addrlib only uses the pipe interleave bytes which
> are 0 (=256 bytes) for all AMDGPU HW (and on GFX9 addrlib even asserts
> on that).  From MC_ARB_RAMCFG addrlib reads the number of banks and
> ranks, calculates the number of logical banks from it, but then does
> not use it. (Presumably because it is the same number as the number of
> banks in the tiling table entry?) Some bits gets used by the kernel
> (memory row size), but those get encoded in the tile split of the
> tiling table, i.e. we do not need the separate bits.
>
> for GFX9, only the DCC meta surface seems to depend on GB_ADDR_CONFIG
> (except the aforementioned pipe interleave bytes) which are constant.

On GFX9, addrlib in Mesa uses most fields from GB_ADDR_CONFIG.
GB_ADDR_CONFIG defines the tiling formats.

On older chips, addrlib reads some fields from GB_ADDR_CONFIG and uses
the chip identification for others like the number of pipes, even
though GB_ADDR_CONFIG has the information too.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] drm/amdgpu: Add macros and documentation for format modifiers.

2018-09-06 Thread Marek Olšák
Hopefully this answers some questions.

Other parameters that affect tiling layouts are GB_ADDR_CONFIG (all
chips) and MC_ARB_RAMCFG (GFX6-8 only), and those vary with each chip.

Some 32bpp 1D tiling layouts are compatible across all chips (1D
display tiling is the same as SW_256B_D if Bpp == 4).

On GFX9, swizzle modes <= 11 are the same on all GFX9 chips. The
remaining modes depend on GB_ADDR_CONFIG and are also more efficient.
Bpp, number of samples, and resource type (2D/3D) affect the layout
too, e.g. 3D textures silently use thick tiling on GFX9.

Harvesting doesn't affect tiling layouts.

The layout changes between layers/slices a little. Always use the base
address of the whole image when programming the hardware. Don't assume
that the 2nd layer has the same layout.

> + * TODO: Can scanout really not support fastclear data?

It can, but only those encoded in the DCC buffer (0/1). There is no
DAL support for DCC though.


> + * TODO: Do some generations share DCC format?

DCC mirrors the tiling layout, so the same tiling mode means the same
DCC. Take the absolute pixel address, shift it to the right, and
you'll get the DCC element address.

I would generally consider DCC as non-shareable because of different
meanings of TILING_INDEX between chips except maybe for common GFX9
layouts.


> [comments about number of bits]

We could certainly represent all formats as a list of enums, but then
we would need to convert the enums to the full description in drivers.
GFX6-8 can use TILING_INDEX (except for stencil, let's ignore
stencil). The tiling tables shouldn't change anymore because they are
optimized for the hardware, and later hw doesn't have any tables.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[ANNOUNCE] libdrm 2.4.93

2018-07-31 Thread Marek Olšák
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512


Christian König (1):
  amdgpu: make sure to set CLOEXEC on duplicated FDs

Emil Velikov (10):
  xf86drm: drmGetDevice2: error out if the fd has unknown subsys
  xf86drm: introduce drm_device_has_rdev() helper
  xf86drm: Fold drmDevice processing into process_device() helper
  xf86drm: Allocate drmDevicePtr's on stack
  xf86drm: introduce a get_real_pci_path() helper
  xf86drm: Add drmDevice support for virtio_gpu
  tests/drmdevices: install alongside other utilities
  tests/drmdevice: add a couple of printf headers
  drmdevice: convert the tabbed output into a tree
  drmdevice: print the correct host1x information

Jan Vesely (3):
  amdgpu: Take a lock before removing devices from fd_tab hash table.
  amdgpu/util_hash_table: Add helper function to count the number of 
entries in hash table
  amdgpu: Destroy fd_hash table when the last device is removed.

José Roberto de Souza (2):
  intel: Introducing Whiskey Lake platform
  intel: Introducing Amber Lake platform

Kevin Strasser (1):
  xf86drm: Be sure to closedir before return

Marek Olšák (3):
  amdgpu: don't call add_handle_to_table for KMS BO exports
  amdgpu: add amdgpu_bo_handle_type_kms_noimport
  configure.ac: bump version to 2.4.93

Mariusz Ceier (1):
  xf86drm: Fix error path in drmGetDevice2

Michel Dänzer (2):
  Always pass O_CLOEXEC when opening DRM file descriptors
  Revert "amdgpu: don't call add_handle_to_table for KMS BO exports"

Rob Clark (5):
  freedreno: add user ptr to fd_ringbuffer
  freedreno: add fd_ringbuffer_new_object()
  freedreno: small cleanup
  freedreno: slight reordering
  freedreno/msm: "stateobj" support

git tag: libdrm-2.4.93

https://dri.freedesktop.org/libdrm/libdrm-2.4.93.tar.bz2
MD5:  0ba45ad1551b2c1b6df0797a3e65f827  libdrm-2.4.93.tar.bz2
SHA1: 550ba4bb50236fc2e9138cbeadcb4942ce09410e  libdrm-2.4.93.tar.bz2
SHA256: 6e84d1dc9548a76f20b59a85cf80a0b230cd8196084f5243469d9e65354fcd3c  
libdrm-2.4.93.tar.bz2
SHA512: 
ba4221e8d6a3a9872fb6d30a0ea391e30ea0e17f249c66f067bed9c2161ed1ad8083959cb2c212834c6566c3e025f4daae31e9533d77aae19b9de6c2ab3d
  libdrm-2.4.93.tar.bz2
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.93.tar.bz2.sig

https://dri.freedesktop.org/libdrm/libdrm-2.4.93.tar.gz
MD5:  2df4729a0e9829d77a7a0a7a8dda2f66  libdrm-2.4.93.tar.gz
SHA1: 3947aeecb6ecc271c657638f97e8d21753cedf6e  libdrm-2.4.93.tar.gz
SHA256: bc67b2503106155c239c4e455b6718ef1b31675ea51f544c785c0e3295712861  
libdrm-2.4.93.tar.gz
SHA512: 
3ca334ee46fe50103e146463d2dab85c7a075559192a85bfe73ca2f80cb8a8847b19775b1a16271b537656e9d7a48f0e209ea7227a3e1ebd9fa3a5caf38047f2
  libdrm-2.4.93.tar.gz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.93.tar.gz.sig

-BEGIN PGP SIGNATURE-

iQEzBAEBCgAdFiEEzUfFNBo3XzO+97r6/dFdWs7w8rEFAlthEUoACgkQ/dFdWs7w
8rGVWAf/bV1GNvp0Aakm95UhIHC61CvZk7PxnSADd3SlZC6BE9K5WNK2jumtXQln
o1EmXc1WS2b02jPQVX4+qv/F8gxVlHdKDbi52Rk1RnK0ii7gP2oGf+4q0EUrdwoq
HYE6XHppUFgBVRBXTa0vCqVBo/KYWRgPlSPNlEsxigPzRk00Qt0vWEjQiN8vTmH2
+YReSIvOQxmZ/CSU8/+JaV395S+7nc59HG18xHvjcC6F4AelWBFdAA+P782yTeJ5
nr952bkr7+Z5/n/XWUkzdTr11YSHev5N24JYxopvXgbmDr+Dz/hdIYyy8MGJSKp0
q+2RVWP6WlNlXh2KUvwPM9SZydxkDA==
=DmZ7
-END PGP SIGNATURE-
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH libdrm 2/2] amdgpu: add AMDGPU_VA_RANGE_HIGH

2018-02-26 Thread Marek Olšák
For the series:

Reviewed-by: Marek Olšák 

Marek

On Mon, Feb 26, 2018 at 2:16 PM, Christian König
 wrote:
> Return high addresses if requested and available.
>
> Signed-off-by: Christian König 
> ---
>  amdgpu/amdgpu.h  |  1 +
>  amdgpu/amdgpu_device.c   |  6 --
>  amdgpu/amdgpu_internal.h |  1 -
>  amdgpu/amdgpu_vamgr.c| 24 +++-
>  4 files changed, 24 insertions(+), 8 deletions(-)
>
> diff --git a/amdgpu/amdgpu.h b/amdgpu/amdgpu.h
> index 928b2a68..36f91058 100644
> --- a/amdgpu/amdgpu.h
> +++ b/amdgpu/amdgpu.h
> @@ -1162,6 +1162,7 @@ int amdgpu_read_mm_registers(amdgpu_device_handle dev, 
> unsigned dword_offset,
>   * Flag to request VA address range in the 32bit address space
>  */
>  #define AMDGPU_VA_RANGE_32_BIT 0x1
> +#define AMDGPU_VA_RANGE_HIGH   0x2
>
>  /**
>   * Allocate virtual address range
> diff --git a/amdgpu/amdgpu_device.c b/amdgpu/amdgpu_device.c
> index ca0c7987..9ff6ad16 100644
> --- a/amdgpu/amdgpu_device.c
> +++ b/amdgpu/amdgpu_device.c
> @@ -268,7 +268,6 @@ int amdgpu_device_initialize(int fd,
> max = MIN2(dev->dev_info.virtual_address_max, 0x1ULL);
> amdgpu_vamgr_init(&dev->vamgr_32, start, max,
>   dev->dev_info.virtual_address_alignment);
> -   dev->address32_hi = start >> 32;
>
> start = max;
> max = MAX2(dev->dev_info.virtual_address_max, 0x1ULL);
> @@ -323,7 +322,10 @@ int amdgpu_query_sw_info(amdgpu_device_handle dev, enum 
> amdgpu_sw_info info,
>
> switch (info) {
> case amdgpu_sw_info_address32_hi:
> -   *val32 = dev->address32_hi;
> +   if (dev->vamgr_high_32.va_max)
> +   *val32 = dev->vamgr_high_32.va_max >> 32;
> +   else
> +   *val32 = dev->vamgr_32.va_max >> 32;
> return 0;
> }
> return -EINVAL;
> diff --git a/amdgpu/amdgpu_internal.h b/amdgpu/amdgpu_internal.h
> index 423880ed..aeb5d651 100644
> --- a/amdgpu/amdgpu_internal.h
> +++ b/amdgpu/amdgpu_internal.h
> @@ -73,7 +73,6 @@ struct amdgpu_device {
> int flink_fd;
> unsigned major_version;
> unsigned minor_version;
> -   uint32_t address32_hi;
>
> char *marketing_name;
> /** List of buffer handles. Protected by bo_table_mutex. */
> diff --git a/amdgpu/amdgpu_vamgr.c b/amdgpu/amdgpu_vamgr.c
> index 58400428..ac1202de 100644
> --- a/amdgpu/amdgpu_vamgr.c
> +++ b/amdgpu/amdgpu_vamgr.c
> @@ -201,10 +201,21 @@ int amdgpu_va_range_alloc(amdgpu_device_handle dev,
>  {
> struct amdgpu_bo_va_mgr *vamgr;
>
> -   if (flags & AMDGPU_VA_RANGE_32_BIT)
> -   vamgr = &dev->vamgr_32;
> -   else
> -   vamgr = &dev->vamgr;
> +   /* Clear the flag when the high VA manager is not initialized */
> +   if (flags & AMDGPU_VA_RANGE_HIGH && !dev->vamgr_high_32.va_max)
> +   flags &= ~AMDGPU_VA_RANGE_HIGH;
> +
> +   if (flags & AMDGPU_VA_RANGE_HIGH) {
> +   if (flags & AMDGPU_VA_RANGE_32_BIT)
> +   vamgr = &dev->vamgr_high_32;
> +   else
> +   vamgr = &dev->vamgr_high;
> +   } else {
> +   if (flags & AMDGPU_VA_RANGE_32_BIT)
> +   vamgr = &dev->vamgr_32;
> +   else
> +   vamgr = &dev->vamgr;
> +   }
>
> va_base_alignment = MAX2(va_base_alignment, vamgr->va_alignment);
> size = ALIGN(size, vamgr->va_alignment);
> @@ -215,7 +226,10 @@ int amdgpu_va_range_alloc(amdgpu_device_handle dev,
> if (!(flags & AMDGPU_VA_RANGE_32_BIT) &&
> (*va_base_allocated == AMDGPU_INVALID_VA_ADDRESS)) {
> /* fallback to 32bit address */
> -   vamgr = &dev->vamgr_32;
> +   if (flags & AMDGPU_VA_RANGE_HIGH)
> +   vamgr = &dev->vamgr_high_32;
> +   else
> +   vamgr = &dev->vamgr_32;
> *va_base_allocated = amdgpu_vamgr_find_va(vamgr, size,
> va_base_alignment, va_base_required);
> }
> --
> 2.14.1
>
> ___
> amd-gfx mailing list
> amd-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[ANNOUNCE] libdrm 2.4.90

2018-02-16 Thread Marek Olšák
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Andrey Grodzovsky (2):
  amdgpu: Update deadlock test to not assert on ECANCELED
  amdgpu: Fix segfault in deadlock test.

Anuj Phogat (1):
  intel: Add more Coffeelake PCI IDs

Bas Nieuwenhuizen (1):
  drm: Fix 32-bit drmSyncobjWait.

Christian König (5):
  amdgpu: fix 32bit VA manager max address
  headers: sync up amdgpu_drm.h with drm-next
  amdgpu: use the high VA range if possible v2
  test/amdgpu: fix compiler warnings
  amdgpu: fix high VA mask

Christoph Haag (1):
  meson: fix the install path of amdgpu.ids

Chunming Zhou (5):
  fix return value for syncobj wait
  amdgpu: fix inefficient vamgr algorithm
  amdgpu: clean up non list code path for vamgr
  tests/amdgpu: add bo eviction test
  amdgpu: clean up non list code path for vamgr v2

Dylan Baker (7):
  Add meson build system
  autotools: Include meson.build files in tarball
  README: Add note about meson
  meson: set proper pkg-config version for libdrm_freedreno
  meson: set the minimum version correctly
  meson: fix libdrm_nouveau pkgconfig include directories
  meson: include headers in root directory in ext_libdrm

Emil Velikov (1):
  tests/amdgpu: add missing config.h include

Eric Engestrom (25):
  remove unnecessary double-semicolon
  tests/amdgpu: add parentheses to make operation priority explicit
  tests/amdgpu: drop unused variables
  tests/util: fix signed/unsigned comparisons
  tests/util: drop unused parameters
  tests/etnaviv: drop unused `return 0`
  meson: add missing HAVE_RADEON
  configure: remove unused HAVE_CUNIT define
  configure: remove unused HAVE_INSTALL_TESTS define
  meson,configure: remove unused HAVE_OMAP define
  meson,configure: remove unused HAVE_TEGRA define
  meson,configure: remove unused HAVE_FREEDRENO define
  meson,configure: remove unused HAVE_ETNAVIV define
  meson,configure: always define 
HAVE_{INTEL,VMWGFX,NOUVEAU,EXYNOS,VC4,RADEON}
  always define HAVE_FREEDRENO_KGSL
  always define HAVE_CAIRO
  always define HAVE_VALGRIND
  meson: sort HAVE_* defines
  xf86atomic: fix -Wundef warning
  meson: cleanup whitespace
  meson,configure: add warning when using undefined preprocessor tokens
  xf86drmHash: remove always-false #if guards
  configure: always define HAVE_LIBDRM_ATOMIC_PRIMITIVES and 
HAVE_LIB_ATOMIC_OPS
  exynos/tests: use #ifdef for never-defined token
  meson,configure: turn undefined preprocessor tokens warnings into errors

Hawking Zhang (3):
  tests/amdgpu: execute write linear on all the available rings
  tests/amdgpu: execute const fill on all the available rings
  tests/amdgpu: execute copy linear on all the available rings

Marek Olšák (2):
  amdgpu: add amdgpu_query_sw_info for querying high bits of 32-bit address 
space
  configure.ac: bump version to 2.4.90

Michel Dänzer (7):
  amdgpu: Don't print error message if parse_one_line returned -EAGAIN
  amdgpu: Don't dereference device_handle after amdgpu_device_deinitialize
  amdgpu: Symlink .editorconfig to tests/amdgpu
  amdgpu: Disable deadlock test suite by default for SI ASICs
  amdgpu: Disable VM test suite by default for SI ASICs
  Revert "amdgpu: clean up non list code path for vamgr"
  amdgpu: Add amdgpu_query_sw_info to amdgpu-symbol-check

Rob Clark (1):
  freedreno: clamp priority based on # of rings

Robert Foss (5):
  android: Move gralloc handle struct to libdrm
  android: Add version variable to gralloc_handle_t
  android: Mark gralloc_handle_t magic variable as const
  android: Remove member name from gralloc_handle_t
  android: Change gralloc_handle_t members to be fixed width

Seung-Woo Kim (2):
  amdgpu: fix not to add amdgpu.ids when building without amdgpu
  modetest: Fix to check return value of asprintf()

git tag: libdrm-2.4.90

https://dri.freedesktop.org/libdrm/libdrm-2.4.90.tar.bz2
MD5:  61dcb4989c728f566e3c15c236585a17  libdrm-2.4.90.tar.bz2
SHA1: 7630ba36c65433251a0494b47086fbd0b32ff7a8  libdrm-2.4.90.tar.bz2
SHA256: db37ec8f1dbaa2c192ad9903c8d0988b858ae88031e96f169bf76aaf705db68b  
libdrm-2.4.90.tar.bz2
SHA512: 
3d32d60c44ffdcb58667d0926e6af8d375332add1f243d8b2d37567aeef4e4b26d786294aeecf46c3dea94fc002fb73756567c457300703acfc21e32ffbd458c
  libdrm-2.4.90.tar.bz2
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.90.tar.bz2.sig

https://dri.freedesktop.org/libdrm/libdrm-2.4.90.tar.gz
MD5:  f417488bc6450849b9d571bc2938d194  libdrm-2.4.90.tar.gz
SHA1: 82b0b70ff2356abff95e86d17563a192e52ba98c  libdrm-2.4.90.tar.gz
SHA256: 750a3355fb6cdcee6dcfb366efbbe85d0efe4d9eb02c1f296d854470a1e12c99  
libdrm-2.4.90.tar.gz
SHA512: 
c2eeafbfcdf1a71537ee640a8b1b3c4865dfd272eb12140992904e7213e2a27ddc40be2c2c41a002d4f83ecd6a98cd473c2d0ce4fe3c6d833be41242e3b4
  libdrm-2.4.9

  1   2   3   4   5   6   >