Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-17 Thread Tvrtko Ursulin



On 17/03/2022 07:08, Kasireddy, Vivek wrote:

Hi Tvrtko,



On 16/03/2022 07:37, Kasireddy, Vivek wrote:

Hi Tvrtko,



On 15/03/2022 07:28, Kasireddy, Vivek wrote:

Hi Tvrtko, Daniel,



On 11/03/2022 09:39, Daniel Vetter wrote:

On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy 

wrote:


On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
more framebuffers/scanout buffers results in only one that is mappable/
fenceable. Therefore, pageflipping between these 2 FBs where only one
is mappable/fenceable creates latencies large enough to miss alternate
vblanks thereby producing less optimal framerate.

This mainly happens because when i915_gem_object_pin_to_display_plane()
is called to pin one of the FB objs, the associated vma is identified
as misplaced and therefore i915_vma_unbind() is called which unbinds and
evicts it. This misplaced vma gets subseqently pinned only when
i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
results in a latency of ~10ms and happens every other vblank/repaint cycle.
Therefore, to fix this issue, we try to see if there is space to map
at-least two objects of a given size and return early if there isn't. This
would ensure that we do not try with PIN_MAPPABLE for any objects that
are too big to map thereby preventing unncessary unbind.

Testcase:
Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
a frame ~7ms before the next vblank, the latencies seen between atomic
commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
it misses the vblank every other frame.

Here is the ftrace snippet that shows the source of the ~10ms latency:
  i915_gem_object_pin_to_display_plane() {
0.102 us   |i915_gem_object_set_cache_level();
i915_gem_object_ggtt_pin_ww() {
0.390 us   |  i915_vma_instance();
0.178 us   |  i915_vma_misplaced();
  i915_vma_unbind() {
  __i915_active_wait() {
0.082 us   |i915_active_acquire_if_busy();
0.475 us   |  }
  intel_runtime_pm_get() {
0.087 us   |intel_runtime_pm_acquire();
0.259 us   |  }
  __i915_active_wait() {
0.085 us   |i915_active_acquire_if_busy();
0.240 us   |  }
  __i915_vma_evict() {
ggtt_unbind_vma() {
  gen8_ggtt_clear_range() {
10507.255 us |}
10507.689 us |  }
10508.516 us |   }

v2: Instead of using bigjoiner checks, determine whether a scanout
buffer is too big by checking to see if it is possible to map
two of them into the ggtt.

v3 (Ville):
- Count how many fb objects can be fit into the available holes
  instead of checking for a hole twice the object size.
- Take alignment constraints into account.
- Limit this large scanout buffer check to >= Gen 11 platforms.

v4:
- Remove existing heuristic that checks just for size. (Ville)
- Return early if we find space to map at-least two objects. (Tvrtko)
- Slightly update the commit message.

v5: (Tvrtko)
- Rename the function to indicate that the object may be too big to
  map into the aperture.
- Account for guard pages while calculating the total size required
  for the object.
- Do not subject all objects to the heuristic check and instead
  consider objects only of a certain size.
- Do the hole walk using the rbtree.
- Preserve the existing PIN_NONBLOCK logic.
- Drop the PIN_MAPPABLE check while pinning the VMA.

v6: (Tvrtko)
- Return 0 on success and the specific error code on failure to
  preserve the existing behavior.

v7: (Ville)
- Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
  size < ggtt->mappable_end / 4 checks.
- Drop the redundant check that is based on previous heuristic.

v8:
- Make sure that we are holding the mutex associated with ggtt vm
  as we traverse the hole nodes.

v9: (Tvrtko)
- Use mutex_lock_interruptible_nested() instead of mutex_lock().

Cc: Ville Syrjälä 
Cc: Maarten Lankhorst 
Cc: Tvrtko Ursulin 
Cc: Manasi Navare 
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Vivek Kasireddy 
---
 drivers/gpu/drm/i915/i915_gem.c | 128 +++

-

 1 file changed, 94 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9747924cc57b..e0d731b3f215 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,6 +49,7 @@
 #include "gem/i915_gem_pm.h"
 #include "gem/i915_gem_region.h"
 #include "gem/i915_gem_userptr.h"
+#include "gem/i915_gem_tiling.h"
 #include "gt/intel_engine_user.h"
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_pm.h"
@@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
spin_unlock(>vma.lock);
 }

+static int
+i915_gem_object_fits_in_aperture(struct 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-17 Thread Daniel Vetter
On Thu, Mar 17, 2022 at 10:04:36AM +, Tvrtko Ursulin wrote:
> 
> On 17/03/2022 09:47, Daniel Vetter wrote:
> > On Tue, Mar 15, 2022 at 09:45:20AM +, Tvrtko Ursulin wrote:
> > > 
> > > On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > > > Hi Tvrtko, Daniel,
> > > > 
> > > > > 
> > > > > On 11/03/2022 09:39, Daniel Vetter wrote:
> > > > > > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy 
> > > > > >  wrote:
> > > > > > > 
> > > > > > > On platforms capable of allowing 8K (7680 x 4320) modes, pinning 
> > > > > > > 2 or
> > > > > > > more framebuffers/scanout buffers results in only one that is 
> > > > > > > mappable/
> > > > > > > fenceable. Therefore, pageflipping between these 2 FBs where only 
> > > > > > > one
> > > > > > > is mappable/fenceable creates latencies large enough to miss 
> > > > > > > alternate
> > > > > > > vblanks thereby producing less optimal framerate.
> > > > > > > 
> > > > > > > This mainly happens because when 
> > > > > > > i915_gem_object_pin_to_display_plane()
> > > > > > > is called to pin one of the FB objs, the associated vma is 
> > > > > > > identified
> > > > > > > as misplaced and therefore i915_vma_unbind() is called which 
> > > > > > > unbinds and
> > > > > > > evicts it. This misplaced vma gets subseqently pinned only when
> > > > > > > i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> > > > > > > results in a latency of ~10ms and happens every other 
> > > > > > > vblank/repaint cycle.
> > > > > > > Therefore, to fix this issue, we try to see if there is space to 
> > > > > > > map
> > > > > > > at-least two objects of a given size and return early if there 
> > > > > > > isn't. This
> > > > > > > would ensure that we do not try with PIN_MAPPABLE for any objects 
> > > > > > > that
> > > > > > > are too big to map thereby preventing unncessary unbind.
> > > > > > > 
> > > > > > > Testcase:
> > > > > > > Running Weston and weston-simple-egl on an Alderlake_S (ADLS) 
> > > > > > > platform
> > > > > > > with a 8K@60 mode results in only ~40 FPS. Since upstream Weston 
> > > > > > > submits
> > > > > > > a frame ~7ms before the next vblank, the latencies seen between 
> > > > > > > atomic
> > > > > > > commit and flip event are 7, 24 (7 + 16.66), 7, 24. 
> > > > > > > suggesting that
> > > > > > > it misses the vblank every other frame.
> > > > > > > 
> > > > > > > Here is the ftrace snippet that shows the source of the ~10ms 
> > > > > > > latency:
> > > > > > >  i915_gem_object_pin_to_display_plane() {
> > > > > > > 0.102 us   |i915_gem_object_set_cache_level();
> > > > > > >i915_gem_object_ggtt_pin_ww() {
> > > > > > > 0.390 us   |  i915_vma_instance();
> > > > > > > 0.178 us   |  i915_vma_misplaced();
> > > > > > >  i915_vma_unbind() {
> > > > > > >  __i915_active_wait() {
> > > > > > > 0.082 us   |i915_active_acquire_if_busy();
> > > > > > > 0.475 us   |  }
> > > > > > >  intel_runtime_pm_get() {
> > > > > > > 0.087 us   |intel_runtime_pm_acquire();
> > > > > > > 0.259 us   |  }
> > > > > > >  __i915_active_wait() {
> > > > > > > 0.085 us   |i915_active_acquire_if_busy();
> > > > > > > 0.240 us   |  }
> > > > > > >  __i915_vma_evict() {
> > > > > > >ggtt_unbind_vma() {
> > > > > > >  gen8_ggtt_clear_range() {
> > > > > > > 10507.255 us |}
> > > > > > > 10507.689 us |  }
> > > > > > > 10508.516 us |   }
> > > > > > > 
> > > > > > > v2: Instead of using bigjoiner checks, determine whether a scanout
> > > > > > >buffer is too big by checking to see if it is possible to 
> > > > > > > map
> > > > > > >two of them into the ggtt.
> > > > > > > 
> > > > > > > v3 (Ville):
> > > > > > > - Count how many fb objects can be fit into the available holes
> > > > > > >  instead of checking for a hole twice the object size.
> > > > > > > - Take alignment constraints into account.
> > > > > > > - Limit this large scanout buffer check to >= Gen 11 platforms.
> > > > > > > 
> > > > > > > v4:
> > > > > > > - Remove existing heuristic that checks just for size. (Ville)
> > > > > > > - Return early if we find space to map at-least two objects. 
> > > > > > > (Tvrtko)
> > > > > > > - Slightly update the commit message.
> > > > > > > 
> > > > > > > v5: (Tvrtko)
> > > > > > > - Rename the function to indicate that the object may be too big 
> > > > > > > to
> > > > > > >  map into the aperture.
> > > > > > > - Account for guard pages while calculating the total size 
> > > > > > > required
> > > > > > >  for the object.
> > > > > > > - Do not subject all objects to the heuristic check and instead
> > > > > > >  consider objects only of a certain size.
> > > > > > > - Do the hole walk using the rbtree.
> > > > > > > - Preserve the existing PIN_NONBLOCK logic.
> > > > > > > - Drop the 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-17 Thread Tvrtko Ursulin



On 17/03/2022 09:47, Daniel Vetter wrote:

On Tue, Mar 15, 2022 at 09:45:20AM +, Tvrtko Ursulin wrote:


On 15/03/2022 07:28, Kasireddy, Vivek wrote:

Hi Tvrtko, Daniel,



On 11/03/2022 09:39, Daniel Vetter wrote:

On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  wrote:


On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
more framebuffers/scanout buffers results in only one that is mappable/
fenceable. Therefore, pageflipping between these 2 FBs where only one
is mappable/fenceable creates latencies large enough to miss alternate
vblanks thereby producing less optimal framerate.

This mainly happens because when i915_gem_object_pin_to_display_plane()
is called to pin one of the FB objs, the associated vma is identified
as misplaced and therefore i915_vma_unbind() is called which unbinds and
evicts it. This misplaced vma gets subseqently pinned only when
i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
results in a latency of ~10ms and happens every other vblank/repaint cycle.
Therefore, to fix this issue, we try to see if there is space to map
at-least two objects of a given size and return early if there isn't. This
would ensure that we do not try with PIN_MAPPABLE for any objects that
are too big to map thereby preventing unncessary unbind.

Testcase:
Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
a frame ~7ms before the next vblank, the latencies seen between atomic
commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
it misses the vblank every other frame.

Here is the ftrace snippet that shows the source of the ~10ms latency:
 i915_gem_object_pin_to_display_plane() {
0.102 us   |i915_gem_object_set_cache_level();
   i915_gem_object_ggtt_pin_ww() {
0.390 us   |  i915_vma_instance();
0.178 us   |  i915_vma_misplaced();
 i915_vma_unbind() {
 __i915_active_wait() {
0.082 us   |i915_active_acquire_if_busy();
0.475 us   |  }
 intel_runtime_pm_get() {
0.087 us   |intel_runtime_pm_acquire();
0.259 us   |  }
 __i915_active_wait() {
0.085 us   |i915_active_acquire_if_busy();
0.240 us   |  }
 __i915_vma_evict() {
   ggtt_unbind_vma() {
 gen8_ggtt_clear_range() {
10507.255 us |}
10507.689 us |  }
10508.516 us |   }

v2: Instead of using bigjoiner checks, determine whether a scanout
   buffer is too big by checking to see if it is possible to map
   two of them into the ggtt.

v3 (Ville):
- Count how many fb objects can be fit into the available holes
 instead of checking for a hole twice the object size.
- Take alignment constraints into account.
- Limit this large scanout buffer check to >= Gen 11 platforms.

v4:
- Remove existing heuristic that checks just for size. (Ville)
- Return early if we find space to map at-least two objects. (Tvrtko)
- Slightly update the commit message.

v5: (Tvrtko)
- Rename the function to indicate that the object may be too big to
 map into the aperture.
- Account for guard pages while calculating the total size required
 for the object.
- Do not subject all objects to the heuristic check and instead
 consider objects only of a certain size.
- Do the hole walk using the rbtree.
- Preserve the existing PIN_NONBLOCK logic.
- Drop the PIN_MAPPABLE check while pinning the VMA.

v6: (Tvrtko)
- Return 0 on success and the specific error code on failure to
 preserve the existing behavior.

v7: (Ville)
- Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
 size < ggtt->mappable_end / 4 checks.
- Drop the redundant check that is based on previous heuristic.

v8:
- Make sure that we are holding the mutex associated with ggtt vm
 as we traverse the hole nodes.

v9: (Tvrtko)
- Use mutex_lock_interruptible_nested() instead of mutex_lock().

Cc: Ville Syrjälä 
Cc: Maarten Lankhorst 
Cc: Tvrtko Ursulin 
Cc: Manasi Navare 
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Vivek Kasireddy 
---
drivers/gpu/drm/i915/i915_gem.c | 128 +++-
1 file changed, 94 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9747924cc57b..e0d731b3f215 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,6 +49,7 @@
#include "gem/i915_gem_pm.h"
#include "gem/i915_gem_region.h"
#include "gem/i915_gem_userptr.h"
+#include "gem/i915_gem_tiling.h"
#include "gt/intel_engine_user.h"
#include "gt/intel_gt.h"
#include "gt/intel_gt_pm.h"
@@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
   spin_unlock(>vma.lock);
}

+static int
+i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
+ 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-17 Thread Daniel Vetter
On Tue, Mar 15, 2022 at 09:45:20AM +, Tvrtko Ursulin wrote:
> 
> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > Hi Tvrtko, Daniel,
> > 
> > > 
> > > On 11/03/2022 09:39, Daniel Vetter wrote:
> > > > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy 
> > > >  wrote:
> > > > > 
> > > > > On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> > > > > more framebuffers/scanout buffers results in only one that is 
> > > > > mappable/
> > > > > fenceable. Therefore, pageflipping between these 2 FBs where only one
> > > > > is mappable/fenceable creates latencies large enough to miss alternate
> > > > > vblanks thereby producing less optimal framerate.
> > > > > 
> > > > > This mainly happens because when 
> > > > > i915_gem_object_pin_to_display_plane()
> > > > > is called to pin one of the FB objs, the associated vma is identified
> > > > > as misplaced and therefore i915_vma_unbind() is called which unbinds 
> > > > > and
> > > > > evicts it. This misplaced vma gets subseqently pinned only when
> > > > > i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> > > > > results in a latency of ~10ms and happens every other vblank/repaint 
> > > > > cycle.
> > > > > Therefore, to fix this issue, we try to see if there is space to map
> > > > > at-least two objects of a given size and return early if there isn't. 
> > > > > This
> > > > > would ensure that we do not try with PIN_MAPPABLE for any objects that
> > > > > are too big to map thereby preventing unncessary unbind.
> > > > > 
> > > > > Testcase:
> > > > > Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> > > > > with a 8K@60 mode results in only ~40 FPS. Since upstream Weston 
> > > > > submits
> > > > > a frame ~7ms before the next vblank, the latencies seen between atomic
> > > > > commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting 
> > > > > that
> > > > > it misses the vblank every other frame.
> > > > > 
> > > > > Here is the ftrace snippet that shows the source of the ~10ms latency:
> > > > > i915_gem_object_pin_to_display_plane() {
> > > > > 0.102 us   |i915_gem_object_set_cache_level();
> > > > >   i915_gem_object_ggtt_pin_ww() {
> > > > > 0.390 us   |  i915_vma_instance();
> > > > > 0.178 us   |  i915_vma_misplaced();
> > > > > i915_vma_unbind() {
> > > > > __i915_active_wait() {
> > > > > 0.082 us   |i915_active_acquire_if_busy();
> > > > > 0.475 us   |  }
> > > > > intel_runtime_pm_get() {
> > > > > 0.087 us   |intel_runtime_pm_acquire();
> > > > > 0.259 us   |  }
> > > > > __i915_active_wait() {
> > > > > 0.085 us   |i915_active_acquire_if_busy();
> > > > > 0.240 us   |  }
> > > > > __i915_vma_evict() {
> > > > >   ggtt_unbind_vma() {
> > > > > gen8_ggtt_clear_range() {
> > > > > 10507.255 us |}
> > > > > 10507.689 us |  }
> > > > > 10508.516 us |   }
> > > > > 
> > > > > v2: Instead of using bigjoiner checks, determine whether a scanout
> > > > >   buffer is too big by checking to see if it is possible to map
> > > > >   two of them into the ggtt.
> > > > > 
> > > > > v3 (Ville):
> > > > > - Count how many fb objects can be fit into the available holes
> > > > > instead of checking for a hole twice the object size.
> > > > > - Take alignment constraints into account.
> > > > > - Limit this large scanout buffer check to >= Gen 11 platforms.
> > > > > 
> > > > > v4:
> > > > > - Remove existing heuristic that checks just for size. (Ville)
> > > > > - Return early if we find space to map at-least two objects. (Tvrtko)
> > > > > - Slightly update the commit message.
> > > > > 
> > > > > v5: (Tvrtko)
> > > > > - Rename the function to indicate that the object may be too big to
> > > > > map into the aperture.
> > > > > - Account for guard pages while calculating the total size required
> > > > > for the object.
> > > > > - Do not subject all objects to the heuristic check and instead
> > > > > consider objects only of a certain size.
> > > > > - Do the hole walk using the rbtree.
> > > > > - Preserve the existing PIN_NONBLOCK logic.
> > > > > - Drop the PIN_MAPPABLE check while pinning the VMA.
> > > > > 
> > > > > v6: (Tvrtko)
> > > > > - Return 0 on success and the specific error code on failure to
> > > > > preserve the existing behavior.
> > > > > 
> > > > > v7: (Ville)
> > > > > - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> > > > > size < ggtt->mappable_end / 4 checks.
> > > > > - Drop the redundant check that is based on previous heuristic.
> > > > > 
> > > > > v8:
> > > > > - Make sure that we are holding the mutex associated with ggtt vm
> > > > > as we traverse the hole nodes.
> > > > > 
> > > > > v9: (Tvrtko)
> > > > > - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> > > > > 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-17 Thread Kasireddy, Vivek
Hi Tvrtko,

> 
> On 16/03/2022 07:37, Kasireddy, Vivek wrote:
> > Hi Tvrtko,
> >
> >>
> >> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> >>> Hi Tvrtko, Daniel,
> >>>
> 
>  On 11/03/2022 09:39, Daniel Vetter wrote:
> > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy 
> wrote:
> >>
> >> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >> more framebuffers/scanout buffers results in only one that is mappable/
> >> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >> is mappable/fenceable creates latencies large enough to miss alternate
> >> vblanks thereby producing less optimal framerate.
> >>
> >> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >> is called to pin one of the FB objs, the associated vma is identified
> >> as misplaced and therefore i915_vma_unbind() is called which unbinds 
> >> and
> >> evicts it. This misplaced vma gets subseqently pinned only when
> >> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >> results in a latency of ~10ms and happens every other vblank/repaint 
> >> cycle.
> >> Therefore, to fix this issue, we try to see if there is space to map
> >> at-least two objects of a given size and return early if there isn't. 
> >> This
> >> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >> are too big to map thereby preventing unncessary unbind.
> >>
> >> Testcase:
> >> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston 
> >> submits
> >> a frame ~7ms before the next vblank, the latencies seen between atomic
> >> commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
> >> it misses the vblank every other frame.
> >>
> >> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>  i915_gem_object_pin_to_display_plane() {
> >> 0.102 us   |i915_gem_object_set_cache_level();
> >>i915_gem_object_ggtt_pin_ww() {
> >> 0.390 us   |  i915_vma_instance();
> >> 0.178 us   |  i915_vma_misplaced();
> >>  i915_vma_unbind() {
> >>  __i915_active_wait() {
> >> 0.082 us   |i915_active_acquire_if_busy();
> >> 0.475 us   |  }
> >>  intel_runtime_pm_get() {
> >> 0.087 us   |intel_runtime_pm_acquire();
> >> 0.259 us   |  }
> >>  __i915_active_wait() {
> >> 0.085 us   |i915_active_acquire_if_busy();
> >> 0.240 us   |  }
> >>  __i915_vma_evict() {
> >>ggtt_unbind_vma() {
> >>  gen8_ggtt_clear_range() {
> >> 10507.255 us |}
> >> 10507.689 us |  }
> >> 10508.516 us |   }
> >>
> >> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>buffer is too big by checking to see if it is possible to map
> >>two of them into the ggtt.
> >>
> >> v3 (Ville):
> >> - Count how many fb objects can be fit into the available holes
> >>  instead of checking for a hole twice the object size.
> >> - Take alignment constraints into account.
> >> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>
> >> v4:
> >> - Remove existing heuristic that checks just for size. (Ville)
> >> - Return early if we find space to map at-least two objects. (Tvrtko)
> >> - Slightly update the commit message.
> >>
> >> v5: (Tvrtko)
> >> - Rename the function to indicate that the object may be too big to
> >>  map into the aperture.
> >> - Account for guard pages while calculating the total size required
> >>  for the object.
> >> - Do not subject all objects to the heuristic check and instead
> >>  consider objects only of a certain size.
> >> - Do the hole walk using the rbtree.
> >> - Preserve the existing PIN_NONBLOCK logic.
> >> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>
> >> v6: (Tvrtko)
> >> - Return 0 on success and the specific error code on failure to
> >>  preserve the existing behavior.
> >>
> >> v7: (Ville)
> >> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>  size < ggtt->mappable_end / 4 checks.
> >> - Drop the redundant check that is based on previous heuristic.
> >>
> >> v8:
> >> - Make sure that we are holding the mutex associated with ggtt vm
> >>  as we traverse the hole nodes.
> >>
> >> v9: (Tvrtko)
> >> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>
> >> Cc: Ville Syrjälä 
> >> Cc: Maarten Lankhorst 
> >> Cc: Tvrtko Ursulin 
> >> Cc: 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-16 Thread Tvrtko Ursulin



On 16/03/2022 07:37, Kasireddy, Vivek wrote:

Hi Tvrtko,



On 15/03/2022 07:28, Kasireddy, Vivek wrote:

Hi Tvrtko, Daniel,



On 11/03/2022 09:39, Daniel Vetter wrote:

On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  wrote:


On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
more framebuffers/scanout buffers results in only one that is mappable/
fenceable. Therefore, pageflipping between these 2 FBs where only one
is mappable/fenceable creates latencies large enough to miss alternate
vblanks thereby producing less optimal framerate.

This mainly happens because when i915_gem_object_pin_to_display_plane()
is called to pin one of the FB objs, the associated vma is identified
as misplaced and therefore i915_vma_unbind() is called which unbinds and
evicts it. This misplaced vma gets subseqently pinned only when
i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
results in a latency of ~10ms and happens every other vblank/repaint cycle.
Therefore, to fix this issue, we try to see if there is space to map
at-least two objects of a given size and return early if there isn't. This
would ensure that we do not try with PIN_MAPPABLE for any objects that
are too big to map thereby preventing unncessary unbind.

Testcase:
Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
a frame ~7ms before the next vblank, the latencies seen between atomic
commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
it misses the vblank every other frame.

Here is the ftrace snippet that shows the source of the ~10ms latency:
 i915_gem_object_pin_to_display_plane() {
0.102 us   |i915_gem_object_set_cache_level();
   i915_gem_object_ggtt_pin_ww() {
0.390 us   |  i915_vma_instance();
0.178 us   |  i915_vma_misplaced();
 i915_vma_unbind() {
 __i915_active_wait() {
0.082 us   |i915_active_acquire_if_busy();
0.475 us   |  }
 intel_runtime_pm_get() {
0.087 us   |intel_runtime_pm_acquire();
0.259 us   |  }
 __i915_active_wait() {
0.085 us   |i915_active_acquire_if_busy();
0.240 us   |  }
 __i915_vma_evict() {
   ggtt_unbind_vma() {
 gen8_ggtt_clear_range() {
10507.255 us |}
10507.689 us |  }
10508.516 us |   }

v2: Instead of using bigjoiner checks, determine whether a scanout
   buffer is too big by checking to see if it is possible to map
   two of them into the ggtt.

v3 (Ville):
- Count how many fb objects can be fit into the available holes
 instead of checking for a hole twice the object size.
- Take alignment constraints into account.
- Limit this large scanout buffer check to >= Gen 11 platforms.

v4:
- Remove existing heuristic that checks just for size. (Ville)
- Return early if we find space to map at-least two objects. (Tvrtko)
- Slightly update the commit message.

v5: (Tvrtko)
- Rename the function to indicate that the object may be too big to
 map into the aperture.
- Account for guard pages while calculating the total size required
 for the object.
- Do not subject all objects to the heuristic check and instead
 consider objects only of a certain size.
- Do the hole walk using the rbtree.
- Preserve the existing PIN_NONBLOCK logic.
- Drop the PIN_MAPPABLE check while pinning the VMA.

v6: (Tvrtko)
- Return 0 on success and the specific error code on failure to
 preserve the existing behavior.

v7: (Ville)
- Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
 size < ggtt->mappable_end / 4 checks.
- Drop the redundant check that is based on previous heuristic.

v8:
- Make sure that we are holding the mutex associated with ggtt vm
 as we traverse the hole nodes.

v9: (Tvrtko)
- Use mutex_lock_interruptible_nested() instead of mutex_lock().

Cc: Ville Syrjälä 
Cc: Maarten Lankhorst 
Cc: Tvrtko Ursulin 
Cc: Manasi Navare 
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Vivek Kasireddy 
---
drivers/gpu/drm/i915/i915_gem.c | 128 +++-
1 file changed, 94 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9747924cc57b..e0d731b3f215 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,6 +49,7 @@
#include "gem/i915_gem_pm.h"
#include "gem/i915_gem_region.h"
#include "gem/i915_gem_userptr.h"
+#include "gem/i915_gem_tiling.h"
#include "gt/intel_engine_user.h"
#include "gt/intel_gt.h"
#include "gt/intel_gt_pm.h"
@@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
   spin_unlock(>vma.lock);
}

+static int
+i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
+u64 alignment, u64 flags)


Tvrtko asked 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-16 Thread Kasireddy, Vivek
Hi Tvrtko,

> 
> On 15/03/2022 07:28, Kasireddy, Vivek wrote:
> > Hi Tvrtko, Daniel,
> >
> >>
> >> On 11/03/2022 09:39, Daniel Vetter wrote:
> >>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  
> >>> wrote:
> 
>  On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
>  more framebuffers/scanout buffers results in only one that is mappable/
>  fenceable. Therefore, pageflipping between these 2 FBs where only one
>  is mappable/fenceable creates latencies large enough to miss alternate
>  vblanks thereby producing less optimal framerate.
> 
>  This mainly happens because when i915_gem_object_pin_to_display_plane()
>  is called to pin one of the FB objs, the associated vma is identified
>  as misplaced and therefore i915_vma_unbind() is called which unbinds and
>  evicts it. This misplaced vma gets subseqently pinned only when
>  i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
>  results in a latency of ~10ms and happens every other vblank/repaint 
>  cycle.
>  Therefore, to fix this issue, we try to see if there is space to map
>  at-least two objects of a given size and return early if there isn't. 
>  This
>  would ensure that we do not try with PIN_MAPPABLE for any objects that
>  are too big to map thereby preventing unncessary unbind.
> 
>  Testcase:
>  Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
>  with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
>  a frame ~7ms before the next vblank, the latencies seen between atomic
>  commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
>  it misses the vblank every other frame.
> 
>  Here is the ftrace snippet that shows the source of the ~10ms latency:
>  i915_gem_object_pin_to_display_plane() {
>  0.102 us   |i915_gem_object_set_cache_level();
>    i915_gem_object_ggtt_pin_ww() {
>  0.390 us   |  i915_vma_instance();
>  0.178 us   |  i915_vma_misplaced();
>  i915_vma_unbind() {
>  __i915_active_wait() {
>  0.082 us   |i915_active_acquire_if_busy();
>  0.475 us   |  }
>  intel_runtime_pm_get() {
>  0.087 us   |intel_runtime_pm_acquire();
>  0.259 us   |  }
>  __i915_active_wait() {
>  0.085 us   |i915_active_acquire_if_busy();
>  0.240 us   |  }
>  __i915_vma_evict() {
>    ggtt_unbind_vma() {
>  gen8_ggtt_clear_range() {
>  10507.255 us |}
>  10507.689 us |  }
>  10508.516 us |   }
> 
>  v2: Instead of using bigjoiner checks, determine whether a scanout
>    buffer is too big by checking to see if it is possible to map
>    two of them into the ggtt.
> 
>  v3 (Ville):
>  - Count how many fb objects can be fit into the available holes
>  instead of checking for a hole twice the object size.
>  - Take alignment constraints into account.
>  - Limit this large scanout buffer check to >= Gen 11 platforms.
> 
>  v4:
>  - Remove existing heuristic that checks just for size. (Ville)
>  - Return early if we find space to map at-least two objects. (Tvrtko)
>  - Slightly update the commit message.
> 
>  v5: (Tvrtko)
>  - Rename the function to indicate that the object may be too big to
>  map into the aperture.
>  - Account for guard pages while calculating the total size required
>  for the object.
>  - Do not subject all objects to the heuristic check and instead
>  consider objects only of a certain size.
>  - Do the hole walk using the rbtree.
>  - Preserve the existing PIN_NONBLOCK logic.
>  - Drop the PIN_MAPPABLE check while pinning the VMA.
> 
>  v6: (Tvrtko)
>  - Return 0 on success and the specific error code on failure to
>  preserve the existing behavior.
> 
>  v7: (Ville)
>  - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>  size < ggtt->mappable_end / 4 checks.
>  - Drop the redundant check that is based on previous heuristic.
> 
>  v8:
>  - Make sure that we are holding the mutex associated with ggtt vm
>  as we traverse the hole nodes.
> 
>  v9: (Tvrtko)
>  - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> 
>  Cc: Ville Syrjälä 
>  Cc: Maarten Lankhorst 
>  Cc: Tvrtko Ursulin 
>  Cc: Manasi Navare 
>  Reviewed-by: Tvrtko Ursulin 
>  Signed-off-by: Vivek Kasireddy 
>  ---
> drivers/gpu/drm/i915/i915_gem.c | 128 +++-
> 1 file changed, 94 insertions(+), 34 deletions(-)
> 
>  diff --git a/drivers/gpu/drm/i915/i915_gem.c 
>  

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-15 Thread Tvrtko Ursulin



On 15/03/2022 07:28, Kasireddy, Vivek wrote:

Hi Tvrtko, Daniel,



On 11/03/2022 09:39, Daniel Vetter wrote:

On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  wrote:


On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
more framebuffers/scanout buffers results in only one that is mappable/
fenceable. Therefore, pageflipping between these 2 FBs where only one
is mappable/fenceable creates latencies large enough to miss alternate
vblanks thereby producing less optimal framerate.

This mainly happens because when i915_gem_object_pin_to_display_plane()
is called to pin one of the FB objs, the associated vma is identified
as misplaced and therefore i915_vma_unbind() is called which unbinds and
evicts it. This misplaced vma gets subseqently pinned only when
i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
results in a latency of ~10ms and happens every other vblank/repaint cycle.
Therefore, to fix this issue, we try to see if there is space to map
at-least two objects of a given size and return early if there isn't. This
would ensure that we do not try with PIN_MAPPABLE for any objects that
are too big to map thereby preventing unncessary unbind.

Testcase:
Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
a frame ~7ms before the next vblank, the latencies seen between atomic
commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
it misses the vblank every other frame.

Here is the ftrace snippet that shows the source of the ~10ms latency:
i915_gem_object_pin_to_display_plane() {
0.102 us   |i915_gem_object_set_cache_level();
  i915_gem_object_ggtt_pin_ww() {
0.390 us   |  i915_vma_instance();
0.178 us   |  i915_vma_misplaced();
i915_vma_unbind() {
__i915_active_wait() {
0.082 us   |i915_active_acquire_if_busy();
0.475 us   |  }
intel_runtime_pm_get() {
0.087 us   |intel_runtime_pm_acquire();
0.259 us   |  }
__i915_active_wait() {
0.085 us   |i915_active_acquire_if_busy();
0.240 us   |  }
__i915_vma_evict() {
  ggtt_unbind_vma() {
gen8_ggtt_clear_range() {
10507.255 us |}
10507.689 us |  }
10508.516 us |   }

v2: Instead of using bigjoiner checks, determine whether a scanout
  buffer is too big by checking to see if it is possible to map
  two of them into the ggtt.

v3 (Ville):
- Count how many fb objects can be fit into the available holes
instead of checking for a hole twice the object size.
- Take alignment constraints into account.
- Limit this large scanout buffer check to >= Gen 11 platforms.

v4:
- Remove existing heuristic that checks just for size. (Ville)
- Return early if we find space to map at-least two objects. (Tvrtko)
- Slightly update the commit message.

v5: (Tvrtko)
- Rename the function to indicate that the object may be too big to
map into the aperture.
- Account for guard pages while calculating the total size required
for the object.
- Do not subject all objects to the heuristic check and instead
consider objects only of a certain size.
- Do the hole walk using the rbtree.
- Preserve the existing PIN_NONBLOCK logic.
- Drop the PIN_MAPPABLE check while pinning the VMA.

v6: (Tvrtko)
- Return 0 on success and the specific error code on failure to
preserve the existing behavior.

v7: (Ville)
- Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
size < ggtt->mappable_end / 4 checks.
- Drop the redundant check that is based on previous heuristic.

v8:
- Make sure that we are holding the mutex associated with ggtt vm
as we traverse the hole nodes.

v9: (Tvrtko)
- Use mutex_lock_interruptible_nested() instead of mutex_lock().

Cc: Ville Syrjälä 
Cc: Maarten Lankhorst 
Cc: Tvrtko Ursulin 
Cc: Manasi Navare 
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Vivek Kasireddy 
---
   drivers/gpu/drm/i915/i915_gem.c | 128 +++-
   1 file changed, 94 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9747924cc57b..e0d731b3f215 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,6 +49,7 @@
   #include "gem/i915_gem_pm.h"
   #include "gem/i915_gem_region.h"
   #include "gem/i915_gem_userptr.h"
+#include "gem/i915_gem_tiling.h"
   #include "gt/intel_engine_user.h"
   #include "gt/intel_gt.h"
   #include "gt/intel_gt_pm.h"
@@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
  spin_unlock(>vma.lock);
   }

+static int
+i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
+u64 alignment, u64 flags)


Tvrtko asked me to ack the first patch, but then I looked at this and
started wondering.


Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-15 Thread Kasireddy, Vivek
Hi Tvrtko, Daniel,

> 
> On 11/03/2022 09:39, Daniel Vetter wrote:
> > On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  
> > wrote:
> >>
> >> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> >> more framebuffers/scanout buffers results in only one that is mappable/
> >> fenceable. Therefore, pageflipping between these 2 FBs where only one
> >> is mappable/fenceable creates latencies large enough to miss alternate
> >> vblanks thereby producing less optimal framerate.
> >>
> >> This mainly happens because when i915_gem_object_pin_to_display_plane()
> >> is called to pin one of the FB objs, the associated vma is identified
> >> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> >> evicts it. This misplaced vma gets subseqently pinned only when
> >> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> >> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> >> Therefore, to fix this issue, we try to see if there is space to map
> >> at-least two objects of a given size and return early if there isn't. This
> >> would ensure that we do not try with PIN_MAPPABLE for any objects that
> >> are too big to map thereby preventing unncessary unbind.
> >>
> >> Testcase:
> >> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> >> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> >> a frame ~7ms before the next vblank, the latencies seen between atomic
> >> commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
> >> it misses the vblank every other frame.
> >>
> >> Here is the ftrace snippet that shows the source of the ~10ms latency:
> >>i915_gem_object_pin_to_display_plane() {
> >> 0.102 us   |i915_gem_object_set_cache_level();
> >>  i915_gem_object_ggtt_pin_ww() {
> >> 0.390 us   |  i915_vma_instance();
> >> 0.178 us   |  i915_vma_misplaced();
> >>i915_vma_unbind() {
> >>__i915_active_wait() {
> >> 0.082 us   |i915_active_acquire_if_busy();
> >> 0.475 us   |  }
> >>intel_runtime_pm_get() {
> >> 0.087 us   |intel_runtime_pm_acquire();
> >> 0.259 us   |  }
> >>__i915_active_wait() {
> >> 0.085 us   |i915_active_acquire_if_busy();
> >> 0.240 us   |  }
> >>__i915_vma_evict() {
> >>  ggtt_unbind_vma() {
> >>gen8_ggtt_clear_range() {
> >> 10507.255 us |}
> >> 10507.689 us |  }
> >> 10508.516 us |   }
> >>
> >> v2: Instead of using bigjoiner checks, determine whether a scanout
> >>  buffer is too big by checking to see if it is possible to map
> >>  two of them into the ggtt.
> >>
> >> v3 (Ville):
> >> - Count how many fb objects can be fit into the available holes
> >>instead of checking for a hole twice the object size.
> >> - Take alignment constraints into account.
> >> - Limit this large scanout buffer check to >= Gen 11 platforms.
> >>
> >> v4:
> >> - Remove existing heuristic that checks just for size. (Ville)
> >> - Return early if we find space to map at-least two objects. (Tvrtko)
> >> - Slightly update the commit message.
> >>
> >> v5: (Tvrtko)
> >> - Rename the function to indicate that the object may be too big to
> >>map into the aperture.
> >> - Account for guard pages while calculating the total size required
> >>for the object.
> >> - Do not subject all objects to the heuristic check and instead
> >>consider objects only of a certain size.
> >> - Do the hole walk using the rbtree.
> >> - Preserve the existing PIN_NONBLOCK logic.
> >> - Drop the PIN_MAPPABLE check while pinning the VMA.
> >>
> >> v6: (Tvrtko)
> >> - Return 0 on success and the specific error code on failure to
> >>preserve the existing behavior.
> >>
> >> v7: (Ville)
> >> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
> >>size < ggtt->mappable_end / 4 checks.
> >> - Drop the redundant check that is based on previous heuristic.
> >>
> >> v8:
> >> - Make sure that we are holding the mutex associated with ggtt vm
> >>as we traverse the hole nodes.
> >>
> >> v9: (Tvrtko)
> >> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
> >>
> >> Cc: Ville Syrjälä 
> >> Cc: Maarten Lankhorst 
> >> Cc: Tvrtko Ursulin 
> >> Cc: Manasi Navare 
> >> Reviewed-by: Tvrtko Ursulin 
> >> Signed-off-by: Vivek Kasireddy 
> >> ---
> >>   drivers/gpu/drm/i915/i915_gem.c | 128 +++-
> >>   1 file changed, 94 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c 
> >> b/drivers/gpu/drm/i915/i915_gem.c
> >> index 9747924cc57b..e0d731b3f215 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> @@ -49,6 +49,7 @@
> >>   #include "gem/i915_gem_pm.h"
> >>   #include "gem/i915_gem_region.h"
> >>   #include "gem/i915_gem_userptr.h"
> >> 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-14 Thread Tvrtko Ursulin



On 11/03/2022 09:39, Daniel Vetter wrote:

On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  wrote:


On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
more framebuffers/scanout buffers results in only one that is mappable/
fenceable. Therefore, pageflipping between these 2 FBs where only one
is mappable/fenceable creates latencies large enough to miss alternate
vblanks thereby producing less optimal framerate.

This mainly happens because when i915_gem_object_pin_to_display_plane()
is called to pin one of the FB objs, the associated vma is identified
as misplaced and therefore i915_vma_unbind() is called which unbinds and
evicts it. This misplaced vma gets subseqently pinned only when
i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
results in a latency of ~10ms and happens every other vblank/repaint cycle.
Therefore, to fix this issue, we try to see if there is space to map
at-least two objects of a given size and return early if there isn't. This
would ensure that we do not try with PIN_MAPPABLE for any objects that
are too big to map thereby preventing unncessary unbind.

Testcase:
Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
a frame ~7ms before the next vblank, the latencies seen between atomic
commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
it misses the vblank every other frame.

Here is the ftrace snippet that shows the source of the ~10ms latency:
   i915_gem_object_pin_to_display_plane() {
0.102 us   |i915_gem_object_set_cache_level();
 i915_gem_object_ggtt_pin_ww() {
0.390 us   |  i915_vma_instance();
0.178 us   |  i915_vma_misplaced();
   i915_vma_unbind() {
   __i915_active_wait() {
0.082 us   |i915_active_acquire_if_busy();
0.475 us   |  }
   intel_runtime_pm_get() {
0.087 us   |intel_runtime_pm_acquire();
0.259 us   |  }
   __i915_active_wait() {
0.085 us   |i915_active_acquire_if_busy();
0.240 us   |  }
   __i915_vma_evict() {
 ggtt_unbind_vma() {
   gen8_ggtt_clear_range() {
10507.255 us |}
10507.689 us |  }
10508.516 us |   }

v2: Instead of using bigjoiner checks, determine whether a scanout
 buffer is too big by checking to see if it is possible to map
 two of them into the ggtt.

v3 (Ville):
- Count how many fb objects can be fit into the available holes
   instead of checking for a hole twice the object size.
- Take alignment constraints into account.
- Limit this large scanout buffer check to >= Gen 11 platforms.

v4:
- Remove existing heuristic that checks just for size. (Ville)
- Return early if we find space to map at-least two objects. (Tvrtko)
- Slightly update the commit message.

v5: (Tvrtko)
- Rename the function to indicate that the object may be too big to
   map into the aperture.
- Account for guard pages while calculating the total size required
   for the object.
- Do not subject all objects to the heuristic check and instead
   consider objects only of a certain size.
- Do the hole walk using the rbtree.
- Preserve the existing PIN_NONBLOCK logic.
- Drop the PIN_MAPPABLE check while pinning the VMA.

v6: (Tvrtko)
- Return 0 on success and the specific error code on failure to
   preserve the existing behavior.

v7: (Ville)
- Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
   size < ggtt->mappable_end / 4 checks.
- Drop the redundant check that is based on previous heuristic.

v8:
- Make sure that we are holding the mutex associated with ggtt vm
   as we traverse the hole nodes.

v9: (Tvrtko)
- Use mutex_lock_interruptible_nested() instead of mutex_lock().

Cc: Ville Syrjälä 
Cc: Maarten Lankhorst 
Cc: Tvrtko Ursulin 
Cc: Manasi Navare 
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Vivek Kasireddy 
---
  drivers/gpu/drm/i915/i915_gem.c | 128 +++-
  1 file changed, 94 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9747924cc57b..e0d731b3f215 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -49,6 +49,7 @@
  #include "gem/i915_gem_pm.h"
  #include "gem/i915_gem_region.h"
  #include "gem/i915_gem_userptr.h"
+#include "gem/i915_gem_tiling.h"
  #include "gt/intel_engine_user.h"
  #include "gt/intel_gt.h"
  #include "gt/intel_gt_pm.h"
@@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
 spin_unlock(>vma.lock);
  }

+static int
+i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
+u64 alignment, u64 flags)


Tvrtko asked me to ack the first patch, but then I looked at this and
started wondering.

Conceptually this doesn't pass the smell test. What if we have
multiple per-crtc buffers? Multiple planes 

Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/gem: Don't try to map and fence large scanout buffers (v9)

2022-03-11 Thread Daniel Vetter
On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy  wrote:
>
> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> more framebuffers/scanout buffers results in only one that is mappable/
> fenceable. Therefore, pageflipping between these 2 FBs where only one
> is mappable/fenceable creates latencies large enough to miss alternate
> vblanks thereby producing less optimal framerate.
>
> This mainly happens because when i915_gem_object_pin_to_display_plane()
> is called to pin one of the FB objs, the associated vma is identified
> as misplaced and therefore i915_vma_unbind() is called which unbinds and
> evicts it. This misplaced vma gets subseqently pinned only when
> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This
> results in a latency of ~10ms and happens every other vblank/repaint cycle.
> Therefore, to fix this issue, we try to see if there is space to map
> at-least two objects of a given size and return early if there isn't. This
> would ensure that we do not try with PIN_MAPPABLE for any objects that
> are too big to map thereby preventing unncessary unbind.
>
> Testcase:
> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> a frame ~7ms before the next vblank, the latencies seen between atomic
> commit and flip event are 7, 24 (7 + 16.66), 7, 24. suggesting that
> it misses the vblank every other frame.
>
> Here is the ftrace snippet that shows the source of the ~10ms latency:
>   i915_gem_object_pin_to_display_plane() {
> 0.102 us   |i915_gem_object_set_cache_level();
> i915_gem_object_ggtt_pin_ww() {
> 0.390 us   |  i915_vma_instance();
> 0.178 us   |  i915_vma_misplaced();
>   i915_vma_unbind() {
>   __i915_active_wait() {
> 0.082 us   |i915_active_acquire_if_busy();
> 0.475 us   |  }
>   intel_runtime_pm_get() {
> 0.087 us   |intel_runtime_pm_acquire();
> 0.259 us   |  }
>   __i915_active_wait() {
> 0.085 us   |i915_active_acquire_if_busy();
> 0.240 us   |  }
>   __i915_vma_evict() {
> ggtt_unbind_vma() {
>   gen8_ggtt_clear_range() {
> 10507.255 us |}
> 10507.689 us |  }
> 10508.516 us |   }
>
> v2: Instead of using bigjoiner checks, determine whether a scanout
> buffer is too big by checking to see if it is possible to map
> two of them into the ggtt.
>
> v3 (Ville):
> - Count how many fb objects can be fit into the available holes
>   instead of checking for a hole twice the object size.
> - Take alignment constraints into account.
> - Limit this large scanout buffer check to >= Gen 11 platforms.
>
> v4:
> - Remove existing heuristic that checks just for size. (Ville)
> - Return early if we find space to map at-least two objects. (Tvrtko)
> - Slightly update the commit message.
>
> v5: (Tvrtko)
> - Rename the function to indicate that the object may be too big to
>   map into the aperture.
> - Account for guard pages while calculating the total size required
>   for the object.
> - Do not subject all objects to the heuristic check and instead
>   consider objects only of a certain size.
> - Do the hole walk using the rbtree.
> - Preserve the existing PIN_NONBLOCK logic.
> - Drop the PIN_MAPPABLE check while pinning the VMA.
>
> v6: (Tvrtko)
> - Return 0 on success and the specific error code on failure to
>   preserve the existing behavior.
>
> v7: (Ville)
> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and
>   size < ggtt->mappable_end / 4 checks.
> - Drop the redundant check that is based on previous heuristic.
>
> v8:
> - Make sure that we are holding the mutex associated with ggtt vm
>   as we traverse the hole nodes.
>
> v9: (Tvrtko)
> - Use mutex_lock_interruptible_nested() instead of mutex_lock().
>
> Cc: Ville Syrjälä 
> Cc: Maarten Lankhorst 
> Cc: Tvrtko Ursulin 
> Cc: Manasi Navare 
> Reviewed-by: Tvrtko Ursulin 
> Signed-off-by: Vivek Kasireddy 
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 128 +++-
>  1 file changed, 94 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 9747924cc57b..e0d731b3f215 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -49,6 +49,7 @@
>  #include "gem/i915_gem_pm.h"
>  #include "gem/i915_gem_region.h"
>  #include "gem/i915_gem_userptr.h"
> +#include "gem/i915_gem_tiling.h"
>  #include "gt/intel_engine_user.h"
>  #include "gt/intel_gt.h"
>  #include "gt/intel_gt_pm.h"
> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> spin_unlock(>vma.lock);
>  }
>
> +static int
> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj,
> +u64 alignment, u64 flags)

Tvrtko asked me to ack the first patch, but then I