Hi Alex, On Mon, 21 Aug 2023 at 16:44, Alexander Graf <ag...@csgraf.de> wrote: > > > On 22.08.23 00:10, Simon Glass wrote: > > Hi Alex, > > > > On Mon, 21 Aug 2023 at 13:59, Alexander Graf <ag...@csgraf.de> wrote: > >> > >> On 21.08.23 21:11, Simon Glass wrote: > >>> Hi Alper, > >>> > >>> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiya...@gmail.com> > >>> wrote: > >>>> From: Alexander Graf <ag...@csgraf.de> > >>>> > >>>> Now that we have a damage area tells us which parts of the frame buffer > >>>> actually need updating, let's only dcache flush those on video_sync() > >>>> calls. With this optimization in place, frame buffer updates - especially > >>>> on large screen such as 4k displays - speed up significantly. > >>>> > >>>> Signed-off-by: Alexander Graf <ag...@csgraf.de> > >>>> Reported-by: Da Xue <da@libre.computer> > >>>> [Alper: Use damage.xstart/yend, IS_ENABLED()] > >>>> Co-developed-by: Alper Nebi Yasak <alpernebiya...@gmail.com> > >>>> Signed-off-by: Alper Nebi Yasak <alpernebiya...@gmail.com> > >>>> --- > >>>> > >>>> Changes in v5: > >>>> - Use xstart, ystart, xend, yend as names for damage region > >>>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED() > >>>> > >>>> Changes in v2: > >>>> - Fix dcache range; we were flushing too much before > >>>> - Remove ifdefs > >>>> > >>>> drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++----- > >>>> 1 file changed, 36 insertions(+), 5 deletions(-) > >>> This is a little strange, since flushing the whole cache will only > >>> actually write out data that was actually written (to the display). Is > >>> there a benefit to this patch, in terms of performance? > >> > >> I'm happy to see you go through the same thought process I went through > >> when writing these: "This surely can't be the problem, can it?". The > >> answer is "simple" in hindsight: > >> > >> Have a look at the ARMv8 cache flush function. It does the only "safe" > >> thing you can expect it to do: Clean+Invalidate to POC because we use it > >> for multiple things, clearing modified code among others: > >> > >> ENTRY(__asm_flush_dcache_range) > >> mrs x3, ctr_el0 > >> ubfx x3, x3, #16, #4 > >> mov x2, #4 > >> lsl x2, x2, x3 /* cache line size */ > >> > >> /* x2 <- minimal cache line size in cache system */ > >> sub x3, x2, #1 > >> bic x0, x0, x3 > >> 1: dc civac, x0 /* clean & invalidate data or unified > >> cache */ > >> add x0, x0, x2 > >> cmp x0, x1 > >> b.lo 1b > >> dsb sy > >> ret > >> ENDPROC(__asm_flush_dcache_range) > >> > >> > >> Looking at the "dc civac" call, we find this documentation page from > >> ARM: > >> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC > >> > >> This says we're writing any dirtyness of this cache line up to the POC > >> and then invalidate (remove the cache line) also up to POC. That means > >> when you look at a typical SBC, this will either be L2 or system level > >> cache. Every read afterwards needs to go and pull it all the way back to > >> L1 to modify it (or not) on the next character write and then flush it > >> again. > >> > >> Even worse: Because of the invalidate, we may even evict it from caches > >> that the display controller uses to read the frame buffer. So depending > >> on the SoC's cache topology and implementation, we may force the display > >> controller to refetch the full FB content on its next screen refresh cycle. > >> > >> I faintly remember that I tried to experiment with CVAC instead to only > >> flush without invalidating. I don't fully remember the results anymore > >> though. I believe CVAC just behaved identical to CIVAC on the A53 > >> platform I was working on. And then I looked at Cortex-A53 errata like > >> [1] and just accepted that doing anything but restricting the flushing > >> range is a waste of time :) > > Yuck I didn't know it was invalidating too. That is horrible. Is there > > no way to fix it? > > > Before building all of this damage logic, I tried, but failed. I'd > welcome anyone else to try again :). I'm not even convinced yet that it > is actually fixable: Depending on the SoC's internal cache logic, it may > opt to always invalidate I think.
Wow, that is crazy! How is anyone supposed to make the system run well with logic like that??! > > That said, this patch set really also makes sense outside of the > particular invalidate problem. It creates a generic abstraction between > the copy and non-copy code path and allows us to reduce the amount of > work spent for both, generically for any video sync operation. Sure...my question was really why it helps so much, given what I understood the caches to be doing. If they are invalidating, then it is amazing anything gets done... Regards, SImon