On Sat, 16 Jan 2021, Reimar Döffinger wrote:

On 15 Jan 2021, at 23:55, Martin Storsjö <mar...@martin.st> wrote:

On Tue, 12 Jan 2021, reimar.doeffin...@gmx.de wrote:

create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S
create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c

This patch fails checkasm

Fixed, one mis-translated coefficient index...

Ok, good.


+.macro fixsqrshrn d, dt, n, m
+  .ifc \dt, .8H
+        sqrshrn2        \d\dt, \n\().4S, \m
+  .else
+        sqrshrn         \n\().4H, \n\().4S, \m
+        mov             \d\().D[0], \n\().D[0]
+  .endif
+.endm

Is this to set the lower half of the dest register, without wiping the upper 
part? It looks a bit clumsy (and stall-prone on in-order cores), but I guess 
it's not easy to do things differently without rewriting the higher level 
structure?

Did not have a good idea, but I was also aiming for keeping the structure of 
the 32-bit and 64-bit code similar.
In particular since I did not know about checkasm and expected some further 
painful debug.


+.macro transpose_8x8 r0, r1, r2, r3, r4, r5, r6, r7
+        transpose8_4x4  \r0, \r1, \r2, \r3
+        transpose8_4x4  \r4, \r5, \r6, \r7
+.endm
+

There's a bunch of existing transposes in libavcodec/aarch64/neon.S - can they 
be used? Not that they're rocket science, though... And I see that the existing 
arm version also has got its own transpose macros.

If it's inconvenient to use shared macros, this is fine.

They are different and seem to not be documented, so it would take some
time to figure out how to replace them.
There’s also a bit of a question if I’d want to give up alignment
with the 32-bit code just yet.

Sure - one doesn't have to give up on matching the overall structure, but it doesn't necessarily need to stay 1:1 on the individual instruction level, if things are done more efficiently in one way or another on the other architecture. Not saying that these are more or less efficient; a transpose usually is the same anyway, but the aarch64 macros take a number of more temp regs as arguments.


+        // Transpose each 4x4 block, and swap how d4-d7 and d8-d11 are used.
+        // Layout before:
+        // d0  d1
+        // d2  d3
+        // d4  d5
+        // d6  d7
+        // d8  d9
+        // d10 d11
+        // d12 d13
+        // d14 d15

These layouts don't look like they're up to date for the aarch64 version?

Removed in new version as it seems not that useful.


+        vst1.s32        {\in7}, [r3, :128]
+.endm

This is left behind untranslated (and thus unused)

Removed. I am not sure if that means it’s also unused in the 32-bit version.

Possibly - a patch for removing that, if it is, would be good too.

+
+        movrel          x1, trans + 16

The movrel macro has got a separate third optional argument for the offset, so write this 
"movrel x1, trans, 16". (Older versions of llvm are picky with the symbol 
offset syntax and break with this, as the macro right now adds its own implicit +(0) 
here. If you pass the offset in the macro parameter, all the offsets get passed within 
the parentheses.

Changed.

+.macro idct_16x16 bitdepth
+function ff_hevc_idct_16x16_\bitdepth\()_neon, export=1
+//r0 - coeffs
+        mov             x15, lr
+

Binutils doesn't recognize "lr" as alias for x30

It didn’t have an issue in the Debian unstable VM?
That seems like the kind of workaround where it would be
better to leave a comment with more info, if you know
what exactly is affected.

Binutils 2.28 doesn't recognize "lr" while 2.30 does, it seems.

FWIW, all the existing aarch64 assembly just uses "x30" to refer to this register, none of it uses "lr".

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to