Adds NEON unscaled paths for {yuv420p, yuv422p, yuva420p, nv12, nv21}
-> {rgb565le, bgr565le, rgb555le, bgr555le}, extending the 24/32bpp
NEON conversions from 7fab0becab.

Speedup vs C at width=1920 on Apple M1, --bench:

  | input    | rgb565le | bgr565le | rgb555le | bgr555le |
  |----------|----------|----------|----------|----------|
  | yuv420p  | 3.69x    | 3.68x    | 3.28x    | 3.31x    |
  | yuv422p  | 4.70x    | 4.70x    | 4.32x    | 4.35x    |
  | yuva420p | 3.67x    | 3.66x    | 3.32x    | 3.27x    |
  | nv12     | bench    | bench    | bench    | bench    |
  | nv21     | bench    | bench    | bench    | bench    |

NEON cycles are ~48 for planar and ~50.5 for semi-planar across all
four outputs. yuv422p shows the biggest speedup because its C
reference is the most expensive. 555 ratios trail 565 because the C
reference is faster for 555 (one fewer mask bit); NEON cycles are
the same.

The 16bpp packing uses v8/v9 as accumulators, which clobbers d8/d9.
AAPCS-64 requires d8-d15 callee-saved, so declare_func now wraps a
stp/ldp d8, d9 around the 16bpp paths only, gated by .ifc on the
output format. Other paths are untouched. Same pattern as
libswscale/aarch64/hscale.S.

LE-only. Apple Silicon is always LE; a BE follow-up is one rev16
before the store.

nv12/nv21 are bench-only in checkasm because ff_get_unscaled_swscale
wires the C yuv2rgb fast path only for {YUV420P, YUV422P, YUVA420P}.
The NEON wrappers run (clobber detection + cycle counts) but have no
C reference to compare against. They share pack_rgb16 and the
compute_rgb macro with the verified planar paths, and FATE exercises
them end-to-end.

Tests:
  - checkasm --test=sw_yuv2rgb: 110/110 (was 44/44; +66 from the new
    16bpp outputs across yuv420p/yuv422p/yuva420p plus the new nv12
    and nv21 suites)
  - full checkasm: 7657/7657 (baseline 7589)
  - make fate: clean

DROOdotFOO (1):
  swscale/aarch64: add NEON yuv->rgb16 fast paths

 libswscale/aarch64/swscale_unscaled.c |  47 ++++++++
 libswscale/aarch64/yuv2rgb_neon.S     | 147 ++++++++++++++++++++++++++
 tests/checkasm/sw_yuv2rgb.c           |  13 ++-
 3 files changed, 205 insertions(+), 2 deletions(-)

--
2.50.1 (Apple Git-155)

_______________________________________________
ffmpeg-devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to