Hi,
thanks for your review.

> +#ifdef __MACH__
> +#   define MACH
> +#else
> +#   define MACH #
> This is not good idea to bypass .const_data

MACH uses ".const_data" directive, which is invalid for ELF.
For ELF the directive is ".rodata":

> ELF     .section        .rodata
> MACH    .const_data

> +    ushll           v0.8h, v0.8b, #0
> ...
> +    mul             v16.8h, v0.8h, v24.8h
> Why not MULL?

That would not work for the rest of the computation.
Part of the data in v0 gets used in the next computation,
and then I would have to split mla into a mull + add.

> +    orr             v0.16b, v1.16b, v1.16b
> This is equal to MOV, I guess compiler will replace to right instruction on 
> ARM64

I replaced orr with mov instructions.

> +    // sum row[0-7]
> +    dup             v18.2d, v16.d[1]
> +    dup             v19.2d, v17.d[1]
> +    add             v16.4h, v16.4h, v18.4h
> +    add             v17.4h, v17.4h, v19.4h
> +    trn1            v16.2d, v16.2d, v17.2d
> How about ADDP?

I replaced the above 5 instructions with the following 3 and the performance 
improved.

    trn1            v20.2d, v16.2d, v17.2d
    trn2            v21.2d, v16.2d, v17.2d
    add             v16.8h, v20.8h, v21.8h

Please see attached the amended patch.

Thanks,
Sebastian

Attachment: 0001-arm64-port-luma_vpp.patch
Description: 0001-arm64-port-luma_vpp.patch

_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

Reply via email to