Thanks for the feedback.  You are right, I can use VPERMQ to free up a 
register.  I can also remove the PAND mask by doing PSLLD/PSRLD.  That 
eliminates the need for an x86-64 block.
I tried the naive 'unrolled' version with no permute, and it was much slower, 
about the same as the AVX/SSSE3 code.  VPERMQ/D is a single shuffle uop on port 
5, so it turns out to be useful.
I will submit a new patch with those improvements and the VBROADCASTI128 macro. 
 I role-modeled my code from 'v210enc.asm' which also could be updated with 
VBROADCASTI128.
Note, I'm running on Windows and it looks like 'checkasm' performance 
benchmarking is only enabled on Linux.  For my tests I put a 100x loop around 
the 'unpack_frame' call and ran:
ffmpeg.exe -s:v 1920x1080 -vcodec v210  -stream_loop 200 -i 
OddaView_1920x1080.v210  -f null -y NUL
If there is a better way, let me know...
Thanks,Mike
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to