Hi, 2016-01-19 13:46 GMT+01:00 John Cox <j...@kynesim.co.uk>: > I've just done a fair bit of work on hevc_cabac decode for the Rasberry > Pi2 and I think that the patch is generally applicable. Patch is > attached but you may prefer to take it from git:
This work is certainly impressive, and most people would have come only with some of the "tricks" you used. Although it already represents quite a bit of work, I echo others' suggestions to have more incremental changes. > I have not yet run fate over it as I haven't yet finished downloading > the samples (the internet connection here isn't wildly fast), but I have > run it against the H265.1 conformance streams on both x86 and ARM and it > causes no regressions. Your patch fails on the later fate tests linked to range extensions (RExt sequences) on Win64. I didn't investigate why. Random thoughts: transform_skip, cross-channel residual, some bypass-coded elements (eg SAO). > 3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though > ctz does) That could be a patch in and by itself. So, referring to your changes, it would be nice to have the following changes split in their own patches: 1) significant coeff flag decoding, which probably is the largest gain (and therefore would be even nicer if further sliced): a) for instance, you avoid an indirection by flattening/merging context tables; b) other parts, which I fear may not translate that well for other platforms (at least without matching x86 code), or sequences 2) you use native sized integers in some places (not sure if that can cause issues); 3) bypass-coded stuff is a fairly large change (both in terms of code, review and impacting the cabac struct also used by h264); it would be nice knowing how much you gain here 4) the replacing of !!something by something when the flag is already 0/1 5) coefficient saturation 3) is indeed the largest chunk. I don't know what your profiling indicated, but the original code didn't seem that high-profile. But I haven't split it to see what it actually provided, but overall numbers look good: I quickly hacked (quickly being the keyword as it also means poor and potentially resulting in faulty conclusion) something that is close to 2) + 4) for reference. Benching REF+1)a) vs REF+1), it did seem slower on Win64/Haswell for significant flag decoding by a few cycles (around 1% of the codeblock) Benching REF+1)a) vs your patch, I see around 3% improvement with something that is fairly more optimized overall than ffmpeg's master, ie ff_hevc_hls_residual_coding is a lot more prevalent, which is probably also the case in your rpi2 benchmarks. Note: I don't think I'll review next iterations of the patch(set) with any shape of diligence, but some of the above parts (1.a, 4 and 5) are ok if not the cause of the fate issues. Best regards, -- Christophe _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel