On Tue, 22 May 2018 01:18:30 +0100, Mark Thompson <s...@jkqxz.net> wrote: > On 21/05/18 07:50, Ruiling Song wrote: > > This filter does HDR(HDR10/HLG) to SDR conversion with tone-mapping. > > > > An example command to use this filter with vaapi codecs: > > FFMPEG -init_hw_device vaapi=va:/dev/dri/renderD128 -init_hw_device \ > > opencl=ocl@va -hwaccel vaapi -hwaccel_device va -hwaccel_output_format \ > > vaapi -i INPUT -filter_hw_device ocl -filter_complex \ > > '[0:v]hwmap,tonemap_opencl=t=bt2020:tonemap=linear:format=p010[x1]; \ > > [x1]hwmap=derive_device=vaapi:reverse=1' -c:v hevc_vaapi -profile 2 OUTPUT > > > > Signed-off-by: Ruiling Song <ruiling.s...@intel.com> > > --- > > I assume you're testing with Beignet for this sort of mapping to work? I > tried it with Beignet on Coffee Lake with 10-bit videos and it looks > sensible, though it is rather hard to tell whether it is in some sense > "correct".
It's also rather hard to define whether it is in some sense "correct". The methodology employed here is generally based on ITU-R recommendations, however the ITU-R advises multiple possible ways of doing tone-mapping. They also highlight their own curve function, which we don't use (for performance/simplicity reasons - iirc I gave it a try and the result was not visually dissimilar enough from the hable function, but my memory could be wrong). There's nothing resembling an official "standard" way to tone-map defined by the ITU-R. This algorithm is also generally based on the results obtained from the "official" ACES implementation of HDR->SDR tone mapping (obtainable here: https://github.com/ampas/aces-dev), with the key difference that we do chroma-invariant tone mapping whereas hollywood tends to use channel-independent tone mapping. I think the latter distorts the colors too much for taste and generally results in a worse looking image. The only important bit to make chroma-invariant tone mapping work well, however, is the need for a good desaturation algorithm. This one is based on original research and experimentation. The desaturation strength with a parameter of 1.0 is comparable to the one achieved by the ACES algorithm, although I pick a lower strength by default (0.5), because I found it too damaging for some types of sources (particularly bright skies) as a result of the chroma-invariant nature. In addition to the desaturation step, the other key innovation which I cannot find mentioned in ITU-R literature is the importance of adjusting the overall average brightness before tone mapping. I suspect the reason this isn't considered by the ITU-R is because the ITU-R assumes that HDR sources actually follow their standards, which in practice none seem to do. In theory, HDR material isn't supposed to have a significantly higher average brightness than SDR material. Apart from the addition of isolated super-highlights, nothing should have changed about the image appearance. In practice, HDR directors like to point their expensive cameras at very bright objects (e.g. the sun) and blind viewers' eyes by failing to adjust the brightness during the mastering step. Our algorithm compensates for this by essentially "correcting" the bad mastering in real-time. [1] Of course, the result here is not as good as doing it ahead of time by a human, but unfortunately we don't have a say in this matter. As long as the implementation is correct, I'd be confident in assuming that this produces pleasurable results for all the sources I've thrown at it, often even exceeding in quality the "official" SDR-mapped blu-ray versions of the same sources on the same scenes. (Partially due to the preserved higher color gamut) In order to ascertain whether or not the implementation is correct, you could compare it to results obtained by latest `mpv` (might need git master) or `libplacebo`, both of which implement the same algorithm. [1] The algorithm I use in mpv and libplacebo does this with one frame of latency, because I don't want to round-trip through an intermediate buffer in my processing chain, and there's no other way to communicate back the measured frame statistics to the rest of the kernels in OpenGL/Vulkan land. I do this because of my realtime requirements as well as the structure of my processing chain. Since you are implementing an offline filter and neither of these restrictions apply to you, I would recommend changing the implementation to separate the peak measurement step from the tone mapping step, so that the former completes first and then the second runs from scratch and can use the results computed by the former in the same frame. If you don't do this, you run the risk of failing to tone map single frame data (e.g. screenshots), because no data about the previous frame is available at the time. > > +// detect peak/average signal of a frame, the algorithm was ported from: > > +// libplacebo (https://github.com/haasn/libplacebo) > > +struct detection_result > > +detect_peak_avg(global uint *util_buf, __local uint *sum_wg, > > + float signal, float peak) { > > + global uint *avg_buf = util_buf; > > + global uint *peak_buf = avg_buf + DETECTION_FRAMES + 1; > > + global uint *counter_wg_p = peak_buf + DETECTION_FRAMES + 1; > > + global uint *max_total_p = counter_wg_p + 1; > > + global uint *avg_total_p = max_total_p + 1; > > + global uint *frame_idx_p = avg_total_p + 1; > > + global uint *scene_frame_num_p = frame_idx_p + 1; > > + > > + uint frame_idx = *frame_idx_p; > > + uint scene_frame_num = *scene_frame_num_p; > > + > > + size_t lidx = get_local_id(0); > > + size_t lidy = get_local_id(1); > > + size_t lsizex = get_local_size(0); > > + size_t lsizey = get_local_size(1); > > + uint num_wg = get_num_groups(0) * get_num_groups(1); > > + size_t group_idx = get_group_id(0); > > + size_t group_idy = get_group_id(1); > > + struct detection_result r = {peak, sdr_avg}; > > + *sum_wg = 0; > > This is technically a data race - maybe set it in only the first workitem? I'm not sure where the data race is here. There's a barrier immediately below it, which ensures that all of the *sum_wg writes must complete before progressing further, no? So even though all of the threads conflict in their write to *sum_wg, they all write the same thing and wait for each other before continuing. > > > + barrier(CLK_LOCAL_MEM_FENCE); > > + > > + // update workgroup sum > > + atomic_add(sum_wg, (uint)(signal * REFERENCE_WHITE)); > > I think the numbers you're adding together here sum to at most something like > 16 * 16 * 100 * 1023? Can you make sure this can't overflow and add a > comment on that. It's not * 1023, the highest possible peak in practice is 100.0 (PQ's peak brightness). So the limit per workgroup is 16 * 16 * 10000, requiring 22 bits to not overflow on a pathological input. > > > + barrier(CLK_LOCAL_MEM_FENCE); > > + > > + // update frame peak/avg using work-group-average. > > + if (lidx == 0 && lidy == 0) { > > + uint avg_wg = *sum_wg / (lsizex * lsizey); > > + atomic_max(&peak_buf[frame_idx], avg_wg); > > + atomic_add(&avg_buf[frame_idx], avg_wg); > > Similarly this one? (width/16 * height/16 * 100 * 1023, I think, which might > overflow for 8K?) For 8K it's 8192/16 * 4320/16 * 10000, requiring 31 bits to store without theoretical risk of overflow. And actually, there is a third source of overflow worth investigating, namely the *avg_total_p variable, since this accumulates across frames. It stores a value of 10000 * (PEAK_DETECTION_FRAMES+1). In practice, however, this shouldn't cause any issues for typical buffer sizes. (Needing 20 bits for a buffer size of 100). Note: In practice, none of these considerations are that worth worrying about, since the average illumination of a scene is generally around at most 50, so it's more like 23 bits needed to store a typical scene rather than the 31 worst case I calculated earlier. The only scenario in which I could imagine a worst case like that occurring in normal content is if some mastering engineer mistakenly implements a "fade to white" by fading to the highest possible HDR peak, and this were to somehow survive being reviewed by other humans who presumably have functioning retinas that would be screaming in pain as their displays blasted 10000 cd/m² during the fade. > > + // de-saturate > > + if (desat_param > 0.0f) { > > + float luma = get_luma_dst(rgb); > > + float base = 0.18f * dst_peak; > > Magic number might want some explaination. It is derived from experimentation and visual comparisons with e.g. the ACES algorithm. There is no theoretical basis for it. > +float3 ootf_hlg(float3 c) { > + float luma = get_luma_src(c); > + // assume a reference display with 1000 nits peak > + float factor = 1000.0f / REFERENCE_WHITE * pow(luma, 0.2f) / pow(12.0f, > 1.2f); > + return c * factor; > +} > + > +float3 inverse_ootf_hlg(float3 c) { > + // assume a reference display with 1000 nits peak > + c *= pow(12.0f, 1.2f) / (1000.0f / REFERENCE_WHITE); > + c /= pow(get_luma_dst(c), 0.2f / 1.2f); > + return c; > +} I would recommend parametrizing these by the peak variable. When you tone map from HLG to HLG at a lower peak, the inverse OOTF call needs to use the new peak. (You should also update the peak tagging in the frame's side channel data, not sure if you do). Ditto, for the forwards OOTF, the `peak` needs to match the value you assume for the src sig peak down below. You have it hard-coded as 12.0 for HLG, which is the correct peak in scene-referred space, but that doesn't necessarily need to match the display referred case, which is what's relevant for tone mapping. If you tune the OOTF for a 1000 nits peak display, the source peak after applying the OOTF would be 10.0, not 12.0. Alternatively, you could tune the OOTF for 1200 nits instead. (This is what libplacebo does, although I think not intentionally. I'll change it to use 1000 nits as well.) _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel