(I'm sending this as an email as lowest common denominator but feel an issue on 
the color-and-hdr repo would be a better interface for productive discussion. 
Please pop over to https://gitlab.freedesktop.org/pq/color-and-hdr/-/issues/10 
if you agree. Hopefully we can drive the discussion there but if there is a 
strong preference for email that works as well. :) )

I've wanted to start a thread to discuss the use of PWL APIs that were 
introduced by Uma a year ago and for which Bhanuprakash provided IGT tests. I 
have come to like the API but as we're getting closer to a real-world use of it 
I have a few questions and comments. As with a lot of complex APIs the devil is 
in the details. Some of those details are currently underspecified, or 
underdocumented and it's important that we all interpret the API the same way.

**The API**

The original patches posted by Uma:
https://patchwork.freedesktop.org/series/90822/
https://patchwork.freedesktop.org/series/90826/

The IGT tests for PWL API:
https://patchwork.freedesktop.org/series/96895/

I've rebased the kernel patches on a slightly more recent kernel, along with an 
AMD implementation:
https://gitlab.freedesktop.org/hwentland/linux/-/tree/color-and-hdr

I've also rebased them on an IGT tree, but that's not too up-to-date:
https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/tree/color-and-hdr


**Why do I like the API?**

In order to allow HW composition of HDR planes in linear space we need the 
ability to program at least a per-CRTC regamma (inv_EOTF) to go from linear to 
wire format post-blending. Since userspace might want to apply corrections on 
top of a simple transfer function (such as PQ, BT.709, etc.) it would like a 
way to set a custom LUT.

The existing CRTC gamma LUT defines equally spaced entries. As Pekka shows in 
[1] equally-spaced LUTs have unacceptable error for regamma/inv_EOTF. Hence we 
need finer granularity of our entries near zero while coarse granularity works 
fine toward the brighter values.

[1] https://gitlab.freedesktop.org/pq/color-and-hdr/-/merge_requests/9

HW (at least AMD and Intel HW) implements this ability as segmented piece-wise 
linear LUTs. These define segments of equally spaced entries. These segments 
are constrained by the HW implementation. I like how the PWL API allows 
different drivers to specify the constraints imposed by different HW while 
allowing userspace a generic way of parsing the PWL. This also avoids complex 
calculations in the kernel driver, which might be required for other APIs one 
could envision. If anyone likes I can elaborate on some ideas for an alternate 
API, though all of them will require non-trivial transformations by the kernel 
driver in order to program them to HW.


**Nitpicks**

The API defines everything inside the segments, including flags and values that 
apply to the entire PWL, such as DRM_MODE_LUT_GAMMA, 
DRM_MODE_LUT_REFLECT_NEGATIVE, input_bpc, and output_bpc. If these don't stay 
constant for segments it might complicate the interpretation of segments. I 
suggest we consider these as effectively applying to the entire PWL. We could 
encode them in an overall drm_color_lut struct that includes an array of 
drm_color_lut_range but that's probably not necessary, hence why I called this 
out as a nitpick. I would just like us to be aware of this ambiguity and 
document that these values applies to the entire PWL.


**How to read the PWL**

Let me first give a summary for how this LUT is used in userspace. If you're 
familiar with this please review and comment if I got things wrong. As I 
mentioned, a lot of this is underspecified at the moment so you're reading my 
interpretation.

You can see this behavior in plane_degamma_test [2] in the kms_color.c IGT test 
suite. I suggest the plane_degamma_test here here instead of the 
test_pipe_gamma test as the latter still has Intelisms (assumptions around 
Intel driver/HW behavior) and will not work for other drivers.

Iterate over all enums in PLANE_DEGAMMA_MODE and find a suitable one. How do we 
find the suitable one? More on that below.

Once we have the right PLANE_DEGAMMA_MODE we read the blob for the blob ID 
associated with the PLANE_DEGAMMA_MODE enum. We interpret the blob as an array 
of drm_color_lut_range. See get_segment_data [3].

We can think of our LUT/PWL as f(x) = y. For a traditional equally spaced LUT 
with 1024 entries x would be 0, 1, 2, ..., 1023. For a PWL LUT we need to parse 
the segment data provided in drm_color_lut_range.

Let's look at the 2nd-last entry of the nonlinear_pwl definition for the AMD 
driver [4] (I've correct it here and dropped the DRM_MODE_LUT_REUSE_LAST but 
it's still incorrect in the link) and simplify it to 4 entries for sake of 
readability:

{
        .flags = (DRM_MODE_LUT_GAMMA | DRM_MODE_LUT_REFLECT_NEGATIVE | 
DRM_MODE_LUT_INTERPOLATE | DRM_MODE_LUT_NON_DECREASING),
        .count = 4,
        .input_bpc = 13, .output_bpc = 18,
        .start = 1 << 12, .end = 1 << 13,
        .min = 0, .max = 1UL << 35
    },

We see we have 16 entries in the region from (1 << 12) to (1 << 13). We see 
input_bpc is 13, so our 1.0 float value is 1 << 13.

(Is it sensible to use input_bpc as our 1.0 floating point reference? Why?)

Since this segment is not reusing the last entry (doesn't have 
DRM_MODE_LUT_REUSE_LAST) we divide the region between 1 << 12 and 1 << 13 into 
4 equally spaced sections.

step_size = (segment->end - segment->start) / count

In our case our segment spans from 4096 to 8192 and yields a step_size of 1024.

Note that we need to calculate this in floating point, otherwise we're not 
guaranteed equal spacing.

This gives us these X entries for this particular segment:
4096, 5120, 6144, 7168

And normalized to 1 << 13 (input_bpc) for our 1.0 float value we get
0.5, 0.625, 0.75, 0.875

If the segment had the REUSE_LAST flag the values would look like
4096, 5461, 6826, 8192

and normalized to
0.5, 0.666626, 0.833252, 1

Though in the case of REUSE_LAST a more sensible definition might be with a 
count of 5 entries in this segment, instead of 16. But ultimatly that's up to 
the driver.

I attached a simple C program that parses a PWL and helps illustrate my 
interpretation. You'll just need to copy-paste or include your PWL definition 
(instead of color_gamma.h), and point the PWL define to your PWL variable. To 
build it you'll need to copy the drm-uapi folder from my IGT repo into the same 
directory as pwl-parser.c and then just build it with "gcc -Idrm-uapi 
pwl-parser.c".

The above entries come from running pwl-parser with the nonlinear_pwl from [5].

To illustrate these I added versions of your lut1d_error scripts that run on 
the SDR and HDR PWLs for AMD HW [6].

We should probably document the above in detail in the DRM/KMS API docs.

[2] 
https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/blob/color-and-hdr/tests/kms_color.c#L978
[3] 
https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/blob/color-and-hdr/tests/kms_color_helper.c#L393
[4] 
https://gitlab.freedesktop.org/hwentland/linux/-/blob/color-and-hdr/drivers/gpu/drm/amd/display/modules/color/color_gamma.h#L109
[5] 
https://gitlab.freedesktop.org/hwentland/linux/-/blob/color-and-hdr/drivers/gpu/drm/amd/display/modules/color/color_gamma.h#L38
[6] 
https://gitlab.freedesktop.org/hwentland/color-and-hdr/-/tree/precision/octave

**How to provide PWL entries?**

To provide the PWL values for each entry we'll have to sample our (custom) 
curve at the respective points specified by our PWL entries and providing those 
samples in a blob that is passed to PLANE_DEGAMMA_LUT. It's not much different 
from how we provide values for an equally spaced (legacy) LUT. As for sampling 
our curve, I remember seeing that Weston uses an LCMS function to sample the 
curve at required points. Last I checked it samples the curve at evenly spaced 
intervals but the LCMS function seemed to provide arbitrary sampling.


**How to pick a suitable PWL definition?**

Picking the right PWL definition out of the respective \_MODE enum isn't 
trivial. Without further information a userspace implementer has to understand 
the distribution of entries in all segments and perform a bunch of math to 
estimate the error for given curves.

A simpler approach might be if we defined common naming for our PWL enums. We 
can define the commonly expected cases. The two important parameters are 
within-range vs overrange entries, and linear vs non-linear outputs.

Within-range PWLs would cover [0.0, 1.0] entries and overrange [0.0, 128.0] to 
cover PQ and probably anything else. One could think of the within-range PWL as 
intended for SDR content and the over-range PWL for HDR content.

Linear PWLs would be intended for linear luminance outputs (or near-linear), 
and can be represented by equally spaced LUTs, such as a single-segment 
definitions. Non-linear PWLs would be intended for linear to non-linear 
transforms; Linear PWLs for non-linear to linear transforms or OOTFs.

This gives us four enums, plus one for bypass:
DRM_MODE_LUT_BYPASS
DRM_MODE_LUT_LINEAR_SDR
DRM_MODE_LUT_NONLINEAR_SDR
DRM_MODE_LUT_LINEAR_HDR
DRM_MODE_LUT_NONLINEAR_HDR

Drivers can provide appropriate PWLs based on the HW caps. Userspace can pick 
an appropriate one if it's available or fall back to either a sub-optimal PWL 
or to using a GPU transform instead.

Thoughts?

Thanks,
Harry
/*
 * Copyright 2022 Advanced Micro Devices, Inc.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * the rights to use, copy, modify, merge, publish, distribute, sublicense,
 * and/or sell copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
 * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
 * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
 * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 * OTHER DEALINGS IN THE SOFTWARE.
 *
 * Authors: AMD
 *
 */

#define BIT(x) (1 << x)
#include "drm_mode.h"
#include "color_gamma.h"
#include <stdio.h>

#define PWL nonlinear_hdr_pwl

#define MAX_ENTRIES_PER_SEGMENT 512
#define MAX_ENTRIES 4096

int main(int argc, char *argv[])
{
	__u64 entries[MAX_ENTRIES];
	unsigned pwl_size = 0;
	unsigned segment_size = 0;
	unsigned segment_count = 0;
	unsigned i = 0;
	unsigned j = 0;
	unsigned overall_count = 0;
	__u64 norm = 0;

	pwl_size = sizeof(PWL);
	segment_size = sizeof(PWL[0]);
	segment_count = pwl_size / segment_size;

	printf("pwl_size = %d, segment_size = %d, segment_count = %d\n",
	       pwl_size, segment_size, segment_count);

	for (i = 0; i < segment_count; i++) {
		const struct drm_color_lut_range *segment = &PWL[i];
		__u64 segment_entries[MAX_ENTRIES_PER_SEGMENT];
		unsigned count = 0;
		__u64 step_size = 0;

		printf("Segment %d\n", i);
		printf("\t.flags = 0x%x\n", segment->flags);
		printf("\t.count = %d\n", segment->count);
		printf("\t.input_bpc = %d, .output_bpc = %d\n", segment->input_bpc, segment->output_bpc);
		printf("\t.start = 0x%lx, .end = %lx\n", segment->start, segment->end);
		printf("\t.min = 0x%lx, .max = %lx\n", segment->min, segment->max);

		norm = segment->input_bpc;

		if (segment->count == 1) {
			printf("\t- steps = %d\n", segment->start);
			printf("\t- normalized steps = %g\n", segment->start / ((1 << segment->input_bpc) + 0.0));
			entries[overall_count++] = segment->start;
			continue;
		}

		if (segment->flags & DRM_MODE_LUT_REUSE_LAST) {
			/*
			 * Entries are evenly spaced in [start, end]
			 * including "end" as an entry.
			 * 
			 * This is the same as count -1 entries when
			 * REUSE_LAST isn't used, with "end" added
			 * as the additional entry
			 */
			count = segment->count - 1;
		} else {
			/*
			 * entries are evenly spaced in [start, end]
			 * excluding "end" as an entry
			 * 
			 * this will look the same as a REUSE_LAST
			 * distribution of count + 1 with the last
			 * entry omitted
			 */
			count = segment->count;
		}

		step_size = (segment->end - segment->start) / (count);

		printf("\t- step_size count 0x%x, step_size 0x%x\n", count, step_size);

		for (j = 0; j < count; j++) {
			segment_entries[j] = segment->start + (j * step_size);
			entries[overall_count++] = segment_entries[j];
		}

		if (segment->flags & DRM_MODE_LUT_REUSE_LAST) {
			segment_entries[j] = segment->end;
			entries[overall_count++] = segment_entries[j];
		}

		/* print all entries */

		count = (segment->flags & DRM_MODE_LUT_REUSE_LAST) ? count + 1 : count;

		printf("\t- steps = ");
		for (j = 0; j < count; j++) {
			printf("%ld", segment_entries[j]);
			if (j < count-1)
				printf(", ");
		}
		printf("\n");

		printf("\t- normalized steps = ");
		for (j = 0; j < count; j++) {
			printf("%g", segment_entries[j] / ((1 << segment->input_bpc) + 0.0));
			if (j < count-1)
				printf(", ");
		}
		printf("\n");

	}

	printf("steps = ");
	for (i = 0; i < overall_count; i++) {
		printf("%ld", entries[i]);
		if (i < overall_count-1)
			printf(", ");
	}
	printf("\n");

	printf("normalized steps = ");
	for (i = 0; i < overall_count; i++) {
		printf("%g", entries[i] / ((1 << norm) + 0.0));
		if (i < overall_count-1)
			printf(", ");
	}
	printf("\n");
	printf("%d entries\n", overall_count);


	return 0;
}

Reply via email to