On 2023-11-04 19:01, Christopher Braga wrote:
> Just want to loop back to before we branched off deeper into the programming 
> performance talk
> On 10/26/2023 3:25 PM, Alex Goins wrote:
>> On Thu, 26 Oct 2023, Sebastian Wick wrote:
>>> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote:
>>>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT)
>>>> Alex Goins <ago...@nvidia.com> wrote:
>>>>> Thank you Harry and all other contributors for your work on this. 
>>>>> Responses
>>>>> inline -
>>>>> On Mon, 23 Oct 2023, Pekka Paalanen wrote:
>>>>>> On Fri, 20 Oct 2023 11:23:28 -0400
>>>>>> Harry Wentland <harry.wentl...@amd.com> wrote:
>>>>>>> On 2023-10-20 10:57, Pekka Paalanen wrote:
>>>>>>>> On Fri, 20 Oct 2023 16:22:56 +0200
>>>>>>>> Sebastian Wick <sebastian.w...@redhat.com> wrote:
>>>>>>>>> Thanks for continuing to work on this!
>>>>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote:


>>>>> Actually, the current examples in the proposal don't include a multiplier 
>>>>> color
>>>>> op, which might be useful. For AMD as above, but also for NVIDIA as the
>>>>> following issue arises:
>>>>> As discussed further below, the NVIDIA "degamma" LUT performs an implicit 
>>>>> fixed
> If possible, let's declare this as two blocks. One that informatively 
> declares the conversion is present, and another for the de-gamma. This will 
> help with block-reuse between vendors.
>>>>> point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF 
>>>>> maps
>>>>> to in floating point varies depending on the source content. If it's SDR
>>>>> content, we want the max value in FP16 to be 1.0 (80 nits), subject to a
>>>>> potential boost multiplier if we want SDR content to be brighter. If it's 
>>>>> HDR PQ
>>>>> content, we want the max value in FP16 to be 125.0 (10,000 nits). My 
>>>>> assumption
>>>>> is that this is also what AMD's "HDR Multiplier" stage is used for, is 
>>>>> that
>>>>> correct?
>>>> It would be against the UAPI design principles to tag content as HDR or
>>>> SDR. What you can do instead is to expose a colorop with a multiplier of
>>>> 1.0 or 125.0 to match your hardware behaviour, then tell your hardware
>>>> that the input is SDR or HDR to get the expected multiplier. You will
>>>> never know what the content actually is, anyway.
>> Right, I didn't mean to suggest that we should tag content as HDR or SDR in 
>> the
>> UAPI, just relating to the end result in the pipe, ultimately it would be
>> determined by the multiplier color op.
> A multiplier could work but we would should give OEMs the option to either 
> make it "informative" and fixed by the hardware, or fully configurable. With 
> the Qualcomm pipeline how we absorb FP16 pixel buffers, as well as how we 
> convert them to fixed point data actually has a dependency on the desired 
> de-gamma and gamma processing. So for an example:
> If a source pixel buffer is scRGB encoded FP16 content we would expect input 
> pixel content to be up to 7.5, with the IGC output reaching 125 as in the 
> NVIDIA case. Likewise gamma 2.2 encoded FP16 content would be 0-1 in and 0-1 
> out.
> So in the Qualcomm case the expectations are fixed depending on the use case.
> It is sounding to me like we would need to be able to declare three things 
> here:
> 1. Value range expectations *into* the de-gamma block. A multiplier wouldn't 
> work here because it would be more of a clipping operation. I guess we would 
> have to add an explicit clamping block as well.
> 2. What the value range expectations  at the *output* of de-gamma processing 
> block. Also covered by using another multiplier block.
> 3. Value range expectations *into* a gamma processing block. This should be 
> covered by declaring a multiplier post-csc, but only assuming CSC output is 
> normalized in the desired value range. A clamping block would be preferable 
> because it describes what happens when it isn't.

What about adding informational input and output range properties
to colorops? I think Intel's PWL definitions had something like
that, but I'd have to take a look at that again. While I'm not
in favor of defining segmented LUTs at the uAPI the input/output
ranges seem to be something of value.

> All this is do-able, but it seems like it would require the definition of 
> multiple color pipelines to expose the different limitations for color block 
> configuration combinations. Additionally, would it be easy for user space to 
> find the right pipeline?

I'm also a little concerned that some of these proposals mean we'd
have to expose an inordinate number of color pipelines and color
pipeline selection becomes difficult and error prone.


>>>> Given that elements like various kinds of look-up tables inherently
>>>> assume that the domain is [0.0, 1.0] (because the it is a table that
>>>> has a beginning and an end, and the usual convention is that the
>>>> beginning is zero and the end is one), I think it is best to stick to
>>>> the [0.0, 1.0] range where possible. If we go out of that range, then
>>>> we have to define how a LUT would apply in a sensible way.
>> In my last reply I mentioned a static (but actually programmable) LUT that is
>> typically used to convert FP16 linear pixels to fixed point PQ before handing
>> them to the scaler and tone mapping operator. You're actually right that it
>> indexes in the fixed point [0.0, 1.0] range for the reasons you describe, but
>> because the input pixels are expected to be FP16 in the [0.0, 125.0] range, 
>> it
>> applies a non-programmable 1/125.0 normalization factor first.
>> In this case, you could think of the LUT as indexing on [0.0, 125.0], but as 
>> you
>> point out there would need to be some way to describe that. Maybe we actually
>> need a fractional multiplier / divider color op. NVIDIA pipes that include 
>> this
>> LUT would need to include a mandatory 1/125.0 factor immediately prior to the
>> LUT, then LUT can continue assuming a range of [0.0, 1.0].
>> Assuming you are using the hardware in a conventional way, specifying a
>> multiplier of 1.0 after the "degamma" LUT would then map to the 80-nit PQ 
>> range
>> after the static (but actually programmable) PQ LUT, whereas specifying a
>> multiplier of 125.0 would map to the 10,000-nit PQ range, which is what we 
>> want.
>> I guess it's kind of messy, but the effect would be that color ops other than
>> multipliers/dividers would still be in the [0.0, 1.0] domain, and any 
>> multiplier
>> that exceeds that range would have to be normalized by a divider before any
>> other color op.
> Hmm. A multiplier would resolve issues when input linear FP16 data that has 
> different ideas on what 1.0 means in regards to nits values (think of Apple's 
> EDR as an example). For a client to go from their definition to hardware 
> definition of 1.0 = x nits, we would need to expose what the pipeline sees as 
> 1.0 though. So in this case the multiplier would be programmable, but the 
> divisor is informational? It seems like the later would have an influence on 
> how the former is programmed.

A programmable multiplier would either need to be backed by a HW block
to perform the operation or require a driver to scale the LUT or matrix
values of an adjacent LUT or matrix block.


>>>>>> Yeah, this is why we need a definition. I understand "informational" to
>>>>>> not change pixel values in any way. Previously I had some weird idea
>>>>>> that scaling doesn't alter color, but of course it may.
>>>>> On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do
>>>>> implicit fixed-point to FP16 conversions, and vice versa.
>>>> Above, I claimed that the UAPI should be defined in nominal
>>>> floating-point values, but I wonder, would that work? Would we need to
>>>> have explicit colorops for converting from raw pixel data values into
>>>> nominal floating-point in the UAPI?
>> Yeah, I think something like that is needed, or another solution as discussed
>> below. Even if we define the UAPI in terms of floating point, the actual
>> underlying pixel format needs to match the expectations of each stage as it
>> flows through the pipe.
> Strongly agree on this. Pixel format and block relationships definitely exist.

Interesting to see this isn't just an AMD thing. :)


>>>> Both blending and scaling are fundamentally the same operation: you
>>>> have two or more source colors (pixels), and you want to compute a
>>>> weighted average of them following what happens in nature, that is,
>>>> physics, as that is what humans are used to.
>>>> Both blending and scaling will suffer from the same problems if the
>>>> operation is performed on not light-linear values. The result of the
>>>> weighted average does not correspond to physics.
>>>> The problem may be hard to observe with natural imagery, but Josh's
>>>> example shows it very clearly. Maybe that effect is sometimes useful
>>>> for some imagery in some use cases, but it is still an accidental
>>>> side-effect. You might get even better results if you don't rely on
>>>> accidental side-effects but design a separate operation for the exact
>>>> goal you have.
>>>> Mind, by scaling we mean changing image size. Not scaling color values.
>> Fair enough, but it might not always be a choice given the hardware.
> Agreeing with Alex here. I get there is some debate over the best way to do 
> this, but I think it is best to leave it up to the driver to declare how that 
> is done.



>>>> What I was left puzzled about after the XDC workshop is that is it
>>>> possible to pre-load configurations in the background (slow), and then
>>>> quickly switch between them? Hardware-wise I mean.
>> This works fine for our "fast" LUTs, you just point them to a surface in 
>> video
>> memory and they flip to it. You could keep multiple surfaces around and flip
>> between them without having to reprogram them in software. We can easily do 
>> that
>> with enumerated curves, populating them when the driver initializes instead 
>> of
>> waiting for the client to request them. You can even point multiple hardware
>> LUTs to the same video memory surface, if they need the same curve.
>>> We could define that pipelines with a lower ID are to be preferred over
>>> higher IDs.
>> Sure, but this isn't just an issue with a pipeline as a whole, but the
>> individual elements within it and how to use them in a given context.
>>> The issue is that if programming a pipeline becomes too slow to be
>>> useful it probably should just not be made available to user space.
>> It's not that programming the pipeline is overall too slow. The LUTs we have
>> that are relatively slow to program are meant to be set infrequently, or even
>> just once, to allow the scaler and tone mapping operator to operate in fixed
>> point PQ space. You might still want the tone mapper, so you would choose a
>> pipeline that includes them, but when it comes to e.g. animating a night 
>> light,
>> you would want to choose a different LUT for that purpose.
>>> The prepare-commit idea for blob properties would help to make the
>>> pipelines usable again, but until then it's probably a good idea to just
>>> not expose those pipelines.
>> The prepare-commit idea actually wouldn't work for these LUTs, because they 
>> are
>> programmed using methods instead of pointing them to a surface. I'm actually 
>> not
>> sure how slow it actually is, would need to benchmark it. I think not 
>> exposing
>> them at all would be overkill, since it would mean you can't use the 
>> preblending
>> scaler or tonemapper, and animation isn't necessary for that.
>> The AMD 3DLUT is another example of a LUT that is slow to update, and it 
>> would
>> obviously be a major loss if that wasn't exposed. There just needs to be some
>> way for clients to know if they are going to kill performance by trying to
>> change it every frame.
>> Thanks,
>> Alex
> To clarify, what are we defining as slow to update here? Something we aren't 
> able to update within a frame (let's say at a low frame rate such as 30 fps 
> for discussion's sake)? A block that requires a programming sequence of 
> disable + program + enable to update? Defining performance seems like it can 
> get murky if we start to consider frame concurrent updates among multiple 
> color blocks as well.

I think any definition for slow would need to be imprecise on some level.
In the AMD 3DLUT case we can take around 8 ms. Some compositors need the
programming time to be well under 1 ms, even for low frame rates. Those
compositors might want to know if an operation might be undesirable if
they care about latency. I'm not sure we could reliably indicate more.


> Thanks,
> Christopher
>>>> Thanks,
>>>> pq

Reply via email to