Re: [RFC] Plane color pipeline KMS uAPI

2023-06-13 Thread Christopher Braga




On 6/13/2023 4:23 AM, Pekka Paalanen wrote:

On Mon, 12 Jun 2023 12:56:57 -0400
Christopher Braga  wrote:


On 6/12/2023 5:21 AM, Pekka Paalanen wrote:

On Fri, 9 Jun 2023 19:11:25 -0400
Christopher Braga  wrote:
   

On 6/9/2023 12:30 PM, Simon Ser wrote:

Hi Christopher,

On Friday, June 9th, 2023 at 17:52, Christopher Braga  
wrote:
  

The new COLOROP objects also expose a number of KMS properties. Each has a
type, a reference to the next COLOROP object in the linked list, and other
type-specific properties. Here is an example for a 1D LUT operation:

Color operation 42
├─ "type": enum {Bypass, 1D curve} = 1D curve
├─ "1d_curve_type": enum {LUT, sRGB, PQ, BT.709, HLG, …} = LUT

The options sRGB / PQ / BT.709 / HLG would select hard-coded 1D
curves? Will different hardware be allowed to expose a subset of these
enum values?


Yes. Only hardcoded LUTs supported by the HW are exposed as enum entries.
  

├─ "lut_size": immutable range = 4096
├─ "lut_data": blob
└─ "next": immutable color operation ID = 43
 

Some hardware has per channel 1D LUT values, while others use the same
LUT for all channels.  We will definitely need to expose this in the
UAPI in some form.


Hm, I was assuming per-channel 1D LUTs here, just like the existing GAMMA_LUT/
DEGAMMA_LUT properties work. If some hardware can't support that, it'll need
to get exposed as another color operation block.
  

To configure this hardware block, user-space can fill a KMS blob with
4096 u32
entries, then set "lut_data" to the blob ID. Other color operation types
might
have different properties.
 

The bit-depth of the LUT is an important piece of information we should
include by default. Are we assuming that the DRM driver will always
reduce the input values to the resolution supported by the pipeline?
This could result in differences between the hardware behavior
and the shader behavior.

Additionally, some pipelines are floating point while others are fixed.
How would user space know if it needs to pack 32 bit integer values vs
32 bit float values?


Again, I'm deferring to the existing GAMMA_LUT/DEGAMMA_LUT. These use a common
definition of LUT blob (u16 elements) and it's up to the driver to convert.

Using a very precise format for the uAPI has the nice property of making the
uAPI much simpler to use. User-space sends high precision data and it's up to
drivers to map that to whatever the hardware accepts.
 

Conversion from a larger uint type to a smaller type sounds low effort,
however if a block works in a floating point space things are going to
get messy really quickly. If the block operates in FP16 space and the
interface is 16 bits we are good, but going from 32 bits to FP16 (such
as in the matrix case or 3DLUT) is less than ideal.


Hi Christopher,

are you thinking of precision loss, or the overhead of conversion?

Conversion from N-bit fixed point to N-bit floating-point is generally
lossy, too, and the other direction as well.

What exactly would be messy?
   

Overheard of conversion is the primary concern here. Having to extract
and / or calculate the significand + exponent components in the kernel
is burdensome and imo a task better suited for user space. This also has
to be done every blob set, meaning that if user space is re-using
pre-calculated blobs we would be repeating the same conversion
operations in kernel space unnecessarily.


What is burdensome in that calculation? I don't think you would need to
use any actual floating-point instructions. Logarithm for finding the
exponent is about finding the highest bit set in an integer and
everything is conveniently expressed in base-2. Finding significand is
just masking the integer based on the exponent.

Oh it definitely can be done, but I think this is just a difference of 
opinion at this point. At the end of the day we will do it if we have 
to, but it is just more optimal if a more agreeable common type is used.



Can you not cache the converted data, keyed by the DRM blob unique
identity vs. the KMS property it is attached to?
If the userspace compositor has N common transforms (ex: standard P3 -> 
sRGB matrix), they would likely have N unique blobs. Obviously from the 
kernel end we wouldn't want to cache the transform of every blob passed 
down through the UAPI.




You can assume that userspace will not be re-creating DRM blobs without
a reason to believe the contents have changed. If the same blob is set
on the same property repeatedly, I would definitely not expect a driver
to convert the data again.
If the blob ID is unchanged there is no issue since caching the last 
result is already common. As you say, blobs are immutable so no update 
is needed. I'd question why the compositor keeps trying to send down the

same blob ID though.


If a driver does that, it seems like it
should be easy to avoid, though I'm no kernel dev. Even if the
conversion was just a memcpy, I would still posit it 

Re: Refresh rates with multiple monitors

2023-06-13 Thread Daniel Stone
Hi,

On Tue, 13 Jun 2023 at 10:20, Pekka Paalanen  wrote:

> On Tue, 13 Jun 2023 01:11:44 + (UTC)
> Joe M  wrote:
> > As I understand, there is one global wl_display. Is there always one
> > wl_compositor too?
>
> That is inconsequential.
>

Yeah, I think the really consequential thing is that a wl_display really
just represents a connection to a Wayland server (aka compositor).

Display targets (e.g. 'the HDMI connector on the left', 'the DSI panel')
are represented by wl_output objects. There is one of those for each output.

Cheers,
Daniel


Re: Refresh rates with multiple monitors

2023-06-13 Thread Pekka Paalanen
On Tue, 13 Jun 2023 01:11:44 + (UTC)
Joe M  wrote:

> Hi, I was wondering about the internals of Wayland (wl_compositor?)
> with multiple physical screens/displays attached. I'm using EGL so if
> those details are contextual to the answer please include if possible.

Hi,

I wrote a bit of an introduction here first to give some depth to the
answer, so pardon for straying a bit.

The first thing to recap is that Wayland is not a program you could run.
Wayland is not an implementation but only a language that applications
and display servers use to talk to each other.

Some vocabulary:
- A Wayland compositor is a display server.
- An application is a Wayland client.
- An output is usually a monitor.
- Repainting is the action of rendering a new composition for an output.

Wayland does pose some assumptions, especially related to a window that
happens to be on multiple outputs simultaneously:
- Each output is allowed to be repainted independently of others.
- An output can be repainted regardless of client actions at any time.
- A client draws the image of a window, and that one image is used on
  any outputs as necessary.
- A client does not need to draw, if the window image does not need
  changes.
- From client perspective a window has a single update loop
  (timings), therefore it can synchronise to only one timing source
  (output) at a time.

Wayland does not define how or when Wayland compositors should repaint
their outputs. Wayland also does not define what to use for the timings
of a window. Compositor implementations decide on those details as they
see fit.

A popular approach is for a compositor to repaint each output
independently and without tearing, using whatever is the latest image
for each window.

> As I understand, there is one global wl_display. Is there always one
> wl_compositor too?

That is inconsequential.

Protocol objects (wl_proxy - an instance of, say, wl_compositor) are
always private to a Wayland client, but multiple protocol objects even
from different clients can refer to the same underlying "thing", like
a wl_output object refers to an output.

Sometimes there is no particular "thing" to refer to. Both wl_display
and wl_compositor essentially refer to the compositor as a whole. They
are merely pieces of API. Our jargon calls wl_compositor a "singleton
global". wl_display is even more fundamental and on client side it
represents the Wayland connection to a compositor.

> I'm able to create a surface in two different apps (or multiple
> instance of the same app), and call "set_fullscreen" on each one.
> Wayland (or, weston, I guess?) does the right thing and puts them on
> separate physical screens.
> Now, eglSwapBuffers takes as parameters the EGLDisplay and the
> EGLSurface. Is the vsync that the two apps observe at all
> interdependent, as a result of the display singleton?

No fundamental dependency there. What actually happens depends on how
the compositor in question is implemented and on which outputs the
windows are shown.

> If one monitor's mode is 30Hz and the other 60Hz, will both apps be
> constrained to the 30hz refresh?

I believe most, if not all, compositor implementations allow each app
to have its own pace according to the monitor it is on. IOW, no.


Thanks,
pq


pgppPkh7r7fdK.pgp
Description: OpenPGP digital signature


Re: [RFC] Plane color pipeline KMS uAPI

2023-06-13 Thread Pekka Paalanen
On Mon, 12 Jun 2023 12:56:57 -0400
Christopher Braga  wrote:

> On 6/12/2023 5:21 AM, Pekka Paalanen wrote:
> > On Fri, 9 Jun 2023 19:11:25 -0400
> > Christopher Braga  wrote:
> >   
> >> On 6/9/2023 12:30 PM, Simon Ser wrote:  
> >>> Hi Christopher,
> >>>
> >>> On Friday, June 9th, 2023 at 17:52, Christopher Braga 
> >>>  wrote:
> >>>  
> > The new COLOROP objects also expose a number of KMS properties. Each 
> > has a
> > type, a reference to the next COLOROP object in the linked list, and 
> > other
> > type-specific properties. Here is an example for a 1D LUT operation:
> >
> >Color operation 42
> >├─ "type": enum {Bypass, 1D curve} = 1D curve
> >├─ "1d_curve_type": enum {LUT, sRGB, PQ, BT.709, HLG, …} = LUT  
>  The options sRGB / PQ / BT.709 / HLG would select hard-coded 1D
>  curves? Will different hardware be allowed to expose a subset of these
>  enum values?  
> >>>
> >>> Yes. Only hardcoded LUTs supported by the HW are exposed as enum entries.
> >>>  
> >├─ "lut_size": immutable range = 4096
> >├─ "lut_data": blob
> >└─ "next": immutable color operation ID = 43
> > 
>  Some hardware has per channel 1D LUT values, while others use the same
>  LUT for all channels.  We will definitely need to expose this in the
>  UAPI in some form.  
> >>>
> >>> Hm, I was assuming per-channel 1D LUTs here, just like the existing 
> >>> GAMMA_LUT/
> >>> DEGAMMA_LUT properties work. If some hardware can't support that, it'll 
> >>> need
> >>> to get exposed as another color operation block.
> >>>  
> > To configure this hardware block, user-space can fill a KMS blob with
> > 4096 u32
> > entries, then set "lut_data" to the blob ID. Other color operation types
> > might
> > have different properties.
> > 
>  The bit-depth of the LUT is an important piece of information we should
>  include by default. Are we assuming that the DRM driver will always
>  reduce the input values to the resolution supported by the pipeline?
>  This could result in differences between the hardware behavior
>  and the shader behavior.
> 
>  Additionally, some pipelines are floating point while others are fixed.
>  How would user space know if it needs to pack 32 bit integer values vs
>  32 bit float values?  
> >>>
> >>> Again, I'm deferring to the existing GAMMA_LUT/DEGAMMA_LUT. These use a 
> >>> common
> >>> definition of LUT blob (u16 elements) and it's up to the driver to 
> >>> convert.
> >>>
> >>> Using a very precise format for the uAPI has the nice property of making 
> >>> the
> >>> uAPI much simpler to use. User-space sends high precision data and it's 
> >>> up to
> >>> drivers to map that to whatever the hardware accepts.
> >>> 
> >> Conversion from a larger uint type to a smaller type sounds low effort,
> >> however if a block works in a floating point space things are going to
> >> get messy really quickly. If the block operates in FP16 space and the
> >> interface is 16 bits we are good, but going from 32 bits to FP16 (such
> >> as in the matrix case or 3DLUT) is less than ideal.  
> > 
> > Hi Christopher,
> > 
> > are you thinking of precision loss, or the overhead of conversion?
> > 
> > Conversion from N-bit fixed point to N-bit floating-point is generally
> > lossy, too, and the other direction as well.
> > 
> > What exactly would be messy?
> >   
> Overheard of conversion is the primary concern here. Having to extract 
> and / or calculate the significand + exponent components in the kernel 
> is burdensome and imo a task better suited for user space. This also has 
> to be done every blob set, meaning that if user space is re-using 
> pre-calculated blobs we would be repeating the same conversion 
> operations in kernel space unnecessarily.

What is burdensome in that calculation? I don't think you would need to
use any actual floating-point instructions. Logarithm for finding the
exponent is about finding the highest bit set in an integer and
everything is conveniently expressed in base-2. Finding significand is
just masking the integer based on the exponent.

Can you not cache the converted data, keyed by the DRM blob unique
identity vs. the KMS property it is attached to?

You can assume that userspace will not be re-creating DRM blobs without
a reason to believe the contents have changed. If the same blob is set
on the same property repeatedly, I would definitely not expect a driver
to convert the data again. If a driver does that, it seems like it
should be easy to avoid, though I'm no kernel dev. Even if the
conversion was just a memcpy, I would still posit it needs to be
avoided when the data has obviously not changed. Blobs are immutable.

Userspace having to use hardware-specific number formats would probably
not be well received.

> I agree normalization of the value causing precision loss and rounding 

Refresh rates with multiple monitors

2023-06-13 Thread Joe M
Hi, I was wondering about the internals of Wayland (wl_compositor?) with 
multiple physical screens/displays attached. I'm using EGL so if those details 
are contextual to the answer please include if possible.
As I understand, there is one global wl_display. Is there always one 
wl_compositor too?
I'm able to create a surface in two different apps (or multiple instance of the 
same app), and call "set_fullscreen" on each one. Wayland (or, weston, I 
guess?) does the right thing and puts them on separate physical screens.
Now, eglSwapBuffers takes as parameters the EGLDisplay and the EGLSurface. Is 
the vsync that the two apps observe at all interdependent, as a result of the 
display singleton?
If one monitor's mode is 30Hz and the other 60Hz, will both apps be constrained 
to the 30hz refresh?
Thanks!