Re: [RFC] Plane color pipeline KMS uAPI
On 6/13/2023 4:23 AM, Pekka Paalanen wrote: On Mon, 12 Jun 2023 12:56:57 -0400 Christopher Braga wrote: On 6/12/2023 5:21 AM, Pekka Paalanen wrote: On Fri, 9 Jun 2023 19:11:25 -0400 Christopher Braga wrote: On 6/9/2023 12:30 PM, Simon Ser wrote: Hi Christopher, On Friday, June 9th, 2023 at 17:52, Christopher Braga wrote: The new COLOROP objects also expose a number of KMS properties. Each has a type, a reference to the next COLOROP object in the linked list, and other type-specific properties. Here is an example for a 1D LUT operation: Color operation 42 ├─ "type": enum {Bypass, 1D curve} = 1D curve ├─ "1d_curve_type": enum {LUT, sRGB, PQ, BT.709, HLG, …} = LUT The options sRGB / PQ / BT.709 / HLG would select hard-coded 1D curves? Will different hardware be allowed to expose a subset of these enum values? Yes. Only hardcoded LUTs supported by the HW are exposed as enum entries. ├─ "lut_size": immutable range = 4096 ├─ "lut_data": blob └─ "next": immutable color operation ID = 43 Some hardware has per channel 1D LUT values, while others use the same LUT for all channels. We will definitely need to expose this in the UAPI in some form. Hm, I was assuming per-channel 1D LUTs here, just like the existing GAMMA_LUT/ DEGAMMA_LUT properties work. If some hardware can't support that, it'll need to get exposed as another color operation block. To configure this hardware block, user-space can fill a KMS blob with 4096 u32 entries, then set "lut_data" to the blob ID. Other color operation types might have different properties. The bit-depth of the LUT is an important piece of information we should include by default. Are we assuming that the DRM driver will always reduce the input values to the resolution supported by the pipeline? This could result in differences between the hardware behavior and the shader behavior. Additionally, some pipelines are floating point while others are fixed. How would user space know if it needs to pack 32 bit integer values vs 32 bit float values? Again, I'm deferring to the existing GAMMA_LUT/DEGAMMA_LUT. These use a common definition of LUT blob (u16 elements) and it's up to the driver to convert. Using a very precise format for the uAPI has the nice property of making the uAPI much simpler to use. User-space sends high precision data and it's up to drivers to map that to whatever the hardware accepts. Conversion from a larger uint type to a smaller type sounds low effort, however if a block works in a floating point space things are going to get messy really quickly. If the block operates in FP16 space and the interface is 16 bits we are good, but going from 32 bits to FP16 (such as in the matrix case or 3DLUT) is less than ideal. Hi Christopher, are you thinking of precision loss, or the overhead of conversion? Conversion from N-bit fixed point to N-bit floating-point is generally lossy, too, and the other direction as well. What exactly would be messy? Overheard of conversion is the primary concern here. Having to extract and / or calculate the significand + exponent components in the kernel is burdensome and imo a task better suited for user space. This also has to be done every blob set, meaning that if user space is re-using pre-calculated blobs we would be repeating the same conversion operations in kernel space unnecessarily. What is burdensome in that calculation? I don't think you would need to use any actual floating-point instructions. Logarithm for finding the exponent is about finding the highest bit set in an integer and everything is conveniently expressed in base-2. Finding significand is just masking the integer based on the exponent. Oh it definitely can be done, but I think this is just a difference of opinion at this point. At the end of the day we will do it if we have to, but it is just more optimal if a more agreeable common type is used. Can you not cache the converted data, keyed by the DRM blob unique identity vs. the KMS property it is attached to? If the userspace compositor has N common transforms (ex: standard P3 -> sRGB matrix), they would likely have N unique blobs. Obviously from the kernel end we wouldn't want to cache the transform of every blob passed down through the UAPI. You can assume that userspace will not be re-creating DRM blobs without a reason to believe the contents have changed. If the same blob is set on the same property repeatedly, I would definitely not expect a driver to convert the data again. If the blob ID is unchanged there is no issue since caching the last result is already common. As you say, blobs are immutable so no update is needed. I'd question why the compositor keeps trying to send down the same blob ID though. If a driver does that, it seems like it should be easy to avoid, though I'm no kernel dev. Even if the conversion was just a memcpy, I would still posit it
Re: Refresh rates with multiple monitors
Hi, On Tue, 13 Jun 2023 at 10:20, Pekka Paalanen wrote: > On Tue, 13 Jun 2023 01:11:44 + (UTC) > Joe M wrote: > > As I understand, there is one global wl_display. Is there always one > > wl_compositor too? > > That is inconsequential. > Yeah, I think the really consequential thing is that a wl_display really just represents a connection to a Wayland server (aka compositor). Display targets (e.g. 'the HDMI connector on the left', 'the DSI panel') are represented by wl_output objects. There is one of those for each output. Cheers, Daniel
Re: Refresh rates with multiple monitors
On Tue, 13 Jun 2023 01:11:44 + (UTC) Joe M wrote: > Hi, I was wondering about the internals of Wayland (wl_compositor?) > with multiple physical screens/displays attached. I'm using EGL so if > those details are contextual to the answer please include if possible. Hi, I wrote a bit of an introduction here first to give some depth to the answer, so pardon for straying a bit. The first thing to recap is that Wayland is not a program you could run. Wayland is not an implementation but only a language that applications and display servers use to talk to each other. Some vocabulary: - A Wayland compositor is a display server. - An application is a Wayland client. - An output is usually a monitor. - Repainting is the action of rendering a new composition for an output. Wayland does pose some assumptions, especially related to a window that happens to be on multiple outputs simultaneously: - Each output is allowed to be repainted independently of others. - An output can be repainted regardless of client actions at any time. - A client draws the image of a window, and that one image is used on any outputs as necessary. - A client does not need to draw, if the window image does not need changes. - From client perspective a window has a single update loop (timings), therefore it can synchronise to only one timing source (output) at a time. Wayland does not define how or when Wayland compositors should repaint their outputs. Wayland also does not define what to use for the timings of a window. Compositor implementations decide on those details as they see fit. A popular approach is for a compositor to repaint each output independently and without tearing, using whatever is the latest image for each window. > As I understand, there is one global wl_display. Is there always one > wl_compositor too? That is inconsequential. Protocol objects (wl_proxy - an instance of, say, wl_compositor) are always private to a Wayland client, but multiple protocol objects even from different clients can refer to the same underlying "thing", like a wl_output object refers to an output. Sometimes there is no particular "thing" to refer to. Both wl_display and wl_compositor essentially refer to the compositor as a whole. They are merely pieces of API. Our jargon calls wl_compositor a "singleton global". wl_display is even more fundamental and on client side it represents the Wayland connection to a compositor. > I'm able to create a surface in two different apps (or multiple > instance of the same app), and call "set_fullscreen" on each one. > Wayland (or, weston, I guess?) does the right thing and puts them on > separate physical screens. > Now, eglSwapBuffers takes as parameters the EGLDisplay and the > EGLSurface. Is the vsync that the two apps observe at all > interdependent, as a result of the display singleton? No fundamental dependency there. What actually happens depends on how the compositor in question is implemented and on which outputs the windows are shown. > If one monitor's mode is 30Hz and the other 60Hz, will both apps be > constrained to the 30hz refresh? I believe most, if not all, compositor implementations allow each app to have its own pace according to the monitor it is on. IOW, no. Thanks, pq pgppPkh7r7fdK.pgp Description: OpenPGP digital signature
Re: [RFC] Plane color pipeline KMS uAPI
On Mon, 12 Jun 2023 12:56:57 -0400 Christopher Braga wrote: > On 6/12/2023 5:21 AM, Pekka Paalanen wrote: > > On Fri, 9 Jun 2023 19:11:25 -0400 > > Christopher Braga wrote: > > > >> On 6/9/2023 12:30 PM, Simon Ser wrote: > >>> Hi Christopher, > >>> > >>> On Friday, June 9th, 2023 at 17:52, Christopher Braga > >>> wrote: > >>> > > The new COLOROP objects also expose a number of KMS properties. Each > > has a > > type, a reference to the next COLOROP object in the linked list, and > > other > > type-specific properties. Here is an example for a 1D LUT operation: > > > >Color operation 42 > >├─ "type": enum {Bypass, 1D curve} = 1D curve > >├─ "1d_curve_type": enum {LUT, sRGB, PQ, BT.709, HLG, …} = LUT > The options sRGB / PQ / BT.709 / HLG would select hard-coded 1D > curves? Will different hardware be allowed to expose a subset of these > enum values? > >>> > >>> Yes. Only hardcoded LUTs supported by the HW are exposed as enum entries. > >>> > >├─ "lut_size": immutable range = 4096 > >├─ "lut_data": blob > >└─ "next": immutable color operation ID = 43 > > > Some hardware has per channel 1D LUT values, while others use the same > LUT for all channels. We will definitely need to expose this in the > UAPI in some form. > >>> > >>> Hm, I was assuming per-channel 1D LUTs here, just like the existing > >>> GAMMA_LUT/ > >>> DEGAMMA_LUT properties work. If some hardware can't support that, it'll > >>> need > >>> to get exposed as another color operation block. > >>> > > To configure this hardware block, user-space can fill a KMS blob with > > 4096 u32 > > entries, then set "lut_data" to the blob ID. Other color operation types > > might > > have different properties. > > > The bit-depth of the LUT is an important piece of information we should > include by default. Are we assuming that the DRM driver will always > reduce the input values to the resolution supported by the pipeline? > This could result in differences between the hardware behavior > and the shader behavior. > > Additionally, some pipelines are floating point while others are fixed. > How would user space know if it needs to pack 32 bit integer values vs > 32 bit float values? > >>> > >>> Again, I'm deferring to the existing GAMMA_LUT/DEGAMMA_LUT. These use a > >>> common > >>> definition of LUT blob (u16 elements) and it's up to the driver to > >>> convert. > >>> > >>> Using a very precise format for the uAPI has the nice property of making > >>> the > >>> uAPI much simpler to use. User-space sends high precision data and it's > >>> up to > >>> drivers to map that to whatever the hardware accepts. > >>> > >> Conversion from a larger uint type to a smaller type sounds low effort, > >> however if a block works in a floating point space things are going to > >> get messy really quickly. If the block operates in FP16 space and the > >> interface is 16 bits we are good, but going from 32 bits to FP16 (such > >> as in the matrix case or 3DLUT) is less than ideal. > > > > Hi Christopher, > > > > are you thinking of precision loss, or the overhead of conversion? > > > > Conversion from N-bit fixed point to N-bit floating-point is generally > > lossy, too, and the other direction as well. > > > > What exactly would be messy? > > > Overheard of conversion is the primary concern here. Having to extract > and / or calculate the significand + exponent components in the kernel > is burdensome and imo a task better suited for user space. This also has > to be done every blob set, meaning that if user space is re-using > pre-calculated blobs we would be repeating the same conversion > operations in kernel space unnecessarily. What is burdensome in that calculation? I don't think you would need to use any actual floating-point instructions. Logarithm for finding the exponent is about finding the highest bit set in an integer and everything is conveniently expressed in base-2. Finding significand is just masking the integer based on the exponent. Can you not cache the converted data, keyed by the DRM blob unique identity vs. the KMS property it is attached to? You can assume that userspace will not be re-creating DRM blobs without a reason to believe the contents have changed. If the same blob is set on the same property repeatedly, I would definitely not expect a driver to convert the data again. If a driver does that, it seems like it should be easy to avoid, though I'm no kernel dev. Even if the conversion was just a memcpy, I would still posit it needs to be avoided when the data has obviously not changed. Blobs are immutable. Userspace having to use hardware-specific number formats would probably not be well received. > I agree normalization of the value causing precision loss and rounding
Refresh rates with multiple monitors
Hi, I was wondering about the internals of Wayland (wl_compositor?) with multiple physical screens/displays attached. I'm using EGL so if those details are contextual to the answer please include if possible. As I understand, there is one global wl_display. Is there always one wl_compositor too? I'm able to create a surface in two different apps (or multiple instance of the same app), and call "set_fullscreen" on each one. Wayland (or, weston, I guess?) does the right thing and puts them on separate physical screens. Now, eglSwapBuffers takes as parameters the EGLDisplay and the EGLSurface. Is the vsync that the two apps observe at all interdependent, as a result of the display singleton? If one monitor's mode is 30Hz and the other 60Hz, will both apps be constrained to the 30hz refresh? Thanks!