[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-14 Thread Tom Cooksey
> >> > Turning to DRM/KMS, it seems the supported formats of a plane 
> >> > can be queried using drm_mode_get_plane. However, there doesn't 
> >> > seem to be a way to query the supported formats of a crtc? If 
> >> > display HW only supports scanning out from a single buffer 
> >> > (like pl111 does), I think it won't have any planes and a fb can 
> >> > only be set on the crtc. In which case, how should user-space 
> > >> query which pixel formats that crtc supports?
> >>
> >> it is exposed for drm plane's.  What is missing is to expose the
> >> primary-plane associated with the crtc.
> >
> > Cool - so a patch which adds a way to query the what formats a crtc
> > supports would be welcome?
> 
> well, I kinda think we want something that exposes the "primary plane"
> of the crtc.. I'm thinking something roughly like:
> 
> -
> diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
> index 53db7ce..c7ffca8 100644
> --- a/include/uapi/drm/drm_mode.h
> +++ b/include/uapi/drm/drm_mode.h
> @@ -157,6 +157,12 @@ struct drm_mode_get_plane {
>  struct drm_mode_get_plane_res {
>   __u64 plane_id_ptr;
>   __u32 count_planes;
> + /* The primary planes are in matching order to crtc_id_ptr in
> +  * drm_mode_card_res (and same length).  For crtc_id[n], it's
> +  * primary plane is given by primary_plane_id[n].
> +  */
> + __u32 count_primary_planes;
> + __u64 primary_plane_id_ptr;
>  };

Yup - I think that works and allows userspace to query the supported
formats of the crtc. Great!




> which is why you want to let userspace figure out the pitch and then
> tell the display driver what size it wants, rather than using dumb
> buffer ioctl ;-)
> 
> Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or
> property or what have you.. but I think that would be hard to get
> right for all cases, and most people don't really care about that
> because they already need a gpu/display specific xorg driver and/or
> gl/egl talking to their kernel driver.  You are in a slightly special
> case, since you are providing GL driver independently of the display
> driver.  But I think that is easier to handle by just telling your
> customers "here, fill out this function(s) to allocate buffer for
> scanout" (and, well, I guess you'd need one to query for
> pitch/stride), rather than trying to cram everything into the kernel.

I fear we're going round in circles here, so time to step back a sec.

My first goal is to figure out how to solve our immediate problem of
how to allocate buffers in our DDX which doesn't abuse the dumb buffer
interface. As stated at the start of this thread, we need to allocate
two types of buffers:

1) Those which will be shared between GPU & Display. These must be
allocated in such a way as to satisfy both devices' constraints.

2) Those which will only be used by the GPU (DRI2 buffers, pixmaps
which have been imported into a client's EGL, etc.)

It must be possible to obtain handles to both those types of buffers
in a single DRM/GEM name-space to allow us to implement DRI2.


I think we can satisfy the first buffer type by adjusting the display
DRM's dumb buffer alloc function to always allocate buffers which also
satisfy the GPU's constraints. In pl111_drm, that means all buffers
are allocated with a 64-byte stride alignment, even on SoCs without a
GPU. Not a big issue really and if it were we could make it a Kconfig
option or something.

>From what I have now understood, allocating GPU-only buffers should
be done with a device-specific ioctl. We then have a choice about
which DRM driver we should add that device-specific ioctl to. We could
either add it to the display controller's DRM or we could add it to
the GPU's DRM.

Adding it to the GPU's DRM requires user-space to jump through quite
a lot of hoops: In order to get both the scan-out GEM buffers and DRI2
GEM buffers in a single device's name-space, it would have to use
PRIME to export, say dumb scan-out buffers from the display's DRM as
dma_buf fds, then import those dma_buf fds into the GPU's DRM and then
use flink to give those imported scan-out buffers a name in the GPU's
DRM's namespace. Yuck.

No, I think it is easier to just add allocating GPU DRI2 buffers as
a device-specific ioctl on the display controller's DRM. Indeed, this   
appears to be what OMAP and Exynos DRM drivers (and maybe others) do.
One device does all the allocations and thus all buffers are already
in the same namespace, no faffing with exporting & importing buffers
in the DDX required.

We will need to figure out a way in the xf86-video-armsoc DDX to
abstract those driver-specific allocation ioctls. Using GBM is an
interesting idea - looking at the interface it seems to be very,
_very_ similar to Android's gralloc! Though I don't see how to get
a system-wide name for a buffer I can pass back to a client via DRI2?
I assume gbm_bo_handle is process-local? In the short term, I think
we'll just use run-time detection 

[RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-14 Thread Tom Cooksey
> >> > > > So in the above, after X receives the second DRI2SwapBuffers,
> >> > > > it doesn't need to get scheduled again for the next frame to 
> >> > > > be both rendered by the GPU and issued to the display for 
> >> > > > scanout.
> >> > > 
> >> > > well, this is really only an issue if you are so loaded that you
> >> > > don't get a chance to schedule for ~16ms.. which is pretty long
> >> > > time.
> >>
> >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't
> >> it? Hmmm, that does sound very long. Will try out some experiments 
> >> and see.
> >
> > We're looking at moving the flip queue into the DDX driver, however
> > it's not as straight-forward as I thought. With the current design,
> > all rate-limiting happens on the client side. So even if you only
> 
> > have double buffering, using KDS you can queue up as many 
> > asynchronous GPU-render/scan-out pairs as you want. It's up to EGL 
> > in the client application to figure out there's a lot of frames in-
> > flight and so should probably block the application's render thread 
> > in eglSwapBuffers to let the GPU and/or display catch up a bit.
> >
> > If we only allow a single outstanding page-flip job in DRM, there'd
> 
> > be a race if we returned a buffer to the client which had an 
> > outstanding page-flip queued up in the DDX: The client could issue 
> 
> > a render job to the buffer just as the DDX processed the page-flip 
> > from the queue, making the scan-out block until the GPU rendered 
> > the next frame. It would also mean the previous frame would have 
> > been lost as it never got scanned out before the GPU rendered the 
> > next-next frame to it.
>
> You wouldn't unconditionally send the swap-done event to the client
> when the queue is "full".  (Well, for omap and msm, the queue depth is
> 1, for triple buffer.. I think usually you don't want to do more than
> triple buffer.)  The client would never get a buffer that wasn't
> already done being scanned out, so there shouldn't be a race.
> 
> Basically, in DDX, when you get a ScheduleSwap, there are two cases:
> 1) you are still waiting for previous page-flip event from kernel, in
> which case you queue the swap and don't immediately send the event
> back to the client.  When the previous page flip completes, you
> schedule the new one and then send back the event to the client.
> 2) you are not waiting for a previous page-flip, in which case you
> schedule the new page-flip and send the event to the client.
> 
> (I hope that is clear.. I suppose maybe a picture here would help, 
> but sadly I don't have anything handy)

So your solution depends on the client-side EGL using page flip events
to figure out when to block the application thread when CPU is running
ahead of the GPU/display. We (currently) use the number of uncompleted
frames sent to the GPU to block the application thread. So there is a
race if we move the flip queue into the DDX and did nothing else.
However, I'm not proposing we do nothing else. :-)

Our proposal was to instead use waiting on the reply of the
DRI2GetBuffers request to block the application thread when the client
is submitting frames faster than the display can display them.
I've not really looked into using the DRI2BufferSwapComplete in our
EGL implementation - it always felt like we'd be at risk of the
application somehow stealing the event and causing us to dead-lock.
But - that may well be a completely irrational fear. :-) Anyway, I'll
take a look, thanks for the pointer!


Cheers,

Tom






RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-14 Thread Tom Cooksey
> >> > > > So in the above, after X receives the second DRI2SwapBuffers,
> >> > > > it doesn't need to get scheduled again for the next frame to 
> >> > > > be both rendered by the GPU and issued to the display for 
> >> > > > scanout.
> >> > > 
> >> > > well, this is really only an issue if you are so loaded that you
> >> > > don't get a chance to schedule for ~16ms.. which is pretty long
> >> > > time.
> >>
> >> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't
> >> it? Hmmm, that does sound very long. Will try out some experiments 
> >> and see.
> >
> > We're looking at moving the flip queue into the DDX driver, however
> > it's not as straight-forward as I thought. With the current design,
> > all rate-limiting happens on the client side. So even if you only
> 
> > have double buffering, using KDS you can queue up as many 
> > asynchronous GPU-render/scan-out pairs as you want. It's up to EGL 
> > in the client application to figure out there's a lot of frames in-
> > flight and so should probably block the application's render thread 
> > in eglSwapBuffers to let the GPU and/or display catch up a bit.
> >
> > If we only allow a single outstanding page-flip job in DRM, there'd
> 
> > be a race if we returned a buffer to the client which had an 
> > outstanding page-flip queued up in the DDX: The client could issue 
> 
> > a render job to the buffer just as the DDX processed the page-flip 
> > from the queue, making the scan-out block until the GPU rendered 
> > the next frame. It would also mean the previous frame would have 
> > been lost as it never got scanned out before the GPU rendered the 
> > next-next frame to it.
>
> You wouldn't unconditionally send the swap-done event to the client
> when the queue is "full".  (Well, for omap and msm, the queue depth is
> 1, for triple buffer.. I think usually you don't want to do more than
> triple buffer.)  The client would never get a buffer that wasn't
> already done being scanned out, so there shouldn't be a race.
> 
> Basically, in DDX, when you get a ScheduleSwap, there are two cases:
> 1) you are still waiting for previous page-flip event from kernel, in
> which case you queue the swap and don't immediately send the event
> back to the client.  When the previous page flip completes, you
> schedule the new one and then send back the event to the client.
> 2) you are not waiting for a previous page-flip, in which case you
> schedule the new page-flip and send the event to the client.
> 
> (I hope that is clear.. I suppose maybe a picture here would help, 
> but sadly I don't have anything handy)

So your solution depends on the client-side EGL using page flip events
to figure out when to block the application thread when CPU is running
ahead of the GPU/display. We (currently) use the number of uncompleted
frames sent to the GPU to block the application thread. So there is a
race if we move the flip queue into the DDX and did nothing else.
However, I'm not proposing we do nothing else. :-)

Our proposal was to instead use waiting on the reply of the
DRI2GetBuffers request to block the application thread when the client
is submitting frames faster than the display can display them.
I've not really looked into using the DRI2BufferSwapComplete in our
EGL implementation - it always felt like we'd be at risk of the
application somehow stealing the event and causing us to dead-lock.
But - that may well be a completely irrational fear. :-) Anyway, I'll
take a look, thanks for the pointer!


Cheers,

Tom




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-14 Thread Tom Cooksey
> >> > Turning to DRM/KMS, it seems the supported formats of a plane 
> >> > can be queried using drm_mode_get_plane. However, there doesn't 
> >> > seem to be a way to query the supported formats of a crtc? If 
> >> > display HW only supports scanning out from a single buffer 
> >> > (like pl111 does), I think it won't have any planes and a fb can 
> >> > only be set on the crtc. In which case, how should user-space 
> > >> query which pixel formats that crtc supports?
> >>
> >> it is exposed for drm plane's.  What is missing is to expose the
> >> primary-plane associated with the crtc.
> >
> > Cool - so a patch which adds a way to query the what formats a crtc
> > supports would be welcome?
> 
> well, I kinda think we want something that exposes the "primary plane"
> of the crtc.. I'm thinking something roughly like:
> 
> -
> diff --git a/include/uapi/drm/drm_mode.h b/include/uapi/drm/drm_mode.h
> index 53db7ce..c7ffca8 100644
> --- a/include/uapi/drm/drm_mode.h
> +++ b/include/uapi/drm/drm_mode.h
> @@ -157,6 +157,12 @@ struct drm_mode_get_plane {
>  struct drm_mode_get_plane_res {
>   __u64 plane_id_ptr;
>   __u32 count_planes;
> + /* The primary planes are in matching order to crtc_id_ptr in
> +  * drm_mode_card_res (and same length).  For crtc_id[n], it's
> +  * primary plane is given by primary_plane_id[n].
> +  */
> + __u32 count_primary_planes;
> + __u64 primary_plane_id_ptr;
>  };

Yup - I think that works and allows userspace to query the supported
formats of the crtc. Great!




> which is why you want to let userspace figure out the pitch and then
> tell the display driver what size it wants, rather than using dumb
> buffer ioctl ;-)
> 
> Ok, you could have a generic TELL_ME_WHAT_STRIDE_TO_USE ioctl or
> property or what have you.. but I think that would be hard to get
> right for all cases, and most people don't really care about that
> because they already need a gpu/display specific xorg driver and/or
> gl/egl talking to their kernel driver.  You are in a slightly special
> case, since you are providing GL driver independently of the display
> driver.  But I think that is easier to handle by just telling your
> customers "here, fill out this function(s) to allocate buffer for
> scanout" (and, well, I guess you'd need one to query for
> pitch/stride), rather than trying to cram everything into the kernel.

I fear we're going round in circles here, so time to step back a sec.

My first goal is to figure out how to solve our immediate problem of
how to allocate buffers in our DDX which doesn't abuse the dumb buffer
interface. As stated at the start of this thread, we need to allocate
two types of buffers:

1) Those which will be shared between GPU & Display. These must be
allocated in such a way as to satisfy both devices' constraints.

2) Those which will only be used by the GPU (DRI2 buffers, pixmaps
which have been imported into a client's EGL, etc.)

It must be possible to obtain handles to both those types of buffers
in a single DRM/GEM name-space to allow us to implement DRI2.


I think we can satisfy the first buffer type by adjusting the display
DRM's dumb buffer alloc function to always allocate buffers which also
satisfy the GPU's constraints. In pl111_drm, that means all buffers
are allocated with a 64-byte stride alignment, even on SoCs without a
GPU. Not a big issue really and if it were we could make it a Kconfig
option or something.

>From what I have now understood, allocating GPU-only buffers should
be done with a device-specific ioctl. We then have a choice about
which DRM driver we should add that device-specific ioctl to. We could
either add it to the display controller's DRM or we could add it to
the GPU's DRM.

Adding it to the GPU's DRM requires user-space to jump through quite
a lot of hoops: In order to get both the scan-out GEM buffers and DRI2
GEM buffers in a single device's name-space, it would have to use
PRIME to export, say dumb scan-out buffers from the display's DRM as
dma_buf fds, then import those dma_buf fds into the GPU's DRM and then
use flink to give those imported scan-out buffers a name in the GPU's
DRM's namespace. Yuck.

No, I think it is easier to just add allocating GPU DRI2 buffers as
a device-specific ioctl on the display controller's DRM. Indeed, this   
appears to be what OMAP and Exynos DRM drivers (and maybe others) do.
One device does all the allocations and thus all buffers are already
in the same namespace, no faffing with exporting & importing buffers
in the DDX required.

We will need to figure out a way in the xf86-video-armsoc DDX to
abstract those driver-specific allocation ioctls. Using GBM is an
interesting idea - looking at the interface it seems to be very,
_very_ similar to Android's gralloc! Though I don't see how to get
a system-wide name for a buffer I can pass back to a client via DRI2?
I assume gbm_bo_handle is process-local? In the short term, I think
we'll just use run-time detection 

[RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-13 Thread Tom Cooksey
> > > > So in the above, after X receives the second DRI2SwapBuffers, it
> > > > doesn't need to get scheduled again for the next frame to be both
> > > > rendered by the GPU and issued to the display for scanout.
> > >
> > > well, this is really only an issue if you are so loaded that you
> > > don't get a chance to schedule for ~16ms.. which is pretty long
> > > time.
> 
> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it?
> Hmmm, that does sound very long. Will try out some experiments and see.

We're looking at moving the flip queue into the DDX driver, however
it's not as straight-forward as I thought. With the current design,
all rate-limiting happens on the client side. So even if you only have
double buffering, using KDS you can queue up as many asynchronous
GPU-render/scan-out pairs as you want. It's up to EGL in the client
application to figure out there's a lot of frames in-flight and so
should probably block the application's render thread in
eglSwapBuffers to let the GPU and/or display catch up a bit.

If we only allow a single outstanding page-flip job in DRM, there'd be
a race if we returned a buffer to the client which had an outstanding
page-flip queued up in the DDX: The client could issue a render job to
the buffer just as the DDX processed the page-flip from the queue,
making the scan-out block until the GPU rendered the next frame. It
would also mean the previous frame would have been lost as it never
got scanned out before the GPU rendered the next-next frame to it.

So instead, I think we'll have to block (suspend?) a client in 
ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers
has an outstanding page-flip in the user-space queue. We then wake
the client up again _after_ we get the page-flip event for the
previous page flip and have issued the page-flip to the next buffer
to the DRM. That way the DRM display driver has already registered its
intention to use the buffer with KDS before the client ever gets hold
of it.

Note: I say KDS here, but I assume the same issues will apply on any
implicit buffer-based synchronization. I.e. dma-fence.

It's not really a problem I don't think, but mention it to see if you
can see a reason why the above wouldn't work before we go and
implement it - it's a fairly big change to the DDX. Can you see any
issues with it? PrepareAccess gets interesting...



Cheers,

Tom







RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-13 Thread Tom Cooksey
> > > > So in the above, after X receives the second DRI2SwapBuffers, it
> > > > doesn't need to get scheduled again for the next frame to be both
> > > > rendered by the GPU and issued to the display for scanout.
> > >
> > > well, this is really only an issue if you are so loaded that you
> > > don't get a chance to schedule for ~16ms.. which is pretty long
> > > time.
> 
> Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it?
> Hmmm, that does sound very long. Will try out some experiments and see.

We're looking at moving the flip queue into the DDX driver, however
it's not as straight-forward as I thought. With the current design,
all rate-limiting happens on the client side. So even if you only have
double buffering, using KDS you can queue up as many asynchronous
GPU-render/scan-out pairs as you want. It's up to EGL in the client
application to figure out there's a lot of frames in-flight and so
should probably block the application's render thread in
eglSwapBuffers to let the GPU and/or display catch up a bit.

If we only allow a single outstanding page-flip job in DRM, there'd be
a race if we returned a buffer to the client which had an outstanding
page-flip queued up in the DDX: The client could issue a render job to
the buffer just as the DDX processed the page-flip from the queue,
making the scan-out block until the GPU rendered the next frame. It
would also mean the previous frame would have been lost as it never
got scanned out before the GPU rendered the next-next frame to it.

So instead, I think we'll have to block (suspend?) a client in 
ScheduleSwap if the next buffer it would obtain with DRI2GetBuffers
has an outstanding page-flip in the user-space queue. We then wake
the client up again _after_ we get the page-flip event for the
previous page flip and have issued the page-flip to the next buffer
to the DRM. That way the DRM display driver has already registered its
intention to use the buffer with KDS before the client ever gets hold
of it.

Note: I say KDS here, but I assume the same issues will apply on any
implicit buffer-based synchronization. I.e. dma-fence.

It's not really a problem I don't think, but mention it to see if you
can see a reason why the above wouldn't work before we go and
implement it - it's a fairly big change to the DDX. Can you see any
issues with it? PrepareAccess gets interesting...



Cheers,

Tom





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-09 Thread Tom Cooksey
> > > So in the above, after X receives the second DRI2SwapBuffers, it
> > > doesn't need to get scheduled again for the next frame to be both
> > > rendered by the GPU and issued to the display for scanout.
> >
> > well, this is really only an issue if you are so loaded that you
> > don't get a chance to schedule for ~16ms.. which is pretty long time.

Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it?
Hmmm, that does sound very long. Will try out some experiments and see.


> > If you are triple buffering, it should not end up in the critical 
> > path (since the gpu already has the 3rd buffer to start on the next
> > frame). And, well, if you do it all in the kernel you probably need
> > to toss things over to a workqueue anyways.
> 
> Just a quick comment on the kernel flip queue issue.
> 
> 16 ms scheduling latency sounds awful but totally doable with a less
> than stellar ddx driver going into limbo land and so preventing your
> single threaded X from doing more useful stuff. Is this really the 
> linux scheduler being stupid?

Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and
so rely on pixman to perform all 2D operations which does indeed tie
up that thread for fairly long periods of time.

We've had internal discussions about introducing a thread (gulp) in
the DDX to off-load drawing operations to. I think we were all a bit
scared by that idea though.


BTW: I wasn't suggesting it was the linux scheduler being stupid, just
that there is sometimes lots of contention over the CPU cores and X
is just one thread among many wanting to run.


> At least my impression was that the hw/kernel flip queue is to save
> power so that you can queue up a few frames and everything goes to
> sleep for half a second or so (at 24fps or whatever movie your
> showing). Needing to schedule 5 frames ahead with pageflips under
> load is just guaranteed to result in really horrible interactivity
> and so awful user experience

Agreed. There's always a tradeoff between tolerance to variable frame
rendering time/system latency (lot of buffers) and UI latency (few
buffers). 

As a side note, video playback is one use-case for explicit sync
objects which implicit/buffer-based sync doesn't handle: Queue up lots
of video frames for display, but mark those "display buffer" 
operations as depending on explicit sync objects which get signalled 
by the audio clock. Not sure Android actually does that yet though. 
Anyway, off topic.


Cheers,

Tom







[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-09 Thread Tom Cooksey

> > Turning to DRM/KMS, it seems the supported formats of a plane can be
> > queried using drm_mode_get_plane. However, there doesn't seem to be a
> > way to query the supported formats of a crtc? If display HW only
> > supports scanning out from a single buffer (like pl111 does), I think
> > it won't have any planes and a fb can only be set on the crtc. In
> > which case, how should user-space query which pixel formats that crtc
> > supports?
> 
> it is exposed for drm plane's.  What is missing is to expose the
> primary-plane associated with the crtc.

Cool - so a patch which adds a way to query the what formats a crtc
supports would be welcome?

What about a way to query the stride alignment constraints?

Presumably using the drm_mode_get_property mechanism would be the
right way to implement that?


> > As with v4l2, DRM doesn't appear to have a way to query the stride
> > constraints? Assuming there is a way to query the stride constraints,
> > there also isn't a way to specify them when creating a buffer with
> > DRM, though perhaps the existing pitch parameter of
> > drm_mode_create_dumb could be used to allow user-space to pass in a
> > minimum stride as well as receive the allocated stride?
> >
> 
> well, you really shouldn't be using create_dumb..  you should have a
> userspace piece that is specific to the drm driver, and knows how to
> use that driver's gem allocate ioctl.

Sorry, why does this need a driver-specific allocation function? It's
just a display controller driver and I just want to allocate a scan-
out buffer - all I'm asking is for the display controller driver to
use a minimum stride alignment so I can export the buffer and use
another device to fill it with data.

The whole point is to be able to allocate the buffer in such a way
that another device can access it. So the driver _can't_ use a
special, device specific format, nor can it allocate it from a
private memory pool because doing so would preclude it from being
shared with another device.

That other device doesn't need to be a GPU wither, it could just as
easily be a camera/ISP or video decoder.



> >> > So presumably you're talking about a GPU driver being the exporter
> >> > here? If so, how could the GPU driver do these kind of tricks on
> >> > memory shared with another device?
> >>
> >> Yes, that is gpu-as-exporter.  If someone else is allocating
> >> buffers, it is up to them to do these tricks or not.  Probably 
> >> there is a pretty good chance that if you aren't a GPU you don't 
> >> need those sort of tricks for fast allocation of transient upload 
> >> buffers, staging textures, temporary pixmaps, etc.  Ie. I don't 
> >> really think a v4l camera or video decoder would benefit from that 
> >> sort of optimization.
> >
> > Right - but none of those are really buffers you'd want to export
> 
> > with dma_buf to share with another device are they? In which case, 
> > why not just have dma_buf figure out the constraints and allocate 
> > the memory?
>
> maybe not.. but (a) you don't necessarily know at creation time if it
> is going to be exported (maybe you know if it is definitely not going
> to be exported, but the converse is not true),

I can't actually think of an example where you would not know if a
buffer was going to be exported or not at allocation time? Do you have
a case in mind?

Regardless, you'd certainly have to know if a buffer will be exported
pretty quickly, before it's used so that you can import it into
whatever devices are going to access it. Otherwise if it gets
allocated before you export it, the allocation won't satisfy the
constraints of the other devices which will need to access it and
importing will fail. Assuming of course deferred allocation of the
backing pages as discussed earlier in the thread.



> and (b) there isn't
> really any reason to special case the allocation in the driver because
> it is going to be exported.

Not sure I follow you here? Surely you absolutely have to special-case
the allocation if the buffer is to be exported because you have to
take the other devices' constraints into account when you allocate? Or
do you mean you don't need to special-case the GEM buffer object
creation, only the allocation of the backing pages? Though I'm not
sure how that distinction is useful - at the end of the day, you need
to special-case allocation of the backing pages.


> helpers that can be used by simple drivers, yes.  Forcing the way the
> buffer is allocated, for sure not.  Currently, for example, there is
> no issue to export a buffer allocated from stolen-mem.

Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk
of memory reserved at boot for the GPU which the OS can't touch? I
guess I view such memory as accessible to all media devices on the 
system and as such, needs to be managed by a central allocator which
dma_buf can use to allocate from.

I guess if that stolen-mem is managed by a single device then in
essence that device becomes the central

[RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-09 Thread Tom Cooksey
Hi Daniel, Rob.

Thank you both for your reviews - greatly appreciated!

> > > Known issues:
> > >  * It still includes code to use KDS, which is not going upstream.
> >
> > review's on  > July/042462.html> can't hurt
> >
> > although you might consider submitting a reduced functionality driver
> > w/ KDS bits removed in the mean time.. then when the fence stuff is
> > merged it is just an incremental patch rather than a whole driver ;-)
> 
> Yeah, I think the KDS bits and comments need to go first before
> merginge.

Right, as I expected really. Though as I said we'll probably wait for
fences to land and switch over to that before asking for it to be
merged. A pl111 KMS driver with neither KDS nor implicit fences is 
useless to us. Having said that, if someone else would have a use for
a fence/KDS-less pl111 KMS driver, please let me know!



> > > +/*
> > > + * Number of flips allowed in flight at any one time. Any more
> > > + * flips requested beyond this value will cause the caller to 
> > > + * block until earlier flips have completed.
> > > + *
> > > + * For performance reasons, this must be greater than the number
> > > + * of buffers used in the rendering pipeline. Note that the 
> > > + * rendering pipeline can contain different types of buffer, e.g.:
> > > + * - 2 final framebuffers
> > > + * - >2 geometry buffers for GPU use-cases
> > > + * - >2 vertex buffers for GPU use-cases
> > > + *
> > > + * For example, a system using 5 geometry buffers could have 5
> > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 
> > > + * 5 or greater.
> > > + *
> > > + * Whilst there may be more intermediate buffers (such as
> > > + * vertex/geometry) than final framebuffers, KDS is used to 
> > > + * ensure that GPU rendering waits for the next off-screen 
> > > + * buffer, so it doesn't overwrite an on-screen buffer and 
> > > + * produce tearing.
> > > + */
> > > +
> >
> > fwiw, this is at least different from how other drivers do triple
> > (or > double) buffering.  In other drivers (intel, omap, and
> > msm/freedreno, that I know of, maybe others too) the xorg driver
> > dri2 bits implement the double buffering (ie. send flip event back
> > to client immediately and queue up the flip and call page-flip
> > after the pageflip event back from kernel.
> >
> > I'm not saying not to do it this way, I guess I'd like to hear
> > what other folks think.  I kinda prefer doing this in userspace 
> > as it keeps the kernel bits simpler (plus it would then work 
> > properly on exynosdrm or other kms drivers).
> 
> Yeah, if this is just a sw queue then I don't think it makes sense
> to have it in the kernel. Afaik the current pageflip interface drm
> exposes allows one oustanding flip only, and you _must_ wait for
> the flip complete event before you can submit the second one.

Right, I'll have a think about this. I think our idea was to issue
enough page-flips into the kernel to make sure that any process
scheduling latencies on a heavily loaded system don't cause us to
miss a v_sync deadline. At the moment we issue the page flip from DRI2
schedule_swap. If we were to move that to the page flip event handler
of the previous page-flip, we're potentially adding in extra latency.

I.e. Currently we have:

DRI2SwapBuffers
 - drm_mode_page_flip to buffer B
DRI2SwapBuffers
 - drm_mode_page_flip to buffer A (gets queued in kernel)
...
v_sync! (at this point buffer B is scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
- queued GPU job to render next frame to buffer A scheduled on HW
...
GPU interrupt! (at this point buffer A is ready to be scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
- second page flip executed, buffer A's address written to scanout
  register, takes effect on next v_sync.


So in the above, after X receives the second DRI2SwapBuffers, it
doesn't need to get scheduled again for the next frame to be both
rendered by the GPU and issued to the display for scanout.


If we were to move to a user-space queue, I think we have something
like this:

DRI2SwapBuffers
 - drm_mode_page_flip to buffer B
DRI2SwapBuffers
 - queue page flip to buffer A in DDX
...
v_sync! (at this point buffer B is scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
- queued GPU job to render next frame to buffer A scheduled on HW
 - Send page flip event to X
...
GPU interrupt! (at this point buffer A is ready to be scanned out)
 - Release buffer A's KDS resource/signal buffer A's fence - but nothing
   is waiting on it
...
X gets scheduled, runs page flip handler
 - drm_mode_page_flip to buffer A
   - buffer A's address written to scanout register, takes effect on
 next v_sync.


So here, X must get scheduled again after processing the second
DRI2SwapBuffers in order to have the next frame displayed. This
increases the likely-hood that we're not able to write the address of
buf

RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-09 Thread Tom Cooksey
> > > So in the above, after X receives the second DRI2SwapBuffers, it
> > > doesn't need to get scheduled again for the next frame to be both
> > > rendered by the GPU and issued to the display for scanout.
> >
> > well, this is really only an issue if you are so loaded that you
> > don't get a chance to schedule for ~16ms.. which is pretty long time.

Yes - it really is 16ms (minus interrupt/workqueue latency) isn't it?
Hmmm, that does sound very long. Will try out some experiments and see.


> > If you are triple buffering, it should not end up in the critical 
> > path (since the gpu already has the 3rd buffer to start on the next
> > frame). And, well, if you do it all in the kernel you probably need
> > to toss things over to a workqueue anyways.
> 
> Just a quick comment on the kernel flip queue issue.
> 
> 16 ms scheduling latency sounds awful but totally doable with a less
> than stellar ddx driver going into limbo land and so preventing your
> single threaded X from doing more useful stuff. Is this really the 
> linux scheduler being stupid?

Ahahhaaa!! Yes!!! Really good point. We generally don't have 2D HW and
so rely on pixman to perform all 2D operations which does indeed tie
up that thread for fairly long periods of time.

We've had internal discussions about introducing a thread (gulp) in
the DDX to off-load drawing operations to. I think we were all a bit
scared by that idea though.


BTW: I wasn't suggesting it was the linux scheduler being stupid, just
that there is sometimes lots of contention over the CPU cores and X
is just one thread among many wanting to run.


> At least my impression was that the hw/kernel flip queue is to save
> power so that you can queue up a few frames and everything goes to
> sleep for half a second or so (at 24fps or whatever movie your
> showing). Needing to schedule 5 frames ahead with pageflips under
> load is just guaranteed to result in really horrible interactivity
> and so awful user experience

Agreed. There's always a tradeoff between tolerance to variable frame
rendering time/system latency (lot of buffers) and UI latency (few
buffers). 

As a side note, video playback is one use-case for explicit sync
objects which implicit/buffer-based sync doesn't handle: Queue up lots
of video frames for display, but mark those "display buffer" 
operations as depending on explicit sync objects which get signalled 
by the audio clock. Not sure Android actually does that yet though. 
Anyway, off topic.


Cheers,

Tom





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-09 Thread Tom Cooksey

> > Turning to DRM/KMS, it seems the supported formats of a plane can be
> > queried using drm_mode_get_plane. However, there doesn't seem to be a
> > way to query the supported formats of a crtc? If display HW only
> > supports scanning out from a single buffer (like pl111 does), I think
> > it won't have any planes and a fb can only be set on the crtc. In
> > which case, how should user-space query which pixel formats that crtc
> > supports?
> 
> it is exposed for drm plane's.  What is missing is to expose the
> primary-plane associated with the crtc.

Cool - so a patch which adds a way to query the what formats a crtc
supports would be welcome?

What about a way to query the stride alignment constraints?

Presumably using the drm_mode_get_property mechanism would be the
right way to implement that?


> > As with v4l2, DRM doesn't appear to have a way to query the stride
> > constraints? Assuming there is a way to query the stride constraints,
> > there also isn't a way to specify them when creating a buffer with
> > DRM, though perhaps the existing pitch parameter of
> > drm_mode_create_dumb could be used to allow user-space to pass in a
> > minimum stride as well as receive the allocated stride?
> >
> 
> well, you really shouldn't be using create_dumb..  you should have a
> userspace piece that is specific to the drm driver, and knows how to
> use that driver's gem allocate ioctl.

Sorry, why does this need a driver-specific allocation function? It's
just a display controller driver and I just want to allocate a scan-
out buffer - all I'm asking is for the display controller driver to
use a minimum stride alignment so I can export the buffer and use
another device to fill it with data.

The whole point is to be able to allocate the buffer in such a way
that another device can access it. So the driver _can't_ use a
special, device specific format, nor can it allocate it from a
private memory pool because doing so would preclude it from being
shared with another device.

That other device doesn't need to be a GPU wither, it could just as
easily be a camera/ISP or video decoder.



> >> > So presumably you're talking about a GPU driver being the exporter
> >> > here? If so, how could the GPU driver do these kind of tricks on
> >> > memory shared with another device?
> >>
> >> Yes, that is gpu-as-exporter.  If someone else is allocating
> >> buffers, it is up to them to do these tricks or not.  Probably 
> >> there is a pretty good chance that if you aren't a GPU you don't 
> >> need those sort of tricks for fast allocation of transient upload 
> >> buffers, staging textures, temporary pixmaps, etc.  Ie. I don't 
> >> really think a v4l camera or video decoder would benefit from that 
> >> sort of optimization.
> >
> > Right - but none of those are really buffers you'd want to export
> 
> > with dma_buf to share with another device are they? In which case, 
> > why not just have dma_buf figure out the constraints and allocate 
> > the memory?
>
> maybe not.. but (a) you don't necessarily know at creation time if it
> is going to be exported (maybe you know if it is definitely not going
> to be exported, but the converse is not true),

I can't actually think of an example where you would not know if a
buffer was going to be exported or not at allocation time? Do you have
a case in mind?

Regardless, you'd certainly have to know if a buffer will be exported
pretty quickly, before it's used so that you can import it into
whatever devices are going to access it. Otherwise if it gets
allocated before you export it, the allocation won't satisfy the
constraints of the other devices which will need to access it and
importing will fail. Assuming of course deferred allocation of the
backing pages as discussed earlier in the thread.



> and (b) there isn't
> really any reason to special case the allocation in the driver because
> it is going to be exported.

Not sure I follow you here? Surely you absolutely have to special-case
the allocation if the buffer is to be exported because you have to
take the other devices' constraints into account when you allocate? Or
do you mean you don't need to special-case the GEM buffer object
creation, only the allocation of the backing pages? Though I'm not
sure how that distinction is useful - at the end of the day, you need
to special-case allocation of the backing pages.


> helpers that can be used by simple drivers, yes.  Forcing the way the
> buffer is allocated, for sure not.  Currently, for example, there is
> no issue to export a buffer allocated from stolen-mem.

Where stolen-mem is the PC-world's version of a carveout? I.e. A chunk
of memory reserved at boot for the GPU which the OS can't touch? I
guess I view such memory as accessible to all media devices on the 
system and as such, needs to be managed by a central allocator which
dma_buf can use to allocate from.

I guess if that stolen-mem is managed by a single device then in
essence that device becomes the central

RE: [RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-09 Thread Tom Cooksey
Hi Daniel, Rob.

Thank you both for your reviews - greatly appreciated!

> > > Known issues:
> > >  * It still includes code to use KDS, which is not going upstream.
> >
> > review's on  > July/042462.html> can't hurt
> >
> > although you might consider submitting a reduced functionality driver
> > w/ KDS bits removed in the mean time.. then when the fence stuff is
> > merged it is just an incremental patch rather than a whole driver ;-)
> 
> Yeah, I think the KDS bits and comments need to go first before
> merginge.

Right, as I expected really. Though as I said we'll probably wait for
fences to land and switch over to that before asking for it to be
merged. A pl111 KMS driver with neither KDS nor implicit fences is 
useless to us. Having said that, if someone else would have a use for
a fence/KDS-less pl111 KMS driver, please let me know!



> > > +/*
> > > + * Number of flips allowed in flight at any one time. Any more
> > > + * flips requested beyond this value will cause the caller to 
> > > + * block until earlier flips have completed.
> > > + *
> > > + * For performance reasons, this must be greater than the number
> > > + * of buffers used in the rendering pipeline. Note that the 
> > > + * rendering pipeline can contain different types of buffer, e.g.:
> > > + * - 2 final framebuffers
> > > + * - >2 geometry buffers for GPU use-cases
> > > + * - >2 vertex buffers for GPU use-cases
> > > + *
> > > + * For example, a system using 5 geometry buffers could have 5
> > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 
> > > + * 5 or greater.
> > > + *
> > > + * Whilst there may be more intermediate buffers (such as
> > > + * vertex/geometry) than final framebuffers, KDS is used to 
> > > + * ensure that GPU rendering waits for the next off-screen 
> > > + * buffer, so it doesn't overwrite an on-screen buffer and 
> > > + * produce tearing.
> > > + */
> > > +
> >
> > fwiw, this is at least different from how other drivers do triple
> > (or > double) buffering.  In other drivers (intel, omap, and
> > msm/freedreno, that I know of, maybe others too) the xorg driver
> > dri2 bits implement the double buffering (ie. send flip event back
> > to client immediately and queue up the flip and call page-flip
> > after the pageflip event back from kernel.
> >
> > I'm not saying not to do it this way, I guess I'd like to hear
> > what other folks think.  I kinda prefer doing this in userspace 
> > as it keeps the kernel bits simpler (plus it would then work 
> > properly on exynosdrm or other kms drivers).
> 
> Yeah, if this is just a sw queue then I don't think it makes sense
> to have it in the kernel. Afaik the current pageflip interface drm
> exposes allows one oustanding flip only, and you _must_ wait for
> the flip complete event before you can submit the second one.

Right, I'll have a think about this. I think our idea was to issue
enough page-flips into the kernel to make sure that any process
scheduling latencies on a heavily loaded system don't cause us to
miss a v_sync deadline. At the moment we issue the page flip from DRI2
schedule_swap. If we were to move that to the page flip event handler
of the previous page-flip, we're potentially adding in extra latency.

I.e. Currently we have:

DRI2SwapBuffers
 - drm_mode_page_flip to buffer B
DRI2SwapBuffers
 - drm_mode_page_flip to buffer A (gets queued in kernel)
...
v_sync! (at this point buffer B is scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
- queued GPU job to render next frame to buffer A scheduled on HW
...
GPU interrupt! (at this point buffer A is ready to be scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
- second page flip executed, buffer A's address written to scanout
  register, takes effect on next v_sync.


So in the above, after X receives the second DRI2SwapBuffers, it
doesn't need to get scheduled again for the next frame to be both
rendered by the GPU and issued to the display for scanout.


If we were to move to a user-space queue, I think we have something
like this:

DRI2SwapBuffers
 - drm_mode_page_flip to buffer B
DRI2SwapBuffers
 - queue page flip to buffer A in DDX
...
v_sync! (at this point buffer B is scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
- queued GPU job to render next frame to buffer A scheduled on HW
 - Send page flip event to X
...
GPU interrupt! (at this point buffer A is ready to be scanned out)
 - Release buffer A's KDS resource/signal buffer A's fence - but nothing
   is waiting on it
...
X gets scheduled, runs page flip handler
 - drm_mode_page_flip to buffer A
   - buffer A's address written to scanout register, takes effect on
 next v_sync.


So here, X must get scheduled again after processing the second
DRI2SwapBuffers in order to have the next frame displayed. This
increases the likely-hood that we're not able to write the address of
buf

[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-07 Thread Tom Cooksey

> >> > Didn't you say that programmatically describing device placement
> >> > constraints was an unbounded problem? I guess we would have to
> >> > accept that it's not possible to describe all possible constraints
> >> > and instead find a way to describe the common ones?
> >>
> >> well, the point I'm trying to make, is by dividing your constraints
> >> into two groups, one that impacts and is handled by userspace, and
> >> one that is in the kernel (ie. where the pages go), you cut down 
> >> the number of permutations that the kernel has to care about
> >>  considerably. And kernel already cares about, for example, what 
> >> range of addresses that a device can dma to/from.  I think really 
> >> the only thing missing is the max # of sglist entries (contiguous 
> >> or not)
> >
> > I think it's more than physically contiguous or not.
> >
> > For example, it can be more efficient to use large page sizes on
> > devices with IOMMUs to reduce TLB traffic. I think the size and even
> > the availability of large pages varies between different IOMMUs.
> 
> sure.. but I suppose if we can spiff out dma_params to express "I need
> contiguous", perhaps we can add some way to express "I prefer
> as-contiguous-as-possible".. either way, this is about where the pages
> are placed, and not about the layout of pixels within the page, so
> should be in kernel.  It's something that is missing, but I believe
> that it belongs in dma_params and hidden behind dma_alloc_*() for
> simple drivers.

Thinking about it, isn't this more a property of the IOMMU? I mean,
are there any cases where an IOMMU had a large page mode but you
wouldn't want to use it? So when allocating the memory, you'd have to
take into account not just the constraints of the devices themselves,
but also of any IOMMUs any of the device sit behind? 


> > There's also the issue of buffer stride alignment. As I say, if the
> > buffer is to be written by a tile-based GPU like Mali, it's more
> > efficient if the buffer's stride is aligned to the max AXI bus burst
> > length. Though I guess a buffer stride only makes sense as a concept
> > when interpreting the data as a linear-layout 2D image, so perhaps
> > belongs in user-space along with format negotiation?
> >
> 
> Yeah.. this isn't about where the pages go, but about the arrangement
> within a page.
> 
> And, well, except for hw that supports the same tiling (or
> compressed-fb) in display+gpu, you probably aren't sharing tiled
> buffers.

You'd only want to share a buffer between devices if those devices can
understand the same pixel format. That pixel format can't be device-
specific or opaque, it has to be explicit. I think drm_fourcc.h is
what defines all the possible pixel formats. This is the enum I used
in EGL_EXT_image_dma_buf_import at least. So if we get to the point
where multiple devices can understand a tiled or compressed format, I
assume we could just add that format to drm_fourcc.h and possibly
v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h.

For user-space to negotiate a common pixel format and now stride
alignment, I guess it will obviously need a way to query what pixel
formats a device supports and what its stride alignment requirements
are.

I don't know v4l2 very well, but it certainly seems the pixel format
can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set
a particular format. I couldn't however find a way to retrieve a list
of supported formats - it seems the mechanism is to try out each
format in turn to determine if it is supported. Is that right?

There doesn't however seem a way to query what stride constraints a
V4l2 device might have. Does HW abstracted by v4l2 typically have
such constraints? If so, how can we query them such that a buffer
allocated by a DRM driver can be imported into v4l2 and used with
that HW?

Turning to DRM/KMS, it seems the supported formats of a plane can be
queried using drm_mode_get_plane. However, there doesn't seem to be a
way to query the supported formats of a crtc? If display HW only
supports scanning out from a single buffer (like pl111 does), I think
it won't have any planes and a fb can only be set on the crtc. In
which case, how should user-space query which pixel formats that crtc
supports?

Assuming user-space can query the supported formats and find a common
one, it will need to allocate a buffer. Looks like 
drm_mode_create_dumb can do that, but it only takes a bpp parameter,
there's no format parameter. I assume then that user-space defines
the format and tells the DRM driver which format the buffer is in
when creating the fb with drm_mode_fb_cmd2, which does take a format
parameter? Is that right?

As with v4l2, DRM doesn't appear to have a way to query the stride
constraints? Assuming there is a way to query the stride constraints,
there also isn't a way to specify them when creating a buffer with
DRM, though perhaps the existing pitch parameter of
drm_mode_create_dumb could be used to allow user-space to pas

RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-07 Thread Tom Cooksey

> >> > Didn't you say that programmatically describing device placement
> >> > constraints was an unbounded problem? I guess we would have to
> >> > accept that it's not possible to describe all possible constraints
> >> > and instead find a way to describe the common ones?
> >>
> >> well, the point I'm trying to make, is by dividing your constraints
> >> into two groups, one that impacts and is handled by userspace, and
> >> one that is in the kernel (ie. where the pages go), you cut down 
> >> the number of permutations that the kernel has to care about
> >>  considerably. And kernel already cares about, for example, what 
> >> range of addresses that a device can dma to/from.  I think really 
> >> the only thing missing is the max # of sglist entries (contiguous 
> >> or not)
> >
> > I think it's more than physically contiguous or not.
> >
> > For example, it can be more efficient to use large page sizes on
> > devices with IOMMUs to reduce TLB traffic. I think the size and even
> > the availability of large pages varies between different IOMMUs.
> 
> sure.. but I suppose if we can spiff out dma_params to express "I need
> contiguous", perhaps we can add some way to express "I prefer
> as-contiguous-as-possible".. either way, this is about where the pages
> are placed, and not about the layout of pixels within the page, so
> should be in kernel.  It's something that is missing, but I believe
> that it belongs in dma_params and hidden behind dma_alloc_*() for
> simple drivers.

Thinking about it, isn't this more a property of the IOMMU? I mean,
are there any cases where an IOMMU had a large page mode but you
wouldn't want to use it? So when allocating the memory, you'd have to
take into account not just the constraints of the devices themselves,
but also of any IOMMUs any of the device sit behind? 


> > There's also the issue of buffer stride alignment. As I say, if the
> > buffer is to be written by a tile-based GPU like Mali, it's more
> > efficient if the buffer's stride is aligned to the max AXI bus burst
> > length. Though I guess a buffer stride only makes sense as a concept
> > when interpreting the data as a linear-layout 2D image, so perhaps
> > belongs in user-space along with format negotiation?
> >
> 
> Yeah.. this isn't about where the pages go, but about the arrangement
> within a page.
> 
> And, well, except for hw that supports the same tiling (or
> compressed-fb) in display+gpu, you probably aren't sharing tiled
> buffers.

You'd only want to share a buffer between devices if those devices can
understand the same pixel format. That pixel format can't be device-
specific or opaque, it has to be explicit. I think drm_fourcc.h is
what defines all the possible pixel formats. This is the enum I used
in EGL_EXT_image_dma_buf_import at least. So if we get to the point
where multiple devices can understand a tiled or compressed format, I
assume we could just add that format to drm_fourcc.h and possibly
v4l2's v4l2_mbus_pixelcode enum in v4l2-mediabus.h.

For user-space to negotiate a common pixel format and now stride
alignment, I guess it will obviously need a way to query what pixel
formats a device supports and what its stride alignment requirements
are.

I don't know v4l2 very well, but it certainly seems the pixel format
can be queried using V4L2_SUBDEV_FORMAT_TRY when attempting to set
a particular format. I couldn't however find a way to retrieve a list
of supported formats - it seems the mechanism is to try out each
format in turn to determine if it is supported. Is that right?

There doesn't however seem a way to query what stride constraints a
V4l2 device might have. Does HW abstracted by v4l2 typically have
such constraints? If so, how can we query them such that a buffer
allocated by a DRM driver can be imported into v4l2 and used with
that HW?

Turning to DRM/KMS, it seems the supported formats of a plane can be
queried using drm_mode_get_plane. However, there doesn't seem to be a
way to query the supported formats of a crtc? If display HW only
supports scanning out from a single buffer (like pl111 does), I think
it won't have any planes and a fb can only be set on the crtc. In
which case, how should user-space query which pixel formats that crtc
supports?

Assuming user-space can query the supported formats and find a common
one, it will need to allocate a buffer. Looks like 
drm_mode_create_dumb can do that, but it only takes a bpp parameter,
there's no format parameter. I assume then that user-space defines
the format and tells the DRM driver which format the buffer is in
when creating the fb with drm_mode_fb_cmd2, which does take a format
parameter? Is that right?

As with v4l2, DRM doesn't appear to have a way to query the stride
constraints? Assuming there is a way to query the stride constraints,
there also isn't a way to specify them when creating a buffer with
DRM, though perhaps the existing pitch parameter of
drm_mode_create_dumb could be used to allow user-space to pas

[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-06 Thread Tom Cooksey

> >> ... This is the purpose of the attach step,
> >> so you know all the devices involved in sharing up front before
> >> allocating the backing pages. (Or in the worst case, if you have a
> >> "late attacher" you at least know when no device is doing dma access
> >> to a buffer and can reallocate and move the buffer.)  A long time
> >> back, I had a patch that added a field or two to 'struct
> >> device_dma_parameters' so that it could be known if a device
> >> required contiguous buffers.. looks like that never got merged, so
> >> I'd need to dig that back up and resend it.  But the idea was to 
> >> have the 'struct device' encapsulate all the information that would 
> >> be needed to do-the-right-thing when it comes to placement.
> >
> > As I understand it, it's up to the exporting device to allocate the
> > memory backing the dma_buf buffer. I guess the latest possible point
> > you can allocate the backing pages is when map_dma_buf is first
> > called? At that point the exporter can iterate over the current set
> > of attachments, programmatically determine the all the constraints of
> > all the attached drivers and attempt to allocate the backing pages
> > in such a way as to satisfy all those constraints?
> 
> yes, this is the idea..  possibly some room for some helpers to help
> out with this, but that is all under the hood from userspace
> perspective
> 
> > Didn't you say that programmatically describing device placement
> > constraints was an unbounded problem? I guess we would have to
> > accept that it's not possible to describe all possible constraints
> > and instead find a way to describe the common ones?
> 
> well, the point I'm trying to make, is by dividing your constraints
> into two groups, one that impacts and is handled by userspace, and one
> that is in the kernel (ie. where the pages go), you cut down the
> number of permutations that the kernel has to care about considerably.
>  And kernel already cares about, for example, what range of addresses
> that a device can dma to/from.  I think really the only thing missing
> is the max # of sglist entries (contiguous or not)

I think it's more than physically contiguous or not.

For example, it can be more efficient to use large page sizes on
devices with IOMMUs to reduce TLB traffic. I think the size and even
the availability of large pages varies between different IOMMUs.

There's also the issue of buffer stride alignment. As I say, if the
buffer is to be written by a tile-based GPU like Mali, it's more
efficient if the buffer's stride is aligned to the max AXI bus burst
length. Though I guess a buffer stride only makes sense as a concept 
when interpreting the data as a linear-layout 2D image, so perhaps 
belongs in user-space along with format negotiation?


> > One problem with this is it duplicates a lot of logic in each
> > driver which can export a dma_buf buffer. Each exporter will need to
> > do pretty much the same thing: iterate over all the attachments,
> > determine of all the constraints (assuming that can be done) and
> > allocate pages such that the lowest-common-denominator is satisfied.
> >
> > Perhaps rather than duplicating that logic in every driver, we could
> > Instead move allocation of the backing pages into dma_buf itself?
> >
> 
> I tend to think it is better to add helpers as we see common patterns
> emerge, which drivers can opt-in to using.  I don't think that we
> should move allocation into dma_buf itself, but it would perhaps be
> useful to have dma_alloc_*() variants that could allocate for multiple
> devices.

A helper could work I guess, though I quite like the idea of having
dma_alloc_*() variants which take a list of devices to allocate memory
for.


> That would help for simple stuff, although I'd suspect
> eventually a GPU driver will move away from that.  (Since you probably
> want to play tricks w/ pools of pages that are pre-zero'd and in the
> correct cache state, use spare cycles on the gpu or dma engine to
> pre-zero uncached pages, and games like that.)

So presumably you're talking about a GPU driver being the exporter
here? If so, how could the GPU driver do these kind of tricks on
memory shared with another device?



> >> > Anyway, assuming user-space can figure out how a buffer should be
> >> > stored in memory, how does it indicate this to a kernel driver and
> >> > actually allocate it? Which ioctl on which device does user-space
> >> > call, with what parameters? Are you suggesting using something
> >> > like ION which exposes the low-level details of how buffers are 
> > >> laid out in physical memory to userspace? If not, what?
> >> > 
> >>
> >> no, userspace should not need to know this.  And having a central
> >> driver that knows this for all the other drivers in the system
> >> doesn't really solve anything and isn't really scalable.  At best
> >> you might want, in some cases, a flag you can pass when allocating.
> >> For example, some of the drivers have a 'SCANOUT' flag

[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-06 Thread Tom Cooksey
Hi Rob,

> >> > We may also then have additional constraints when sharing buffers
> >> > between the display HW and video decode or even camera ISP HW.
> >> > Programmatically describing buffer allocation constraints is very
> >> > difficult and I'm not sure you can actually do it - there's some
> >> > pretty complex constraints out there! E.g. I believe there's a
> >> > platform where Y and UV planes of the reference frame need to be
> >> > in separate DRAM banks for real-time 1080p decode, or something 
> >> > like that?
> >>
> >> yes, this was discussed.  This is different from pitch/format/size
> >> constraints.. it is really just a placement constraint (ie. where 
> >> do the physical pages go).  IIRC the conclusion was to use a dummy
> >> devices with it's own CMA pool for attaching the Y vs UV buffers.
> >>
> >> > Anyway, I guess my point is that even if we solve how to allocate
> >> > buffers which will be shared between the GPU and display HW such
> >> > that both sets of constraints are satisfied, that may not be the
> >> > end of the story.
> >> >
> >>
> >> that was part of the reason to punt this problem to userspace ;-)
> >>
> >> In practice, the kernel drivers doesn't usually know too much about
> >> the dimensions/format/etc.. that is really userspace level 
> >> knowledge. There are a few exceptions when the kernel needs to know
> >> how to setup GTT/etc for tiled buffers, but normally this sort of 
> >> information is up at the next level up (userspace, and 
> >> drm_framebuffer in case of scanout).  Userspace media frameworks 
> >> like GStreamer already have a concept of format/caps negotiation.  
> >> For non-display<->gpu sharing, I think this is probably where this 
> >> sort of constraint negotiation should be handled.
> >
> > I agree that user-space will know which devices will access the
> > buffer and thus can figure out at least a common pixel format. 
> > Though I'm not so sure userspace can figure out more low-level 
> > details like alignment and placement in physical memory, etc.
> > 
> 
> well, let's divide things up into two categories:
> 
> 1) the arrangement and format of pixels.. ie. what userspace would
> need to know if it mmap's a buffer.  This includes pixel format,
> stride, etc.  This should be negotiated in userspace, it would be
> crazy to try to do this in the kernel.

Absolutely. Pixel format has to be negotiated by user-space as in
most cases, user-space can map the buffer and thus will need to
know how to interpret the data.



> 2) the physical placement of the pages.  Ie. whether it is contiguous
> or not.  Which bank the pages in the buffer are placed in, etc.  This
> is not visible to userspace.

Seems sensible to me.


> ... This is the purpose of the attach step,
> so you know all the devices involved in sharing up front before
> allocating the backing pages. (Or in the worst case, if you have a
> "late attacher" you at least know when no device is doing dma access
> to a buffer and can reallocate and move the buffer.)  A long time
> back, I had a patch that added a field or two to 'struct
> device_dma_parameters' so that it could be known if a device required
> contiguous buffers.. looks like that never got merged, so I'd need to
> dig that back up and resend it.  But the idea was to have the 'struct
> device' encapsulate all the information that would be needed to
> do-the-right-thing when it comes to placement.

As I understand it, it's up to the exporting device to allocate the
memory backing the dma_buf buffer. I guess the latest possible point
you can allocate the backing pages is when map_dma_buf is first 
called? At that point the exporter can iterate over the current set
of attachments, programmatically determine the all the constraints of
all the attached drivers and attempt to allocate the backing pages
in such a way as to satisfy all those constraints?

Didn't you say that programmatically describing device placement
constraints was an unbounded problem? I guess we would have to
accept that it's not possible to describe all possible constraints
and instead find a way to describe the common ones?

One problem with this is it duplicates a lot of logic in each
driver which can export a dma_buf buffer. Each exporter will need to
do pretty much the same thing: iterate over all the attachments,
determine of all the constraints (assuming that can be done) and
allocate pages such that the lowest-common-denominator is satisfied.

Perhaps rather than duplicating that logic in every driver, we could
Instead move allocation of the backing pages into dma_buf itself?


> > Anyway, assuming user-space can figure out how a buffer should be
> > stored in memory, how does it indicate this to a kernel driver and
> > actually allocate it? Which ioctl on which device does user-space
> > call, with what parameters? Are you suggesting using something like
> > ION which exposes the low-level details of how buffers are laid out
> in
> > physical memory to userspace?

[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-06 Thread Tom Cooksey
Hi Rob,

+lkml

> >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey 
> >> wrote:
> >> >> >  * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to
> >> >> >also allocate buffers for the GPU. Still not sure how to 
> >> >> >resolve this as we don't use DRM for our GPU driver.
> >> >>
> >> >> any thoughts/plans about a DRM GPU driver?  Ideally long term
> >> >> (esp. once the dma-fence stuff is in place), we'd have 
> >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display
> >> >> specific drm/kms driver, using prime/dmabuf to share between
> >> >> the two.
> >> >
> >> > The "extra" buffers we were allocating from armsoc DDX were really
> >> > being allocated through DRM/GEM so we could get an flink name
> >> > for them and pass a reference to them back to our GPU driver on
> >> > the client side. If it weren't for our need to access those
> >> > extra off-screen buffers with the GPU we wouldn't need to
> >> > allocate them with DRM at all. So, given they are really "GPU"
> >> > buffers, it does absolutely make sense to allocate them in a
> >> > different driver to the display driver.
> >> >
> >> > However, to avoid unnecessary memcpys & related cache
> >> > maintenance ops, we'd also like the GPU to render into buffers
> >> > which are scanned out by the display controller. So let's say
> >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> >> > out buffers with the display's DRM driver but a custom ioctl
> >> > on the GPU's DRM driver to allocate non scanout, off-screen
> >> > buffers. Sounds great, but I don't think that really works
> >> > with DRI2. If we used two drivers to allocate buffers, which
> >> > of those drivers do we return in DRI2ConnectReply? Even if we
> >> > solve that somehow, GEM flink names are name-spaced to a
> >> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> >> > how does the EGL in the client know which DRM device owns GEM
> >> > flink name "1234"? We'd need some pretty dirty hacks.
> >>
> >> You would return the name of the display driver allocating the
> >> buffers.  On the client side you can use generic ioctls to go from
> >> flink -> handle -> dmabuf.  So the client side would end up opening
> >> both the display drm device and the gpu, but without needing to know
> >> too much about the display.
> >
> > I think the bit I was missing was that a GEM bo for a buffer imported
> > using dma_buf/PRIME can still be flink'd. So the display controller's
> > DRM driver allocates scan-out buffers via the DUMB buffer allocate
> > ioctl. Those scan-out buffers than then be exported from the
> > dispaly's DRM driver and imported into the GPU's DRM driver using
> > PRIME. Once imported into the GPU's driver, we can use flink to get a
> > name for that buffer within the GPU DRM driver's name-space to return
> > to the DRI2 client. That same namespace is also what DRI2 back-
> > buffers are allocated from, so I think that could work... Except...
> 
> (and.. the general direction is that things will move more to just use
> dmabuf directly, ie. wayland or dri3)

I agree, DRI2 is the only reason why we need a system-wide ID. I also
prefer buffers to be passed around by dma_buf fd, but we still need to
support DRI2 and will do for some time I expect.



> >> > Anyway, that latter case also gets quite difficult. The "GPU"
> >> > DRM driver would need to know the constraints of the display
> >> > controller when allocating buffers intended to be scanned out.
> >> > For example, pl111 typically isn't behind an IOMMU and so
> >> > requires physically contiguous memory. We'd have to teach the
> >> > GPU's DRM driver about the constraints of the display HW. Not
> >> > exactly a clean driver model. :-(
> >> >
> >> > I'm still a little stuck on how to proceed, so any ideas
> >> > would greatly appreciated! My current train of thought is
> >> > having a kind of SoC-specific DRM driver which allocates
> >> > buffers for both display and GPU within a single GEM
> >> > namespace. That SoC-specific DRM driver could then know the
> >> > constraints of both the GPU and the display 

RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-06 Thread Tom Cooksey

> >> ... This is the purpose of the attach step,
> >> so you know all the devices involved in sharing up front before
> >> allocating the backing pages. (Or in the worst case, if you have a
> >> "late attacher" you at least know when no device is doing dma access
> >> to a buffer and can reallocate and move the buffer.)  A long time
> >> back, I had a patch that added a field or two to 'struct
> >> device_dma_parameters' so that it could be known if a device
> >> required contiguous buffers.. looks like that never got merged, so
> >> I'd need to dig that back up and resend it.  But the idea was to 
> >> have the 'struct device' encapsulate all the information that would 
> >> be needed to do-the-right-thing when it comes to placement.
> >
> > As I understand it, it's up to the exporting device to allocate the
> > memory backing the dma_buf buffer. I guess the latest possible point
> > you can allocate the backing pages is when map_dma_buf is first
> > called? At that point the exporter can iterate over the current set
> > of attachments, programmatically determine the all the constraints of
> > all the attached drivers and attempt to allocate the backing pages
> > in such a way as to satisfy all those constraints?
> 
> yes, this is the idea..  possibly some room for some helpers to help
> out with this, but that is all under the hood from userspace
> perspective
> 
> > Didn't you say that programmatically describing device placement
> > constraints was an unbounded problem? I guess we would have to
> > accept that it's not possible to describe all possible constraints
> > and instead find a way to describe the common ones?
> 
> well, the point I'm trying to make, is by dividing your constraints
> into two groups, one that impacts and is handled by userspace, and one
> that is in the kernel (ie. where the pages go), you cut down the
> number of permutations that the kernel has to care about considerably.
>  And kernel already cares about, for example, what range of addresses
> that a device can dma to/from.  I think really the only thing missing
> is the max # of sglist entries (contiguous or not)

I think it's more than physically contiguous or not.

For example, it can be more efficient to use large page sizes on
devices with IOMMUs to reduce TLB traffic. I think the size and even
the availability of large pages varies between different IOMMUs.

There's also the issue of buffer stride alignment. As I say, if the
buffer is to be written by a tile-based GPU like Mali, it's more
efficient if the buffer's stride is aligned to the max AXI bus burst
length. Though I guess a buffer stride only makes sense as a concept 
when interpreting the data as a linear-layout 2D image, so perhaps 
belongs in user-space along with format negotiation?


> > One problem with this is it duplicates a lot of logic in each
> > driver which can export a dma_buf buffer. Each exporter will need to
> > do pretty much the same thing: iterate over all the attachments,
> > determine of all the constraints (assuming that can be done) and
> > allocate pages such that the lowest-common-denominator is satisfied.
> >
> > Perhaps rather than duplicating that logic in every driver, we could
> > Instead move allocation of the backing pages into dma_buf itself?
> >
> 
> I tend to think it is better to add helpers as we see common patterns
> emerge, which drivers can opt-in to using.  I don't think that we
> should move allocation into dma_buf itself, but it would perhaps be
> useful to have dma_alloc_*() variants that could allocate for multiple
> devices.

A helper could work I guess, though I quite like the idea of having
dma_alloc_*() variants which take a list of devices to allocate memory
for.


> That would help for simple stuff, although I'd suspect
> eventually a GPU driver will move away from that.  (Since you probably
> want to play tricks w/ pools of pages that are pre-zero'd and in the
> correct cache state, use spare cycles on the gpu or dma engine to
> pre-zero uncached pages, and games like that.)

So presumably you're talking about a GPU driver being the exporter
here? If so, how could the GPU driver do these kind of tricks on
memory shared with another device?



> >> > Anyway, assuming user-space can figure out how a buffer should be
> >> > stored in memory, how does it indicate this to a kernel driver and
> >> > actually allocate it? Which ioctl on which device does user-space
> >> > call, with what parameters? Are you suggesting using something
> >> > like ION which exposes the low-level details of how buffers are 
> > >> laid out in physical memory to userspace? If not, what?
> >> > 
> >>
> >> no, userspace should not need to know this.  And having a central
> >> driver that knows this for all the other drivers in the system
> >> doesn't really solve anything and isn't really scalable.  At best
> >> you might want, in some cases, a flag you can pass when allocating.
> >> For example, some of the drivers have a 'SCANOUT' flag

RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-06 Thread Tom Cooksey
Hi Rob,

> >> > We may also then have additional constraints when sharing buffers
> >> > between the display HW and video decode or even camera ISP HW.
> >> > Programmatically describing buffer allocation constraints is very
> >> > difficult and I'm not sure you can actually do it - there's some
> >> > pretty complex constraints out there! E.g. I believe there's a
> >> > platform where Y and UV planes of the reference frame need to be
> >> > in separate DRAM banks for real-time 1080p decode, or something 
> >> > like that?
> >>
> >> yes, this was discussed.  This is different from pitch/format/size
> >> constraints.. it is really just a placement constraint (ie. where 
> >> do the physical pages go).  IIRC the conclusion was to use a dummy
> >> devices with it's own CMA pool for attaching the Y vs UV buffers.
> >>
> >> > Anyway, I guess my point is that even if we solve how to allocate
> >> > buffers which will be shared between the GPU and display HW such
> >> > that both sets of constraints are satisfied, that may not be the
> >> > end of the story.
> >> >
> >>
> >> that was part of the reason to punt this problem to userspace ;-)
> >>
> >> In practice, the kernel drivers doesn't usually know too much about
> >> the dimensions/format/etc.. that is really userspace level 
> >> knowledge. There are a few exceptions when the kernel needs to know
> >> how to setup GTT/etc for tiled buffers, but normally this sort of 
> >> information is up at the next level up (userspace, and 
> >> drm_framebuffer in case of scanout).  Userspace media frameworks 
> >> like GStreamer already have a concept of format/caps negotiation.  
> >> For non-display<->gpu sharing, I think this is probably where this 
> >> sort of constraint negotiation should be handled.
> >
> > I agree that user-space will know which devices will access the
> > buffer and thus can figure out at least a common pixel format. 
> > Though I'm not so sure userspace can figure out more low-level 
> > details like alignment and placement in physical memory, etc.
> > 
> 
> well, let's divide things up into two categories:
> 
> 1) the arrangement and format of pixels.. ie. what userspace would
> need to know if it mmap's a buffer.  This includes pixel format,
> stride, etc.  This should be negotiated in userspace, it would be
> crazy to try to do this in the kernel.

Absolutely. Pixel format has to be negotiated by user-space as in
most cases, user-space can map the buffer and thus will need to
know how to interpret the data.



> 2) the physical placement of the pages.  Ie. whether it is contiguous
> or not.  Which bank the pages in the buffer are placed in, etc.  This
> is not visible to userspace.

Seems sensible to me.


> ... This is the purpose of the attach step,
> so you know all the devices involved in sharing up front before
> allocating the backing pages. (Or in the worst case, if you have a
> "late attacher" you at least know when no device is doing dma access
> to a buffer and can reallocate and move the buffer.)  A long time
> back, I had a patch that added a field or two to 'struct
> device_dma_parameters' so that it could be known if a device required
> contiguous buffers.. looks like that never got merged, so I'd need to
> dig that back up and resend it.  But the idea was to have the 'struct
> device' encapsulate all the information that would be needed to
> do-the-right-thing when it comes to placement.

As I understand it, it's up to the exporting device to allocate the
memory backing the dma_buf buffer. I guess the latest possible point
you can allocate the backing pages is when map_dma_buf is first 
called? At that point the exporter can iterate over the current set
of attachments, programmatically determine the all the constraints of
all the attached drivers and attempt to allocate the backing pages
in such a way as to satisfy all those constraints?

Didn't you say that programmatically describing device placement
constraints was an unbounded problem? I guess we would have to
accept that it's not possible to describe all possible constraints
and instead find a way to describe the common ones?

One problem with this is it duplicates a lot of logic in each
driver which can export a dma_buf buffer. Each exporter will need to
do pretty much the same thing: iterate over all the attachments,
determine of all the constraints (assuming that can be done) and
allocate pages such that the lowest-common-denominator is satisfied.

Perhaps rather than duplicating that logic in every driver, we could
Instead move allocation of the backing pages into dma_buf itself?


> > Anyway, assuming user-space can figure out how a buffer should be
> > stored in memory, how does it indicate this to a kernel driver and
> > actually allocate it? Which ioctl on which device does user-space
> > call, with what parameters? Are you suggesting using something like
> > ION which exposes the low-level details of how buffers are laid out
> in
> > physical memory to userspace?

RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-06 Thread Tom Cooksey
Hi Rob,

+lkml

> >> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey 
> >> wrote:
> >> >> >  * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to
> >> >> >also allocate buffers for the GPU. Still not sure how to 
> >> >> >resolve this as we don't use DRM for our GPU driver.
> >> >>
> >> >> any thoughts/plans about a DRM GPU driver?  Ideally long term
> >> >> (esp. once the dma-fence stuff is in place), we'd have 
> >> >> gpu-specific drm (gpu-only, no kms) driver, and SoC/display
> >> >> specific drm/kms driver, using prime/dmabuf to share between
> >> >> the two.
> >> >
> >> > The "extra" buffers we were allocating from armsoc DDX were really
> >> > being allocated through DRM/GEM so we could get an flink name
> >> > for them and pass a reference to them back to our GPU driver on
> >> > the client side. If it weren't for our need to access those
> >> > extra off-screen buffers with the GPU we wouldn't need to
> >> > allocate them with DRM at all. So, given they are really "GPU"
> >> > buffers, it does absolutely make sense to allocate them in a
> >> > different driver to the display driver.
> >> >
> >> > However, to avoid unnecessary memcpys & related cache
> >> > maintenance ops, we'd also like the GPU to render into buffers
> >> > which are scanned out by the display controller. So let's say
> >> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> >> > out buffers with the display's DRM driver but a custom ioctl
> >> > on the GPU's DRM driver to allocate non scanout, off-screen
> >> > buffers. Sounds great, but I don't think that really works
> >> > with DRI2. If we used two drivers to allocate buffers, which
> >> > of those drivers do we return in DRI2ConnectReply? Even if we
> >> > solve that somehow, GEM flink names are name-spaced to a
> >> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> >> > how does the EGL in the client know which DRM device owns GEM
> >> > flink name "1234"? We'd need some pretty dirty hacks.
> >>
> >> You would return the name of the display driver allocating the
> >> buffers.  On the client side you can use generic ioctls to go from
> >> flink -> handle -> dmabuf.  So the client side would end up opening
> >> both the display drm device and the gpu, but without needing to know
> >> too much about the display.
> >
> > I think the bit I was missing was that a GEM bo for a buffer imported
> > using dma_buf/PRIME can still be flink'd. So the display controller's
> > DRM driver allocates scan-out buffers via the DUMB buffer allocate
> > ioctl. Those scan-out buffers than then be exported from the
> > dispaly's DRM driver and imported into the GPU's DRM driver using
> > PRIME. Once imported into the GPU's driver, we can use flink to get a
> > name for that buffer within the GPU DRM driver's name-space to return
> > to the DRI2 client. That same namespace is also what DRI2 back-
> > buffers are allocated from, so I think that could work... Except...
> 
> (and.. the general direction is that things will move more to just use
> dmabuf directly, ie. wayland or dri3)

I agree, DRI2 is the only reason why we need a system-wide ID. I also
prefer buffers to be passed around by dma_buf fd, but we still need to
support DRI2 and will do for some time I expect.



> >> > Anyway, that latter case also gets quite difficult. The "GPU"
> >> > DRM driver would need to know the constraints of the display
> >> > controller when allocating buffers intended to be scanned out.
> >> > For example, pl111 typically isn't behind an IOMMU and so
> >> > requires physically contiguous memory. We'd have to teach the
> >> > GPU's DRM driver about the constraints of the display HW. Not
> >> > exactly a clean driver model. :-(
> >> >
> >> > I'm still a little stuck on how to proceed, so any ideas
> >> > would greatly appreciated! My current train of thought is
> >> > having a kind of SoC-specific DRM driver which allocates
> >> > buffers for both display and GPU within a single GEM
> >> > namespace. That SoC-specific DRM driver could then know the
> >> > constraints of both the GPU and the display 

[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-05 Thread Tom Cooksey
Hi Rob,

+linux-media, +linaro-mm-sig for discussion of video/camera
buffer constraints...


> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey 
> wrote:
> >> >  * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
> >> >allocate buffers for the GPU. Still not sure how to resolve
> >> >this as we don't use DRM for our GPU driver.
> >>
> >> any thoughts/plans about a DRM GPU driver?  Ideally long term (esp.
> >> once the dma-fence stuff is in place), we'd have gpu-specific drm
> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
> >> using prime/dmabuf to share between the two.
> >
> > The "extra" buffers we were allocating from armsoc DDX were really
> > being allocated through DRM/GEM so we could get an flink name
> > for them and pass a reference to them back to our GPU driver on
> > the client side. If it weren't for our need to access those
> > extra off-screen buffers with the GPU we wouldn't need to
> > allocate them with DRM at all. So, given they are really "GPU"
> > buffers, it does absolutely make sense to allocate them in a
> > different driver to the display driver.
> >
> > However, to avoid unnecessary memcpys & related cache
> > maintenance ops, we'd also like the GPU to render into buffers
> > which are scanned out by the display controller. So let's say
> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> > out buffers with the display's DRM driver but a custom ioctl
> > on the GPU's DRM driver to allocate non scanout, off-screen
> > buffers. Sounds great, but I don't think that really works
> > with DRI2. If we used two drivers to allocate buffers, which
> > of those drivers do we return in DRI2ConnectReply? Even if we
> > solve that somehow, GEM flink names are name-spaced to a
> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> > how does the EGL in the client know which DRM device owns GEM
> > flink name "1234"? We'd need some pretty dirty hacks.
> 
> You would return the name of the display driver allocating the
> buffers.  On the client side you can use generic ioctls to go from
> flink -> handle -> dmabuf.  So the client side would end up opening
> both the display drm device and the gpu, but without needing to know
> too much about the display.

I think the bit I was missing was that a GEM bo for a buffer imported
using dma_buf/PRIME can still be flink'd. So the display controller's
DRM driver allocates scan-out buffers via the DUMB buffer allocate
ioctl. Those scan-out buffers than then be exported from the
dispaly's DRM driver and imported into the GPU's DRM driver using
PRIME. Once imported into the GPU's driver, we can use flink to get a
name for that buffer within the GPU DRM driver's name-space to return
to the DRI2 client. That same namespace is also what DRI2 back-buffers
are allocated from, so I think that could work... Except...



> > Anyway, that latter case also gets quite difficult. The "GPU"
> > DRM driver would need to know the constraints of the display
> > controller when allocating buffers intended to be scanned out.
> > For example, pl111 typically isn't behind an IOMMU and so
> > requires physically contiguous memory. We'd have to teach the
> > GPU's DRM driver about the constraints of the display HW. Not
> > exactly a clean driver model. :-(
> >
> > I'm still a little stuck on how to proceed, so any ideas
> > would greatly appreciated! My current train of thought is
> > having a kind of SoC-specific DRM driver which allocates
> > buffers for both display and GPU within a single GEM
> > namespace. That SoC-specific DRM driver could then know the
> > constraints of both the GPU and the display HW. We could then
> > use PRIME to export buffers allocated with the SoC DRM driver
> > and import them into the GPU and/or display DRM driver.
> 
> Usually if the display drm driver is allocating the buffers that might
> be scanned out, it just needs to have minimal knowledge of the GPU
> (pitch alignment constraints).  I don't think we need a 3rd device
> just to allocate buffers.

While Mali can render to pretty much any buffer, there is a mild
performance improvement to be had if the buffer stride is aligned to
the AXI bus's max burst length when drawing to the buffer.

So in some respects, there is a constraint on how buffers which will
be drawn to using the GPU are allocated. I don't really like the idea
of teaching the display controller DRM driver about the GPU buffer
constraints, even if t

RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-08-05 Thread Tom Cooksey
Hi Rob,

+linux-media, +linaro-mm-sig for discussion of video/camera
buffer constraints...


> On Fri, Jul 26, 2013 at 11:58 AM, Tom Cooksey 
> wrote:
> >> >  * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
> >> >allocate buffers for the GPU. Still not sure how to resolve
> >> >this as we don't use DRM for our GPU driver.
> >>
> >> any thoughts/plans about a DRM GPU driver?  Ideally long term (esp.
> >> once the dma-fence stuff is in place), we'd have gpu-specific drm
> >> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
> >> using prime/dmabuf to share between the two.
> >
> > The "extra" buffers we were allocating from armsoc DDX were really
> > being allocated through DRM/GEM so we could get an flink name
> > for them and pass a reference to them back to our GPU driver on
> > the client side. If it weren't for our need to access those
> > extra off-screen buffers with the GPU we wouldn't need to
> > allocate them with DRM at all. So, given they are really "GPU"
> > buffers, it does absolutely make sense to allocate them in a
> > different driver to the display driver.
> >
> > However, to avoid unnecessary memcpys & related cache
> > maintenance ops, we'd also like the GPU to render into buffers
> > which are scanned out by the display controller. So let's say
> > we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
> > out buffers with the display's DRM driver but a custom ioctl
> > on the GPU's DRM driver to allocate non scanout, off-screen
> > buffers. Sounds great, but I don't think that really works
> > with DRI2. If we used two drivers to allocate buffers, which
> > of those drivers do we return in DRI2ConnectReply? Even if we
> > solve that somehow, GEM flink names are name-spaced to a
> > single device node (AFAIK). So when we do a DRI2GetBuffers,
> > how does the EGL in the client know which DRM device owns GEM
> > flink name "1234"? We'd need some pretty dirty hacks.
> 
> You would return the name of the display driver allocating the
> buffers.  On the client side you can use generic ioctls to go from
> flink -> handle -> dmabuf.  So the client side would end up opening
> both the display drm device and the gpu, but without needing to know
> too much about the display.

I think the bit I was missing was that a GEM bo for a buffer imported
using dma_buf/PRIME can still be flink'd. So the display controller's
DRM driver allocates scan-out buffers via the DUMB buffer allocate
ioctl. Those scan-out buffers than then be exported from the
dispaly's DRM driver and imported into the GPU's DRM driver using
PRIME. Once imported into the GPU's driver, we can use flink to get a
name for that buffer within the GPU DRM driver's name-space to return
to the DRI2 client. That same namespace is also what DRI2 back-buffers
are allocated from, so I think that could work... Except...



> > Anyway, that latter case also gets quite difficult. The "GPU"
> > DRM driver would need to know the constraints of the display
> > controller when allocating buffers intended to be scanned out.
> > For example, pl111 typically isn't behind an IOMMU and so
> > requires physically contiguous memory. We'd have to teach the
> > GPU's DRM driver about the constraints of the display HW. Not
> > exactly a clean driver model. :-(
> >
> > I'm still a little stuck on how to proceed, so any ideas
> > would greatly appreciated! My current train of thought is
> > having a kind of SoC-specific DRM driver which allocates
> > buffers for both display and GPU within a single GEM
> > namespace. That SoC-specific DRM driver could then know the
> > constraints of both the GPU and the display HW. We could then
> > use PRIME to export buffers allocated with the SoC DRM driver
> > and import them into the GPU and/or display DRM driver.
> 
> Usually if the display drm driver is allocating the buffers that might
> be scanned out, it just needs to have minimal knowledge of the GPU
> (pitch alignment constraints).  I don't think we need a 3rd device
> just to allocate buffers.

While Mali can render to pretty much any buffer, there is a mild
performance improvement to be had if the buffer stride is aligned to
the AXI bus's max burst length when drawing to the buffer.

So in some respects, there is a constraint on how buffers which will
be drawn to using the GPU are allocated. I don't really like the idea
of teaching the display controller DRM driver about the GPU buffer
constraints, even if t

[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-07-26 Thread Tom Cooksey
Hi Rob,

> >  * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
> >allocate buffers for the GPU. Still not sure how to resolve this
> >as we don't use DRM for our GPU driver.
> 
> any thoughts/plans about a DRM GPU driver?  Ideally long term (esp.
> once the dma-fence stuff is in place), we'd have gpu-specific drm
> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
> using prime/dmabuf to share between the two.

The "extra" buffers we were allocating from armsoc DDX were really
being allocated through DRM/GEM so we could get an flink name
for them and pass a reference to them back to our GPU driver on
the client side. If it weren't for our need to access those
extra off-screen buffers with the GPU we wouldn't need to
allocate them with DRM at all. So, given they are really "GPU"
buffers, it does absolutely make sense to allocate them in a
different driver to the display driver.

However, to avoid unnecessary memcpys & related cache
maintenance ops, we'd also like the GPU to render into buffers
which are scanned out by the display controller. So let's say
we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
out buffers with the display's DRM driver but a custom ioctl
on the GPU's DRM driver to allocate non scanout, off-screen
buffers. Sounds great, but I don't think that really works
with DRI2. If we used two drivers to allocate buffers, which
of those drivers do we return in DRI2ConnectReply? Even if we
solve that somehow, GEM flink names are name-spaced to a
single device node (AFAIK). So when we do a DRI2GetBuffers,
how does the EGL in the client know which DRM device owns GEM
flink name "1234"? We'd need some pretty dirty hacks.

So then we looked at allocating _all_ buffers with the GPU's
DRM driver. That solves the DRI2 single-device-name and single
name-space issue. It also means the GPU would _never_ render
into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB.
One thing I wasn't sure about is if there was an objection
to using PRIME to export scanout buffers allocated with
DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU
driver to be rendered into? Is that a concern?

Anyway, that latter case also gets quite difficult. The "GPU"
DRM driver would need to know the constraints of the display
controller when allocating buffers intended to be scanned out.
For example, pl111 typically isn't behind an IOMMU and so
requires physically contiguous memory. We'd have to teach the
GPU's DRM driver about the constraints of the display HW. Not
exactly a clean driver model. :-(

I'm still a little stuck on how to proceed, so any ideas
would greatly appreciated! My current train of thought is
having a kind of SoC-specific DRM driver which allocates
buffers for both display and GPU within a single GEM
namespace. That SoC-specific DRM driver could then know the
constraints of both the GPU and the display HW. We could then
use PRIME to export buffers allocated with the SoC DRM driver
and import them into the GPU and/or display DRM driver.

Note: While it doesn't use the DRM framework, the Mali T6xx
kernel driver has supported importing buffers through dma_buf
for some time. I've even written an EGL extension :-):




> I'm not entirely sure that the directions that the current CDF
> proposals are headed is necessarily the right way forward.  I'd prefer
> to see small/incremental evolution of KMS (ie. add drm_bridge and
> drm_panel, and refactor the existing encoder-slave).  Keeping it
> inside drm means that we can evolve it more easily, and avoid layers
> of glue code for no good reason.

I think CDF could allow vendors to re-use code they've written
for their Android driver stack in DRM drivers more easily. Though
I guess ideally KMS would evolve to a point where it could be used
by an Android driver stack. I.e. Support explicit fences.


Cheers,

Tom







RE: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-07-26 Thread Tom Cooksey
Hi Rob,

> >  * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
> >allocate buffers for the GPU. Still not sure how to resolve this
> >as we don't use DRM for our GPU driver.
> 
> any thoughts/plans about a DRM GPU driver?  Ideally long term (esp.
> once the dma-fence stuff is in place), we'd have gpu-specific drm
> (gpu-only, no kms) driver, and SoC/display specific drm/kms driver,
> using prime/dmabuf to share between the two.

The "extra" buffers we were allocating from armsoc DDX were really
being allocated through DRM/GEM so we could get an flink name
for them and pass a reference to them back to our GPU driver on
the client side. If it weren't for our need to access those
extra off-screen buffers with the GPU we wouldn't need to
allocate them with DRM at all. So, given they are really "GPU"
buffers, it does absolutely make sense to allocate them in a
different driver to the display driver.

However, to avoid unnecessary memcpys & related cache
maintenance ops, we'd also like the GPU to render into buffers
which are scanned out by the display controller. So let's say
we continue using DRM_IOCTL_MODE_CREATE_DUMB to allocate scan
out buffers with the display's DRM driver but a custom ioctl
on the GPU's DRM driver to allocate non scanout, off-screen
buffers. Sounds great, but I don't think that really works
with DRI2. If we used two drivers to allocate buffers, which
of those drivers do we return in DRI2ConnectReply? Even if we
solve that somehow, GEM flink names are name-spaced to a
single device node (AFAIK). So when we do a DRI2GetBuffers,
how does the EGL in the client know which DRM device owns GEM
flink name "1234"? We'd need some pretty dirty hacks.

So then we looked at allocating _all_ buffers with the GPU's
DRM driver. That solves the DRI2 single-device-name and single
name-space issue. It also means the GPU would _never_ render
into buffers allocated through DRM_IOCTL_MODE_CREATE_DUMB.
One thing I wasn't sure about is if there was an objection
to using PRIME to export scanout buffers allocated with
DRM_IOCTL_MODE_CREATE_DUMB and then importing them into a GPU
driver to be rendered into? Is that a concern?

Anyway, that latter case also gets quite difficult. The "GPU"
DRM driver would need to know the constraints of the display
controller when allocating buffers intended to be scanned out.
For example, pl111 typically isn't behind an IOMMU and so
requires physically contiguous memory. We'd have to teach the
GPU's DRM driver about the constraints of the display HW. Not
exactly a clean driver model. :-(

I'm still a little stuck on how to proceed, so any ideas
would greatly appreciated! My current train of thought is
having a kind of SoC-specific DRM driver which allocates
buffers for both display and GPU within a single GEM
namespace. That SoC-specific DRM driver could then know the
constraints of both the GPU and the display HW. We could then
use PRIME to export buffers allocated with the SoC DRM driver
and import them into the GPU and/or display DRM driver.

Note: While it doesn't use the DRM framework, the Mali T6xx
kernel driver has supported importing buffers through dma_buf
for some time. I've even written an EGL extension :-):




> I'm not entirely sure that the directions that the current CDF
> proposals are headed is necessarily the right way forward.  I'd prefer
> to see small/incremental evolution of KMS (ie. add drm_bridge and
> drm_panel, and refactor the existing encoder-slave).  Keeping it
> inside drm means that we can evolve it more easily, and avoid layers
> of glue code for no good reason.

I think CDF could allow vendors to re-use code they've written
for their Android driver stack in DRM drivers more easily. Though
I guess ideally KMS would evolve to a point where it could be used
by an Android driver stack. I.e. Support explicit fences.


Cheers,

Tom





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[RFC 0/1] drm/pl111: Initial drm/kms driver for pl111

2013-07-25 Thread tom . cooksey
From: Tom Cooksey 

Please find below the current state of our pl111 DRM/KMS driver. This
is lightly tested on a Versatile Express using X running the
xf86-video-armsoc DDX driver[i] with the patches applied to drm-next
as of ~last week. To actually see anything on the DVI output, you
must also apply Pawel Moll's VExpress DVI mux driver[ii] to select
the video signal from the ca9x4 core tile.

[i] 
<https://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summary>
[ii] <https://patchwork.kernel.org/patch/1765981/>


Known issues:
 * It uses KDS. We intend to switch to whatever implicit per-buffer
   synchronisation mechanism gets merged, once something is merged.
 * It abuses flags parameter of DRM_IOCTL_MODE_CREATE_DUMB to also
   allocate buffers for the GPU. Still not sure how to resolve this
   as we don't use DRM for our GPU driver.
 * Doesn't handle page flip event sequence numbers and timestamps
 * The v_sync handling needs work in general - a work queue is a
   little overkill
 * Doesn't support the import half of PRIME properly, only export
 * Need to validate src rectangle size in
   pl111_drm_cursor_plane_update()
 * Only supports 640x480 mode, which is hard-coded. We intend to
   rebase on top of CDF once it is merged, which hopefully will
   handle a lot of the EDID parsing & mode setting for us (once
   Pawel's CDF patches for VExpress also land).

I appreciate that's a fairly hefty list of known issues already!
However, we're waiting for both CDF & dma_buf sync mechanisms to land
before we can address some of those. So in the mean-time, I thought
someone might be interested in taking a look at what we have so far,
which is why I'm posting this now. Needless to say the code will need
to be refactored a fair bit, however I'm keen to get and additional
feedback anyone cares to give.


Cheers,

Tom

Tom Cooksey (1):
  drm/pl111: Initial drm/kms driver for pl111 display controller

 drivers/gpu/drm/Kconfig |2 +
 drivers/gpu/drm/Makefile|1 +
 drivers/gpu/drm/pl111/Kbuild|   14 +
 drivers/gpu/drm/pl111/Kconfig   |9 +
 drivers/gpu/drm/pl111/pl111_clcd_ext.h  |   78 
 drivers/gpu/drm/pl111/pl111_drm.h   |  227 
 drivers/gpu/drm/pl111/pl111_drm_connector.c |  166 +
 drivers/gpu/drm/pl111/pl111_drm_crtc.c  |  432 ++
 drivers/gpu/drm/pl111/pl111_drm_cursor.c|   97 +
 drivers/gpu/drm/pl111/pl111_drm_device.c|  319 +
 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c   |  339 ++
 drivers/gpu/drm/pl111/pl111_drm_encoder.c   |  106 ++
 drivers/gpu/drm/pl111/pl111_drm_fb.c|  152 
 drivers/gpu/drm/pl111/pl111_drm_funcs.h |  127 +++
 drivers/gpu/drm/pl111/pl111_drm_gem.c   |  287 +++
 drivers/gpu/drm/pl111/pl111_drm_pl111.c |  513 +++
 drivers/gpu/drm/pl111/pl111_drm_platform.c  |  150 
 drivers/gpu/drm/pl111/pl111_drm_suspend.c   |   35 ++
 drivers/gpu/drm/pl111/pl111_drm_vma.c   |  214 +++
 19 files changed, 3268 insertions(+)
 create mode 100644 drivers/gpu/drm/pl111/Kbuild
 create mode 100644 drivers/gpu/drm/pl111/Kconfig
 create mode 100644 drivers/gpu/drm/pl111/pl111_clcd_ext.h
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm.h
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_connector.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_crtc.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_cursor.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_device.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_dma_buf.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_encoder.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_fb.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_funcs.h
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_gem.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_pl111.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_platform.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_suspend.c
 create mode 100644 drivers/gpu/drm/pl111/pl111_drm_vma.c

-- 
1.7.9.5


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


abuse of dumb ioctls in exynos

2013-04-24 Thread Tom Cooksey
Hi Dave!

I guess I should have opened a discussion around armsoc a lot earlier
than now as you clearly have some frustrations! Sorry about that.

It also sounds like you have some ideas over how we should approach
the technical side and those I really want to understand.


> -Original Message-
> From: Dave Airlie [mailto:airlied at gmail.com]
> Sent: 23 April 2013 21:29
> To: Tom Cooksey
> Cc: dri-devel; Inki Dae
> Subject: Re: abuse of dumb ioctls in exynos
> 
> >
> > Having a flag to indicate a dumb buffer allocation is to be used as a
> > scan-out buffer would be useful for xf86-video-armsoc. We're trying to
> > keep that driver as generic as possible and currently the main device-
> > specific bits are what flags to pass to DRM_IOCTL_MODE_CREATE_DUMB for
> > scanout & non-scanout buffer allocations. If a generic scanout flag could
> > be added, it would simplify armsoc a fair bit and also allow the DRM
> > drivers we're using armsoc with to comply with the don't pass device-
> > specific flags to create dumb.
> >
> > For reference, the device-specific bits of armsoc are currently abstracted
> > here:
> >
> > Note: We are still using DRM_IOCTL_MODE_CREATE_DUMB to allocate pixmap
> > and DRI2 buffers and have not come across any issues with doing that.
> > Certainly both Mali-400 & Mali-T6xx render to linear RGBA buffers and
> > the display controller's in SoCs shipping Mali also seem to happily
> > scan-out linear RGB buffers. Getting armsoc to run on OMAP (again) might
> > need a device-specific allocation function to allocate the tiled format
> > used on OMAP, but only for efficient 90-degree rotations (if I understood
> > Rob correctly). So maybe we could also one day add a "this buffer will be
> > rotated 90 degrees" flag?
> 
> What part of don't use dumb buffer for acceleration is hard to understand?
> 
> Christ I called them DUMB. Lets try this again.
> 
> DON'T USE DUMB BUFFERS FOR ALLOCATING BUFFERS USED FOR ACCELERATION.

Right, I _think_ I understand your opinion on that. :-)

The reason we (currently) use the dumb buffer interface is because it
does pretty much exactly what we need it to, as we only want linear
RGB buffers:

On Mali & probably other tiled-based GPUs, the back buffer only gets
written once per frame, when the GPU writes its on-die tile buffer to
system memory. As such, we don't need the complicated memory layouts
immediate mode renders do to improve cache efficiency, etc.

What's more, the 2D hardware typically found on SoCs we're targeting
isn't advanced enough to implement all of the EXA operations and
frequently falls back to software rendering, which only works with
linear RGB buffers.

Another option we nearly went with is to use ION to allocate all
buffers, using the PRIME ioctls to import those buffers we want
to scanout into the display controller's DRM driver. ION's a pretty
good fit, but requires some SoC-specific logic in userspace to
figure out E.g. the display controller doesn't have an IOMMU and
we must therefore allocate from a contiguous ION heap. By allocating
via the DUMB interface and specifying a scanout hint, we can leave
that decision to the DRM driver and keep userspace entirely generic.
The other reason to go with DUMB rather than ION was because ION
wasn't upstream.


> Now that we've cleared that up, armsoc is a big bag of shit, I've
> spent a few hours on it in the last few weeks trying to get anything
> to run on my chromebook and really armsoc needs to be put out of its
> misery.

This is why we need a bug tracker! To objectively quantify "big bag
of shit" and fix it. :-)


> The only working long term strategy for ARM I see is to abstract the
> common modesetting code into a new library, 

Would you mind elaborating a little on this? I assume you're not talking
about libkms? What operations would be performed by this driver which
would need to be abstracted in userspace which aren't already nicely
abstracted by KMS? Once we have a new library of some description, I
assume you're suggesting we modify armsoc to use it? That seems a good
idea as it also means we can use that to implement the HWComposer HAL
on Android and thus use the same driver code can be used with minimal
changes on X11, Android, Wayland, Mir and whatever other new window
system comes along. That's really the point I'm trying to get to.


> and write a per-GPU
> driver.

So in our bit of the ARM ecosystem, the GPU is just the bit which
draws 3D graphics. The 2D drawing hardware is separate, as is the
display controller as is the video codec. This is reflected in the
driver model: The GPU driver is totally bespoke, the display
controller interfac

RE: abuse of dumb ioctls in exynos

2013-04-24 Thread Tom Cooksey
Hi Dave!

I guess I should have opened a discussion around armsoc a lot earlier
than now as you clearly have some frustrations! Sorry about that.

It also sounds like you have some ideas over how we should approach
the technical side and those I really want to understand.


> -Original Message-
> From: Dave Airlie [mailto:airl...@gmail.com]
> Sent: 23 April 2013 21:29
> To: Tom Cooksey
> Cc: dri-devel; Inki Dae
> Subject: Re: abuse of dumb ioctls in exynos
> 
> >
> > Having a flag to indicate a dumb buffer allocation is to be used as a
> > scan-out buffer would be useful for xf86-video-armsoc. We're trying to
> > keep that driver as generic as possible and currently the main device-
> > specific bits are what flags to pass to DRM_IOCTL_MODE_CREATE_DUMB for
> > scanout & non-scanout buffer allocations. If a generic scanout flag could
> > be added, it would simplify armsoc a fair bit and also allow the DRM
> > drivers we're using armsoc with to comply with the don't pass device-
> > specific flags to create dumb.
> >
> > For reference, the device-specific bits of armsoc are currently abstracted
> > here:
> >
> > Note: We are still using DRM_IOCTL_MODE_CREATE_DUMB to allocate pixmap
> > and DRI2 buffers and have not come across any issues with doing that.
> > Certainly both Mali-400 & Mali-T6xx render to linear RGBA buffers and
> > the display controller's in SoCs shipping Mali also seem to happily
> > scan-out linear RGB buffers. Getting armsoc to run on OMAP (again) might
> > need a device-specific allocation function to allocate the tiled format
> > used on OMAP, but only for efficient 90-degree rotations (if I understood
> > Rob correctly). So maybe we could also one day add a "this buffer will be
> > rotated 90 degrees" flag?
> 
> What part of don't use dumb buffer for acceleration is hard to understand?
> 
> Christ I called them DUMB. Lets try this again.
> 
> DON'T USE DUMB BUFFERS FOR ALLOCATING BUFFERS USED FOR ACCELERATION.

Right, I _think_ I understand your opinion on that. :-)

The reason we (currently) use the dumb buffer interface is because it
does pretty much exactly what we need it to, as we only want linear
RGB buffers:

On Mali & probably other tiled-based GPUs, the back buffer only gets
written once per frame, when the GPU writes its on-die tile buffer to
system memory. As such, we don't need the complicated memory layouts
immediate mode renders do to improve cache efficiency, etc.

What's more, the 2D hardware typically found on SoCs we're targeting
isn't advanced enough to implement all of the EXA operations and
frequently falls back to software rendering, which only works with
linear RGB buffers.

Another option we nearly went with is to use ION to allocate all
buffers, using the PRIME ioctls to import those buffers we want
to scanout into the display controller's DRM driver. ION's a pretty
good fit, but requires some SoC-specific logic in userspace to
figure out E.g. the display controller doesn't have an IOMMU and
we must therefore allocate from a contiguous ION heap. By allocating
via the DUMB interface and specifying a scanout hint, we can leave
that decision to the DRM driver and keep userspace entirely generic.
The other reason to go with DUMB rather than ION was because ION
wasn't upstream.


> Now that we've cleared that up, armsoc is a big bag of shit, I've
> spent a few hours on it in the last few weeks trying to get anything
> to run on my chromebook and really armsoc needs to be put out of its
> misery.

This is why we need a bug tracker! To objectively quantify "big bag
of shit" and fix it. :-)


> The only working long term strategy for ARM I see is to abstract the
> common modesetting code into a new library, 

Would you mind elaborating a little on this? I assume you're not talking
about libkms? What operations would be performed by this driver which
would need to be abstracted in userspace which aren't already nicely
abstracted by KMS? Once we have a new library of some description, I
assume you're suggesting we modify armsoc to use it? That seems a good
idea as it also means we can use that to implement the HWComposer HAL
on Android and thus use the same driver code can be used with minimal
changes on X11, Android, Wayland, Mir and whatever other new window
system comes along. That's really the point I'm trying to get to.


> and write a per-GPU
> driver.

So in our bit of the ARM ecosystem, the GPU is just the bit which
draws 3D graphics. The 2D drawing hardware is separate, as is the
display controller as is the video codec. This is reflected in the
driver model: The GPU driver is totally bespoke, the display
controller interface 

abuse of dumb ioctls in exynos

2013-04-23 Thread Tom Cooksey
> It appears exynos is passing the generic flags from the dumb ioctls
> straight into the the GEM creation code.
> 
> The dumb flags are NOT driver specific, and are NOT to be used in this
> fashion. Please remove this use of the flags from your driver.
> 
> I was going to add one new flag to the interface for SCANOUT vs CURSOR
> for some drivers.

Having a flag to indicate a dumb buffer allocation is to be used as a 
scan-out buffer would be useful for xf86-video-armsoc. We're trying to
keep that driver as generic as possible and currently the main device-
specific bits are what flags to pass to DRM_IOCTL_MODE_CREATE_DUMB for
scanout & non-scanout buffer allocations. If a generic scanout flag could
be added, it would simplify armsoc a fair bit and also allow the DRM
drivers we're using armsoc with to comply with the don't pass device-
specific flags to create dumb.

For reference, the device-specific bits of armsoc are currently abstracted
here:




Note: We are still using DRM_IOCTL_MODE_CREATE_DUMB to allocate pixmap
and DRI2 buffers and have not come across any issues with doing that.
Certainly both Mali-400 & Mali-T6xx render to linear RGBA buffers and
the display controller's in SoCs shipping Mali also seem to happily
scan-out linear RGB buffers. Getting armsoc to run on OMAP (again) might
need a device-specific allocation function to allocate the tiled format
used on OMAP, but only for efficient 90-degree rotations (if I understood
Rob correctly). So maybe we could also one day add a "this buffer will be
rotated 90 degrees" flag?


Cheers,

Tom

PS: I've stuck in a fd.o bugzilla ticket to move xf86-video-armsoc to
freedesktop.org infrastructure, so hopefully it will live in a more
appropriate place soon, not to mention have a mailing list, etc.!







RE: abuse of dumb ioctls in exynos

2013-04-23 Thread Tom Cooksey
> It appears exynos is passing the generic flags from the dumb ioctls
> straight into the the GEM creation code.
> 
> The dumb flags are NOT driver specific, and are NOT to be used in this
> fashion. Please remove this use of the flags from your driver.
> 
> I was going to add one new flag to the interface for SCANOUT vs CURSOR
> for some drivers.

Having a flag to indicate a dumb buffer allocation is to be used as a 
scan-out buffer would be useful for xf86-video-armsoc. We're trying to
keep that driver as generic as possible and currently the main device-
specific bits are what flags to pass to DRM_IOCTL_MODE_CREATE_DUMB for
scanout & non-scanout buffer allocations. If a generic scanout flag could
be added, it would simplify armsoc a fair bit and also allow the DRM
drivers we're using armsoc with to comply with the don't pass device-
specific flags to create dumb.

For reference, the device-specific bits of armsoc are currently abstracted
here:




Note: We are still using DRM_IOCTL_MODE_CREATE_DUMB to allocate pixmap
and DRI2 buffers and have not come across any issues with doing that.
Certainly both Mali-400 & Mali-T6xx render to linear RGBA buffers and
the display controller's in SoCs shipping Mali also seem to happily
scan-out linear RGB buffers. Getting armsoc to run on OMAP (again) might
need a device-specific allocation function to allocate the tiled format
used on OMAP, but only for efficient 90-degree rotations (if I understood
Rob correctly). So maybe we could also one day add a "this buffer will be
rotated 90 degrees" flag?


Cheers,

Tom

PS: I've stuck in a fd.o bugzilla ticket to move xf86-video-armsoc to
freedesktop.org infrastructure, so hopefully it will live in a more
appropriate place soon, not to mention have a mailing list, etc.!





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Status of exporting an fbdev framebuffer with dma_buf?

2013-04-09 Thread Tom Cooksey
Hi All,

Last year Laurent posted an RFC patch[i] to add support for exporting an fbdev 
framebuffer through
dma_buf. Looking through the mailing list archives, it doesn't appear to have 
progressed beyond an
RFC? What would be needed to get this merged? It would be useful for our Mali 
T6xx driver (which
supports importing dma_buf buffers) to allow the GPU to draw directly into the 
framebuffer on
platforms which lack a DRM/KMS driver.

[i] Subject: "[RFC/PATCH] fb: Add dma-buf support", sent 20/06/2012.


Cheers,

Tom







Status of exporting an fbdev framebuffer with dma_buf?

2013-04-09 Thread Tom Cooksey
Hi All,

Last year Laurent posted an RFC patch[i] to add support for exporting an fbdev 
framebuffer through
dma_buf. Looking through the mailing list archives, it doesn't appear to have 
progressed beyond an
RFC? What would be needed to get this merged? It would be useful for our Mali 
T6xx driver (which
supports importing dma_buf buffers) to allow the GPU to draw directly into the 
framebuffer on
platforms which lack a DRM/KMS driver.

[i] Subject: "[RFC/PATCH] fb: Add dma-buf support", sent 20/06/2012.


Cheers,

Tom





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Mesa-dev] [RFC] New dma_buf -> EGLImage EGL extension - Final spec published!

2013-02-25 Thread Tom Cooksey
Hi All,

The final spec has had enum values assigned and been published on Khronos:

http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_import.txt

Thanks to all who've provided input.


Cheers,

Tom



> -Original Message-
> From: mesa-dev-bounces+tom.cooksey=arm.com at lists.freedesktop.org 
> [mailto:mesa-dev-
> bounces+tom.cooksey=arm.com at lists.freedesktop.org] On Behalf Of Tom Cooksey
> Sent: 04 October 2012 13:10
> To: mesa-dev at lists.freedesktop.org; linaro-mm-sig at lists.linaro.org; dri-
> devel at lists.freedesktop.org; linux-media at vger.kernel.org
> Subject: [Mesa-dev] [RFC] New dma_buf -> EGLImage EGL extension - New draft!
> 
> Hi All,
> 
> After receiving a fair bit of feedback (thanks!), I've updated the
> EGL_EXT_image_dma_buf_import spec
> and expanded it to resolve a number of the issues. Please find the latest 
> draft below and let
> me
> know any additional feedback you might have, either on the lists or by 
> private e-mail - I
> don't mind
> which.
> 
> I think the only remaining issue now is if we need a mechanism whereby an 
> application can
> query
> which drm_fourcc.h formats EGL supports or if just failing with EGL_BAD_MATCH 
> when the
> application
> has use one EGL doesn't support is sufficient. Any thoughts?
> 
> 
> Cheers,
> 
> Tom
> 
> 
> 8<
> 
> 
> Name
> 
>     EXT_image_dma_buf_import
> 
> Name Strings
> 
> EGL_EXT_image_dma_buf_import
> 
> Contributors
> 
> Jesse Barker
> Rob Clark
> Tom Cooksey
> 
> Contacts
> 
> Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
> Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)
> 
> Status
> 
> DRAFT
> 
> Version
> 
> Version 4, October 04, 2012
> 
> Number
> 
> EGL Extension ???
> 
> Dependencies
> 
> EGL 1.2 is required.
> 
> EGL_KHR_image_base is required.
> 
> The EGL implementation must be running on a Linux kernel supporting the
> dma_buf buffer sharing mechanism.
> 
> This extension is written against the wording of the EGL 1.2 
> Specification.
> 
> Overview
> 
> This extension allows creating an EGLImage from a Linux dma_buf file
> descriptor or multiple file descriptors in the case of multi-plane YUV
> images.
> 
> New Types
> 
> None
> 
> New Procedures and Functions
> 
> None
> 
> New Tokens
> 
> Accepted by the  parameter of eglCreateImageKHR:
> 
> EGL_LINUX_DMA_BUF_EXT
> 
> Accepted as an attribute in the  parameter of
> eglCreateImageKHR:
> 
> EGL_LINUX_DRM_FOURCC_EXT
> EGL_DMA_BUF_PLANE0_FD_EXT
> EGL_DMA_BUF_PLANE0_OFFSET_EXT
> EGL_DMA_BUF_PLANE0_PITCH_EXT
> EGL_DMA_BUF_PLANE1_FD_EXT
> EGL_DMA_BUF_PLANE1_OFFSET_EXT
> EGL_DMA_BUF_PLANE1_PITCH_EXT
> EGL_DMA_BUF_PLANE2_FD_EXT
> EGL_DMA_BUF_PLANE2_OFFSET_EXT
> EGL_DMA_BUF_PLANE2_PITCH_EXT
> EGL_YUV_COLOR_SPACE_HINT_EXT
> EGL_SAMPLE_RANGE_HINT_EXT
> EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT
> EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT
> 
> Accepted as the value for the EGL_YUV_COLOR_SPACE_HINT_EXT attribute:
> 
> EGL_ITU_REC601_EXT
> EGL_ITU_REC709_EXT
> EGL_ITU_REC2020_EXT
> 
> Accepted as the value for the EGL_SAMPLE_RANGE_HINT_EXT attribute:
> 
> EGL_YUV_FULL_RANGE_EXT
> EGL_YUV_NARROW_RANGE_EXT
> 
> Accepted as the value for the EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT &
> EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT attributes:
> 
> EGL_YUV_CHROMA_SITING_0_EXT
> EGL_YUV_CHROMA_SITING_0_5_EXT
> 
> 
> Additions to Chapter 2 of the EGL 1.2 Specification (EGL Operation)
> 
> Add to section 2.5.1 "EGLImage Specification" (as defined by the
> EGL_KHR_image_base specification), in the description of
> eglCreateImageKHR:
> 
>"Values accepted for  are listed in Table aaa, below.
> 
>   +-++
>   | |  Notes |
>   +-++
>   |  EGL_LINUX_DMA_BUF_EXT  |   Used for EGLImages imported from Linux   |
>   | |   dma_buf file descriptors |
>   +-++
>

RE: [Mesa-dev] [RFC] New dma_buf -> EGLImage EGL extension - Final spec published!

2013-02-25 Thread Tom Cooksey
Hi All,

The final spec has had enum values assigned and been published on Khronos:

http://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_image_dma_buf_import.txt

Thanks to all who've provided input.


Cheers,

Tom



> -Original Message-
> From: mesa-dev-bounces+tom.cooksey=arm@lists.freedesktop.org 
> [mailto:mesa-dev-
> bounces+tom.cooksey=arm@lists.freedesktop.org] On Behalf Of Tom Cooksey
> Sent: 04 October 2012 13:10
> To: mesa-...@lists.freedesktop.org; linaro-mm-...@lists.linaro.org; dri-
> de...@lists.freedesktop.org; linux-me...@vger.kernel.org
> Subject: [Mesa-dev] [RFC] New dma_buf -> EGLImage EGL extension - New draft!
> 
> Hi All,
> 
> After receiving a fair bit of feedback (thanks!), I've updated the
> EGL_EXT_image_dma_buf_import spec
> and expanded it to resolve a number of the issues. Please find the latest 
> draft below and let
> me
> know any additional feedback you might have, either on the lists or by 
> private e-mail - I
> don't mind
> which.
> 
> I think the only remaining issue now is if we need a mechanism whereby an 
> application can
> query
> which drm_fourcc.h formats EGL supports or if just failing with EGL_BAD_MATCH 
> when the
> application
> has use one EGL doesn't support is sufficient. Any thoughts?
> 
> 
> Cheers,
> 
> Tom
> 
> 
> 8<
> 
> 
> Name
> 
>     EXT_image_dma_buf_import
> 
> Name Strings
> 
> EGL_EXT_image_dma_buf_import
> 
> Contributors
> 
> Jesse Barker
> Rob Clark
> Tom Cooksey
> 
> Contacts
> 
> Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
> Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)
> 
> Status
> 
> DRAFT
> 
> Version
> 
> Version 4, October 04, 2012
> 
> Number
> 
> EGL Extension ???
> 
> Dependencies
> 
> EGL 1.2 is required.
> 
> EGL_KHR_image_base is required.
> 
> The EGL implementation must be running on a Linux kernel supporting the
> dma_buf buffer sharing mechanism.
> 
> This extension is written against the wording of the EGL 1.2 
> Specification.
> 
> Overview
> 
> This extension allows creating an EGLImage from a Linux dma_buf file
> descriptor or multiple file descriptors in the case of multi-plane YUV
> images.
> 
> New Types
> 
> None
> 
> New Procedures and Functions
> 
> None
> 
> New Tokens
> 
> Accepted by the  parameter of eglCreateImageKHR:
> 
> EGL_LINUX_DMA_BUF_EXT
> 
> Accepted as an attribute in the  parameter of
> eglCreateImageKHR:
> 
> EGL_LINUX_DRM_FOURCC_EXT
> EGL_DMA_BUF_PLANE0_FD_EXT
> EGL_DMA_BUF_PLANE0_OFFSET_EXT
> EGL_DMA_BUF_PLANE0_PITCH_EXT
> EGL_DMA_BUF_PLANE1_FD_EXT
> EGL_DMA_BUF_PLANE1_OFFSET_EXT
> EGL_DMA_BUF_PLANE1_PITCH_EXT
> EGL_DMA_BUF_PLANE2_FD_EXT
> EGL_DMA_BUF_PLANE2_OFFSET_EXT
> EGL_DMA_BUF_PLANE2_PITCH_EXT
> EGL_YUV_COLOR_SPACE_HINT_EXT
> EGL_SAMPLE_RANGE_HINT_EXT
> EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT
> EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT
> 
> Accepted as the value for the EGL_YUV_COLOR_SPACE_HINT_EXT attribute:
> 
> EGL_ITU_REC601_EXT
> EGL_ITU_REC709_EXT
> EGL_ITU_REC2020_EXT
> 
> Accepted as the value for the EGL_SAMPLE_RANGE_HINT_EXT attribute:
> 
> EGL_YUV_FULL_RANGE_EXT
> EGL_YUV_NARROW_RANGE_EXT
> 
> Accepted as the value for the EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT &
> EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT attributes:
> 
> EGL_YUV_CHROMA_SITING_0_EXT
> EGL_YUV_CHROMA_SITING_0_5_EXT
> 
> 
> Additions to Chapter 2 of the EGL 1.2 Specification (EGL Operation)
> 
> Add to section 2.5.1 "EGLImage Specification" (as defined by the
> EGL_KHR_image_base specification), in the description of
> eglCreateImageKHR:
> 
>"Values accepted for  are listed in Table aaa, below.
> 
>   +-++
>   | |  Notes |
>   +-++
>   |  EGL_LINUX_DMA_BUF_EXT  |   Used for EGLImages imported from Linux   |
>   | |   dma_buf file descriptors |
>   +-++
>Table aaa. 

[Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension

2012-10-04 Thread Tom Cooksey
Hi Rob,

> -Original Message-
> From: robdclark at gmail.com [mailto:robdclark at gmail.com] On Behalf Of Rob 
> Clark
> Sent: 03 October 2012 13:39
> To: Maarten Lankhorst
> Cc: Tom Cooksey; mesa-dev at lists.freedesktop.org; linaro-mm-sig at 
> lists.linaro.org; dri-
> devel at lists.freedesktop.org; Jesse Barker; linux-media at vger.kernel.org
> Subject: Re: [Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension
>
> On Tue, Oct 2, 2012 at 2:10 PM, Maarten Lankhorst
>  wrote:
> > How do you want to deal with the case where Y' and CbCr are different 
> > hardware buffers?
> > Could some support for 2d arrays be added in case Y' and CbCr are separated 
> > into top/bottom
> fields?
> > How are semi-planar/planar formats handled that have a different 
> > width/height for Y' and
> CbCr? (YUV420)
>
> The API works (AFAIU) like drm addfb2 ioctl, take I420 for example,
> you could either do:
>
>   single buffer:
>  fd0 = fd
>  offset0 = 0
>  pitch0 = width
>  fd1 = fd
>  offset1 = width * height
>  pitch1 = width / 2
>  fd2 = fd
>  offset2 = offset1 + (width / height / 4)
>  pitch2 = width / 2
>
>   multiple buffers:
>  offset0 = offset1 = offset2 = 0
>  fd0 = fd_luma
>  fd1 = fd_u
>  fd2 = fd_v
>  ... and so on

Yup, that's pretty much how I'd envisaged it.


> for interlaced/stereo.. is sticking our heads in sand an option?  :-P
>
> You could get lots of permutations for data layout of fields between
> interlaced and stereo.  One option might be to ignore and let the user
> create two egl-images and deal with blending in the shader?

I think for interlaced video the only option really is to create two EGLImages 
as the two fields have to be displayed at different times. If the application 
wanted to display them progressively they'd have to run a de-interlacing filter 
over the two images. Perhaps writing such a filter as a GLSL shader might not 
be such a bad idea, but it's kinda the app's problem. Same deal with stereo.


Cheers,

Tom

PS: I've updated the spec and sent out a new draft.


-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.



[RFC] New dma_buf -> EGLImage EGL extension - New draft!

2012-10-04 Thread Tom Cooksey
Hi All,

After receiving a fair bit of feedback (thanks!), I've updated the 
EGL_EXT_image_dma_buf_import spec
and expanded it to resolve a number of the issues. Please find the latest draft 
below and let me
know any additional feedback you might have, either on the lists or by private 
e-mail - I don't mind
which.

I think the only remaining issue now is if we need a mechanism whereby an 
application can query
which drm_fourcc.h formats EGL supports or if just failing with EGL_BAD_MATCH 
when the application
has use one EGL doesn't support is sufficient. Any thoughts?


Cheers,

Tom


8<


Name

EXT_image_dma_buf_import

Name Strings

EGL_EXT_image_dma_buf_import

Contributors

Jesse Barker
    Rob Clark
Tom Cooksey

Contacts

Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)

Status

DRAFT

Version

Version 4, October 04, 2012

Number

EGL Extension ???

Dependencies

EGL 1.2 is required.

EGL_KHR_image_base is required.

The EGL implementation must be running on a Linux kernel supporting the
dma_buf buffer sharing mechanism.

This extension is written against the wording of the EGL 1.2 Specification.

Overview

This extension allows creating an EGLImage from a Linux dma_buf file
descriptor or multiple file descriptors in the case of multi-plane YUV
images.

New Types

None

New Procedures and Functions

None

New Tokens

Accepted by the  parameter of eglCreateImageKHR:

EGL_LINUX_DMA_BUF_EXT

Accepted as an attribute in the  parameter of
eglCreateImageKHR:

EGL_LINUX_DRM_FOURCC_EXT
EGL_DMA_BUF_PLANE0_FD_EXT
EGL_DMA_BUF_PLANE0_OFFSET_EXT
EGL_DMA_BUF_PLANE0_PITCH_EXT
EGL_DMA_BUF_PLANE1_FD_EXT
EGL_DMA_BUF_PLANE1_OFFSET_EXT
EGL_DMA_BUF_PLANE1_PITCH_EXT
EGL_DMA_BUF_PLANE2_FD_EXT
EGL_DMA_BUF_PLANE2_OFFSET_EXT
EGL_DMA_BUF_PLANE2_PITCH_EXT
EGL_YUV_COLOR_SPACE_HINT_EXT
EGL_SAMPLE_RANGE_HINT_EXT
EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT
EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT

Accepted as the value for the EGL_YUV_COLOR_SPACE_HINT_EXT attribute:

EGL_ITU_REC601_EXT
EGL_ITU_REC709_EXT
EGL_ITU_REC2020_EXT

Accepted as the value for the EGL_SAMPLE_RANGE_HINT_EXT attribute:

EGL_YUV_FULL_RANGE_EXT
EGL_YUV_NARROW_RANGE_EXT

Accepted as the value for the EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT &
EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT attributes:

EGL_YUV_CHROMA_SITING_0_EXT
EGL_YUV_CHROMA_SITING_0_5_EXT


Additions to Chapter 2 of the EGL 1.2 Specification (EGL Operation)

Add to section 2.5.1 "EGLImage Specification" (as defined by the
EGL_KHR_image_base specification), in the description of
eglCreateImageKHR:

   "Values accepted for  are listed in Table aaa, below.

  +-++
  | |  Notes |
  +-++
  |  EGL_LINUX_DMA_BUF_EXT  |   Used for EGLImages imported from Linux   |
  | |   dma_buf file descriptors |
  +-++
   Table aaa.  Legal values for eglCreateImageKHR  parameter

...

If  is EGL_LINUX_DMA_BUF_EXT,  must be a valid display, 
must be EGL_NO_CONTEXT, and  must be NULL, cast into the type
EGLClientBuffer. The details of the image is specified by the attributes
passed into eglCreateImageKHR. Required attributes and their values are as
follows:

* EGL_WIDTH & EGL_HEIGHT: The logical dimensions of the buffer in pixels

* EGL_LINUX_DRM_FOURCC_EXT: The pixel format of the buffer, as specified
  by drm_fourcc.h and used as the pixel_format parameter of the
  drm_mode_fb_cmd2 ioctl.

* EGL_DMA_BUF_PLANE0_FD_EXT: The dma_buf file descriptor of plane 0 of
  the image.

* EGL_DMA_BUF_PLANE0_OFFSET_EXT: The offset from the start of the
  dma_buf of the first sample in plane 0, in bytes.

* EGL_DMA_BUF_PLANE0_PITCH_EXT: The number of bytes between the start of
  subsequent rows of samples in plane 0. May have special meaning for
  non-linear formats.

For images in an RGB color-space or those using a single-plane YUV format,
only the first plane's file descriptor, offset & pitch should be specified.
For semi-planar YUV formats, the chroma samples are stored in plane 1 and
for fully planar formats, U-samples are stored in plane 1 and V-samples are
stored in plane 2. Planes 

RE: [Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension

2012-10-04 Thread Tom Cooksey
Hi Rob,

> -Original Message-
> From: robdcl...@gmail.com [mailto:robdcl...@gmail.com] On Behalf Of Rob Clark
> Sent: 03 October 2012 13:39
> To: Maarten Lankhorst
> Cc: Tom Cooksey; mesa-...@lists.freedesktop.org; 
> linaro-mm-...@lists.linaro.org; dri-
> de...@lists.freedesktop.org; Jesse Barker; linux-me...@vger.kernel.org
> Subject: Re: [Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension
>
> On Tue, Oct 2, 2012 at 2:10 PM, Maarten Lankhorst
>  wrote:
> > How do you want to deal with the case where Y' and CbCr are different 
> > hardware buffers?
> > Could some support for 2d arrays be added in case Y' and CbCr are separated 
> > into top/bottom
> fields?
> > How are semi-planar/planar formats handled that have a different 
> > width/height for Y' and
> CbCr? (YUV420)
>
> The API works (AFAIU) like drm addfb2 ioctl, take I420 for example,
> you could either do:
>
>   single buffer:
>  fd0 = fd
>  offset0 = 0
>  pitch0 = width
>  fd1 = fd
>  offset1 = width * height
>  pitch1 = width / 2
>  fd2 = fd
>  offset2 = offset1 + (width / height / 4)
>  pitch2 = width / 2
>
>   multiple buffers:
>  offset0 = offset1 = offset2 = 0
>  fd0 = fd_luma
>  fd1 = fd_u
>  fd2 = fd_v
>  ... and so on

Yup, that's pretty much how I'd envisaged it.


> for interlaced/stereo.. is sticking our heads in sand an option?  :-P
>
> You could get lots of permutations for data layout of fields between
> interlaced and stereo.  One option might be to ignore and let the user
> create two egl-images and deal with blending in the shader?

I think for interlaced video the only option really is to create two EGLImages 
as the two fields have to be displayed at different times. If the application 
wanted to display them progressively they'd have to run a de-interlacing filter 
over the two images. Perhaps writing such a filter as a GLSL shader might not 
be such a bad idea, but it's kinda the app's problem. Same deal with stereo.


Cheers,

Tom

PS: I've updated the spec and sent out a new draft.


-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[RFC] New dma_buf -> EGLImage EGL extension - New draft!

2012-10-04 Thread Tom Cooksey
Hi All,

After receiving a fair bit of feedback (thanks!), I've updated the 
EGL_EXT_image_dma_buf_import spec
and expanded it to resolve a number of the issues. Please find the latest draft 
below and let me
know any additional feedback you might have, either on the lists or by private 
e-mail - I don't mind
which.

I think the only remaining issue now is if we need a mechanism whereby an 
application can query
which drm_fourcc.h formats EGL supports or if just failing with EGL_BAD_MATCH 
when the application
has use one EGL doesn't support is sufficient. Any thoughts?


Cheers,

Tom


8<


Name

EXT_image_dma_buf_import

Name Strings

EGL_EXT_image_dma_buf_import

Contributors

Jesse Barker
    Rob Clark
Tom Cooksey

Contacts

Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)

Status

DRAFT

Version

Version 4, October 04, 2012

Number

EGL Extension ???

Dependencies

EGL 1.2 is required.

EGL_KHR_image_base is required.

The EGL implementation must be running on a Linux kernel supporting the
dma_buf buffer sharing mechanism.

This extension is written against the wording of the EGL 1.2 Specification.

Overview

This extension allows creating an EGLImage from a Linux dma_buf file
descriptor or multiple file descriptors in the case of multi-plane YUV
images.

New Types

None

New Procedures and Functions

None

New Tokens

Accepted by the  parameter of eglCreateImageKHR:

EGL_LINUX_DMA_BUF_EXT

Accepted as an attribute in the  parameter of
eglCreateImageKHR:

EGL_LINUX_DRM_FOURCC_EXT
EGL_DMA_BUF_PLANE0_FD_EXT
EGL_DMA_BUF_PLANE0_OFFSET_EXT
EGL_DMA_BUF_PLANE0_PITCH_EXT
EGL_DMA_BUF_PLANE1_FD_EXT
EGL_DMA_BUF_PLANE1_OFFSET_EXT
EGL_DMA_BUF_PLANE1_PITCH_EXT
EGL_DMA_BUF_PLANE2_FD_EXT
EGL_DMA_BUF_PLANE2_OFFSET_EXT
EGL_DMA_BUF_PLANE2_PITCH_EXT
EGL_YUV_COLOR_SPACE_HINT_EXT
EGL_SAMPLE_RANGE_HINT_EXT
EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT
EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT

Accepted as the value for the EGL_YUV_COLOR_SPACE_HINT_EXT attribute:

EGL_ITU_REC601_EXT
EGL_ITU_REC709_EXT
EGL_ITU_REC2020_EXT

Accepted as the value for the EGL_SAMPLE_RANGE_HINT_EXT attribute:

EGL_YUV_FULL_RANGE_EXT
EGL_YUV_NARROW_RANGE_EXT

Accepted as the value for the EGL_YUV_CHROMA_HORIZONTAL_SITING_HINT_EXT &
EGL_YUV_CHROMA_VERTICAL_SITING_HINT_EXT attributes:

EGL_YUV_CHROMA_SITING_0_EXT
EGL_YUV_CHROMA_SITING_0_5_EXT


Additions to Chapter 2 of the EGL 1.2 Specification (EGL Operation)

Add to section 2.5.1 "EGLImage Specification" (as defined by the
EGL_KHR_image_base specification), in the description of
eglCreateImageKHR:

   "Values accepted for  are listed in Table aaa, below.

  +-++
  | |  Notes |
  +-++
  |  EGL_LINUX_DMA_BUF_EXT  |   Used for EGLImages imported from Linux   |
  | |   dma_buf file descriptors |
  +-++
   Table aaa.  Legal values for eglCreateImageKHR  parameter

...

If  is EGL_LINUX_DMA_BUF_EXT,  must be a valid display, 
must be EGL_NO_CONTEXT, and  must be NULL, cast into the type
EGLClientBuffer. The details of the image is specified by the attributes
passed into eglCreateImageKHR. Required attributes and their values are as
follows:

* EGL_WIDTH & EGL_HEIGHT: The logical dimensions of the buffer in pixels

* EGL_LINUX_DRM_FOURCC_EXT: The pixel format of the buffer, as specified
  by drm_fourcc.h and used as the pixel_format parameter of the
  drm_mode_fb_cmd2 ioctl.

* EGL_DMA_BUF_PLANE0_FD_EXT: The dma_buf file descriptor of plane 0 of
  the image.

* EGL_DMA_BUF_PLANE0_OFFSET_EXT: The offset from the start of the
  dma_buf of the first sample in plane 0, in bytes.
 
* EGL_DMA_BUF_PLANE0_PITCH_EXT: The number of bytes between the start of
  subsequent rows of samples in plane 0. May have special meaning for
  non-linear formats.

For images in an RGB color-space or those using a single-plane YUV format,
only the first plane's file descriptor, offset & pitch should be specified.
For semi-planar YUV formats, the chroma samples are stored in plane 1 and
for fully planar formats, U-samples are stored in plane 1 and V-samples are
stored in plane 2. Planes 

[Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension

2012-10-02 Thread Tom Cooksey
Hi Maarten,

Thanks for taking a look at this! Responses in-line...


Cheers,

Tom


> -Original Message-
> From: Maarten Lankhorst [mailto:m.b.lankhorst at gmail.com]
> Sent: 02 October 2012 13:10
> To: Tom Cooksey
> Cc: mesa-dev at lists.freedesktop.org; linaro-mm-sig at lists.linaro.org; dri-
> devel at lists.freedesktop.org; 'Jesse Barker'; linux-media at vger.kernel.org
> Subject: Re: [Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension
> 
> Hey,
> 
> Bit late reply, hopefully not too late.
> 
> Op 30-08-12 16:00, Tom Cooksey schreef:
> > Hi All,
> >
> > Over the last few months I've been working on & off with a few people from
> > Linaro on a new EGL extension. The extension allows constructing an EGLImage
> > from a (set of) dma_buf file descriptors, including support for multi-plane
> > YUV. I envisage the primary use-case of this extension to be importing video
> > frames from v4l2 into the EGL/GLES graphics driver to texture from.
> > Originally the intent was to develop this as a Khronos-ratified extension.
> > However, this is a little too platform-specific to be an officially
> > sanctioned Khronos extension. It also goes against the general "EGLStream"
> > direction the EGL working group is going in. As such, the general feeling
> > was to make this an EXT "multi-vendor" extension with no official stamp of
> > approval from Khronos. As this is no-longer intended to be a Khronos
> > extension, I've re-written it to be a lot more Linux & dma_buf specific. It
> > also allows me to circulate the extension more widely (I.e. To those outside
> > Khronos membership).
> >
> > ARM are implementing this extension for at least our Mali-T6xx driver and
> > likely earlier drivers too. I am sending this e-mail to solicit feedback,
> > both from other vendors who might implement this extension (Mesa3D?) and
> > from potential users of the extension. However, any feedback is welcome.
> > Please find the extension text as it currently stands below. There several
> > open issues which I've proposed solutions for, but I'm not really happy with
> > those proposals and hoped others could chip-in with better ideas. There are
> > likely other issues I've not thought about which also need to be added and
> > addressed.
> >
> > Once there's a general consensus or if no-one's interested, I'll update the
> > spec, move it out of Draft status and get it added to the Khronos registry,
> > which includes assigning values for the new symbols.
> >
> >
> > Cheers,
> >
> > Tom
> >
> >
> > -8<-
> >
> >
> > Name
> >
> > EXT_image_dma_buf_import
> >
> > Name Strings
> >
> > EGL_EXT_image_dma_buf_import
> >
> > Contributors
> >
> > Jesse Barker
> > Rob Clark
> > Tom Cooksey
> >
> > Contacts
> >
> > Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
> > Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)
> >
> > Status
> >
> > DRAFT
> >
> > Version
> >
> > Version 3, August 16, 2012
> >
> > Number
> >
> > EGL Extension ???
> >
> > Dependencies
> >
> > EGL 1.2 is required.
> >
> > EGL_KHR_image_base is required.
> >
> > The EGL implementation must be running on a Linux kernel supporting the
> > dma_buf buffer sharing mechanism.
> >
> > This extension is written against the wording of the EGL 1.2
> > Specification.
> >
> > Overview
> >
> > This extension allows creating an EGLImage from a Linux dma_buf file
> > descriptor or multiple file descriptors in the case of multi-plane YUV
> > images.
> >
> > New Types
> >
> > None
> >
> > New Procedures and Functions
> >
> > None
> >
> > New Tokens
> >
> > Accepted by the  parameter of eglCreateImageKHR:
> >
> > EGL_LINUX_DMA_BUF_EXT
> >
> > Accepted as an attribute in the  parameter of
> > eglCreateImageKHR:
> >
> > EGL_LINUX_DRM_FOURCC_EXT
> > EGL_DMA_BUF_PLANE0_FD_EXT
> > EGL_DMA_BUF_PLANE0_OFFSET_EXT
> > EGL_DMA_BUF_PLANE0_PITCH_EXT
> > EGL_DMA_BUF_PLANE1_FD_EXT
> >

RE: [Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension

2012-10-02 Thread Tom Cooksey
Hi Maarten,

Thanks for taking a look at this! Responses in-line...


Cheers,

Tom


> -Original Message-
> From: Maarten Lankhorst [mailto:m.b.lankho...@gmail.com]
> Sent: 02 October 2012 13:10
> To: Tom Cooksey
> Cc: mesa-...@lists.freedesktop.org; linaro-mm-...@lists.linaro.org; dri-
> de...@lists.freedesktop.org; 'Jesse Barker'; linux-me...@vger.kernel.org
> Subject: Re: [Linaro-mm-sig] [RFC] New dma_buf -> EGLImage EGL extension
> 
> Hey,
> 
> Bit late reply, hopefully not too late.
> 
> Op 30-08-12 16:00, Tom Cooksey schreef:
> > Hi All,
> >
> > Over the last few months I've been working on & off with a few people from
> > Linaro on a new EGL extension. The extension allows constructing an EGLImage
> > from a (set of) dma_buf file descriptors, including support for multi-plane
> > YUV. I envisage the primary use-case of this extension to be importing video
> > frames from v4l2 into the EGL/GLES graphics driver to texture from.
> > Originally the intent was to develop this as a Khronos-ratified extension.
> > However, this is a little too platform-specific to be an officially
> > sanctioned Khronos extension. It also goes against the general "EGLStream"
> > direction the EGL working group is going in. As such, the general feeling
> > was to make this an EXT "multi-vendor" extension with no official stamp of
> > approval from Khronos. As this is no-longer intended to be a Khronos
> > extension, I've re-written it to be a lot more Linux & dma_buf specific. It
> > also allows me to circulate the extension more widely (I.e. To those outside
> > Khronos membership).
> >
> > ARM are implementing this extension for at least our Mali-T6xx driver and
> > likely earlier drivers too. I am sending this e-mail to solicit feedback,
> > both from other vendors who might implement this extension (Mesa3D?) and
> > from potential users of the extension. However, any feedback is welcome.
> > Please find the extension text as it currently stands below. There several
> > open issues which I've proposed solutions for, but I'm not really happy with
> > those proposals and hoped others could chip-in with better ideas. There are
> > likely other issues I've not thought about which also need to be added and
> > addressed.
> >
> > Once there's a general consensus or if no-one's interested, I'll update the
> > spec, move it out of Draft status and get it added to the Khronos registry,
> > which includes assigning values for the new symbols.
> >
> >
> > Cheers,
> >
> > Tom
> >
> >
> > -8<-
> >
> >
> > Name
> >
> > EXT_image_dma_buf_import
> >
> > Name Strings
> >
> > EGL_EXT_image_dma_buf_import
> >
> > Contributors
> >
> > Jesse Barker
> > Rob Clark
> > Tom Cooksey
> >
> > Contacts
> >
> > Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
> > Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)
> >
> > Status
> >
> > DRAFT
> >
> > Version
> >
> > Version 3, August 16, 2012
> >
> > Number
> >
> > EGL Extension ???
> >
> > Dependencies
> >
> > EGL 1.2 is required.
> >
> > EGL_KHR_image_base is required.
> >
> > The EGL implementation must be running on a Linux kernel supporting the
> > dma_buf buffer sharing mechanism.
> >
> > This extension is written against the wording of the EGL 1.2
> > Specification.
> >
> > Overview
> >
> > This extension allows creating an EGLImage from a Linux dma_buf file
> > descriptor or multiple file descriptors in the case of multi-plane YUV
> > images.
> >
> > New Types
> >
> > None
> >
> > New Procedures and Functions
> >
> > None
> >
> > New Tokens
> >
> > Accepted by the  parameter of eglCreateImageKHR:
> >
> > EGL_LINUX_DMA_BUF_EXT
> >
> > Accepted as an attribute in the  parameter of
> > eglCreateImageKHR:
> >
> > EGL_LINUX_DRM_FOURCC_EXT
> > EGL_DMA_BUF_PLANE0_FD_EXT
> > EGL_DMA_BUF_PLANE0_OFFSET_EXT
> > EGL_DMA_BUF_PLANE0_PITCH_EXT
> > EGL_DMA_BUF_PLANE1_FD_EXT
> > EGL_DMA_BUF_PLANE1_OFFSET_EXT
> > EGL_DMA_BUF_PLANE1_PITCH_EXT
> > EGL_DMA_BUF_PLANE2_

[RFC] New dma_buf -> EGLImage EGL extension

2012-08-30 Thread Tom Cooksey
Hi All,

Over the last few months I've been working on & off with a few people from
Linaro on a new EGL extension. The extension allows constructing an EGLImage
from a (set of) dma_buf file descriptors, including support for multi-plane
YUV. I envisage the primary use-case of this extension to be importing video
frames from v4l2 into the EGL/GLES graphics driver to texture from.
Originally the intent was to develop this as a Khronos-ratified extension.
However, this is a little too platform-specific to be an officially
sanctioned Khronos extension. It also goes against the general "EGLStream"
direction the EGL working group is going in. As such, the general feeling
was to make this an EXT "multi-vendor" extension with no official stamp of
approval from Khronos. As this is no-longer intended to be a Khronos
extension, I've re-written it to be a lot more Linux & dma_buf specific. It
also allows me to circulate the extension more widely (I.e. To those outside
Khronos membership).

ARM are implementing this extension for at least our Mali-T6xx driver and
likely earlier drivers too. I am sending this e-mail to solicit feedback,
both from other vendors who might implement this extension (Mesa3D?) and
from potential users of the extension. However, any feedback is welcome.
Please find the extension text as it currently stands below. There several
open issues which I've proposed solutions for, but I'm not really happy with
those proposals and hoped others could chip-in with better ideas. There are
likely other issues I've not thought about which also need to be added and
addressed.

Once there's a general consensus or if no-one's interested, I'll update the
spec, move it out of Draft status and get it added to the Khronos registry,
which includes assigning values for the new symbols.


Cheers,

Tom


-8<-


Name

EXT_image_dma_buf_import

Name Strings

EGL_EXT_image_dma_buf_import

Contributors

Jesse Barker
Rob Clark
Tom Cooksey

Contacts

Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)

Status

DRAFT

Version

Version 3, August 16, 2012

Number

EGL Extension ???

Dependencies

EGL 1.2 is required.

EGL_KHR_image_base is required.

The EGL implementation must be running on a Linux kernel supporting the
dma_buf buffer sharing mechanism.

This extension is written against the wording of the EGL 1.2
Specification.

Overview

This extension allows creating an EGLImage from a Linux dma_buf file
descriptor or multiple file descriptors in the case of multi-plane YUV
images.

New Types

None

New Procedures and Functions

None

New Tokens

Accepted by the  parameter of eglCreateImageKHR:

EGL_LINUX_DMA_BUF_EXT

Accepted as an attribute in the  parameter of
eglCreateImageKHR:

EGL_LINUX_DRM_FOURCC_EXT
EGL_DMA_BUF_PLANE0_FD_EXT
EGL_DMA_BUF_PLANE0_OFFSET_EXT
EGL_DMA_BUF_PLANE0_PITCH_EXT
EGL_DMA_BUF_PLANE1_FD_EXT
EGL_DMA_BUF_PLANE1_OFFSET_EXT
EGL_DMA_BUF_PLANE1_PITCH_EXT
EGL_DMA_BUF_PLANE2_FD_EXT
EGL_DMA_BUF_PLANE2_OFFSET_EXT
EGL_DMA_BUF_PLANE2_PITCH_EXT

Additions to Chapter 2 of the EGL 1.2 Specification (EGL Operation)

Add to section 2.5.1 "EGLImage Specification" (as defined by the
EGL_KHR_image_base specification), in the description of
eglCreateImageKHR:

   "Values accepted for  are listed in Table aaa, below.


+-++
  | |  Notes
|

+-++
  |  EGL_LINUX_DMA_BUF_EXT  |   Used for EGLImages imported from Linux
|
  | |   dma_buf file descriptors
|

+-++
   Table aaa.  Legal values for eglCreateImageKHR  parameter

...

If  is EGL_LINUX_DMA_BUF_EXT,  must be a valid display,

must be EGL_NO_CONTEXT, and  must be NULL, cast into the type
EGLClientBuffer. The details of the image is specified by the attributes
passed into eglCreateImageKHR. Required attributes and their values are
as
follows:

* EGL_WIDTH & EGL_HEIGHT: The logical dimensions of the buffer in
pixels

* EGL_LINUX_DRM_FOURCC_EXT: The pixel format of the buffer, as
specified
  by drm_fourcc.h and used as the pixel_format parameter of the
  drm_mode_fb_cmd2 ioctl.

* EGL_DMA_BUF_PLANE0_FD_EXT: The dma_buf file descriptor of plane 0
of
  the image.

* EGL_DMA_BUF_PLANE0_OFFSET_EXT: The offset from the start of the
  dma_buf of the first sample in plane 0, in bytes.

* EGL_DMA_BUF_PLANE0_PITCH_EX

[RFC] New dma_buf -> EGLImage EGL extension

2012-08-30 Thread Tom Cooksey
Hi All,

Over the last few months I've been working on & off with a few people from
Linaro on a new EGL extension. The extension allows constructing an EGLImage
from a (set of) dma_buf file descriptors, including support for multi-plane
YUV. I envisage the primary use-case of this extension to be importing video
frames from v4l2 into the EGL/GLES graphics driver to texture from.
Originally the intent was to develop this as a Khronos-ratified extension.
However, this is a little too platform-specific to be an officially
sanctioned Khronos extension. It also goes against the general "EGLStream"
direction the EGL working group is going in. As such, the general feeling
was to make this an EXT "multi-vendor" extension with no official stamp of
approval from Khronos. As this is no-longer intended to be a Khronos
extension, I've re-written it to be a lot more Linux & dma_buf specific. It
also allows me to circulate the extension more widely (I.e. To those outside
Khronos membership).

ARM are implementing this extension for at least our Mali-T6xx driver and
likely earlier drivers too. I am sending this e-mail to solicit feedback,
both from other vendors who might implement this extension (Mesa3D?) and
from potential users of the extension. However, any feedback is welcome.
Please find the extension text as it currently stands below. There several
open issues which I've proposed solutions for, but I'm not really happy with
those proposals and hoped others could chip-in with better ideas. There are
likely other issues I've not thought about which also need to be added and
addressed.

Once there's a general consensus or if no-one's interested, I'll update the
spec, move it out of Draft status and get it added to the Khronos registry,
which includes assigning values for the new symbols.


Cheers,

Tom


-8<-


Name

EXT_image_dma_buf_import

Name Strings

EGL_EXT_image_dma_buf_import

Contributors

Jesse Barker
Rob Clark
Tom Cooksey

Contacts

Jesse Barker (jesse 'dot' barker 'at' linaro 'dot' org)
Tom Cooksey (tom 'dot' cooksey 'at' arm 'dot' com)

Status

DRAFT

Version

Version 3, August 16, 2012

Number

EGL Extension ???

Dependencies

EGL 1.2 is required.

EGL_KHR_image_base is required.

The EGL implementation must be running on a Linux kernel supporting the
dma_buf buffer sharing mechanism.

This extension is written against the wording of the EGL 1.2
Specification.

Overview

This extension allows creating an EGLImage from a Linux dma_buf file
descriptor or multiple file descriptors in the case of multi-plane YUV
images.

New Types

None

New Procedures and Functions

None

New Tokens

Accepted by the  parameter of eglCreateImageKHR:

EGL_LINUX_DMA_BUF_EXT

Accepted as an attribute in the  parameter of
eglCreateImageKHR:

EGL_LINUX_DRM_FOURCC_EXT
EGL_DMA_BUF_PLANE0_FD_EXT
EGL_DMA_BUF_PLANE0_OFFSET_EXT
EGL_DMA_BUF_PLANE0_PITCH_EXT
EGL_DMA_BUF_PLANE1_FD_EXT
EGL_DMA_BUF_PLANE1_OFFSET_EXT
EGL_DMA_BUF_PLANE1_PITCH_EXT
EGL_DMA_BUF_PLANE2_FD_EXT
EGL_DMA_BUF_PLANE2_OFFSET_EXT
EGL_DMA_BUF_PLANE2_PITCH_EXT

Additions to Chapter 2 of the EGL 1.2 Specification (EGL Operation)

Add to section 2.5.1 "EGLImage Specification" (as defined by the
EGL_KHR_image_base specification), in the description of
eglCreateImageKHR:

   "Values accepted for  are listed in Table aaa, below.

 
+-++
  | |  Notes
|
 
+-++
  |  EGL_LINUX_DMA_BUF_EXT  |   Used for EGLImages imported from Linux
|
  | |   dma_buf file descriptors
|
 
+-++
   Table aaa.  Legal values for eglCreateImageKHR  parameter

...

If  is EGL_LINUX_DMA_BUF_EXT,  must be a valid display,

must be EGL_NO_CONTEXT, and  must be NULL, cast into the type
EGLClientBuffer. The details of the image is specified by the attributes
passed into eglCreateImageKHR. Required attributes and their values are
as
follows:

* EGL_WIDTH & EGL_HEIGHT: The logical dimensions of the buffer in
pixels

* EGL_LINUX_DRM_FOURCC_EXT: The pixel format of the buffer, as
specified
  by drm_fourcc.h and used as the pixel_format parameter of the
  drm_mode_fb_cmd2 ioctl.

* EGL_DMA_BUF_PLANE0_FD_EXT: The dma_buf file descriptor of plane 0
of
  the image.

* EGL_DMA_BUF_PLANE0_OFFSET_EXT: The offset from the start of the
  dma_buf of the first sample in plane 0, in bytes.
 
* EGL_DMA_BUF_PLANE0_PITCH_EX

[RFC] dma-fence: dma-buf synchronization (v2)

2012-07-13 Thread Tom Cooksey
Hi Rob,

Yes, sorry we've been a bit slack progressing KDS publicly. Your
approach looks interesting and seems like it could enable both implicit
and explicit synchronization. A good compromise.


> From: Rob Clark 
> 
> A dma-fence can be attached to a buffer which is being filled or
> consumed by hw, to allow userspace to pass the buffer without waiting
> to another device.  For example, userspace can call page_flip ioctl to
> display the next frame of graphics after kicking the GPU but while the
> GPU is still rendering.  The display device sharing the buffer with the
> GPU would attach a callback to get notified when the GPU's rendering-
> complete IRQ fires, to update the scan-out address of the display,
> without having to wake up userspace.
> 
> A dma-fence is transient, one-shot deal.  It is allocated and attached
> to dma-buf's list of fences.  When the one that attached it is done,
> with the pending operation, it can signal the fence removing it from
> the dma-buf's list of fences:
> 
>   + dma_buf_attach_fence()
>   + dma_fence_signal()

It would be useful to have two lists of fences, those around writes to
the buffer and those around reads. The idea being that if you only want
to read from a buffer, you don't need to wait for fences around other
read operations, you only need to wait for the "last" writer fence. If
you do want to write to the buffer however, you need to wait for all
the read fences and the last writer fence. The use-case is when EGL
swap behaviour is EGL_BUFFER_PRESERVED. You have the display controller
reading the buffer with its fence defined to be signalled when it is
no-longer scanning out that buffer. It can only stop scanning out that
buffer when it is given another buffer to scan-out. If that next buffer
must be rendered by copying the currently scanned-out buffer into it
(one possible option for implementing EGL_BUFFER_PRESERVED) then you
essentially deadlock if the scan-out job blocks the "render the next
frame" job. 

There's probably variations of this idea, perhaps you only need a flag
to indicate if a fence is around a read-only or rw access?


> The intention is to provide a userspace interface (presumably via
> eventfd) later, to be used in conjunction with dma-buf's mmap support
> for sw access to buffers (or for userspace apps that would prefer to
> do their own synchronization).

>From our experience of our own KDS, we've come up with an interesting
approach to synchronizing userspace applications which have a buffer
mmap'd. We wanted to avoid userspace being able to block jobs running
on hardware while still allowing userspace to participate. Our original
idea was to have a lock/unlock ioctl interface on a dma_buf but have
a timeout whereby the application's lock would be broken if held for
too long. That at least bounded how long userspace could potentially
block hardware making progress, though was pretty "harsh".

The approach we have now settled on is to instead only allow an
application to wait for all jobs currently pending for a buffer. So
there's no way userspace can prevent anything else from using a
buffer, other than not issuing jobs which will use that buffer.
Also, the interface we settled on was to add a poll handler to
dma_buf, that way userspace can select() on multiple dma_buff
buffers in one syscall. It can also chose if it wants to wait for
only the last writer fence, I.e. wait until it can read (POLLIN)
or wait for all fences as it wants to write to the buffer (POLLOUT).
We kinda like this, but does restrict the utility a little. An idea
worth considering anyway.


My other thought is around atomicity. Could this be extended to
(safely) allow for hardware devices which might want to access
multiple buffers simultaneously? I think it probably can with
some tweaks to the interface? An atomic function which does 
something like "give me all the fences for all these buffers 
and add this fence to each instead/as-well-as"?


Cheers,

Tom






RE: [RFC] dma-fence: dma-buf synchronization (v2)

2012-07-13 Thread Tom Cooksey
Hi Rob,

Yes, sorry we've been a bit slack progressing KDS publicly. Your
approach looks interesting and seems like it could enable both implicit
and explicit synchronization. A good compromise.


> From: Rob Clark 
> 
> A dma-fence can be attached to a buffer which is being filled or
> consumed by hw, to allow userspace to pass the buffer without waiting
> to another device.  For example, userspace can call page_flip ioctl to
> display the next frame of graphics after kicking the GPU but while the
> GPU is still rendering.  The display device sharing the buffer with the
> GPU would attach a callback to get notified when the GPU's rendering-
> complete IRQ fires, to update the scan-out address of the display,
> without having to wake up userspace.
> 
> A dma-fence is transient, one-shot deal.  It is allocated and attached
> to dma-buf's list of fences.  When the one that attached it is done,
> with the pending operation, it can signal the fence removing it from
> the dma-buf's list of fences:
> 
>   + dma_buf_attach_fence()
>   + dma_fence_signal()

It would be useful to have two lists of fences, those around writes to
the buffer and those around reads. The idea being that if you only want
to read from a buffer, you don't need to wait for fences around other
read operations, you only need to wait for the "last" writer fence. If
you do want to write to the buffer however, you need to wait for all
the read fences and the last writer fence. The use-case is when EGL
swap behaviour is EGL_BUFFER_PRESERVED. You have the display controller
reading the buffer with its fence defined to be signalled when it is
no-longer scanning out that buffer. It can only stop scanning out that
buffer when it is given another buffer to scan-out. If that next buffer
must be rendered by copying the currently scanned-out buffer into it
(one possible option for implementing EGL_BUFFER_PRESERVED) then you
essentially deadlock if the scan-out job blocks the "render the next
frame" job. 

There's probably variations of this idea, perhaps you only need a flag
to indicate if a fence is around a read-only or rw access?


> The intention is to provide a userspace interface (presumably via
> eventfd) later, to be used in conjunction with dma-buf's mmap support
> for sw access to buffers (or for userspace apps that would prefer to
> do their own synchronization).

>From our experience of our own KDS, we've come up with an interesting
approach to synchronizing userspace applications which have a buffer
mmap'd. We wanted to avoid userspace being able to block jobs running
on hardware while still allowing userspace to participate. Our original
idea was to have a lock/unlock ioctl interface on a dma_buf but have
a timeout whereby the application's lock would be broken if held for
too long. That at least bounded how long userspace could potentially
block hardware making progress, though was pretty "harsh".

The approach we have now settled on is to instead only allow an
application to wait for all jobs currently pending for a buffer. So
there's no way userspace can prevent anything else from using a
buffer, other than not issuing jobs which will use that buffer.
Also, the interface we settled on was to add a poll handler to
dma_buf, that way userspace can select() on multiple dma_buff
buffers in one syscall. It can also chose if it wants to wait for
only the last writer fence, I.e. wait until it can read (POLLIN)
or wait for all fences as it wants to write to the buffer (POLLOUT).
We kinda like this, but does restrict the utility a little. An idea
worth considering anyway.


My other thought is around atomicity. Could this be extended to
(safely) allow for hardware devices which might want to access
multiple buffers simultaneously? I think it probably can with
some tweaks to the interface? An atomic function which does 
something like "give me all the fences for all these buffers 
and add this fence to each instead/as-well-as"?


Cheers,

Tom




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-07 Thread Tom Cooksey


> >>>?The bigger issue is the previous point about how to deal
> >>> with cases where the CPU doesn't really need to get involved as an
> >>> intermediary.
> >>>
> >>> CPU fallback access to the buffer is the only legit case where we
> >>> need a standardized API to userspace (since CPU access isn't already
> >>> associated w/ some other kernel device file where some extra ioctl
> >>> can be added)
> >>
> >> The CPU case will still need to wait on an arbitrarily backed sync
> >> primitive. ?It shouldn't need to know if it's backed by the gpu,
> >> camera, or dsp.
> >
> > Right, this is the one place we definitely need something.. some
> > userspace code would just get passed a dmabuf file descriptor and
> > want to mmap it and do something, without really knowing where it
> > came from. ?I *guess* we'll have to add some ioctl's to the dmabuf
> > fd.
> 
> I personally favor having sync primitives have their own anon inode
> vs. strictly coupling them with dma_buf.

I think this is really the crux of the matter - do we associate sync
objects with buffers or not. The approach ARM are suggesting _is_ to
associate the sync objects with the buffer and do this by adding
kds_resource* as a member of struct dma_buf. The main reason I want
to do this is because it doesn't require changes to existing
interfaces. Specifically, DRM/KMS & v4l2. These user/kernel interfaces
already allow userspace to specify the handle of a buffer the driver
should perform an operation on. What dma_buf has done is allowed those
driver-specific buffer handles to be exported from one driver and
imported into another. While new ioctls have been added to the v4l2 &
DRM interfaces for dma_buf, they have only been to allow the import &
export of driver-specific buffer objects. Once imported as a driver
specific buffer object, existing ioctls are re-used to perform
operations on those buffers (at least this is what PRIME does for DRM,
I'm not so sure about v4l2?). But my point is that no new "page flip
to this dma_buf fd" ioctl has been added to KMS, you use the existing
drm_mode_crtc_page_flip and specify an fb_id which has been imported
from a dma_buf.

If we associate sync objects with buffers, none of those device
specific ioctls which perform operations on buffer objects need to
be modified. It's just that internally, those drivers use kds or
something similar to make sure they don't tread on each other's
toes.

The alternate is to not associate sync objects with buffers and
have them be distinct entities, exposed to userspace. This gives
userpsace more power and flexibility and might allow for use-cases
which an implicit synchronization mechanism can't satisfy - I'd
be curious to know any specifics here. However, every driver which
needs to participate in the synchronization mechanism will need
to have its interface with userspace modified to allow the sync
objects to be passed to the drivers. This seemed like a lot of
work to me, which is why I prefer the implicit approach. However
I don't actually know what work is needed and think it should be
explored. I.e. How much work is it to add explicit sync object
support to the DRM & v4l2 interfaces?

E.g. I believe DRM/GEM's job dispatch API is "in-order"
in which case it might be easy to just add "wait for this fence"
and "signal this fence" ioctls. Seems like vmwgfx already has
something similar to this already? Could this work over having
to specify a list of sync objects to wait on and another list
of sync objects to signal for every operation (exec buf/page
flip)? What about for v4l2?

I guess my other thought is that implicit vs explicit is not
mutually exclusive, though I'd guess there'd be interesting
deadlocks to have to debug if both were in use _at the same
time_. :-)


Cheers,

Tom






RE: [Linaro-mm-sig] [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-06-07 Thread Tom Cooksey


> >>> The bigger issue is the previous point about how to deal
> >>> with cases where the CPU doesn't really need to get involved as an
> >>> intermediary.
> >>>
> >>> CPU fallback access to the buffer is the only legit case where we
> >>> need a standardized API to userspace (since CPU access isn't already
> >>> associated w/ some other kernel device file where some extra ioctl
> >>> can be added)
> >>
> >> The CPU case will still need to wait on an arbitrarily backed sync
> >> primitive.  It shouldn't need to know if it's backed by the gpu,
> >> camera, or dsp.
> >
> > Right, this is the one place we definitely need something.. some
> > userspace code would just get passed a dmabuf file descriptor and
> > want to mmap it and do something, without really knowing where it
> > came from.  I *guess* we'll have to add some ioctl's to the dmabuf
> > fd.
> 
> I personally favor having sync primitives have their own anon inode
> vs. strictly coupling them with dma_buf.

I think this is really the crux of the matter - do we associate sync
objects with buffers or not. The approach ARM are suggesting _is_ to
associate the sync objects with the buffer and do this by adding
kds_resource* as a member of struct dma_buf. The main reason I want
to do this is because it doesn't require changes to existing
interfaces. Specifically, DRM/KMS & v4l2. These user/kernel interfaces
already allow userspace to specify the handle of a buffer the driver
should perform an operation on. What dma_buf has done is allowed those
driver-specific buffer handles to be exported from one driver and
imported into another. While new ioctls have been added to the v4l2 &
DRM interfaces for dma_buf, they have only been to allow the import &
export of driver-specific buffer objects. Once imported as a driver
specific buffer object, existing ioctls are re-used to perform
operations on those buffers (at least this is what PRIME does for DRM,
I'm not so sure about v4l2?). But my point is that no new "page flip
to this dma_buf fd" ioctl has been added to KMS, you use the existing
drm_mode_crtc_page_flip and specify an fb_id which has been imported
from a dma_buf.

If we associate sync objects with buffers, none of those device
specific ioctls which perform operations on buffer objects need to
be modified. It's just that internally, those drivers use kds or
something similar to make sure they don't tread on each other's
toes.

The alternate is to not associate sync objects with buffers and
have them be distinct entities, exposed to userspace. This gives
userpsace more power and flexibility and might allow for use-cases
which an implicit synchronization mechanism can't satisfy - I'd
be curious to know any specifics here. However, every driver which
needs to participate in the synchronization mechanism will need
to have its interface with userspace modified to allow the sync
objects to be passed to the drivers. This seemed like a lot of
work to me, which is why I prefer the implicit approach. However
I don't actually know what work is needed and think it should be
explored. I.e. How much work is it to add explicit sync object
support to the DRM & v4l2 interfaces?

E.g. I believe DRM/GEM's job dispatch API is "in-order"
in which case it might be easy to just add "wait for this fence"
and "signal this fence" ioctls. Seems like vmwgfx already has
something similar to this already? Could this work over having
to specify a list of sync objects to wait on and another list
of sync objects to signal for every operation (exec buf/page
flip)? What about for v4l2?

I guess my other thought is that implicit vs explicit is not
mutually exclusive, though I'd guess there'd be interesting
deadlocks to have to debug if both were in use _at the same
time_. :-)


Cheers,

Tom




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-05-25 Thread Tom Cooksey
> > There are multiple ways synchronization can be achieved, 
> > fences/sync objects is one common approach, however we're
> > presenting a different approach. Personally, I quite like 
> > fence sync objects, however we believe it requires a lot of 
> > userspace interfaces to be changed to pass around sync object
> > handles. Our hope is that the kds approach will require less 
> > effort to make use of as no existing userspace interfaces need
> > to be changed. E.g. To use explicit fences, the struct
> > drm_mode_crtc_page_flip would need a new members to pass in the
> > handle(s) of sync object(s) which the flip depends on (I.e.
> > don't flip until these fences fire). The additional benefit of
> > our approach is that it prevents userspace specifying dependency
> > loops which can cause a deadlock (see kds.txt for an explanation
> > of what I mean here).
> 
> It is easy to cause cyclic dependencies with implicit fences unless you
> are very sure that client can only cause linear implicit dependencies.

I'm not sure I know what you mean by linear implicit dependencies?


> But clients already have synchronization dependencies with userspace.
> That makes implicit synchronization possibly cause unexpected
> deadlocks.

Again, not sure what you mean here? Do you mean that userspace can
Submit a piece of work to a driver which depends on something
else happening in userpsace?


> Explicit synchronization is easier to debug because developer using
> explicit synchronization can track the dependencies in userspace. But
> of course that makes userspace API harder to use than API using
> implicitly synchronization.
> 
> But implicit synchronization can avoid client deadlock issues.
> Providing if client may never block "fence" from triggering in finite
> time when it is granted access. The page flip can be synchronized in
> that manner if client can't block HW from processing queued rendering.

Yes, I guess this is the critical point - this approach assumes that
when a client starts using a resource, it will only do so for a finite
amount of time. If userspace wanted to participate in the scheme, we
would probably need some kind of timeout, otherwise userspace could
prevent other devices from accessing a resource.


> You were talking about adding new parameter to page flip ioctl. I fail
> to see need for it because page flip already has fb object as parameter
> that should map to the implicit synchronization fence through dma_buf.

This is the point I was trying to make. With explicit fence objects
you do have to add a new parameter, whereas with this kds implicit
approach you do not - the buffer itself becomes the sync object.



> > While KDS defines a very generic mechanism, I am proposing that 
> > this code or at least the concepts be merged with the existing 
> 
> > dma_buf code, so a the struct kds_resource members get moved to
> > struct dma_buf, kds_* functions get renamed to dma_buf_*
> > functions, etc. So I guess what I'm saying is please don't review
> > the actual code just yet, only the concepts the code describes,
> > where kds_resource == dma_duf.
> 
> But the documented functionality sounds very much deadlock prone. If
> userspace gets exclusive access and needs to wait for implicit access
> synchronization.
> 
> app A has access to buffer X
> app B requests exclusive access to buffer X and blocks waiting for access
> app A makes synchronous IPC call to app B
> 
> I didn't read the actual code at all to figure out if that is possible
> scenario. But it sounds like possible scenario based on documentation
> talking EGL depending on exclusive access.

The intention was to use this mechanism for synchronizing between
drivers rather than between userspace processes, I think the userspace
access is somewhat an afterthought which will probably need some more
thought. In the example you give, app A making a synchronous IPC call
to app B breaks the clients must guarantee they complete in a finite
time, which in the case of userspace access could be enforced by a
timeout. Though I would have thought there's a better way to handle
this than just a timeout.


Cheers,

Tom






[RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-05-25 Thread Tom Cooksey
Hi All,

I realise it's been a while since this was last discussed, however I'd like
to bring up kernel-side synchronization again. By kernel-side
synchronization, I mean allowing multiple drivers/devices wanting to access
the same buffer to do so without bouncing up to userspace to resolve
dependencies such as "the display controller can't start scanning out a
buffer until the GPU has finished rendering into it". As such, this is
really just an optimization which reduces latency between E.g. The GPU
finishing a rendering job and that buffer being scanned out. I appreciate
this particular example is already solved on desktop graphics cards as the
display controller and 3D core are both controlled by the same driver, so no
"generic" mechanism is needed. However on ARM SoCs, the 3D core (like an ARM
Mali) and display controller tend to be driven by separate drivers, so some
mechanism is needed to allow both drivers to synchronize their access to
buffers.

There are multiple ways synchronization can be achieved, fences/sync objects
is one common approach, however we're presenting a different approach.
Personally, I quite like fence sync objects, however we believe it requires
a lot of userspace interfaces to be changed to pass around sync object
handles. Our hope is that the kds approach will require less effort to make
use of as no existing userspace interfaces need to be changed. E.g. To use
explicit fences, the struct drm_mode_crtc_page_flip would need a new members
to pass in the handle(s) of sync object(s) which the flip depends on (I.e.
don't flip until these fences fire). The additional benefit of our approach
is that it prevents userspace specifying dependency loops which can cause a
deadlock (see kds.txt for an explanation of what I mean here).

I have waited until now to bring this up again because I am now able to
share the code I was trying (and failing I think) to explain previously. The
code has now been released under the GPLv2 from ARM Mali's developer portal,
however I've attempted to turn that into a patch to allow it to be discussed
on this list. Please find the patch inline below.

While KDS defines a very generic mechanism, I am proposing that this code or
at least the concepts be merged with the existing dma_buf code, so a the
struct kds_resource members get moved to struct dma_buf, kds_* functions get
renamed to dma_buf_* functions, etc. So I guess what I'm saying is please
don't review the actual code just yet, only the concepts the code describes,
where kds_resource == dma_duf.


Cheers,

Tom



Author: Tom Cooksey 
Date:   Fri May 25 10:45:27 2012 +0100

Add new system to allow synchronizing access to resources

See Documentation/kds.txt for details, however the general
idea is that this kds framework synchronizes multiple drivers
("clients") wanting to access the same resources, where a
resource is typically a 2D image buffer being shared around
using dma-buf.

Note: This patch is created by extracting the sources from the
tarball on <http://www.malideveloper.com/open-source-mali-gpus-lin
ux-kernel-device-drivers---dev-releases.php> and putting them in
roughly the right places.

diff --git a/Documentation/kds.txt b/Documentation/kds.txt
new file mode 100644
index 000..a96db21
--- /dev/null
+++ b/Documentation/kds.txt
@@ -0,0 +1,113 @@
+#
+# (C) COPYRIGHT 2012 ARM Limited. All rights reserved.
+#
+# This program is free software and is provided to you under the terms of
the GNU General Public License version 2
+# as published by the Free Software Foundation, and any use by you of this
program is subject to the terms of such GNU licence.
+#
+# A copy of the licence is included with the program, and can also be
obtained from Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA. 
+#
+#
+
+
+==
+kds - Kernel Dependency System
+==
+
+Introduction
+
+kds provides a mechanism for clients to atomically lock down multiple
abstract resources.
+This can be done either synchronously or asynchronously.
+Abstract resources is used to allow a set of clients to use kds to control
access to any
+resource, an example is structured memory buffers.
+
+kds supports that buffer is locked for exclusive access and sharing of
buffers.
+
+kds can be built as either a integrated feature of the kernel or as a
module.
+It supports being compiled as a module both in-tree and out-of-tree.
+
+
+Concepts
+
+A core concept in kds is abstract resources.
+A kds resource is just an abstraction for some client object, kds doesn't
care what it is.
+Typically EGL will consider UMP buffers as being a resource, thus each UMP
buffer has 
+a kds resource for synchronization to the buffer.
+
+kds allows a client to create and destroy the abstract resource objects.
+A new resour

RE: [RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-05-25 Thread Tom Cooksey
> > There are multiple ways synchronization can be achieved, 
> > fences/sync objects is one common approach, however we're
> > presenting a different approach. Personally, I quite like 
> > fence sync objects, however we believe it requires a lot of 
> > userspace interfaces to be changed to pass around sync object
> > handles. Our hope is that the kds approach will require less 
> > effort to make use of as no existing userspace interfaces need
> > to be changed. E.g. To use explicit fences, the struct
> > drm_mode_crtc_page_flip would need a new members to pass in the
> > handle(s) of sync object(s) which the flip depends on (I.e.
> > don't flip until these fences fire). The additional benefit of
> > our approach is that it prevents userspace specifying dependency
> > loops which can cause a deadlock (see kds.txt for an explanation
> > of what I mean here).
> 
> It is easy to cause cyclic dependencies with implicit fences unless you
> are very sure that client can only cause linear implicit dependencies.

I'm not sure I know what you mean by linear implicit dependencies?


> But clients already have synchronization dependencies with userspace.
> That makes implicit synchronization possibly cause unexpected
> deadlocks.

Again, not sure what you mean here? Do you mean that userspace can
Submit a piece of work to a driver which depends on something
else happening in userpsace?


> Explicit synchronization is easier to debug because developer using
> explicit synchronization can track the dependencies in userspace. But
> of course that makes userspace API harder to use than API using
> implicitly synchronization.
> 
> But implicit synchronization can avoid client deadlock issues.
> Providing if client may never block "fence" from triggering in finite
> time when it is granted access. The page flip can be synchronized in
> that manner if client can't block HW from processing queued rendering.

Yes, I guess this is the critical point - this approach assumes that
when a client starts using a resource, it will only do so for a finite
amount of time. If userspace wanted to participate in the scheme, we
would probably need some kind of timeout, otherwise userspace could
prevent other devices from accessing a resource.


> You were talking about adding new parameter to page flip ioctl. I fail
> to see need for it because page flip already has fb object as parameter
> that should map to the implicit synchronization fence through dma_buf.

This is the point I was trying to make. With explicit fence objects
you do have to add a new parameter, whereas with this kds implicit
approach you do not - the buffer itself becomes the sync object.



> > While KDS defines a very generic mechanism, I am proposing that 
> > this code or at least the concepts be merged with the existing 
> 
> > dma_buf code, so a the struct kds_resource members get moved to
> > struct dma_buf, kds_* functions get renamed to dma_buf_*
> > functions, etc. So I guess what I'm saying is please don't review
> > the actual code just yet, only the concepts the code describes,
> > where kds_resource == dma_duf.
> 
> But the documented functionality sounds very much deadlock prone. If
> userspace gets exclusive access and needs to wait for implicit access
> synchronization.
> 
> app A has access to buffer X
> app B requests exclusive access to buffer X and blocks waiting for access
> app A makes synchronous IPC call to app B
> 
> I didn't read the actual code at all to figure out if that is possible
> scenario. But it sounds like possible scenario based on documentation
> talking EGL depending on exclusive access.

The intention was to use this mechanism for synchronizing between
drivers rather than between userspace processes, I think the userspace
access is somewhat an afterthought which will probably need some more
thought. In the example you give, app A making a synchronous IPC call
to app B breaks the clients must guarantee they complete in a finite
time, which in the case of userspace access could be enforced by a
timeout. Though I would have thought there's a better way to handle
this than just a timeout.


Cheers,

Tom




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[RFC] Synchronizing access to buffers shared with dma-buf between drivers/devices

2012-05-25 Thread Tom Cooksey
Hi All,

I realise it's been a while since this was last discussed, however I'd like
to bring up kernel-side synchronization again. By kernel-side
synchronization, I mean allowing multiple drivers/devices wanting to access
the same buffer to do so without bouncing up to userspace to resolve
dependencies such as "the display controller can't start scanning out a
buffer until the GPU has finished rendering into it". As such, this is
really just an optimization which reduces latency between E.g. The GPU
finishing a rendering job and that buffer being scanned out. I appreciate
this particular example is already solved on desktop graphics cards as the
display controller and 3D core are both controlled by the same driver, so no
"generic" mechanism is needed. However on ARM SoCs, the 3D core (like an ARM
Mali) and display controller tend to be driven by separate drivers, so some
mechanism is needed to allow both drivers to synchronize their access to
buffers.

There are multiple ways synchronization can be achieved, fences/sync objects
is one common approach, however we're presenting a different approach.
Personally, I quite like fence sync objects, however we believe it requires
a lot of userspace interfaces to be changed to pass around sync object
handles. Our hope is that the kds approach will require less effort to make
use of as no existing userspace interfaces need to be changed. E.g. To use
explicit fences, the struct drm_mode_crtc_page_flip would need a new members
to pass in the handle(s) of sync object(s) which the flip depends on (I.e.
don't flip until these fences fire). The additional benefit of our approach
is that it prevents userspace specifying dependency loops which can cause a
deadlock (see kds.txt for an explanation of what I mean here).

I have waited until now to bring this up again because I am now able to
share the code I was trying (and failing I think) to explain previously. The
code has now been released under the GPLv2 from ARM Mali's developer portal,
however I've attempted to turn that into a patch to allow it to be discussed
on this list. Please find the patch inline below.

While KDS defines a very generic mechanism, I am proposing that this code or
at least the concepts be merged with the existing dma_buf code, so a the
struct kds_resource members get moved to struct dma_buf, kds_* functions get
renamed to dma_buf_* functions, etc. So I guess what I'm saying is please
don't review the actual code just yet, only the concepts the code describes,
where kds_resource == dma_duf.


Cheers,

Tom



Author: Tom Cooksey 
Date:   Fri May 25 10:45:27 2012 +0100

Add new system to allow synchronizing access to resources

See Documentation/kds.txt for details, however the general
idea is that this kds framework synchronizes multiple drivers
("clients") wanting to access the same resources, where a
resource is typically a 2D image buffer being shared around
using dma-buf.

Note: This patch is created by extracting the sources from the
tarball on <http://www.malideveloper.com/open-source-mali-gpus-lin
ux-kernel-device-drivers---dev-releases.php> and putting them in
roughly the right places.

diff --git a/Documentation/kds.txt b/Documentation/kds.txt
new file mode 100644
index 000..a96db21
--- /dev/null
+++ b/Documentation/kds.txt
@@ -0,0 +1,113 @@
+#
+# (C) COPYRIGHT 2012 ARM Limited. All rights reserved.
+#
+# This program is free software and is provided to you under the terms of
the GNU General Public License version 2
+# as published by the Free Software Foundation, and any use by you of this
program is subject to the terms of such GNU licence.
+#
+# A copy of the licence is included with the program, and can also be
obtained from Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA. 
+#
+#
+
+
+==
+kds - Kernel Dependency System
+==
+
+Introduction
+
+kds provides a mechanism for clients to atomically lock down multiple
abstract resources.
+This can be done either synchronously or asynchronously.
+Abstract resources is used to allow a set of clients to use kds to control
access to any
+resource, an example is structured memory buffers.
+
+kds supports that buffer is locked for exclusive access and sharing of
buffers.
+
+kds can be built as either a integrated feature of the kernel or as a
module.
+It supports being compiled as a module both in-tree and out-of-tree.
+
+
+Concepts
+
+A core concept in kds is abstract resources.
+A kds resource is just an abstraction for some client object, kds doesn't
care what it is.
+Typically EGL will consider UMP buffers as being a resource, thus each UMP
buffer has 
+a kds resource for synchronization to the buffer.
+
+kds allows a client to create and destroy the abstract resource object

[Linaro-mm-sig] New "xf86-video-armsoc" DDX driver

2012-05-24 Thread Tom Cooksey


> -Original Message-
> From: Daniel Vetter [mailto:daniel.vetter at ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: 21 May 2012 10:04
> To: Dave Airlie
> Cc: Tom Cooksey; linaro-mm-sig at lists.linaro.org; xorg-
> devel at lists.x.org; dri-devel at lists.freedesktop.org
> Subject: Re: [Linaro-mm-sig] New "xf86-video-armsoc" DDX driver
> 
> On Mon, May 21, 2012 at 09:55:06AM +0100, Dave Airlie wrote:
> > > * Define a new x-server sub-module interface to allow a seperate
> > > > .so 2D driver to be loaded (this is the approach the current
> > > > OMAP DDX uses).
> >
> > This seems the sanest.
> 
> Or go the intel glamour route and stitch together a somewhat generic 2d
> accel code on top of GL. That should give you reasonable (albeit likely
> not stellar) X render performance.
> -Daniel

I'm not sure that would perform well on a tile-based deferred renderer
like Mali. To perform well, we need to gather an entire frame's worth
of rendering/draw-calls before passing them to the GPU to render. I
believe this is not the typical use-case of EXA? How much of the
framebuffer is re-drawn between flushes?


Cheers,

Tom







RE: [Linaro-mm-sig] New "xf86-video-armsoc" DDX driver

2012-05-24 Thread Tom Cooksey


> -Original Message-
> From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: 21 May 2012 10:04
> To: Dave Airlie
> Cc: Tom Cooksey; linaro-mm-...@lists.linaro.org; xorg-
> de...@lists.x.org; dri-devel@lists.freedesktop.org
> Subject: Re: [Linaro-mm-sig] New "xf86-video-armsoc" DDX driver
> 
> On Mon, May 21, 2012 at 09:55:06AM +0100, Dave Airlie wrote:
> > > * Define a new x-server sub-module interface to allow a seperate
> > > > .so 2D driver to be loaded (this is the approach the current
> > > > OMAP DDX uses).
> >
> > This seems the sanest.
> 
> Or go the intel glamour route and stitch together a somewhat generic 2d
> accel code on top of GL. That should give you reasonable (albeit likely
> not stellar) X render performance.
> -Daniel

I'm not sure that would perform well on a tile-based deferred renderer
like Mali. To perform well, we need to gather an entire frame's worth
of rendering/draw-calls before passing them to the GPU to render. I
believe this is not the typical use-case of EXA? How much of the
framebuffer is re-drawn between flushes?


Cheers,

Tom





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


New "xf86-video-armsoc" DDX driver

2012-05-21 Thread Tom Cooksey
Hi All,

For the last few months we (ARM MPD... "The Mali guys") have been working on
getting X.Org up and running with Mali T6xx (ARM's next-generation GPU IP).
The approach is very similar (well identical I think) to how things work on
OMAP: We use a DRM driver to manage the display controller via KMS. The KMS
driver also allocates both scan-out and pixmap/back buffers via the
DRM_IOCTL_MODE_CREATE_DUMB ioctl which is internally implemented with GEM.
When returning buffers to DRI clients, the x-server uses flink to get a
global handle to a buffer which it passes back to the DRI client (in our
case the Mali-T600 X11 EGL winsys). The client then uses the new PRIME
ioctls to export the GEM buffer it received from the x-server to a dma_buf
fd. This fd is then passed into the T6xx kernel driver via our own job
dispatch user/kernel API (we're not using DRM for driving the GPU, only the
display controller).

Note: ARM doesn't generally provide the display controller IP block, so this
is really for our customers/Linaro to develop, though we do have something
hacked up for ARM's own PL111 display controller on our Versatile Express
development platform which we'll be open sourcing/up-streaming asap via
Linaro. 

We believe most ARM SoCs are likely to work the same way, at least those
with 3rd-party GPU IP blocks/drivers (so everyone except Qualcomm & nVidia).
As mentioned, this is certainly how the OMAP integration works. As such,
we've taken the OMAP DDX driver Rob Clark wrote and hacked on it to make it
work for Mali. The patch is actually relatively small, which is not really
too surprising as all the driver is doing is allocating buffers and managing
a display controller via a device-agnostic interface (KMS). All the
device-specific code is kept in the DRM driver and the client GLES/EGL
library. Given that the DDX driver doesn't contain any device-specific code,
we're going to take the OMAP DDX as a baseline and try and make it more
generic. Our immediate goals are to make it work on our own Versatile
Express development platform and on Samsung's Exynos 5250 SoC, however our
hope is to have a single DDX driver which can cover OMAP, Exynos, ST-E's
Nova/Thor platforms and probably others too. It's even been suggested it
could work with Mesa's sw backend(?).

Anyway, the DDX is very much a work-in-progress and is still heavily branded
OMAP, even though it's working (almost) perfectly on VExpress & Exynos too
(re-branding isn't too high-up our priority list at the moment). We are
actively developing this driver and will be doing so in a public git
repository hosted by Linaro. We will not be maintaining any private
repository behind ARM's firewall or anything like that - you'll see what we
see. The first patches have now been pushed, so if anyone's interested in
seeing what we have so far or wants to track development, the tree is here:

http://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summa
ry

Note: When we originally spoke to Rob Clark about this, he suggested we take
the already-generic xf86-video-modesetting and just add the dri2 code to it.
This is indeed how we started out, however as we progressed it became clear
that the majority of the code we wanted was in the omap driver and were
having to work fairly hard to keep some of the original modesetting code.
This is why we've now changed tactic and just forked the OMAP driver,
something Rob is more than happy for us to do.


One thing the DDX driver isn't doing yet is making use of 2D hw blocks. In
the short-term, we will simply create a branch off of the "generic" master
for each SoC and add 2D hardware support there. We do however want a more
permanent solution which doesn't need a separate branch per SoC. Some of the
suggested solutions are:

* Add a new generic DRM ioctl API for larger 2D operations (I would imagine
small blits/blends would be done in SW).
* Use SW rendering for everything other than solid blits and use v4l2's
blitting API for those (importing/exporting buffers to be blitted using
dma_buf). The theory here is that most UIs are rendered with GLES and so you
only need 2D hardware for blits. I think we'll prototype this approach on
Exynos.
* Define a new x-server sub-module interface to allow a seperate .so 2D
driver to be loaded (this is the approach the current OMAP DDX uses).

We are hoping someone might have some advice & suggestions on how to proceed
with regards to 2D. We're also very interested in any feedback, both on the
DDX driver specifically and on the approach we're taking in general.


Cheers,

Tom






New "xf86-video-armsoc" DDX driver

2012-05-21 Thread Tom Cooksey
Hi All,

For the last few months we (ARM MPD... "The Mali guys") have been working on
getting X.Org up and running with Mali T6xx (ARM's next-generation GPU IP).
The approach is very similar (well identical I think) to how things work on
OMAP: We use a DRM driver to manage the display controller via KMS. The KMS
driver also allocates both scan-out and pixmap/back buffers via the
DRM_IOCTL_MODE_CREATE_DUMB ioctl which is internally implemented with GEM.
When returning buffers to DRI clients, the x-server uses flink to get a
global handle to a buffer which it passes back to the DRI client (in our
case the Mali-T600 X11 EGL winsys). The client then uses the new PRIME
ioctls to export the GEM buffer it received from the x-server to a dma_buf
fd. This fd is then passed into the T6xx kernel driver via our own job
dispatch user/kernel API (we're not using DRM for driving the GPU, only the
display controller).

Note: ARM doesn't generally provide the display controller IP block, so this
is really for our customers/Linaro to develop, though we do have something
hacked up for ARM's own PL111 display controller on our Versatile Express
development platform which we'll be open sourcing/up-streaming asap via
Linaro. 

We believe most ARM SoCs are likely to work the same way, at least those
with 3rd-party GPU IP blocks/drivers (so everyone except Qualcomm & nVidia).
As mentioned, this is certainly how the OMAP integration works. As such,
we've taken the OMAP DDX driver Rob Clark wrote and hacked on it to make it
work for Mali. The patch is actually relatively small, which is not really
too surprising as all the driver is doing is allocating buffers and managing
a display controller via a device-agnostic interface (KMS). All the
device-specific code is kept in the DRM driver and the client GLES/EGL
library. Given that the DDX driver doesn't contain any device-specific code,
we're going to take the OMAP DDX as a baseline and try and make it more
generic. Our immediate goals are to make it work on our own Versatile
Express development platform and on Samsung's Exynos 5250 SoC, however our
hope is to have a single DDX driver which can cover OMAP, Exynos, ST-E's
Nova/Thor platforms and probably others too. It's even been suggested it
could work with Mesa's sw backend(?).

Anyway, the DDX is very much a work-in-progress and is still heavily branded
OMAP, even though it's working (almost) perfectly on VExpress & Exynos too
(re-branding isn't too high-up our priority list at the moment). We are
actively developing this driver and will be doing so in a public git
repository hosted by Linaro. We will not be maintaining any private
repository behind ARM's firewall or anything like that - you'll see what we
see. The first patches have now been pushed, so if anyone's interested in
seeing what we have so far or wants to track development, the tree is here:

http://git.linaro.org/gitweb?p=arm/xorg/driver/xf86-video-armsoc.git;a=summa
ry

Note: When we originally spoke to Rob Clark about this, he suggested we take
the already-generic xf86-video-modesetting and just add the dri2 code to it.
This is indeed how we started out, however as we progressed it became clear
that the majority of the code we wanted was in the omap driver and were
having to work fairly hard to keep some of the original modesetting code.
This is why we've now changed tactic and just forked the OMAP driver,
something Rob is more than happy for us to do.


One thing the DDX driver isn't doing yet is making use of 2D hw blocks. In
the short-term, we will simply create a branch off of the "generic" master
for each SoC and add 2D hardware support there. We do however want a more
permanent solution which doesn't need a separate branch per SoC. Some of the
suggested solutions are:

* Add a new generic DRM ioctl API for larger 2D operations (I would imagine
small blits/blends would be done in SW).
* Use SW rendering for everything other than solid blits and use v4l2's
blitting API for those (importing/exporting buffers to be blitted using
dma_buf). The theory here is that most UIs are rendered with GLES and so you
only need 2D hardware for blits. I think we'll prototype this approach on
Exynos.
* Define a new x-server sub-module interface to allow a seperate .so 2D
driver to be loaded (this is the approach the current OMAP DDX uses).

We are hoping someone might have some advice & suggestions on how to proceed
with regards to 2D. We're also very interested in any feedback, both on the
DDX driver specifically and on the approach we're taking in general.


Cheers,

Tom




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH] RFC: dma-buf: userspace mmap support

2012-03-19 Thread Tom Cooksey


> -Original Message-
> From: Alan Cox [mailto:alan at lxorguk.ukuu.org.uk]
> Sent: 19 March 2012 16:57
> To: Tom Cooksey
> Cc: 'Rob Clark'; linaro-mm-sig at lists.linaro.org; dri-
> devel at lists.freedesktop.org; linux-media at vger.kernel.org;
> rschultz at google.com; Rob Clark; sumit.semwal at linaro.org;
> patches at linaro.org
> Subject: Re: [PATCH] RFC: dma-buf: userspace mmap support
> 
> > If the API was to also be used for synchronization it would have to
> > include an atomic "prepare multiple" ioctl which blocked until all
> > the buffers listed by the application were available. In the same
> 
> Too slow already. You are now serializing stuff while what we want to
> do
> really is
> 
>   nobody_else_gets_buffers_next([list])
>   on available(buffer)
>   dispatch_work(buffer)
> 
> so that you can maximise parallelism without allowing deadlocks. If
> you've got a high memory bandwith and 8+ cores the 'stop everything'
> model isn't great.

Yes, sorry I wasn't clear here. By atomic I meant that a job starts
using all buffers at the same time, once they are available. You are
right, a job waiting for a list of buffers to become available should
not prevent other jobs running or queuing new jobs (eughh). We actually
have the option of using asynchronous call-backs in KDS: A driver lists
all the buffers it needs when adding a job and that job gets added to
the FIFO of each buffer as an atomic operation. However, once the job
is added to all the FIFOs, that atomic operation is complete and another
job can be "queued" up. When a job completes, it is removed from each
buffer's FIFO. At that point, all the "next" jobs in each buffer's FIFO
are evaluated to see if they can run. If they can run, the job's
"start" call-back is called. There's also a synchronous mode of
operation where a blocked thread is "woken up" instead of calling a
call-back function. It is this synchronous mode I would imagine
would be used for user-space access.


> > This might be a good argument for keeping synchronization and cache
> > maintenance separate, though even ignoring synchronization I would
> > think being able to issue cache maintenance operations for multiple
> > buffers in a single ioctl might present some small efficiency gains.
> > However as Rob points out, CPU access is already in slow/legacy
> > territory.
> 
> Dangerous assumption. I do think they should be separate. For one it
> makes the case of synchronization needed but hardware cache management
> much easier to split cleanly. Assuming CPU access is slow/legacy
> reflects a certain model of relatively slow CPU and accelerators
> where falling off the acceleration path is bad. On a higher end
> processor falling off the acceleration path isn't a performance
> matter so much as a power concern.

On some GPU architectures, glReadPixels is a _very_ heavy-weight
operation, so is very much a performance issue and I think always
will be. However I think this might be a special case for certain
GPUs: Other GPU architectures or device-types might be able to
share data with the CPU without such a large impact to performance.
The example of writing subtitles onto a video frame decoded by
a v4l2 hardware codec seems a good example.

> > KDS we differentiated jobs which needed "exclusive access" to a
> > buffer and jobs which needed "shared access" to a buffer. Multiple
> > jobs could access a buffer at the same time if those jobs all
> 
> Makes sense as it's a reader/writer lock and it reflects MESI/MOESI
> caching and cache policy in some hardware/software assists.

Actually, this got me thinking... Several ARM implementations rely
on CPU/NEON to perform X.Org's 2D operations and those tend to
operate directly on the framebuffer. So in that case, both the CPU
and display controller need to access the same buffer at the same
time, even though one of them is writing to the buffer. This is
the main reason we called it shared/exclusive access in KDS rather
than read-only/read-write access. In such scenarios, you'd still
want to do a CPU cache flush after CPU-based 2D drawing is complete
to make sure the display controller "saw" those changes. So yes,
perhaps there's actually a use-case where synchronization must be
kept separate to cache-maintenance? In which case, it is worth
making the proposed prepare/finish API more explicit in that it is
a CPU cache invalidate and CPU cache flush operation only? Or are
there other things one might want to do in prepare/finish?
Automatic cache domain tracking for example?


> > display controller will be reading the front buffer, but the GPU
> > might also need to 

[PATCH] RFC: dma-buf: userspace mmap support

2012-03-19 Thread Tom Cooksey


> -Original Message-
> From: Alan Cox [mailto:alan at lxorguk.ukuu.org.uk]
> Sent: 17 March 2012 20:17
> To: Tom Cooksey
> Cc: 'Rob Clark'; linaro-mm-sig at lists.linaro.org; dri-
> devel at lists.freedesktop.org; linux-media at vger.kernel.org;
> rschultz at google.com; Rob Clark; sumit.semwal at linaro.org;
> patches at linaro.org
> Subject: Re: [PATCH] RFC: dma-buf: userspace mmap support
> 
> > > dma-buf file descriptor.  Userspace access to the buffer should be
> > > bracketed with DMA_BUF_IOCTL_{PREPARE,FINISH}_ACCESS ioctl calls to
> > > give the exporting driver a chance to deal with cache
> synchronization
> > > and such for cached userspace mappings without resorting to page
> 
> There should be flags indicating if this is necessary. We don't want
> extra
> syscalls on hardware that doesn't need it. The other question is what
> info is needed as you may only want to poke a few pages out of cache
> and
> the prepare/finish on its own gives no info.
> 
> > E.g. If another device was writing to the buffer, the prepare ioctl
> > could block until that device had finished accessing that buffer.
> 
> How do you avoid deadlocks on this ? We need very clear ways to ensure
> things always complete in some form given multiple buffer
> owner/requestors and the fact this API has no "prepare-multiple-
> buffers"
> support.

Yes, good point.

If the API was to also be used for synchronization it would have to
include an atomic "prepare multiple" ioctl which blocked until all
the buffers listed by the application were available. In the same
way, the kernel interface would also need to allow drivers to pass a
list of buffers a job will access in an atomic "add job" operation.
Actually, our current "KDS" (Kernel Dependency System) implementation
already works like this.

This might be a good argument for keeping synchronization and cache
maintenance separate, though even ignoring synchronization I would
think being able to issue cache maintenance operations for multiple
buffers in a single ioctl might present some small efficiency gains.
However as Rob points out, CPU access is already in slow/legacy
territory.

Note: Making the ioctl a "prepare multiple" would at least prevent
accidental dead-locks due to cross-dependencies, etc., but I think
some kind of watchdog/timeout would be useful on userspace locks to
stop a malicious application from preventing other devices and
processes from using buffers indefinitely.

Finally, it's probably worth mentioning that when we implemented
KDS we differentiated jobs which needed "exclusive access" to a
buffer and jobs which needed "shared access" to a buffer. Multiple
jobs could access a buffer at the same time if those jobs all
indicated they only needed shared access. Typically this would be
ajob which will only read a buffer, such as a display controller
or texture read. The main use-case for this was implementing EGL's
preserved swap behaviour when using "buffer flipping". Here, the
display controller will be reading the front buffer, but the GPU
might also need to read that front buffer. So perhaps adding
"read-only" & "read-write" access flags to prepare could also be
interpreted as shared & exclusive accesses, if we went down this
route for synchronization that is. :-)


Cheers,

Tom







RE: [PATCH] RFC: dma-buf: userspace mmap support

2012-03-19 Thread Tom Cooksey


> -Original Message-
> From: Alan Cox [mailto:a...@lxorguk.ukuu.org.uk]
> Sent: 19 March 2012 16:57
> To: Tom Cooksey
> Cc: 'Rob Clark'; linaro-mm-...@lists.linaro.org; dri-
> de...@lists.freedesktop.org; linux-me...@vger.kernel.org;
> rschu...@google.com; Rob Clark; sumit.sem...@linaro.org;
> patc...@linaro.org
> Subject: Re: [PATCH] RFC: dma-buf: userspace mmap support
> 
> > If the API was to also be used for synchronization it would have to
> > include an atomic "prepare multiple" ioctl which blocked until all
> > the buffers listed by the application were available. In the same
> 
> Too slow already. You are now serializing stuff while what we want to
> do
> really is
> 
>   nobody_else_gets_buffers_next([list])
>   on available(buffer)
>   dispatch_work(buffer)
> 
> so that you can maximise parallelism without allowing deadlocks. If
> you've got a high memory bandwith and 8+ cores the 'stop everything'
> model isn't great.

Yes, sorry I wasn't clear here. By atomic I meant that a job starts
using all buffers at the same time, once they are available. You are
right, a job waiting for a list of buffers to become available should
not prevent other jobs running or queuing new jobs (eughh). We actually
have the option of using asynchronous call-backs in KDS: A driver lists
all the buffers it needs when adding a job and that job gets added to
the FIFO of each buffer as an atomic operation. However, once the job
is added to all the FIFOs, that atomic operation is complete and another
job can be "queued" up. When a job completes, it is removed from each
buffer's FIFO. At that point, all the "next" jobs in each buffer's FIFO
are evaluated to see if they can run. If they can run, the job's
"start" call-back is called. There's also a synchronous mode of
operation where a blocked thread is "woken up" instead of calling a
call-back function. It is this synchronous mode I would imagine
would be used for user-space access.


> > This might be a good argument for keeping synchronization and cache
> > maintenance separate, though even ignoring synchronization I would
> > think being able to issue cache maintenance operations for multiple
> > buffers in a single ioctl might present some small efficiency gains.
> > However as Rob points out, CPU access is already in slow/legacy
> > territory.
> 
> Dangerous assumption. I do think they should be separate. For one it
> makes the case of synchronization needed but hardware cache management
> much easier to split cleanly. Assuming CPU access is slow/legacy
> reflects a certain model of relatively slow CPU and accelerators
> where falling off the acceleration path is bad. On a higher end
> processor falling off the acceleration path isn't a performance
> matter so much as a power concern.

On some GPU architectures, glReadPixels is a _very_ heavy-weight
operation, so is very much a performance issue and I think always
will be. However I think this might be a special case for certain
GPUs: Other GPU architectures or device-types might be able to
share data with the CPU without such a large impact to performance.
The example of writing subtitles onto a video frame decoded by
a v4l2 hardware codec seems a good example.

> > KDS we differentiated jobs which needed "exclusive access" to a
> > buffer and jobs which needed "shared access" to a buffer. Multiple
> > jobs could access a buffer at the same time if those jobs all
> 
> Makes sense as it's a reader/writer lock and it reflects MESI/MOESI
> caching and cache policy in some hardware/software assists.

Actually, this got me thinking... Several ARM implementations rely
on CPU/NEON to perform X.Org's 2D operations and those tend to
operate directly on the framebuffer. So in that case, both the CPU
and display controller need to access the same buffer at the same
time, even though one of them is writing to the buffer. This is
the main reason we called it shared/exclusive access in KDS rather
than read-only/read-write access. In such scenarios, you'd still
want to do a CPU cache flush after CPU-based 2D drawing is complete
to make sure the display controller "saw" those changes. So yes,
perhaps there's actually a use-case where synchronization must be
kept separate to cache-maintenance? In which case, it is worth
making the proposed prepare/finish API more explicit in that it is
a CPU cache invalidate and CPU cache flush operation only? Or are
there other things one might want to do in prepare/finish?
Automatic cache domain tracking for example?


> > display controller will be reading the front buffer, but the GPU
> > might also need to read that front buffer. 

RE: [PATCH] RFC: dma-buf: userspace mmap support

2012-03-19 Thread Tom Cooksey


> -Original Message-
> From: Alan Cox [mailto:a...@lxorguk.ukuu.org.uk]
> Sent: 17 March 2012 20:17
> To: Tom Cooksey
> Cc: 'Rob Clark'; linaro-mm-...@lists.linaro.org; dri-
> de...@lists.freedesktop.org; linux-me...@vger.kernel.org;
> rschu...@google.com; Rob Clark; sumit.sem...@linaro.org;
> patc...@linaro.org
> Subject: Re: [PATCH] RFC: dma-buf: userspace mmap support
> 
> > > dma-buf file descriptor.  Userspace access to the buffer should be
> > > bracketed with DMA_BUF_IOCTL_{PREPARE,FINISH}_ACCESS ioctl calls to
> > > give the exporting driver a chance to deal with cache
> synchronization
> > > and such for cached userspace mappings without resorting to page
> 
> There should be flags indicating if this is necessary. We don't want
> extra
> syscalls on hardware that doesn't need it. The other question is what
> info is needed as you may only want to poke a few pages out of cache
> and
> the prepare/finish on its own gives no info.
> 
> > E.g. If another device was writing to the buffer, the prepare ioctl
> > could block until that device had finished accessing that buffer.
> 
> How do you avoid deadlocks on this ? We need very clear ways to ensure
> things always complete in some form given multiple buffer
> owner/requestors and the fact this API has no "prepare-multiple-
> buffers"
> support.

Yes, good point.

If the API was to also be used for synchronization it would have to
include an atomic "prepare multiple" ioctl which blocked until all
the buffers listed by the application were available. In the same
way, the kernel interface would also need to allow drivers to pass a
list of buffers a job will access in an atomic "add job" operation.
Actually, our current "KDS" (Kernel Dependency System) implementation
already works like this.

This might be a good argument for keeping synchronization and cache
maintenance separate, though even ignoring synchronization I would
think being able to issue cache maintenance operations for multiple
buffers in a single ioctl might present some small efficiency gains.
However as Rob points out, CPU access is already in slow/legacy
territory.

Note: Making the ioctl a "prepare multiple" would at least prevent
accidental dead-locks due to cross-dependencies, etc., but I think
some kind of watchdog/timeout would be useful on userspace locks to
stop a malicious application from preventing other devices and
processes from using buffers indefinitely.

Finally, it's probably worth mentioning that when we implemented
KDS we differentiated jobs which needed "exclusive access" to a
buffer and jobs which needed "shared access" to a buffer. Multiple
jobs could access a buffer at the same time if those jobs all
indicated they only needed shared access. Typically this would be
ajob which will only read a buffer, such as a display controller
or texture read. The main use-case for this was implementing EGL's
preserved swap behaviour when using "buffer flipping". Here, the
display controller will be reading the front buffer, but the GPU
might also need to read that front buffer. So perhaps adding
"read-only" & "read-write" access flags to prepare could also be
interpreted as shared & exclusive accesses, if we went down this
route for synchronization that is. :-)


Cheers,

Tom





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH] RFC: dma-buf: userspace mmap support

2012-03-16 Thread Tom Cooksey

> From: Rob Clark 
> 
> Enable optional userspace access to dma-buf buffers via mmap() on the
> dma-buf file descriptor.  Userspace access to the buffer should be
> bracketed with DMA_BUF_IOCTL_{PREPARE,FINISH}_ACCESS ioctl calls to
> give the exporting driver a chance to deal with cache synchronization
> and such for cached userspace mappings without resorting to page
> faulting tricks.  The reasoning behind this is that, while drm
> drivers tend to have all the mechanisms in place for dealing with
> page faulting tricks, other driver subsystems may not.  And in
> addition, while page faulting tricks make userspace simpler, there
> are some associated overheads.

Speaking for the ARM Mali T6xx driver point of view, this API looks
good for us. Our use-case for mmap is glReadPixels and
glTex[Sub]Image2D on buffers the driver has imported via dma_buf. In
the case of glReadPixels, the finish ioctl isn't strictly necessary
as the CPU won't have written to the buffer and so doesn't need
flushing. As such, we'd get an additional cache flush which isn't
really necessary. But hey, it's glReadPixels - it's supposed to be
slow. :-)

I think requiring the finish ioctl in the API contract is a good
idea, even if the CPU has only done a ro access as it allows future
enhancements*. To "fix" the unnecessary flush in glReadPixels, I
think we'd like to keep the finish but see an "access type"
parameter added to prepare ioctl indicating if the access is ro or
rw to allow the cache flush in finish to be skipped if the access
was ro. As Rebecca says, a debug feature could even be added to
re-map the pages as ro in prepare(ro) to catch naughty accesses. I'd
also go as far as to say the debug feature should completely unmap
the pages after finish too. Though for us, both the access-type
parameter and debug features are "nice to haves" - we can make
progress with the code as it currently stands (assuming exporters
start using the API that is).

Something which also came up when discussing internally is the topic
of mmap APIs of the importing device driver. For example, I believe
DRM has an mmap API on GEM buffer objects. If a new dma_buf import
ioctl was added to GEM (maybe the PRIME patches already add this),
how would GEM's bo mmap API work?


* Future enhancements: The prepare/finish bracketing could be used
as part of a wider synchronization scheme with other devices.
E.g. If another device was writing to the buffer, the prepare ioctl
could block until that device had finished accessing that buffer.
In the same way, another device could be blocked from accessing that
buffer until the client process called finish. We have already
started playing with such a scheme in the T6xx driver stack we're
terming "kernel dependency system". In this scheme each buffer has a
FIFO of "buffer consumers" waiting to access a buffer. The idea
being that a "buffer consumer" is fairly abstract and could be any
device or userspace process participating in the synchronization
scheme. Examples would be GPU jobs, display controller "scan-out"
jobs, etc.

So for example, a userspace application could dispatch a GPU
fragment shading job into the GPU's kernel driver which will write
to a KMS scanout buffer. The application then immediately issues a
drm_mode_crtc_page_flip ioctl on the display controller's DRM driver
to display the soon-to-be-rendered buffer. Inside the kernel, the
GPU driver adds the fragment job to the dma_buf's FIFO. As the FIFO
was empty, dma_buf calls into the GPU kernel driver to tell it it
"owns" access to the buffer and the GPU driver schedules the job to
run on the GPU. Upon receiving the drm_mode_crtc_page_flip ioctl,
the DRM/KMS driver adds a scan-out job to the buffer's FIFO.
However, the FIFO already has the GPU's fragment shading job in it
so nothing happens until the GPU job completes. When the GPU job
completes, the GPU driver calls into dma_buf to mark its job
complete. dma_buf then takes the next job in its FIFO which the KMS
driver's scanout job, calls into the KMS driver to schedule the
pageflip. The result? A buffer gets scanned out as soon as it has
finished being rendered without needing a round-trip to userspace.
Sure, there are easier ways to achieve that goal, but the idea is
that the mechanism can be used to synchronize access across multiple
devices, which makes it useful for lots of other use-cases too.


As I say, we have already implemented something which works as I
describe but where the buffers are abstract resources not linked to
dma_buf. I'd like to discuss the finer points of the mechanisms
further, but if it's looking like there's interest in this approach
we'll start re-writing the code we have to sit on-top of dma_buf
and posting it as RFCs to the various lists. The intention is to get
this to mainline, if mainline wants it. :-)

Personally, what I particularly like about this approach to
synchronization is that it doesn't require any interfaces to be
modified. I think/hope that makes it easier t

RE: [PATCH] RFC: dma-buf: userspace mmap support

2012-03-16 Thread Tom Cooksey

> From: Rob Clark 
> 
> Enable optional userspace access to dma-buf buffers via mmap() on the
> dma-buf file descriptor.  Userspace access to the buffer should be
> bracketed with DMA_BUF_IOCTL_{PREPARE,FINISH}_ACCESS ioctl calls to
> give the exporting driver a chance to deal with cache synchronization
> and such for cached userspace mappings without resorting to page
> faulting tricks.  The reasoning behind this is that, while drm
> drivers tend to have all the mechanisms in place for dealing with
> page faulting tricks, other driver subsystems may not.  And in
> addition, while page faulting tricks make userspace simpler, there
> are some associated overheads.

Speaking for the ARM Mali T6xx driver point of view, this API looks
good for us. Our use-case for mmap is glReadPixels and
glTex[Sub]Image2D on buffers the driver has imported via dma_buf. In
the case of glReadPixels, the finish ioctl isn't strictly necessary
as the CPU won't have written to the buffer and so doesn't need
flushing. As such, we'd get an additional cache flush which isn't
really necessary. But hey, it's glReadPixels - it's supposed to be
slow. :-)

I think requiring the finish ioctl in the API contract is a good
idea, even if the CPU has only done a ro access as it allows future
enhancements*. To "fix" the unnecessary flush in glReadPixels, I
think we'd like to keep the finish but see an "access type"
parameter added to prepare ioctl indicating if the access is ro or
rw to allow the cache flush in finish to be skipped if the access
was ro. As Rebecca says, a debug feature could even be added to
re-map the pages as ro in prepare(ro) to catch naughty accesses. I'd
also go as far as to say the debug feature should completely unmap
the pages after finish too. Though for us, both the access-type
parameter and debug features are "nice to haves" - we can make
progress with the code as it currently stands (assuming exporters
start using the API that is).

Something which also came up when discussing internally is the topic
of mmap APIs of the importing device driver. For example, I believe
DRM has an mmap API on GEM buffer objects. If a new dma_buf import
ioctl was added to GEM (maybe the PRIME patches already add this),
how would GEM's bo mmap API work?


* Future enhancements: The prepare/finish bracketing could be used
as part of a wider synchronization scheme with other devices.
E.g. If another device was writing to the buffer, the prepare ioctl
could block until that device had finished accessing that buffer.
In the same way, another device could be blocked from accessing that
buffer until the client process called finish. We have already
started playing with such a scheme in the T6xx driver stack we're
terming "kernel dependency system". In this scheme each buffer has a
FIFO of "buffer consumers" waiting to access a buffer. The idea
being that a "buffer consumer" is fairly abstract and could be any
device or userspace process participating in the synchronization
scheme. Examples would be GPU jobs, display controller "scan-out"
jobs, etc.

So for example, a userspace application could dispatch a GPU
fragment shading job into the GPU's kernel driver which will write
to a KMS scanout buffer. The application then immediately issues a
drm_mode_crtc_page_flip ioctl on the display controller's DRM driver
to display the soon-to-be-rendered buffer. Inside the kernel, the
GPU driver adds the fragment job to the dma_buf's FIFO. As the FIFO
was empty, dma_buf calls into the GPU kernel driver to tell it it
"owns" access to the buffer and the GPU driver schedules the job to
run on the GPU. Upon receiving the drm_mode_crtc_page_flip ioctl,
the DRM/KMS driver adds a scan-out job to the buffer's FIFO.
However, the FIFO already has the GPU's fragment shading job in it
so nothing happens until the GPU job completes. When the GPU job
completes, the GPU driver calls into dma_buf to mark its job
complete. dma_buf then takes the next job in its FIFO which the KMS
driver's scanout job, calls into the KMS driver to schedule the
pageflip. The result? A buffer gets scanned out as soon as it has
finished being rendered without needing a round-trip to userspace.
Sure, there are easier ways to achieve that goal, but the idea is
that the mechanism can be used to synchronize access across multiple
devices, which makes it useful for lots of other use-cases too.


As I say, we have already implemented something which works as I
describe but where the buffers are abstract resources not linked to
dma_buf. I'd like to discuss the finer points of the mechanisms
further, but if it's looking like there's interest in this approach
we'll start re-writing the code we have to sit on-top of dma_buf
and posting it as RFCs to the various lists. The intention is to get
this to mainline, if mainline wants it. :-)

Personally, what I particularly like about this approach to
synchronization is that it doesn't require any interfaces to be
modified. I think/hope that makes it easier t