subject:"Re\: \[RFC v2 1\/2\] dma\-buf\: Introduce dma buffer sharing mechanism"

2012/1/9 Daniel Vetter dan...@ffwll.ch:
On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote:
I has test dmabuf based drm gem module for exynos and I found one problem.
you can refer to this test repository:
http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf

at this repository, I added some exception codes for resource release
in addition to Dave's patch sets.

let's suppose we use dmabuf based vb2 and drm gem with physically
continuous memory(no IOMMU) and we try to share allocated buffer
between them(v4l2 and drm driver).

for this problem, I added get_shared_cnt() callback to dma-buf.h but
I'm not sure that this is good way. maybe there may be better way.
if there is any missing point, please let me know.

when the exporter creates the dma_buf handle(for example, gem - fd),
I think the refcount of gem object should be increased at this point,
and decreased by dma_buf-ops-release() again because when the
dma_buf is created and dma_buf_export() is called, this dma_buf refers
to the gem object one time. and in case of inporter(fd - gem),
file-f_count of the dma_buf is increased and then when this gem
object is released by user request such as drm close or
drn_gem_close_ioctl, dma_buf_put() should be called by
dma_buf-ops-detach() to decrease file-f_count again because the gem
object refers to the dma_buf. for this, you can refer to my test
repository I mentioned above. but the problem is that when a buffer is
released by one side, another can't know whether the buffer already
was released or not.
note : in case of sharing a buffer between v4l2 and drm driver, the
memory info would be copied vb2_xx_buf to xx_gem or xx_gem to
vb2_xx_buf through sg table. in this case, only memory info is used to
share, not some objects.

If this doesn't work like that currently, we have a bug, and exporting the
reference count or something similar can't fix that.

Yours, Daniel

PS: Please cut down the original mail when replying, otherwise it's pretty
hard to find your response ;-)

Ok, got it. thanks. :)

--
Daniel Vetter
Mail: dan...@ffwll.ch
Mobile: +41 (0)79 365 57 48
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012-01-09 Thread Daniel Vetter

On Mon, Jan 09, 2012 at 07:10:25PM +0900, InKi Dae wrote:
2012/1/9 Daniel Vetter dan...@ffwll.ch:
On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote:
I has test dmabuf based drm gem module for exynos and I found one problem.
you can refer to this test repository:
http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf

at this repository, I added some exception codes for resource release
in addition to Dave's patch sets.

let's suppose we use dmabuf based vb2 and drm gem with physically
continuous memory(no IOMMU) and we try to share allocated buffer
between them(v4l2 and drm driver).

for this problem, I added get_shared_cnt() callback to dma-buf.h but
I'm not sure that this is good way. maybe there may be better way.
if there is any missing point, please let me know.

Nope, dma_buf_put should not be called by -detach. The importer gets his
reference when importing the dma_buf and needs to drop that reference
himself when it's done using the buffer by calling dma_buf_put (i.e. after
the last -detach call). I think adding separate reference counting to
-attach and -detach is a waste of time and only papers over buggy
importers.

Additionally the importer does _not_ control the lifetime of an dma_buf
object and it's underlying backing storage. It hence may _never_ free the
backing storage itself, that's the job of the exporter.

With that cleared up, referencing the exporters underlying buffer object
from the dma_buf will just do the right thing.

note : in case of sharing a buffer between v4l2 and drm driver, the
memory info would be copied vb2_xx_buf to xx_gem or xx_gem to
vb2_xx_buf through sg table. in this case, only memory info is used to
share, not some objects.

Hm, maybe I need to take a look at the currently proposed v4l dma_buf
patches ;-) atm I don't have an idea what exactly you're talking about.

Yours, Daniel
--
Daniel Vetter
Mail: dan...@ffwll.ch
Mobile: +41 (0)79 365 57 48
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012/1/9 Daniel Vetter dan...@ffwll.ch:
 On Mon, Jan 09, 2012 at 07:10:25PM +0900, InKi Dae wrote:
 2012/1/9 Daniel Vetter dan...@ffwll.ch:
  On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote:
  I has test dmabuf based drm gem module for exynos and I found one problem.
  you can refer to this test repository:
  http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf
 
  at this repository, I added some exception codes for resource release
  in addition to Dave's patch sets.
 
  let's suppose we use dmabuf based vb2 and drm gem with physically
  continuous memory(no IOMMU) and we try to share allocated buffer
  between them(v4l2 and drm driver).
 
  1. request memory allocation through drm gem interface.
  2. request DRM_SET_PRIME ioctl with the gem handle to get a fd to the
  gem object.
  - internally, private gem based dmabuf moudle calls drm_buf_export()
  to register allocated gem object to fd.
  3. request qbuf with the fd(got from 2) and DMABUF type to set the
  buffer to v4l2 based device.
  - internally, vb2 plug in module gets a buffer to the fd and then
  calls dmabuf-ops-map_dmabuf() callback to get the sg table
  containing physical memory info to the gem object. and then the
  physical memory info would be copied to vb2_xx_buf object.
  for DMABUF feature for v4l2 and videobuf2 framework, you can refer to
  this repository:
  git://github.com/robclark/kernel-omap4.git drmplane-dmabuf
 
  after that, if v4l2 driver want to release vb2_xx_buf object with
  allocated memory region by user request, how should we do?. refcount
  to vb2_xx_buf is dependent on videobuf2 framework. so when vb2_xx_buf
  object is released videobuf2 framework don't know who is using the
  physical memory region. so this physical memory region is released and
  when drm driver tries to access the region or to release it also, a
  problem would be induced.
 
  for this problem, I added get_shared_cnt() callback to dma-buf.h but
  I'm not sure that this is good way. maybe there may be better way.
  if there is any missing point, please let me know.
 
  The dma_buf object needs to hold a reference on the underlying
  (necessarily reference-counted) buffer object when the exporter creates
  the dma_buf handle. This reference should then get dropped in the
  exporters dma_buf-ops-release() function, which is only getting called
  when the last reference to the dma_buf disappears.
 

 when the exporter creates the dma_buf handle(for example, gem - fd),
 I think the refcount of gem object should be increased at this point,
 and decreased by dma_buf-ops-release() again because when the
 dma_buf is created and dma_buf_export() is called, this dma_buf refers
 to the gem object one time. and in case of inporter(fd - gem),
 file-f_count of the dma_buf is increased and then when this gem
 object is released by user request such as drm close or
 drn_gem_close_ioctl, dma_buf_put() should be called by
 dma_buf-ops-detach() to decrease file-f_count again because the gem
 object refers to the dma_buf. for this, you can refer to my test
 repository I mentioned above. but the problem is that when a buffer is
 released by one side, another can't know whether the buffer already
 was released or not.

 Nope, dma_buf_put should not be called by -detach. The importer gets his
 reference when importing the dma_buf and needs to drop that reference
 himself when it's done using the buffer by calling dma_buf_put (i.e. after
 the last -detach call).

I'm afraid that there may be my missing points. I'm confusing. who is
Importer and who is Exporter you think? Importer is fd goes to private
buffer and Exporter is private buffer goes to fd? if so, yes, when the
importer needs to drop that reference(the importer want to release
that buffer), dma_buf_put() should be called somewhere and in my case,
that function is called by drm_prime_gem_destory(). this function is
included at Dave's patch sets and also dma_buf_detatch() is called
there. and I just thought that here is right place. I didn't find the
place dma_buf_put() is called anywhere. could you please tell me where
dma_buf_put() should be called at you think?.

for this, you can refer to Dave's repository:
http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf

 I think adding separate reference counting to
 -attach and -detach is a waste of time and only papers over buggy
 importers.


 I mean  when fd goes to private buffer, that reference(file-f_count)
would be increased by dma_buf_get() and  only -detach is used to drop
that reference.

 Additionally the importer does _not_ control the lifetime of an dma_buf
 object and it's underlying backing storage. It hence may _never_ free the
 backing storage itself, that's the job of the exporter.


yes, my understanding is that if user app requested close(fd) call to
the Importer, after close(fd), file-f_count would be decreased but
would still has 1. because file-f_count already was 1 by the
Exporter.

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012-01-09 Thread Rob Clark

On Mon, Jan 9, 2012 at 4:10 AM, InKi Dae daei...@gmail.com wrote:
 note : in case of sharing a buffer between v4l2 and drm driver, the
 memory info would be copied vb2_xx_buf to xx_gem or xx_gem to
 vb2_xx_buf through sg table. in this case, only memory info is used to
 share, not some objects.

which v4l2/vb2 patches are you looking at?  The patches I was using,
vb2 holds a reference to the 'struct dma_buf *' internally, not just
keeping the sg_table

BR,
-R
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012-01-09 Thread Daniel Vetter

On Mon, Jan 09, 2012 at 09:06:56PM +0900, InKi Dae wrote:
 2012/1/9 Daniel Vetter dan...@ffwll.ch:
  On Mon, Jan 09, 2012 at 07:10:25PM +0900, InKi Dae wrote:
  2012/1/9 Daniel Vetter dan...@ffwll.ch:
   On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote:
   I has test dmabuf based drm gem module for exynos and I found one 
   problem.
   you can refer to this test repository:
   http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf
  
   at this repository, I added some exception codes for resource release
   in addition to Dave's patch sets.
  
   let's suppose we use dmabuf based vb2 and drm gem with physically
   continuous memory(no IOMMU) and we try to share allocated buffer
   between them(v4l2 and drm driver).
  
   1. request memory allocation through drm gem interface.
   2. request DRM_SET_PRIME ioctl with the gem handle to get a fd to the
   gem object.
   - internally, private gem based dmabuf moudle calls drm_buf_export()
   to register allocated gem object to fd.
   3. request qbuf with the fd(got from 2) and DMABUF type to set the
   buffer to v4l2 based device.
   - internally, vb2 plug in module gets a buffer to the fd and then
   calls dmabuf-ops-map_dmabuf() callback to get the sg table
   containing physical memory info to the gem object. and then the
   physical memory info would be copied to vb2_xx_buf object.
   for DMABUF feature for v4l2 and videobuf2 framework, you can refer to
   this repository:
   git://github.com/robclark/kernel-omap4.git drmplane-dmabuf
  
   after that, if v4l2 driver want to release vb2_xx_buf object with
   allocated memory region by user request, how should we do?. refcount
   to vb2_xx_buf is dependent on videobuf2 framework. so when vb2_xx_buf
   object is released videobuf2 framework don't know who is using the
   physical memory region. so this physical memory region is released and
   when drm driver tries to access the region or to release it also, a
   problem would be induced.
  
   for this problem, I added get_shared_cnt() callback to dma-buf.h but
   I'm not sure that this is good way. maybe there may be better way.
   if there is any missing point, please let me know.
  
   The dma_buf object needs to hold a reference on the underlying
   (necessarily reference-counted) buffer object when the exporter creates
   the dma_buf handle. This reference should then get dropped in the
   exporters dma_buf-ops-release() function, which is only getting called
   when the last reference to the dma_buf disappears.
  
 
  when the exporter creates the dma_buf handle(for example, gem - fd),
  I think the refcount of gem object should be increased at this point,
  and decreased by dma_buf-ops-release() again because when the
  dma_buf is created and dma_buf_export() is called, this dma_buf refers
  to the gem object one time. and in case of inporter(fd - gem),
  file-f_count of the dma_buf is increased and then when this gem
  object is released by user request such as drm close or
  drn_gem_close_ioctl, dma_buf_put() should be called by
  dma_buf-ops-detach() to decrease file-f_count again because the gem
  object refers to the dma_buf. for this, you can refer to my test
  repository I mentioned above. but the problem is that when a buffer is
  released by one side, another can't know whether the buffer already
  was released or not.
 
  Nope, dma_buf_put should not be called by -detach. The importer gets his
  reference when importing the dma_buf and needs to drop that reference
  himself when it's done using the buffer by calling dma_buf_put (i.e. after
  the last -detach call).
 
 I'm afraid that there may be my missing points. I'm confusing. who is
 Importer and who is Exporter you think? Importer is fd goes to private
 buffer and Exporter is private buffer goes to fd? if so, yes, when the
 importer needs to drop that reference(the importer want to release
 that buffer), dma_buf_put() should be called somewhere and in my case,
 that function is called by drm_prime_gem_destory(). this function is
 included at Dave's patch sets and also dma_buf_detatch() is called
 there. and I just thought that here is right place. I didn't find the
 place dma_buf_put() is called anywhere. could you please tell me where
 dma_buf_put() should be called at you think?.
 
 for this, you can refer to Dave's repository:
 http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf

I haven't really looked at Dave's latest prime patches, but he reported
some reference counting issues last time around we chatted about it on
irc. So maybe you're just right and the dma_buf_put is indeed missing from
drm_prime_gem_destroy ;-) But as I've said, haven't really reviewed the
code, so I'm likely completely wrong.

Cheers, Daniel
-- 
Daniel Vetter
Mail: dan...@ffwll.ch
Mobile: +41 (0)79 365 57 48
___
dri-devel mailing list
dri-devel@lists.freedesktop.org

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012/1/10 Rob Clark r...@ti.com:
 On Mon, Jan 9, 2012 at 4:10 AM, InKi Dae daei...@gmail.com wrote:
 note : in case of sharing a buffer between v4l2 and drm driver, the
 memory info would be copied vb2_xx_buf to xx_gem or xx_gem to
 vb2_xx_buf through sg table. in this case, only memory info is used to
 share, not some objects.

 which v4l2/vb2 patches are you looking at?  The patches I was using,
 vb2 holds a reference to the 'struct dma_buf *' internally, not just
 keeping the sg_table


yes, not keeping the sg_table. I mean... see a example below please.

static void vb2_dma_contig_map_dmabuf(void *mem_priv)
{
struct sg_table *sg;
 ...
 sg = dma_buf_map_attachment(buf-db_attach, dir);
 ...
 buf-dma_addr = sg_dma_address(sg-sgl);
 ...
}

at least with no IOMMU, the memory information(containing physical
memory address) would be copied to vb2_xx_buf object if drm gem
exported its own buffer and vb2 wants to use that buffer at this time,
sg table is used to share that buffer. and the problem I pointed out
is that this buffer(also physical memory region) could be released by
vb2 framework(as you know, vb2_xx_buf object and the memory region for
buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that
so some problems would be induced once drm gem tries to release or
access that buffer. and I have tried to resolve this issue adding
get_shared_cnt() callback to dma-buf.h but I'm not sure that this is
good way. maybe there would be better way.

Thanks.

 BR,
 -R
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012-01-09 Thread Rob Clark

On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote:
 2012/1/10 Rob Clark r...@ti.com:
 On Mon, Jan 9, 2012 at 4:10 AM, InKi Dae daei...@gmail.com wrote:
 note : in case of sharing a buffer between v4l2 and drm driver, the
 memory info would be copied vb2_xx_buf to xx_gem or xx_gem to
 vb2_xx_buf through sg table. in this case, only memory info is used to
 share, not some objects.

 which v4l2/vb2 patches are you looking at?  The patches I was using,
 vb2 holds a reference to the 'struct dma_buf *' internally, not just
 keeping the sg_table


 yes, not keeping the sg_table. I mean... see a example below please.

 static void vb2_dma_contig_map_dmabuf(void *mem_priv)
 {
    struct sg_table *sg;
     ...
     sg = dma_buf_map_attachment(buf-db_attach, dir);
     ...
     buf-dma_addr = sg_dma_address(sg-sgl);
     ...
 }

 at least with no IOMMU, the memory information(containing physical
 memory address) would be copied to vb2_xx_buf object if drm gem
 exported its own buffer and vb2 wants to use that buffer at this time,
 sg table is used to share that buffer. and the problem I pointed out
 is that this buffer(also physical memory region) could be released by
 vb2 framework(as you know, vb2_xx_buf object and the memory region for
 buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that
 so some problems would be induced once drm gem tries to release or
 access that buffer. and I have tried to resolve this issue adding
 get_shared_cnt() callback to dma-buf.h but I'm not sure that this is
 good way. maybe there would be better way.

the exporter (in this case your driver's drm/gem bits) shouldn't
release that mapping / sgtable until the importer (in this case v4l2)
calls dma_buf_unmap fxn..

It would be an error if the importer did a dma_buf_put() without first
calling dma_buf_unmap_attachment() (if currently mapped) and then
dma_buf_detach() (if currently attached).  Perhaps somewhere there
should be some sanity checking debug code which could be enabled to do
a WARN_ON() if the importer does the wrong thing.  It shouldn't really
be part of the API, I don't think, but it actually does seem like a
good thing, esp. as new drivers start trying to use dmabuf, to have
some debug options which could be enabled.

It is entirely possible that something was missed on the vb2 patches,
but the way it is intended to work is like this:
https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-core.c#L92

where it does a detach() before the dma_buf_put(), and the vb2-contig
backend checks here that it is also unmapped():
https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-dma-contig.c#L251

BR,
-R

 Thanks.

 BR,
 -R
 ___
 dri-devel mailing list
 dri-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012-01-09 Thread Semwal, Sumit

On Tue, Jan 10, 2012 at 7:44 AM, Rob Clark r...@ti.com wrote:
 On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote:
 2012/1/10 Rob Clark r...@ti.com:
 at least with no IOMMU, the memory information(containing physical
 memory address) would be copied to vb2_xx_buf object if drm gem
 exported its own buffer and vb2 wants to use that buffer at this time,
 sg table is used to share that buffer. and the problem I pointed out
 is that this buffer(also physical memory region) could be released by
 vb2 framework(as you know, vb2_xx_buf object and the memory region for
 buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that
 so some problems would be induced once drm gem tries to release or
 access that buffer. and I have tried to resolve this issue adding
 get_shared_cnt() callback to dma-buf.h but I'm not sure that this is
 good way. maybe there would be better way.
Hi Inki,
As also mentioned in the documentation patch, importer (the user of
the buffer) - in this case for current RFC patches on
v4l2-as-a-user[1] vb2 framework - shouldn't release the backing memory
of the buffer directly - it should only use the dma-buf callbacks in
the right sequence to let the exporter know that it is done using this
buffer, so the exporter can release it if allowed and needed.

 the exporter (in this case your driver's drm/gem bits) shouldn't
 release that mapping / sgtable until the importer (in this case v4l2)
 calls dma_buf_unmap fxn..

 It would be an error if the importer did a dma_buf_put() without first
 calling dma_buf_unmap_attachment() (if currently mapped) and then
 dma_buf_detach() (if currently attached).  Perhaps somewhere there
 should be some sanity checking debug code which could be enabled to do
 a WARN_ON() if the importer does the wrong thing.  It shouldn't really
 be part of the API, I don't think, but it actually does seem like a
 good thing, esp. as new drivers start trying to use dmabuf, to have
 some debug options which could be enabled.

 It is entirely possible that something was missed on the vb2 patches,
 but the way it is intended to work is like this:
 https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-core.c#L92

 where it does a detach() before the dma_buf_put(), and the vb2-contig
 backend checks here that it is also unmapped():
 https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-dma-contig.c#L251

The proposed RFC for V4L2 adaptation at [1] does exactly the same
thing; detach() before dma_buf_put(), and check in vb2-contig backend
for unmapped() as mentioned above.


 BR,
 -R

BR,
Sumit.

[1]: V4l2 as a dma-buf user RFC:
http://comments.gmane.org/gmane.linux.drivers.video-input-infrastructure/42966
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2012/1/10 Semwal, Sumit sumit.sem...@ti.com:
 On Tue, Jan 10, 2012 at 7:44 AM, Rob Clark r...@ti.com wrote:
 On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote:
 2012/1/10 Rob Clark r...@ti.com:
 at least with no IOMMU, the memory information(containing physical
 memory address) would be copied to vb2_xx_buf object if drm gem
 exported its own buffer and vb2 wants to use that buffer at this time,
 sg table is used to share that buffer. and the problem I pointed out
 is that this buffer(also physical memory region) could be released by
 vb2 framework(as you know, vb2_xx_buf object and the memory region for
 buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that
 so some problems would be induced once drm gem tries to release or
 access that buffer. and I have tried to resolve this issue adding
 get_shared_cnt() callback to dma-buf.h but I'm not sure that this is
 good way. maybe there would be better way.
 Hi Inki,
 As also mentioned in the documentation patch, importer (the user of
 the buffer) - in this case for current RFC patches on
 v4l2-as-a-user[1] vb2 framework - shouldn't release the backing memory
 of the buffer directly - it should only use the dma-buf callbacks in
 the right sequence to let the exporter know that it is done using this
 buffer, so the exporter can release it if allowed and needed.

thank you for your comments.:) and below are some tables about dmabuf
operations with ideal use and these tables indicate reference count of
when buffer is created, shared and released. so if there are any
problems, please let me know. P.S. these are just simple cases so
there would be others.


in case of using only drm gem and dmabuf,

operations   gem refcountfile f_count   buf refcount

1. gem create   A:1   A:0
2. export(handle A - fd)A:2A:1  A:0
3. import(fd - handle B)A:2, B:1 A:2  A:1
4. file close(A)  A:2, B:1 A:1  A:1
5. gem close(A)A:1, B:1 A:1  A:1
6. gem close(B)A:1, B:0 A:1  A:0
7. file close(A)  A:0A:0
---
3. handle B shares the buf of handle A.
6. release handle B but its buf.
7. release gem handle A and dmabuf of file A and also physical memory region.


and in case of using drm gem, vb2 and dmabuf,

operations  gem, vb2 refcountfile f_count   buf refcount

1. gem create   A:1 A:0
   (GEM side)
2. export(handle A - fd)A:2 A:1  A:0
   (GEM side)
3. import(fd - handle B)A:2, B:1  A:2  A:1
   (VB2 side)
4. file close(A)  A:2, B:1  A:1  A:1
   (VB2 side)
5. vb2 close(B) A:2, B:0  A:1  A:0
   (VB2 side)
6. gem close(A)A:1A:1  A:0
   (GEM side)
7. file close(A)  A:0A:0
   (GEM side)

3. vb2 handle B is shared with the buf of gem handle A.
5. release vb2 handle B and decrease refcount of the buf pointed by it.
7. release gem handle A and dmabuf of file A and also physical memory region.


 the exporter (in this case your driver's drm/gem bits) shouldn't
 release that mapping / sgtable until the importer (in this case v4l2)
 calls dma_buf_unmap fxn..

 It would be an error if the importer did a dma_buf_put() without first
 calling dma_buf_unmap_attachment() (if currently mapped) and then
 dma_buf_detach() (if currently attached).  Perhaps somewhere there
 should be some sanity checking debug code which could be enabled to do
 a WARN_ON() if the importer does the wrong thing.  It shouldn't really
 be part of the API, I don't think, but it actually does seem like a
 good thing, esp. as new drivers start trying to use dmabuf, to have
 some debug options which could be enabled.

 It is entirely possible that something was missed on the vb2 patches,
 but the way it is intended to work is like this:
 https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-core.c#L92

 where it does a detach() before the dma_buf_put(), and the vb2-contig
 backend checks here that it is also unmapped():
 https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-dma-contig.c#L251

 The proposed RFC for V4L2 adaptation at [1] does exactly the same
 thing; detach() before dma_buf_put(),

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-07 Thread Arnd Bergmann

On Wednesday 07 December 2011, Semwal, Sumit wrote:
 
  Do you have a use case for making the interface compile-time disabled?
  I had assumed that any code using it would make no sense if it's not
  available so you don't actually need this.

 Ok. Though if we keep the interface compile-time disabled, the users
 can actually check and fail or fall-back gracefully when the API is
 not available; If I remove it, anyways the users would need to do the
 same compile time check whether API is available or not, right?

If you have to do a compile-time check for the config symbol, it's better
to do it the way you did here than in the caller.

My guess was that no caller would actually require this, because when you
write a part of a subsystem to interact with the dma-buf infrastructure,
you would always disable compilation of an extire file that deals with 
everything related to struct dma_buf, not just stub out the calls.

Arnd
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-07 Thread Semwal, Sumit

On Wed, Dec 7, 2011 at 3:41 PM, Arnd Bergmann a...@arndb.de wrote:
 On Wednesday 07 December 2011, Semwal, Sumit wrote:
 
  Do you have a use case for making the interface compile-time disabled?
  I had assumed that any code using it would make no sense if it's not
  available so you don't actually need this.

 Ok. Though if we keep the interface compile-time disabled, the users
 can actually check and fail or fall-back gracefully when the API is
 not available; If I remove it, anyways the users would need to do the
 same compile time check whether API is available or not, right?

 If you have to do a compile-time check for the config symbol, it's better
 to do it the way you did here than in the caller.

 My guess was that no caller would actually require this, because when you
 write a part of a subsystem to interact with the dma-buf infrastructure,
 you would always disable compilation of an extire file that deals with
 everything related to struct dma_buf, not just stub out the calls.

Right; that would be ideal, but we may not be able to ask each user to
do so - especially when the sharing part might be interspersed in
existing buffer handling code. So for now, I would like to keep it as
it-is.

        Arnd

BR,
~Sumit.
 --
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-07 Thread Arnd Bergmann

On Wednesday 07 December 2011, Semwal, Sumit wrote:
 Right; that would be ideal, but we may not be able to ask each user to
 do so - especially when the sharing part might be interspersed in
 existing buffer handling code. So for now, I would like to keep it as
 it-is.

Ok, fair enough. It certainly doesn't hurt.

Arnd
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-07 Thread Semwal, Sumit

Hi Daniel, Rob,


On Tue, Dec 6, 2011 at 3:41 AM, Rob Clark r...@ti.com wrote:
 On Mon, Dec 5, 2011 at 3:23 PM, Daniel Vetter dan...@ffwll.ch wrote:
 On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote:
 On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote:
  In the patch 2, you have a section about migration that mentions that
  it is possible to export a buffer that can be migrated after it
  is already mapped into one user driver. How does that work when
  the physical addresses are mapped into a consumer device already?

 I think you can do physical migration if you are attached, but
 probably not if you are mapped.

 Yeah, that's very much how I see this, and also why map/unmap (at least
 for simple users like v4l) should only bracket actual usage. GPU memory
 managers need to be able to move around buffers while no one is using
 them.

 [snip]

  +     /* allow allocator to take care of cache ops */
  +     void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
  +     void (*sync_sg_for_device)(struct dma_buf *, struct device *);
 
  I don't see how this works with multiple consumers: For the streaming
  DMA mapping, there must be exactly one owner, either the device or
  the CPU. Obviously, this rule needs to be extended when you get to
  multiple devices and multiple device drivers, plus possibly user
  mappings. Simply assigning the buffer to the device from one
  driver does not block other drivers from touching the buffer, and
  assigning it to the cpu does not stop other hardware that the
  code calling sync_sg_for_cpu is not aware of.
 
  The only way to solve this that I can think of right now is to
  mandate that the mappings are all coherent (i.e. noncachable
  on noncoherent architectures like ARM). If you do that, you no
  longer need the sync_sg_for_* calls.

 My original thinking was that you either need DMABUF_CPU_{PREP,FINI}
 ioctls and corresponding dmabuf ops, which userspace is required to
 call before / after CPU access.  Or just remove mmap() and do the
 mmap() via allocating device and use that device's equivalent
 DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls.  That
 would give you a way to (a) synchronize with gpu/asynchronous
 pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing
 buffer (ie. wait all devices have dma_buf_unmap_attachment'd).  And
 that gives you a convenient place to do cache operations on
 noncoherent architecture.

 I sort of preferred having the DMABUF shim because that lets you pass
 a buffer around userspace without the receiving code knowing about a
 device specific API.  But the problem I eventually came around to: if
 your GL stack (or some other userspace component) is batching up
 commands before submission to kernel, the buffers you need to wait for
 completion might not even be submitted yet.  So from kernel
 perspective they are ready for cpu access.  Even though in fact they
 are not in a consistent state from rendering perspective.  I don't
 really know a sane way to deal with that.  Maybe the approach instead
 should be a userspace level API (in libkms/libdrm?) to provide
 abstraction for userspace access to buffers rather than dealing with
 this at the kernel level.

 Well, there's a reason GL has an explicit flush and extensions for sync
 objects. It's to support such scenarios where the driver batches up gpu
 commands before actually submitting them.

 Hmm.. what about other non-GL APIs..  maybe vaapi/vdpau or similar?
 (Or something that I haven't thought of.)

 Also, recent gpus have all (or
 shortly will grow) multiple execution pipelines, so it's also important
 that you sync up with the right command stream. Syncing up with all of
 them is generally frowned upon for obvious reasons ;-)

 Well, I guess I am happy enough with something that is at least
 functional.  Usespace access would (I think) mainly be weird edge case
 type stuff.  But...

snip

 On the topic of a coherency model for dmabuf, I think we need to look at
 dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and
 cpu_finish or whatever they might get called) as barriers:

 So after a dma_buf_map, all previsously completed dma operations (i.e.
 unmap already called) and any cpu writes (i.e. cpu_finish called) will be
 coherent. Similar rule holds for cpu access through the userspace mmap,
 only writes completed before the cpu_start will show up.

 Similar, writes done by the device are only guaranteed to show up after
 the _unmap. Dito for cpu writes and cpu_finish.

 In short we always need two function calls to denote the start/end of the
 critical section.

 Yup, this was exactly my assumption.  But I guess it is better to spell it 
 out.

Thanks for the excellent discussion - it indeed is very good learning
for the relatively-inexperienced me :)

So, for the purpose of dma-buf framework, could I summarize the
following and rework accordingly?:
1. remove mmap() dma_buf_op [and mmap fop], and

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-07 Thread Arnd Bergmann

On Wednesday 07 December 2011, Semwal, Sumit wrote:
 Thanks for the excellent discussion - it indeed is very good learning
 for the relatively-inexperienced me :)
 
 So, for the purpose of dma-buf framework, could I summarize the
 following and rework accordingly?:
 1. remove mmap() dma_buf_op [and mmap fop], and introduce cpu_start(),
 cpu_finish() ops to bracket cpu accesses to the buffer. Also add
 DMABUF_CPU_START / DMABUF_CPU_FINI IOCTLs?

I think we'd be better off for now without the extra ioctls and
just document that a shared buffer must not be exported to user
space using mmap at all, to avoid those problems. Serialization
between GPU and CPU is on a higher level than the dma_buf framework
IMHO.

 2. remove sg_sync* ops for now (and we'll see if we need to add them
 later if needed)

Just removing the sg_sync_* operations is not enough. We have to make
the decision whether we want to allow
a) only coherent mappings of the buffer into kernel memory (requiring
an extension to the dma_map_ops on ARM to not flush caches at map/unmap
time)
b) not allowing any in-kernel mappings (same requirement on ARM, also
limits the usefulness of the dma_buf if we cannot access it from the
kernel or from user space)
c) only allowing streaming mappings, even if those are non-coherent
(requiring strict serialization between CPU (in-kernel) and dma users of
the buffer)

This issue has to be solved or we get random data corruption.

Arnd
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-06 Thread Arnd Bergmann

On Monday 05 December 2011, Rob Clark wrote:
  On the topic of a coherency model for dmabuf, I think we need to look at
  dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and
  cpu_finish or whatever they might get called) as barriers:
 
  So after a dma_buf_map, all previsously completed dma operations (i.e.
  unmap already called) and any cpu writes (i.e. cpu_finish called) will be
  coherent. Similar rule holds for cpu access through the userspace mmap,
  only writes completed before the cpu_start will show up.
 
  Similar, writes done by the device are only guaranteed to show up after
  the _unmap. Dito for cpu writes and cpu_finish.
 
  In short we always need two function calls to denote the start/end of the
  critical section.
 
 Yup, this was exactly my assumption.  But I guess it is better to spell it 
 out.

I still don't understand how this is going to help you if you let
multiple drivers enter and leave the critical section without serializing
against one another. That doesn't sound like what I know as critical
section.

Given some reasonable constraints (all devices must be in the same coherency
domain, for instance), you can probably define it in a way that you can
have multiple devices mapping the same buffer at the same time, and
when no device has mapped the buffer you can have as many concurrent
kernel and user space accesses on the same buffer as you like. But you
must still guarantee that no software touches a noncoherent buffer while
it is mapped into any device and vice versa.

Why can't we just mandate that all mappings into the kernel must be
coherent and that user space accesses must either be coherent as well
or be done by user space that uses explicit serialization with all
DMA accesses?

Arnd
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-06 Thread Daniel Vetter

On Tue, Dec 06, 2011 at 01:16:58PM +, Arnd Bergmann wrote:
 On Monday 05 December 2011, Rob Clark wrote:
   On the topic of a coherency model for dmabuf, I think we need to look at
   dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and
   cpu_finish or whatever they might get called) as barriers:
  
   So after a dma_buf_map, all previsously completed dma operations (i.e.
   unmap already called) and any cpu writes (i.e. cpu_finish called) will be
   coherent. Similar rule holds for cpu access through the userspace mmap,
   only writes completed before the cpu_start will show up.
  
   Similar, writes done by the device are only guaranteed to show up after
   the _unmap. Dito for cpu writes and cpu_finish.
  
   In short we always need two function calls to denote the start/end of the
   critical section.
 
  Yup, this was exactly my assumption.  But I guess it is better to spell it 
  out.

 I still don't understand how this is going to help you if you let
 multiple drivers enter and leave the critical section without serializing
 against one another. That doesn't sound like what I know as critical
 section.

I already regret to having added that last critical section remark.
Think barriers. It's just that you need a barrier in both directions that
bracket the actual usage. In i915-land we call the first one generally
invalidate (so that caches on the target domain don't contain stale data)
and that second one flush (to get any data out of caches).

 Given some reasonable constraints (all devices must be in the same coherency
 domain, for instance), you can probably define it in a way that you can
 have multiple devices mapping the same buffer at the same time, and
 when no device has mapped the buffer you can have as many concurrent
 kernel and user space accesses on the same buffer as you like. But you
 must still guarantee that no software touches a noncoherent buffer while
 it is mapped into any device and vice versa.

 Why can't we just mandate that all mappings into the kernel must be
 coherent and that user space accesses must either be coherent as well
 or be done by user space that uses explicit serialization with all
 DMA accesses?

I agree with your points here, afaics the contentious issue is just
whether dma_buf should _enforce_ this strict ordering. I'm leading towards
a no for the following reasons:

- gpu people love nonblocking interfaces (and love to come up with
  abuses). In the generic case we'd need some more functions to properly
  flush everything while 2 devices access a buffer concurrently (which is
  imo a bit unrealistic). But e.g. 2 gpus rendering in SLI mode very much
  want to access the same buffer at the same time (and the
  kernel+userspace gpu driver already needs all the information about
  caches to make that happen, at least on x86).

- Buffer sharing alone has already some great potential for deadlock and
  lock recursion issues. Making dma_buf into something that very much acts
  like a new locking primitive itself (even exposed to userspace) will
  make this much worse. I've seen some of the kernel/userspace shared
  hwlock code of dri1 yonder, and it's horrible (and at least for the case
  of the dri1 hwlock, totally broken).

- All current subsystem already have the concept to pass the ownership of
  a buffer between the device and userspace (sometimes even more than just
  2 domains, like in i915 ...). Userspace already needs to use this
  interface to get anything resembling correct data. I don't see any case
  where userspace can't enforce passing around buffer ownership if
  multiple devices are involved (we obviously need to clarify subsystem
  interfaces to make it clear when a buffer is in use and when another
  device taking part in the sharing could use it). So I don't see how the
  kernel enforcing strict access ordering helps implementing correct
  userspace.

- I don't see any security needs that would make it necessary for the
  kernel to enforce any consistency guarantees for concurrent access -
  we're only dealing with pixel data in all the currently discussed
  generic use-cases. So I think garbage as an end-result is acceptable if
  userspace does stupid things (or fails at trying to be clever).

Cheers, Daniel
-- 
Daniel Vetter
Mail: dan...@ffwll.ch
Mobile: +41 (0)79 365 57 48
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

2011-12-06 Thread Semwal, Sumit

Hi Arnd,

Thanks for your review!
On Mon, Dec 5, 2011 at 10:48 PM, Arnd Bergmann a...@arndb.de wrote:
 On Friday 02 December 2011, Sumit Semwal wrote:
 This is the first step in defining a dma buffer sharing mechanism.

 This looks very nice, but there are a few things I don't understand yet
 and a bunch of trivial comments I have about things I spotted.

 Do you have prototype exporter and consumer drivers that you can post
 for clarification?

 In the patch 2, you have a section about migration that mentions that
 it is possible to export a buffer that can be migrated after it
 is already mapped into one user driver. How does that work when
 the physical addresses are mapped into a consumer device already?
I guess I need to clear it up in the documentation - when I said once
all ongoing access is completed - I meant to say once all current
users have finished accessing and have unmapped this buffer. So I
agree with Rob - the migration would only be possible for attached
but unmapped buffers. I will update the documentation.

 diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
 index 21cf46f..07d8095 100644
 --- a/drivers/base/Kconfig
 +++ b/drivers/base/Kconfig
 @@ -174,4 +174,14 @@ config SYS_HYPERVISOR

  source drivers/base/regmap/Kconfig

 +config DMA_SHARED_BUFFER
 +     bool Buffer framework to be shared between drivers
 +     default n
 +     depends on ANON_INODES

 I would make this 'select ANON_INODES', like the other users of this
 feature.
Sure.


 +     return dmabuf;
 +}
 +EXPORT_SYMBOL(dma_buf_export);

 I agree with Konrad, this should definitely be EXPORT_SYMBOL_GPL,
 because it's really a low-level function that I would expect
 to get used by in-kernel subsystems providing the feature to
 users and having back-end drivers, but it's not the kind of thing
 we want out-of-tree drivers to mess with.
Agreed.


 +/**
 + * dma_buf_fd - returns a file descriptor for the given dma_buf
 + * @dmabuf:  [in]    pointer to dma_buf for which fd is required.
 + *
 + * On success, returns an associated 'fd'. Else, returns error.
 + */
 +int dma_buf_fd(struct dma_buf *dmabuf)
 +{
 +     int error, fd;
 +
 +     if (!dmabuf-file)
 +             return -EINVAL;
 +
 +     error = get_unused_fd_flags(0);

 Why not simply get_unused_fd()?
:) oversight. Will correct.


 +/**
 + * dma_buf_attach - Add the device to dma_buf's attachments list; 
 optionally,
 + * calls attach() of dma_buf_ops to allow device-specific attach 
 functionality
 + * @dmabuf:  [in]    buffer to attach device to.
 + * @dev:     [in]    device to be attached.
 + *
 + * Returns struct dma_buf_attachment * for this attachment; may return NULL.
 + *

 Or may return a negative error code. It's better to be consistent here:
 either always return NULL on error, or change the allocation error to
 ERR_PTR(-ENOMEM).
Ok, that makes sense.


 + */
 +struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
 +                                             struct device *dev)
 +{
 +     struct dma_buf_attachment *attach;
 +     int ret;
 +
 +     BUG_ON(!dmabuf || !dev);
 +
 +     attach = kzalloc(sizeof(struct dma_buf_attachment), GFP_KERNEL);
 +     if (attach == NULL)
 +             goto err_alloc;
 +
 +     mutex_lock(dmabuf-lock);
 +
 +     attach-dev = dev;
 +     attach-dmabuf = dmabuf;
 +     if (dmabuf-ops-attach) {
 +             ret = dmabuf-ops-attach(dmabuf, dev, attach);
 +             if (!ret)
 +                     goto err_attach;

 You probably mean if (ret) here instead of if (!ret), right?
yes - a stupid one! will correct.


 +     /* allow allocator to take care of cache ops */
 +     void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
 +     void (*sync_sg_for_device)(struct dma_buf *, struct device *);

 I don't see how this works with multiple consumers: For the streaming
 DMA mapping, there must be exactly one owner, either the device or
 the CPU. Obviously, this rule needs to be extended when you get to
 multiple devices and multiple device drivers, plus possibly user
 mappings. Simply assigning the buffer to the device from one
 driver does not block other drivers from touching the buffer, and
 assigning it to the cpu does not stop other hardware that the
 code calling sync_sg_for_cpu is not aware of.

 The only way to solve this that I can think of right now is to
 mandate that the mappings are all coherent (i.e. noncachable
 on noncoherent architectures like ARM). If you do that, you no
 longer need the sync_sg_for_* calls.
I will take yours and Daniel's suggestion, and remove these; if at all
they're needed, we can add them back again later, with
/s/device/attachment as suggested by Daniel.

 +#ifdef CONFIG_DMA_SHARED_BUFFER

 Do you have a use case for making the interface compile-time disabled?
 I had assumed that any code using it would make no sense if it's not
 available so you don't actually need this.
Ok. Though if we keep the interface compile-time disabled, the users
can actually check and

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Friday 02 December 2011, Sumit Semwal wrote:
 This is the first step in defining a dma buffer sharing mechanism.

This looks very nice, but there are a few things I don't understand yet
and a bunch of trivial comments I have about things I spotted.

Do you have prototype exporter and consumer drivers that you can post
for clarification?

In the patch 2, you have a section about migration that mentions that
it is possible to export a buffer that can be migrated after it
is already mapped into one user driver. How does that work when
the physical addresses are mapped into a consumer device already?

 diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
 index 21cf46f..07d8095 100644
 --- a/drivers/base/Kconfig
 +++ b/drivers/base/Kconfig
 @@ -174,4 +174,14 @@ config SYS_HYPERVISOR
  
  source drivers/base/regmap/Kconfig
  
 +config DMA_SHARED_BUFFER
 + bool Buffer framework to be shared between drivers
 + default n
 + depends on ANON_INODES

I would make this 'select ANON_INODES', like the other users of this
feature.

 + return dmabuf;
 +}
 +EXPORT_SYMBOL(dma_buf_export);

I agree with Konrad, this should definitely be EXPORT_SYMBOL_GPL,
because it's really a low-level function that I would expect
to get used by in-kernel subsystems providing the feature to
users and having back-end drivers, but it's not the kind of thing
we want out-of-tree drivers to mess with.

 +/**
 + * dma_buf_fd - returns a file descriptor for the given dma_buf
 + * @dmabuf:  [in]pointer to dma_buf for which fd is required.
 + *
 + * On success, returns an associated 'fd'. Else, returns error.
 + */
 +int dma_buf_fd(struct dma_buf *dmabuf)
 +{
 + int error, fd;
 +
 + if (!dmabuf-file)
 + return -EINVAL;
 +
 + error = get_unused_fd_flags(0);

Why not simply get_unused_fd()?

 +/**
 + * dma_buf_attach - Add the device to dma_buf's attachments list; optionally,
 + * calls attach() of dma_buf_ops to allow device-specific attach 
 functionality
 + * @dmabuf:  [in]buffer to attach device to.
 + * @dev: [in]device to be attached.
 + *
 + * Returns struct dma_buf_attachment * for this attachment; may return NULL.
 + *

Or may return a negative error code. It's better to be consistent here:
either always return NULL on error, or change the allocation error to
ERR_PTR(-ENOMEM).

 + */
 +struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
 + struct device *dev)
 +{
 + struct dma_buf_attachment *attach;
 + int ret;
 +
 + BUG_ON(!dmabuf || !dev);
 +
 + attach = kzalloc(sizeof(struct dma_buf_attachment), GFP_KERNEL);
 + if (attach == NULL)
 + goto err_alloc;
 +
 + mutex_lock(dmabuf-lock);
 +
 + attach-dev = dev;
 + attach-dmabuf = dmabuf;
 + if (dmabuf-ops-attach) {
 + ret = dmabuf-ops-attach(dmabuf, dev, attach);
 + if (!ret)
 + goto err_attach;

You probably mean if (ret) here instead of if (!ret), right?

 + /* allow allocator to take care of cache ops */
 + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
 + void (*sync_sg_for_device)(struct dma_buf *, struct device *);

I don't see how this works with multiple consumers: For the streaming
DMA mapping, there must be exactly one owner, either the device or
the CPU. Obviously, this rule needs to be extended when you get to
multiple devices and multiple device drivers, plus possibly user
mappings. Simply assigning the buffer to the device from one
driver does not block other drivers from touching the buffer, and
assigning it to the cpu does not stop other hardware that the
code calling sync_sg_for_cpu is not aware of.

The only way to solve this that I can think of right now is to
mandate that the mappings are all coherent (i.e. noncachable
on noncoherent architectures like ARM). If you do that, you no
longer need the sync_sg_for_* calls.

 +#ifdef CONFIG_DMA_SHARED_BUFFER

Do you have a use case for making the interface compile-time disabled?
I had assumed that any code using it would make no sense if it's not
available so you don't actually need this.

Arnd
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Mon, Dec 05, 2011 at 05:18:48PM +, Arnd Bergmann wrote:
 On Friday 02 December 2011, Sumit Semwal wrote:
  +   /* allow allocator to take care of cache ops */
  +   void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
  +   void (*sync_sg_for_device)(struct dma_buf *, struct device *);

 I don't see how this works with multiple consumers: For the streaming
 DMA mapping, there must be exactly one owner, either the device or
 the CPU. Obviously, this rule needs to be extended when you get to
 multiple devices and multiple device drivers, plus possibly user
 mappings. Simply assigning the buffer to the device from one
 driver does not block other drivers from touching the buffer, and
 assigning it to the cpu does not stop other hardware that the
 code calling sync_sg_for_cpu is not aware of.

 The only way to solve this that I can think of right now is to
 mandate that the mappings are all coherent (i.e. noncachable
 on noncoherent architectures like ARM). If you do that, you no
 longer need the sync_sg_for_* calls.

Woops, totally missed the addition of these. Can somebody explain to used
to rather coherent x86 what we need these for and the code-flow would look
like in a typical example. I was kinda assuming that devices would bracket
their use of a buffer with the attachment_map/unmap calls and any cache
coherency magic that might be needed would be somewhat transparent to
users of the interface?

The map call gets the dma_data_direction parameter, so it should be able
to do the right thing. And because we keep the attachement around, any
caching of mappings should be possible, too.

Yours, Daniel

PS: Slightly related, because it will make the coherency nightmare worse,
afaict: Can we kill mmap support?
-- 
Daniel Vetter
Mail: dan...@ffwll.ch
Mobile: +41 (0)79 365 57 48
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Monday 05 December 2011 19:55:44 Daniel Vetter wrote:
  The only way to solve this that I can think of right now is to
  mandate that the mappings are all coherent (i.e. noncachable
  on noncoherent architectures like ARM). If you do that, you no
  longer need the sync_sg_for_* calls.
 
 Woops, totally missed the addition of these. Can somebody explain to used
 to rather coherent x86 what we need these for and the code-flow would look
 like in a typical example. I was kinda assuming that devices would bracket
 their use of a buffer with the attachment_map/unmap calls and any cache
 coherency magic that might be needed would be somewhat transparent to
 users of the interface?

I'll describe how the respective functions work in the streaming mapping
API (dma_map_*): You start out with a buffer that is owned by the CPU,
i.e. the kernel can access it freely. When you call dma_map_sg or similar,
a noncoherent device reading the buffer requires the cache to be flushed
in order to see the data that was written by the CPU into the cache.

After dma_map_sg, the device can perform both read and write accesses
(depending on the flag to the map call), but the CPU is no longer allowed
to read (which would allocate a cache line that may become invalid but
remain marked as clean) or write (which would create a dirty cache line
without writing it back) that buffer.

Once the device is done, the driver calls dma_unmap_* and the buffer is
again owned by the CPU. The device can no longer access it (in fact
the address may be no longer be backed if there is an iommu) and the CPU
can again read and write the buffer. On ARMv6 and higher, possibly some
other architectures, dma_unmap_* also needs to invalidate the cache
for the buffer, because due to speculative prefetching, there may also
be a new clean cache line with stale data from an earlier version of
the buffer.

Since map/unmap is an expensive operation, the interface was extended
to pass back the ownership to the CPU and back to the device while leaving
the buffer mapped. dma_sync_sg_for_cpu invalidates the cache in the same
way as dma_unmap_sg, so the CPU can access the buffer, and
dma_sync_sg_for_device hands it back to the device by performing the
same cache flush that dma_map_sg would do.

You could for example do this if you want video input with a
cacheable buffer, or in an rdma scenario with a buffer accessed
by a remote machine.

In case of software iommu (swiotlb, dmabounce), the map and sync
functions don't do cache management but instead copy data between
a buffer accessed by hardware and a different buffer accessed
by the user.

 The map call gets the dma_data_direction parameter, so it should be able
 to do the right thing. And because we keep the attachement around, any
 caching of mappings should be possible, too.
 
 Yours, Daniel
 
 PS: Slightly related, because it will make the coherency nightmare worse,
 afaict: Can we kill mmap support?

The mmap support is required (and only make sense) for consistent mappings,
not for streaming mappings. The provider must ensure that if you map
something uncacheable into the kernel in order to provide consistency,
any mapping into user space must also be uncacheable. A driver
that wants to have the buffer mapped to user space as many do should
not need to know whether it is required to do cacheable or uncacheable
mapping because of some other driver, and it should not need to worry
about how to set up uncached mappings in user space.

Arnd
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote:
 On Friday 02 December 2011, Sumit Semwal wrote:
 This is the first step in defining a dma buffer sharing mechanism.

 This looks very nice, but there are a few things I don't understand yet
 and a bunch of trivial comments I have about things I spotted.

 Do you have prototype exporter and consumer drivers that you can post
 for clarification?

There is some dummy drivers based on an earlier version.  And airlied
has a prime (multi-gpu) prototype:

http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf

I've got a nearly working camera+display prototype:

https://github.com/robclark/kernel-omap4/commits/dmabuf

 In the patch 2, you have a section about migration that mentions that
 it is possible to export a buffer that can be migrated after it
 is already mapped into one user driver. How does that work when
 the physical addresses are mapped into a consumer device already?

I think you can do physical migration if you are attached, but
probably not if you are mapped.

 diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
 index 21cf46f..07d8095 100644
 --- a/drivers/base/Kconfig
 +++ b/drivers/base/Kconfig
 @@ -174,4 +174,14 @@ config SYS_HYPERVISOR

  source drivers/base/regmap/Kconfig

 +config DMA_SHARED_BUFFER
 +     bool Buffer framework to be shared between drivers
 +     default n
 +     depends on ANON_INODES

 I would make this 'select ANON_INODES', like the other users of this
 feature.

 +     return dmabuf;
 +}
 +EXPORT_SYMBOL(dma_buf_export);

 I agree with Konrad, this should definitely be EXPORT_SYMBOL_GPL,
 because it's really a low-level function that I would expect
 to get used by in-kernel subsystems providing the feature to
 users and having back-end drivers, but it's not the kind of thing
 we want out-of-tree drivers to mess with.

 +/**
 + * dma_buf_fd - returns a file descriptor for the given dma_buf
 + * @dmabuf:  [in]    pointer to dma_buf for which fd is required.
 + *
 + * On success, returns an associated 'fd'. Else, returns error.
 + */
 +int dma_buf_fd(struct dma_buf *dmabuf)
 +{
 +     int error, fd;
 +
 +     if (!dmabuf-file)
 +             return -EINVAL;
 +
 +     error = get_unused_fd_flags(0);

 Why not simply get_unused_fd()?

 +/**
 + * dma_buf_attach - Add the device to dma_buf's attachments list; 
 optionally,
 + * calls attach() of dma_buf_ops to allow device-specific attach 
 functionality
 + * @dmabuf:  [in]    buffer to attach device to.
 + * @dev:     [in]    device to be attached.
 + *
 + * Returns struct dma_buf_attachment * for this attachment; may return NULL.
 + *

 Or may return a negative error code. It's better to be consistent here:
 either always return NULL on error, or change the allocation error to
 ERR_PTR(-ENOMEM).

 + */
 +struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
 +                                             struct device *dev)
 +{
 +     struct dma_buf_attachment *attach;
 +     int ret;
 +
 +     BUG_ON(!dmabuf || !dev);
 +
 +     attach = kzalloc(sizeof(struct dma_buf_attachment), GFP_KERNEL);
 +     if (attach == NULL)
 +             goto err_alloc;
 +
 +     mutex_lock(dmabuf-lock);
 +
 +     attach-dev = dev;
 +     attach-dmabuf = dmabuf;
 +     if (dmabuf-ops-attach) {
 +             ret = dmabuf-ops-attach(dmabuf, dev, attach);
 +             if (!ret)
 +                     goto err_attach;

 You probably mean if (ret) here instead of if (!ret), right?

 +     /* allow allocator to take care of cache ops */
 +     void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
 +     void (*sync_sg_for_device)(struct dma_buf *, struct device *);

 I don't see how this works with multiple consumers: For the streaming
 DMA mapping, there must be exactly one owner, either the device or
 the CPU. Obviously, this rule needs to be extended when you get to
 multiple devices and multiple device drivers, plus possibly user
 mappings. Simply assigning the buffer to the device from one
 driver does not block other drivers from touching the buffer, and
 assigning it to the cpu does not stop other hardware that the
 code calling sync_sg_for_cpu is not aware of.

 The only way to solve this that I can think of right now is to
 mandate that the mappings are all coherent (i.e. noncachable
 on noncoherent architectures like ARM). If you do that, you no
 longer need the sync_sg_for_* calls.

My original thinking was that you either need DMABUF_CPU_{PREP,FINI}
ioctls and corresponding dmabuf ops, which userspace is required to
call before / after CPU access.  Or just remove mmap() and do the
mmap() via allocating device and use that device's equivalent
DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls.  That
would give you a way to (a) synchronize with gpu/asynchronous
pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing
buffer (ie. wait all devices have dma_buf_unmap_attachment'd).  And
that gives you a convenient place

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Mon, Dec 05, 2011 at 08:29:49PM +0100, Arnd Bergmann wrote:
 On Monday 05 December 2011 19:55:44 Daniel Vetter wrote:
   The only way to solve this that I can think of right now is to
   mandate that the mappings are all coherent (i.e. noncachable
   on noncoherent architectures like ARM). If you do that, you no
   longer need the sync_sg_for_* calls.
 
  Woops, totally missed the addition of these. Can somebody explain to used
  to rather coherent x86 what we need these for and the code-flow would look
  like in a typical example. I was kinda assuming that devices would bracket
  their use of a buffer with the attachment_map/unmap calls and any cache
  coherency magic that might be needed would be somewhat transparent to
  users of the interface?

 I'll describe how the respective functions work in the streaming mapping
 API (dma_map_*): You start out with a buffer that is owned by the CPU,
 i.e. the kernel can access it freely. When you call dma_map_sg or similar,
 a noncoherent device reading the buffer requires the cache to be flushed
 in order to see the data that was written by the CPU into the cache.

 After dma_map_sg, the device can perform both read and write accesses
 (depending on the flag to the map call), but the CPU is no longer allowed
 to read (which would allocate a cache line that may become invalid but
 remain marked as clean) or write (which would create a dirty cache line
 without writing it back) that buffer.

 Once the device is done, the driver calls dma_unmap_* and the buffer is
 again owned by the CPU. The device can no longer access it (in fact
 the address may be no longer be backed if there is an iommu) and the CPU
 can again read and write the buffer. On ARMv6 and higher, possibly some
 other architectures, dma_unmap_* also needs to invalidate the cache
 for the buffer, because due to speculative prefetching, there may also
 be a new clean cache line with stale data from an earlier version of
 the buffer.

 Since map/unmap is an expensive operation, the interface was extended
 to pass back the ownership to the CPU and back to the device while leaving
 the buffer mapped. dma_sync_sg_for_cpu invalidates the cache in the same
 way as dma_unmap_sg, so the CPU can access the buffer, and
 dma_sync_sg_for_device hands it back to the device by performing the
 same cache flush that dma_map_sg would do.

 You could for example do this if you want video input with a
 cacheable buffer, or in an rdma scenario with a buffer accessed
 by a remote machine.

 In case of software iommu (swiotlb, dmabounce), the map and sync
 functions don't do cache management but instead copy data between
 a buffer accessed by hardware and a different buffer accessed
 by the user.

Thanks a lot for this excellent overview. I think at least for the first
version of dmabuf we should drop the sync_* interfaces and simply require
users to bracket their usage of the buffer from the attached device by
map/unmap. A dma_buf provider is always free to cache the mapping and
simply call sync_sg_for of the streaming dma api.

If it later turns out that we want to be able to cache the sg list also on
the use-site in the driver (e.g. map it into some hw sg list) we can
always add that functionality later. I just fear that sync ops among N
devices is a bit ill-defined and we already have a plethora of ill-defined
issues at hand. Also the proposed api doesn't quite fit into what's
already there, I think an s/device/dma_buf_attachment/ would be more
consistent - otherwise the dmabuf provider might need to walk the list of
attachements to get at the right one for the device.

  The map call gets the dma_data_direction parameter, so it should be able
  to do the right thing. And because we keep the attachement around, any
  caching of mappings should be possible, too.
 
  Yours, Daniel
 
  PS: Slightly related, because it will make the coherency nightmare worse,
  afaict: Can we kill mmap support?

 The mmap support is required (and only make sense) for consistent mappings,
 not for streaming mappings. The provider must ensure that if you map
 something uncacheable into the kernel in order to provide consistency,
 any mapping into user space must also be uncacheable. A driver
 that wants to have the buffer mapped to user space as many do should
 not need to know whether it is required to do cacheable or uncacheable
 mapping because of some other driver, and it should not need to worry
 about how to set up uncached mappings in user space.

Either I've always missed it or no one ever described it that consisely,
but now I see the use-case for mmap: Simpler drivers (i.e. not gpus) might
need to expose a userspace mapping and only the provider knows how to do
that in a coherent fashion. I want this in the docs (if it's not there yet
...).

But even with that use-case in mind I still have some gripes with the
current mmap interfaces as proposed:
- This use-case explains why the dmabuf provider needs to expose an mmap
  function.

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote:
 On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote:
  In the patch 2, you have a section about migration that mentions that
  it is possible to export a buffer that can be migrated after it
  is already mapped into one user driver. How does that work when
  the physical addresses are mapped into a consumer device already?

 I think you can do physical migration if you are attached, but
 probably not if you are mapped.

Yeah, that's very much how I see this, and also why map/unmap (at least
for simple users like v4l) should only bracket actual usage. GPU memory
managers need to be able to move around buffers while no one is using
them.

[snip]

  +     /* allow allocator to take care of cache ops */
  +     void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
  +     void (*sync_sg_for_device)(struct dma_buf *, struct device *);
 
  I don't see how this works with multiple consumers: For the streaming
  DMA mapping, there must be exactly one owner, either the device or
  the CPU. Obviously, this rule needs to be extended when you get to
  multiple devices and multiple device drivers, plus possibly user
  mappings. Simply assigning the buffer to the device from one
  driver does not block other drivers from touching the buffer, and
  assigning it to the cpu does not stop other hardware that the
  code calling sync_sg_for_cpu is not aware of.
 
  The only way to solve this that I can think of right now is to
  mandate that the mappings are all coherent (i.e. noncachable
  on noncoherent architectures like ARM). If you do that, you no
  longer need the sync_sg_for_* calls.

 My original thinking was that you either need DMABUF_CPU_{PREP,FINI}
 ioctls and corresponding dmabuf ops, which userspace is required to
 call before / after CPU access.  Or just remove mmap() and do the
 mmap() via allocating device and use that device's equivalent
 DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls.  That
 would give you a way to (a) synchronize with gpu/asynchronous
 pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing
 buffer (ie. wait all devices have dma_buf_unmap_attachment'd).  And
 that gives you a convenient place to do cache operations on
 noncoherent architecture.

 I sort of preferred having the DMABUF shim because that lets you pass
 a buffer around userspace without the receiving code knowing about a
 device specific API.  But the problem I eventually came around to: if
 your GL stack (or some other userspace component) is batching up
 commands before submission to kernel, the buffers you need to wait for
 completion might not even be submitted yet.  So from kernel
 perspective they are ready for cpu access.  Even though in fact they
 are not in a consistent state from rendering perspective.  I don't
 really know a sane way to deal with that.  Maybe the approach instead
 should be a userspace level API (in libkms/libdrm?) to provide
 abstraction for userspace access to buffers rather than dealing with
 this at the kernel level.

Well, there's a reason GL has an explicit flush and extensions for sync
objects. It's to support such scenarios where the driver batches up gpu
commands before actually submitting them. Also, recent gpus have all (or
shortly will grow) multiple execution pipelines, so it's also important
that you sync up with the right command stream. Syncing up with all of
them is generally frowned upon for obvious reasons ;-)

So any userspace that interacts with an OpenGL driver needs to take care
of this anyway. But I think for simpler stuff (v4l) kernel only coherency
should work and userspace just needs to take care of gl interactions and
call glflush and friends at the right points. I think we can flesh this
out precisely when we spec the dmabuf EGL extension ... (or implement one
of the preexisting ones already around).

On the topic of a coherency model for dmabuf, I think we need to look at
dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and
cpu_finish or whatever they might get called) as barriers:

So after a dma_buf_map, all previsously completed dma operations (i.e.
unmap already called) and any cpu writes (i.e. cpu_finish called) will be
coherent. Similar rule holds for cpu access through the userspace mmap,
only writes completed before the cpu_start will show up.

Similar, writes done by the device are only guaranteed to show up after
the _unmap. Dito for cpu writes and cpu_finish.

In short we always need two function calls to denote the start/end of the
critical section.

Any concurrent operations are allowed to yield garbage, meaning any
combination of the old or either of the newly written contents (i.e.
non-overlapping writes might not actually all end up in the buffer,
but instead some old contents). Maybe we even need to loosen that to
the real undefined behaviour, but atm I can't think of an example.

-Daniel
-- 
Daniel Vetter
Mail: dan...@ffwll.ch

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Monday 05 December 2011 21:58:39 Daniel Vetter wrote:
 On Mon, Dec 05, 2011 at 08:29:49PM +0100, Arnd Bergmann wrote:
  ...
 
 Thanks a lot for this excellent overview. I think at least for the first
 version of dmabuf we should drop the sync_* interfaces and simply require
 users to bracket their usage of the buffer from the attached device by
 map/unmap. A dma_buf provider is always free to cache the mapping and
 simply call sync_sg_for of the streaming dma api.

I think we still have the same problem if we allow multiple drivers
to access a noncoherent buffer using map/unmap:

driver Adriver B

1.  read/write  
2.  read/write
3.  map()   
4.  read/write
5.  dma
6.  map()
7.  dma
8.  dma
9.  unmap()
10. dma
11. read/write
12. unmap() 



In step 4, the buffer is owned by device A, but accessed by driver B, which
is a bug. In step 11, the buffer is owned by device B but accessed by driver
A, which is the same bug on the other side. In steps 7 and 8, the buffer
is owned by both device A and B, which is currently undefined but would
be ok if both devices are on the same coherency domain. Whether that point
is meaningful depends on what the devices actually do. It would be ok
if both are only reading, but not if they write into the same location
concurrently.

As I mentioned originally, the problem could be completely avoided if
we only allow consistent (e.g. uncached) mappings or buffers that
are not mapped into the kernel virtual address space at all.

Alternatively, a clearer model would be to require each access to
nonconsistent buffers to be exclusive: a map() operation would have
to block until the current mapper (if any) has done an unmap(), and
any access from the CPU would also have to call a dma_buf_ops pointer
to serialize the CPU accesses with any device accesses. User
mappings of the buffer can be easily blocked during a DMA access
by unmapping the buffer from user space at map() time and blocking the
vm_ops-fault() operation until the unmap().

 If it later turns out that we want to be able to cache the sg list also on
 the use-site in the driver (e.g. map it into some hw sg list) we can
 always add that functionality later. I just fear that sync ops among N
 devices is a bit ill-defined and we already have a plethora of ill-defined
 issues at hand. Also the proposed api doesn't quite fit into what's
 already there, I think an s/device/dma_buf_attachment/ would be more
 consistent - otherwise the dmabuf provider might need to walk the list of
 attachements to get at the right one for the device.

Right, at last for the start, let's mandate just map/unmap and not provide
sync. I do wonder however whether we should implement consistent (possibly
uncached) or streaming (cacheable, but always owned by either the device
or the CPU, not both) buffers, or who gets to make the decision which
one is used if both are implemented.

   The map call gets the dma_data_direction parameter, so it should be able
   to do the right thing. And because we keep the attachement around, any
   caching of mappings should be possible, too.
  
   Yours, Daniel
  
   PS: Slightly related, because it will make the coherency nightmare worse,
   afaict: Can we kill mmap support?
 
  The mmap support is required (and only make sense) for consistent mappings,
  not for streaming mappings. The provider must ensure that if you map
  something uncacheable into the kernel in order to provide consistency,
  any mapping into user space must also be uncacheable. A driver
  that wants to have the buffer mapped to user space as many do should
  not need to know whether it is required to do cacheable or uncacheable
  mapping because of some other driver, and it should not need to worry
  about how to set up uncached mappings in user space.
 
 Either I've always missed it or no one ever described it that consisely,
 but now I see the use-case for mmap: Simpler drivers (i.e. not gpus) might
 need to expose a userspace mapping and only the provider knows how to do
 that in a coherent fashion. I want this in the docs (if it's not there yet
 ...).

It's currently implemented in the ARM/PowerPC-specific dma_mmap_coherent
function and documented (hardly) in arch/arm/include/asm/dma-mapping.h.

We should make clear in that this is actually an extension of the
regular dma mapping api that first needs to be made generic.

 But even with that use-case in mind I still have some gripes with the
 current mmap interfaces as proposed:
 - This use-case explains why the dmabuf provider needs to expose an mmap
   function. It

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Monday 05 December 2011 14:46:47 Rob Clark wrote:
 On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote:
  On Friday 02 December 2011, Sumit Semwal wrote:
  This is the first step in defining a dma buffer sharing mechanism.
 
  This looks very nice, but there are a few things I don't understand yet
  and a bunch of trivial comments I have about things I spotted.
 
  Do you have prototype exporter and consumer drivers that you can post
  for clarification?
 
 There is some dummy drivers based on an earlier version.  And airlied
 has a prime (multi-gpu) prototype:
 
 http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf
 
 I've got a nearly working camera+display prototype:
 
 https://github.com/robclark/kernel-omap4/commits/dmabuf

Ok, thanks. I think it would be good to post these for reference
in v3, with a clear indication that they are not being submitted
for discussion/inclusion yet.

  In the patch 2, you have a section about migration that mentions that
  it is possible to export a buffer that can be migrated after it
  is already mapped into one user driver. How does that work when
  the physical addresses are mapped into a consumer device already?
 
 I think you can do physical migration if you are attached, but
 probably not if you are mapped.

Ok, that's what I thought.

  You probably mean if (ret) here instead of if (!ret), right?
 
  + /* allow allocator to take care of cache ops */
  + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
  + void (*sync_sg_for_device)(struct dma_buf *, struct device *);
 
  I don't see how this works with multiple consumers: For the streaming
  DMA mapping, there must be exactly one owner, either the device or
  the CPU. Obviously, this rule needs to be extended when you get to
  multiple devices and multiple device drivers, plus possibly user
  mappings. Simply assigning the buffer to the device from one
  driver does not block other drivers from touching the buffer, and
  assigning it to the cpu does not stop other hardware that the
  code calling sync_sg_for_cpu is not aware of.
 
  The only way to solve this that I can think of right now is to
  mandate that the mappings are all coherent (i.e. noncachable
  on noncoherent architectures like ARM). If you do that, you no
  longer need the sync_sg_for_* calls.
 
 My original thinking was that you either need DMABUF_CPU_{PREP,FINI}
 ioctls and corresponding dmabuf ops, which userspace is required to
 call before / after CPU access.  Or just remove mmap() and do the
 mmap() via allocating device and use that device's equivalent
 DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls.  That
 would give you a way to (a) synchronize with gpu/asynchronous
 pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing
 buffer (ie. wait all devices have dma_buf_unmap_attachment'd).  And
 that gives you a convenient place to do cache operations on
 noncoherent architecture.

I wasn't even thinking of user mappings, as I replied to Daniel, I
think they are easy to solve (maybe not efficiently though)

 I sort of preferred having the DMABUF shim because that lets you pass
 a buffer around userspace without the receiving code knowing about a
 device specific API.  But the problem I eventually came around to: if
 your GL stack (or some other userspace component) is batching up
 commands before submission to kernel, the buffers you need to wait for
 completion might not even be submitted yet.  So from kernel
 perspective they are ready for cpu access.  Even though in fact they
 are not in a consistent state from rendering perspective.  I don't
 really know a sane way to deal with that.  Maybe the approach instead
 should be a userspace level API (in libkms/libdrm?) to provide
 abstraction for userspace access to buffers rather than dealing with
 this at the kernel level.

It would be nice if user space had no way to block out kernel drivers,
otherwise we have to be very careful to ensure that each map() operation
can be interrupted by a signal as the last resort to avoid deadlocks.

Arnd
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Mon, Dec 5, 2011 at 3:23 PM, Daniel Vetter dan...@ffwll.ch wrote:
 On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote:
 On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote:
  In the patch 2, you have a section about migration that mentions that
  it is possible to export a buffer that can be migrated after it
  is already mapped into one user driver. How does that work when
  the physical addresses are mapped into a consumer device already?

 I think you can do physical migration if you are attached, but
 probably not if you are mapped.

 Yeah, that's very much how I see this, and also why map/unmap (at least
 for simple users like v4l) should only bracket actual usage. GPU memory
 managers need to be able to move around buffers while no one is using
 them.

 [snip]

  +     /* allow allocator to take care of cache ops */
  +     void (*sync_sg_for_cpu) (struct dma_buf *, struct device *);
  +     void (*sync_sg_for_device)(struct dma_buf *, struct device *);
 
  I don't see how this works with multiple consumers: For the streaming
  DMA mapping, there must be exactly one owner, either the device or
  the CPU. Obviously, this rule needs to be extended when you get to
  multiple devices and multiple device drivers, plus possibly user
  mappings. Simply assigning the buffer to the device from one
  driver does not block other drivers from touching the buffer, and
  assigning it to the cpu does not stop other hardware that the
  code calling sync_sg_for_cpu is not aware of.
 
  The only way to solve this that I can think of right now is to
  mandate that the mappings are all coherent (i.e. noncachable
  on noncoherent architectures like ARM). If you do that, you no
  longer need the sync_sg_for_* calls.

 My original thinking was that you either need DMABUF_CPU_{PREP,FINI}
 ioctls and corresponding dmabuf ops, which userspace is required to
 call before / after CPU access.  Or just remove mmap() and do the
 mmap() via allocating device and use that device's equivalent
 DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls.  That
 would give you a way to (a) synchronize with gpu/asynchronous
 pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing
 buffer (ie. wait all devices have dma_buf_unmap_attachment'd).  And
 that gives you a convenient place to do cache operations on
 noncoherent architecture.

 I sort of preferred having the DMABUF shim because that lets you pass
 a buffer around userspace without the receiving code knowing about a
 device specific API.  But the problem I eventually came around to: if
 your GL stack (or some other userspace component) is batching up
 commands before submission to kernel, the buffers you need to wait for
 completion might not even be submitted yet.  So from kernel
 perspective they are ready for cpu access.  Even though in fact they
 are not in a consistent state from rendering perspective.  I don't
 really know a sane way to deal with that.  Maybe the approach instead
 should be a userspace level API (in libkms/libdrm?) to provide
 abstraction for userspace access to buffers rather than dealing with
 this at the kernel level.

 Well, there's a reason GL has an explicit flush and extensions for sync
 objects. It's to support such scenarios where the driver batches up gpu
 commands before actually submitting them.

Hmm.. what about other non-GL APIs..  maybe vaapi/vdpau or similar?
(Or something that I haven't thought of.)

 Also, recent gpus have all (or
 shortly will grow) multiple execution pipelines, so it's also important
 that you sync up with the right command stream. Syncing up with all of
 them is generally frowned upon for obvious reasons ;-)

Well, I guess I am happy enough with something that is at least
functional.  Usespace access would (I think) mainly be weird edge case
type stuff.  But...

 So any userspace that interacts with an OpenGL driver needs to take care
 of this anyway. But I think for simpler stuff (v4l) kernel only coherency
 should work and userspace just needs to take care of gl interactions and
 call glflush and friends at the right points. I think we can flesh this
 out precisely when we spec the dmabuf EGL extension ... (or implement one
 of the preexisting ones already around).

.. yeah, I think egl/eglImage extension would be the right place to
hide this behind.  And I guess your GL stack should be able to figure
out which execution pipeline to sync, cache state of buffer, and
whatever other optimizations you might want to make.

 On the topic of a coherency model for dmabuf, I think we need to look at
 dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and
 cpu_finish or whatever they might get called) as barriers:

 So after a dma_buf_map, all previsously completed dma operations (i.e.
 unmap already called) and any cpu writes (i.e. cpu_finish called) will be
 coherent. Similar rule holds for cpu access through the userspace mmap,
 only writes completed before the cpu_start

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Mon, Dec 5, 2011 at 4:09 PM, Arnd Bergmann a...@arndb.de wrote:

 https://github.com/robclark/kernel-omap4/commits/dmabuf

 Ok, thanks. I think it would be good to post these for reference
 in v3, with a clear indication that they are not being submitted
 for discussion/inclusion yet.

btw, don't look at this too closely at that tree yet.. where the
attach/detach is done in videobuf2 code isn't really correct.  But I
was going to get something functioning first.

BR,
-R
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism

On Mon, Dec 05, 2011 at 04:11:46PM -0600, Rob Clark wrote:
 On Mon, Dec 5, 2011 at 3:23 PM, Daniel Vetter dan...@ffwll.ch wrote:
  On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote:
  I sort of preferred having the DMABUF shim because that lets you pass
  a buffer around userspace without the receiving code knowing about a
  device specific API.  But the problem I eventually came around to: if
  your GL stack (or some other userspace component) is batching up
  commands before submission to kernel, the buffers you need to wait for
  completion might not even be submitted yet.  So from kernel
  perspective they are ready for cpu access.  Even though in fact they
  are not in a consistent state from rendering perspective.  I don't
  really know a sane way to deal with that.  Maybe the approach instead
  should be a userspace level API (in libkms/libdrm?) to provide
  abstraction for userspace access to buffers rather than dealing with
  this at the kernel level.
 
  Well, there's a reason GL has an explicit flush and extensions for sync
  objects. It's to support such scenarios where the driver batches up gpu
  commands before actually submitting them.

 Hmm.. what about other non-GL APIs..  maybe vaapi/vdpau or similar?
 (Or something that I haven't thought of.)

They generally all have a concept of when they've actually commited the
rendering to an X pixmap or egl image. Usually it's rather implicit, e.g.
the driver will submit any outstanding batches before returning from any
calls.
-Daniel
-- 
Daniel Vetter
Mail: dan...@ffwll.ch
Mobile: +41 (0)79 365 57 48
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism