Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
2012/1/10 InKi Dae daei...@gmail.com: 2012/1/10 Semwal, Sumit sumit.sem...@ti.com: On Tue, Jan 10, 2012 at 7:44 AM, Rob Clark r...@ti.com wrote: On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote: 2012/1/10 Rob Clark r...@ti.com: at least with no IOMMU, the memory information(containing physical memory address) would be copied to vb2_xx_buf object if drm gem exported its own buffer and vb2 wants to use that buffer at this time, sg table is used to share that buffer. and the problem I pointed out is that this buffer(also physical memory region) could be released by vb2 framework(as you know, vb2_xx_buf object and the memory region for buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that so some problems would be induced once drm gem tries to release or access that buffer. and I have tried to resolve this issue adding get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there would be better way. Hi Inki, As also mentioned in the documentation patch, importer (the user of the buffer) - in this case for current RFC patches on v4l2-as-a-user[1] vb2 framework - shouldn't release the backing memory of the buffer directly - it should only use the dma-buf callbacks in the right sequence to let the exporter know that it is done using this buffer, so the exporter can release it if allowed and needed. thank you for your comments.:) and below are some tables about dmabuf operations with ideal use and these tables indicate reference count of when buffer is created, shared and released. so if there are any problems, please let me know. P.S. these are just simple cases so there would be others. in case of using only drm gem and dmabuf, operations gem refcount file f_count buf refcount 1. gem create A:1 A:0 2. export(handle A - fd) A:2 A:1 A:0 3. import(fd - handle B) A:2, B:1 A:2 A:1 4. file close(A) A:2, B:1 A:1 A:1 5. gem close(A) A:1, B:1 A:1 A:1 6. gem close(B) A:1, B:0 A:1 A:0 7. file close(A) A:0 A:0 --- 3. handle B shares the buf of handle A. 6. release handle B but its buf. 7. release gem handle A and dmabuf of file A and also physical memory region. and in case of using drm gem, vb2 and dmabuf, operations gem, vb2 refcount file f_count buf refcount 1. gem create A:1 A:0 (GEM side) 2. export(handle A - fd) A:2 A:1 A:0 (GEM side) 3. import(fd - handle B) A:2, B:1 A:2 A:1 (VB2 side) 4. file close(A) A:2, B:1 A:1 A:1 (VB2 side) 5. vb2 close(B) A:2, B:0 A:1 A:0 (VB2 side) 6. gem close(A) A:1 A:1 A:0 (GEM side) 7. file close(A) A:0 A:0 (GEM side) 3. vb2 handle B is shared with the buf of gem handle A. 5. release vb2 handle B and decrease refcount of the buf pointed by it. 7. release gem handle A and dmabuf of file A and also physical memory region. Ah, sorry, it seems that file close shouldn't be called because file-f_count of the file would be dropped by Importer and if f_count is 0 then the file would be released by fput() so I'm not sure but again: in case of using only drm gem and dmabuf, operations gem refcount file f_count buf refcount -- 1. gem create(A) A:1 A:0 2. export(handle A - fd)A:2 A:1 A:0 3. import(fd - handle B)A:2, B:1 A:2 A:1 4. gem close(B)A:2, B:release A:1 A:0 5. gem close(A)A:1, A:1 A:0 6. gem close(A)A:release A:release A:release - and in case of using drm gem, vb2 and dmabuf, operations gem, vb2 refcountfile f_count buf refcount 1. gem create A:1 A:0
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
2012/1/10 Rob Clark r...@ti.com: On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote: 2012/1/10 Rob Clark r...@ti.com: On Mon, Jan 9, 2012 at 4:10 AM, InKi Dae daei...@gmail.com wrote: note : in case of sharing a buffer between v4l2 and drm driver, the memory info would be copied vb2_xx_buf to xx_gem or xx_gem to vb2_xx_buf through sg table. in this case, only memory info is used to share, not some objects. which v4l2/vb2 patches are you looking at? The patches I was using, vb2 holds a reference to the 'struct dma_buf *' internally, not just keeping the sg_table yes, not keeping the sg_table. I mean... see a example below please. static void vb2_dma_contig_map_dmabuf(void *mem_priv) { struct sg_table *sg; ... sg = dma_buf_map_attachment(buf-db_attach, dir); ... buf-dma_addr = sg_dma_address(sg-sgl); ... } at least with no IOMMU, the memory information(containing physical memory address) would be copied to vb2_xx_buf object if drm gem exported its own buffer and vb2 wants to use that buffer at this time, sg table is used to share that buffer. and the problem I pointed out is that this buffer(also physical memory region) could be released by vb2 framework(as you know, vb2_xx_buf object and the memory region for buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that so some problems would be induced once drm gem tries to release or access that buffer. and I have tried to resolve this issue adding get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there would be better way. the exporter (in this case your driver's drm/gem bits) shouldn't release that mapping / sgtable until the importer (in this case v4l2) calls dma_buf_unmap fxn.. It would be an error if the importer did a dma_buf_put() without first calling dma_buf_unmap_attachment() (if currently mapped) and then dma_buf_detach() (if currently attached). Perhaps somewhere there should be some sanity checking debug code which could be enabled to do a WARN_ON() if the importer does the wrong thing. It shouldn't really be part of the API, I don't think, but it actually does seem like a good thing, esp. as new drivers start trying to use dmabuf, to have some debug options which could be enabled. It is entirely possible that something was missed on the vb2 patches, but the way it is intended to work is like this: https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-core.c#L92 where it does a detach() before the dma_buf_put(), and the vb2-contig backend checks here that it is also unmapped(): https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-dma-contig.c#L251 I think that we also used same concept as your. for this, you can refer to Dave's repository below and see the drm_prime_gem_destroy function. http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-prime-dmabufid=7cb374d6642e838e0e4836042e057e6d9139dcad but when it comes to releasing resources, I mistakely understood some parts of dmabuf concept so thank you for Rob and Sumit. that is very useful. BR, -R Thanks. BR, -R ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote: I has test dmabuf based drm gem module for exynos and I found one problem. you can refer to this test repository: http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf at this repository, I added some exception codes for resource release in addition to Dave's patch sets. let's suppose we use dmabuf based vb2 and drm gem with physically continuous memory(no IOMMU) and we try to share allocated buffer between them(v4l2 and drm driver). 1. request memory allocation through drm gem interface. 2. request DRM_SET_PRIME ioctl with the gem handle to get a fd to the gem object. - internally, private gem based dmabuf moudle calls drm_buf_export() to register allocated gem object to fd. 3. request qbuf with the fd(got from 2) and DMABUF type to set the buffer to v4l2 based device. - internally, vb2 plug in module gets a buffer to the fd and then calls dmabuf-ops-map_dmabuf() callback to get the sg table containing physical memory info to the gem object. and then the physical memory info would be copied to vb2_xx_buf object. for DMABUF feature for v4l2 and videobuf2 framework, you can refer to this repository: git://github.com/robclark/kernel-omap4.git drmplane-dmabuf after that, if v4l2 driver want to release vb2_xx_buf object with allocated memory region by user request, how should we do?. refcount to vb2_xx_buf is dependent on videobuf2 framework. so when vb2_xx_buf object is released videobuf2 framework don't know who is using the physical memory region. so this physical memory region is released and when drm driver tries to access the region or to release it also, a problem would be induced. for this problem, I added get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there may be better way. if there is any missing point, please let me know. The dma_buf object needs to hold a reference on the underlying (necessarily reference-counted) buffer object when the exporter creates the dma_buf handle. This reference should then get dropped in the exporters dma_buf-ops-release() function, which is only getting called when the last reference to the dma_buf disappears. If this doesn't work like that currently, we have a bug, and exporting the reference count or something similar can't fix that. Yours, Daniel PS: Please cut down the original mail when replying, otherwise it's pretty hard to find your response ;-) -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
2012/1/9 Daniel Vetter dan...@ffwll.ch: On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote: I has test dmabuf based drm gem module for exynos and I found one problem. you can refer to this test repository: http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf at this repository, I added some exception codes for resource release in addition to Dave's patch sets. let's suppose we use dmabuf based vb2 and drm gem with physically continuous memory(no IOMMU) and we try to share allocated buffer between them(v4l2 and drm driver). 1. request memory allocation through drm gem interface. 2. request DRM_SET_PRIME ioctl with the gem handle to get a fd to the gem object. - internally, private gem based dmabuf moudle calls drm_buf_export() to register allocated gem object to fd. 3. request qbuf with the fd(got from 2) and DMABUF type to set the buffer to v4l2 based device. - internally, vb2 plug in module gets a buffer to the fd and then calls dmabuf-ops-map_dmabuf() callback to get the sg table containing physical memory info to the gem object. and then the physical memory info would be copied to vb2_xx_buf object. for DMABUF feature for v4l2 and videobuf2 framework, you can refer to this repository: git://github.com/robclark/kernel-omap4.git drmplane-dmabuf after that, if v4l2 driver want to release vb2_xx_buf object with allocated memory region by user request, how should we do?. refcount to vb2_xx_buf is dependent on videobuf2 framework. so when vb2_xx_buf object is released videobuf2 framework don't know who is using the physical memory region. so this physical memory region is released and when drm driver tries to access the region or to release it also, a problem would be induced. for this problem, I added get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there may be better way. if there is any missing point, please let me know. The dma_buf object needs to hold a reference on the underlying (necessarily reference-counted) buffer object when the exporter creates the dma_buf handle. This reference should then get dropped in the exporters dma_buf-ops-release() function, which is only getting called when the last reference to the dma_buf disappears. when the exporter creates the dma_buf handle(for example, gem - fd), I think the refcount of gem object should be increased at this point, and decreased by dma_buf-ops-release() again because when the dma_buf is created and dma_buf_export() is called, this dma_buf refers to the gem object one time. and in case of inporter(fd - gem), file-f_count of the dma_buf is increased and then when this gem object is released by user request such as drm close or drn_gem_close_ioctl, dma_buf_put() should be called by dma_buf-ops-detach() to decrease file-f_count again because the gem object refers to the dma_buf. for this, you can refer to my test repository I mentioned above. but the problem is that when a buffer is released by one side, another can't know whether the buffer already was released or not. note : in case of sharing a buffer between v4l2 and drm driver, the memory info would be copied vb2_xx_buf to xx_gem or xx_gem to vb2_xx_buf through sg table. in this case, only memory info is used to share, not some objects. If this doesn't work like that currently, we have a bug, and exporting the reference count or something similar can't fix that. Yours, Daniel PS: Please cut down the original mail when replying, otherwise it's pretty hard to find your response ;-) Ok, got it. thanks. :) -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Jan 09, 2012 at 07:10:25PM +0900, InKi Dae wrote: 2012/1/9 Daniel Vetter dan...@ffwll.ch: On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote: I has test dmabuf based drm gem module for exynos and I found one problem. you can refer to this test repository: http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf at this repository, I added some exception codes for resource release in addition to Dave's patch sets. let's suppose we use dmabuf based vb2 and drm gem with physically continuous memory(no IOMMU) and we try to share allocated buffer between them(v4l2 and drm driver). 1. request memory allocation through drm gem interface. 2. request DRM_SET_PRIME ioctl with the gem handle to get a fd to the gem object. - internally, private gem based dmabuf moudle calls drm_buf_export() to register allocated gem object to fd. 3. request qbuf with the fd(got from 2) and DMABUF type to set the buffer to v4l2 based device. - internally, vb2 plug in module gets a buffer to the fd and then calls dmabuf-ops-map_dmabuf() callback to get the sg table containing physical memory info to the gem object. and then the physical memory info would be copied to vb2_xx_buf object. for DMABUF feature for v4l2 and videobuf2 framework, you can refer to this repository: git://github.com/robclark/kernel-omap4.git drmplane-dmabuf after that, if v4l2 driver want to release vb2_xx_buf object with allocated memory region by user request, how should we do?. refcount to vb2_xx_buf is dependent on videobuf2 framework. so when vb2_xx_buf object is released videobuf2 framework don't know who is using the physical memory region. so this physical memory region is released and when drm driver tries to access the region or to release it also, a problem would be induced. for this problem, I added get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there may be better way. if there is any missing point, please let me know. The dma_buf object needs to hold a reference on the underlying (necessarily reference-counted) buffer object when the exporter creates the dma_buf handle. This reference should then get dropped in the exporters dma_buf-ops-release() function, which is only getting called when the last reference to the dma_buf disappears. when the exporter creates the dma_buf handle(for example, gem - fd), I think the refcount of gem object should be increased at this point, and decreased by dma_buf-ops-release() again because when the dma_buf is created and dma_buf_export() is called, this dma_buf refers to the gem object one time. and in case of inporter(fd - gem), file-f_count of the dma_buf is increased and then when this gem object is released by user request such as drm close or drn_gem_close_ioctl, dma_buf_put() should be called by dma_buf-ops-detach() to decrease file-f_count again because the gem object refers to the dma_buf. for this, you can refer to my test repository I mentioned above. but the problem is that when a buffer is released by one side, another can't know whether the buffer already was released or not. Nope, dma_buf_put should not be called by -detach. The importer gets his reference when importing the dma_buf and needs to drop that reference himself when it's done using the buffer by calling dma_buf_put (i.e. after the last -detach call). I think adding separate reference counting to -attach and -detach is a waste of time and only papers over buggy importers. Additionally the importer does _not_ control the lifetime of an dma_buf object and it's underlying backing storage. It hence may _never_ free the backing storage itself, that's the job of the exporter. With that cleared up, referencing the exporters underlying buffer object from the dma_buf will just do the right thing. note : in case of sharing a buffer between v4l2 and drm driver, the memory info would be copied vb2_xx_buf to xx_gem or xx_gem to vb2_xx_buf through sg table. in this case, only memory info is used to share, not some objects. Hm, maybe I need to take a look at the currently proposed v4l dma_buf patches ;-) atm I don't have an idea what exactly you're talking about. Yours, Daniel -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
2012/1/9 Daniel Vetter dan...@ffwll.ch: On Mon, Jan 09, 2012 at 07:10:25PM +0900, InKi Dae wrote: 2012/1/9 Daniel Vetter dan...@ffwll.ch: On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote: I has test dmabuf based drm gem module for exynos and I found one problem. you can refer to this test repository: http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf at this repository, I added some exception codes for resource release in addition to Dave's patch sets. let's suppose we use dmabuf based vb2 and drm gem with physically continuous memory(no IOMMU) and we try to share allocated buffer between them(v4l2 and drm driver). 1. request memory allocation through drm gem interface. 2. request DRM_SET_PRIME ioctl with the gem handle to get a fd to the gem object. - internally, private gem based dmabuf moudle calls drm_buf_export() to register allocated gem object to fd. 3. request qbuf with the fd(got from 2) and DMABUF type to set the buffer to v4l2 based device. - internally, vb2 plug in module gets a buffer to the fd and then calls dmabuf-ops-map_dmabuf() callback to get the sg table containing physical memory info to the gem object. and then the physical memory info would be copied to vb2_xx_buf object. for DMABUF feature for v4l2 and videobuf2 framework, you can refer to this repository: git://github.com/robclark/kernel-omap4.git drmplane-dmabuf after that, if v4l2 driver want to release vb2_xx_buf object with allocated memory region by user request, how should we do?. refcount to vb2_xx_buf is dependent on videobuf2 framework. so when vb2_xx_buf object is released videobuf2 framework don't know who is using the physical memory region. so this physical memory region is released and when drm driver tries to access the region or to release it also, a problem would be induced. for this problem, I added get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there may be better way. if there is any missing point, please let me know. The dma_buf object needs to hold a reference on the underlying (necessarily reference-counted) buffer object when the exporter creates the dma_buf handle. This reference should then get dropped in the exporters dma_buf-ops-release() function, which is only getting called when the last reference to the dma_buf disappears. when the exporter creates the dma_buf handle(for example, gem - fd), I think the refcount of gem object should be increased at this point, and decreased by dma_buf-ops-release() again because when the dma_buf is created and dma_buf_export() is called, this dma_buf refers to the gem object one time. and in case of inporter(fd - gem), file-f_count of the dma_buf is increased and then when this gem object is released by user request such as drm close or drn_gem_close_ioctl, dma_buf_put() should be called by dma_buf-ops-detach() to decrease file-f_count again because the gem object refers to the dma_buf. for this, you can refer to my test repository I mentioned above. but the problem is that when a buffer is released by one side, another can't know whether the buffer already was released or not. Nope, dma_buf_put should not be called by -detach. The importer gets his reference when importing the dma_buf and needs to drop that reference himself when it's done using the buffer by calling dma_buf_put (i.e. after the last -detach call). I'm afraid that there may be my missing points. I'm confusing. who is Importer and who is Exporter you think? Importer is fd goes to private buffer and Exporter is private buffer goes to fd? if so, yes, when the importer needs to drop that reference(the importer want to release that buffer), dma_buf_put() should be called somewhere and in my case, that function is called by drm_prime_gem_destory(). this function is included at Dave's patch sets and also dma_buf_detatch() is called there. and I just thought that here is right place. I didn't find the place dma_buf_put() is called anywhere. could you please tell me where dma_buf_put() should be called at you think?. for this, you can refer to Dave's repository: http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf I think adding separate reference counting to -attach and -detach is a waste of time and only papers over buggy importers. I mean when fd goes to private buffer, that reference(file-f_count) would be increased by dma_buf_get() and only -detach is used to drop that reference. Additionally the importer does _not_ control the lifetime of an dma_buf object and it's underlying backing storage. It hence may _never_ free the backing storage itself, that's the job of the exporter. yes, my understanding is that if user app requested close(fd) call to the Importer, after close(fd), file-f_count would be decreased but would still has 1. because file-f_count already was 1 by the Exporter.
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Jan 9, 2012 at 4:10 AM, InKi Dae daei...@gmail.com wrote: note : in case of sharing a buffer between v4l2 and drm driver, the memory info would be copied vb2_xx_buf to xx_gem or xx_gem to vb2_xx_buf through sg table. in this case, only memory info is used to share, not some objects. which v4l2/vb2 patches are you looking at? The patches I was using, vb2 holds a reference to the 'struct dma_buf *' internally, not just keeping the sg_table BR, -R ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Jan 09, 2012 at 09:06:56PM +0900, InKi Dae wrote: 2012/1/9 Daniel Vetter dan...@ffwll.ch: On Mon, Jan 09, 2012 at 07:10:25PM +0900, InKi Dae wrote: 2012/1/9 Daniel Vetter dan...@ffwll.ch: On Mon, Jan 09, 2012 at 03:20:48PM +0900, InKi Dae wrote: I has test dmabuf based drm gem module for exynos and I found one problem. you can refer to this test repository: http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/exynos-drm-dmabuf at this repository, I added some exception codes for resource release in addition to Dave's patch sets. let's suppose we use dmabuf based vb2 and drm gem with physically continuous memory(no IOMMU) and we try to share allocated buffer between them(v4l2 and drm driver). 1. request memory allocation through drm gem interface. 2. request DRM_SET_PRIME ioctl with the gem handle to get a fd to the gem object. - internally, private gem based dmabuf moudle calls drm_buf_export() to register allocated gem object to fd. 3. request qbuf with the fd(got from 2) and DMABUF type to set the buffer to v4l2 based device. - internally, vb2 plug in module gets a buffer to the fd and then calls dmabuf-ops-map_dmabuf() callback to get the sg table containing physical memory info to the gem object. and then the physical memory info would be copied to vb2_xx_buf object. for DMABUF feature for v4l2 and videobuf2 framework, you can refer to this repository: git://github.com/robclark/kernel-omap4.git drmplane-dmabuf after that, if v4l2 driver want to release vb2_xx_buf object with allocated memory region by user request, how should we do?. refcount to vb2_xx_buf is dependent on videobuf2 framework. so when vb2_xx_buf object is released videobuf2 framework don't know who is using the physical memory region. so this physical memory region is released and when drm driver tries to access the region or to release it also, a problem would be induced. for this problem, I added get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there may be better way. if there is any missing point, please let me know. The dma_buf object needs to hold a reference on the underlying (necessarily reference-counted) buffer object when the exporter creates the dma_buf handle. This reference should then get dropped in the exporters dma_buf-ops-release() function, which is only getting called when the last reference to the dma_buf disappears. when the exporter creates the dma_buf handle(for example, gem - fd), I think the refcount of gem object should be increased at this point, and decreased by dma_buf-ops-release() again because when the dma_buf is created and dma_buf_export() is called, this dma_buf refers to the gem object one time. and in case of inporter(fd - gem), file-f_count of the dma_buf is increased and then when this gem object is released by user request such as drm close or drn_gem_close_ioctl, dma_buf_put() should be called by dma_buf-ops-detach() to decrease file-f_count again because the gem object refers to the dma_buf. for this, you can refer to my test repository I mentioned above. but the problem is that when a buffer is released by one side, another can't know whether the buffer already was released or not. Nope, dma_buf_put should not be called by -detach. The importer gets his reference when importing the dma_buf and needs to drop that reference himself when it's done using the buffer by calling dma_buf_put (i.e. after the last -detach call). I'm afraid that there may be my missing points. I'm confusing. who is Importer and who is Exporter you think? Importer is fd goes to private buffer and Exporter is private buffer goes to fd? if so, yes, when the importer needs to drop that reference(the importer want to release that buffer), dma_buf_put() should be called somewhere and in my case, that function is called by drm_prime_gem_destory(). this function is included at Dave's patch sets and also dma_buf_detatch() is called there. and I just thought that here is right place. I didn't find the place dma_buf_put() is called anywhere. could you please tell me where dma_buf_put() should be called at you think?. for this, you can refer to Dave's repository: http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf I haven't really looked at Dave's latest prime patches, but he reported some reference counting issues last time around we chatted about it on irc. So maybe you're just right and the dma_buf_put is indeed missing from drm_prime_gem_destroy ;-) But as I've said, haven't really reviewed the code, so I'm likely completely wrong. Cheers, Daniel -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
2012/1/10 Rob Clark r...@ti.com: On Mon, Jan 9, 2012 at 4:10 AM, InKi Dae daei...@gmail.com wrote: note : in case of sharing a buffer between v4l2 and drm driver, the memory info would be copied vb2_xx_buf to xx_gem or xx_gem to vb2_xx_buf through sg table. in this case, only memory info is used to share, not some objects. which v4l2/vb2 patches are you looking at? The patches I was using, vb2 holds a reference to the 'struct dma_buf *' internally, not just keeping the sg_table yes, not keeping the sg_table. I mean... see a example below please. static void vb2_dma_contig_map_dmabuf(void *mem_priv) { struct sg_table *sg; ... sg = dma_buf_map_attachment(buf-db_attach, dir); ... buf-dma_addr = sg_dma_address(sg-sgl); ... } at least with no IOMMU, the memory information(containing physical memory address) would be copied to vb2_xx_buf object if drm gem exported its own buffer and vb2 wants to use that buffer at this time, sg table is used to share that buffer. and the problem I pointed out is that this buffer(also physical memory region) could be released by vb2 framework(as you know, vb2_xx_buf object and the memory region for buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that so some problems would be induced once drm gem tries to release or access that buffer. and I have tried to resolve this issue adding get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there would be better way. Thanks. BR, -R ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote: 2012/1/10 Rob Clark r...@ti.com: On Mon, Jan 9, 2012 at 4:10 AM, InKi Dae daei...@gmail.com wrote: note : in case of sharing a buffer between v4l2 and drm driver, the memory info would be copied vb2_xx_buf to xx_gem or xx_gem to vb2_xx_buf through sg table. in this case, only memory info is used to share, not some objects. which v4l2/vb2 patches are you looking at? The patches I was using, vb2 holds a reference to the 'struct dma_buf *' internally, not just keeping the sg_table yes, not keeping the sg_table. I mean... see a example below please. static void vb2_dma_contig_map_dmabuf(void *mem_priv) { struct sg_table *sg; ... sg = dma_buf_map_attachment(buf-db_attach, dir); ... buf-dma_addr = sg_dma_address(sg-sgl); ... } at least with no IOMMU, the memory information(containing physical memory address) would be copied to vb2_xx_buf object if drm gem exported its own buffer and vb2 wants to use that buffer at this time, sg table is used to share that buffer. and the problem I pointed out is that this buffer(also physical memory region) could be released by vb2 framework(as you know, vb2_xx_buf object and the memory region for buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that so some problems would be induced once drm gem tries to release or access that buffer. and I have tried to resolve this issue adding get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there would be better way. the exporter (in this case your driver's drm/gem bits) shouldn't release that mapping / sgtable until the importer (in this case v4l2) calls dma_buf_unmap fxn.. It would be an error if the importer did a dma_buf_put() without first calling dma_buf_unmap_attachment() (if currently mapped) and then dma_buf_detach() (if currently attached). Perhaps somewhere there should be some sanity checking debug code which could be enabled to do a WARN_ON() if the importer does the wrong thing. It shouldn't really be part of the API, I don't think, but it actually does seem like a good thing, esp. as new drivers start trying to use dmabuf, to have some debug options which could be enabled. It is entirely possible that something was missed on the vb2 patches, but the way it is intended to work is like this: https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-core.c#L92 where it does a detach() before the dma_buf_put(), and the vb2-contig backend checks here that it is also unmapped(): https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-dma-contig.c#L251 BR, -R Thanks. BR, -R ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Tue, Jan 10, 2012 at 7:44 AM, Rob Clark r...@ti.com wrote: On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote: 2012/1/10 Rob Clark r...@ti.com: at least with no IOMMU, the memory information(containing physical memory address) would be copied to vb2_xx_buf object if drm gem exported its own buffer and vb2 wants to use that buffer at this time, sg table is used to share that buffer. and the problem I pointed out is that this buffer(also physical memory region) could be released by vb2 framework(as you know, vb2_xx_buf object and the memory region for buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that so some problems would be induced once drm gem tries to release or access that buffer. and I have tried to resolve this issue adding get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there would be better way. Hi Inki, As also mentioned in the documentation patch, importer (the user of the buffer) - in this case for current RFC patches on v4l2-as-a-user[1] vb2 framework - shouldn't release the backing memory of the buffer directly - it should only use the dma-buf callbacks in the right sequence to let the exporter know that it is done using this buffer, so the exporter can release it if allowed and needed. the exporter (in this case your driver's drm/gem bits) shouldn't release that mapping / sgtable until the importer (in this case v4l2) calls dma_buf_unmap fxn.. It would be an error if the importer did a dma_buf_put() without first calling dma_buf_unmap_attachment() (if currently mapped) and then dma_buf_detach() (if currently attached). Perhaps somewhere there should be some sanity checking debug code which could be enabled to do a WARN_ON() if the importer does the wrong thing. It shouldn't really be part of the API, I don't think, but it actually does seem like a good thing, esp. as new drivers start trying to use dmabuf, to have some debug options which could be enabled. It is entirely possible that something was missed on the vb2 patches, but the way it is intended to work is like this: https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-core.c#L92 where it does a detach() before the dma_buf_put(), and the vb2-contig backend checks here that it is also unmapped(): https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-dma-contig.c#L251 The proposed RFC for V4L2 adaptation at [1] does exactly the same thing; detach() before dma_buf_put(), and check in vb2-contig backend for unmapped() as mentioned above. BR, -R BR, Sumit. [1]: V4l2 as a dma-buf user RFC: http://comments.gmane.org/gmane.linux.drivers.video-input-infrastructure/42966 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
2012/1/10 Semwal, Sumit sumit.sem...@ti.com: On Tue, Jan 10, 2012 at 7:44 AM, Rob Clark r...@ti.com wrote: On Mon, Jan 9, 2012 at 7:34 PM, InKi Dae daei...@gmail.com wrote: 2012/1/10 Rob Clark r...@ti.com: at least with no IOMMU, the memory information(containing physical memory address) would be copied to vb2_xx_buf object if drm gem exported its own buffer and vb2 wants to use that buffer at this time, sg table is used to share that buffer. and the problem I pointed out is that this buffer(also physical memory region) could be released by vb2 framework(as you know, vb2_xx_buf object and the memory region for buf-dma_addr pointing) but the Exporter(drm gem) couldn't know that so some problems would be induced once drm gem tries to release or access that buffer. and I have tried to resolve this issue adding get_shared_cnt() callback to dma-buf.h but I'm not sure that this is good way. maybe there would be better way. Hi Inki, As also mentioned in the documentation patch, importer (the user of the buffer) - in this case for current RFC patches on v4l2-as-a-user[1] vb2 framework - shouldn't release the backing memory of the buffer directly - it should only use the dma-buf callbacks in the right sequence to let the exporter know that it is done using this buffer, so the exporter can release it if allowed and needed. thank you for your comments.:) and below are some tables about dmabuf operations with ideal use and these tables indicate reference count of when buffer is created, shared and released. so if there are any problems, please let me know. P.S. these are just simple cases so there would be others. in case of using only drm gem and dmabuf, operations gem refcountfile f_count buf refcount 1. gem create A:1 A:0 2. export(handle A - fd)A:2A:1 A:0 3. import(fd - handle B)A:2, B:1 A:2 A:1 4. file close(A) A:2, B:1 A:1 A:1 5. gem close(A)A:1, B:1 A:1 A:1 6. gem close(B)A:1, B:0 A:1 A:0 7. file close(A) A:0A:0 --- 3. handle B shares the buf of handle A. 6. release handle B but its buf. 7. release gem handle A and dmabuf of file A and also physical memory region. and in case of using drm gem, vb2 and dmabuf, operations gem, vb2 refcountfile f_count buf refcount 1. gem create A:1 A:0 (GEM side) 2. export(handle A - fd)A:2 A:1 A:0 (GEM side) 3. import(fd - handle B)A:2, B:1 A:2 A:1 (VB2 side) 4. file close(A) A:2, B:1 A:1 A:1 (VB2 side) 5. vb2 close(B) A:2, B:0 A:1 A:0 (VB2 side) 6. gem close(A)A:1A:1 A:0 (GEM side) 7. file close(A) A:0A:0 (GEM side) 3. vb2 handle B is shared with the buf of gem handle A. 5. release vb2 handle B and decrease refcount of the buf pointed by it. 7. release gem handle A and dmabuf of file A and also physical memory region. the exporter (in this case your driver's drm/gem bits) shouldn't release that mapping / sgtable until the importer (in this case v4l2) calls dma_buf_unmap fxn.. It would be an error if the importer did a dma_buf_put() without first calling dma_buf_unmap_attachment() (if currently mapped) and then dma_buf_detach() (if currently attached). Perhaps somewhere there should be some sanity checking debug code which could be enabled to do a WARN_ON() if the importer does the wrong thing. It shouldn't really be part of the API, I don't think, but it actually does seem like a good thing, esp. as new drivers start trying to use dmabuf, to have some debug options which could be enabled. It is entirely possible that something was missed on the vb2 patches, but the way it is intended to work is like this: https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-core.c#L92 where it does a detach() before the dma_buf_put(), and the vb2-contig backend checks here that it is also unmapped(): https://github.com/robclark/kernel-omap4/blob/0961428143cd10269223e3d0f24bc3a66a96185f/drivers/media/video/videobuf2-dma-contig.c#L251 The proposed RFC for V4L2 adaptation at [1] does exactly the same thing; detach() before dma_buf_put(),
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Wednesday 07 December 2011, Semwal, Sumit wrote: Do you have a use case for making the interface compile-time disabled? I had assumed that any code using it would make no sense if it's not available so you don't actually need this. Ok. Though if we keep the interface compile-time disabled, the users can actually check and fail or fall-back gracefully when the API is not available; If I remove it, anyways the users would need to do the same compile time check whether API is available or not, right? If you have to do a compile-time check for the config symbol, it's better to do it the way you did here than in the caller. My guess was that no caller would actually require this, because when you write a part of a subsystem to interact with the dma-buf infrastructure, you would always disable compilation of an extire file that deals with everything related to struct dma_buf, not just stub out the calls. Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Wed, Dec 7, 2011 at 3:41 PM, Arnd Bergmann a...@arndb.de wrote: On Wednesday 07 December 2011, Semwal, Sumit wrote: Do you have a use case for making the interface compile-time disabled? I had assumed that any code using it would make no sense if it's not available so you don't actually need this. Ok. Though if we keep the interface compile-time disabled, the users can actually check and fail or fall-back gracefully when the API is not available; If I remove it, anyways the users would need to do the same compile time check whether API is available or not, right? If you have to do a compile-time check for the config symbol, it's better to do it the way you did here than in the caller. My guess was that no caller would actually require this, because when you write a part of a subsystem to interact with the dma-buf infrastructure, you would always disable compilation of an extire file that deals with everything related to struct dma_buf, not just stub out the calls. Right; that would be ideal, but we may not be able to ask each user to do so - especially when the sharing part might be interspersed in existing buffer handling code. So for now, I would like to keep it as it-is. Arnd BR, ~Sumit. -- ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Wednesday 07 December 2011, Semwal, Sumit wrote: Right; that would be ideal, but we may not be able to ask each user to do so - especially when the sharing part might be interspersed in existing buffer handling code. So for now, I would like to keep it as it-is. Ok, fair enough. It certainly doesn't hurt. Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
Hi Daniel, Rob, On Tue, Dec 6, 2011 at 3:41 AM, Rob Clark r...@ti.com wrote: On Mon, Dec 5, 2011 at 3:23 PM, Daniel Vetter dan...@ffwll.ch wrote: On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote: On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote: In the patch 2, you have a section about migration that mentions that it is possible to export a buffer that can be migrated after it is already mapped into one user driver. How does that work when the physical addresses are mapped into a consumer device already? I think you can do physical migration if you are attached, but probably not if you are mapped. Yeah, that's very much how I see this, and also why map/unmap (at least for simple users like v4l) should only bracket actual usage. GPU memory managers need to be able to move around buffers while no one is using them. [snip] + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. My original thinking was that you either need DMABUF_CPU_{PREP,FINI} ioctls and corresponding dmabuf ops, which userspace is required to call before / after CPU access. Or just remove mmap() and do the mmap() via allocating device and use that device's equivalent DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls. That would give you a way to (a) synchronize with gpu/asynchronous pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing buffer (ie. wait all devices have dma_buf_unmap_attachment'd). And that gives you a convenient place to do cache operations on noncoherent architecture. I sort of preferred having the DMABUF shim because that lets you pass a buffer around userspace without the receiving code knowing about a device specific API. But the problem I eventually came around to: if your GL stack (or some other userspace component) is batching up commands before submission to kernel, the buffers you need to wait for completion might not even be submitted yet. So from kernel perspective they are ready for cpu access. Even though in fact they are not in a consistent state from rendering perspective. I don't really know a sane way to deal with that. Maybe the approach instead should be a userspace level API (in libkms/libdrm?) to provide abstraction for userspace access to buffers rather than dealing with this at the kernel level. Well, there's a reason GL has an explicit flush and extensions for sync objects. It's to support such scenarios where the driver batches up gpu commands before actually submitting them. Hmm.. what about other non-GL APIs.. maybe vaapi/vdpau or similar? (Or something that I haven't thought of.) Also, recent gpus have all (or shortly will grow) multiple execution pipelines, so it's also important that you sync up with the right command stream. Syncing up with all of them is generally frowned upon for obvious reasons ;-) Well, I guess I am happy enough with something that is at least functional. Usespace access would (I think) mainly be weird edge case type stuff. But... snip On the topic of a coherency model for dmabuf, I think we need to look at dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and cpu_finish or whatever they might get called) as barriers: So after a dma_buf_map, all previsously completed dma operations (i.e. unmap already called) and any cpu writes (i.e. cpu_finish called) will be coherent. Similar rule holds for cpu access through the userspace mmap, only writes completed before the cpu_start will show up. Similar, writes done by the device are only guaranteed to show up after the _unmap. Dito for cpu writes and cpu_finish. In short we always need two function calls to denote the start/end of the critical section. Yup, this was exactly my assumption. But I guess it is better to spell it out. Thanks for the excellent discussion - it indeed is very good learning for the relatively-inexperienced me :) So, for the purpose of dma-buf framework, could I summarize the following and rework accordingly?: 1. remove mmap() dma_buf_op [and mmap fop], and
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Wednesday 07 December 2011, Semwal, Sumit wrote: Thanks for the excellent discussion - it indeed is very good learning for the relatively-inexperienced me :) So, for the purpose of dma-buf framework, could I summarize the following and rework accordingly?: 1. remove mmap() dma_buf_op [and mmap fop], and introduce cpu_start(), cpu_finish() ops to bracket cpu accesses to the buffer. Also add DMABUF_CPU_START / DMABUF_CPU_FINI IOCTLs? I think we'd be better off for now without the extra ioctls and just document that a shared buffer must not be exported to user space using mmap at all, to avoid those problems. Serialization between GPU and CPU is on a higher level than the dma_buf framework IMHO. 2. remove sg_sync* ops for now (and we'll see if we need to add them later if needed) Just removing the sg_sync_* operations is not enough. We have to make the decision whether we want to allow a) only coherent mappings of the buffer into kernel memory (requiring an extension to the dma_map_ops on ARM to not flush caches at map/unmap time) b) not allowing any in-kernel mappings (same requirement on ARM, also limits the usefulness of the dma_buf if we cannot access it from the kernel or from user space) c) only allowing streaming mappings, even if those are non-coherent (requiring strict serialization between CPU (in-kernel) and dma users of the buffer) This issue has to be solved or we get random data corruption. Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Monday 05 December 2011, Rob Clark wrote: On the topic of a coherency model for dmabuf, I think we need to look at dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and cpu_finish or whatever they might get called) as barriers: So after a dma_buf_map, all previsously completed dma operations (i.e. unmap already called) and any cpu writes (i.e. cpu_finish called) will be coherent. Similar rule holds for cpu access through the userspace mmap, only writes completed before the cpu_start will show up. Similar, writes done by the device are only guaranteed to show up after the _unmap. Dito for cpu writes and cpu_finish. In short we always need two function calls to denote the start/end of the critical section. Yup, this was exactly my assumption. But I guess it is better to spell it out. I still don't understand how this is going to help you if you let multiple drivers enter and leave the critical section without serializing against one another. That doesn't sound like what I know as critical section. Given some reasonable constraints (all devices must be in the same coherency domain, for instance), you can probably define it in a way that you can have multiple devices mapping the same buffer at the same time, and when no device has mapped the buffer you can have as many concurrent kernel and user space accesses on the same buffer as you like. But you must still guarantee that no software touches a noncoherent buffer while it is mapped into any device and vice versa. Why can't we just mandate that all mappings into the kernel must be coherent and that user space accesses must either be coherent as well or be done by user space that uses explicit serialization with all DMA accesses? Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Tue, Dec 06, 2011 at 01:16:58PM +, Arnd Bergmann wrote: On Monday 05 December 2011, Rob Clark wrote: On the topic of a coherency model for dmabuf, I think we need to look at dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and cpu_finish or whatever they might get called) as barriers: So after a dma_buf_map, all previsously completed dma operations (i.e. unmap already called) and any cpu writes (i.e. cpu_finish called) will be coherent. Similar rule holds for cpu access through the userspace mmap, only writes completed before the cpu_start will show up. Similar, writes done by the device are only guaranteed to show up after the _unmap. Dito for cpu writes and cpu_finish. In short we always need two function calls to denote the start/end of the critical section. Yup, this was exactly my assumption. But I guess it is better to spell it out. I still don't understand how this is going to help you if you let multiple drivers enter and leave the critical section without serializing against one another. That doesn't sound like what I know as critical section. I already regret to having added that last critical section remark. Think barriers. It's just that you need a barrier in both directions that bracket the actual usage. In i915-land we call the first one generally invalidate (so that caches on the target domain don't contain stale data) and that second one flush (to get any data out of caches). Given some reasonable constraints (all devices must be in the same coherency domain, for instance), you can probably define it in a way that you can have multiple devices mapping the same buffer at the same time, and when no device has mapped the buffer you can have as many concurrent kernel and user space accesses on the same buffer as you like. But you must still guarantee that no software touches a noncoherent buffer while it is mapped into any device and vice versa. Why can't we just mandate that all mappings into the kernel must be coherent and that user space accesses must either be coherent as well or be done by user space that uses explicit serialization with all DMA accesses? I agree with your points here, afaics the contentious issue is just whether dma_buf should _enforce_ this strict ordering. I'm leading towards a no for the following reasons: - gpu people love nonblocking interfaces (and love to come up with abuses). In the generic case we'd need some more functions to properly flush everything while 2 devices access a buffer concurrently (which is imo a bit unrealistic). But e.g. 2 gpus rendering in SLI mode very much want to access the same buffer at the same time (and the kernel+userspace gpu driver already needs all the information about caches to make that happen, at least on x86). - Buffer sharing alone has already some great potential for deadlock and lock recursion issues. Making dma_buf into something that very much acts like a new locking primitive itself (even exposed to userspace) will make this much worse. I've seen some of the kernel/userspace shared hwlock code of dri1 yonder, and it's horrible (and at least for the case of the dri1 hwlock, totally broken). - All current subsystem already have the concept to pass the ownership of a buffer between the device and userspace (sometimes even more than just 2 domains, like in i915 ...). Userspace already needs to use this interface to get anything resembling correct data. I don't see any case where userspace can't enforce passing around buffer ownership if multiple devices are involved (we obviously need to clarify subsystem interfaces to make it clear when a buffer is in use and when another device taking part in the sharing could use it). So I don't see how the kernel enforcing strict access ordering helps implementing correct userspace. - I don't see any security needs that would make it necessary for the kernel to enforce any consistency guarantees for concurrent access - we're only dealing with pixel data in all the currently discussed generic use-cases. So I think garbage as an end-result is acceptable if userspace does stupid things (or fails at trying to be clever). Cheers, Daniel -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
Hi Arnd, Thanks for your review! On Mon, Dec 5, 2011 at 10:48 PM, Arnd Bergmann a...@arndb.de wrote: On Friday 02 December 2011, Sumit Semwal wrote: This is the first step in defining a dma buffer sharing mechanism. This looks very nice, but there are a few things I don't understand yet and a bunch of trivial comments I have about things I spotted. Do you have prototype exporter and consumer drivers that you can post for clarification? In the patch 2, you have a section about migration that mentions that it is possible to export a buffer that can be migrated after it is already mapped into one user driver. How does that work when the physical addresses are mapped into a consumer device already? I guess I need to clear it up in the documentation - when I said once all ongoing access is completed - I meant to say once all current users have finished accessing and have unmapped this buffer. So I agree with Rob - the migration would only be possible for attached but unmapped buffers. I will update the documentation. diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 21cf46f..07d8095 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -174,4 +174,14 @@ config SYS_HYPERVISOR source drivers/base/regmap/Kconfig +config DMA_SHARED_BUFFER + bool Buffer framework to be shared between drivers + default n + depends on ANON_INODES I would make this 'select ANON_INODES', like the other users of this feature. Sure. + return dmabuf; +} +EXPORT_SYMBOL(dma_buf_export); I agree with Konrad, this should definitely be EXPORT_SYMBOL_GPL, because it's really a low-level function that I would expect to get used by in-kernel subsystems providing the feature to users and having back-end drivers, but it's not the kind of thing we want out-of-tree drivers to mess with. Agreed. +/** + * dma_buf_fd - returns a file descriptor for the given dma_buf + * @dmabuf: [in] pointer to dma_buf for which fd is required. + * + * On success, returns an associated 'fd'. Else, returns error. + */ +int dma_buf_fd(struct dma_buf *dmabuf) +{ + int error, fd; + + if (!dmabuf-file) + return -EINVAL; + + error = get_unused_fd_flags(0); Why not simply get_unused_fd()? :) oversight. Will correct. +/** + * dma_buf_attach - Add the device to dma_buf's attachments list; optionally, + * calls attach() of dma_buf_ops to allow device-specific attach functionality + * @dmabuf: [in] buffer to attach device to. + * @dev: [in] device to be attached. + * + * Returns struct dma_buf_attachment * for this attachment; may return NULL. + * Or may return a negative error code. It's better to be consistent here: either always return NULL on error, or change the allocation error to ERR_PTR(-ENOMEM). Ok, that makes sense. + */ +struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, + struct device *dev) +{ + struct dma_buf_attachment *attach; + int ret; + + BUG_ON(!dmabuf || !dev); + + attach = kzalloc(sizeof(struct dma_buf_attachment), GFP_KERNEL); + if (attach == NULL) + goto err_alloc; + + mutex_lock(dmabuf-lock); + + attach-dev = dev; + attach-dmabuf = dmabuf; + if (dmabuf-ops-attach) { + ret = dmabuf-ops-attach(dmabuf, dev, attach); + if (!ret) + goto err_attach; You probably mean if (ret) here instead of if (!ret), right? yes - a stupid one! will correct. + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. I will take yours and Daniel's suggestion, and remove these; if at all they're needed, we can add them back again later, with /s/device/attachment as suggested by Daniel. +#ifdef CONFIG_DMA_SHARED_BUFFER Do you have a use case for making the interface compile-time disabled? I had assumed that any code using it would make no sense if it's not available so you don't actually need this. Ok. Though if we keep the interface compile-time disabled, the users can actually check and
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Friday 02 December 2011, Sumit Semwal wrote: This is the first step in defining a dma buffer sharing mechanism. This looks very nice, but there are a few things I don't understand yet and a bunch of trivial comments I have about things I spotted. Do you have prototype exporter and consumer drivers that you can post for clarification? In the patch 2, you have a section about migration that mentions that it is possible to export a buffer that can be migrated after it is already mapped into one user driver. How does that work when the physical addresses are mapped into a consumer device already? diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 21cf46f..07d8095 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -174,4 +174,14 @@ config SYS_HYPERVISOR source drivers/base/regmap/Kconfig +config DMA_SHARED_BUFFER + bool Buffer framework to be shared between drivers + default n + depends on ANON_INODES I would make this 'select ANON_INODES', like the other users of this feature. + return dmabuf; +} +EXPORT_SYMBOL(dma_buf_export); I agree with Konrad, this should definitely be EXPORT_SYMBOL_GPL, because it's really a low-level function that I would expect to get used by in-kernel subsystems providing the feature to users and having back-end drivers, but it's not the kind of thing we want out-of-tree drivers to mess with. +/** + * dma_buf_fd - returns a file descriptor for the given dma_buf + * @dmabuf: [in]pointer to dma_buf for which fd is required. + * + * On success, returns an associated 'fd'. Else, returns error. + */ +int dma_buf_fd(struct dma_buf *dmabuf) +{ + int error, fd; + + if (!dmabuf-file) + return -EINVAL; + + error = get_unused_fd_flags(0); Why not simply get_unused_fd()? +/** + * dma_buf_attach - Add the device to dma_buf's attachments list; optionally, + * calls attach() of dma_buf_ops to allow device-specific attach functionality + * @dmabuf: [in]buffer to attach device to. + * @dev: [in]device to be attached. + * + * Returns struct dma_buf_attachment * for this attachment; may return NULL. + * Or may return a negative error code. It's better to be consistent here: either always return NULL on error, or change the allocation error to ERR_PTR(-ENOMEM). + */ +struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, + struct device *dev) +{ + struct dma_buf_attachment *attach; + int ret; + + BUG_ON(!dmabuf || !dev); + + attach = kzalloc(sizeof(struct dma_buf_attachment), GFP_KERNEL); + if (attach == NULL) + goto err_alloc; + + mutex_lock(dmabuf-lock); + + attach-dev = dev; + attach-dmabuf = dmabuf; + if (dmabuf-ops-attach) { + ret = dmabuf-ops-attach(dmabuf, dev, attach); + if (!ret) + goto err_attach; You probably mean if (ret) here instead of if (!ret), right? + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. +#ifdef CONFIG_DMA_SHARED_BUFFER Do you have a use case for making the interface compile-time disabled? I had assumed that any code using it would make no sense if it's not available so you don't actually need this. Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 05, 2011 at 05:18:48PM +, Arnd Bergmann wrote: On Friday 02 December 2011, Sumit Semwal wrote: + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. Woops, totally missed the addition of these. Can somebody explain to used to rather coherent x86 what we need these for and the code-flow would look like in a typical example. I was kinda assuming that devices would bracket their use of a buffer with the attachment_map/unmap calls and any cache coherency magic that might be needed would be somewhat transparent to users of the interface? The map call gets the dma_data_direction parameter, so it should be able to do the right thing. And because we keep the attachement around, any caching of mappings should be possible, too. Yours, Daniel PS: Slightly related, because it will make the coherency nightmare worse, afaict: Can we kill mmap support? -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Monday 05 December 2011 19:55:44 Daniel Vetter wrote: The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. Woops, totally missed the addition of these. Can somebody explain to used to rather coherent x86 what we need these for and the code-flow would look like in a typical example. I was kinda assuming that devices would bracket their use of a buffer with the attachment_map/unmap calls and any cache coherency magic that might be needed would be somewhat transparent to users of the interface? I'll describe how the respective functions work in the streaming mapping API (dma_map_*): You start out with a buffer that is owned by the CPU, i.e. the kernel can access it freely. When you call dma_map_sg or similar, a noncoherent device reading the buffer requires the cache to be flushed in order to see the data that was written by the CPU into the cache. After dma_map_sg, the device can perform both read and write accesses (depending on the flag to the map call), but the CPU is no longer allowed to read (which would allocate a cache line that may become invalid but remain marked as clean) or write (which would create a dirty cache line without writing it back) that buffer. Once the device is done, the driver calls dma_unmap_* and the buffer is again owned by the CPU. The device can no longer access it (in fact the address may be no longer be backed if there is an iommu) and the CPU can again read and write the buffer. On ARMv6 and higher, possibly some other architectures, dma_unmap_* also needs to invalidate the cache for the buffer, because due to speculative prefetching, there may also be a new clean cache line with stale data from an earlier version of the buffer. Since map/unmap is an expensive operation, the interface was extended to pass back the ownership to the CPU and back to the device while leaving the buffer mapped. dma_sync_sg_for_cpu invalidates the cache in the same way as dma_unmap_sg, so the CPU can access the buffer, and dma_sync_sg_for_device hands it back to the device by performing the same cache flush that dma_map_sg would do. You could for example do this if you want video input with a cacheable buffer, or in an rdma scenario with a buffer accessed by a remote machine. In case of software iommu (swiotlb, dmabounce), the map and sync functions don't do cache management but instead copy data between a buffer accessed by hardware and a different buffer accessed by the user. The map call gets the dma_data_direction parameter, so it should be able to do the right thing. And because we keep the attachement around, any caching of mappings should be possible, too. Yours, Daniel PS: Slightly related, because it will make the coherency nightmare worse, afaict: Can we kill mmap support? The mmap support is required (and only make sense) for consistent mappings, not for streaming mappings. The provider must ensure that if you map something uncacheable into the kernel in order to provide consistency, any mapping into user space must also be uncacheable. A driver that wants to have the buffer mapped to user space as many do should not need to know whether it is required to do cacheable or uncacheable mapping because of some other driver, and it should not need to worry about how to set up uncached mappings in user space. Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote: On Friday 02 December 2011, Sumit Semwal wrote: This is the first step in defining a dma buffer sharing mechanism. This looks very nice, but there are a few things I don't understand yet and a bunch of trivial comments I have about things I spotted. Do you have prototype exporter and consumer drivers that you can post for clarification? There is some dummy drivers based on an earlier version. And airlied has a prime (multi-gpu) prototype: http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf I've got a nearly working camera+display prototype: https://github.com/robclark/kernel-omap4/commits/dmabuf In the patch 2, you have a section about migration that mentions that it is possible to export a buffer that can be migrated after it is already mapped into one user driver. How does that work when the physical addresses are mapped into a consumer device already? I think you can do physical migration if you are attached, but probably not if you are mapped. diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 21cf46f..07d8095 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -174,4 +174,14 @@ config SYS_HYPERVISOR source drivers/base/regmap/Kconfig +config DMA_SHARED_BUFFER + bool Buffer framework to be shared between drivers + default n + depends on ANON_INODES I would make this 'select ANON_INODES', like the other users of this feature. + return dmabuf; +} +EXPORT_SYMBOL(dma_buf_export); I agree with Konrad, this should definitely be EXPORT_SYMBOL_GPL, because it's really a low-level function that I would expect to get used by in-kernel subsystems providing the feature to users and having back-end drivers, but it's not the kind of thing we want out-of-tree drivers to mess with. +/** + * dma_buf_fd - returns a file descriptor for the given dma_buf + * @dmabuf: [in] pointer to dma_buf for which fd is required. + * + * On success, returns an associated 'fd'. Else, returns error. + */ +int dma_buf_fd(struct dma_buf *dmabuf) +{ + int error, fd; + + if (!dmabuf-file) + return -EINVAL; + + error = get_unused_fd_flags(0); Why not simply get_unused_fd()? +/** + * dma_buf_attach - Add the device to dma_buf's attachments list; optionally, + * calls attach() of dma_buf_ops to allow device-specific attach functionality + * @dmabuf: [in] buffer to attach device to. + * @dev: [in] device to be attached. + * + * Returns struct dma_buf_attachment * for this attachment; may return NULL. + * Or may return a negative error code. It's better to be consistent here: either always return NULL on error, or change the allocation error to ERR_PTR(-ENOMEM). + */ +struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, + struct device *dev) +{ + struct dma_buf_attachment *attach; + int ret; + + BUG_ON(!dmabuf || !dev); + + attach = kzalloc(sizeof(struct dma_buf_attachment), GFP_KERNEL); + if (attach == NULL) + goto err_alloc; + + mutex_lock(dmabuf-lock); + + attach-dev = dev; + attach-dmabuf = dmabuf; + if (dmabuf-ops-attach) { + ret = dmabuf-ops-attach(dmabuf, dev, attach); + if (!ret) + goto err_attach; You probably mean if (ret) here instead of if (!ret), right? + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. My original thinking was that you either need DMABUF_CPU_{PREP,FINI} ioctls and corresponding dmabuf ops, which userspace is required to call before / after CPU access. Or just remove mmap() and do the mmap() via allocating device and use that device's equivalent DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls. That would give you a way to (a) synchronize with gpu/asynchronous pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing buffer (ie. wait all devices have dma_buf_unmap_attachment'd). And that gives you a convenient place
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 05, 2011 at 08:29:49PM +0100, Arnd Bergmann wrote: On Monday 05 December 2011 19:55:44 Daniel Vetter wrote: The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. Woops, totally missed the addition of these. Can somebody explain to used to rather coherent x86 what we need these for and the code-flow would look like in a typical example. I was kinda assuming that devices would bracket their use of a buffer with the attachment_map/unmap calls and any cache coherency magic that might be needed would be somewhat transparent to users of the interface? I'll describe how the respective functions work in the streaming mapping API (dma_map_*): You start out with a buffer that is owned by the CPU, i.e. the kernel can access it freely. When you call dma_map_sg or similar, a noncoherent device reading the buffer requires the cache to be flushed in order to see the data that was written by the CPU into the cache. After dma_map_sg, the device can perform both read and write accesses (depending on the flag to the map call), but the CPU is no longer allowed to read (which would allocate a cache line that may become invalid but remain marked as clean) or write (which would create a dirty cache line without writing it back) that buffer. Once the device is done, the driver calls dma_unmap_* and the buffer is again owned by the CPU. The device can no longer access it (in fact the address may be no longer be backed if there is an iommu) and the CPU can again read and write the buffer. On ARMv6 and higher, possibly some other architectures, dma_unmap_* also needs to invalidate the cache for the buffer, because due to speculative prefetching, there may also be a new clean cache line with stale data from an earlier version of the buffer. Since map/unmap is an expensive operation, the interface was extended to pass back the ownership to the CPU and back to the device while leaving the buffer mapped. dma_sync_sg_for_cpu invalidates the cache in the same way as dma_unmap_sg, so the CPU can access the buffer, and dma_sync_sg_for_device hands it back to the device by performing the same cache flush that dma_map_sg would do. You could for example do this if you want video input with a cacheable buffer, or in an rdma scenario with a buffer accessed by a remote machine. In case of software iommu (swiotlb, dmabounce), the map and sync functions don't do cache management but instead copy data between a buffer accessed by hardware and a different buffer accessed by the user. Thanks a lot for this excellent overview. I think at least for the first version of dmabuf we should drop the sync_* interfaces and simply require users to bracket their usage of the buffer from the attached device by map/unmap. A dma_buf provider is always free to cache the mapping and simply call sync_sg_for of the streaming dma api. If it later turns out that we want to be able to cache the sg list also on the use-site in the driver (e.g. map it into some hw sg list) we can always add that functionality later. I just fear that sync ops among N devices is a bit ill-defined and we already have a plethora of ill-defined issues at hand. Also the proposed api doesn't quite fit into what's already there, I think an s/device/dma_buf_attachment/ would be more consistent - otherwise the dmabuf provider might need to walk the list of attachements to get at the right one for the device. The map call gets the dma_data_direction parameter, so it should be able to do the right thing. And because we keep the attachement around, any caching of mappings should be possible, too. Yours, Daniel PS: Slightly related, because it will make the coherency nightmare worse, afaict: Can we kill mmap support? The mmap support is required (and only make sense) for consistent mappings, not for streaming mappings. The provider must ensure that if you map something uncacheable into the kernel in order to provide consistency, any mapping into user space must also be uncacheable. A driver that wants to have the buffer mapped to user space as many do should not need to know whether it is required to do cacheable or uncacheable mapping because of some other driver, and it should not need to worry about how to set up uncached mappings in user space. Either I've always missed it or no one ever described it that consisely, but now I see the use-case for mmap: Simpler drivers (i.e. not gpus) might need to expose a userspace mapping and only the provider knows how to do that in a coherent fashion. I want this in the docs (if it's not there yet ...). But even with that use-case in mind I still have some gripes with the current mmap interfaces as proposed: - This use-case explains why the dmabuf provider needs to expose an mmap function.
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote: On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote: In the patch 2, you have a section about migration that mentions that it is possible to export a buffer that can be migrated after it is already mapped into one user driver. How does that work when the physical addresses are mapped into a consumer device already? I think you can do physical migration if you are attached, but probably not if you are mapped. Yeah, that's very much how I see this, and also why map/unmap (at least for simple users like v4l) should only bracket actual usage. GPU memory managers need to be able to move around buffers while no one is using them. [snip] + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. My original thinking was that you either need DMABUF_CPU_{PREP,FINI} ioctls and corresponding dmabuf ops, which userspace is required to call before / after CPU access. Or just remove mmap() and do the mmap() via allocating device and use that device's equivalent DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls. That would give you a way to (a) synchronize with gpu/asynchronous pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing buffer (ie. wait all devices have dma_buf_unmap_attachment'd). And that gives you a convenient place to do cache operations on noncoherent architecture. I sort of preferred having the DMABUF shim because that lets you pass a buffer around userspace without the receiving code knowing about a device specific API. But the problem I eventually came around to: if your GL stack (or some other userspace component) is batching up commands before submission to kernel, the buffers you need to wait for completion might not even be submitted yet. So from kernel perspective they are ready for cpu access. Even though in fact they are not in a consistent state from rendering perspective. I don't really know a sane way to deal with that. Maybe the approach instead should be a userspace level API (in libkms/libdrm?) to provide abstraction for userspace access to buffers rather than dealing with this at the kernel level. Well, there's a reason GL has an explicit flush and extensions for sync objects. It's to support such scenarios where the driver batches up gpu commands before actually submitting them. Also, recent gpus have all (or shortly will grow) multiple execution pipelines, so it's also important that you sync up with the right command stream. Syncing up with all of them is generally frowned upon for obvious reasons ;-) So any userspace that interacts with an OpenGL driver needs to take care of this anyway. But I think for simpler stuff (v4l) kernel only coherency should work and userspace just needs to take care of gl interactions and call glflush and friends at the right points. I think we can flesh this out precisely when we spec the dmabuf EGL extension ... (or implement one of the preexisting ones already around). On the topic of a coherency model for dmabuf, I think we need to look at dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and cpu_finish or whatever they might get called) as barriers: So after a dma_buf_map, all previsously completed dma operations (i.e. unmap already called) and any cpu writes (i.e. cpu_finish called) will be coherent. Similar rule holds for cpu access through the userspace mmap, only writes completed before the cpu_start will show up. Similar, writes done by the device are only guaranteed to show up after the _unmap. Dito for cpu writes and cpu_finish. In short we always need two function calls to denote the start/end of the critical section. Any concurrent operations are allowed to yield garbage, meaning any combination of the old or either of the newly written contents (i.e. non-overlapping writes might not actually all end up in the buffer, but instead some old contents). Maybe we even need to loosen that to the real undefined behaviour, but atm I can't think of an example. -Daniel -- Daniel Vetter Mail: dan...@ffwll.ch
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Monday 05 December 2011 21:58:39 Daniel Vetter wrote: On Mon, Dec 05, 2011 at 08:29:49PM +0100, Arnd Bergmann wrote: ... Thanks a lot for this excellent overview. I think at least for the first version of dmabuf we should drop the sync_* interfaces and simply require users to bracket their usage of the buffer from the attached device by map/unmap. A dma_buf provider is always free to cache the mapping and simply call sync_sg_for of the streaming dma api. I think we still have the same problem if we allow multiple drivers to access a noncoherent buffer using map/unmap: driver Adriver B 1. read/write 2. read/write 3. map() 4. read/write 5. dma 6. map() 7. dma 8. dma 9. unmap() 10. dma 11. read/write 12. unmap() In step 4, the buffer is owned by device A, but accessed by driver B, which is a bug. In step 11, the buffer is owned by device B but accessed by driver A, which is the same bug on the other side. In steps 7 and 8, the buffer is owned by both device A and B, which is currently undefined but would be ok if both devices are on the same coherency domain. Whether that point is meaningful depends on what the devices actually do. It would be ok if both are only reading, but not if they write into the same location concurrently. As I mentioned originally, the problem could be completely avoided if we only allow consistent (e.g. uncached) mappings or buffers that are not mapped into the kernel virtual address space at all. Alternatively, a clearer model would be to require each access to nonconsistent buffers to be exclusive: a map() operation would have to block until the current mapper (if any) has done an unmap(), and any access from the CPU would also have to call a dma_buf_ops pointer to serialize the CPU accesses with any device accesses. User mappings of the buffer can be easily blocked during a DMA access by unmapping the buffer from user space at map() time and blocking the vm_ops-fault() operation until the unmap(). If it later turns out that we want to be able to cache the sg list also on the use-site in the driver (e.g. map it into some hw sg list) we can always add that functionality later. I just fear that sync ops among N devices is a bit ill-defined and we already have a plethora of ill-defined issues at hand. Also the proposed api doesn't quite fit into what's already there, I think an s/device/dma_buf_attachment/ would be more consistent - otherwise the dmabuf provider might need to walk the list of attachements to get at the right one for the device. Right, at last for the start, let's mandate just map/unmap and not provide sync. I do wonder however whether we should implement consistent (possibly uncached) or streaming (cacheable, but always owned by either the device or the CPU, not both) buffers, or who gets to make the decision which one is used if both are implemented. The map call gets the dma_data_direction parameter, so it should be able to do the right thing. And because we keep the attachement around, any caching of mappings should be possible, too. Yours, Daniel PS: Slightly related, because it will make the coherency nightmare worse, afaict: Can we kill mmap support? The mmap support is required (and only make sense) for consistent mappings, not for streaming mappings. The provider must ensure that if you map something uncacheable into the kernel in order to provide consistency, any mapping into user space must also be uncacheable. A driver that wants to have the buffer mapped to user space as many do should not need to know whether it is required to do cacheable or uncacheable mapping because of some other driver, and it should not need to worry about how to set up uncached mappings in user space. Either I've always missed it or no one ever described it that consisely, but now I see the use-case for mmap: Simpler drivers (i.e. not gpus) might need to expose a userspace mapping and only the provider knows how to do that in a coherent fashion. I want this in the docs (if it's not there yet ...). It's currently implemented in the ARM/PowerPC-specific dma_mmap_coherent function and documented (hardly) in arch/arm/include/asm/dma-mapping.h. We should make clear in that this is actually an extension of the regular dma mapping api that first needs to be made generic. But even with that use-case in mind I still have some gripes with the current mmap interfaces as proposed: - This use-case explains why the dmabuf provider needs to expose an mmap function. It
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Monday 05 December 2011 14:46:47 Rob Clark wrote: On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote: On Friday 02 December 2011, Sumit Semwal wrote: This is the first step in defining a dma buffer sharing mechanism. This looks very nice, but there are a few things I don't understand yet and a bunch of trivial comments I have about things I spotted. Do you have prototype exporter and consumer drivers that you can post for clarification? There is some dummy drivers based on an earlier version. And airlied has a prime (multi-gpu) prototype: http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-prime-dmabuf I've got a nearly working camera+display prototype: https://github.com/robclark/kernel-omap4/commits/dmabuf Ok, thanks. I think it would be good to post these for reference in v3, with a clear indication that they are not being submitted for discussion/inclusion yet. In the patch 2, you have a section about migration that mentions that it is possible to export a buffer that can be migrated after it is already mapped into one user driver. How does that work when the physical addresses are mapped into a consumer device already? I think you can do physical migration if you are attached, but probably not if you are mapped. Ok, that's what I thought. You probably mean if (ret) here instead of if (!ret), right? + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. My original thinking was that you either need DMABUF_CPU_{PREP,FINI} ioctls and corresponding dmabuf ops, which userspace is required to call before / after CPU access. Or just remove mmap() and do the mmap() via allocating device and use that device's equivalent DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls. That would give you a way to (a) synchronize with gpu/asynchronous pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing buffer (ie. wait all devices have dma_buf_unmap_attachment'd). And that gives you a convenient place to do cache operations on noncoherent architecture. I wasn't even thinking of user mappings, as I replied to Daniel, I think they are easy to solve (maybe not efficiently though) I sort of preferred having the DMABUF shim because that lets you pass a buffer around userspace without the receiving code knowing about a device specific API. But the problem I eventually came around to: if your GL stack (or some other userspace component) is batching up commands before submission to kernel, the buffers you need to wait for completion might not even be submitted yet. So from kernel perspective they are ready for cpu access. Even though in fact they are not in a consistent state from rendering perspective. I don't really know a sane way to deal with that. Maybe the approach instead should be a userspace level API (in libkms/libdrm?) to provide abstraction for userspace access to buffers rather than dealing with this at the kernel level. It would be nice if user space had no way to block out kernel drivers, otherwise we have to be very careful to ensure that each map() operation can be interrupted by a signal as the last resort to avoid deadlocks. Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 5, 2011 at 3:23 PM, Daniel Vetter dan...@ffwll.ch wrote: On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote: On Mon, Dec 5, 2011 at 11:18 AM, Arnd Bergmann a...@arndb.de wrote: In the patch 2, you have a section about migration that mentions that it is possible to export a buffer that can be migrated after it is already mapped into one user driver. How does that work when the physical addresses are mapped into a consumer device already? I think you can do physical migration if you are attached, but probably not if you are mapped. Yeah, that's very much how I see this, and also why map/unmap (at least for simple users like v4l) should only bracket actual usage. GPU memory managers need to be able to move around buffers while no one is using them. [snip] + /* allow allocator to take care of cache ops */ + void (*sync_sg_for_cpu) (struct dma_buf *, struct device *); + void (*sync_sg_for_device)(struct dma_buf *, struct device *); I don't see how this works with multiple consumers: For the streaming DMA mapping, there must be exactly one owner, either the device or the CPU. Obviously, this rule needs to be extended when you get to multiple devices and multiple device drivers, plus possibly user mappings. Simply assigning the buffer to the device from one driver does not block other drivers from touching the buffer, and assigning it to the cpu does not stop other hardware that the code calling sync_sg_for_cpu is not aware of. The only way to solve this that I can think of right now is to mandate that the mappings are all coherent (i.e. noncachable on noncoherent architectures like ARM). If you do that, you no longer need the sync_sg_for_* calls. My original thinking was that you either need DMABUF_CPU_{PREP,FINI} ioctls and corresponding dmabuf ops, which userspace is required to call before / after CPU access. Or just remove mmap() and do the mmap() via allocating device and use that device's equivalent DRM_XYZ_GEM_CPU_{PREP,FINI} or DRM_XYZ_GEM_SET_DOMAIN ioctls. That would give you a way to (a) synchronize with gpu/asynchronous pipeline, (b) synchronize w/ multiple hw devices vs cpu accessing buffer (ie. wait all devices have dma_buf_unmap_attachment'd). And that gives you a convenient place to do cache operations on noncoherent architecture. I sort of preferred having the DMABUF shim because that lets you pass a buffer around userspace without the receiving code knowing about a device specific API. But the problem I eventually came around to: if your GL stack (or some other userspace component) is batching up commands before submission to kernel, the buffers you need to wait for completion might not even be submitted yet. So from kernel perspective they are ready for cpu access. Even though in fact they are not in a consistent state from rendering perspective. I don't really know a sane way to deal with that. Maybe the approach instead should be a userspace level API (in libkms/libdrm?) to provide abstraction for userspace access to buffers rather than dealing with this at the kernel level. Well, there's a reason GL has an explicit flush and extensions for sync objects. It's to support such scenarios where the driver batches up gpu commands before actually submitting them. Hmm.. what about other non-GL APIs.. maybe vaapi/vdpau or similar? (Or something that I haven't thought of.) Also, recent gpus have all (or shortly will grow) multiple execution pipelines, so it's also important that you sync up with the right command stream. Syncing up with all of them is generally frowned upon for obvious reasons ;-) Well, I guess I am happy enough with something that is at least functional. Usespace access would (I think) mainly be weird edge case type stuff. But... So any userspace that interacts with an OpenGL driver needs to take care of this anyway. But I think for simpler stuff (v4l) kernel only coherency should work and userspace just needs to take care of gl interactions and call glflush and friends at the right points. I think we can flesh this out precisely when we spec the dmabuf EGL extension ... (or implement one of the preexisting ones already around). .. yeah, I think egl/eglImage extension would be the right place to hide this behind. And I guess your GL stack should be able to figure out which execution pipeline to sync, cache state of buffer, and whatever other optimizations you might want to make. On the topic of a coherency model for dmabuf, I think we need to look at dma_buf_attachment_map/unmap (and also the mmap variants cpu_start and cpu_finish or whatever they might get called) as barriers: So after a dma_buf_map, all previsously completed dma operations (i.e. unmap already called) and any cpu writes (i.e. cpu_finish called) will be coherent. Similar rule holds for cpu access through the userspace mmap, only writes completed before the cpu_start
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 5, 2011 at 4:09 PM, Arnd Bergmann a...@arndb.de wrote: https://github.com/robclark/kernel-omap4/commits/dmabuf Ok, thanks. I think it would be good to post these for reference in v3, with a clear indication that they are not being submitted for discussion/inclusion yet. btw, don't look at this too closely at that tree yet.. where the attach/detach is done in videobuf2 code isn't really correct. But I was going to get something functioning first. BR, -R ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 05, 2011 at 04:11:46PM -0600, Rob Clark wrote: On Mon, Dec 5, 2011 at 3:23 PM, Daniel Vetter dan...@ffwll.ch wrote: On Mon, Dec 05, 2011 at 02:46:47PM -0600, Rob Clark wrote: I sort of preferred having the DMABUF shim because that lets you pass a buffer around userspace without the receiving code knowing about a device specific API. But the problem I eventually came around to: if your GL stack (or some other userspace component) is batching up commands before submission to kernel, the buffers you need to wait for completion might not even be submitted yet. So from kernel perspective they are ready for cpu access. Even though in fact they are not in a consistent state from rendering perspective. I don't really know a sane way to deal with that. Maybe the approach instead should be a userspace level API (in libkms/libdrm?) to provide abstraction for userspace access to buffers rather than dealing with this at the kernel level. Well, there's a reason GL has an explicit flush and extensions for sync objects. It's to support such scenarios where the driver batches up gpu commands before actually submitting them. Hmm.. what about other non-GL APIs.. maybe vaapi/vdpau or similar? (Or something that I haven't thought of.) They generally all have a concept of when they've actually commited the rendering to an X pixmap or egl image. Usually it's rather implicit, e.g. the driver will submit any outstanding batches before returning from any calls. -Daniel -- Daniel Vetter Mail: dan...@ffwll.ch Mobile: +41 (0)79 365 57 48 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Mon, Dec 5, 2011 at 4:09 PM, Arnd Bergmann a...@arndb.de wrote: On Monday 05 December 2011 14:46:47 Rob Clark wrote: I sort of preferred having the DMABUF shim because that lets you pass a buffer around userspace without the receiving code knowing about a device specific API. But the problem I eventually came around to: if your GL stack (or some other userspace component) is batching up commands before submission to kernel, the buffers you need to wait for completion might not even be submitted yet. So from kernel perspective they are ready for cpu access. Even though in fact they are not in a consistent state from rendering perspective. I don't really know a sane way to deal with that. Maybe the approach instead should be a userspace level API (in libkms/libdrm?) to provide abstraction for userspace access to buffers rather than dealing with this at the kernel level. It would be nice if user space had no way to block out kernel drivers, otherwise we have to be very careful to ensure that each map() operation can be interrupted by a signal as the last resort to avoid deadlocks. map_dma_buf should be documented to be allowed to return -EINTR.. otherwise, yeah, that would be problematic. Arnd ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC v2 1/2] dma-buf: Introduce dma buffer sharing mechanism
On Fri, Dec 02, 2011 at 02:27:31PM +0530, Sumit Semwal wrote: This is the first step in defining a dma buffer sharing mechanism. A new buffer object dma_buf is added, with operations and API to allow easy sharing of this buffer object across devices. The framework allows: - different devices to 'attach' themselves to this buffer, to facilitate backing storage negotiation, using dma_buf_attach() API. - association of a file pointer with each user-buffer and associated allocator-defined operations on that buffer. This operation is called the 'export' operation. - this exported buffer-object to be shared with the other entity by asking for its 'file-descriptor (fd)', and sharing the fd across. - a received fd to get the buffer object back, where it can be accessed using the associated exporter-defined operations. - the exporter and user to share the scatterlist using map_dma_buf and unmap_dma_buf operations. Atleast one 'attach()' call is required to be made prior to calling the map_dma_buf() operation. Couple of building blocks in map_dma_buf() are added to ease introduction of sync'ing across exporter and users, and late allocation by the exporter. *OPTIONALLY*: mmap() file operation is provided for the associated 'fd', as wrapper over the optional allocator defined mmap(), to be used by devices that might need one. More details are there in the documentation patch. This is based on design suggestions from many people at the mini-summits[1], most notably from Arnd Bergmann a...@arndb.de, Rob Clark r...@ti.com and Daniel Vetter dan...@ffwll.ch. The implementation is inspired from proof-of-concept patch-set from Tomasz Stanislawski t.stanisl...@samsung.com, who demonstrated buffer sharing between two v4l2 devices. [2] [1]: https://wiki.linaro.org/OfficeofCTO/MemoryManagement [2]: http://lwn.net/Articles/454389 Signed-off-by: Sumit Semwal sumit.sem...@linaro.org Signed-off-by: Sumit Semwal sumit.sem...@ti.com You have a clone? You only need one SOB. --- drivers/base/Kconfig| 10 ++ drivers/base/Makefile |1 + drivers/base/dma-buf.c | 290 +++ include/linux/dma-buf.h | 176 4 files changed, 477 insertions(+), 0 deletions(-) create mode 100644 drivers/base/dma-buf.c create mode 100644 include/linux/dma-buf.h diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 21cf46f..07d8095 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -174,4 +174,14 @@ config SYS_HYPERVISOR source drivers/base/regmap/Kconfig +config DMA_SHARED_BUFFER + bool Buffer framework to be shared between drivers + default n + depends on ANON_INODES + help + This option enables the framework for buffer-sharing between + multiple drivers. A buffer is associated with a file using driver + APIs extension; the file's descriptor can then be passed on to other + driver. + endmenu diff --git a/drivers/base/Makefile b/drivers/base/Makefile index 99a375a..d0df046 100644 --- a/drivers/base/Makefile +++ b/drivers/base/Makefile @@ -8,6 +8,7 @@ obj-$(CONFIG_DEVTMPFS)+= devtmpfs.o obj-y+= power/ obj-$(CONFIG_HAS_DMA)+= dma-mapping.o obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o +obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o obj-$(CONFIG_ISA)+= isa.o obj-$(CONFIG_FW_LOADER) += firmware_class.o obj-$(CONFIG_NUMA) += node.o diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c new file mode 100644 index 000..4b9005e --- /dev/null +++ b/drivers/base/dma-buf.c @@ -0,0 +1,290 @@ +/* + * Framework for buffer objects that can be shared across devices/subsystems. + * + * Copyright(C) 2011 Linaro Limited. All rights reserved. + * Author: Sumit Semwal sumit.sem...@ti.com OK, so the SOB should be from @ti.com then. + * + * Many thanks to linaro-mm-sig list, and specially + * Arnd Bergmann a...@arndb.de, Rob Clark r...@ti.com and + * Daniel Vetter dan...@ffwll.ch for their support in creation and + * refining of this idea. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program. If not, see http://www.gnu.org/licenses/. + */ + +#include linux/fs.h +#include linux/slab.h +#include linux/dma-buf.h +#include linux/anon_inodes.h +#include linux/export.h + +static inline int is_dma_buf_file(struct