Re: intel hw and caching interface to TTM..
Dave Airlie wrote: OK. We're using this functionality in Poulsbo, so we should probably sort this out to avoid breaking things. Okay I'll try and fix it back up tomorrow.. I've attached my first attempt at fixing this up but it messes up by the time the flags hit the ttm binding function they've lost the CACHED_MAPPED bit and I end up with snooped bindings, drm_bo_mt_compatible seems to be where it disappears and also where my sanity started to end.. are you sure there isn't a nicer way of doing those flags/mask because the current system is vastly confusing to trace through.. Can you suggest what I'm missing to fix this up? Dave, as a general rule, bo_mem-mask is the request and bo_mem-flags is the code's response to that request. The move code deals with the DRM_BO_MASK_MEMTYPE, whereas the rest of the flags ~DRM_BO_MASK_MEMTYPE are transferred from the mask in drm_buffer_object_validate, which is after the move and bind. So I guess you should extend DRM_BO_MASK_MEMTYPE and do a quick sanity check (perhaps just a passthrough) in drm_bo_mt_compatible. That should do the drick. That said, it's probably possible to do this in a nicer way, yes. Dave. diff --git a/linux-core/drm_agpsupport.c b/linux-core/drm_agpsupport.c index 8c7f570..4015659 100644 --- a/linux-core/drm_agpsupport.c +++ b/linux-core/drm_agpsupport.c @@ -541,11 +541,17 @@ static int drm_agp_bind_ttm(struct drm_ttm_backend *backend, container_of(backend, struct drm_agp_ttm_backend, backend); DRM_AGP_MEM *mem = agp_be-mem; int ret; + int snooped = (bo_mem-flags DRM_BO_FLAG_CACHED) !(bo_mem-flags DRM_BO_FLAG_CACHED_MAPPED); DRM_DEBUG(drm_agp_bind_ttm\n); mem-is_flushed = TRUE; - mem-type = (bo_mem-flags DRM_BO_FLAG_CACHED) ? AGP_USER_CACHED_MEMORY : - AGP_USER_MEMORY; + mem-type = AGP_USER_MEMORY; + /* CACHED MAPPED implies not snooped memory */ + if (snooped) { + DRM_ERROR(snooped is set oh no 0x%016llx\n, (unsigned long long)bo_mem-flags); + mem-type = AGP_USER_CACHED_MEMORY; + } + ret = drm_agp_bind_memory(mem, bo_mem-mm_node-start); if (ret) { DRM_ERROR(AGP Bind memory failed\n); diff --git a/linux-core/drm_bo.c b/linux-core/drm_bo.c index 16203c7..05b9f5a 100644 --- a/linux-core/drm_bo.c +++ b/linux-core/drm_bo.c @@ -1029,7 +1029,7 @@ static int drm_bo_busy(struct drm_buffer_object * bo) return 0; } -static int drm_bo_read_cached(struct drm_buffer_object * bo) +static int drm_bo_evict_cached(struct drm_buffer_object * bo) { int ret = 0; @@ -1177,15 +1177,11 @@ static int drm_buffer_object_map(struct drm_file *file_priv, uint32_t handle, goto out; } - if ((map_flags DRM_BO_FLAG_READ) - (bo-mem.flags DRM_BO_FLAG_READ_CACHED) - (!(bo-mem.flags DRM_BO_FLAG_CACHED))) { - drm_bo_read_cached(bo); - } + if (bo-mem.flags DRM_BO_FLAG_CACHED_MAPPED) + drm_bo_evict_cached(bo); + break; - } else if ((map_flags DRM_BO_FLAG_READ) - (bo-mem.flags DRM_BO_FLAG_READ_CACHED) - (!(bo-mem.flags DRM_BO_FLAG_CACHED))) { + } else if (bo-mem.flags DRM_BO_FLAG_CACHED_MAPPED) { /* * We are already mapped with different flags. @@ -1292,6 +1288,7 @@ int drm_bo_move_buffer(struct drm_buffer_object * bo, uint64_t new_mem_flags, if (ret) return ret; + DRM_DEBUG(new mem flags 0x%016llx\n, new_mem_flags); mem.num_pages = bo-num_pages; mem.size = mem.num_pages PAGE_SHIFT; mem.mask = new_mem_flags; @@ -1666,7 +1663,6 @@ int drm_buffer_object_create(struct drm_device *dev, DRM_BO_FLAG_MAPPABLE; atomic_inc(bm-count); ret = drm_bo_new_mask(bo, mask, hint); - if (ret) goto out_err; diff --git a/linux-core/drm_ttm.c b/linux-core/drm_ttm.c index df9e7e4..83aa28a 100644 --- a/linux-core/drm_ttm.c +++ b/linux-core/drm_ttm.c @@ -327,9 +327,10 @@ int drm_bind_ttm(struct drm_ttm * ttm, struct drm_bo_mem_reg *bo_mem) if (ret) return ret; + DRM_DEBUG(bo-mem flags is 0x%016llx\n, bo_mem-flags); if (ttm-state == ttm_unbound !(bo_mem-flags DRM_BO_FLAG_CACHED)) { drm_set_caching(ttm, DRM_TTM_PAGE_UNCACHED); - } else if ((bo_mem-flags DRM_BO_FLAG_CACHED) + } else if ((bo_mem-flags DRM_BO_FLAG_CACHED_MAPPED) bo_driver-ttm_cache_flush) bo_driver-ttm_cache_flush(ttm); diff --git a/shared-core/drm.h
Re: intel hw and caching interface to TTM..
OK. We're using this functionality in Poulsbo, so we should probably sort this out to avoid breaking things. Okay I'll try and fix it back up tomorrow.. I've attached my first attempt at fixing this up but it messes up by the time the flags hit the ttm binding function they've lost the CACHED_MAPPED bit and I end up with snooped bindings, drm_bo_mt_compatible seems to be where it disappears and also where my sanity started to end.. are you sure there isn't a nicer way of doing those flags/mask because the current system is vastly confusing to trace through.. Can you suggest what I'm missing to fix this up? Dave.diff --git a/linux-core/drm_agpsupport.c b/linux-core/drm_agpsupport.c index 8c7f570..4015659 100644 --- a/linux-core/drm_agpsupport.c +++ b/linux-core/drm_agpsupport.c @@ -541,11 +541,17 @@ static int drm_agp_bind_ttm(struct drm_ttm_backend *backend, container_of(backend, struct drm_agp_ttm_backend, backend); DRM_AGP_MEM *mem = agp_be-mem; int ret; + int snooped = (bo_mem-flags DRM_BO_FLAG_CACHED) !(bo_mem-flags DRM_BO_FLAG_CACHED_MAPPED); DRM_DEBUG(drm_agp_bind_ttm\n); mem-is_flushed = TRUE; - mem-type = (bo_mem-flags DRM_BO_FLAG_CACHED) ? AGP_USER_CACHED_MEMORY : - AGP_USER_MEMORY; + mem-type = AGP_USER_MEMORY; + /* CACHED MAPPED implies not snooped memory */ + if (snooped) { + DRM_ERROR(snooped is set oh no 0x%016llx\n, (unsigned long long)bo_mem-flags); + mem-type = AGP_USER_CACHED_MEMORY; + } + ret = drm_agp_bind_memory(mem, bo_mem-mm_node-start); if (ret) { DRM_ERROR(AGP Bind memory failed\n); diff --git a/linux-core/drm_bo.c b/linux-core/drm_bo.c index 16203c7..05b9f5a 100644 --- a/linux-core/drm_bo.c +++ b/linux-core/drm_bo.c @@ -1029,7 +1029,7 @@ static int drm_bo_busy(struct drm_buffer_object * bo) return 0; } -static int drm_bo_read_cached(struct drm_buffer_object * bo) +static int drm_bo_evict_cached(struct drm_buffer_object * bo) { int ret = 0; @@ -1177,15 +1177,11 @@ static int drm_buffer_object_map(struct drm_file *file_priv, uint32_t handle, goto out; } - if ((map_flags DRM_BO_FLAG_READ) - (bo-mem.flags DRM_BO_FLAG_READ_CACHED) - (!(bo-mem.flags DRM_BO_FLAG_CACHED))) { - drm_bo_read_cached(bo); - } + if (bo-mem.flags DRM_BO_FLAG_CACHED_MAPPED) + drm_bo_evict_cached(bo); + break; - } else if ((map_flags DRM_BO_FLAG_READ) - (bo-mem.flags DRM_BO_FLAG_READ_CACHED) - (!(bo-mem.flags DRM_BO_FLAG_CACHED))) { + } else if (bo-mem.flags DRM_BO_FLAG_CACHED_MAPPED) { /* * We are already mapped with different flags. @@ -1292,6 +1288,7 @@ int drm_bo_move_buffer(struct drm_buffer_object * bo, uint64_t new_mem_flags, if (ret) return ret; + DRM_DEBUG(new mem flags 0x%016llx\n, new_mem_flags); mem.num_pages = bo-num_pages; mem.size = mem.num_pages PAGE_SHIFT; mem.mask = new_mem_flags; @@ -1666,7 +1663,6 @@ int drm_buffer_object_create(struct drm_device *dev, DRM_BO_FLAG_MAPPABLE; atomic_inc(bm-count); ret = drm_bo_new_mask(bo, mask, hint); - if (ret) goto out_err; diff --git a/linux-core/drm_ttm.c b/linux-core/drm_ttm.c index df9e7e4..83aa28a 100644 --- a/linux-core/drm_ttm.c +++ b/linux-core/drm_ttm.c @@ -327,9 +327,10 @@ int drm_bind_ttm(struct drm_ttm * ttm, struct drm_bo_mem_reg *bo_mem) if (ret) return ret; + DRM_DEBUG(bo-mem flags is 0x%016llx\n, bo_mem-flags); if (ttm-state == ttm_unbound !(bo_mem-flags DRM_BO_FLAG_CACHED)) { drm_set_caching(ttm, DRM_TTM_PAGE_UNCACHED); - } else if ((bo_mem-flags DRM_BO_FLAG_CACHED) + } else if ((bo_mem-flags DRM_BO_FLAG_CACHED_MAPPED) bo_driver-ttm_cache_flush) bo_driver-ttm_cache_flush(ttm); diff --git a/shared-core/drm.h b/shared-core/drm.h index 3a10273..9d00e9f 100644 --- a/shared-core/drm.h +++ b/shared-core/drm.h @@ -700,10 +700,12 @@ struct drm_fence_arg { */ #define DRM_BO_FLAG_NO_MOVE (1ULL 8) -/* Mask: Make sure the buffer is in cached memory when mapped for reading. +/* Mask: Make sure the buffer is in cached memory when mapped * Flags: Acknowledge. + * Buffers allocated with this flag should not be used for suballocators */ -#define DRM_BO_FLAG_READ_CACHED(1ULL 19) +#define DRM_BO_FLAG_CACHED_MAPPED(1ULL 19) + /* Mask: Force DRM_BO_FLAG_CACHED flag strictly also if it is set. *
Re: intel hw and caching interface to TTM..
Dave Airlie wrote: Hi, So currently the TTM interface allows the user specify a cacheable allocation, and on Intel hardware this gets conflated with using the intel snooped memory type in the GART. This is bad as the intel snooped memory type comes with its own set of special rules and sucks for lots of things. However I want to be able to use the cacheable CPU memory with the GPU to a) make things go fast b) avoid lots of SMP cross-talking and cache flushing (see a) c) buffer object creation faster Cool. So this led me to the patch at: http://cgit.freedesktop.org/~airlied/drm/diff/?h=intel-hackeryid=da14a0bbb8849cdc91ca87786fde90ac36fe1198 I could add back the snooped option if we want using a driver private flag. Dave, I'd like to see the flag DRM_BO_FLAG_CACHED really mean cache-coherent memory, that is cache coherent also while visible to the GPU. There are HW implementations out there (Poulsbo at least) where this option actually seems to work, althought it's considerably slower for things like texturing. It's also a requirement for user bo's since they will have VMAs that we cant kill and remap. Could we perhaps change the flag DRM_BO_FLAG_READ_CACHED to mean DRM_BO_FLAG_MAPPED_CACHED to implement the behaviour you describe. This will also indicate that the buffer cannot be used for user-space sub-allocators, as we in that case must be able to guarantee that the CPU can access parts of the buffer while other parts are validated for the GPU. This patch evicts the buffer on mapping so the GPU doesn't see anything via the aperture or otherwise, and flushes before validating into the aperture. It doesn't contain the chipset flush patch yet which is requried to actually make it work (add agp_chipset_flush to i915_dma.c before submitting the batchbuffer) OK. This works and appears to be nice and fast, all userspace buffers can be allocated _LOCAL | _CACHED and validated to _TT later without any major cache flushing overhead when we have clflush, and without SMP overhead at all as cache flushing is cache coherent on the Intel chipsets I've played with so far (CPU coherent- not GPU) Does this mean that clflush() on one processor flushes the cache line on all processors in an SMP system? No need for preemption guarding and IPIs? I of course need to makes this code not so x86 specific, so I might add a page flush hook to the driver interface and put the flushing code in the driver side. This also leads me into backwards compatibility, the chipset flushing changes to AGP are required for all of this good stuff, options are 1) resurrect linux-agp-compat add chipset flushing code - easier 2) try and hack chipset flushing into drm_compat.c - probably more difficult that I would like to bother with.. So comments please on whether a comeback for linux-agp-compat is a good or bad thing.. Perhaps an intel-specific TTM backend? Dave. /Thomas - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: intel hw and caching interface to TTM..
Dave, I'd like to see the flag DRM_BO_FLAG_CACHED really mean cache-coherent memory, that is cache coherent also while visible to the GPU. There are HW implementations out there (Poulsbo at least) where this option actually seems to work, althought it's considerably slower for things like texturing. It's also a requirement for user bo's since they will have VMAs that we cant kill and remap. Most PCIE cards will be cache coherent, however AGP cards not so much, so need to think if a generic _CACHED makes sense especially for something like radeon, will I have to pass different flags depending on the GART type this seems like uggh.. so maybe a separate flag makes more sense.. Could we perhaps change the flag DRM_BO_FLAG_READ_CACHED to mean DRM_BO_FLAG_MAPPED_CACHED to implement the behaviour you describe. This will also indicate that the buffer cannot be used for user-space sub-allocators, as we in that case must be able to guarantee that the CPU can access parts of the buffer while other parts are validated for the GPU. Yes, to be honest sub-allocators for most use-cases should be avoided if possible, we should be able to make the kernel interface fast enough for most things if we don't have to switching caching flags on the fly at map/destroy etc.. Does this mean that clflush() on one processor flushes the cache line on all processors in an SMP system? No need for preemption guarding and IPIs? Yes this is specified to do this, tlb flushing would require ipi but we should mostly be able to avoid doing tlb flushes from what I can see, the problem I was seeing on i9xx was the chipset had its own Global Write Buffer which wasn't that well documented.. so stuff could hide in it.. So comments please on whether a comeback for linux-agp-compat is a good or bad thing.. Perhaps an intel-specific TTM backend? Actually I did a backport into a i915_compat.c and it seems like I can make that work, However something in the changes you made in your branch, broke something on me, so I'll have to spend tomorrow rebasing my tree with your patches to see what went wrong, something doesn't appear to be getting flushed at the right time. Dave. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: intel hw and caching interface to TTM..
Dave Airlie wrote: Dave, I'd like to see the flag DRM_BO_FLAG_CACHED really mean cache-coherent memory, that is cache coherent also while visible to the GPU. There are HW implementations out there (Poulsbo at least) where this option actually seems to work, althought it's considerably slower for things like texturing. It's also a requirement for user bo's since they will have VMAs that we cant kill and remap. Most PCIE cards will be cache coherent, however AGP cards not so much, so need to think if a generic _CACHED makes sense especially for something like radeon, will I have to pass different flags depending on the GART type this seems like uggh.. so maybe a separate flag makes more sense.. Could we perhaps change the flag DRM_BO_FLAG_READ_CACHED to mean DRM_BO_FLAG_MAPPED_CACHED to implement the behaviour you describe. This will also indicate that the buffer cannot be used for user-space sub-allocators, as we in that case must be able to guarantee that the CPU can access parts of the buffer while other parts are validated for the GPU. Yes, to be honest sub-allocators for most use-cases should be avoided if possible, we should be able to make the kernel interface fast enough for most things if we don't have to switching caching flags on the fly at map/destroy etc.. Hmm - if that was true why do we have malloc() and friends - aren't they just sub-allocators for brk() and mmap()? There is more to this than performance - applications out there can allocate extraordinarily large numbers of small textures, that can only sanely be dealt with as light-weight userspace suballocations of a sensible-sized buffer. (We don't do this yet, but will need to at some point!). The reasons for this are granularity (ie wasted space in the allocation), the memory overhead of managing all these allocations, and perhaps third performance. If you think about what goes on in a 3d driver, you are always doing sub-allocations of some sort or another, though that's more obvious when you start doing state objects that have an independent lifecycle as opposed to just emitting state linearly into a command buffer. For managing objects of a few dozen bytes, obviously you are going to want to do that in userspace. So there is a continuum where successively larger buffers increasingly justify whatever additional cost there is to go directly to the kernel to allocate them. But for sufficiently small or frequently allocated buffers, there will always be a crossover point where it is faster to do it in userspace. It certainly makes sense to speed up the kernel paths, but that won't make the crossover point go away - it'll just shift it more or less depending on how successful you are. Keith - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: intel hw and caching interface to TTM..
Dave Airlie wrote: Dave, I'd like to see the flag DRM_BO_FLAG_CACHED really mean cache-coherent memory, that is cache coherent also while visible to the GPU. There are HW implementations out there (Poulsbo at least) where this option actually seems to work, althought it's considerably slower for things like texturing. It's also a requirement for user bo's since they will have VMAs that we cant kill and remap. Most PCIE cards will be cache coherent, however AGP cards not so much, so need to think if a generic _CACHED makes sense especially for something like radeon, will I have to pass different flags depending on the GART type this seems like uggh.. so maybe a separate flag makes more sense.. OK. We're using this functionality in Poulsbo, so we should probably sort this out to avoid breaking things. Could we perhaps change the flag DRM_BO_FLAG_READ_CACHED to mean DRM_BO_FLAG_MAPPED_CACHED to implement the behaviour you describe. This will also indicate that the buffer cannot be used for user-space sub-allocators, as we in that case must be able to guarantee that the CPU can access parts of the buffer while other parts are validated for the GPU. Yes, to be honest sub-allocators for most use-cases should be avoided if possible, we should be able to make the kernel interface fast enough for most things if we don't have to switching caching flags on the fly at map/destroy etc.. Yes, Eric seems to have the same opinion. I'm not quite sure I understand the reasoning behind it. Is it the added complexity or something else? While it's super to have a fast kernel interface, the inherent latency and allocation granularity will probably always make a user-space sub-allocator a desirable thing. Particularly something like a slab allocator that would also to some extent avoid fragmentation. My view of TTM has changed to be a bit from the opposite side: Let's say we have a fast user-space per-client allocator. What kernel functionality would we require to make sure that it can assume it's the sole owner of the memory it manages? For a repeated usage pattern like batch-buffers we end up allocating pages from the kernel, setting up one VMA per buffer, modifying gart- and page tables and in the worst case even caching policy for each and every use. Even if this can be made reasonably fast, I think it's a CPU overhead we really shouldn't be paying?? /Thomas - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: intel hw and caching interface to TTM..
OK. We're using this functionality in Poulsbo, so we should probably sort this out to avoid breaking things. Okay I'll try and fix it back up tomorrow.. Yes, Eric seems to have the same opinion. I'm not quite sure I understand the reasoning behind it. Is it the added complexity or something else? While it's super to have a fast kernel interface, the inherent latency and allocation granularity will probably always make a user-space sub-allocator a desirable thing. Particularly something like a slab allocator that would also to some extent avoid fragmentation. My view of TTM has changed to be a bit from the opposite side: Let's say we have a fast user-space per-client allocator. What kernel functionality would we require to make sure that it can assume it's the sole owner of the memory it manages? We have slightly different use-cases, I'm probably more targetting 3d desktop uses where sharing buffers is more important (think pixmaps etc) so making the allocator go as fast as possible make sense for this case as we need to have it in the kernel to do the sharing ... For a repeated usage pattern like batch-buffers we end up allocating pages from the kernel, setting up one VMA per buffer, modifying gart- and page tables and in the worst case even caching policy for each and every use. Even if this can be made reasonably fast, I think it's a CPU overhead we really shouldn't be paying?? Or we end up allocating a large amount of space to store on-the-fly buffers, which requires more tuning per application, some apps may not need 65k of batchbuffer space etc.. So while I see the need for suballocators (we need one for glyphs and small pixmaps on 2D side in any case) I also see the need to make things faster for the non-sub-allocated use case... So I don't think the goals of 3D driver vs 3D desktop are mutually exclusive, I think your work has fone too far towards the single app case and our work is trying to pull it back towards our use case.. Dave. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: intel hw and caching interface to TTM..
Dave Airlie wrote: OK. We're using this functionality in Poulsbo, so we should probably sort this out to avoid breaking things. Okay I'll try and fix it back up tomorrow.. Yes, Eric seems to have the same opinion. I'm not quite sure I understand the reasoning behind it. Is it the added complexity or something else? While it's super to have a fast kernel interface, the inherent latency and allocation granularity will probably always make a user-space sub-allocator a desirable thing. Particularly something like a slab allocator that would also to some extent avoid fragmentation. My view of TTM has changed to be a bit from the opposite side: Let's say we have a fast user-space per-client allocator. What kernel functionality would we require to make sure that it can assume it's the sole owner of the memory it manages? We have slightly different use-cases, I'm probably more targetting 3d desktop uses where sharing buffers is more important (think pixmaps etc) so making the allocator go as fast as possible make sense for this case as we need to have it in the kernel to do the sharing ... Agreed. For a repeated usage pattern like batch-buffers we end up allocating pages from the kernel, setting up one VMA per buffer, modifying gart- and page tables and in the worst case even caching policy for each and every use. Even if this can be made reasonably fast, I think it's a CPU overhead we really shouldn't be paying?? Or we end up allocating a large amount of space to store on-the-fly buffers, which requires more tuning per application, some apps may not need 65k of batchbuffer space etc.. So while I see the need for suballocators (we need one for glyphs and small pixmaps on 2D side in any case) I also see the need to make things faster for the non-sub-allocated use case... So I don't think the goals of 3D driver vs 3D desktop are mutually exclusive, I think your work has fone too far towards the single app case and our work is trying to pull it back towards our use case.. Indeed. There are certainly different use-cases. In some cases a sub-allocator is beneficial and in some cases we gain more by not using them. I had a concern that they would be considered generally useless and that the needed kernel support would go away. /Thomas Dave. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel