On Thu, Apr 17, 2025 at 1:06 AM Christian König
<christian.koe...@amd.com> wrote:
>
> Am 16.04.25 um 23:51 schrieb Juan Yescas:
>
> On Wed, Apr 16, 2025 at 4:34 AM Christian König
> <christian.koe...@amd.com> wrote:
>
>
> Am 15.04.25 um 19:19 schrieb Juan Yescas:
>
> This change sets the allocation orders for the different page sizes
> (4k, 16k, 64k) based on PAGE_SHIFT. Before this change, the orders
> for large page sizes were calculated incorrectly, this caused system
> heap to allocate from 2% to 4% more memory on 16KiB page size kernels.
>
> This change was tested on 4k/16k page size kernels.
>
> Signed-off-by: Juan Yescas <jyes...@google.com>
> ---
> drivers/dma-buf/heaps/system_heap.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c
> b/drivers/dma-buf/heaps/system_heap.c
> index 26d5dc89ea16..54674c02dcb4 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -50,8 +50,15 @@ static gfp_t order_flags[] = {HIGH_ORDER_GFP,
> HIGH_ORDER_GFP, LOW_ORDER_GFP};
> * to match with the sizes often found in IOMMUs. Using order 4 pages instead
> * of order 0 pages can significantly improve the performance of many IOMMUs
> * by reducing TLB pressure and time spent updating page tables.
> + *
> + * Note: When the order is 0, the minimum allocation is PAGE_SIZE. The
> possible
> + * page sizes for ARM devices could be 4K, 16K and 64K.
> */
> -static const unsigned int orders[] = {8, 4, 0};
> +#define ORDER_1M (20 - PAGE_SHIFT)
> +#define ORDER_64K (16 - PAGE_SHIFT)
> +#define ORDER_FOR_PAGE_SIZE (0)
> +static const unsigned int orders[] = {ORDER_1M, ORDER_64K,
> ORDER_FOR_PAGE_SIZE};
> +#
>
> Good catch, but I think the defines are just overkill.
>
> What you should do instead is to subtract page shift when using the array.
>
> There are several occurrences of orders in the file and I think it is
> better to do the calculations up front in the array. Would you be ok
> if we get rid of the defines as per your suggestion and make the
> calculations during the definition of the array. Something like this:
>
> static const unsigned int orders[] = {20 - PAGE_SHIFT, 16 - PAGE_SHIFT, 0};
>
>
> Yeah that totally works for me as well. Just make sure that a code comment
> nearby explains why 20, 16 and 0.
>
Thanks, the changes were added in the [PATCH v3].
> Apart from that using 1M, 64K and then falling back to 4K just sounds random
> to me. We have especially pushed back on 64K more than once because it is
> actually not beneficial in almost all cases.
>
> In the hardware where the driver is used, the 64K is beneficial for:
>
> Arm SMMUv3 (4KB granule size):
> 64KB (contiguous bit groups 16 4KB pages)
>
> SysMMU benefits from 64KB (“large” page) on 4k/16k page sizes.
>
>
> Question could this HW also work with pages larger than 64K? (I strongly
> expect yes).
>
Yes, if the page sizes in the SMMUv3 are:
- 4k, we can have 2MiB pages
- 16k, we can have 32MiB pages
> I suggest to fix the code in system_heap_allocate to not over allocate
> instead and just try the available orders like TTM does. This has proven to
> be working architecture independent.
>
> Do you mean to have an implementation similar to __ttm_pool_alloc()?
>
>
> Yeah something like that.
>
> https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/ttm/ttm_pool.c?h=v6.15-rc2#n728
>
> If that is the case, we can try the change, run some benchmarks and
> submit the patch if we see positive results.
>
>
> Could be that this doesn't matter for your platform, but it's minimal extra
> overhead asking for larger chunks first and it just avoids fragmenting
> everything into 64K chunks.
>
> That is kind of important since the code in DMA-heaps should be platform
> neutral if possible.
I agree, I'll make the change in another patch.
Thanks
Juan
>
> Regards,
> Christian.
>
>
> Thanks
> Juan
>
> Regards,
> Christian.
>
> #define NUM_ORDERS ARRAY_SIZE(orders)
>
> static struct sg_table *dup_sg_table(struct sg_table *table)
>
>