Am 13.06.19 um 18:37 schrieb Kuehling, Felix: > [SNIP] >>>> There is >>>> an issue found by KFDTest.BigBufStressTest, it allocates buffers up to >>>> 3/8 of total 256GB system memory, each buffer size is 128MB, then use >>>> queue to write to the buffers. If ttm_mem_global_alloc_page get page pfn >>>> is below 4GB, it account to dma32 zone and will exhaust 2GB limit, then >>>> ttm_check_swapping will schedule ttm_shrink_work to start eviction. It >>>> takes minutes to finish restore (retry many times if busy), the test >>>> failed because queue timeout. This eviction is unnecessary because we >>>> still have enough free system memory. >>> No that is definitely necessary. For example on my Laptop it's the sound >>> system which needs DMA32. >>> >>> Without this when an application uses a lot of memory we run into the >>> OOM killer as soon as some audio starts playing. >>> >> I did not realize TTM is used by other drivers. I agree we cannot remove >> dma32 zone, this will break other device drivers which depends on dma32 >> zone. > If I understand Christian correctly, the point is not that other drivers > use TTM, but other drivers need dma32 memory (memory with 32-bit > physical addresses). If TTM used up all that memory, it would break > those other drivers. As a good steward of memory, TTM limits its usage > of dma32 memory in order to allow other drivers to have access to it.
Yes, exactly. > > Regards, > Felix > > >>>> It's random case, happens about 1/5. I can change test to increase the >>>> timeout value to workaround this, but this seems TTM bug. This will slow >>>> application performance a lot if this random issue happens. >>> One thing we could try is to dig into why the kernel gives us DMA32 >>> pages when there are still other pages available. Please take a look at >>> /proc/buddyinfo on that box for this. >>> >> Thanks for the advise, from buddyinfo, the machine has 4 nodes, each >> node has 64GB memory, and dma32 zone is on node 0. BigBufStressTest >> allocate 96GB. If I force the test on node 1, "numactl --cpunodebind=1 >> ./kfdtest", no eviction at all. If I force the test on node 0, it always >> has eviction and restore because it used up all memory including dma32 >> zone from node 0. I will change test app to avoid running on node 0 to >> workaround this. That is a really interesting test case you got here. I actually think that exhausting DMA32 before looking into another numa node is a bug in the core MM. Anyway I will put it on my TODO list to improve the handling in TTM, shouldn't be more than a day or two of work. Till that's done the workaround of not using node 0 should do it. Christian. >> >> Thanks, >> Philip >> _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx