[+Philip] On 2018-04-20 10:47 AM, Michel Dänzer wrote: > On 2018-04-11 11:37 AM, Christian König wrote: >> Am 11.04.2018 um 06:00 schrieb Gabriel C: >>> 2018-04-09 11:42 GMT+02:00 Christian König >>> <ckoenig.leichtzumer...@gmail.com>: >>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin: >>>>> Hi Christian, >>>>> >>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at: >>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778 >>>>> Feel free to comment since you have a better understanding of what's >>>>> going on. >>>>> >>>>> One last question: right now I'm running 4.15.0 with the "offending" >>>>> patch reverted. Is that safe to run or are there possible bad >>>>> interactions with other changes. >>>> That should work without problems. >>>> >>>> But I just had another idea as well, if you want you could still test >>>> the >>>> new code path which will be using in 4.17. >>>> >>> While Firefox may do some strange things is not about only Firefox. >>> >>> With your patches my EPYC box is unusable with 4.15++ kernels. >>> The whole Desktop is acting weird. This one is using >>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU. >>> >>> Box is 2 * EPYC 7281 with 128 GB ECC RAM >>> >>> Also a 14C Xeon box with a HD7700 is broken same way. >> The hardware is irrelevant for this. We need to know what software stack >> you use on top of it. >> >> E.g. desktop environment/Mesa and DDX version etc... >> >>> Everything breaks in X .. scrolling , moving windows , flickering etc. >>> >>> >>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and >>> 648bc3574716400acc06f99915815f80d9563783 >>> from an 4.15 kernel makes things work again. >>> >>> >>>> Backporting all the detection logic is to invasive, but you could >>>> just go >>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other >>>> code path. >>>> >>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those. >>>> >>> Well you really can't be serious about these suggestions ? Are you ? >>> >>> Telling peoples to #if 0 random code is not a solution. >> That is for testing and not a permanent solution. >> >>> You broke existsing working userland with your patches and at least >>> please fix that for 4.16. >>> >>> I can help testing code for 4.17/++ if you wish but that is >>> *different* storry. >> Please test Alex's amd-staging-drm-next branch from >> git://people.freedesktop.org/~agd5f/linux. > I think we're still missing something here. > > I'm currently running 4.16.2 + the DRM subsystem changes which are going > into 4.17 (so I have the changes Christian is referring to) with a > Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc. > Some observations: > > Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the > order of a minute, during which the kernel is spending most of one > core's cycles inside alloc_pages (__alloc_pages_nodemask to be more > precise), called from ttm_alloc_new_pages. Philip debugged a similar problem with a KFD memory stress test about two weeks ago, where the kernel was seemingly stuck in an infinite loop trying to allocate huge pages. I'm pasting his analysis for the record:
> [...] it uses huge_flags GFP_TRANSHUGE to call alloc_pages(), this > seems a corner case inside __alloc_pages_slowpath(), it never exits > but goes to retry path every time. It can reclaim pages and > did_some_progress (as a result, no_progress_loops is reset to 0 every > loop, never reach MAX_RECLAIM_RETRIES) but cannot finish huge page > allocations under this specific memory pressure. As a workaround to unblock our release branch testing we removed transparent huge page allocation from ttm_get_pages. We're seeing this as far back as 4.13 on our release branch. If we're really talking about the same problem, I don't think it's caused by recent page allocator changes, but rather exposed by recent TTM changes. Regards, Felix > > At least in the case of Firefox, this happens due to Mesa internal BO > allocations for glTex(Sub)Image, so it's not obvious that Firefox is > doing something wrong. > > I never noticed this before this week. Before, I was running 4.15.y + > DRM subsystem changes from 4.16. Maybe something has changed in core > code, trying harder to allocate huge pages. > > > Maybe TTM should only try to use any huge pages that happen to be > available, not spend any (/ "too much"?) additional effort trying to > free up huge pages? > >