Re: [Patch v2] drm/ttm: Allow direct reclaim to allocate local memory
On 7/8/24 6:06 PM, Rajneesh Bhardwaj wrote: > Limiting the allocation of higher order pages to the closest NUMA node > and enabling direct memory reclaim provides not only failsafe against > situations when memory becomes too much fragmented and the allocator is > not able to satisfy the request from the local node but falls back to > remote pages (HUGEPAGE) but also offers performance improvement. > Accessing remote pages suffers due to bandwidth limitations and could be > avoided if memory becomes defragmented and in most cases without using > manual compaction. (/proc/sys/vm/compact_memory) > > Note: On certain distros such as RHEL, the proactive compaction is > disabled. (https://tinyurl.com/4f32f7rs) > > Cc: Dave Airlie > Cc: Vlastimil Babka > Cc: Daniel Vetter > Reviewed-by: Christian König > Signed-off-by: Rajneesh Bhardwaj > --- > drivers/gpu/drm/ttm/ttm_pool.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c > index 6e1fd6985ffc..cc27d5c7afe8 100644 > --- a/drivers/gpu/drm/ttm/ttm_pool.c > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > @@ -91,7 +91,7 @@ static struct page *ttm_pool_alloc_page(struct ttm_pool > *pool, gfp_t gfp_flags, >*/ > if (order) > gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | > - __GFP_KSWAPD_RECLAIM; > + __GFP_RECLAIM | __GFP_THISNODE; __GFP_RECLAIM includes ___GFP_DIRECT_RECLAIM, what if the caller is a context that can't sleep? Then it would be a bugy to add that. OTOH the only caller I see is ttm_pool_alloc() which seems to start with GFP_USER and that already includes __GFP_RECLAIM, so either way I see no reason to be adding it here, other than it might be a potential bug in case other callers are added later and have more restricted context. As for __GFP_THISNODE addition that should be fine as this seems to be an opportunistic allocation with a fallback that's decreasing the attempted order. Also I note that the caller might be adding __GFP_RETRY_MAYFAIL if (ctx->gfp_retry_mayfail) gfp_flags |= __GFP_RETRY_MAYFAIL; But here adding __GFP_NORETRY makes __GFP_RETRY_MAYFAIL non-effective as __alloc_pages_slowpath() evaluates __GFP_NORETRY earlier to decide to give up, than evaluating __GFP_RETRY_MAYFAIL to decide to try a bit harder. That's not introduced by this patch but maybe something to look into, if __GFP_RETRY_MAYFAIL is expected to be useful here for trying harder. If it's instead intended only for the final fallback order-0 case where the gfp_flags adjustment above doesn't apply, then __GFP_RETRY_MAYFAIL will cause the allocation to fail instead of applying the infamous implicit "too small to fail" for order-0 allocation and invoking OOM. If that's the reason for it to be used here, all is fine and great. Vlastimil > > if (!pool->use_dma_alloc) { > p = alloc_pages_node(pool->nid, gfp_flags, order);
Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
On 1/29/24 08:44, Christian König wrote: > Am 26.01.24 um 17:29 schrieb Matthew Brost: >> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote: >>> Am 25.01.24 um 18:30 schrieb Matthew Brost: On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote: > Am 24.01.24 um 22:08 schrieb Matthew Brost: >> All entities must be drained in the DRM scheduler run job worker to >> avoid the following case. An entity found that is ready, no job found >> ready on entity, and run job worker goes idle with other entities + jobs >> ready. Draining all ready entities (i.e. loop over all ready entities) >> in the run job worker ensures all job that are ready will be scheduled. > That doesn't make sense. drm_sched_select_entity() only returns entities > which are "ready", e.g. have a job to run. > That is what I thought too, hence my original design but it is not exactly true. Let me explain. drm_sched_select_entity() returns an entity with a non-empty spsc queue (job in queue) and no *current* waiting dependecies [1]. Dependecies for an entity can be added when drm_sched_entity_pop_job() is called [2][3] returning a NULL job. Thus we can get into a scenario where 2 entities A and B both have jobs and no current dependecies. A's job is waiting B's job, entity A gets selected first, a dependecy gets installed in drm_sched_entity_pop_job(), run work goes idle, and now we deadlock. >>> And here is the real problem. run work doesn't goes idle in that moment. >>> >>> drm_sched_run_job_work() should restarts itself until there is either no >>> more space in the ring buffer or it can't find a ready entity any more. >>> >>> At least that was the original design when that was all still driven by a >>> kthread. >>> >>> It can perfectly be that we messed this up when switching from kthread to a >>> work item. >>> >> Right, that what this patch does - the run worker does not go idle until >> no ready entities are found. That was incorrect in the original patch >> and fixed here. Do you have any issues with this fix? It has been tested >> 3x times and clearly fixes the issue. > > Ah! Yes in this case that patch here is a little bit ugly as well. > > The original idea was that run_job restarts so that we are able to pause > the submission thread without searching for an entity to submit more. > > I strongly suggest to replace the while loop with a call to > drm_sched_run_job_queue() so that when the entity can't provide a job we > just restart the queuing work. Note it's already included in rc2, so any changes need to be a followup fix. If these are important, then please make sure they get to rc3 :)
Re: [PATCH] drm/sched: Drain all entities in DRM sched run job worker
On 1/24/24 22:08, Matthew Brost wrote: > All entities must be drained in the DRM scheduler run job worker to > avoid the following case. An entity found that is ready, no job found > ready on entity, and run job worker goes idle with other entities + jobs > ready. Draining all ready entities (i.e. loop over all ready entities) > in the run job worker ensures all job that are ready will be scheduled. > > Cc: Thorsten Leemhuis > Reported-by: Mikhail Gavrilov > Closes: > https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuowtoeee+q26z...@mail.gmail.com/ > Reported-and-tested-by: Mario Limonciello > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124 > Link: > https://lore.kernel.org/all/20240123021155.2775-1-mario.limoncie...@amd.com/ > Reported-by: Vlastimil Babka Can change to Reported-and-tested-by: Vlastimil Babka Thanks! > Closes: > https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd15...@amd.com/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a > Signed-off-by: Matthew Brost > --- > drivers/gpu/drm/scheduler/sched_main.c | 15 +++ > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 550492a7a031..85f082396d42 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct > *w) > struct drm_sched_entity *entity; > struct dma_fence *fence; > struct drm_sched_fence *s_fence; > - struct drm_sched_job *sched_job; > + struct drm_sched_job *sched_job = NULL; > int r; > > if (READ_ONCE(sched->pause_submit)) > return; > > - entity = drm_sched_select_entity(sched); > + /* Find entity with a ready job */ > + while (!sched_job && (entity = drm_sched_select_entity(sched))) { > + sched_job = drm_sched_entity_pop_job(entity); > + if (!sched_job) > + complete_all(&entity->entity_idle); > + } > if (!entity) > - return; > - > - sched_job = drm_sched_entity_pop_job(entity); > - if (!sched_job) { > - complete_all(&entity->entity_idle); > return; /* No more work */ > - } > > s_fence = sched_job->s_fence; >
Re: [RFC] Revert "drm/sched: Split free_job into own work item"
On 1/23/24 03:11, Mario Limonciello wrote: > commit f7fe64ad0f22 ("drm/sched: Split free_job into own work item") > causes graphics hangs at GDM or right after logging in on a > Framework 13 AMD laptop (containing a Phoenix APU). > > This reverts commit f7fe64ad0f22ff034f8ebcfbd7299ee9cc9b57d7. > > Fixes: f7fe64ad0f22 ("drm/sched: Split free_job into own work item") > Signed-off-by: Mario Limonciello > --- > This is a regression introduced in 6.8-rc1, bisected from 6.7. > This revert done on top of 6.8-rc1 fixes the issue. Applying this revert on 6.8-rc1 fixed my issues reported here: https://lore.kernel.org/all/2faccc1a-7fdd-499b-aa0a-bd54f4068...@suse.cz/ Let me know if there's another fix instead of revert so I can test. Thanks, Vlastimil > > I'm happy to gather any data to use to properly debug if that is > preferable to a revert. > --- > drivers/gpu/drm/scheduler/sched_main.c | 133 + > include/drm/gpu_scheduler.h| 4 +- > 2 files changed, 48 insertions(+), 89 deletions(-) >
Re: [git pull] drm for 6.8
On 1/24/24 16:31, Donald Carr wrote: > On Wed, Jan 24, 2024 at 7:06 AM Vlastimil Babka wrote: >> When testing the rc1 on my openSUSE Tumbleweed desktop, I've started >> experiencing "frozen desktop" (KDE/Wayland) issues. The symptoms are that >> everything freezes including mouse cursor. After a while it either resolves, >> or e.g. firefox crashes (if it was actively used when it froze) or it's >> frozen for too long and I reboot with alt-sysrq-b. When it's frozen I can >> still ssh to the machine, and there's nothing happening in dmesg. >> The machine is based on Amd Ryzen 7 2700 and Radeon RX7600. >> >> I've bisected the merge commits so far and now will try to dig into this >> one. I've noticed there was also a drm fixes PR later in the merge window but >> since it was also merged into rc1 and thus didn't prevent the issue for me, >> I guess it's not relevant here? >> >> Because the reproduction wasn't very deterministic I considered a commit bad >> even if it didn't lead to completely frozen desktop and a forced reboot. >> Even the multi-second hangs that resolved were a regression compared to 6.7 >> anyway. >> >> If there are known issues and perhaps candidate fixes already, please do >> tell. > > I am experiencing the exact same symptoms; sddm (on weston) starts > perfectly, launching a KDE wayland session freezes at various points > (leading to plenty of premature celebration), but normally on the > handoff from sddm to kde (replete with terminal cursor on screen) > > Working perfectly as of the end of 6.7 final release, broken as of 6.8 rc1. > Sometimes sddm can be successfully restarted via ssh, other times > restarting sddm is slow and fails to complete. Big thanks to Thorsten who suggested I look at the following: https://lore.kernel.org/all/20240123021155.2775-1-mario.limoncie...@amd.com/ https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuowtoeee+q26z...@mail.gmail.com/ Instead of further bisection I've applied Mario's revert from the first link on top of 6.8-rc1 and the issue seems gone for me now. Vlastimil > Yours sincerely, > Donald
Re: [git pull] drm for 6.8
On 1/10/24 20:49, Dave Airlie wrote: > Hi Linus, > > This is the main drm pull request for 6.8. When testing the rc1 on my openSUSE Tumbleweed desktop, I've started experiencing "frozen desktop" (KDE/Wayland) issues. The symptoms are that everything freezes including mouse cursor. After a while it either resolves, or e.g. firefox crashes (if it was actively used when it froze) or it's frozen for too long and I reboot with alt-sysrq-b. When it's frozen I can still ssh to the machine, and there's nothing happening in dmesg. The machine is based on Amd Ryzen 7 2700 and Radeon RX7600. I've bisected the merge commits so far and now will try to dig into this one. I've noticed there was also a drm fixes PR later in the merge window but since it was also merged into rc1 and thus didn't prevent the issue for me, I guess it's not relevant here? Because the reproduction wasn't very deterministic I considered a commit bad even if it didn't lead to completely frozen desktop and a forced reboot. Even the multi-second hangs that resolved were a regression compared to 6.7 anyway. If there are known issues and perhaps candidate fixes already, please do tell. Thanks, Vlastimil git bisect start '--first-parent' # status: waiting for both good and bad commits # bad: [6613476e225e090cc9aad49be7fa504e290dd33d] Linux 6.8-rc1 git bisect bad 6613476e225e090cc9aad49be7fa504e290dd33d # status: waiting for good commit(s), bad commit known # good: [0dd3ee31125508cd67f7e7172247f05b7fd1753a] Linux 6.7 git bisect good 0dd3ee31125508cd67f7e7172247f05b7fd1753a # bad: [b4442cadca2f97239c8b80f64af7937897b867b1] Merge tag 'x86_tdx_for_6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad b4442cadca2f97239c8b80f64af7937897b867b1 # bad: [c4c6044d35f06a93115e691e79436839962c203e] Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm git bisect bad c4c6044d35f06a93115e691e79436839962c203e # bad: [42bff4d0f9b9c8b669c5cef25c5116f41eb45c6b] Merge tag 'pwm/for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm git bisect bad 42bff4d0f9b9c8b669c5cef25c5116f41eb45c6b # good: [32720aca900b226653c843bb4e06b8125312f214] Merge tag 'fsnotify_for_v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs git bisect good 32720aca900b226653c843bb4e06b8125312f214 # good: [5bad490858c3ebdbb47e622e8f9049f828d2abba] Merge tag 'soc-defconfig-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect good 5bad490858c3ebdbb47e622e8f9049f828d2abba # good: [70d201a40823acba23899342d62bc2644051ad2e] Merge tag 'f2fs-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs git bisect good 70d201a40823acba23899342d62bc2644051ad2e # bad: [141d9c6e003b806d8faeddeec7053ee2691ea61a] Merge tag 'firewire-updates-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 git bisect bad 141d9c6e003b806d8faeddeec7053ee2691ea61a # bad: [61f4c3e6711477b8a347ca5fe89e5e6613e0a147] Merge tag 'linux-watchdog-6.8-rc1' of git://www.linux-watchdog.org/linux-watchdog git bisect bad 61f4c3e6711477b8a347ca5fe89e5e6613e0a147 # bad: [7912a6391f3ee7eb9f9a69227a209d502679bc0c] Merge tag 'sound-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect bad 7912a6391f3ee7eb9f9a69227a209d502679bc0c # bad: [cf65598d5909acf5e7b7dc9e21786e386356bc81] Merge tag 'drm-next-2024-01-10' of git://anongit.freedesktop.org/drm/drm git bisect bad cf65598d5909acf5e7b7dc9e21786e386356bc81 # first bad commit: [cf65598d5909acf5e7b7dc9e21786e386356bc81] Merge tag 'drm-next-2024-01-10' of git://anongit.freedesktop.org/drm/drm
Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()
On 7/12/23 18:24, Felix Kuehling wrote: > Allocations in the heap and stack tend to be small, with several > allocations sharing the same page. Sharing the same page for different > allocations with different access patterns leads to thrashing when we > migrate data back and forth on GPU and CPU access. To avoid this we > disable HMM migrations for head and stack VMAs. Wonder how well does it really work in practice? AFAIK "heaps" (malloc()) today uses various arenas obtained by mmap() and not a single brk() managed space anymore? And programs might be multithreaded, thus have multiple stacks, while vma_is_stack() will recognize only the initial one... Vlastimil > Regards, > Felix > > > Am 2023-07-12 um 10:42 schrieb Christoph Hellwig: >> On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote: >>> Use the helpers to simplify code. >> Nothing against your addition of a helper, but a GPU driver really >> should have no business even looking at this information.. >> >> >
Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
On 6/22/23 10:53, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in > shrinkers subsystem, which protects most operations > such as slab shrink, registration and unregistration > of shrinkers, etc. This can easily cause problems in > the following cases. > > 1) When the memory pressure is high and there are many >filesystems mounted or unmounted at the same time, >slab shrink will be affected (down_read_trylock() >failed). > >Such as the real workload mentioned by Kirill Tkhai: > >``` >One of the real workloads from my experience is start >of an overcommitted node containing many starting >containers after node crash (or many resuming containers >after reboot for kernel update). In these cases memory >pressure is huge, and the node goes round in long reclaim. >``` > > 2) If a shrinker is blocked (such as the case mentioned >in [1]) and a writer comes in (such as mount a fs), >then this writer will be blocked and cause all >subsequent shrinker-related operations to be blocked. > > Even if there is no competitor when shrinking slab, there > may still be a problem. If we have a long shrinker list > and we do not reclaim enough memory with each shrinker, > then the down_read_trylock() may be called with high > frequency. Because of the poor multicore scalability of > atomic operations, this can lead to a significant drop > in IPC (instructions per cycle). > > We used to implement the lockless slab shrink with > SRCU [1], but then kernel test robot reported -88.8% > regression in stress-ng.ramfs.ops_per_sec test case [2], > so we reverted it [3]. > > This commit uses the refcount+RCU method [4] proposed by > by Dave Chinner to re-implement the lockless global slab > shrink. The memcg slab shrink is handled in the subsequent > patch. > > Currently, the shrinker instances can be divided into > the following three types: > > a) global shrinker instance statically defined in the kernel, > such as workingset_shadow_shrinker. > > b) global shrinker instance statically defined in the kernel > modules, such as mmu_shrinker in x86. > > c) shrinker instance embedded in other structures. > > For case a, the memory of shrinker instance is never freed. > For case b, the memory of shrinker instance will be freed > after the module is unloaded. But we will call synchronize_rcu() > in free_module() to wait for RCU read-side critical section to > exit. For case c, the memory of shrinker instance will be > dynamically freed by calling kfree_rcu(). So we can use > rcu_read_{lock,unlock}() to ensure that the shrinker instance > is valid. > > The shrinker::refcount mechanism ensures that the shrinker > instance will not be run again after unregistration. So the > structure that records the pointer of shrinker instance can be > safely freed without waiting for the RCU read-side critical > section. > > In this way, while we implement the lockless slab shrink, we > don't need to be blocked in unregister_shrinker() to wait > RCU read-side critical section. > > The following are the test results: > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & > > 1) Before applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s >(secs)(secs)(secs) (real time) (usr+sys > time) > ramfs880623 60.02 7.71226.93 14671.45 > 3753.09 > ramfs: > 1 System Management Interrupt > for a 60.03s run time: > 5762.40s available CPU time >7.71s user time ( 0.13%) > 226.93s system time ( 3.94%) > 234.64s total time ( 4.07%) > load average: 8.54 3.06 2.11 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.03s (1 min, 0.03 secs) > > 2) After applying this patchset: > > setting to a 60 second run per stressor > dispatching hogs: 9 ramfs > stressor bogo ops real time usr time sys time bogo ops/s bogo > ops/s >(secs)(secs)(secs) (real time) (usr+sys > time) > ramfs847562 60.02 7.44230.22 14120.66 > 3566.23 > ramfs: > 4 System Management Interrupts > for a 60.12s run time: > 5771.95s available CPU time >7.44s user time ( 0.13%) > 230.22s system time ( 3.99%) > 237.66s total time ( 4.12%) > load average: 8.18 2.43 0.84 > passed: 9: ramfs (9) > failed: 0 > skipped: 0 > successful run completed in 60.12s (1 min, 0.12 secs) > > We can see that the ops/s has hardly changed. > > [1]. > https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/ > [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/ > [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/ > [4]. https://lore.kernel.org/
Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file
On 6/22/23 10:53, Qi Zheng wrote: > The mm/vmscan.c file is too large, so separate the shrinker-related > code from it into a separate file. No functional changes. > > Signed-off-by: Qi Zheng Maybe do this move as patch 01 so the further changes are done in the new file already?
Re: [PATCH 01/29] mm: shrinker: add shrinker::private_data field
On 6/22/23 10:53, Qi Zheng wrote: > To prepare for the dynamic allocation of shrinker instances > embedded in other structures, add a private_data field to > struct shrinker, so that we can use shrinker::private_data > to record and get the original embedded structure. > > Signed-off-by: Qi Zheng I would fold this to 02/29, less churn. > --- > include/linux/shrinker.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 224293b2dd06..43e6fcabbf51 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -70,6 +70,8 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + void *private_data; > + > /* These are for internal use */ > struct list_head list; > #ifdef CONFIG_MEMCG
Re: [PATCH v2 0/6] drm/gud: Use the shadow plane helper
On 12/1/22 15:16, Konstantin Ryabitsev wrote: On Thu, Dec 01, 2022 at 02:34:41PM +0100, Javier Martinez Canillas wrote: Konstantin, Can you add a rule in b4 to exclude sta...@vger.kernel.org (sta...@kernel.org as well?) from getting the whole patchset? sta...@kernel.org is a pipe to /dev/null so that's not needed to be messed with. As for this needing special casing in b4, it's rare that you send out a patch series and only want 1 or 2 of them in stable, right? Not really, it's very common for a patch-series to contain fixes (that could go to stable if applicable) and change that are not suitable for stable. The problem as Noralf mentioned is that the b4 tool doesn't seem to allow Cc'ing individual patches to different recipients, and everyone get the whole set. So either b4 needs to have this support, exclude sta...@vger.kernel.org when sending a set or sta...@vger.kernel.org ignore patches without a Fixes: tag. I think what I can do is a special logic for Cc: trailers: - Any Cc: trailers we find in the cover letter receive the entire series - Any Cc: trailers in individual patches only receive these individual patches I usually do that with git send-email and a custom --cc-cmd script, but additionally it sends the cover letter also to everyone that's on any individual patch's Cc, so everyone gets at least the cover letter + their specific patch(es). But that extra logic could be optional. Thank you for being patient -- we'll get this right, I promise. -K
Re: [PATCH mm-unstable v1 09/20] mm/gup: reliable R/O long-term pinning in COW mappings
On 11/16/22 11:26, David Hildenbrand wrote: > We already support reliable R/O pinning of anonymous memory. However, > assume we end up pinning (R/O long-term) a pagecache page or the shared > zeropage inside a writable private ("COW") mapping. The next write access > will trigger a write-fault and replace the pinned page by an exclusive > anonymous page in the process page tables to break COW: the pinned page no > longer corresponds to the page mapped into the process' page table. > > Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a > COW mapping, let's properly break COW first before R/O long-term > pinning something that's not an exclusive anon page inside a COW > mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page > instead that can get pinned safely. > > With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable > R/O long-term pinning in COW mappings. > > With this change, the new R/O long-term pinning tests for non-anonymous > memory succeed: > # [RUN] R/O longterm GUP pin ... with shared zeropage > ok 151 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP pin ... with memfd > ok 152 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP pin ... with tmpfile > ok 153 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP pin ... with huge zeropage > ok 154 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) > ok 155 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) > ok 156 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP-fast pin ... with shared zeropage > ok 157 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP-fast pin ... with memfd > ok 158 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP-fast pin ... with tmpfile > ok 159 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP-fast pin ... with huge zeropage > ok 160 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) > ok 161 Longterm R/O pin is reliable > # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) > ok 162 Longterm R/O pin is reliable > > Note 1: We don't care about short-term R/O-pinning, because they have > snapshot semantics: they are not supposed to observe modifications that > happen after pinning. > > As one example, assume we start direct I/O to read from a page and store > page content into a file: modifications to page content after starting > direct I/O are not guaranteed to end up in the file. So even if we'd pin > the shared zeropage, the end result would be as expected -- getting zeroes > stored to the file. > > Note 2: For shared mappings we'll now always fallback to the slow path to > lookup the VMA when R/O long-term pining. While that's the necessary price > we have to pay right now, it's actually not that bad in practice: most > FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with > FOLL_FORCE because they tried dealing with COW mappings correctly ... > > Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE, > such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd > populate exclusive anon pages that we can pin. There was a concern that > this could affect the memlock limit of existing setups. > > For example, a VM running with VFIO could run into the memlock limit and > fail to run. However, we essentially had the same behavior already in > commit 17839856fd58 ("gup: document and work around "COW can break either > way" issue") which got merged into some enterprise distros, and there were > not any such complaints. So most probably, we're fine. > > Signed-off-by: David Hildenbrand Reviewed-by: Vlastimil Babka
Re: [PATCH mm-unstable v1 08/20] mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
On 11/16/22 11:26, David Hildenbrand wrote: > Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a > COW (i.e., private writable) mapping and adjust the documentation > accordingly. > > FAULT_FLAG_UNSHARE will now also break COW when encountering the shared > zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by > properly replacing the mapped page/pfn by a private copy (an exclusive > anonymous page). > > Note that only do_wp_page() needs care: hugetlb_wp() already handles > FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it > correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE > such that we can handle FAULT_FLAG_UNSHARE on the PTE level. > > This change is a requirement for reliable long-term R/O pinning in > COW mappings. > > Signed-off-by: David Hildenbrand Reviewed-by: Vlastimil Babka > --- > include/linux/mm_types.h | 8 > mm/memory.c | 4 > 2 files changed, 4 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5e7f4fac1e78..5e9aaad8c7b2 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1037,9 +1037,9 @@ typedef struct { > * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. > * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. > * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal > signals. > - * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and > mark > - * exclusive) a possibly shared anonymous page that is > - * mapped R/O. > + * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in a > + * COW mapping, making sure that an exclusive anon page > is > + * mapped after the fault. > * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. > *We should only access orig_pte if this flag set. > * > @@ -1064,7 +1064,7 @@ typedef struct { > * > * The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal. > * FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when > - * no existing R/O-mapped anonymous page is encountered. > + * applied to mappings that are not COW mappings. > */ > enum fault_flag { > FAULT_FLAG_WRITE = 1 << 0, > diff --git a/mm/memory.c b/mm/memory.c > index d47ad33c6487..56b21ab1e4d2 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3432,10 +3432,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) > } > wp_page_reuse(vmf); > return 0; > - } else if (unshare) { > - /* No anonymous page -> nothing to do. */ > - pte_unmap_unlock(vmf->pte, vmf->ptl); > - return 0; > } > copy: > /*
Re: [PATCH mm-unstable v1 07/20] mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings
On 11/16/22 11:26, David Hildenbrand wrote: > If we already have a PMD/PUD mapped write-protected in a private mapping > and we want to break COW either due to FAULT_FLAG_WRITE or > FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on > the PTE path. > > Let's just split (->zap) + fallback in that case. > > This is a preparation for more generic FAULT_FLAG_UNSHARE support in > COW mappings. > > Signed-off-by: David Hildenbrand Reviewed-by: Vlastimil Babka Nits: > --- > mm/memory.c | 24 +++- > 1 file changed, 15 insertions(+), 9 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index c35e6cd32b6a..d47ad33c6487 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4802,6 +4802,7 @@ static inline vm_fault_t create_huge_pmd(struct > vm_fault *vmf) > static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) > { > const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; > + vm_fault_t ret; > > if (vma_is_anonymous(vmf->vma)) { > if (likely(!unshare) && > @@ -4809,11 +4810,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault > *vmf) > return handle_userfault(vmf, VM_UFFD_WP); > return do_huge_pmd_wp_page(vmf); > } > - if (vmf->vma->vm_ops->huge_fault) { > - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); > > - if (!(ret & VM_FAULT_FALLBACK)) > - return ret; > + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { > + if (vmf->vma->vm_ops->huge_fault) { I guess it could have been a single if with && and the reduced identation could fit keeping 'ret' declaration inside. AFAICS the later patches don't build more on top of this anyway. But also fine keeping as is. (the hunk below same) > + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); > + if (!(ret & VM_FAULT_FALLBACK)) > + return ret; > + } > } > > /* COW or write-notify handled on pte level: split pmd. */ > @@ -4839,14 +4842,17 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, > pud_t orig_pud) > { > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ > defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) > + vm_fault_t ret; > + > /* No support for anonymous transparent PUD pages yet */ > if (vma_is_anonymous(vmf->vma)) > goto split; > - if (vmf->vma->vm_ops->huge_fault) { > - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); > - > - if (!(ret & VM_FAULT_FALLBACK)) > - return ret; > + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { > + if (vmf->vma->vm_ops->huge_fault) { > + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); > + if (!(ret & VM_FAULT_FALLBACK)) > + return ret; > + } > } > split: > /* COW or write-notify not handled on PUD level: split pud.*/
Re: [PATCH mm-unstable v1 06/20] mm: rework handling in do_wp_page() based on private vs. shared mappings
On 11/16/22 11:26, David Hildenbrand wrote: > We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a > COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages. > Let's prepare for that by handling shared mappings first such that we can > handle private mappings last. > > While at it, use folio-based functions instead of page-based functions > where we touch the code either way. > > Signed-off-by: David Hildenbrand Reviewed-by: Vlastimil Babka
Re: [PATCH mm-unstable v1 05/20] mm: add early FAULT_FLAG_WRITE consistency checks
On 11/16/22 11:26, David Hildenbrand wrote: > Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to > care in all other handlers and might get "surprises" if we forget to do > so. > > Write faults without VM_MAYWRITE don't make any sense, and our > maybe_mkwrite() logic could have hidden such abuse for now. > > Write faults without VM_WRITE on something that is not a COW mapping is > similarly broken, and e.g., do_wp_page() could end up placing an > anonymous page into a shared mapping, which would be bad. > > This is a preparation for reliable R/O long-term pinning of pages in > private mappings, whereby we want to make sure that we will never break > COW in a read-only private mapping. > > Signed-off-by: David Hildenbrand Reviewed-by: Vlastimil Babka > --- > mm/memory.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index e014435a87db..c4fa378ec2a0 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5170,6 +5170,14 @@ static vm_fault_t sanitize_fault_flags(struct > vm_area_struct *vma, >*/ > if (!is_cow_mapping(vma->vm_flags)) > *flags &= ~FAULT_FLAG_UNSHARE; > + } else if (*flags & FAULT_FLAG_WRITE) { > + /* Write faults on read-only mappings are impossible ... */ > + if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE))) > + return VM_FAULT_SIGSEGV; > + /* ... and FOLL_FORCE only applies to COW mappings. */ > + if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) && > + !is_cow_mapping(vma->vm_flags))) > + return VM_FAULT_SIGSEGV; > } > return 0; > }
Re: [PATCH mm-unstable v1 04/20] mm: add early FAULT_FLAG_UNSHARE consistency checks
On 11/16/22 11:26, David Hildenbrand wrote: > For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which > implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not > dealing with a COW mapping, such that we treat it like a read fault as > documented and don't have to worry about the flag throughout all fault > handlers. > > While at it, centralize the check for mutual exclusion of > FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that > either flag is set in the WP handler. > > Signed-off-by: David Hildenbrand > --- > mm/huge_memory.c | 3 --- > mm/hugetlb.c | 5 - > mm/memory.c | 23 --- > 3 files changed, 20 insertions(+), 11 deletions(-) Reviewed-by: Vlastimil Babka
Re: [PATCH mm-unstable v1 01/20] selftests/vm: anon_cow: prepare for non-anonymous COW tests
On 11/16/22 11:26, David Hildenbrand wrote: > Originally, the plan was to have a separate tests for testing COW of > non-anonymous (e.g., shared zeropage) pages. > > Turns out, that we'd need a lot of similar functionality and that there > isn't a really good reason to separate it. So let's prepare for non-anon > tests by renaming to "cow". > > Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka
Re: [PATCH v2 0/8] Fix several device private page reference counting issues
On 9/28/22 14:01, Alistair Popple wrote: > This series aims to fix a number of page reference counting issues in > drivers dealing with device private ZONE_DEVICE pages. These result in > use-after-free type bugs, either from accessing a struct page which no > longer exists because it has been removed or accessing fields within the > struct page which are no longer valid because the page has been freed. > > During normal usage it is unlikely these will cause any problems. However > without these fixes it is possible to crash the kernel from userspace. > These crashes can be triggered either by unloading the kernel module or > unbinding the device from the driver prior to a userspace task exiting. In > modules such as Nouveau it is also possible to trigger some of these issues > by explicitly closing the device file-descriptor prior to the task exiting > and then accessing device private memory. Hi, as this series was noticed to create a CVE [1], do you think a stable backport is warranted? I think the "It is possible to launch the attack remotely." in [1] is incorrect though, right? It looks to me that patch 1 would be needed since the CONFIG_DEVICE_PRIVATE introduction, while the following few only to kernels with 27674ef6c73f (probably not so critical as that includes no LTS)? Thanks, Vlastimil [1] https://nvd.nist.gov/vuln/detail/CVE-2022-3523 > This involves some minor changes to both PowerPC and AMD GPU code. > Unfortunately I lack hardware to test either of those so any help there > would be appreciated. The changes mimic what is done in for both Nouveau > and hmm-tests though so I doubt they will cause problems. > > To: Andrew Morton > To: linux...@kvack.org > Cc: linux-ker...@vger.kernel.org > Cc: amd-...@lists.freedesktop.org > Cc: nouv...@lists.freedesktop.org > Cc: dri-devel@lists.freedesktop.org > > Alistair Popple (8): > mm/memory.c: Fix race when faulting a device private page > mm: Free device private pages have zero refcount > mm/memremap.c: Take a pgmap reference on page allocation > mm/migrate_device.c: Refactor migrate_vma and migrate_deivce_coherent_page() > mm/migrate_device.c: Add migrate_device_range() > nouveau/dmem: Refactor nouveau_dmem_fault_copy_one() > nouveau/dmem: Evict device private memory during release > hmm-tests: Add test for migrate_device_range() > > arch/powerpc/kvm/book3s_hv_uvmem.c | 17 +- > drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 19 +- > drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 2 +- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 11 +- > drivers/gpu/drm/nouveau/nouveau_dmem.c | 108 +++ > include/linux/memremap.h | 1 +- > include/linux/migrate.h | 15 ++- > lib/test_hmm.c | 129 ++--- > lib/test_hmm_uapi.h | 1 +- > mm/memory.c | 16 +- > mm/memremap.c| 30 ++- > mm/migrate.c | 34 +-- > mm/migrate_device.c | 239 +--- > mm/page_alloc.c | 8 +- > tools/testing/selftests/vm/hmm-tests.c | 49 +- > 15 files changed, 516 insertions(+), 163 deletions(-) > > base-commit: 088b8aa537c2c767765f1c19b555f21ffe555786
Re: [PATCH v2 01/16] slab: Remove __malloc attribute from realloc functions
On 9/28/22 19:13, Kees Cook wrote: On Wed, Sep 28, 2022 at 09:26:15AM +0200, Geert Uytterhoeven wrote: Hi Kees, On Fri, Sep 23, 2022 at 10:35 PM Kees Cook wrote: The __malloc attribute should not be applied to "realloc" functions, as the returned pointer may alias the storage of the prior pointer. Instead of splitting __malloc from __alloc_size, which would be a huge amount of churn, just create __realloc_size for the few cases where it is needed. Additionally removes the conditional test for __alloc_size__, which is always defined now. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Vlastimil Babka Cc: Roman Gushchin Cc: Hyeonggon Yoo <42.hye...@gmail.com> Cc: Marco Elver Cc: linux...@kvack.org Signed-off-by: Kees Cook Thanks for your patch, which is now commit 63caa04ec60583b1 ("slab: Remove __malloc attribute from realloc functions") in next-20220927. nore...@ellerman.id.au reported all gcc8-based builds to fail (e.g. [1], more at [2]): In file included from : ./include/linux/percpu.h: In function ‘__alloc_reserved_percpu’: ././include/linux/compiler_types.h:279:30: error: expected declaration specifiers before ‘__alloc_size__’ #define __alloc_size(x, ...) __alloc_size__(x, ## __VA_ARGS__) __malloc ^~ ./include/linux/percpu.h:120:74: note: in expansion of macro ‘__alloc_size’ [...] It's building fine with e.g. gcc-9 (which is my usual m68k cross-compiler). Reverting this commit on next-20220927 fixes the issue. [1] http://kisskb.ellerman.id.au/kisskb/buildresult/14803908/ [2] http://kisskb.ellerman.id.au/kisskb/head/1bd8b75fe6adeaa89d02968bdd811ffe708cf839/ Eek! Thanks for letting me know. I'm confused about this -- __alloc_size__ wasn't optional in compiler_attributes.h -- but obviously I broke something! I'll go figure this out. Even in latest next I can see at the end of include/linux/compiler-gcc.h /* * Prior to 9.1, -Wno-alloc-size-larger-than (and therefore the "alloc_size" * attribute) do not work, and must be disabled. */ #if GCC_VERSION < 90100 #undef __alloc_size__ #endif -Kees
Re: [PATCH v2 01/16] slab: Remove __malloc attribute from realloc functions
On 9/28/22 09:26, Geert Uytterhoeven wrote: Hi Kees, On Fri, Sep 23, 2022 at 10:35 PM Kees Cook wrote: The __malloc attribute should not be applied to "realloc" functions, as the returned pointer may alias the storage of the prior pointer. Instead of splitting __malloc from __alloc_size, which would be a huge amount of churn, just create __realloc_size for the few cases where it is needed. Additionally removes the conditional test for __alloc_size__, which is always defined now. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Vlastimil Babka Cc: Roman Gushchin Cc: Hyeonggon Yoo <42.hye...@gmail.com> Cc: Marco Elver Cc: linux...@kvack.org Signed-off-by: Kees Cook Thanks for your patch, which is now commit 63caa04ec60583b1 ("slab: Remove __malloc attribute from realloc functions") in next-20220927. nore...@ellerman.id.au reported all gcc8-based builds to fail (e.g. [1], more at [2]): In file included from : ./include/linux/percpu.h: In function ‘__alloc_reserved_percpu’: ././include/linux/compiler_types.h:279:30: error: expected declaration specifiers before ‘__alloc_size__’ #define __alloc_size(x, ...) __alloc_size__(x, ## __VA_ARGS__) __malloc ^~ ./include/linux/percpu.h:120:74: note: in expansion of macro ‘__alloc_size’ [...] It's building fine with e.g. gcc-9 (which is my usual m68k cross-compiler). Reverting this commit on next-20220927 fixes the issue. So IIUC it was wrong to remove the #ifdefs? [1] http://kisskb.ellerman.id.au/kisskb/buildresult/14803908/ [2] http://kisskb.ellerman.id.au/kisskb/head/1bd8b75fe6adeaa89d02968bdd811ffe708cf839/ --- include/linux/compiler_types.h | 13 + include/linux/slab.h | 12 ++-- mm/slab_common.c | 4 ++-- 3 files changed, 13 insertions(+), 16 deletions(-) diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index 4f2a819fd60a..f141a6f6b9f6 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -271,15 +271,12 @@ struct ftrace_likely_data { /* * Any place that could be marked with the "alloc_size" attribute is also - * a place to be marked with the "malloc" attribute. Do this as part of the - * __alloc_size macro to avoid redundant attributes and to avoid missing a - * __malloc marking. + * a place to be marked with the "malloc" attribute, except those that may + * be performing a _reallocation_, as that may alias the existing pointer. + * For these, use __realloc_size(). */ -#ifdef __alloc_size__ -# define __alloc_size(x, ...) __alloc_size__(x, ## __VA_ARGS__) __malloc -#else -# define __alloc_size(x, ...) __malloc -#endif +#define __alloc_size(x, ...) __alloc_size__(x, ## __VA_ARGS__) __malloc +#define __realloc_size(x, ...) __alloc_size__(x, ## __VA_ARGS__) #ifndef asm_volatile_goto #define asm_volatile_goto(x...) asm goto(x) diff --git a/include/linux/slab.h b/include/linux/slab.h index 0fefdf528e0d..41bd036e7551 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -184,7 +184,7 @@ int kmem_cache_shrink(struct kmem_cache *s); /* * Common kmalloc functions provided by all allocators */ -void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __alloc_size(2); +void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __realloc_size(2); void kfree(const void *objp); void kfree_sensitive(const void *objp); size_t __ksize(const void *objp); @@ -647,10 +647,10 @@ static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_ * @new_size: new size of a single member of the array * @flags: the type of memory to allocate (see kmalloc) */ -static inline __alloc_size(2, 3) void * __must_check krealloc_array(void *p, - size_t new_n, - size_t new_size, - gfp_t flags) +static inline __realloc_size(2, 3) void * __must_check krealloc_array(void *p, + size_t new_n, + size_t new_size, + gfp_t flags) { size_t bytes; @@ -774,7 +774,7 @@ static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t fla } extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags) - __alloc_size(3); + __realloc_size(3); extern void kvfree(const void *addr); extern void kvfree_sensitive(const void *addr, size_t len); diff --git a/mm/slab_common.c b/mm/slab_common.c index 1799
Re: [PATCH v2 13/16] mempool: Use kmalloc_size_roundup() to match ksize() usage
On 9/23/22 22:28, Kees Cook wrote: Round up allocations with kmalloc_size_roundup() so that mempool's use of ksize() is always accurate and no special handling of the memory is needed by KASAN, UBSAN_BOUNDS, nor FORTIFY_SOURCE. Cc: Andrew Morton Cc: linux...@kvack.org Signed-off-by: Kees Cook --- mm/mempool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mempool.c b/mm/mempool.c index 96488b13a1ef..0f3107b28e6b 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -526,7 +526,7 @@ EXPORT_SYMBOL(mempool_free_slab); */ void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data) { - size_t size = (size_t)pool_data; + size_t size = kmalloc_size_roundup((size_t)pool_data); Hm it is kinda wasteful to call into kmalloc_size_roundup for every allocation that has the same input. We could do it just once in mempool_init_node() for adjusting pool->pool_data ? But looking more closely, I wonder why poison_element() and kasan_unpoison_element() in mm/mempool.c even have to use ksize()/__ksize() and not just operate on the requested size (again, pool->pool_data). If no kmalloc mempool's users use ksize() to write beyond requested size, then we don't have to unpoison/poison that area either? return kmalloc(size, gfp_mask); } EXPORT_SYMBOL(mempool_kmalloc);
Re: [PATCH v2 02/16] slab: Introduce kmalloc_size_roundup()
On 9/23/22 22:28, Kees Cook wrote: In the effort to help the compiler reason about buffer sizes, the __alloc_size attribute was added to allocators. This improves the scope of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well, as the vast majority of callers are not expecting to use more memory than what they asked for. There is, however, one common exception to this: anticipatory resizing of kmalloc allocations. These cases all use ksize() to determine the actual bucket size of a given allocation (e.g. 128 when 126 was asked for). This comes in two styles in the kernel: 1) An allocation has been determined to be too small, and needs to be resized. Instead of the caller choosing its own next best size, it wants to minimize the number of calls to krealloc(), so it just uses ksize() plus some additional bytes, forcing the realloc into the next bucket size, from which it can learn how large it is now. For example: data = krealloc(data, ksize(data) + 1, gfp); data_len = ksize(data); 2) The minimum size of an allocation is calculated, but since it may grow in the future, just use all the space available in the chosen bucket immediately, to avoid needing to reallocate later. A good example of this is skbuff's allocators: data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); ... /* kmalloc(size) might give us more room than requested. * Put skb_shared_info exactly at the end of allocated zone, * to allow max possible filling before reallocation. */ osize = ksize(data); size = SKB_WITH_OVERHEAD(osize); In both cases, the "how much was actually allocated?" question is answered _after_ the allocation, where the compiler hinting is not in an easy place to make the association any more. This mismatch between the compiler's view of the buffer length and the code's intention about how much it is going to actually use has already caused problems[1]. It is possible to fix this by reordering the use of the "actual size" information. We can serve the needs of users of ksize() and still have accurate buffer length hinting for the compiler by doing the bucket size calculation _before_ the allocation. Code can instead ask "how large an allocation would I get for a given size?". Introduce kmalloc_size_roundup(), to serve this function so we can start replacing the "anticipatory resizing" uses of ksize(). [1] https://github.com/ClangBuiltLinux/linux/issues/1599 https://github.com/KSPP/linux/issues/183 Cc: Vlastimil Babka Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: linux...@kvack.org Signed-off-by: Kees Cook OK, added patch 1+2 to slab.git for-next branch. Had to adjust this one a bit, see below. --- include/linux/slab.h | 31 +++ mm/slab.c| 9 ++--- mm/slab_common.c | 20 3 files changed, 57 insertions(+), 3 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 41bd036e7551..727640173568 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -188,7 +188,21 @@ void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __r void kfree(const void *objp); void kfree_sensitive(const void *objp); size_t __ksize(const void *objp); + +/** + * ksize - Report actual allocation size of associated object + * + * @objp: Pointer returned from a prior kmalloc()-family allocation. + * + * This should not be used for writing beyond the originally requested + * allocation size. Either use krealloc() or round up the allocation size + * with kmalloc_size_roundup() prior to allocation. If this is used to + * access beyond the originally requested allocation size, UBSAN_BOUNDS + * and/or FORTIFY_SOURCE may trip, since they only know about the + * originally allocated size via the __alloc_size attribute. + */ size_t ksize(const void *objp); + #ifdef CONFIG_PRINTK bool kmem_valid_obj(void *object); void kmem_dump_obj(void *object); @@ -779,6 +793,23 @@ extern void kvfree(const void *addr); extern void kvfree_sensitive(const void *addr, size_t len); unsigned int kmem_cache_size(struct kmem_cache *s); + +/** + * kmalloc_size_roundup - Report allocation bucket size for the given size + * + * @size: Number of bytes to round up from. + * + * This returns the number of bytes that would be available in a kmalloc() + * allocation of @size bytes. For example, a 126 byte request would be + * rounded up to the next sized kmalloc bucket, 128 bytes. (This is strictly + * for the general-purpose kmalloc()-based allocations, and is not for the + * pre-sized kmem_cache_alloc()-based allocations.) + * + * Use this to kmalloc() the full bucket size ahead of time instead of using
Re: [PATCH 00/12] slab: Introduce kmalloc_size_roundup()
On 9/22/22 23:49, Kees Cook wrote: > On Thu, Sep 22, 2022 at 11:05:47PM +0200, Vlastimil Babka wrote: >> On 9/22/22 17:55, Kees Cook wrote: >> > On Thu, Sep 22, 2022 at 09:10:56AM +0200, Christian König wrote: >> > [...] >> > > So when this patch set is about to clean up this use case it should >> > > probably >> > > also take care to remove ksize() or at least limit it so that it won't be >> > > used for this use case in the future. >> > >> > Yeah, my goal would be to eliminate ksize(), and it seems possible if >> > other cases are satisfied with tracking their allocation sizes directly. >> >> I think we could leave ksize() to determine the size without a need for >> external tracking, but from now on forbid callers from using that hint to >> overflow the allocation size they actually requested? Once we remove the >> kasan/kfence hooks in ksize() that make the current kinds of usage possible, >> we should be able to catch any offenders of the new semantics that would >> appear? > > That's correct. I spent the morning working my way through the rest of > the ksize() users I didn't clean up yesterday, and in several places I > just swapped in __ksize(). But that wouldn't even be needed if we just > removed the kasan unpoisoning from ksize(), etc. > > I am tempted to leave it __ksize(), though, just to reinforce that it's > not supposed to be used "normally". What do you think? Sounds good. Note in linux-next there's now a series in slab.git planned for 6.1 that moves __ksize() declaration to mm/slab.h to make it more private. But we don't want random users outside mm and related kasan/kfence subsystems to include mm/slab.h, so we'll have to expose it again instead of ksize().
Re: [PATCH 00/12] slab: Introduce kmalloc_size_roundup()
On 9/22/22 17:55, Kees Cook wrote: > On Thu, Sep 22, 2022 at 09:10:56AM +0200, Christian König wrote: >> Am 22.09.22 um 05:10 schrieb Kees Cook: >> > Hi, >> > >> > This series fixes up the cases where callers of ksize() use it to >> > opportunistically grow their buffer sizes, which can run afoul of the >> > __alloc_size hinting that CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE >> > use to perform dynamic buffer bounds checking. >> >> Good cleanup, but one question: What other use cases we have for ksize() >> except the opportunistically growth of buffers? > > The remaining cases all seem to be using it as a "do we need to resize > yet?" check, where they don't actually track the allocation size > themselves and want to just depend on the slab cache to answer it. This > is most clearly seen in the igp code: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/intel/igb/igb_main.c?h=v6.0-rc6#n1204 > > My "solution" there kind of side-steps it, and leaves ksize() as-is: > https://lore.kernel.org/linux-hardening/20220922031013.2150682-8-keesc...@chromium.org/ > > The more correct solution would be to add per-v_idx size tracking, > similar to the other changes I sent: > https://lore.kernel.org/linux-hardening/20220922031013.2150682-11-keesc...@chromium.org/ > > I wonder if perhaps I should just migrate some of this code to using > something like struct membuf. > >> Off hand I can't see any. >> >> So when this patch set is about to clean up this use case it should probably >> also take care to remove ksize() or at least limit it so that it won't be >> used for this use case in the future. > > Yeah, my goal would be to eliminate ksize(), and it seems possible if > other cases are satisfied with tracking their allocation sizes directly. I think we could leave ksize() to determine the size without a need for external tracking, but from now on forbid callers from using that hint to overflow the allocation size they actually requested? Once we remove the kasan/kfence hooks in ksize() that make the current kinds of usage possible, we should be able to catch any offenders of the new semantics that would appear? > -Kees >
Re: [PATCH 3/3] mm/mempool: use might_alloc()
On 6/5/22 17:25, Daniel Vetter wrote: > mempool are generally used for GFP_NOIO, so this wont benefit all that > much because might_alloc currently only checks GFP_NOFS. But it does > validate against mmu notifier pte zapping, some might catch some > drivers doing really silly things, plus it's a bit more meaningful in > what we're checking for here. > > Signed-off-by: Daniel Vetter > Cc: Andrew Morton > Cc: linux...@kvack.org Reviewed-by: Vlastimil Babka > --- > mm/mempool.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/mempool.c b/mm/mempool.c > index b933d0fc21b8..96488b13a1ef 100644 > --- a/mm/mempool.c > +++ b/mm/mempool.c > @@ -379,7 +379,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) > gfp_t gfp_temp; > > VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); > - might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); > + might_alloc(gfp_mask); > > gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */ > gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
Re: [PATCH 1/3] mm/page_alloc: use might_alloc()
On 6/5/22 17:25, Daniel Vetter wrote: > ... instead of open codding it. Completely equivalent code, just > a notch more meaningful when reading. > > Signed-off-by: Daniel Vetter > Cc: Andrew Morton > Cc: linux...@kvack.org Reviewed-by: Vlastimil Babka > --- > mm/page_alloc.c | 5 + > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2db95780e003..24d170cb 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5177,10 +5177,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, > unsigned int order, > *alloc_flags |= ALLOC_CPUSET; > } > > - fs_reclaim_acquire(gfp_mask); > - fs_reclaim_release(gfp_mask); > - > - might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); > + might_alloc(gfp_mask); > > if (should_fail_alloc_page(gfp_mask, order)) > return false;
Re: [PATCH 2/3] mm/slab: delete cache_alloc_debugcheck_before()
On 6/5/22 17:25, Daniel Vetter wrote: > It only does a might_sleep_if(GFP_RECLAIM) check, which is already > covered by the might_alloc() in slab_pre_alloc_hook(). And all callers > of cache_alloc_debugcheck_before() call that beforehand already. > > Signed-off-by: Daniel Vetter > Cc: Christoph Lameter > Cc: Pekka Enberg > Cc: David Rientjes > Cc: Joonsoo Kim > Cc: Andrew Morton > Cc: Vlastimil Babka > Cc: Roman Gushchin > Cc: linux...@kvack.org Thanks, added to slab/for-5.20/cleanup as it's slab-specific and independent from 1/3 and 3/3. > --- > mm/slab.c | 10 -- > 1 file changed, 10 deletions(-) > > diff --git a/mm/slab.c b/mm/slab.c > index b04e40078bdf..75779ac5f5ba 100644 > --- a/mm/slab.c > +++ b/mm/slab.c > @@ -2973,12 +2973,6 @@ static void *cache_alloc_refill(struct kmem_cache > *cachep, gfp_t flags) > return ac->entry[--ac->avail]; > } > > -static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep, > - gfp_t flags) > -{ > - might_sleep_if(gfpflags_allow_blocking(flags)); > -} > - > #if DEBUG > static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep, > gfp_t flags, void *objp, unsigned long caller) > @@ -3219,7 +3213,6 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, > int nodeid, size_t orig_ > if (unlikely(ptr)) > goto out_hooks; > > - cache_alloc_debugcheck_before(cachep, flags); > local_irq_save(save_flags); > > if (nodeid == NUMA_NO_NODE) > @@ -3304,7 +3297,6 @@ slab_alloc(struct kmem_cache *cachep, struct list_lru > *lru, gfp_t flags, > if (unlikely(objp)) > goto out; > > - cache_alloc_debugcheck_before(cachep, flags); > local_irq_save(save_flags); > objp = __do_cache_alloc(cachep, flags); > local_irq_restore(save_flags); > @@ -3541,8 +3533,6 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t > flags, size_t size, > if (!s) > return 0; > > - cache_alloc_debugcheck_before(s, flags); > - > local_irq_disable(); > for (i = 0; i < size; i++) { > void *objp = kfence_alloc(s, s->object_size, flags) ?: > __do_cache_alloc(s, flags);
Re: [PATCH] lib/stackdepot: always do filter_irq_stacks() in stack_depot_save()
On 11/30/21 10:57, Marco Elver wrote: > The non-interrupt portion of interrupt stack traces before interrupt > entry is usually arbitrary. Therefore, saving stack traces of interrupts > (that include entries before interrupt entry) to stack depot leads to > unbounded stackdepot growth. > > As such, use of filter_irq_stacks() is a requirement to ensure > stackdepot can efficiently deduplicate interrupt stacks. > > Looking through all current users of stack_depot_save(), none (except > KASAN) pass the stack trace through filter_irq_stacks() before passing > it on to stack_depot_save(). > > Rather than adding filter_irq_stacks() to all current users of > stack_depot_save(), it became clear that stack_depot_save() should > simply do filter_irq_stacks(). Agree. > Signed-off-by: Marco Elver Acked-by: Vlastimil Babka Thanks. > --- > lib/stackdepot.c | 13 + > mm/kasan/common.c | 1 - > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/lib/stackdepot.c b/lib/stackdepot.c > index b437ae79aca1..519c7898c7f2 100644 > --- a/lib/stackdepot.c > +++ b/lib/stackdepot.c > @@ -305,6 +305,9 @@ EXPORT_SYMBOL_GPL(stack_depot_fetch); > * (allocates using GFP flags of @alloc_flags). If @can_alloc is %false, > avoids > * any allocations and will fail if no space is left to store the stack > trace. > * > + * If the stack trace in @entries is from an interrupt, only the portion up > to > + * interrupt entry is saved. > + * > * Context: Any context, but setting @can_alloc to %false is required if > * alloc_pages() cannot be used from the current context. Currently > * this is the case from contexts where neither %GFP_ATOMIC nor > @@ -323,6 +326,16 @@ depot_stack_handle_t __stack_depot_save(unsigned long > *entries, > unsigned long flags; > u32 hash; > > + /* > + * If this stack trace is from an interrupt, including anything before > + * interrupt entry usually leads to unbounded stackdepot growth. > + * > + * Because use of filter_irq_stacks() is a requirement to ensure > + * stackdepot can efficiently deduplicate interrupt stacks, always > + * filter_irq_stacks() to simplify all callers' use of stackdepot. > + */ > + nr_entries = filter_irq_stacks(entries, nr_entries); > + > if (unlikely(nr_entries == 0) || stack_depot_disable) > goto fast_exit; > > diff --git a/mm/kasan/common.c b/mm/kasan/common.c > index 8428da2aaf17..efaa836e5132 100644 > --- a/mm/kasan/common.c > +++ b/mm/kasan/common.c > @@ -36,7 +36,6 @@ depot_stack_handle_t kasan_save_stack(gfp_t flags, bool > can_alloc) > unsigned int nr_entries; > > nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0); > - nr_entries = filter_irq_stacks(entries, nr_entries); > return __stack_depot_save(entries, nr_entries, flags, can_alloc); > } > >
Re: [next] [dragonboard 410c] Unable to handle kernel paging request at virtual address 00000000007c4240
On 10/22/21 05:38, Andrew Morton wrote: > On Thu, 21 Oct 2021 19:51:20 +0200 Vlastimil Babka wrote: > >> >> Then we have to figure out how to order a fix between DRM and mmotm... >> > >> > That is the question! The problem exists only in the merge of the >> > two. On current DRM side stack_depot_init() exists but it's __init and >> > does not look safe to call multiple times. And obviously my changes >> > don't exist at all in mmotm. >> > >> > I guess one (admittedly hackish) option is to first add a patch in >> > drm-next (or drm-misc-next) that makes it safe to call >> > stack_depot_init() multiple times in non-init context. It would be >> > dropped in favour of your changes once the trees get merged together. >> > >> > Or is there some way for __drm_stack_depot_init() to detect whether it >> > should call stack_depot_init() or not, i.e. whether your changes are >> > there or not? >> >> Let's try the easiest approach first. AFAIK mmotm series is now split to >> pre-next and post-next part > > It has been this way for many years! Aha, great. Looks like I misinterpreted few months ago the thread about adding folio tree to next. >> and moving my patch >> lib-stackdepot-allow-optional-init-and-stack_table-allocation-by-kvmalloc.patch >> with the following fixup to the post-next part should solve this. Would that >> work, Andrew? Thanks. > > For this reason. No probs, thanks. Thanks! > I merge up the post-linux-next parts late in the merge window. I do > need to manually check that the prerequisites are in mainline, because > sometimes the patches apply OK but don't make sense. >
Re: [next] [dragonboard 410c] Unable to handle kernel paging request at virtual address 00000000007c4240
On 10/21/21 10:40, Jani Nikula wrote: > On Thu, 21 Oct 2021, Vlastimil Babka wrote: >> This one seems a bit more tricky and I could really use some advice. >> cd06ab2fd48f adds stackdepot usage to drm_modeset_lock which itself has a >> number of different users and requiring those to call stack_depot_init() >> would be likely error prone. Would it be ok to add the call of >> stack_depot_init() (guarded by #ifdef CONFIG_DRM_DEBUG_MODESET_LOCK) to >> drm_modeset_lock_init()? It will do a mutex_lock()/unlock(), and kvmalloc() >> on first call. >> I don't know how much of hotpath this is, but hopefully should be acceptable >> in debug config. Or do you have better suggestion? Thanks. > > I think that should be fine. > > Maybe add __drm_stack_depot_init() in the existing #if > IS_ENABLED(CONFIG_DRM_DEBUG_MODESET_LOCK), similar to the other > __drm_stack_depot_*() functions, with an empty stub for > CONFIG_DRM_DEBUG_MODESET_LOCK=n, and call it unconditionally in > drm_modeset_lock_init(). Good idea. >> Then we have to figure out how to order a fix between DRM and mmotm... > > That is the question! The problem exists only in the merge of the > two. On current DRM side stack_depot_init() exists but it's __init and > does not look safe to call multiple times. And obviously my changes > don't exist at all in mmotm. > > I guess one (admittedly hackish) option is to first add a patch in > drm-next (or drm-misc-next) that makes it safe to call > stack_depot_init() multiple times in non-init context. It would be > dropped in favour of your changes once the trees get merged together. > > Or is there some way for __drm_stack_depot_init() to detect whether it > should call stack_depot_init() or not, i.e. whether your changes are > there or not? Let's try the easiest approach first. AFAIK mmotm series is now split to pre-next and post-next part and moving my patch lib-stackdepot-allow-optional-init-and-stack_table-allocation-by-kvmalloc.patch with the following fixup to the post-next part should solve this. Would that work, Andrew? Thanks. 8< >From 719e91df5571034b62fa992f6738b00f8d29020e Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 21 Oct 2021 19:43:33 +0200 Subject: [PATCH] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup3 Due to cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks without backoff") landing recently to -next adding a new stack depot user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate call to stack_depot_init() there as well. Signed-off-by: Vlastimil Babka --- drivers/gpu/drm/drm_modeset_lock.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/drm_modeset_lock.c b/drivers/gpu/drm/drm_modeset_lock.c index c97323365675..918065982db4 100644 --- a/drivers/gpu/drm/drm_modeset_lock.c +++ b/drivers/gpu/drm/drm_modeset_lock.c @@ -107,6 +107,11 @@ static void __drm_stack_depot_print(depot_stack_handle_t stack_depot) kfree(buf); } + +static void __drm_stack_depot_init(void) +{ + stack_depot_init(); +} #else /* CONFIG_DRM_DEBUG_MODESET_LOCK */ static depot_stack_handle_t __drm_stack_depot_save(void) { @@ -115,6 +120,9 @@ static depot_stack_handle_t __drm_stack_depot_save(void) static void __drm_stack_depot_print(depot_stack_handle_t stack_depot) { } +static void __drm_stack_depot_init(void) +{ +} #endif /* CONFIG_DRM_DEBUG_MODESET_LOCK */ /** @@ -359,6 +367,7 @@ void drm_modeset_lock_init(struct drm_modeset_lock *lock) { ww_mutex_init(&lock->mutex, &crtc_ww_class); INIT_LIST_HEAD(&lock->head); + __drm_stack_depot_init(); } EXPORT_SYMBOL(drm_modeset_lock_init); -- 2.33.0
Re: [next] [dragonboard 410c] Unable to handle kernel paging request at virtual address 00000000007c4240
On 10/20/21 20:24, Naresh Kamboju wrote: > Following kernel crash noticed on linux next 20211020 tag. > while booting on arm64 architecture dragonboard 410c device. > > I see the following config is enabled in 20211020 tag builds. > CONFIG_STACKDEPOT=y > > Crash log, > [ 18.583097] Unable to handle kernel paging request at virtual > address 007c4240 > [ 18.583521] Mem abort info: > [ 18.590286] ESR = 0x9604 > [ 18.592920] EC = 0x25: DABT (current EL), IL = 32 bits > [ 18.596103] SET = 0, FnV = 0 > [ 18.601512] EA = 0, S1PTW = 0 > [ 18.604384] FSC = 0x04: level 0 translation fault > [ 18.607447] Data abort info: > [ 18.612296] ISV = 0, ISS = 0x0004 > [ 18.615451] CM = 0, WnR = 0 > [ 18.618990] user pgtable: 4k pages, 48-bit VAs, pgdp=8b4c7000 > [ 18.622054] [007c4240] pgd=, p4d= > [ 18.628974] Internal error: Oops: 9604 [#1] SMP > [ 18.635073] Modules linked in: adv7511 cec snd_soc_lpass_apq8016 > snd_soc_lpass_cpu snd_soc_lpass_platform snd_soc_msm8916_digital > qcom_camss qrtr snd_soc_apq8016_sbc videobuf2_dma_sg qcom_pon > qcom_spmi_vadc snd_soc_qcom_common qcom_q6v5_mss qcom_vadc_common > rtc_pm8xxx qcom_spmi_temp_alarm msm qcom_pil_info v4l2_fwnode > qcom_q6v5 snd_soc_msm8916_analog qcom_sysmon qcom_common v4l2_async > qnoc_msm8916 qcom_rng gpu_sched qcom_glink_smem venus_core > videobuf2_memops icc_smd_rpm qmi_helpers drm_kms_helper v4l2_mem2mem > mdt_loader display_connector i2c_qcom_cci videobuf2_v4l2 crct10dif_ce > videobuf2_common socinfo drm rmtfs_mem fuse > [ 18.672948] CPU: 0 PID: 178 Comm: kworker/u8:3 Not tainted > 5.15.0-rc6-next-20211020 #1 > [ 18.695000] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) > [ 18.695012] Workqueue: events_unbound deferred_probe_work_func > [ 18.695033] pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 18.715282] pc : __stack_depot_save+0x13c/0x4e0 > [ 18.722130] lr : stack_depot_save+0x14/0x20 > [ 18.726641] sp : 800014a23500 > [ 18.730801] x29: 800014a23500 x28: 000f8848 x27: > 800013acdf68 > [ 18.734294] x26: x25: 007c4240 x24: > 800014a23780 > [ 18.741413] x23: 0008 x22: 800014a235b8 x21: > 0008 > [ 18.748530] x20: c32f8848 x19: 1038cc18 x18: > > [ 18.755649] x17: 80002d9f8000 x16: 800010004000 x15: > c426 > [ 18.762767] x14: x13: 800014a23780 x12: > > [ 18.769885] x11: 1038cc80 x10: 8000136e9ba0 x9 : > 800014a235f4 > [ 18.777003] x8 : 0001 x7 : b664620b x6 : > 11a58b4a > [ 18.784121] x5 : 1aa43464 x4 : 9e7d8b67 x3 : > 0001 > [ 18.791239] x2 : 2800 x1 : 800013acd000 x0 : > f2d429d8 > [ 18.798358] Call trace: > [ 18.805451] __stack_depot_save+0x13c/0x4e0 > [ 18.807716] stack_depot_save+0x14/0x20 > [ 18.811881] __drm_stack_depot_save+0x44/0x70 [drm] > [ 18.815710] modeset_lock.part.0+0xe0/0x1a4 [drm] > [ 18.820571] drm_modeset_lock_all_ctx+0x2d4/0x334 [drm] This stack_depot_save path appears to be new from Jani's commit cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks without backoff") And there's a semantic conflict with my patch in mmotm: - sha1 (valid only in next-20211020) 5e6d063de5cd ("lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()") - lore: https://lore.kernel.org/all/20211013073005.11351-1-vba...@suse.cz/ - patchwork: https://patchwork.freedesktop.org/series/95549/#rev3 With my patch, to-be callers of stack_depot_save() need to call stack_depot_init() at least once, to avoid unnecessary runtime overhead otherwise I have added that calls into three DRM contexts in my patch, but didn't see cd06ab2fd48f yet at the time. This one seems a bit more tricky and I could really use some advice. cd06ab2fd48f adds stackdepot usage to drm_modeset_lock which itself has a number of different users and requiring those to call stack_depot_init() would be likely error prone. Would it be ok to add the call of stack_depot_init() (guarded by #ifdef CONFIG_DRM_DEBUG_MODESET_LOCK) to drm_modeset_lock_init()? It will do a mutex_lock()/unlock(), and kvmalloc() on first call. I don't know how much of hotpath this is, but hopefully should be acceptable in debug config. Or do you have better suggestion? Thanks. Then we have to figure out how to order a fix between DRM and mmotm... > [ 18.825435] drm_client_firmware_config.constprop.0.isra.0+0xc0/0x5d0 [drm] > [ 18.830478] drm_client_modeset_probe+0x328/0xbb0 [drm] > [ 18.837413] __drm_fb_helper_initial_config_and_unlock+0x54/0x5b4 > [drm_kms_helper] > [ 18.842633] drm_fb_helper_initial_config+0x5c/0x70 [drm_kms_helper] > [ 18.850266] msm_fbdev_init+0x98/0x100 [msm] > [ 18.856767] msm_drm_bind+0x650
Re: [PATCH v3] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()
On 10/13/21 09:30, Vlastimil Babka wrote: > Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated > from memblock, even if stack depot ends up not actually used. The default size > of stack_table is 4MB on 32-bit, 8MB on 64-bit. > > This is fine for use-cases such as KASAN which is also a config option and > has overhead on its own. But it's an issue for functionality that has to be > actually enabled on boot (page_owner) or depends on hardware (GPU drivers) > and thus the memory might be wasted. This was raised as an issue [1] when > attempting to add stackdepot support for SLUB's debug object tracking > functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable > slub_debug on boot only when needed, or create only specific kmem caches with > debugging for testing purposes. > > It would thus be more efficient if stackdepot's table was allocated only when > actually going to be used. This patch thus makes the allocation (and whole > stack_depot_init() call) optional: > > - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current > well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN > select this flag. > - Other users have to call stack_depot_init() as part of their own init when > it's determined that stack depot will actually be used. This may depend on > both config and runtime conditions. Convert current users which are > page_owner and several in the DRM subsystem. Same will be done for SLUB > later. > - Because the init might now be called after the boot-time memblock allocation > has given all memory to the buddy allocator, change stack_depot_init() to > allocate stack_table with kvmalloc() when memblock is no longer available. > Also handle allocation failure by disabling stackdepot (could have > theoretically happened even with memblock allocation previously), and don't > unnecessarily align the memblock allocation to its own size anymore. > > [1] > https://lore.kernel.org/all/CAMuHMdW=eovzm1re5fvoen87nkfilmm2+ah7enu2kxehcvb...@mail.gmail.com/ ... > --- > Changes in v3: > - stack_depot_init_mutex made static and moved inside stack_depot_init() > Reported-by: kernel test robot > - use !stack_table condition instead of stack_table == NULL > reported by checkpatch on freedesktop.org patchwork The last change above was missing because I forgot git commit --amend before git format-patch. More importantly there was a bot report for FLATMEM. Please add this fixup. Thanks. 8< >From a971a1670491f8fbbaab579eef3c756a5263af95 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 7 Oct 2021 10:49:09 +0200 Subject: [PATCH] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup On FLATMEM, we call page_ext_init_flatmem_late() just before kmem_cache_init() which means stack_depot_init() (called by page owner init) will not recognize properly it should use kvmalloc() and not memblock_alloc(). memblock_alloc() will also not issue a warning and return a block memory that can be invalid and cause kernel page fault when saving stacks, as reported by the kernel test robot [1]. Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so that slab_is_available() is true during stack_depot_init(). SPARSEMEM doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(), but a different page_ext_init() even later in the boot process. Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue. While at it, also actually resolve a checkpatch warning in stack_depot_init() from DRM CI, which was supposed to be in the original patch already. [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/ Signed-off-by: Vlastimil Babka Reported-by: kernel test robot --- init/main.c | 7 +-- lib/stackdepot.c | 2 +- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/init/main.c b/init/main.c index ca2765c8e45c..0ab632f681c5 100644 --- a/init/main.c +++ b/init/main.c @@ -845,9 +845,12 @@ static void __init mm_init(void) stack_depot_early_init(); mem_init(); mem_init_print_info(); - /* page_owner must be initialized after buddy is ready */ - page_ext_init_flatmem_late(); kmem_cache_init(); + /* +* page_owner must be initialized after buddy is ready, and also after +* slab is ready so that stack_depot_init() works properly +*/ + page_ext_init_flatmem_late(); kmemleak_init(); pgtable_init(); debug_objects_mem_init(); diff --git a/lib/stackdepot.c b/lib/stackdepot.c index 049d7d025d78..1f8ea6d0899b 100644 --- a/lib/stackdepot.c +++ b/lib/stackdepot.c @@ -172,7 +172,7 @@ __ref int stack_depot_init(void) static DEFINE_MUT
Re: [lib/stackdepot] 1cd8ce52c5: BUG:unable_to_handle_page_fault_for_address
On 10/14/21 12:16, Mike Rapoport wrote: > On Thu, Oct 14, 2021 at 11:33:03AM +0200, Vlastimil Babka wrote: >> On 10/14/21 10:54, kernel test robot wrote: >> >> In my local testing of the patch, when stackdepot was initialized through >> page owner init, it was using kvmalloc() so slab_is_available() was true. >> Looks like the exact order of slab vs page_owner alloc is non-deterministic, >> could be arch-dependent or just random ordering of init calls. A wrong order >> will exploit the apparent fact that slab_is_available() is not a good >> indicator of using memblock vs page allocator, and we would need a better >> one. >> Thoughts? > > The order of slab vs page_owner is deterministic, but it is different for > FLATMEM and SPARSEMEM. And page_ext_init_flatmem_late() that initializes > page_ext for FLATMEM is called exactly between buddy and slab setup: Oh, so it was due to FLATMEM, thanks for figuring that out! > static void __init mm_init(void) > { > ... > > mem_init(); > mem_init_print_info(); > /* page_owner must be initialized after buddy is ready */ > page_ext_init_flatmem_late(); > kmem_cache_init(); > > ... > } > > I've stared for a while at page_ext init and it seems that the > page_ext_init_flatmem_late() can be simply dropped because there is anyway > a call to invoke_init_callbacks() in page_ext_init() that is called much > later in the boot process. Yeah, but page_ext_init() only does something for SPARSEMEM, and is empty on FLATMEM. Otherwise it would be duplicating all the work. So I'll just move page_ext_init_flatmem_late() below kmem_cache_init() in mm_init(). Thanks again!
Re: [lib/stackdepot] 1cd8ce52c5: BUG:unable_to_handle_page_fault_for_address
On 10/14/21 10:54, kernel test robot wrote: > > > Greeting, > > FYI, we noticed the following commit (built with gcc-9): > > commit: 1cd8ce52c520c26c513899fb5aee42b8e5f60d0d ("[PATCH v2] lib/stackdepot: > allow optional init and stack_table allocation by kvmalloc()") > url: > https://github.com/0day-ci/linux/commits/Vlastimil-Babka/lib-stackdepot-allow-optional-init-and-stack_table-allocation-by-kvmalloc/20211012-170816 > base: git://anongit.freedesktop.org/drm-intel for-linux-next > > in testcase: rcutorture > version: > with following parameters: > > runtime: 300s > test: cpuhotplug > torture_type: srcud > > test-description: rcutorture is rcutorture kernel module load/unload test. > test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt > > > on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G > > caused below changes (please refer to attached dmesg/kmsg for entire > log/backtrace): > > > +-+++ > | | a94a6d76c9 | 1cd8ce52c5 | > +-+++ > | boot_successes | 30 | 0 | > | boot_failures | 0 | 7 | > | BUG:kernel_NULL_pointer_dereference,address | 0 | 2 | > | Oops:#[##] | 0 | 7 | > | EIP:stack_depot_save| 0 | 7 | > | Kernel_panic-not_syncing:Fatal_exception| 0 | 7 | > | BUG:unable_to_handle_page_fault_for_address | 0 | 5 | > +-+++ > > > If you fix the issue, kindly add following tag > Reported-by: kernel test robot > > > > [ 319.147926][ T259] BUG: unable to handle page fault for address: 0ec74110 > [ 319.149309][ T259] #PF: supervisor read access in kernel mode > [ 319.150362][ T259] #PF: error_code(0x) - not-present page > [ 319.151372][ T259] *pde = > [ 319.151964][ T259] Oops: [#1] SMP > [ 319.152617][ T259] CPU: 0 PID: 259 Comm: systemd-rc-loca Not tainted > 5.15.0-rc1-00270-g1cd8ce52c520 #1 > [ 319.154514][ T259] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.12.0-1 04/01/2014 > [ 319.156200][ T259] EIP: stack_depot_save+0x12a/0x4d0 Cc Mike Rapoport, looks like: - memblock_alloc() should have failed (I think, because page allocator already took over?), but didn't. So apparently we got some area that wasn't fully mapped. - using slab_is_available() is not accurate enough to detect when to use memblock or page allocator (kvmalloc in case of my patch). I have used it because memblock_alloc_internal() checks the same condition to issue a warning. Relevant part of dmesg.xz that was attached: [1.589075][T0] Dentry cache hash table entries: 524288 (order: 9, 2097152 bytes, linear) [1.592396][T0] Inode-cache hash table entries: 262144 (order: 8, 1048576 bytes, linear) [2.916844][T0] allocated 31496920 bytes of page_ext - this means we were allocating from page allocator by alloc_pages_exact_nid() already [2.918197][T0] mem auto-init: stack:off, heap alloc:off, heap free:on [2.919683][T0] mem auto-init: clearing system memory may take some time... [2.921239][T0] Initializing HighMem for node 0 (000b67fe:000bffe0) [ 23.023619][T0] Initializing Movable for node 0 (:) [ 245.194520][T0] Checking if this processor honours the WP bit even in supervisor mode...Ok. [ 245.196847][T0] Memory: 2914460K/3145208K available (20645K kernel code, 5953K rwdata, 12624K rodata, 760K init, 8112K bss, 230748K reserved, 0K cma-reserved, 155528K highmem) [ 245.200521][T0] Stack Depot allocating hash table with memblock_alloc - initializing stack depot as part of initializing page_owner, uses memblock_alloc() because slab_is_available() is still false [ 245.212005][T0] Node 0, zone Normal: page owner found early allocated 0 pages [ 245.213867][T0] Node 0, zone HighMem: page owner found early allocated 0 pages [ 245.216126][T0] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1 - printed by slub's kmem_cache_init() after create_kmalloc_caches() setting slab_state to UP, making slab_is_available() true, but too late In my local testing of the patch, when stackdepot was initialized through page owner init, it was using kvmalloc() so slab_is_available() was true. Looks like the exact order of slab vs page_owner alloc is non-deterministic, could be arch-dependent or just random ordering of init calls. A wrong order will exploit the apparent fact that slab_is_available() is not a good indicator of using memblock vs page allocator, and we would need a better one. Thoughts?
[PATCH v3] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()
Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated from memblock, even if stack depot ends up not actually used. The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit. This is fine for use-cases such as KASAN which is also a config option and has overhead on its own. But it's an issue for functionality that has to be actually enabled on boot (page_owner) or depends on hardware (GPU drivers) and thus the memory might be wasted. This was raised as an issue [1] when attempting to add stackdepot support for SLUB's debug object tracking functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or create only specific kmem caches with debugging for testing purposes. It would thus be more efficient if stackdepot's table was allocated only when actually going to be used. This patch thus makes the allocation (and whole stack_depot_init() call) optional: - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN select this flag. - Other users have to call stack_depot_init() as part of their own init when it's determined that stack depot will actually be used. This may depend on both config and runtime conditions. Convert current users which are page_owner and several in the DRM subsystem. Same will be done for SLUB later. - Because the init might now be called after the boot-time memblock allocation has given all memory to the buddy allocator, change stack_depot_init() to allocate stack_table with kvmalloc() when memblock is no longer available. Also handle allocation failure by disabling stackdepot (could have theoretically happened even with memblock allocation previously), and don't unnecessarily align the memblock allocation to its own size anymore. [1] https://lore.kernel.org/all/CAMuHMdW=eovzm1re5fvoen87nkfilmm2+ah7enu2kxehcvb...@mail.gmail.com/ Signed-off-by: Vlastimil Babka Acked-by: Dmitry Vyukov Reviewed-by: Marco Elver # stackdepot Cc: Marco Elver Cc: Vijayanand Jitta Cc: Maarten Lankhorst Cc: Maxime Ripard Cc: Thomas Zimmermann Cc: David Airlie Cc: Daniel Vetter Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Geert Uytterhoeven Cc: Oliver Glitta Cc: Imran Khan --- Changes in v3: - stack_depot_init_mutex made static and moved inside stack_depot_init() Reported-by: kernel test robot - use !stack_table condition instead of stack_table == NULL reported by checkpatch on freedesktop.org patchwork drivers/gpu/drm/drm_dp_mst_topology.c | 1 + drivers/gpu/drm/drm_mm.c| 4 +++ drivers/gpu/drm/i915/intel_runtime_pm.c | 3 +++ include/linux/stackdepot.h | 25 --- init/main.c | 2 +- lib/Kconfig | 4 +++ lib/Kconfig.kasan | 2 +- lib/stackdepot.c| 33 + mm/page_owner.c | 2 ++ 9 files changed, 60 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c index 86d13d6bc463..b0ebdc843a00 100644 --- a/drivers/gpu/drm/drm_dp_mst_topology.c +++ b/drivers/gpu/drm/drm_dp_mst_topology.c @@ -5493,6 +5493,7 @@ int drm_dp_mst_topology_mgr_init(struct drm_dp_mst_topology_mgr *mgr, mutex_init(&mgr->probe_lock); #if IS_ENABLED(CONFIG_DRM_DEBUG_DP_MST_TOPOLOGY_REFS) mutex_init(&mgr->topology_ref_history_lock); + stack_depot_init(); #endif INIT_LIST_HEAD(&mgr->tx_msg_downq); INIT_LIST_HEAD(&mgr->destroy_port_list); diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c index 93d48a6f04ab..5916228ea0c9 100644 --- a/drivers/gpu/drm/drm_mm.c +++ b/drivers/gpu/drm/drm_mm.c @@ -983,6 +983,10 @@ void drm_mm_init(struct drm_mm *mm, u64 start, u64 size) add_hole(&mm->head_node); mm->scan_active = 0; + +#ifdef CONFIG_DRM_DEBUG_MM + stack_depot_init(); +#endif } EXPORT_SYMBOL(drm_mm_init); diff --git a/drivers/gpu/drm/i915/intel_runtime_pm.c b/drivers/gpu/drm/i915/intel_runtime_pm.c index eaf7688f517d..d083506986e1 100644 --- a/drivers/gpu/drm/i915/intel_runtime_pm.c +++ b/drivers/gpu/drm/i915/intel_runtime_pm.c @@ -78,6 +78,9 @@ static void __print_depot_stack(depot_stack_handle_t stack, static void init_intel_runtime_pm_wakeref(struct intel_runtime_pm *rpm) { spin_lock_init(&rpm->debug.lock); + + if (rpm->available) + stack_depot_init(); } static noinline depot_stack_handle_t diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h index 6bb4bc1a5f54..40fc5e92194f 100644 --- a/include/linux/stackdepot.h +++ b/include/linux/stackdepot.h @@ -13,6 +13,22 @@ typedef u32 depot_stack_handle_t; +/* + * Every
[PATCH v2] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()
Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated from memblock, even if stack depot ends up not actually used. The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit. This is fine for use-cases such as KASAN which is also a config option and has overhead on its own. But it's an issue for functionality that has to be actually enabled on boot (page_owner) or depends on hardware (GPU drivers) and thus the memory might be wasted. This was raised as an issue [1] when attempting to add stackdepot support for SLUB's debug object tracking functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or create only specific kmem caches with debugging for testing purposes. It would thus be more efficient if stackdepot's table was allocated only when actually going to be used. This patch thus makes the allocation (and whole stack_depot_init() call) optional: - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN select this flag. - Other users have to call stack_depot_init() as part of their own init when it's determined that stack depot will actually be used. This may depend on both config and runtime conditions. Convert current users which are page_owner and several in the DRM subsystem. Same will be done for SLUB later. - Because the init might now be called after the boot-time memblock allocation has given all memory to the buddy allocator, change stack_depot_init() to allocate stack_table with kvmalloc() when memblock is no longer available. Also handle allocation failure by disabling stackdepot (could have theoretically happened even with memblock allocation previously), and don't unnecessarily align the memblock allocation to its own size anymore. [1] https://lore.kernel.org/all/CAMuHMdW=eovzm1re5fvoen87nkfilmm2+ah7enu2kxehcvb...@mail.gmail.com/ Signed-off-by: Vlastimil Babka Acked-by: Dmitry Vyukov Cc: Marco Elver Cc: Vijayanand Jitta Cc: Maarten Lankhorst Cc: Maxime Ripard Cc: Thomas Zimmermann Cc: David Airlie Cc: Daniel Vetter Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Geert Uytterhoeven Cc: Oliver Glitta Cc: Imran Khan --- Changes in v2: - Rebase to v5.15-rc5. - Stylistic changes suggested by Marco Elver. drivers/gpu/drm/drm_dp_mst_topology.c | 1 + drivers/gpu/drm/drm_mm.c| 4 drivers/gpu/drm/i915/intel_runtime_pm.c | 3 +++ include/linux/stackdepot.h | 25 --- init/main.c | 2 +- lib/Kconfig | 4 lib/Kconfig.kasan | 2 +- lib/stackdepot.c| 32 + mm/page_owner.c | 2 ++ 9 files changed, 59 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c index 86d13d6bc463..b0ebdc843a00 100644 --- a/drivers/gpu/drm/drm_dp_mst_topology.c +++ b/drivers/gpu/drm/drm_dp_mst_topology.c @@ -5493,6 +5493,7 @@ int drm_dp_mst_topology_mgr_init(struct drm_dp_mst_topology_mgr *mgr, mutex_init(&mgr->probe_lock); #if IS_ENABLED(CONFIG_DRM_DEBUG_DP_MST_TOPOLOGY_REFS) mutex_init(&mgr->topology_ref_history_lock); + stack_depot_init(); #endif INIT_LIST_HEAD(&mgr->tx_msg_downq); INIT_LIST_HEAD(&mgr->destroy_port_list); diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c index 93d48a6f04ab..5916228ea0c9 100644 --- a/drivers/gpu/drm/drm_mm.c +++ b/drivers/gpu/drm/drm_mm.c @@ -983,6 +983,10 @@ void drm_mm_init(struct drm_mm *mm, u64 start, u64 size) add_hole(&mm->head_node); mm->scan_active = 0; + +#ifdef CONFIG_DRM_DEBUG_MM + stack_depot_init(); +#endif } EXPORT_SYMBOL(drm_mm_init); diff --git a/drivers/gpu/drm/i915/intel_runtime_pm.c b/drivers/gpu/drm/i915/intel_runtime_pm.c index eaf7688f517d..d083506986e1 100644 --- a/drivers/gpu/drm/i915/intel_runtime_pm.c +++ b/drivers/gpu/drm/i915/intel_runtime_pm.c @@ -78,6 +78,9 @@ static void __print_depot_stack(depot_stack_handle_t stack, static void init_intel_runtime_pm_wakeref(struct intel_runtime_pm *rpm) { spin_lock_init(&rpm->debug.lock); + + if (rpm->available) + stack_depot_init(); } static noinline depot_stack_handle_t diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h index 6bb4bc1a5f54..40fc5e92194f 100644 --- a/include/linux/stackdepot.h +++ b/include/linux/stackdepot.h @@ -13,6 +13,22 @@ typedef u32 depot_stack_handle_t; +/* + * Every user of stack depot has to call this during its own init when it's + * decided that it will be calling stack_depot_save() later. + * + * The alternative is to select STACKDEPOT_ALWAYS_I
Re: [PATCH] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()
On 10/11/21 19:08, Marco Elver wrote: > On Mon, 11 Oct 2021 at 19:02, Vlastimil Babka wrote: > [...] >> > On the other hand, the lazy initialization mode you're introducing >> > requires an explicit stack_depot_init() call somewhere and isn't as >> > straightforward as before. >> > >> > Not sure what is best. My intuition tells me STACKDEPOT_LAZY_INIT would >> > be safer as it's a deliberate opt-in to the lazy initialization >> > behaviour. >> >> I think it should be fine with ALWAYS_INIT. There are not many stackdepot >> users being added, and anyone developing a new one will very quickly find >> out if they forget to call stack_depot_init()? > > I think that's fine. > >> > Preferences? >> > >> > [...] >> >> --- a/drivers/gpu/drm/drm_mm.c >> >> +++ b/drivers/gpu/drm/drm_mm.c >> >> @@ -980,6 +980,10 @@ void drm_mm_init(struct drm_mm *mm, u64 start, u64 >> >> size) >> >> add_hole(&mm->head_node); >> >> >> >> mm->scan_active = 0; >> >> + >> >> +#ifdef CONFIG_DRM_DEBUG_MM >> >> +stack_depot_init(); >> >> +#endif >> > >> > DRM_DEBUG_MM implies STACKDEPOT. Not sure what is more readable to drm >> > maintainers, but perhaps it'd be nicer to avoid the #ifdef here, and >> > instead just keep the no-op version of stack_depot_init() in >> > . I don't have a strong preference. >> >> Hm, but in case STACKDEPOT is also selected by something else (e.g. >> CONFIG_PAGE_OWNER) which uses lazy init but isn't enabled on boot, then >> without #ifdef CONFIG_DRM_DEBUG_MM above, this code would call a >> stack_depot_init() (that's not a no-op) even in case it's not going to be >> using it, so not what we want to achieve. >> But it could be changed to use IS_ENABLED() if that's preferred by DRM folks. > > You're right -- but I'll leave this to DRM folks. Ah, the file only includes stackdepot.h in a #ifdef CONFIG_DRM_DEBUG_MM section so I will keep the #ifdef here for a minimal change, unless requested otherwise.
Re: [PATCH] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()
On 10/7/21 13:01, Marco Elver wrote: > On Thu, Oct 07, 2021 at 11:58AM +0200, Vlastimil Babka wrote: > [...] >> - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current >> well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN >> select this flag. >> - Other users have to call stack_depot_init() as part of their own init when >> it's determined that stack depot will actually be used. This may depend on >> both config and runtime conditions. Convert current users which are >> page_owner and several in the DRM subsystem. Same will be done for SLUB >> later. >> - Because the init might now be called after the boot-time memblock >> allocation >> has given all memory to the buddy allocator, change stack_depot_init() to >> allocate stack_table with kvmalloc() when memblock is no longer available. >> Also handle allocation failure by disabling stackdepot (could have >> theoretically happened even with memblock allocation previously), and don't >> unnecessarily align the memblock allocation to its own size anymore. > ... >> Hi, I'd appreciate review of the DRM parts - namely that I've got correctly >> that stack_depot_init() is called from the proper init functions and iff >> stack_depot_save() is going to be used later. Thanks! > > For ease of review between stackdepot and DRM changes, I thought it'd be > nice to split into 2 patches, but not sure it'll work, because you're > changing the semantics of the normal STACKDEPOT. Yeah, that's why it's a single patch. As the DRM parts are clearly separated to their files, I think review should be fine. > One option would be to flip it around, and instead have > STACKDEPOT_LAZY_INIT, but that seems counter-intuitive if the majority > of STACKDEPOT users are LAZY_INIT users. Agree. > On the other hand, the lazy initialization mode you're introducing > requires an explicit stack_depot_init() call somewhere and isn't as > straightforward as before. > > Not sure what is best. My intuition tells me STACKDEPOT_LAZY_INIT would > be safer as it's a deliberate opt-in to the lazy initialization > behaviour. I think it should be fine with ALWAYS_INIT. There are not many stackdepot users being added, and anyone developing a new one will very quickly find out if they forget to call stack_depot_init()? > Preferences? > > [...] >> --- a/drivers/gpu/drm/drm_mm.c >> +++ b/drivers/gpu/drm/drm_mm.c >> @@ -980,6 +980,10 @@ void drm_mm_init(struct drm_mm *mm, u64 start, u64 size) >> add_hole(&mm->head_node); >> >> mm->scan_active = 0; >> + >> +#ifdef CONFIG_DRM_DEBUG_MM >> +stack_depot_init(); >> +#endif > > DRM_DEBUG_MM implies STACKDEPOT. Not sure what is more readable to drm > maintainers, but perhaps it'd be nicer to avoid the #ifdef here, and > instead just keep the no-op version of stack_depot_init() in > . I don't have a strong preference. Hm, but in case STACKDEPOT is also selected by something else (e.g. CONFIG_PAGE_OWNER) which uses lazy init but isn't enabled on boot, then without #ifdef CONFIG_DRM_DEBUG_MM above, this code would call a stack_depot_init() (that's not a no-op) even in case it's not going to be using it, so not what we want to achieve. But it could be changed to use IS_ENABLED() if that's preferred by DRM folks. BTW it's possible that there won't be any DRM review because this failed to apply: https://patchwork.freedesktop.org/series/95549/ DRM folks, any hint how to indicate that the base was next-20211001? >> @@ -30,13 +40,4 @@ int stack_depot_snprint(depot_stack_handle_t handle, char >> *buf, size_t size, >> >> void stack_depot_print(depot_stack_handle_t stack); >> >> -#ifdef CONFIG_STACKDEPOT >> -int stack_depot_init(void); >> -#else >> -static inline int stack_depot_init(void) >> -{ >> -return 0; >> -} >> -#endif /* CONFIG_STACKDEPOT */ >> - > > Could we avoid the IS_ENABLED() in init/main.c by adding a wrapper here: > > +#ifdef CONFIG_STACKDEPOT_ALWAYS_INIT > +static inline int stack_depot_early_init(void) { return > stack_depot_init(); } > +#else > +static inline int stack_depot_early_init(void) { return 0; } > +#endif /* CONFIG_STACKDEPOT_ALWAYS_INIT */ We could, but it's a wrapper made for only a single caller... >> #endif >> diff --git a/init/main.c b/init/main.c >> index ee4d3e1b3eb9..b6a5833d98f5 100644 >> --- a/init/main.c >> +++ b/init/main.c >> @@ -844,7 +844,8 @@ static void __init mm_init(void) >> init_mem_debug
[PATCH] lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()
Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated from memblock, even if stack depot ends up not actually used. The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit. This is fine for use-cases such as KASAN which is also a config option and has overhead on its own. But it's an issue for functionality that has to be actually enabled on boot (page_owner) or depends on hardware (GPU drivers) and thus the memory might be wasted. This was raised as an issue when trying to add stackdepot support for SLUB's debug object tracking functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or create specific kmem caches with debugging for testing purposes. It would thus be more efficient if stackdepot's table was allocated only when actually going to be used. This patch thus makes the allocation (and whole stack_depot_init() call) optional: - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN select this flag. - Other users have to call stack_depot_init() as part of their own init when it's determined that stack depot will actually be used. This may depend on both config and runtime conditions. Convert current users which are page_owner and several in the DRM subsystem. Same will be done for SLUB later. - Because the init might now be called after the boot-time memblock allocation has given all memory to the buddy allocator, change stack_depot_init() to allocate stack_table with kvmalloc() when memblock is no longer available. Also handle allocation failure by disabling stackdepot (could have theoretically happened even with memblock allocation previously), and don't unnecessarily align the memblock allocation to its own size anymore. [1] https://lore.kernel.org/all/CAMuHMdW=eovzm1re5fvoen87nkfilmm2+ah7enu2kxehcvb...@mail.gmail.com/ Signed-off-by: Vlastimil Babka Cc: Marco Elver Cc: Vijayanand Jitta Cc: Maarten Lankhorst Cc: Maxime Ripard Cc: Thomas Zimmermann Cc: David Airlie Cc: Daniel Vetter Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Geert Uytterhoeven Cc: Oliver Glitta Cc: Imran Khan --- Hi, I'd appreciate review of the DRM parts - namely that I've got correctly that stack_depot_init() is called from the proper init functions and iff stack_depot_save() is going to be used later. Thanks! drivers/gpu/drm/drm_dp_mst_topology.c | 1 + drivers/gpu/drm/drm_mm.c| 4 drivers/gpu/drm/i915/intel_runtime_pm.c | 3 +++ include/linux/stackdepot.h | 19 --- init/main.c | 3 ++- lib/Kconfig | 3 +++ lib/Kconfig.kasan | 1 + lib/stackdepot.c| 32 + mm/page_owner.c | 2 ++ 9 files changed, 53 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c b/drivers/gpu/drm/drm_dp_mst_topology.c index 2d1adab9e360..bbe972d59dae 100644 --- a/drivers/gpu/drm/drm_dp_mst_topology.c +++ b/drivers/gpu/drm/drm_dp_mst_topology.c @@ -5490,6 +5490,7 @@ int drm_dp_mst_topology_mgr_init(struct drm_dp_mst_topology_mgr *mgr, mutex_init(&mgr->probe_lock); #if IS_ENABLED(CONFIG_DRM_DEBUG_DP_MST_TOPOLOGY_REFS) mutex_init(&mgr->topology_ref_history_lock); + stack_depot_init(); #endif INIT_LIST_HEAD(&mgr->tx_msg_downq); INIT_LIST_HEAD(&mgr->destroy_port_list); diff --git a/drivers/gpu/drm/drm_mm.c b/drivers/gpu/drm/drm_mm.c index 7d1c578388d3..8257f9d4f619 100644 --- a/drivers/gpu/drm/drm_mm.c +++ b/drivers/gpu/drm/drm_mm.c @@ -980,6 +980,10 @@ void drm_mm_init(struct drm_mm *mm, u64 start, u64 size) add_hole(&mm->head_node); mm->scan_active = 0; + +#ifdef CONFIG_DRM_DEBUG_MM + stack_depot_init(); +#endif } EXPORT_SYMBOL(drm_mm_init); diff --git a/drivers/gpu/drm/i915/intel_runtime_pm.c b/drivers/gpu/drm/i915/intel_runtime_pm.c index 0d85f3c5c526..806c32ab410b 100644 --- a/drivers/gpu/drm/i915/intel_runtime_pm.c +++ b/drivers/gpu/drm/i915/intel_runtime_pm.c @@ -68,6 +68,9 @@ static noinline depot_stack_handle_t __save_depot_stack(void) static void init_intel_runtime_pm_wakeref(struct intel_runtime_pm *rpm) { spin_lock_init(&rpm->debug.lock); + + if (rpm->available) + stack_depot_init(); } static noinline depot_stack_handle_t diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h index c34b55a6e554..60ba99a43745 100644 --- a/include/linux/stackdepot.h +++ b/include/linux/stackdepot.h @@ -15,6 +15,16 @@ typedef u32 depot_stack_handle_t; +/* + * Every user of stack depot has to call this during its own init when it's + * decided that it will be cal
Re: [PATCH 0/1] lib, stackdepot: Add helper to print stack entries into buffer.
On 9/10/21 16:10, Imran Khan wrote: > This change is in response to discussion at [1]. > The patch has been created on top of my earlier changes [2] and [3]. > If needed I can resend all of these patches together, though my > earlier patches have been Acked. I think you sent those at the beginning of merge window, so it would be best to gather everything in a self-contained series now and resend. I suggested another change for one of those anyway. You can of course resend including the Acks you already got, as you did already with "[PATCH v2 1/1] lib, stackdepot: Add helper to print stack entries into buffer." > [1] https://lore.kernel.org/lkml/e6f6fb85-1d83-425b-9e36-b5784cc9e...@suse.cz/ > [2] https://lore.kernel.org/lkml/fe94ffd8-d235-87d8-9c3d-80f7f73e0...@suse.cz/ > [3] https://lore.kernel.org/lkml/85f4f073-0b5a-9052-0ba9-74d450608...@suse.cz/ > > Imran Khan (1): > lib, stackdepot: Add helper to print stack entries into buffer. > > drivers/gpu/drm/drm_dp_mst_topology.c | 5 + > drivers/gpu/drm/drm_mm.c| 5 + > drivers/gpu/drm/i915/i915_vma.c | 5 + > drivers/gpu/drm/i915/intel_runtime_pm.c | 20 +--- > include/linux/stackdepot.h | 3 +++ > lib/stackdepot.c| 23 +++ > mm/page_owner.c | 5 + > 7 files changed, 35 insertions(+), 31 deletions(-) >
Re: [PATCH 1/1] lib, stackdepot: Add helper to print stack entries into buffer.
On 9/10/21 16:10, Imran Khan wrote: > To print stack entries into a buffer, users of stackdepot, > first get a list of stack entries using stack_depot_fetch > and then print this list into a buffer using stack_trace_snprint. > Provide a helper in stackdepot for this purpose. > Also change above mentioned users to use this helper. > > Signed-off-by: Imran Khan > Suggested-by: Vlastimil Babka Acked-by: Vlastimil Babka A comment below: > --- a/lib/stackdepot.c > +++ b/lib/stackdepot.c > @@ -214,6 +214,29 @@ static inline struct stack_record *find_stack(struct > stack_record *bucket, > return NULL; > } > > +/** > + * stack_depot_snprint - print stack entries from a depot into a buffer > + * > + * @handle: Stack depot handle which was returned from > + * stack_depot_save(). > + * @buf: Pointer to the print buffer > + * > + * @size:Size of the print buffer > + * > + * @spaces: Number of leading spaces to print > + * > + * Return: Number of bytes printed. > + */ > +int stack_depot_snprint(depot_stack_handle_t handle, char *buf, size_t size, > +int spaces) > +{ > + unsigned long *entries; > + unsigned int nr_entries; > + > + nr_entries = stack_depot_fetch(handle, &entries); > + return stack_trace_snprint(buf, size, entries, nr_entries, 0); stack_trace_snprint() has a WARN_ON(!entries). So maybe we should not call it if nr_entries is 0 (because e.g. handle was 0) as the warnings are not useful in that case.
Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount
On 8/25/21 19:49, Ralph Campbell wrote: > > On 8/25/21 4:15 AM, Vlastimil Babka wrote: >> On 8/25/21 05:48, Alex Sierra wrote: >>> From: Ralph Campbell >>> >>> ZONE_DEVICE struct pages have an extra reference count that complicates the >>> code for put_page() and several places in the kernel that need to check the >>> reference count to see that a page is not being used (gup, compaction, >>> migration, etc.). Clean up the code so the reference count doesn't need to >>> be treated specially for ZONE_DEVICE. >> That's certainly welcome. I just wonder what was the reason to use 1 in the >> first place and why it's no longer necessary? > > I'm sure this is a long story that I don't know most of the history. > I'm guessing that ZONE_DEVICE started out with a reference count of > one since that is what most "normal" struct page pages start with. > Then put_page() is used to free newly initialized struct pages to the > slab/slob/slub page allocator. > This made it easy to call get_page()/put_page() on ZONE_DEVICE pages > since get_page() asserts that the caller has a reference. > However, most drivers that create ZONE_DEVICE struct pages just insert > a PTE into the user page tables and don't increment/decrement the > reference count. MEMORY_DEVICE_FS_DAX used the >1 to 1 reference count > transition to signal that a page was idle so that made put_page() a > bit more complex. Then MEMORY_DEVICE_PRIVATE pages were added and this > special casing of what "idle" meant got more complicated and more parts > of mm had to check for is_device_private_page(). > My goal was to make ZONE_DEVICE struct pages reference counts be zero > based and allocated/freed similar to the page allocator so that more > of the mm code could be used, like THP ZONE_DEVICE pages, without special > casing ZONE_DEVICE. Thanks for the explanation. I was afraid there was something more fundamental that required to catch the 2->1 refcount transition, seems like it's not. I agree with the simplification!
Re: [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount
On 8/25/21 05:48, Alex Sierra wrote: > From: Ralph Campbell > > ZONE_DEVICE struct pages have an extra reference count that complicates the > code for put_page() and several places in the kernel that need to check the > reference count to see that a page is not being used (gup, compaction, > migration, etc.). Clean up the code so the reference count doesn't need to > be treated specially for ZONE_DEVICE. That's certainly welcome. I just wonder what was the reason to use 1 in the first place and why it's no longer necessary?
Re: [PATCH 1/2] mm/vmscan: add sync_shrinkers function
On 4/9/21 9:17 AM, Christian König wrote: > To be able to switch to a spinlock and reduce lock contention in the TTM > shrinker we don't want to hold a mutex while unmapping and freeing pages > from the pool. Does using spinlock instead of mutex really reduce lock contention? > But then we somehow need to prevent a race between (for example) the shrinker > trying to free pages and hotplug trying to remove the device which those pages > belong to. > > Taking and releasing the shrinker semaphore on the write side after > unmapping and freeing all pages should make sure that no shrinker is running > in > paralell any more. So you explain this in this commit log for adding the function, but then the next patch just adds a sync_shrinkers() call without any comment. I would expect there a comment explaining why it's done there - what it protects against, as it's not an obvious pattern IMHO. > Signed-off-by: Christian König > --- > include/linux/shrinker.h | 1 + > mm/vmscan.c | 10 ++ > 2 files changed, 11 insertions(+) > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index 0f80123650e2..6b75dc372fce 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -92,4 +92,5 @@ extern void register_shrinker_prepared(struct shrinker > *shrinker); > extern int register_shrinker(struct shrinker *shrinker); > extern void unregister_shrinker(struct shrinker *shrinker); > extern void free_prealloced_shrinker(struct shrinker *shrinker); > +extern void sync_shrinkers(void); > #endif > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 562e87cbd7a1..46cd9c215d73 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -408,6 +408,16 @@ void unregister_shrinker(struct shrinker *shrinker) > } > EXPORT_SYMBOL(unregister_shrinker); > > +/** > + * sync_shrinker - Wait for all running shrinkers to complete. > + */ > +void sync_shrinkers(void) > +{ > + down_write(&shrinker_rwsem); > + up_write(&shrinker_rwsem); > +} > +EXPORT_SYMBOL(sync_shrinkers); > + > #define SHRINK_BATCH 128 > > static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 1/8] mm: slab: provide krealloc_array()
On 10/27/20 1:17 PM, Bartosz Golaszewski wrote: From: Bartosz Golaszewski When allocating an array of elements, users should check for multiplication overflow or preferably use one of the provided helpers like: kmalloc_array(). There's no krealloc_array() counterpart but there are many users who use regular krealloc() to reallocate arrays. Let's provide an actual krealloc_array() implementation. Signed-off-by: Bartosz Golaszewski Makes sense. Acked-by: Vlastimil Babka --- include/linux/slab.h | 11 +++ 1 file changed, 11 insertions(+) diff --git a/include/linux/slab.h b/include/linux/slab.h index dd6897f62010..0e6683affee7 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -592,6 +592,17 @@ static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags) return __kmalloc(bytes, flags); } +static __must_check inline void * +krealloc_array(void *p, size_t new_n, size_t new_size, gfp_t flags) +{ + size_t bytes; + + if (unlikely(check_mul_overflow(new_n, new_size, &bytes))) + return NULL; + + return krealloc(p, bytes, flags); +} + /** * kcalloc - allocate memory for an array. The memory is set to zero. * @n: number of elements. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 04/65] mm: Extract might_alloc() debug check
On 10/23/20 2:21 PM, Daniel Vetter wrote: Extracted from slab.h, which seems to have the most complete version including the correct might_sleep() check. Roll it out to slob.c. Motivated by a discussion with Paul about possibly changing call_rcu behaviour to allocate memory, but only roughly every 500th call. There are a lot fewer places in the kernel that care about whether allocating memory is allowed or not (due to deadlocks with reclaim code) than places that care whether sleeping is allowed. But debugging these also tends to be a lot harder, so nice descriptive checks could come in handy. I might have some use eventually for annotations in drivers/gpu. Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does not consult the PF_MEMALLOC flags. But there is no flag equivalent for GFP_NOWAIT, hence this check can't go wrong due to memalloc_no*_save/restore contexts. Cc: Paul E. McKenney Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Vlastimil Babka Cc: Mathieu Desnoyers Cc: Sebastian Andrzej Siewior Cc: Michel Lespinasse Cc: Daniel Vetter Cc: Waiman Long Cc: Thomas Gleixner Cc: Randy Dunlap Cc: linux...@kvack.org Cc: linux-fsde...@vger.kernel.org Cc: Dave Chinner Cc: Qian Cai Cc: linux-...@vger.kernel.org Signed-off-by: Daniel Vetter Looks useful. Acked-by: Vlastimil Babka ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Bug 204407] New: Bad page state in process Xorg
On 8/17/19 2:20 AM, Petr Vandrovec wrote: > Vlastimil Babka wrote on 8/16/2019 5:47 AM: >> On 8/15/19 9:13 PM, Petr Vandrovec wrote: >>> With iommu=off disks are visible, but USB keyboard (and other USB >>> devices) does not work: >> >> I've been told iommu=soft might help. > > Thanks. I've rebuilt kernel without IOMMU. > > Unfortunately I was not able to reproduce original problem with latest > kernel - neither with CMA nor without CMA. System seems stable as a rock. That's rather fortunate I would say :) But if you have any unrelated regressions in 5.3, please report them accordingly. > This is the change on which I've tried to reproduce it with your > debugging patches: > > commit 41de59634046b19cd53a1983594a95135c656997 (HEAD -> master, > origin/master, origin/HEAD) > Merge: e22a97a2a85d 1ee1119d184b > Author: Linus Torvalds > Date: Wed Aug 14 15:29:53 2019 -0700 > > Merge tag 'Wimplicit-fallthrough-5.3-rc5' of > git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux > > Petr > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Bug 204407] New: Bad page state in process Xorg
On 8/15/19 9:13 PM, Petr Vandrovec wrote: > Vlastimil Babka wrote on 8/15/2019 7:32 AM: >> >> Does the issue still happen with rc4? Could you apply the 3 attached >> patches (work in progress), configure-enable CONFIG_DEBUG_PAGEALLOC and >> CONFIG_PAGE_OWNER and boot kernel with debug_pagealloc=on page_owner=on >> parameters? That should print stacktraces of allocation and first >> freeing (assuming this is a double free). > > Unfortunately -rc4 does not find any my SATA disks due to some > misunderstanding between AHCI driver and HPT642L adapter (there is no > device 07:00.1, HPT is single-function device at 07:00.0): > > [ 18.003015] scsi host6: ahci > [ 18.006605] DMAR: DRHD: handling fault status reg 2 > [ 18.006619] DMAR: [DMA Write] Request device [07:00.1] fault addr > fffe [fault reason 02] Present bit in context entry is clear > [ 18.076616] DMAR: DRHD: handling fault status reg 102 > [ 18.085910] DMAR: [DMA Write] Request device [07:00.1] fault addr > fffa [fault reason 02] Present bit in context entry is clear > [ 18.100989] DMAR: DRHD: handling fault status reg 202 > [ 18.110985] DMAR: [DMA Write] Request device [07:00.1] fault addr > fffe [fault reason 02] Present bit in context entry is clear Worth reporting as well, not nice regression. > With iommu=off disks are visible, but USB keyboard (and other USB > devices) does not work: I've been told iommu=soft might help. > [ 18.174802] ehci-pci :00:1a.0: swiotlb buffer is full (sz: 8 > bytes), total 0 (slots), used 0 (slots) > [ 18.174804] ehci-pci :00:1a.0: overflow 0x000ffdc75ae8+8 of > DMA mask bus mask 0 > [ 18.174815] WARNING: CPU: 2 PID: 508 at kernel/dma/direct.c:35 > report_addr+0x2e/0x50 > [ 18.174816] Modules linked in: > [ 18.174818] CPU: 2 PID: 508 Comm: kworker/2:1 Tainted: G > T 5.3.0-rc4-64-00058-gd717b092e0b2 #77 > [ 18.174819] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A16 > 02/05/2018 > [ 18.174822] Workqueue: usb_hub_wq hub_event > > I'll try to find -rc4 configuration that has enabled debugging and can boot. > > > Petr > > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Bug 204407] New: Bad page state in process Xorg
On 8/3/19 1:29 AM, Petr Vandrovec wrote: > On Fri, Aug 2, 2019, 3:59 PM Matthew Wilcox <mailto:wi...@infradead.org>> wrote: > > That doesn't help because we call reset_page_owner() in the free > page path. > > We could turn on tracing because we call trace_mm_page_free() in this > path. That requires the reporter to be able to reproduce the problem, > and it's not clear to me whether this is a "happened once" or "every > time I do this, it happens" problem. > > > It happened on 3 of the boots with that kernel. 4th time box either > spontaneously rebooted when X started, or watchdog restarted box shortly > after starting X server. > > So I believe I should be able to reproduce it with additional patches or > extra flags enabled. Does the issue still happen with rc4? Could you apply the 3 attached patches (work in progress), configure-enable CONFIG_DEBUG_PAGEALLOC and CONFIG_PAGE_OWNER and boot kernel with debug_pagealloc=on page_owner=on parameters? That should print stacktraces of allocation and first freeing (assuming this is a double free). Vlastimil >From 5b4c46cb1d7a8bca3e8d98433b19e60b28fb5796 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 15 Aug 2019 14:06:50 +0200 Subject: [PATCH 1/3] mm, page_owner: record page owner for each subpage Currently, page owner info is only recorded for the first page of a high-order allocation, and copied to tail pages in the event of a split page. With the plan to keep previous owner info after freeing the page, it would be benefical to record page owner for each subpage upon allocation. This increases the overhead for high orders, but that should be acceptable for a debugging option. Signed-off-by: Vlastimil Babka --- mm/page_owner.c | 33 + 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/mm/page_owner.c b/mm/page_owner.c index addcbb2ae4e4..695cd5fdf7fd 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -154,18 +154,23 @@ static noinline depot_stack_handle_t save_stack(gfp_t flags) return handle; } -static inline void __set_page_owner_handle(struct page_ext *page_ext, - depot_stack_handle_t handle, unsigned int order, gfp_t gfp_mask) +static inline void __set_page_owner_handle(struct page *page, + struct page_ext *page_ext, depot_stack_handle_t handle, + unsigned int order, gfp_t gfp_mask) { struct page_owner *page_owner; + int i; - page_owner = get_page_owner(page_ext); - page_owner->handle = handle; - page_owner->order = order; - page_owner->gfp_mask = gfp_mask; - page_owner->last_migrate_reason = -1; + for (i = 0; i < (1 << order); i++) { + page_owner = get_page_owner(page_ext); + page_owner->handle = handle; + page_owner->order = order; + page_owner->gfp_mask = gfp_mask; + page_owner->last_migrate_reason = -1; + __set_bit(PAGE_EXT_OWNER, &page_ext->flags); - __set_bit(PAGE_EXT_OWNER, &page_ext->flags); + page_ext = lookup_page_ext(page + i); + } } noinline void __set_page_owner(struct page *page, unsigned int order, @@ -178,7 +183,7 @@ noinline void __set_page_owner(struct page *page, unsigned int order, return; handle = save_stack(gfp_mask); - __set_page_owner_handle(page_ext, handle, order, gfp_mask); + __set_page_owner_handle(page, page_ext, handle, order, gfp_mask); } void __set_page_owner_migrate_reason(struct page *page, int reason) @@ -204,8 +209,11 @@ void __split_page_owner(struct page *page, unsigned int order) page_owner = get_page_owner(page_ext); page_owner->order = 0; - for (i = 1; i < (1 << order); i++) - __copy_page_owner(page, page + i); + for (i = 1; i < (1 << order); i++) { + page_ext = lookup_page_ext(page + i); + page_owner = get_page_owner(page_ext); + page_owner->order = 0; + } } void __copy_page_owner(struct page *oldpage, struct page *newpage) @@ -562,7 +570,8 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone) continue; /* Found early allocated page */ - __set_page_owner_handle(page_ext, early_handle, 0, 0); + __set_page_owner_handle(page, page_ext, early_handle, + 0, 0); count++; } cond_resched(); -- 2.22.0 >From 063f233cc154819bb39a005726260496eea12265 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 15 Aug 2019 14:48:54 +0200 Subject: [PATCH 2/3] mm, page_owner: keep owner info when freeing the page For debugging purposes it might be useful to keep the owner info even after page has been freed and include it in e.g. dump_page() when detecting a bad page state. For that, change the PAGE_EXT_OWNER flag meaning to "page owner info has been set at least once" and add new PAGE_EXT_OWNER_ACTIVE for tracking whether page is supposed to be currently allocated or free. Adjust dump_page() and page_owner file printing accordingly. Signed-off-by: Vlastimil Ba
Re: [PATCH] driver : staging : ion: optimization for decreasing memory fragmentaion
You should have CC'd the ION maintainers/lists per ./scripts/get_maintainer.pl - CCing now. On 3/14/19 12:06 PM, Zhaoyang Huang wrote: > From: Zhaoyang Huang > > Two action for this patch: > 1. set a batch size for system heap's shrinker, which can have it buffer > reasonable page blocks in pool for future allocation. > 2. reverse the order sequence when free page blocks, the purpose is also > to have system heap keep as more big blocks as it can. > > By testing on an android system with 2G RAM, the changes with setting > batch = 48MB can help reduce the fragmentation obviously and improve > big block allocation speed for 15%. > > Signed-off-by: Zhaoyang Huang > --- > drivers/staging/android/ion/ion_heap.c| 12 +++- > drivers/staging/android/ion/ion_system_heap.c | 2 +- > 2 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/drivers/staging/android/ion/ion_heap.c > b/drivers/staging/android/ion/ion_heap.c > index 31db510..9e9caf2 100644 > --- a/drivers/staging/android/ion/ion_heap.c > +++ b/drivers/staging/android/ion/ion_heap.c > @@ -16,6 +16,8 @@ > #include > #include "ion.h" > > +unsigned long ion_heap_batch = 0; > + > void *ion_heap_map_kernel(struct ion_heap *heap, > struct ion_buffer *buffer) > { > @@ -303,7 +305,15 @@ int ion_heap_init_shrinker(struct ion_heap *heap) > heap->shrinker.count_objects = ion_heap_shrink_count; > heap->shrinker.scan_objects = ion_heap_shrink_scan; > heap->shrinker.seeks = DEFAULT_SEEKS; > - heap->shrinker.batch = 0; > + heap->shrinker.batch = ion_heap_batch; > > return register_shrinker(&heap->shrinker); > } > + > +static int __init ion_system_heap_batch_init(char *arg) > +{ > + ion_heap_batch = memparse(arg, NULL); > + > + return 0; > +} > +early_param("ion_batch", ion_system_heap_batch_init); > diff --git a/drivers/staging/android/ion/ion_system_heap.c > b/drivers/staging/android/ion/ion_system_heap.c > index 701eb9f..d249f8d 100644 > --- a/drivers/staging/android/ion/ion_system_heap.c > +++ b/drivers/staging/android/ion/ion_system_heap.c > @@ -182,7 +182,7 @@ static int ion_system_heap_shrink(struct ion_heap *heap, > gfp_t gfp_mask, > if (!nr_to_scan) > only_scan = 1; > > - for (i = 0; i < NUM_ORDERS; i++) { > + for (i = NUM_ORDERS - 1; i >= 0; i--) { > pool = sys_heap->pools[i]; > > if (only_scan) { > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 02/12] blk: use for_each_if
On 07/12/2018 08:45 AM, Joe Perches wrote: > On Wed, 2018-07-11 at 20:50 +0200, Daniel Vetter wrote: >> On Wed, Jul 11, 2018 at 8:30 PM, Jens Axboe wrote: >>> On 7/11/18 10:45 AM, Tejun Heo wrote: On Wed, Jul 11, 2018 at 09:40:58AM -0700, Tejun Heo wrote: > On Mon, Jul 09, 2018 at 10:36:40AM +0200, Daniel Vetter wrote: >> Makes the macros resilient against if {} else {} blocks right >> afterwards. >> >> Signed-off-by: Daniel Vetter >> Cc: Tejun Heo >> Cc: Jens Axboe >> Cc: Shaohua Li >> Cc: Kate Stewart >> Cc: Greg Kroah-Hartman >> Cc: Joseph Qi >> Cc: Daniel Vetter >> Cc: Arnd Bergmann > > Acked-by: Tejun Heo > > Jens, it'd probably be best to route this through block tree. Oops, this requires an earlier patch to move the for_each_if def to a common header and should be routed together. >>> >>> Yeah, this is a problem with the submission. >>> >>> Always (ALWAYS) CC folks on at least the cover letter and generic >>> earlier patches. Getting just one patch sent like this is mostly >>> useless, and causes more harm than good. >> >> Ime sending a patch with more than 20 or so recipients means it gets >> stuck everywhere in moderation queues. Or outright spam filters. I >> thought the correct way to do this is to cc: mailing lists (lkml has >> them all), but apparently that's not how it's done. Despite that all >> the patch series I get never have the cover letter addressed to me >> either. >> >> So what's the magic way to make this possible? > > Jens' advice is crap. This statement was rather offensive and totally uncalled for, AFAICS. Why did you write it like that? > There is no generic way to make this possible. > > BCC's don't work, series that touch multiple subsystems > get rejected when the recipient list is too large. I don't know what's the usual limit for recipient list, probably never hit it myself, but for series that are not so large, I use this approach to make sure the cover letter is CC'd to everyone that's CC'd in any patch in the series: - add per-patch Cc:'s to the git commit logs - clear out *.patch from the working dir - git format-patch --cover-letter ... - edit cover letter - git send-email ... --cc-cmd=./cc.sh ... where cc.sh contains this: #/bin/sh if [[ $1 == *cover-letter* ]]; then grep '<.*@.*>' -h *.patch | sed 's/^.*: //' | sort | uniq else grep '<.*@.*>' -h $1 | sed 's/^.*: //' | sort | uniq fi That proceses all tags besides Cc (acked-by, reported-by etc) and turns them to Cc's for each patch (or does git now do that by itself as well?) and for cover letter, it accumulates that from all the patches. Vlastimil > I think you did it correctly. > ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
drm/radeon spamming alloc_contig_range: [xxx, yyy) PFNs busy busy
On 12/01/2016 10:02 PM, Michal Nazarewicz wrote: > On Thu, Dec 01 2016, Michal Hocko wrote: >> I am not familiar with this code so I cannot really argue but a quick >> look at rmem_cma_setup doesn't suggest any speicific placing or >> anything... > > early_cma parses âcmaâ command line argument which can specify where > exactly the default CMA area is to be located. Furthermore, CMA areas > can be assigned per-device (via the Device Tree IIRC). OK, but the context of this bug report is a generic cma pool and generic dma alloc, which tries cma first and then fallback to alloc_pages_node(). If a device really requires specific placing as you suggest, then it probably uses a different allocation interface, otherwise there would be some flag to disallow the alloc_pages_node() fallback?
drm/radeon spamming alloc_contig_range: [xxx, yyy) PFNs busy busy
On 12/01/2016 08:21 AM, Michal Hocko wrote: > Forgot to CC Joonsoo. The email thread starts more or less here > http://lkml.kernel.org/r/20161130092239.GD18437 at dhcp22.suse.cz > > On Thu 01-12-16 08:15:07, Michal Hocko wrote: >> On Wed 30-11-16 20:19:03, Robin H. Johnson wrote: >> [...] >> > alloc_contig_range: [83f2a3, 83f2a4) PFNs busy >> >> Huh, do I get it right that the request was for a _single_ page? Why do >> we need CMA for that? Ugh, good point. I assumed that was just the PFNs that it failed to migrate away, but it seems that's indeed the whole requested range. Yeah sounds some part of the dma-cma chain could be smarter and attempt CMA only for e.g. costly orders.
drm/radeon spamming alloc_contig_range: [xxx, yyy) PFNs busy busy
On 12/01/2016 07:21 AM, Robin H. Johnson wrote: > On Wed, Nov 30, 2016 at 10:24:59PM +0100, Vlastimil Babka wrote: >> [add more CC's] >> >> On 11/30/2016 09:19 PM, Robin H. Johnson wrote: >> > Somewhere in the Radeon/DRM codebase, CMA page allocation has either >> > regressed in the timeline of 4.5->4.9, and/or the drm/radeon code is >> > doing something different with pages. >> >> Could be that it didn't use dma_generic_alloc_coherent() before, or you >> didn't >> have the generic CMA pool configured. > v4.9-rc7-23-gded6e842cf49: > [0.00] cma: Reserved 16 MiB at 0x00083e40 > [0.00] Memory: 32883108K/33519432K available (6752K kernel code, 1244K > rwdata, 4716K rodata, 1772K init, 2720K bss, 619940K reserved, 16384K > cma-reserved) > >> What's the output of "grep CMA" on your >> .config? > > # grep CMA .config |grep -v -e SECMARK= -e CONFIG_BCMA -e CONFIG_USB_HCD_BCMA > -e INPUT_CMA3000 -e CRYPTO_CMAC > CONFIG_CMA=y > # CONFIG_CMA_DEBUG is not set > # CONFIG_CMA_DEBUGFS is not set > CONFIG_CMA_AREAS=7 > CONFIG_DMA_CMA=y > CONFIG_CMA_SIZE_MBYTES=16 > CONFIG_CMA_SIZE_SEL_MBYTES=y > # CONFIG_CMA_SIZE_SEL_PERCENTAGE is not set > # CONFIG_CMA_SIZE_SEL_MIN is not set > # CONFIG_CMA_SIZE_SEL_MAX is not set > CONFIG_CMA_ALIGNMENT=8 > >> Or any kernel boot options with cma in name? > None. > > >> By default config this should not be used on x86. > What do you mean by that statement? I mean that the 16 mbytes for generic CMA area is not a default on x86: config CMA_SIZE_MBYTES int "Size in Mega Bytes" depends on !CMA_SIZE_SEL_PERCENTAGE default 0 if X86 default 16 Which explains why it's rare to see these reports in the context such as yours. I'd recommend just disabling it, as the primary use case for CMA are devices on mobile phones that don't have any other fallback (unlike the dma alloc). > It should be disallowed to enable CONFIG_CMA? Radeon and CMA should be > mutually exclusive? I don't think this is a specific problem of radeon. But looks like it's a heavy user of the dma alloc. There might be others. >> > Given that I haven't seen ANY other reports of this, I'm inclined to >> > believe the problem is drm/radeon specific (if I don't start X, I can't >> > reproduce the problem). >> >> It's rather CMA specific, the allocation attemps just can't be 100% reliable >> due >> to how CMA works. The question is if it should be spewing in the log in the >> context of dma-cma, which has a fallback allocation option. It even uses >> __GFP_NOWARN, perhaps the CMA path should respect that? > Yes, I'd say if there's a fallback without much penalty, nowarn makes > sense. If the fallback just tries multiple addresses until success, then > the warning should only be issued when too many attempts have been made. On the other hand, if the warnings are correlated with high kernel CPU usage, it's arguably better to be warned. >> >> > The rate of the problem starts slow, and also is relatively low on an idle >> > system (my screens blank at night, no xscreensaver running), but it still >> > ramps >> > up over time (to the point of generating 2.5GB/hour of "(timestamp) >> > alloc_contig_range: [83e4d9, 83e4da) PFNs busy"), with various addresses >> > (~100 >> > unique ranges for a day). >> > >> > My X workload is ~50 chrome tabs and ~20 terminals (over 3x 24" monitors >> > w/ 9 >> > virtual desktops per monitor). >> So IIUC, except the messages, everything actually works fine? > There's high kernel CPU usage that seems to roughly correlate with the > messages, but I can't yet tell if that's due to the syslog itself, or > repeated alloc_contig_range requests. You could try running perf top.
drm/radeon spamming alloc_contig_range: [xxx, yyy) PFNs busy busy
[add more CC's] On 11/30/2016 09:19 PM, Robin H. Johnson wrote: > Somewhere in the Radeon/DRM codebase, CMA page allocation has either > regressed in the timeline of 4.5->4.9, and/or the drm/radeon code is > doing something different with pages. Could be that it didn't use dma_generic_alloc_coherent() before, or you didn't have the generic CMA pool configured. What's the output of "grep CMA" on your .config? Or any kernel boot options with cma in name? By default config this should not be used on x86. > Given that I haven't seen ANY other reports of this, I'm inclined to > believe the problem is drm/radeon specific (if I don't start X, I can't > reproduce the problem). It's rather CMA specific, the allocation attemps just can't be 100% reliable due to how CMA works. The question is if it should be spewing in the log in the context of dma-cma, which has a fallback allocation option. It even uses __GFP_NOWARN, perhaps the CMA path should respect that? > The rate of the problem starts slow, and also is relatively low on an idle > system (my screens blank at night, no xscreensaver running), but it still > ramps > up over time (to the point of generating 2.5GB/hour of "(timestamp) > alloc_contig_range: [83e4d9, 83e4da) PFNs busy"), with various addresses (~100 > unique ranges for a day). > > My X workload is ~50 chrome tabs and ~20 terminals (over 3x 24" monitors w/ 9 > virtual desktops per monitor). So IIUC, except the messages, everything actually works fine? > I added a stack trace & rate limit to alloc_contig_range's PFNs busy message > (patch in previous email on LKML/-MM lists); and they point to radeon. > > alloc_contig_range: [83f2a3, 83f2a4) PFNs busy > CPU: 3 PID: 8518 Comm: X Not tainted 4.9.0-rc7-00024-g6ad4037e18ec #27 > Hardware name: System manufacturer System Product Name/P8Z68 DELUXE, BIOS > 0501 05/09/2011 > ad50c3d7f730 b236c873 0083f2a3 0083f2a4 > ad50c3d7f810 b2183b38 999dff4d8040 20fca8c0 > 0083f400 0083f000 0083f2a3 0004 > Call Trace: > [] dump_stack+0x85/0xc2 > [] alloc_contig_range+0x368/0x370 > [] cma_alloc+0x127/0x2e0 > [] dma_alloc_from_contiguous+0x38/0x40 > [] dma_generic_alloc_coherent+0x91/0x1d0 > [] x86_swiotlb_alloc_coherent+0x25/0x50 > [] ttm_dma_populate+0x48a/0x9a0 [ttm] > [] ? __kmalloc+0x1b6/0x250 > [] radeon_ttm_tt_populate+0x22a/0x2d0 [radeon] > [] ? ttm_dma_tt_init+0x67/0xc0 [ttm] > [] ttm_tt_bind+0x37/0x70 [ttm] > [] ttm_bo_handle_move_mem+0x528/0x5a0 [ttm] > [] ? shmem_alloc_inode+0x1a/0x30 > [] ttm_bo_validate+0x114/0x130 [ttm] > [] ? _raw_write_unlock+0xe/0x10 > [] ttm_bo_init+0x31d/0x3f0 [ttm] > [] radeon_bo_create+0x19b/0x260 [radeon] > [] ? radeon_update_memory_usage.isra.0+0x50/0x50 [radeon] > [] radeon_gem_object_create+0xad/0x180 [radeon] > [] radeon_gem_create_ioctl+0x5f/0xf0 [radeon] > [] drm_ioctl+0x21b/0x4d0 [drm] > [] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon] > [] radeon_drm_ioctl+0x4c/0x80 [radeon] > [] do_vfs_ioctl+0x92/0x5c0 > [] SyS_ioctl+0x79/0x90 > [] do_syscall_64+0x73/0x190 > [] entry_SYSCALL64_slow_path+0x25/0x25 > > The Radeon card in my case is a VisionTek HD 7750 Eyefinity 6, which is > reported as: > > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] > Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] (prog-if 00 [VGA controller]) > Subsystem: VISIONTEK Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] > Flags: bus master, fast devsel, latency 0, IRQ 58 > Memory at c000 (64-bit, prefetchable) [size=256M] > Memory at fbe0 (64-bit, non-prefetchable) [size=256K] > I/O ports at e000 [size=256] > Expansion ROM at 000c [disabled] [size=128K] > Capabilities: [48] Vendor Specific Information: Len=08 > Capabilities: [50] Power Management version 3 > Capabilities: [58] Express Legacy Endpoint, MSI 00 > Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 > > Capabilities: [150] Advanced Error Reporting > Kernel driver in use: radeon > Kernel modules: radeon, amdgpu >
[PATCH v6v3 02/12] mm: migrate: support non-lru movable page migration
On 05/31/2016 02:01 AM, Minchan Kim wrote: > Per Vlastimi's review comment. > > Thanks for the detail review, Vlastimi! > If you have another concern, feel free to say. I don't for now :) [...] > Cc: Rik van Riel > Cc: Vlastimil Babka > Cc: Joonsoo Kim > Cc: Mel Gorman > Cc: Hugh Dickins > Cc: Rafael Aquini > Cc: virtualization at lists.linux-foundation.org > Cc: Jonathan Corbet > Cc: John Einar Reitan > Cc: dri-devel at lists.freedesktop.org > Cc: Sergey Senozhatsky > Signed-off-by: Gioh Kim > Signed-off-by: Minchan Kim Acked-by: Vlastimil Babka
PATCH v6v2 02/12] mm: migrate: support non-lru movable page migration
On 05/30/2016 06:25 PM, Minchan Kim wrote: >>> --- a/mm/compaction.c >>> +++ b/mm/compaction.c >>> @@ -81,6 +81,39 @@ static inline bool migrate_async_suitable(int >>> migratetype) >>> >>> #ifdef CONFIG_COMPACTION >>> >>> +int PageMovable(struct page *page) >>> +{ >>> + struct address_space *mapping; >>> + >>> + VM_BUG_ON_PAGE(!PageLocked(page), page); >>> + if (!__PageMovable(page)) >>> + return 0; >>> + >>> + mapping = page_mapping(page); >>> + if (mapping && mapping->a_ops && mapping->a_ops->isolate_page) >>> + return 1; >>> + >>> + return 0; >>> +} >>> +EXPORT_SYMBOL(PageMovable); >>> + >>> +void __SetPageMovable(struct page *page, struct address_space *mapping) >>> +{ >>> + VM_BUG_ON_PAGE(!PageLocked(page), page); >>> + VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page); >>> + page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE); >>> +} >>> +EXPORT_SYMBOL(__SetPageMovable); >>> + >>> +void __ClearPageMovable(struct page *page) >>> +{ >>> + VM_BUG_ON_PAGE(!PageLocked(page), page); >>> + VM_BUG_ON_PAGE(!PageMovable(page), page); >>> + page->mapping = (void *)((unsigned long)page->mapping & >>> + PAGE_MAPPING_MOVABLE); >>> +} >>> +EXPORT_SYMBOL(__ClearPageMovable); >> >> The second confusing thing is that the function is named >> __ClearPageMovable(), but what it really clears is the mapping >> pointer, >> which is not at all the opposite of what __SetPageMovable() does. >> >> I know it's explained in the documentation, but it also deserves a >> comment here so it doesn't confuse everyone who looks at it. >> Even better would be a less confusing name for the function, but I >> can't offer one right now. > > To me, __ClearPageMovable naming is suitable for user POV. > It effectively makes the page unmovable. The confusion is just caused > by the implementation and I don't prefer exported API depends on the > implementation. So I want to add just comment. > > I didn't add comment above the function because I don't want to export > internal implementation to the user. I think they don't need to know it. > > index a7df2ae71f2a..d1d2063b4fd9 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -108,6 +108,11 @@ void __ClearPageMovable(struct page *page) > { > VM_BUG_ON_PAGE(!PageLocked(page), page); > VM_BUG_ON_PAGE(!PageMovable(page), page); > + /* > +* Clear registered address_space val with keeping > PAGE_MAPPING_MOVABLE > +* flag so that VM can catch up released page by driver after > isolation. > +* With it, VM migration doesn't try to put it back. > +*/ > page->mapping = (void *)((unsigned long)page->mapping & > PAGE_MAPPING_MOVABLE); OK, that's fine! >> >>> diff --git a/mm/util.c b/mm/util.c >>> index 917e0e3d0f8e..b756ee36f7f0 100644 >>> --- a/mm/util.c >>> +++ b/mm/util.c >>> @@ -399,10 +399,12 @@ struct address_space *page_mapping(struct page *page) >>> } >>> >>> mapping = page->mapping; >> >> I'd probably use READ_ONCE() here to be safe. Not all callers are >> under page lock? > > I don't understand. Yeah, all caller are not under page lock but at least, > new user of movable pages should call it under page_lock. > Yeah, I will write the rule down in document. > In this case, what kinds of problem do you see? After more thinking, probably none. It wouldn't prevent any extra races. Sorry for the noise.
PATCH v6v2 02/12] mm: migrate: support non-lru movable page migration
On 05/30/2016 03:39 AM, Minchan Kim wrote: > After isolation, VM calls migratepage of driver with isolated page. > The function of migratepage is to move content of the old page to new page > and set up fields of struct page newpage. Keep in mind that you should > clear PG_movable of oldpage via __ClearPageMovable under page_lock if you > migrated the oldpage successfully and returns 0. This "clear PG_movable" is one of the reasons I was confused about what __ClearPageMovable() really does. There's no actual "PG_movable" page flag and the function doesn't clear even the actual mapping flag :) Also same thing in the Documentation/ part. Something like "... you should indicate to the VM that the oldpage is no longer movable via __ClearPageMovable() ..."? > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -81,6 +81,39 @@ static inline bool migrate_async_suitable(int migratetype) > > #ifdef CONFIG_COMPACTION > > +int PageMovable(struct page *page) > +{ > + struct address_space *mapping; > + > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + if (!__PageMovable(page)) > + return 0; > + > + mapping = page_mapping(page); > + if (mapping && mapping->a_ops && mapping->a_ops->isolate_page) > + return 1; > + > + return 0; > +} > +EXPORT_SYMBOL(PageMovable); > + > +void __SetPageMovable(struct page *page, struct address_space *mapping) > +{ > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page); > + page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE); > +} > +EXPORT_SYMBOL(__SetPageMovable); > + > +void __ClearPageMovable(struct page *page) > +{ > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(!PageMovable(page), page); > + page->mapping = (void *)((unsigned long)page->mapping & > + PAGE_MAPPING_MOVABLE); > +} > +EXPORT_SYMBOL(__ClearPageMovable); The second confusing thing is that the function is named __ClearPageMovable(), but what it really clears is the mapping pointer, which is not at all the opposite of what __SetPageMovable() does. I know it's explained in the documentation, but it also deserves a comment here so it doesn't confuse everyone who looks at it. Even better would be a less confusing name for the function, but I can't offer one right now. > diff --git a/mm/util.c b/mm/util.c > index 917e0e3d0f8e..b756ee36f7f0 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -399,10 +399,12 @@ struct address_space *page_mapping(struct page *page) > } > > mapping = page->mapping; I'd probably use READ_ONCE() here to be safe. Not all callers are under page lock? > - if ((unsigned long)mapping & PAGE_MAPPING_FLAGS) > + if ((unsigned long)mapping & PAGE_MAPPING_ANON) > return NULL; > - return mapping; > + > + return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS); > } > +EXPORT_SYMBOL(page_mapping); > > /* Slow path of page_mapcount() for compound pages */ > int __page_mapcount(struct page *page) >
[PATCH v6 02/12] mm: migrate: support non-lru movable page migration
On 05/30/2016 03:33 AM, Minchan Kim wrote: >> >> >>> + page->mapping = (void *)((unsigned long)page->mapping & >>> + PAGE_MAPPING_MOVABLE); >> >> This should be negated to clear... use ~PAGE_MAPPING_MOVABLE ? > > No. > > The intention is to clear only mapping value but PAGE_MAPPING_MOVABLE > flag. So, any new migration trial will be failed because PageMovable > checks page's mapping value but ongoing migraion handling can catch > whether it's movable page or not with the type bit. Oh, OK, I got that wrong. I'll point out in the reply to the v6v2 what misled me :) >> >> So this effectively prevents movable compound pages from being >> migrated. Are you sure no users of this functionality are going to >> have compound pages? I assumed that they could, and so made the code >> like this, with the is_lru variable (which is redundant after your >> change). > > This implementation at the moment disables effectively non-lru compound > page migration but I'm not a god so I can't make sure no one doesn't want > it in future. If someone want it, we can support it then because this work > doesn't prevent it by design. Oh well. As long as the balloon pages or zsmalloc don't already use compound pages... > > I thouht PageCompound check right before isolate_movable_page in > isolate_migratepages_block will filter it out mostly but yeah > it is racy without zone->lru_lock so it could reach to isolate_movable_page. > However, PageMovable check in there investigates mapping, mapping->a_ops, > and a_ops->isolate_page to verify whether it's movable page or not. > > I thought it's sufficient to filter THP page. I guess, yeah. >> >> [...] >> >>> @@ -755,33 +844,69 @@ static int move_to_new_page(struct page *newpage, >>> struct page *page, >>> enum migrate_mode mode) >>> { >>> struct address_space *mapping; >>> - int rc; >>> + int rc = -EAGAIN; >>> + bool is_lru = !__PageMovable(page); >>> >>> VM_BUG_ON_PAGE(!PageLocked(page), page); >>> VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); >>> >>> mapping = page_mapping(page); >>> - if (!mapping) >>> - rc = migrate_page(mapping, newpage, page, mode); >>> - else if (mapping->a_ops->migratepage) >>> - /* >>> -* Most pages have a mapping and most filesystems provide a >>> -* migratepage callback. Anonymous pages are part of swap >>> -* space which also has its own migratepage callback. This >>> -* is the most common path for page migration. >>> -*/ >>> - rc = mapping->a_ops->migratepage(mapping, newpage, page, mode); >>> - else >>> - rc = fallback_migrate_page(mapping, newpage, page, mode); >>> + /* >>> +* In case of non-lru page, it could be released after >>> +* isolation step. In that case, we shouldn't try >>> +* fallback migration which is designed for LRU pages. >>> +*/ >> >> Hmm but is_lru was determined from !__PageMovable() above, also well >> after the isolation step. So if the driver already released it, we >> wouldn't detect it? And this function is all under same page lock, >> so if __PageMovable was true above, so will be PageMovable below? > > You are missing what I mentioned above. > We should keep the type bit to catch what you are saying(i.e., driver > already released). > > __PageMovable just checks PAGE_MAPPING_MOVABLE flag and PageMovable > checks page->mapping valid while __ClearPageMovable reset only > valid vaule of mapping, not PAGE_MAPPING_MOVABLE flag. > > I wrote it down in Documentation/vm/page_migration. > > "For testing of non-lru movable page, VM supports __PageMovable function. > However, it doesn't guarantee to identify non-lru movable page because > page->mapping field is unified with other variables in struct page. > As well, if driver releases the page after isolation by VM, page->mapping > doesn't have stable value although it has PAGE_MAPPING_MOVABLE > (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether > page is LRU or non-lru movable once the page has been isolated. Because > LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also > good for just peeking to test non-lru movable pages before more expensive > checking with lock_page in pfn scanning to select victim. > > For guaranteeing non-lru movable page, VM provides PageMovable function. > Unlike __PageMovable, PageMovable functions validates page->mapping and > mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden > destroying of page->mapping. > > Driver using __SetPageMovable should clear the flag via __ClearMovablePage > under page_lock before the releasing the page." Right, I get it now. >>> + if (!((unsigned long)page->mapping & PAGE_MAPPING_FLAGS)) >>> page->mapping = NULL; >> >> The two lines above make little sense to me without a comment. > > I folded this. > > @@ -901,7 +901,12 @@ st
[PATCH v6 02/12] mm: migrate: support non-lru movable page migration
On 05/20/2016 04:23 PM, Minchan Kim wrote: > We have allowed migration for only LRU pages until now and it was > enough to make high-order pages. But recently, embedded system(e.g., > webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory) > so we have seen several reports about troubles of small high-order > allocation. For fixing the problem, there were several efforts > (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page, > reserved memory, vmalloc and so on) but if there are lots of > non-movable pages in system, their solutions are void in the long run. > > So, this patch is to support facility to change non-movable pages > with movable. For the feature, this patch introduces functions related > to migration to address_space_operations as well as some page flags. > > If a driver want to make own pages movable, it should define three functions > which are function pointers of struct address_space_operations. > > 1. bool (*isolate_page) (struct page *page, isolate_mode_t mode); > > What VM expects on isolate_page function of driver is to return *true* > if driver isolates page successfully. On returing true, VM marks the page > as PG_isolated so concurrent isolation in several CPUs skip the page > for isolation. If a driver cannot isolate the page, it should return *false*. > > Once page is successfully isolated, VM uses page.lru fields so driver > shouldn't expect to preserve values in that fields. > > 2. int (*migratepage) (struct address_space *mapping, > struct page *newpage, struct page *oldpage, enum migrate_mode); > > After isolation, VM calls migratepage of driver with isolated page. > The function of migratepage is to move content of the old page to new page > and set up fields of struct page newpage. Keep in mind that you should > clear PG_movable of oldpage via __ClearPageMovable under page_lock if you > migrated the oldpage successfully and returns 0. > If driver cannot migrate the page at the moment, driver can return -EAGAIN. > On -EAGAIN, VM will retry page migration in a short time because VM interprets > -EAGAIN as "temporal migration failure". On returning any error except > -EAGAIN, > VM will give up the page migration without retrying in this time. > > Driver shouldn't touch page.lru field VM using in the functions. > > 3. void (*putback_page)(struct page *); > > If migration fails on isolated page, VM should return the isolated page > to the driver so VM calls driver's putback_page with migration failed page. > In this function, driver should put the isolated page back to the own data > structure. > > 4. non-lru movable page flags > > There are two page flags for supporting non-lru movable page. > > * PG_movable > > Driver should use the below function to make page movable under page_lock. > > void __SetPageMovable(struct page *page, struct address_space *mapping) > > It needs argument of address_space for registering migration family functions > which will be called by VM. Exactly speaking, PG_movable is not a real flag of > struct page. Rather than, VM reuses page->mapping's lower bits to represent > it. > > #define PAGE_MAPPING_MOVABLE 0x2 > page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; Interesting, let's see how that works out... Overal this looks much better than the last version I checked! [...] > @@ -357,29 +360,37 @@ PAGEFLAG(Idle, idle, PF_ANY) > * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. > * > * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled, > - * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit; > - * and then page->mapping points, not to an anon_vma, but to a private > + * the PAGE_MAPPING_MOVABLE bit may be set along with the PAGE_MAPPING_ANON > + * bit; and then page->mapping points, not to an anon_vma, but to a private > * structure which KSM associates with that merged page. See ksm.h. > * > - * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used. > + * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable > + * page and then page->mapping points a struct address_space. > * > * Please note that, confusingly, "page_mapping" refers to the inode > * address_space which maps the page from disk; whereas "page_mapped" > * refers to user virtual address space into which the page is mapped. > */ > -#define PAGE_MAPPING_ANON1 > -#define PAGE_MAPPING_KSM 2 > -#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM) > +#define PAGE_MAPPING_ANON0x1 > +#define PAGE_MAPPING_MOVABLE 0x2 > +#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) > +#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) > > -static __always_inline int PageAnonHead(struct page *page) > +static __always_inline int PageMappingFlag(struct page *page) PageMappingFlags()? [...] > diff --git a/mm/compaction.c b/mm/compaction.c > index 1427366ad673..2
[PATCH v3 02/16] mm/compaction: support non-lru movable page migration
On 04/04/2016 07:12 AM, Minchan Kim wrote: > On Fri, Apr 01, 2016 at 11:29:14PM +0200, Vlastimil Babka wrote: >> Might have been better as a separate migration patch and then a >> compaction patch. It's prefixed mm/compaction, but most changed are >> in mm/migrate.c > > Indeed. The title is rather misleading but not sure it's a good idea > to separate compaction and migration part. Guess it's better to see the new functions together with its user after all, OK. > I will just resend to change the tile from "mm/compaction" to > "mm/migration". OK! >> Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that >> page->mapping->a_ops->isolate_page exists for PageMovable() pages. >> What if it's a false positive on a PG_reclaim page? Can we rely on >> PG_reclaim always (and without races) implying PageLRU() so that we >> don't even attempt isolate_movable_page()? > > For now, we shouldn't have such a false positive because PageMovable > checks page->_mapcount == PAGE_MOVABLE_MAPCOUNT_VALUE as well as PG_movable > under PG_lock. > > But I read your question about user-mapped drvier pages so we cannot > use _mapcount anymore so I will find another thing. A option is this. > > static inline int PageMovable(struct page *page) > { > int ret = 0; > struct address_space *mapping; > struct address_space_operations *a_op; > > if (!test_bit(PG_movable, &(page->flags)) > goto out; > > mapping = page->mapping; > if (!mapping) > goto out; > > a_op = mapping->a_op; > if (!aop) > goto out; > if (a_op->isolate_page) > ret = 1; > out: > return ret; > > } > > It works under PG_lock but with this, we need trylock_page to peek > whether it's movable non-lru or not for scanning pfn. Hm I hoped that with READ_ONCE() we could do the peek safely without trylock_page, if we use it only as a heuristic. But I guess it would require at least RCU-level protection of the page->mapping->a_op->isolate_page chain. > For avoiding that, we need another function to peek which just checks > PG_movable bit instead of all things. > > > /* > * If @page_locked is false, we cannot guarantee page->mapping's stability > * so just the function checks with PG_movable which could be false positive > * so caller should check it again under PG_lock to check > a_ops->isolate_page. > */ > static inline int PageMovable(struct page *page, bool page_locked) > { > int ret = 0; > struct address_space *mapping; > struct address_space_operations *a_op; > > if (!test_bit(PG_movable, &(page->flags)) > goto out; > > if (!page_locked) { > ret = 1; > goto out; > } > > mapping = page->mapping; > if (!mapping) > goto out; > > a_op = mapping->a_op; > if (!aop) > goto out; > if (a_op->isolate_page) > ret = 1; > out: > return ret; > } I wouldn't put everything into single function, but create something like __PageMovable() just for the unlocked peek. Unlike the zone->lru_lock, we don't keep page_lock() across iterations in isolate_migratepages_block(), as obviously each page has different lock. So the page_locked parameter would be always passed as constant, and at that point it's better to have separate functions. So I guess the question is how many false positives from overlap with PG_reclaim the scanner will hit if we give up on PAGE_MOVABLE_MAPCOUNT_VALUE, as that will increase number of page locks just to realize that it's not actual PageMovable() page... > Thanks for detail review, Vlastimil! > I will resend new versions after vacation in this week. You're welcome, great!
[PATCH v3 02/16] mm/compaction: support non-lru movable page migration
Might have been better as a separate migration patch and then a compaction patch. It's prefixed mm/compaction, but most changed are in mm/migrate.c On 03/30/2016 09:12 AM, Minchan Kim wrote: > We have allowed migration for only LRU pages until now and it was > enough to make high-order pages. But recently, embedded system(e.g., > webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory) > so we have seen several reports about troubles of small high-order > allocation. For fixing the problem, there were several efforts > (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page, > reserved memory, vmalloc and so on) but if there are lots of > non-movable pages in system, their solutions are void in the long run. > > So, this patch is to support facility to change non-movable pages > with movable. For the feature, this patch introduces functions related > to migration to address_space_operations as well as some page flags. > > Basically, this patch supports two page-flags and two functions related > to page migration. The flag and page->mapping stability are protected > by PG_lock. > > PG_movable > PG_isolated > > bool (*isolate_page) (struct page *, isolate_mode_t); > void (*putback_page) (struct page *); > > Duty of subsystem want to make their pages as migratable are > as follows: > > 1. It should register address_space to page->mapping then mark > the page as PG_movable via __SetPageMovable. > > 2. It should mark the page as PG_isolated via SetPageIsolated > if isolation is sucessful and return true. Ah another thing to document (especially in the comments/Doc) is that the subsystem must not expect anything to survive in page.lru (or fields that union it) after having isolated successfully. > 3. If migration is successful, it should clear PG_isolated and > PG_movable of the page for free preparation then release the > reference of the page to free. > > 4. If migration fails, putback function of subsystem should > clear PG_isolated via ClearPageIsolated. > > 5. If a subsystem want to release isolated page, it should > clear PG_isolated but not PG_movable. Instead, VM will do it. Under lock? Or just with ClearPageIsolated? > Cc: Vlastimil Babka > Cc: Mel Gorman > Cc: Hugh Dickins > Cc: dri-devel at lists.freedesktop.org > Cc: virtualization at lists.linux-foundation.org > Signed-off-by: Gioh Kim > Signed-off-by: Minchan Kim > --- > Documentation/filesystems/Locking | 4 + > Documentation/filesystems/vfs.txt | 5 + > fs/proc/page.c | 3 + > include/linux/fs.h | 2 + > include/linux/migrate.h| 2 + > include/linux/page-flags.h | 31 ++ > include/uapi/linux/kernel-page-flags.h | 1 + > mm/compaction.c| 14 ++- > mm/migrate.c | 174 > + > 9 files changed, 217 insertions(+), 19 deletions(-) > > diff --git a/Documentation/filesystems/Locking > b/Documentation/filesystems/Locking > index 619af9bfdcb3..0bb79560abb3 100644 > --- a/Documentation/filesystems/Locking > +++ b/Documentation/filesystems/Locking > @@ -195,7 +195,9 @@ unlocks and drops the reference. > int (*releasepage) (struct page *, int); > void (*freepage)(struct page *); > int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset); > + bool (*isolate_page) (struct page *, isolate_mode_t); > int (*migratepage)(struct address_space *, struct page *, struct page > *); > + void (*putback_page) (struct page *); > int (*launder_page)(struct page *); > int (*is_partially_uptodate)(struct page *, unsigned long, unsigned > long); > int (*error_remove_page)(struct address_space *, struct page *); > @@ -219,7 +221,9 @@ invalidatepage: yes > releasepage:yes > freepage: yes > direct_IO: > +isolate_page:yes > migratepage:yes (both) > +putback_page:yes > launder_page: yes > is_partially_uptodate: yes > error_remove_page: yes > diff --git a/Documentation/filesystems/vfs.txt > b/Documentation/filesystems/vfs.txt > index b02a7d598258..4c1b6c3b4bc8 100644 > --- a/Documentation/filesystems/vfs.txt > +++ b/Documentation/filesystems/vfs.txt > @@ -592,9 +592,14 @@ struct address_space_operations { > int (*releasepage) (struct page *, int); > void (*freepage)(struct page *); > ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t > offset); > + /* isolate a page for migration */ > + bool (*isola
[PATCH 3/6] drm: Fix drm_vblank_pre/post_modeset regression from Linux 4.4
On 02/12/2016 08:30 PM, Mario Kleiner wrote: > Changes to drm_update_vblank_count() in Linux 4.4 broke the > behaviour of the pre/post modeset functions as the new update > code doesn't deal with hw vblank counter resets inbetween calls > to drm_vblank_pre_modeset an drm_vblank_post_modeset, as it > should. > > This causes mistreatment of such hw counter resets as counter > wraparound, and thereby large forward jumps of the software > vblank counter which in turn cause vblank event dispatching > and vblank waits to fail/hang --> userspace clients hang. > > This symptom was reported on radeon-kms to cause a infinite > hang of KDE Plasma 5 shell's login procedure, preventing users > from logging in. > > Fix this by detecting when drm_update_vblank_count() is called > inside a pre->post modeset interval. If so, clamp valid vblank > increments to the safe values 0 and 1, pretty much restoring > the update behavior of the old update code of Linux 4.3 and > earlier. Also reset the last recorded hw vblank count at call > to drm_vblank_post_modeset() to be safe against hw that after > modesetting, dpms on etc. only fires its first vblank irq after > drm_vblank_post_modeset() was already called. > > Reported-by: Vlastimil Babka > Signed-off-by: Mario Kleiner > Reviewed-by: Daniel Vetter > Tested-by: Vlastimil Babka FWIW I have applied patches 1-4 of this v2 to 4.4.2 and it works fine so far on my radeon. Patch 6/6 was for nouveau so I ignored it, and 5/6 didn't have stable tag (and from the description I couldn't decide if I should include it or not). Thanks, Vlastimil
[PATCH 3/6] drm: Fix drm_vblank_pre/post_modeset regression from Linux 4.4
On 02/08/2016 02:13 AM, Mario Kleiner wrote: > Changes to drm_update_vblank_count() in Linux 4.4 broke the > behaviour of the pre/post modeset functions as the new update > code doesn't deal with hw vblank counter resets inbetween calls > to drm_vblank_pre_modeset an drm_vblank_post_modeset, as it > should. > > This causes mistreatment of such hw counter resets as counter > wraparound, and thereby large forward jumps of the software > vblank counter which in turn cause vblank event dispatching > and vblank waits to fail/hang --> userspace clients hang. > > This symptom was reported on radeon-kms to cause a infinite > hang of KDE Plasma 5 shell's login procedure, preventing users > from logging in. > > Fix this by detecting when drm_update_vblank_count() is called > inside a pre->post modeset interval. If so, clamp valid vblank > increments to the safe values 0 and 1, pretty much restoring > the update behavior of the old update code of Linux 4.3 and > earlier. Also reset the last recorded hw vblank count at call > to drm_vblank_post_modeset() to be safe against hw that after > modesetting, dpms on etc. only fires its first vblank irq after > drm_vblank_post_modeset() was already called. > > Reported-by: Vlastimil Babka FWIW, I've applied the whole patchset to 4.4 and the kde5 login problem didn't occur. I can test the next version too. Thanks, Vlastimil
[PATCH 1/2] drm/radeon: Use drm_vblank_off/on to fix vblank counter trouble.
On 02/08/2016 02:58 AM, Mario Kleiner wrote: Now i just need to actually code and test it first. >>> >>> Ping, any news? :) >>> > > Ok, so that series "drm vblank regression fixes for Linux 4.4+" i just > sent out should hopefully fix this bug and related bugs. Thanks, I'll test on Wed hopefully when I'm at the affected desktop. > > thanks, > -mario >
[PATCH 1/2] drm/radeon: Use drm_vblank_off/on to fix vblank counter trouble.
On 01/22/2016 06:08 PM, Mario Kleiner wrote: > Anyway, some more hours of thinking and code browsing later, now i think > i have a simple and safe solution which should hopefully restore the > drm_vblank_pre/post_modeset behaviour with only a few lines of core > code. At the same time it should fix up another bug in that new > drm_update_vblank_count code that i just realized, in a way simple > enough for a stable fix. > > Now i just need to actually code and test it first. Ping, any news? :)
[PATCH 1/2] drm/radeon: Use drm_vblank_off/on to fix vblank counter trouble.
Is there a PATCH 2/2 which I can't find, or is the subject wrong? On 01/21/2016 09:16 AM, Mario Kleiner wrote: > The hardware vblank counter of AMD gpu's resets to zero during a > modeset. The new implementation of drm_update_vblank_count() from > commit 4dfd6486 "drm: Use vblank timestamps to guesstimate how > many vblanks were missed", introduced in Linux 4.4, treats that > as a counter wraparound and causes the software vblank counter > to jump forward by a large distance of up to 2^24 counts. This > interacts badly with 32-bit wraparound handling in > drm_handle_vblank_events(), causing that function to no longer > deliver pending vblank events to clients. > > This leads to client hangs especially if clients perform OpenGL > or DRI3/Present animations while a modeset happens and triggers > the hw vblank counter reset. One prominent example is a hang of > KDE Plasma 5's startup progress splash screen during login, making > the KDE session unuseable. > > Another small potential race exists when executing a modeset while > vblank interrupts are enabled or just get enabled: The modeset updates > radeon_crtc->lb_vblank_lead_lines during radeon_display_bandwidth_update, > so if vblank interrupt handling or enable would try to access that variable > multiple times at the wrong moment as part of drm_update_vblank_counter, > while the scanout happens to be within lb_vblank_lead_lines before the > start of vblank, it could cause inconsistent vblank counting and again > trigger a jump of the software vblank counter, causing similar client > hangs. The most easy way to avoid this small race is to not allow > vblank enable or vblank irq's during modeset. > > This patch replaces calls to drm_vblank_pre/post_modeset in the > drivers dpms code with calls to drm_vblank_off/on, as recommended > for drivers with hw counters that reset to zero during modeset. > Those calls disable vblank interrupts during the modeset sequence > and reinitialize vblank counts and timestamps after the modeset > properly, taking hw counter reset into account, thereby fixing > the problem of forward jumping counters. > > During a modeset, calls to drm_vblank_get() will no-op/intentionally > fail, so no vblank events or pageflips can be queued during modesetting. > > Radeons static and dynpm power management uses drm_vblank_get to enable > vblank irqs to synchronize reclocking to start of vblank. If a modeset > would happen in parallel with such a power management action, drm_vblank_get > would be suppressed, sync to vblank wouldn't work and a visual glitch could > happen. However that glitch would hopefully be hidden by the blanking of > the crtc during modeset. A small fix to power management makes sure to > check for this and prevent unbalanced vblank reference counts due to > mismatched drm_vblank_get/put. > > Reported-by: Vlastimil Babka > Signed-off-by: Mario Kleiner FWIW, this seems to work for the kde5 login issue, thanks. Let me know if you need also some specific testing/debug output, or testing another approach if the "drm_vblank_on/off propaganda" is not acceptable :)
linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon
On 01/21/2016 09:28 AM, Mario Kleiner wrote: >> ... just like drm_vblank_pre/post_modeset. That those were broken is a >> regression which needs to be fixed anyway. I don't think switching to >> drm_vblank_on/off is suitable for stable trees. >> >> Looking at Vlastimil's original post again, I'd say the most likely >> culprit is 4dfd6486 ("drm: Use vblank timestamps to guesstimate how many >> vblanks were missed"). >> Yeah, this is what I bisected to. > Yes, i think reverting that one alone would likely fix it by reverting > to the old vblank update logic. Yep I said in the original mail that reverting on top of 4.4 fixed it. Well not just this single commit, but also some patches on top (e.g. radeon and amdgpu adaptations to that commit, IIRC it wouldn't have compiled otherwise). >> >>> Once drm_vblank_off is called, drm_vblank_get will no-op and return an >>> error, so clients can't enable vblank irqs during the modeset - pageflip >>> ioctl and waitvblank ioctl would fail while a modeset happens - >>> hopefully userspace handles this correctly everywhere. >> >> We've fixed xf86-video-ati for this. >> >> >>> I'll hack up a patch for demonstration now. >> >> You're a bit late to that party. :) >> >> http://lists.freedesktop.org/archives/dri-devel/2015-May/083614.html >> http://lists.freedesktop.org/archives/dri-devel/2015-July/086451.html >> >> > > Oops. Just sent out my little (so far untested) creations. Yes, they are > essentially the same as Daniel's patches. The only addition is to also > fix that other potential small race i describe by slightly moving the > xxx_pm_compute_clocks() calls around. And a fix for drm_vblank_get/put > imbalance in radeon_pm if vblank_on/off would be used. Thanks, I'll test. > > -mario >
linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon
On 01/18/2016 11:49 AM, Vlastimil Babka wrote: > On 01/16/2016 05:24 AM, Mario Kleiner wrote: >> >> >> On 01/15/2016 01:26 PM, Ville Syrjälä wrote: >>> On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote: >> >> I'm currently running... >> >> while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done >> >> ... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770 and so far i >> can't trigger a hang after hundreds of runs. >> >> Does this also hang for you? > > No, test mode seems to be fine. > >> I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" >> should probably give useful info around the time of the hang. > > Attached. Captured by having kdm running, switching to console, running > "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see > frozen splashscreen, switch back, terminate dmesg. So somewhere around > the middle there should be where ksplashscreen starts... Hmm this looks suspicious? (!!! mine) [ 538.918990] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=385876589, diff=1, hw=622 hw_last=621 [ 538.918991] [drm:evergreen_irq_process] IH: D2 vblank [ 538.935035] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3 [ 538.935040] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3 [ 538.935041] [drm:drm_update_vblank_count] updating vblank count on crtc 0: current=16808100, diff=1, hw=30885 hw_last=30884 [ 538.935042] [drm:evergreen_irq_process] IH: D1 vblank [ 538.939702] [drm:drm_wait_vblank] waiting on vblank count 385876590, crtc 1 [ 538.939704] [drm:drm_wait_vblank] returning 385876590 to client [ 538.939709] [drm:drm_wait_vblank] waiting on vblank count 385876590, crtc 1 [ 538.939710] [drm:drm_wait_vblank] returning 385876590 to client !!!538.939715] [drm:drm_queue_vblank_event] event on vblank count 385876591, current 385876590, crtc 1 [ 538.944452] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.944453] [drm:drm_wait_vblank] returning 16808101 to client [ 538.944458] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.944460] [drm:drm_wait_vblank] returning 16808101 to client [ 538.944465] [drm:drm_queue_vblank_event] event on vblank count 16808102, current 16808101, crtc 0 [ 538.948210] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.948212] [drm:drm_wait_vblank] returning 16808101 to client [ 538.948222] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.948224] [drm:drm_wait_vblank] returning 16808101 to client [ 538.949589] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.949591] [drm:drm_wait_vblank] returning 16808101 to client [ 538.951238] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start 6 [ 538.951245] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start 7 !!!538.951246] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=385876590, diff=16776597, hw=3 hw_last=622 [ 538.951247] [drm:evergreen_irq_process] IH: D2 vblank [ 538.951746] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 4 [ 538.951752] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 4 [ 538.951753] [drm:drm_update_vblank_count] updating vblank count on crtc 0: current=16808101, diff=1, hw=30886 hw_last=30885 [ 538.951754] [drm:drm_handle_vblank_events] vblank event on 16808102, current 16808102 [ 538.951756] [drm:evergreen_irq_process] IH: D1 vblank [ 538.964570] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start 7 [ 538.964581] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start -1058 [ 538.964583] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=402653187, diff=1, hw=4 hw_last=3 Could it be that the underflow caused some signed logic to misbehave and fail to detect that we passed 385876591? Later we have another such big skip (but this time nothing waits for it I guess): [ 541.337813] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=402653363, diff=16777040, hw=3 hw_last=179 >> Maybe also check XOrg.0.log for (WW) warnings related to flip. > > No such warnings there. > >> thanks, >> -mario >> >> >>>> Thanks, >>>> Vlastimil >>> >
linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon
On 01/18/2016 11:49 AM, Vlastimil Babka wrote: > On 01/16/2016 05:24 AM, Mario Kleiner wrote: >> I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" >> should probably give useful info around the time of the hang. > > Attached. Captured by having kdm running, switching to console, running > "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see > frozen splashscreen, switch back, terminate dmesg. So somewhere around > the middle there should be where ksplashscreen starts... > >> Maybe also check XOrg.0.log for (WW) warnings related to flip. > > No such warnings there. This is how gdb backtraces look like from the 4 threads of ksplashqml that's stuck. Thread 3 seems to be waiting on some response to radeon's ioctl? (gdb) info threads Id Target Id Frame 4Thread 0x7feb296f5700 (LWP 3643) "QXcbEventReader" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 3Thread 0x7feb199f8700 (LWP 3644) "ksplashqml" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 2Thread 0x7feb18ff2700 (LWP 3645) "QQmlThread" 0x7feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 * 1Thread 0x7feb3b79f8c0 (LWP 3642) "ksplashqml" 0x7feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 (gdb) bt #0 0x7feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x7feb32509432 in poll (__timeout=-1, __nfds=1, __fds=0x7ffce30ffb50) at /usr/include/bits/poll2.h:46 #2 _xcb_conn_wait (c=c at entry=0x17e25c0, cond=cond at entry=0x7ffce30ffc70, vector=vector at entry=0x0, count=count at entry=0x0) at xcb_conn.c:459 #3 0x7feb3250ad57 in wait_for_reply (c=c at entry=0x17e25c0, request=request at entry=883, e=e at entry=0x7ffce30ffd48) at xcb_in.c:516 #4 0x7feb3250aec1 in xcb_wait_for_reply64 (c=c at entry=0x17e25c0, request=883, e=e at entry=0x7ffce30ffd48) at xcb_in.c:560 #5 0x7feb32b80300 in _XReply (dpy=dpy at entry=0x17e12c0, rep=rep at entry=0x7ffce30ffdc0, extra=extra at entry=0, discard=discard at entry=0) at xcb_io.c:596 #6 0x7feb36eda712 in DRI2GetBuffersWithFormat (dpy=0x17e12c0, drawable=12582924, width=width at entry=0x181e528, height=height at entry=0x181e52c, attachments=0x7ffce30fff10, count=1, outCount=0x7ffce30ffef0) at dri2.c:491 #7 0x7feb36edaa17 in dri2GetBuffersWithFormat (driDrawable=, width=0x181e528, height=0x181e52c, attachments=, count=, out_count=0x7ffce30ffef0, loaderPrivate=0x1fb1290) at dri2_glx.c:900 #8 0x7feb20132618 in dri2_drawable_get_buffers (count=, atts=0x1817da0, drawable=0x1816d20) at dri2.c:213 #9 dri2_allocate_textures (ctx=0x1a453d0, drawable=0x1816d20, statts=0x1817da0, statts_count=2) at dri2.c:407 #10 0x7feb2012f17c in dri_st_framebuffer_validate (stctx=, stfbi=, statts=0x1817da0, count=2, out=0x7ffce3100050) at dri_drawable.c:83 #11 0x7feb2005b5fe in st_framebuffer_validate (stfb=0x1817940, st=st at entry=0x1b11f20) at state_tracker/st_manager.c:200 #12 0x7feb2005c88e in st_api_make_current (stapi=, stctxi=0x1b11f20, stdrawi=0x1816d20, streadi=0x1816d20) at state_tracker/st_manager.c:831 #13 0x7feb2012ecd1 in dri_make_current (cPriv=, driDrawPriv=0x181e500, driReadPriv=0x181e500) at dri_context.c:245 #14 0x7feb2012dcb6 in driBindContext (pcp=, pdp=, prp=) at dri_util.c:531 #15 0x7feb36edc38b in dri2_bind_context (context=0x1a70960, old=, draw=12582924, read=12582924) at dri2_glx.c:160 #16 0x7feb36eb99b7 in MakeContextCurrent (dpy=0x17e12c0, draw=draw at entry=12582924, read=read at entry=12582924, gc_user=0x1a70960) at glxcurrent.c:228 #17 0x7feb36eb9b3b in glXMakeCurrent (dpy=, draw=draw at entry=12582924, gc=) at glxcurrent.c:262 #18 0x7feb288d9a2d in QGLXContext::makeCurrent (this=0x1a48760, surface=0x1a0ac40) at qglxintegration.cpp:476 #19 0x7feb3a0f8750 in QOpenGLContext::makeCurrent (this=0x18401e0, surface=0x1842d90) at kernel/qopenglcontext.cpp:936 #20 0x7feb3af63aef in QSGGuiThreadRenderLoop::renderWindow (this=this at entry=0x1913f50, window=0x1842d80) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/quick/scenegraph/qsgrenderloop.cpp:341 #21 0x7feb3af64d11 in QSGGuiThreadRenderLoop::event (this=0x1913f50, e=) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/quick/scenegraph/qsgrenderloop.cpp:474 #22 0x7feb39b7fbd9 in QCoreApplication::notify (this=, receiver=, event=) at kernel/qcoreapplication.cpp:1038 #23 0x7feb39b7fcf3 in QCoreApplication::notifyInternal (this=0x7ffce3100740, receiver=0x1913f50, event=event at entry=0x7ffce31004c0) at kernel/qcoreapplication.cpp:965 #24 0x7feb39bd23bd in sendEvent (event=0x7ffce31004c0, receiver=) at ../../src/corelib/kernel/qcoreapplication.h:224 #25 QTimerInfoLis
linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon
On 01/16/2016 05:24 AM, Mario Kleiner wrote: > > > On 01/15/2016 01:26 PM, Ville Syrjälä wrote: >> On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote: > > I'm currently running... > > while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done > > ... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770 and so far i > can't trigger a hang after hundreds of runs. > > Does this also hang for you? No, test mode seems to be fine. > I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" > should probably give useful info around the time of the hang. Attached. Captured by having kdm running, switching to console, running "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see frozen splashscreen, switch back, terminate dmesg. So somewhere around the middle there should be where ksplashscreen starts... > Maybe also check XOrg.0.log for (WW) warnings related to flip. No such warnings there. > thanks, > -mario > > >>> Thanks, >>> Vlastimil >> -- next part -- A non-text attachment was scrubbed... Name: dmesg.gz Type: application/gzip Size: 77399 bytes Desc: not available URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20160118/1aa30eb5/attachment-0001.bin>
linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon
On 01/15/2016 01:26 PM, Ville Syrjälä wrote: > On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote: >> >> I have suspected that kwin is waiting for some event, but nevertheless >> tried bisecting the kernel between 4.3 and 4.4, which lead to: >> >> # first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use >> vblank timestamps to guesstimate how many vblanks were missed >> >> I can confirm that 4.4 works if I revert the following commits: >> 63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank >> counter/ts for new drm_update_vblank_count() (v3)" >> >> d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank >> counter/ts for new drm_update_vblank_count() (v2)" > > The sha1s don't seem to match what I have, so not sure which kernel tree Hm sorry, I pasted the sha1 of the reverts by mistake. Correct sha1 are: 5b5561b3660db734652fbd02b4b6cbe00434d96b "drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)" fa4270d8e0257b4b76f11baa2866f4313d29aaf5 "drm: Don't zero vblank timestamps from the irq handler" 235fabe09b46469adad2c9e4cb0563758155187c "drm: Add DRM_DEBUG_VBL()" 4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to guesstimate how many vblanks were missed" 8e36f9d33c134d5c6448ad65b423a9fd94e045cf "drm/amdgpu: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v3)" Also, it turns out that the process actually showing the progress is "ksplashqml", not kwin. It survives killing kwin, and restarting kwin just makes it shown on top again, or something. If I force kill ksplashqml instead of kwin, the desktop works including decorations and everything. ksplashqml itself also waits in kernel in poll(). I'll try some of your suggestions, thanks!
linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon
Hi, since kernel 4.4 I'm unable to login to kde5 desktop (on openSUSE Tumbleweed). There's a screen with progressbar showing the startup, which normally fades away after reaching 100%. But with kernel 4.4, the progress gets stuck somewhere between 1/2 and 3/4 (not always the same). Top shows that kwin is using few % of CPU's but mostly sleeps in poll(). When I kill it from another console, I see that everything has actually started up, just the progressbar screen was obscuring it. The windows obviously don't have decorations etc. Starting kwin manually again shows me again the progressbar screen at the same position. I have suspected that kwin is waiting for some event, but nevertheless tried bisecting the kernel between 4.3 and 4.4, which lead to: # first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use vblank timestamps to guesstimate how many vblanks were missed I can confirm that 4.4 works if I revert the following commits: 63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v3)" d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)" 31ace027c9f1f8e0a2b09bbf961e4db7b1f6cf19 "drm: Don't zero vblank timestamps from the irq handler" ac0567a4b132fa66e3edf3f913938af9daf7f916 "drm: Add DRM_DEBUG_VBL()" 4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to guesstimate how many vblanks were missed" All clean reverts, just needs some fixup on top to use abs() instead of abs64() due to 79211c8ed19c055ca105502c8733800d442a0ae6. Unfortunately I don't know if this is a kernel problem or kwin problem. I tried to CC maintainers of both, advices what to try or what info to provide welcome. The card is "CAICOS" with 1GB memory. Thanks, Vlastimil
[PATCH v7 3/6] mm: Introduce VM_LOCKONFAULT
On 08/26/2015 09:20 AM, Michal Hocko wrote: > On Tue 25-08-15 15:03:00, Eric B Munson wrote: > [...] >> Would you drop your objections to the VMA flag if I drop the portions of >> the patch that expose it to userspace? >> >> The rework to not use the VMA flag is pretty sizeable and is much more >> ugly IMO. I know that you are not wild about using bit 30 of 32 for >> this, but perhaps we can settle on not exporting it to userspace so we >> can reclaim it if we really need it in the future? > > Yes, that would be definitely more acceptable for me. I do understand > that you are not wild about changing mremap behavior. +1
[PATCH v7 3/6] mm: Introduce VM_LOCKONFAULT
On 08/25/2015 03:41 PM, Michal Hocko wrote: > On Fri 21-08-15 14:31:32, Eric B Munson wrote: > [...] >> I am in the middle of implementing lock on fault this way, but I cannot >> see how we will hanlde mremap of a lock on fault region. Say we have >> the following: >> >> addr = mmap(len, MAP_ANONYMOUS, ...); >> mlock(addr, len, MLOCK_ONFAULT); >> ... >> mremap(addr, len, 2 * len, ...) >> >> There is no way for mremap to know that the area being remapped was lock >> on fault so it will be locked and prefaulted by remap. How can we avoid >> this without tracking per vma if it was locked with lock or lock on >> fault? > > Yes mremap is a problem and it is very much similar to mmap(MAP_LOCKED). > It doesn't guarantee the full mlock semantic because it leaves partially > populated ranges behind without reporting any error. Hm, that's right. > Considering the current behavior I do not thing it would be terrible > thing to do what Konstantin was suggesting and populate only the full > ranges in a best effort mode (it is done so anyway) and document the > behavior properly. > " > If the memory segment specified by old_address and old_size is > locked (using mlock(2) or similar), then this lock is maintained > when the segment is resized and/or relocated. As a consequence, > the amount of memory locked by the process may change. > > If the range is already fully populated and the range is > enlarged the new range is attempted to be fully populated > as well to preserve the full mlock semantic but there is no > guarantee this will succeed. Partially populated (e.g. created by > mlock(MLOCK_ONFAULT)) ranges do not have the full mlock semantic > so they are not populated on resize. > " > > So what we have as a result is that partially populated ranges are > preserved and fully populated ones work in the best effort mode the same > way as they are now. > > Does that sound at least remotely reasonably? I'll basically repeat what I said earlier: - mremap scanning existing pte's to figure out the population would slow it down for no good reason - it would be unreliable anyway: - example: was the area completely populated because MLOCK_ONFAULT was not used or because the process faulted it already - example: was the area not completely populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it fully? I think the first point is a pointless regression for workloads that use just plain mlock() and don't want the onfault semantics. Unless there's some shortcut? Does vma have a counter of how much is populated? (I don't think so?)
[PATCH v7 3/6] mm: Introduce VM_LOCKONFAULT
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote: > On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka wrote: >> On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote: >>>> >>>> >>>> I am in the middle of implementing lock on fault this way, but I cannot >>>> see how we will hanlde mremap of a lock on fault region. Say we have >>>> the following: >>>> >>>> addr = mmap(len, MAP_ANONYMOUS, ...); >>>> mlock(addr, len, MLOCK_ONFAULT); >>>> ... >>>> mremap(addr, len, 2 * len, ...) >>>> >>>> There is no way for mremap to know that the area being remapped was lock >>>> on fault so it will be locked and prefaulted by remap. How can we avoid >>>> this without tracking per vma if it was locked with lock or lock on >>>> fault? >>> >>> >>> remap can count filled ptes and prefault only completely populated areas. >> >> >> Does (and should) mremap really prefault non-present pages? Shouldn't it >> just prepare the page tables and that's it? > > As I see mremap prefaults pages when it extends mlocked area. > > Also quote from manpage > : If the memory segment specified by old_address and old_size is locked > : (using mlock(2) or similar), then this lock is maintained when the segment > is > : resized and/or relocated. As a consequence, the amount of memory locked > : by the process may change. Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?). The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly. >> >>> There might be a problem after failed populate: remap will handle them >>> as lock on fault. In this case we can fill ptes with swap-like non-present >>> entries to remember that fact and count them as should-be-locked pages. >> >> >> I don't think we should strive to have mremap try to fix the inherent >> unreliability of mmap (MAP_POPULATE)? > > I don't think so. MAP_POPULATE works only when mmap happens. > Flag MREMAP_POPULATE might be a good idea. Just for symmetry. Maybe, but please do it as a separate series.
[PATCH v7 3/6] mm: Introduce VM_LOCKONFAULT
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote: >> >> I am in the middle of implementing lock on fault this way, but I cannot >> see how we will hanlde mremap of a lock on fault region. Say we have >> the following: >> >> addr = mmap(len, MAP_ANONYMOUS, ...); >> mlock(addr, len, MLOCK_ONFAULT); >> ... >> mremap(addr, len, 2 * len, ...) >> >> There is no way for mremap to know that the area being remapped was lock >> on fault so it will be locked and prefaulted by remap. How can we avoid >> this without tracking per vma if it was locked with lock or lock on >> fault? > > remap can count filled ptes and prefault only completely populated areas. Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it? > There might be a problem after failed populate: remap will handle them > as lock on fault. In this case we can fill ptes with swap-like non-present > entries to remember that fact and count them as should-be-locked pages. I don't think we should strive to have mremap try to fix the inherent unreliability of mmap (MAP_POPULATE)?
difficult to pinpoint exhaustion of swap between 4.2.0-rc6 and 4.2.0-rc7
On 22.8.2015 6:48, Arthur Marsh wrote: > > > Vlastimil Babka wrote on 21/08/15 21:18: >> On 08/21/2015 01:37 PM, Vlastimil Babka wrote: >>> >>> That, said, looking at the memory values: >>> >>> rc6: Free+Buffers+A/I(Anon)+A/I(File)+Slab = 6769MB >>> rc7: ... = 4714MB >>> >>> That's 2GB unaccounted for. >> >> So one brute-force way to see who allocated those 2GB is to use the >> page_owner debug feature. You need to enable CONFIG_PAGE_OWNER and then >> follow the Usage part of Documentation/vm/page_owner.txt >> If you can do that, please send the sorted_page_owner.txt for rc7 when >> it's semi-nearing the exhausted swap. Then you could start doing a >> comparison run with rc6, but maybe it will be easy to figure from the >> rc7 log already. Thanks. >> > > Documentation/vm/page_owner.txt does not mention the need to do: > > mount -t debugfs none /sys/kernel/debug Ah, right... > Having done that when about 1.5 GiB swap was in use, the output of > sorted_page_owner.txt with the rc7+ kernel starts with: > > 699487 times: > Page allocated via order 0, mask 0x280da > [] __alloc_pages_nodemask+0x1de/0xb00 > [] handle_mm_fault+0x11bb/0x1480 > [] __do_page_fault+0x178/0x480 > [] do_page_fault+0x2b/0x40 > [] page_fault+0x28/0x30 > [] 0x That's userspace, that's fine. > 457823 times: > Page allocated via order 0, mask 0x202d0 > [] __alloc_pages_nodemask+0x1de/0xb00 > [] dma_generic_alloc_coherent+0xa4/0xf0 > [] x86_swiotlb_alloc_coherent+0x2d/0x60 > [] dma_alloc_attrs+0x4e/0x90 > [] ttm_dma_populate+0x502/0x900 [ttm] > [] radeon_ttm_tt_populate+0x216/0x2b0 [radeon] > [] ttm_tt_bind+0x44/0x80 [ttm] > [] ttm_bo_handle_move_mem+0x3b6/0x440 [ttm] There. 1800MB of present RAM was allocated through ttm/radeon in rc7(+). And apparently that doesn't happen with rc6. The problem is, there were no commits between rc6 and rc7 in drivers/gpu/drm/radeon/ or drivers/gpu/drm/ttm/. I'm CC'ing dri and some radeon devs anyway. Please find the rest of this thread on lkml. [...] > > Also, once when attempting to do: > > cat /sys/kernel/debug/page_owner > page_owner_full.txt > > I received the following error: > > > [18410.829060] cat: page allocation failure: order:5, mode:0x2040d0 > [18410.829068] CPU: 3 PID: 1732 Comm: cat Not tainted 4.2.0-rc7+ #1907 > [18410.829070] Hardware name: System manufacturer System Product > Name/M3A78 PRO, BIOS 170101/27/2011 > [18410.829073] 0005 88001f4d7a58 815a554d > 0034 > [18410.829078] 002040d0 88001f4d7ae8 8115dedc > 8800360b4540 > [18410.829082] 0005 8800360b4540 002040d0 > 88001f4d7bc0 > [18410.829085] Call Trace: > [18410.829091] [] dump_stack+0x4f/0x7b > [18410.829096] [] warn_alloc_failed+0xdc/0x130 > [18410.829099] [] ? > __alloc_pages_direct_compact+0xe8/0xf0 > [18410.829101] [] __alloc_pages_nodemask+0x891/0xb00 > [18410.829104] [] ? __lock_acquire+0xc05/0x1c70 > [18410.829107] [] cache_alloc_refill+0x33b/0x5b0 > [18410.829110] [] ? print_page_owner+0x54/0x350 > [18410.829112] [] __kmalloc+0x1be/0x330 > [18410.829114] [] print_page_owner+0x54/0x350 > [18410.829116] [] ? drain_pages_zone+0x76/0xa0 > [18410.829118] [] ? page_alloc_cpu_notify+0x50/0x50 > [18410.829119] [] ? drain_pages+0x1f/0x60 > [18410.829122] [] ? trace_hardirqs_on_caller+0x136/0x1c0 > [18410.829123] [] ? page_alloc_cpu_notify+0x50/0x50 > [18410.829126] [] ? preempt_count_sub+0x23/0x60 > [18410.829129] [] ? on_each_cpu_mask+0x5f/0xd0 > [18410.829131] [] read_page_owner+0x15f/0x180 > [18410.829134] [] __vfs_read+0x23/0xd0 > [18410.829137] [] ? security_file_permission+0x9b/0xc0 > [18410.829139] [] ? rw_verify_area+0x4a/0xe0 > [18410.829141] [] vfs_read+0x8d/0x140 > [18410.829143] [] ? lockdep_sys_exit+0x1/0x90 > [18410.829146] [] SyS_read+0x4d/0xb0 > [18410.829149] [] entry_SYSCALL_64_fastpath+0x12/0x76 > [18410.829151] Mem-Info: > [18410.829157] active_anon:715055 inactive_anon:205953 isolated_anon:15 > active_file:215967 inactive_file:199708 isolated_file:0 > unevictable:5132 dirty:4186 writeback:5030 unstable:0 > slab_reclaimable:49019 slab_unreclaimable:28035 > mapped:168002 shmem:124895 pagetables:20296 bounce:0 > free:14378 free_pcp:127 free_cma:0 > [18410.829164] DMA free:15872kB min:20kB low:24kB high:28kB > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB > unevictable:0kB isolated(anon):0kB i
[PATCH v7 3/6] mm: Introduce VM_LOCKONFAULT
On 08/19/2015 11:33 PM, Eric B Munson wrote: > On Wed, 12 Aug 2015, Michal Hocko wrote: > >> On Sun 09-08-15 01:22:53, Eric B Munson wrote: >> >> I do not like this very much to be honest. We have only few bits >> left there and it seems this is not really necessary. I thought that >> LOCKONFAULT acts as a modifier to the mlock call to tell whether to >> poppulate or not. The only place we have to persist it is >> mlockall(MCL_FUTURE) AFAICS. And this can be handled by an additional >> field in the mm_struct. This could be handled at __mm_populate level. >> So unless I am missing something this would be much more easier >> in the end we no new bit in VM flags would be necessary. >> >> This would obviously mean that the LOCKONFAULT couldn't be exported to >> the userspace but is this really necessary? > > Sorry for the latency here, I was on vacation and am now at plumbers. > > I am not sure that growing the mm_struct by another flags field instead > of using available bits in the vm_flags is the right choice. I was making the same objection on one of the earlier versions and since you sticked with a new vm flag, I thought it doesn't matter, as we could change it later if we run out of bits. But now I realize that since you export this difference to userspace (and below you say that it's by request), we won't be able to change it later. So it's a more difficult choice. > After this > patch, we still have 3 free bits on 32 bit architectures (2 after the > userfaultfd set IIRC). The group which asked for this feature here > wants the ability to distinguish between LOCKED and LOCKONFAULT regions > and without the VMA flag there isn't a way to do that. > > Do we know that these last two open flags are needed right now or is > this speculation that they will be and that none of the other VMA flags > can be reclaimed? I think it's the latter, we can expect that flags will be added rather than removed, as removal is hard or impossible.
[PATCH V6 3/6] mm: Introduce VM_LOCKONFAULT
On 07/29/2015 05:42 PM, Eric B Munson wrote: > The cost of faulting in all memory to be locked can be very high when > working with large mappings. If only portions of the mapping will be > used this can incur a high penalty for locking. > > For the example of a large file, this is the usage pattern for a large > statical language model (probably applies to other statical or graphical > models as well). For the security example, any application transacting > in data that cannot be swapped out (credit card data, medical records, > etc). > > This patch introduces the ability to request that pages are not > pre-faulted, but are placed on the unevictable LRU when they are finally > faulted in. The VM_LOCKONFAULT flag will be used together with > VM_LOCKED and has no effect when set without VM_LOCKED. Setting the > VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to > be added to the unevictable LRU when they are faulted or if they are > already present, but will not cause any missing pages to be faulted in. > > Exposing this new lock state means that we cannot overload the meaning > of the FOLL_POPULATE flag any longer. Prior to this patch it was used > to mean that the VMA for a fault was locked. This means we need the > new FOLL_MLOCK flag to communicate the locked state of a VMA. > FOLL_POPULATE will now only control if the VMA should be populated and > in the case of VM_LOCKONFAULT, it will not be set. > > Signed-off-by: Eric B Munson > Acked-by: Kirill A. Shutemov > Cc: Michal Hocko > Cc: Vlastimil Babka > Cc: Jonathan Corbet > Cc: "Kirill A. Shutemov" > Cc: linux-kernel at vger.kernel.org > Cc: dri-devel at lists.freedesktop.org > Cc: linux-mm at kvack.org > Cc: linux-api at vger.kernel.org > --- > drivers/gpu/drm/drm_vm.c | 8 +++- > fs/proc/task_mmu.c | 1 + > include/linux/mm.h | 2 ++ > kernel/fork.c| 2 +- > mm/debug.c | 1 + > mm/gup.c | 10 -- > mm/huge_memory.c | 2 +- > mm/hugetlb.c | 4 ++-- > mm/mlock.c | 2 +- > mm/mmap.c| 2 +- > mm/rmap.c| 4 ++-- > 11 files changed, 27 insertions(+), 11 deletions(-) > > diff --git a/drivers/gpu/drm/drm_vm.c b/drivers/gpu/drm/drm_vm.c > index aab49ee..103a5f6 100644 > --- a/drivers/gpu/drm/drm_vm.c > +++ b/drivers/gpu/drm/drm_vm.c > @@ -699,9 +699,15 @@ int drm_vma_info(struct seq_file *m, void *data) > (void *)(unsigned long)virt_to_phys(high_memory)); > > list_for_each_entry(pt, &dev->vmalist, head) { > + char lock_flag = '-'; > + > vma = pt->vma; > if (!vma) > continue; > + if (vma->vm_flags & VM_LOCKONFAULT) > + lock_flag = 'f'; > + else if (vma->vm_flags & VM_LOCKED) > + lock_flag = 'l'; > seq_printf(m, > "\n%5d 0x%pK-0x%pK %c%c%c%c%c%c 0x%08lx000", > pt->pid, > @@ -710,7 +716,7 @@ int drm_vma_info(struct seq_file *m, void *data) > vma->vm_flags & VM_WRITE ? 'w' : '-', > vma->vm_flags & VM_EXEC ? 'x' : '-', > vma->vm_flags & VM_MAYSHARE ? 's' : 'p', > -vma->vm_flags & VM_LOCKED ? 'l' : '-', > +lock_flag, > vma->vm_flags & VM_IO ? 'i' : '-', > vma->vm_pgoff); > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index ca1e091..38d69fc 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -579,6 +579,7 @@ static void show_smap_vma_flags(struct seq_file *m, > struct vm_area_struct *vma) This function has the following comment: Don't forget to update Documentation/ on changes. [...] > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -92,7 +92,7 @@ retry: >*/ > mark_page_accessed(page); > } > - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) { > + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { > /* >* The preliminary mapping check is mainly to avoid the >* pointless overhead of lock_page on the ZERO_PAGE > @@ -265,6 +265,9 @@ static int faultin_page(struct task_struct *tsk, struct > vm_area_struct *vma, > unsigned int fa
[PATCH 4/4] mm: remove direct calling of migration
On 07/13/2015 10:35 AM, Gioh Kim wrote: > From: Gioh Kim > > Migration is completely generalized so that migrating mobile page > is processed with lru-pages in move_to_new_page. > > Signed-off-by: Gioh Kim > Acked-by: Rafael Aquini Why not just fold this to Patch 3? You already modify this hunk there, and prior to patch 3, the hunk was balloon-pages specific. You made it look generic only to remove it, which is unneeded code churn and I don't think it adds anything wrt e.g. bisectability. > --- > mm/migrate.c | 15 --- > 1 file changed, 15 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 53f0081d..e6644ac 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -844,21 +844,6 @@ static int __unmap_and_move(struct page *page, struct > page *newpage, > } > } > > - if (unlikely(mobile_page(page))) { > - /* > - * A mobile page does not need any special attention from > - * physical to virtual reverse mapping procedures. > - * Skip any attempt to unmap PTEs or to remap swap cache, > - * in order to avoid burning cycles at rmap level, and perform > - * the page migration right away (proteced by page lock). > - */ > - lock_page(newpage); > - rc = page->mapping->a_ops->migratepage(page->mapping, > -newpage, page, mode); > - unlock_page(newpage); > - goto out_unlock; > - } > - > /* >* Corner case handling: >* 1. When a new swap-cache page is read into, it is added to the LRU >
[PATCH 2/4] mm/compaction: enable mobile-page migration
On 07/13/2015 10:35 AM, Gioh Kim wrote: > From: Gioh Kim > > Add framework to register callback functions and check page mobility. > There are some modes for page isolation so that isolate interface > has arguments of page address and isolation mode while putback > interface has only page address as argument. Note that unlike what subject suggest, this doesn't really enable mobile-page migration inside compaction, since that only happens with patch 3. This might theoretically affect some cherry-pick backports that don't care about balloon pages. I can imagine that can easily happen in the world of mobile devices? It would thus be somewhat cleaner if this patch was complete in that sense. > Signed-off-by: Gioh Kim > Acked-by: Rafael Aquini > --- > fs/proc/page.c | 3 ++ > include/linux/compaction.h | 80 > ++ > include/linux/fs.h | 2 + > include/linux/page-flags.h | 19 > include/uapi/linux/kernel-page-flags.h | 1 + > 5 files changed, 105 insertions(+) > > diff --git a/fs/proc/page.c b/fs/proc/page.c > index 7eee2d8..a4f5a00 100644 > --- a/fs/proc/page.c > +++ b/fs/proc/page.c > @@ -146,6 +146,9 @@ u64 stable_page_flags(struct page *page) > if (PageBalloon(page)) > u |= 1 << KPF_BALLOON; > > + if (PageMobile(page)) > + u |= 1 << KPF_MOBILE; PageMovable() would probably be as good a name and correspond to MIGRATE_MOVABLE somewhat, unlike a completely new term. Whatever driver starts to using this should probably change allocation flags to allocate MIGRATE_MOVABLE, so that it works fine with what fragmentation avoidance expects. Guess I should have said that earlier, but can you still reconsider? > + > u |= kpf_copy_bit(k, KPF_LOCKED,PG_locked); > > u |= kpf_copy_bit(k, KPF_SLAB, PG_slab); > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > index aa8f61c..f693072 100644 > --- a/include/linux/compaction.h > +++ b/include/linux/compaction.h > @@ -1,6 +1,9 @@ > #ifndef _LINUX_COMPACTION_H > #define _LINUX_COMPACTION_H > > +#include > +#include > + > /* Return values for compact_zone() and try_to_compact_pages() */ > /* compaction didn't start as it was deferred due to past failures */ > #define COMPACT_DEFERRED0 > @@ -51,6 +54,70 @@ extern void compaction_defer_reset(struct zone *zone, int > order, > bool alloc_success); > extern bool compaction_restarting(struct zone *zone, int order); > > +static inline bool mobile_page(struct page *page) > +{ > + return page->mapping && (PageMobile(page) || PageBalloon(page)); > +} I would put this definition to linux/page-flags.h and rename it to page_mobile (or better page_movable()), which is more common ordering. > + > +static inline bool isolate_mobilepage(struct page *page, isolate_mode_t mode) Does this have to be in compaction.h? The only user is compaction.c so probably move it there, and if there ever is another module using this in the future, we can move it to a more appropriate place and declare it in e.g. mm/internal.h. > +{ > + bool ret = false; > + > + /* > + * Avoid burning cycles with pages that are yet under __free_pages(), > + * or just got freed under us. > + * > + * In case we 'win' a race for a mobile page being freed under us and > + * raise its refcount preventing __free_pages() from doing its job > + * the put_page() at the end of this block will take care of > + * release this page, thus avoiding a nasty leakage. > + */ > + if (unlikely(!get_page_unless_zero(page))) > + goto out; > + > + /* > + * As mobile pages are not isolated from LRU lists, concurrent > + * compaction threads can race against page migration functions > + * as well as race against the releasing a page. > + * > + * In order to avoid having an already isolated mobile page > + * being (wrongly) re-isolated while it is under migration, > + * or to avoid attempting to isolate pages being released, > + * lets be sure we have the page lock > + * before proceeding with the mobile page isolation steps. > + */ > + if (unlikely(!trylock_page(page))) > + goto out_putpage; > + > + if (!(mobile_page(page) && page->mapping->a_ops->isolatepage)) > + goto out_not_isolated; > + ret = page->mapping->a_ops->isolatepage(page, mode); > + if (!ret) > + goto out_not_isolated; > + unlock_page(page); > + return ret; > + > +out_not_isolated: > + unlock_page(page); > +out_putpage: > + put_page(page); > +out: > + return ret; > +} > + > +static inline void putback_mobilepage(struct page *page) Likewise, this could go to migrate.c. Or maybe together with isolate_mobilepage() if you don't want to split them. > +{ > + /* > + * 'lock_p
[PATCH V4 4/6] mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it
On 07/22/2015 08:43 PM, Eric B Munson wrote: > On Wed, 22 Jul 2015, Vlastimil Babka wrote: > >> >> Hi, >> >> I think you should include a complete description of which >> transitions for vma states and mlock2/munlock2 flags applied on them >> are valid and what they do. It will also help with the manpages. >> You explained some to Jon in the last thread, but I think there >> should be a canonical description in changelog (if not also >> Documentation, if mlock is covered there). >> >> For example the scenario Jon asked, what happens after a >> mlock2(MLOCK_ONFAULT) followed by mlock2(MLOCK_LOCKED), and that the >> answer is "nothing". Your promised code comment for >> apply_vma_flags() doesn't suffice IMHO (and I'm not sure it's there, >> anyway?). > > I missed adding that comment to the code, will be there in V5 along with > the description in the changelog. Thanks! >> >> But the more I think about the scenario and your new VM_LOCKONFAULT >> vma flag, it seems awkward to me. Why should munlocking at all care >> if the vma was mlocked with MLOCK_LOCKED or MLOCK_ONFAULT? In either >> case the result is that all pages currently populated are munlocked. >> So the flags for munlock2 should be unnecessary. > > Say a user has a large area of interleaved MLOCK_LOCK and MLOCK_ONFAULT > mappings and they want to unlock only the ones with MLOCK_LOCK. With > the current implementation, this is possible in a single system call > that spans the entire region. With your suggestion, the user would have > to know what regions where locked with MLOCK_LOCK and call munlock() on > each of them. IMO, the way munlock2() works better mirrors the way > munlock() currently works when called on a large area of interleaved > locked and unlocked areas. Um OK, that scenario is possible in theory. But I have a hard time imagining that somebody would really want to do that. I think much more people would benefit from a simpler API. > >> >> I also think VM_LOCKONFAULT is unnecessary. VM_LOCKED should be >> enough - see how you had to handle the new flag in all places that >> had to handle the old flag? I think the information whether mlock >> was supposed to fault the whole vma is obsolete at the moment mlock >> returns. VM_LOCKED should be enough for both modes, and the flag to >> mlock2 could just control whether the pre-faulting is done. >> >> So what should be IMHO enough: >> - munlock can stay without flags >> - mlock2 has only one new flag MLOCK_ONFAULT. If specified, >> pre-faulting is not done, just set VM_LOCKED and mlock pages already >> present. >> - same with mmap(MAP_LOCKONFAULT) (need to define what happens when >> both MAP_LOCKED and MAP_LOCKONFAULT are specified). >> >> Now mlockall(MCL_FUTURE) muddles the situation in that it stores the >> information for future VMA's in current->mm->def_flags, and this >> def_flags would need to distinguish VM_LOCKED with population and >> without. But that could be still solvable without introducing a new >> vma flag everywhere. > > With you right up until that last paragraph. I have been staring at > this a while and I cannot come up a way to handle the > mlockall(MCL_ONFAULT) without introducing a new vm flag. It doesn't > have to be VM_LOCKONFAULT, we could use the model that Michal Hocko > suggested with something like VM_FAULTPOPULATE. However, we can't > really use this flag anywhere except the mlock code becuase we have to > be able to distinguish a caller that wants to use MLOCK_LOCK with > whatever control VM_FAULTPOPULATE might grant outside of mlock and a > caller that wants MLOCK_ONFAULT. That was a long way of saying we need > an extra vma flag regardless. However, if that flag only controls if > mlock pre-populates it would work and it would do away with most of the > places I had to touch to handle VM_LOCKONFAULT properly. Yes, it would be a good way. Adding a new vma flag is probably cleanest after all, but the flag would be set *in addition* to VM_LOCKED, *just* to prevent pre-faulting. The places that check VM_LOCKED for the actual page mlocking (i.e. try_to_unmap_one) would just keep checking VM_LOCKED. The places where VM_LOCKED is checked to trigger prepopulation, would skip that if VM_LOCKONFAULT is also set. Having VM_LOCKONFAULT set without also VM_LOCKED itself would be invalid state. This should work fine with the simplified API as I proposed so let me reiterate and try fill in the blanks: - mlock2 has only one new flag MLOCK_ONFAULT. If specified, VM_LOCKONFAULT is set in addition to VM_LOCKED and no prefaulting is done - old mlock syscall natura
[PATCH V4 4/6] mm: mlock: Introduce VM_LOCKONFAULT and add mlock flags to enable it
On 07/21/2015 09:59 PM, Eric B Munson wrote: > The cost of faulting in all memory to be locked can be very high when > working with large mappings. If only portions of the mapping will be > used this can incur a high penalty for locking. > > For the example of a large file, this is the usage pattern for a large > statical language model (probably applies to other statical or graphical > models as well). For the security example, any application transacting > in data that cannot be swapped out (credit card data, medical records, > etc). > > This patch introduces the ability to request that pages are not > pre-faulted, but are placed on the unevictable LRU when they are finally > faulted in. This can be done area at a time via the > mlock2(MLOCK_ONFAULT) or the mlockall(MCL_ONFAULT) system calls. These > calls can be undone via munlock2(MLOCK_ONFAULT) or > munlockall2(MCL_ONFAULT). > > Applying the VM_LOCKONFAULT flag to a mapping with pages that are > already present required the addition of a function in gup.c to pin all > pages which are present in an address range. It borrows heavily from > __mm_populate(). > > To keep accounting checks out of the page fault path, users are billed > for the entire mapping lock as if MLOCK_LOCKED was used. Hi, I think you should include a complete description of which transitions for vma states and mlock2/munlock2 flags applied on them are valid and what they do. It will also help with the manpages. You explained some to Jon in the last thread, but I think there should be a canonical description in changelog (if not also Documentation, if mlock is covered there). For example the scenario Jon asked, what happens after a mlock2(MLOCK_ONFAULT) followed by mlock2(MLOCK_LOCKED), and that the answer is "nothing". Your promised code comment for apply_vma_flags() doesn't suffice IMHO (and I'm not sure it's there, anyway?). But the more I think about the scenario and your new VM_LOCKONFAULT vma flag, it seems awkward to me. Why should munlocking at all care if the vma was mlocked with MLOCK_LOCKED or MLOCK_ONFAULT? In either case the result is that all pages currently populated are munlocked. So the flags for munlock2 should be unnecessary. I also think VM_LOCKONFAULT is unnecessary. VM_LOCKED should be enough - see how you had to handle the new flag in all places that had to handle the old flag? I think the information whether mlock was supposed to fault the whole vma is obsolete at the moment mlock returns. VM_LOCKED should be enough for both modes, and the flag to mlock2 could just control whether the pre-faulting is done. So what should be IMHO enough: - munlock can stay without flags - mlock2 has only one new flag MLOCK_ONFAULT. If specified, pre-faulting is not done, just set VM_LOCKED and mlock pages already present. - same with mmap(MAP_LOCKONFAULT) (need to define what happens when both MAP_LOCKED and MAP_LOCKONFAULT are specified). Now mlockall(MCL_FUTURE) muddles the situation in that it stores the information for future VMA's in current->mm->def_flags, and this def_flags would need to distinguish VM_LOCKED with population and without. But that could be still solvable without introducing a new vma flag everywhere.
[PATCH 9/9] drm/exynos: Convert g2d_userptr_get_dma_addr() to use get_vaddr_frames()
On 05/06/2015 09:28 AM, Jan Kara wrote: > Convert g2d_userptr_get_dma_addr() to pin pages using get_vaddr_frames(). > This removes the knowledge about vmas and mmap_sem locking from exynos > driver. Also it fixes a problem that the function has been mapping user > provided address without holding mmap_sem. > > Signed-off-by: Jan Kara > --- > drivers/gpu/drm/exynos/exynos_drm_g2d.c | 89 ++ > drivers/gpu/drm/exynos/exynos_drm_gem.c | 97 > - > 2 files changed, 29 insertions(+), 157 deletions(-) > > diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c > b/drivers/gpu/drm/exynos/exynos_drm_g2d.c > index 81a250830808..265519c0fe2d 100644 > --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c > +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c ... > @@ -456,65 +458,37 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct > drm_device *drm_dev, > return ERR_PTR(-ENOMEM); > > atomic_set(&g2d_userptr->refcount, 1); > + g2d_userptr->size = size; > > start = userptr & PAGE_MASK; > offset = userptr & ~PAGE_MASK; > end = PAGE_ALIGN(userptr + size); > npages = (end - start) >> PAGE_SHIFT; > - g2d_userptr->npages = npages; > - > - pages = drm_calloc_large(npages, sizeof(struct page *)); > - if (!pages) { > - DRM_ERROR("failed to allocate pages.\n"); > - ret = -ENOMEM; > + vec = g2d_userptr->vec = frame_vector_create(npages); > + if (!vec) > goto err_free; > - } > > - down_read(¤t->mm->mmap_sem); > - vma = find_vma(current->mm, userptr); > - if (!vma) { > - up_read(¤t->mm->mmap_sem); > - DRM_ERROR("failed to get vm region.\n"); > + ret = get_vaddr_frames(start, npages, 1, 1, vec); Use true instead of 1.
[PATCH 3/9] media: omap_vout: Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns()
On 05/06/2015 09:28 AM, Jan Kara wrote: > Convert omap_vout_uservirt_to_phys() to use get_vaddr_pfns() instead of > hand made mapping of virtual address to physical address. Also the > function leaked page reference from get_user_pages() so fix that by > properly release the reference when omap_vout_buffer_release() is > called. > > Signed-off-by: Jan Kara > --- > drivers/media/platform/omap/omap_vout.c | 67 > +++-- > 1 file changed, 31 insertions(+), 36 deletions(-) > ... > + vec = frame_vector_create(1); > + if (!vec) > + return -ENOMEM; > > - if (res == nr_pages) { > - physp = __pa(page_address(&pages[0]) + > - (virtp & ~PAGE_MASK)); > - } else { > - printk(KERN_WARNING VOUT_NAME > - "get_user_pages failed\n"); > - return 0; > - } > + ret = get_vaddr_frames(virtp, 1, 1, 0, vec); Use true/false where appropriate.
[PATCH 2/9] mm: Provide new get_vaddr_frames() helper
On 05/06/2015 09:28 AM, Jan Kara wrote: > Provide new function get_vaddr_frames(). This function maps virtual > addresses from given start and fills given array with page frame numbers of > the corresponding pages. If given start belongs to a normal vma, the function > grabs reference to each of the pages to pin them in memory. If start > belongs to VM_IO | VM_PFNMAP vma, we don't touch page structures. Caller > must make sure pfns aren't reused for anything else while he is using > them. > > This function is created for various drivers to simplify handling of > their buffers. > > Signed-off-by: Jan Kara Acked-by: Vlastimil Babka With some nitpicks below... > --- > include/linux/mm.h | 44 +++ > mm/gup.c | 214 > + > 2 files changed, 258 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 0755b9fd03a7..dcd1f02a78e9 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -20,6 +20,7 @@ > #include > #include > #include > +#include > > struct mempolicy; > struct anon_vma; > @@ -1197,6 +1198,49 @@ long get_user_pages_unlocked(struct task_struct *tsk, > struct mm_struct *mm, > int write, int force, struct page **pages); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); > + > +/* Container for pinned pfns / pages */ > +struct frame_vector { > + unsigned int nr_allocated; /* Number of frames we have space for */ > + unsigned int nr_frames; /* Number of frames stored in ptrs array */ > + bool got_ref; /* Did we pin pages by getting page ref? */ > + bool is_pfns; /* Does array contain pages or pfns? */ > + void *ptrs[0]; /* Array of pinned pfns / pages. Use > + * pfns_vector_pages() or pfns_vector_pfns() > + * for access */ > +}; > + > +struct frame_vector *frame_vector_create(unsigned int nr_frames); > +void frame_vector_destroy(struct frame_vector *vec); > +int get_vaddr_frames(unsigned long start, unsigned int nr_pfns, > + bool write, bool force, struct frame_vector *vec); > +void put_vaddr_frames(struct frame_vector *vec); > +int frame_vector_to_pages(struct frame_vector *vec); > +void frame_vector_to_pfns(struct frame_vector *vec); > + > +static inline unsigned int frame_vector_count(struct frame_vector *vec) > +{ > + return vec->nr_frames; > +} > + > +static inline struct page **frame_vector_pages(struct frame_vector *vec) > +{ > + if (vec->is_pfns) { > + int err = frame_vector_to_pages(vec); > + > + if (err) > + return ERR_PTR(err); > + } > + return (struct page **)(vec->ptrs); > +} > + > +static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) > +{ > + if (!vec->is_pfns) > + frame_vector_to_pfns(vec); > + return (unsigned long *)(vec->ptrs); > +} > + > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); > diff --git a/mm/gup.c b/mm/gup.c > index 6297f6bccfb1..8db5c40e65c4 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -8,6 +8,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -936,6 +937,219 @@ int __mm_populate(unsigned long start, unsigned long > len, int ignore_errors) > return ret; /* 0 or negative error code */ > } > > +/* > + * get_vaddr_frames() - map virtual addresses to pfns > + * @start: starting user address > + * @nr_frames: number of pages / pfns from start to map > + * @write: whether pages will be written to by the caller > + * @force: whether to force write access even if user mapping is > + * readonly. This will result in the page being COWed even > + * in MAP_SHARED mappings. You do not want this. "You do not want this" and yet some of the drivers in later patches do. That looks weird. Explain better? > + * @vec: structure which receives pages / pfns of the addresses mapped. > + * It should have space for at least nr_frames entries. > + * > + * This function maps virtual addresses from @start and fills @vec structure > + * with page frame numbers or page pointers to corresponding pages (choice > + * depends on the type of the vma underlying the virtual address). If @start > + * belongs to a normal vma, the function grabs reference to each of the pages > + * to pin them in memory. If @start b
[Bug 87891] New: kernel BUG at mm/slab.c:2625!
On 11/12/2014 12:31 AM, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Thu, 06 Nov 2014 17:28:41 + bugzilla-daemon at bugzilla.kernel.org > wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=87891 >> >> Bug ID: 87891 >> Summary: kernel BUG at mm/slab.c:2625! >> Product: Memory Management >> Version: 2.5 >> Kernel Version: 3.17.2 >>Hardware: i386 >> OS: Linux >>Tree: Mainline >> Status: NEW >>Severity: blocking >>Priority: P1 >> Component: Slab Allocator >>Assignee: akpm at linux-foundation.org >>Reporter: luke-jr+linuxbugs at utopios.org >> Regression: No > > Well this is interesting. > > >> [359782.842112] kernel BUG at mm/slab.c:2625! >> ... >> [359782.843008] Call Trace: >> [359782.843017] [] __kmalloc+0xdf/0x200 >> [359782.843037] [] ? ttm_page_pool_free+0x35/0x180 [ttm] >> [359782.843060] [] ttm_page_pool_free+0x35/0x180 [ttm] >> [359782.843084] [] ttm_pool_shrink_scan+0xae/0xd0 [ttm] >> [359782.843108] [] shrink_slab_node+0x12b/0x2e0 >> [359782.843129] [] ? fragmentation_index+0x14/0x70 >> [359782.843150] [] ? zone_watermark_ok+0x1a/0x20 >> [359782.843171] [] shrink_slab+0xc8/0x110 >> [359782.843189] [] do_try_to_free_pages+0x300/0x410 >> [359782.843210] [] try_to_free_pages+0xbb/0x190 >> [359782.843230] [] __alloc_pages_nodemask+0x696/0xa90 >> [359782.843253] [] do_huge_pmd_anonymous_page+0xfa/0x3f0 >> [359782.843278] [] ? debug_smp_processor_id+0x17/0x20 >> [359782.843300] [] ? __lru_cache_add+0x57/0xa0 >> [359782.843321] [] handle_mm_fault+0x37e/0xdd0 > > It went pagefault > ->__alloc_pages_nodemask >->shrink_slab > ->ttm_pool_shrink_scan >->ttm_page_pool_free > ->kmalloc >->cache_grow > ->BUG_ON(flags & GFP_SLAB_BUG_MASK); > > And I don't really know why - I'm not seeing anything in there which > can set a GFP flag which is outside GFP_SLAB_BUG_MASK. However I see > lots of nits. > > Core MM: > > __alloc_pages_nodemask() does > > if (unlikely(!page)) { > /* >* Runtime PM, block IO and its error handling path >* can deadlock because I/O on the device might not >* complete. >*/ > gfp_mask = memalloc_noio_flags(gfp_mask); > page = __alloc_pages_slowpath(gfp_mask, order, > zonelist, high_zoneidx, nodemask, > preferred_zone, classzone_idx, migratetype); > } > > so it permanently alters the value of incoming arg gfp_mask. This > means that the following trace_mm_page_alloc() will print the wrong > value of gfp_mask, and if we later do the `goto retry_cpuset', we retry > with a possibly different gfp_mask. Isn't this a bug? I think so. I noticed and fixed it in the RFC about reducing alloc_pages* parameters [1], but it's buried in patch 2/4 Guess I should have made it a separate non-RFC patch. Will do soon hopefully. Vlastimil [1] https://lkml.org/lkml/2014/8/6/249
drm/sysfs lifetime interaction fixes
On 12.10.2013 14:54, Fengguang Wu wrote: > On Sat, Oct 12, 2013 at 07:47:05AM +1000, Dave Airlie wrote: >>>> This is my preferred method of fixing it as I don't think the lifetimes >>>> need >>>> to be tied so closely, though this requires review my someone to make sure >>>> my unregistering etc is correct and in the right place. >>> Apparently this fixes the problem for Fengguang, and the code looks >>> cleaner too. Thanks, >> Leaves the fixes or next question, since I suppose its not technically >> a regression, -next is probably fine, let me know if you'd l like them >> earlier. > Dave, I tested the two patches on top of drm-next and find that it > does help eliminate the lots of oops messages in the below kernels. > > However in the 2nd config, the patched kernel has one new oops type > than its base kernels (6aba5b6 and v3.12-rc3). v3.12-rc4 is also > tested for your reference. In the 3nd config, both patched and base > kernels have that oops: > > [ 96.969429] init: plymouth-upstart-bridge main process (309) terminated > with status 1 > * Asking all remaining processes to terminate... > [ 97.260371] BUG: Bad page map in process killall5 pte:4f426de0 > pmd:0f4f4067 > [ 97.261114] addr:3fc0 vm_flags:00100173 anon_vma:4f4066c0 mapping: > (null) index:3ffe6 > [ 97.261912] CPU: 0 PID: 334 Comm: killall5 Not tainted > 3.12.0-rc3-00156-gdaeb5e3 #1 > [ 97.262633] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 97.263192] 3fc0 4f4c1e14 4212e45c 4fbff9a0 4f4c1e4c 411a9c4b > 4262ade0 3fc0 > [ 97.264051] 00100173 4f4066c0 0003ffe6 4f426de0 0003ffe6 > 4fbff9a0 > [ 97.264906] 3fc0 3fc0 4f4c1e60 411ab50e 4f464000 > 4f4c1ed0 > [ 97.265751] Call Trace: > [ 97.266022] [<4212e45c>] dump_stack+0xbb/0x14b > [ 97.266456] [<411a9c4b>] print_bad_pte+0x28b/0x2c0 > [ 97.266931] [<411ab50e>] vm_normal_page+0xae/0xe0 > [ 97.267388] [<411b37f3>] munlock_vma_pages_range+0x143/0x320 > [ 97.267950] [<410d30fd>] ? sched_clock_cpu+0x20d/0x250 > [ 97.268451] [<411bacee>] exit_mmap+0x7e/0x200 > [ 97.268893] [<4131de9c>] ? __const_udelay+0x2c/0x40 > [ 97.269369] [<410adf28>] ? __rcu_read_unlock+0x68/0x150 > [ 97.269888] [<4123227b>] ? exit_aio+0x11b/0x280 > [ 97.270412] [<4123217c>] ? exit_aio+0x1c/0x280 > [ 97.270892] [<410829c7>] ? do_exit+0x697/0x1280 > [ 97.271332] [<4107a1a1>] mmput+0x81/0x170 > [ 97.271726] [<410829dc>] do_exit+0x6ac/0x1280 > [ 97.272166] [<410bad75>] ? hrtimer_nanosleep+0x1f5/0x210 > [ 97.272679] [<4215328a>] ? sysenter_exit+0xf/0x45 > [ 97.273151] [<41083751>] do_group_exit+0x131/0x160 > [ 97.273617] [<410837ad>] SyS_exit_group+0x2d/0x30 > [ 97.274088] [<42153251>] sysenter_do_call+0x12/0x3c > [ 97.274560] Disabling lock debugging due to kernel taint > * All processes ended within 1 seconds > > That oops message looks very like the one I reported for this commit. > > commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9 > Author: Vlastimil Babka > Date: Wed Sep 11 14:22:35 2013 -0700 > > mm: munlock: manual pte walk in fast path instead of follow_page_mask() > > Vlastimil Babka, has this bug been fixed in -rc4? > Yes, this should have been fixed by commit eadb41ae82f80210 "mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration", merged between rc3 and rc4. Vlastimil Babka