Hi, On Mon, Sep 15, 2025 at 10:20 PM Venkat <[email protected]> wrote: > > > > > On 13 Sep 2025, at 8:18 AM, Julian Sun <[email protected]> wrote: > > > > Hi, > > > > Does this fix make sense to you? > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index d0dfaa0ccaba..ed24dcece56a 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -3945,9 +3945,10 @@ static void mem_cgroup_css_free(struct > > cgroup_subsys_state *css) > > * Not necessary to wait for wb completion which might > > cause task hung, > > * only used to free resources. See > > memcg_cgwb_waitq_callback_fn(). > > */ > > - __add_wait_queue_entry_tail(wait->done.waitq, > > &wait->wq_entry); > > if (atomic_dec_and_test(&wait->done.cnt)) > > - wake_up_all(wait->done.waitq); > > + kfree(wait); > > + else > > + __add_wait_queue_entry_tail(wait->done.waitq, > > &wait->wq_entry);; > > } > > #endif > > if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && > > !cgroup_memory_nosocket) > > Hello, > > Thanks for the fix. This is fixing the reported issue.
Thanks for your testing and feedback. > > While sending out the patch please add below tag as well. > > Tested-by: Venkat Rao Bagalkote <[email protected]> Sure. That's how it should be. Could you please try again with the following patch? The previous one might have caused a memory leak and had race conditions. I can’t reproduce it locally... diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 80257dba30f8..35da16928599 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3940,6 +3940,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) int __maybe_unused i; #ifdef CONFIG_CGROUP_WRITEBACK + spin_lock(&memcg_cgwb_frn_waitq.lock); for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { struct cgwb_frn_wait *wait = memcg->cgwb_frn[i].wait; @@ -3948,9 +3949,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) * only used to free resources. See memcg_cgwb_waitq_callback_fn(). */ __add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry); - if (atomic_dec_and_test(&wait->done.cnt)) - wake_up_all(wait->done.waitq); + if (atomic_dec_and_test(&wait->done.cnt)) { + list_del(&wait->wq_entry.entry); + kfree(wait); + } } + spin_unlock(&memcg_cgwb_frn_waitq.lock); #endif if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); > > Regards, > Venkat. > > > > On Fri, Sep 12, 2025 at 8:33 PM Venkat <[email protected]> wrote: > >> > >> > >> > >>> On 12 Sep 2025, at 10:51 AM, Venkat Rao Bagalkote > >>> <[email protected]> wrote: > >>> > >>> Greetings!!! > >>> > >>> > >>> IBM CI has reported a kernel crash, while running generic/256 test case > >>> on pmem device from xfstests suite on linux-next20250911 kernel. > >>> > >>> > >>> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git > >>> > >>> local.config: > >>> > >>> [xfs_dax] > >>> export RECREATE_TEST_DEV=true > >>> export TEST_DEV=/dev/pmem0 > >>> export TEST_DIR=/mnt/test_pmem > >>> export SCRATCH_DEV=/dev/pmem0.1 > >>> export SCRATCH_MNT=/mnt/scratch_pmem > >>> export MKFS_OPTIONS="-m reflink=0 -b size=65536 -s size=512" > >>> export FSTYP=xfs > >>> export MOUNT_OPTIONS="-o dax" > >>> > >>> > >>> Test case: generic/256 > >>> > >>> > >>> Traces: > >>> > >>> > >>> [ 163.371929] ------------[ cut here ]------------ > >>> [ 163.371936] kernel BUG at lib/list_debug.c:29! > >>> [ 163.371946] Oops: Exception in kernel mode, sig: 5 [#1] > >>> [ 163.371954] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries > >>> [ 163.371965] Modules linked in: xfs nft_fib_inet nft_fib_ipv4 > >>> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 > >>> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack bonding tls > >>> nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink sunrpc > >>> pseries_rng vmx_crypto dax_pmem fuse ext4 crc16 mbcache jbd2 nd_pmem > >>> papr_scm sd_mod libnvdimm sg ibmvscsi ibmveth scsi_transport_srp > >>> pseries_wdt > >>> [ 163.372127] CPU: 22 UID: 0 PID: 130 Comm: kworker/22:0 Kdump: loaded > >>> Not tainted 6.17.0-rc5-next-20250911 #1 VOLUNTARY > >>> [ 163.372142] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 > >>> 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries > >>> [ 163.372155] Workqueue: cgroup_free css_free_rwork_fn > >>> [ 163.372169] NIP: c000000000d051d4 LR: c000000000d051d0 CTR: > >>> 0000000000000000 > >>> [ 163.372176] REGS: c00000000ba079b0 TRAP: 0700 Not tainted > >>> (6.17.0-rc5-next-20250911) > >>> [ 163.372183] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> > >>> CR: 28000000 XER: 00000006 > >>> [ 163.372214] CFAR: c0000000002bae9c IRQMASK: 0 > >>> [ 163.372214] GPR00: c000000000d051d0 c00000000ba07c50 c00000000230a600 > >>> 0000000000000075 > >>> [ 163.372214] GPR04: 0000000000000004 0000000000000001 c000000000507e2c > >>> 0000000000000001 > >>> [ 163.372214] GPR08: c000000d0cb87d13 0000000000000000 0000000000000000 > >>> a80e000000000000 > >>> [ 163.372214] GPR12: c00e0001a1970fa2 c000000d0ddec700 c000000000208e58 > >>> c000000107b5e190 > >>> [ 163.372214] GPR16: c00000000d3e5d08 c00000000b71cf78 c00000000d3e5d05 > >>> c00000000b71cf30 > >>> [ 163.372214] GPR20: c00000000b71cf08 c00000000b71cf10 c000000019f58588 > >>> c000000004704bc8 > >>> [ 163.372214] GPR24: c000000107b5e100 c000000004704bd0 0000000000000003 > >>> c000000004704bd0 > >>> [ 163.372214] GPR28: c000000004704bc8 c000000019f585a8 c000000019f53da8 > >>> c000000004704bc8 > >>> [ 163.372315] NIP [c000000000d051d4] > >>> __list_add_valid_or_report+0x124/0x188 > >>> [ 163.372326] LR [c000000000d051d0] > >>> __list_add_valid_or_report+0x120/0x188 > >>> [ 163.372335] Call Trace: > >>> [ 163.372339] [c00000000ba07c50] [c000000000d051d0] > >>> __list_add_valid_or_report+0x120/0x188 (unreliable) > >>> [ 163.372352] [c00000000ba07ce0] [c000000000834280] > >>> mem_cgroup_css_free+0xa0/0x27c > >>> [ 163.372363] [c00000000ba07d50] [c0000000003ba198] > >>> css_free_rwork_fn+0xd0/0x59c > >>> [ 163.372374] [c00000000ba07da0] [c0000000001f5d60] > >>> process_one_work+0x41c/0x89c > >>> [ 163.372385] [c00000000ba07eb0] [c0000000001f76c0] > >>> worker_thread+0x558/0x848 > >>> [ 163.372394] [c00000000ba07f80] [c000000000209038] kthread+0x1e8/0x230 > >>> [ 163.372406] [c00000000ba07fe0] [c00000000000ded8] > >>> start_kernel_thread+0x14/0x18 > >>> [ 163.372416] Code: 4b9b1099 60000000 7f63db78 4bae8245 60000000 > >>> e8bf0008 3c62ff88 7fe6fb78 7fc4f378 38637d40 4b5b5c89 60000000 <0fe00000> > >>> 60000000 60000000 7f83e378 > >>> [ 163.372453] ---[ end trace 0000000000000000 ]--- > >>> [ 163.380581] pstore: backend (nvram) writing error (-1) > >>> [ 163.380593] > >>> > >>> > >>> If you happen to fix this issue, please add below tag. > >>> > >>> > >>> Reported-by: Venkat Rao Bagalkote <[email protected]> > >>> > >>> > >>> > >>> Regards, > >>> > >>> Venkat. > >>> > >>> > >> > >> After reverting the below commit, issue is not seen. > >> > >> commit 61bbf51e75df1a94cf6736e311cb96aeb79826a8 > >> Author: Julian Sun <[email protected]> > >> Date: Thu Aug 28 04:45:57 2025 +0800 > >> > >> memcg: don't wait writeback completion when release memcg > >> Recently, we encountered the following hung task: > >> INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds. > >> [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn > >> [Wed Jul 30 17:47:45 2025] Call Trace: > >> [Wed Jul 30 17:47:45 2025] __schedule+0x934/0xe10 > >> [Wed Jul 30 17:47:45 2025] ? complete+0x3b/0x50 > >> [Wed Jul 30 17:47:45 2025] ? _cond_resched+0x15/0x30 > >> [Wed Jul 30 17:47:45 2025] schedule+0x40/0xb0 > >> [Wed Jul 30 17:47:45 2025] wb_wait_for_completion+0x52/0x80 > >> [Wed Jul 30 17:47:45 2025] ? finish_wait+0x80/0x80 > >> [Wed Jul 30 17:47:45 2025] mem_cgroup_css_free+0x22/0x1b0 > >> [Wed Jul 30 17:47:45 2025] css_free_rwork_fn+0x42/0x380 > >> [Wed Jul 30 17:47:45 2025] process_one_work+0x1a2/0x360 > >> [Wed Jul 30 17:47:45 2025] worker_thread+0x30/0x390 > >> [Wed Jul 30 17:47:45 2025] ? create_worker+0x1a0/0x1a0 > >> [Wed Jul 30 17:47:45 2025] kthread+0x110/0x130 > >> [Wed Jul 30 17:47:45 2025] ? __kthread_cancel_work+0x40/0x40 > >> [Wed Jul 30 17:47:45 2025] ret_from_fork+0x1f/0x30 > >> The direct cause is that memcg spends a long time waiting for > >> dirty page > >> writeback of foreign memcgs during release. > >> The root causes are: > >> a. The wb may have multiple writeback tasks, containing millions > >> of dirty pages, as shown below: > >>>>> for work in list_for_each_entry("struct wb_writeback_work", \ > >> wb.work_list.address_of_(), "list"): > >> ... print(work.nr_pages, work.reason, hex(work)) > >> ... > >> 900628 WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40 > >> 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540 > >> 1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0 > >> 1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0 > >> 1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340 > >> 2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400 > >> 2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0 > >> 3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0 > >> 3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0 > >> 3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40 > >> 3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800 > >> 3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600 > >> 3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0 > >> 3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40 > >> 3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340 > >> 3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280 > >> 3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040 > >> 3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00 > >> 3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800 > >> 3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240 > >> 3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400 > >> 3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680 > >> 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000 > >> 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00 > >> b. The writeback might severely throttled by wbt, with a speed > >> possibly less than 100kb/s, leading to a very long writeback > >> time. > >>>>> wb.write_bandwidth > >> (unsigned long)24 > >>>>> wb.write_bandwidth > >> (unsigned long)13 > >> The wb_wait_for_completion() here is probably only used to prevent > >> use-after-free. Therefore, we manage 'done' separately and > >> automatically > >> free it. > >> This allows us to remove wb_wait_for_completion() while preventing > >> the > >> use-after-free issue. > >> com > >> Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty > >> flushing") > >> Signed-off-by: Julian Sun <[email protected]> > >> Acked-by: Tejun Heo <[email protected]> > >> Cc: Michal Hocko <[email protected]> > >> Cc: Roman Gushchin <[email protected]> > >> Cc: Johannes Weiner <[email protected]> > >> Cc: Shakeel Butt <[email protected]> > >> Cc: Muchun Song <[email protected]> > >> Cc: <[email protected]> > >> Signed-off-by: Andrew Morton <[email protected]> > >> > >> Regards, > >> Venkat. > >> > >>> > >> > > > > > > -- > > Julian Sun <[email protected]> > Thanks, -- Julian Sun <[email protected]>
