Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
在 2021/4/11 下午12:25, Theodore Ts'o 写道: On Sun, Apr 11, 2021 at 03:45:01AM +0800, Wen Yang wrote: At this time, some logs are lost. It is suspected that the hard disk itself is faulty. If you have a kernel crash dump, that means you can extract out the dmesg buffer, correct? Is there any I/O error messages in the kernel log? What is the basis of the suspicion that the hard drive is faulty? Kernel dmesg output? Error reporting from smartctl? Hello, we are using a Bare-metal Cloud server (https://www.semanticscholar.org/paper/High-density-Multi-tenant-Bare-metal-Cloud-Zhang-Zheng/ab1b5f0743816c8cb7188019d844ff3f7d565d9f/figure/3), so there is no error log in dmesg or smartctl, and we have to check it in Bm-hypervisor. We finally found that the io processing process on Bm-hypervisor is indeed abnormal. There are many hard disks on our server. Maybe we should not occupy 100% CPU for a long time just because one hard disk fails. It depends on the nature of the hard drive failure. How is it failing? One thing which we do need to be careful about is when focusing on how to prevent a failure caused by some particular (potentially extreme) scenarios, that we don't cause problems on more common scenarios (for example a heavily loaded server, and/or a case where the file system is almost full where we have multiple files "fighting" over a small number of free blocks). In general, my attitude is that the best way to protect against hard drive failures is to have processes which are monitoring the health of the system, and if there is evidence of a failed drive, that we immediately kill all jobs which are relying on that drive (which we call "draining" a particular drive), and/or if a sufficiently large percentage of the drives have failed, or the machine can no longer do its job, to automatically move all of those jobs to other servers (e.g., "drain" the server), and then send the machine to some kind of data center repair service, where the failed hard drives can be replaced. I'm skeptical of attempts to try to make the file system to somehow continue to be able to "work" in the face of hard drive failures, since failures can be highly atypical, and what might work well in one failure scenario might be catastrophic in another. It's especially problematic if the HDD is not explcitly signalling an error condition, but rather being slow (because it's doing a huge number of retries), or the HDD is returning data which is simply different from what was previously written. The best we can do in that case is to detect that something is wrong (this is where metadata checksums would be very helpful), and then either remount the file system r/o, or panic the machine, and/or signal to userspace that some particular file system should be drained. Thanks you. We generally separate the physical disks. One system disk and several business disks. The linux kernel runs on this system disk, and various services run on several business disks. In this way, even if a business disk has a problem , it will not affect the entire system. But the current implementation of mblloc may cause 100% of the cpu to be occupied for a long time, could we optimize it slightly, as follows: diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index a02fadf..c73f212 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -351,6 +351,8 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, ext4_group_t group); static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); +static inline void ext4_mb_show_pa(struct super_block *sb); + /* * The algorithm using this percpu seq counter goes below: * 1. We sample the percpu discard_pa_seq counter before trying for block @@ -4217,9 +4219,9 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac) struct ext4_prealloc_space *pa, *tmp; struct list_head list; struct ext4_buddy e4b; + int free_total = 0; + int busy, free; int err; - int busy = 0; - int free, free_total = 0; mb_debug(sb, "discard preallocation for group %u\n", group); if (list_empty(>bb_prealloc_list)) @@ -4247,6 +4249,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac) INIT_LIST_HEAD(); repeat: + busy = 0; free = 0; ext4_lock_group(sb, group); list_for_each_entry_safe(pa, tmp, @@ -4255,6 +4258,8 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac) if (atomic_read(>pa_count)) { spin_unlock(>pa_lock); busy = 1; + mb_debug(sb, "used pa while discarding for group %u\n", group); + ext4_mb_show_pa(sb); continue; } if (pa->pa_deleted) { @@ -4299,8 +4304,7 @@ static void
Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
On Sun, Apr 11, 2021 at 03:45:01AM +0800, Wen Yang wrote: > At this time, some logs are lost. It is suspected that the hard disk itself > is faulty. If you have a kernel crash dump, that means you can extract out the dmesg buffer, correct? Is there any I/O error messages in the kernel log? What is the basis of the suspicion that the hard drive is faulty? Kernel dmesg output? Error reporting from smartctl? > There are many hard disks on our server. Maybe we should not occupy 100% CPU > for a long time just because one hard disk fails. It depends on the nature of the hard drive failure. How is it failing? One thing which we do need to be careful about is when focusing on how to prevent a failure caused by some particular (potentially extreme) scenarios, that we don't cause problems on more common scenarios (for example a heavily loaded server, and/or a case where the file system is almost full where we have multiple files "fighting" over a small number of free blocks). In general, my attitude is that the best way to protect against hard drive failures is to have processes which are monitoring the health of the system, and if there is evidence of a failed drive, that we immediately kill all jobs which are relying on that drive (which we call "draining" a particular drive), and/or if a sufficiently large percentage of the drives have failed, or the machine can no longer do its job, to automatically move all of those jobs to other servers (e.g., "drain" the server), and then send the machine to some kind of data center repair service, where the failed hard drives can be replaced. I'm skeptical of attempts to try to make the file system to somehow continue to be able to "work" in the face of hard drive failures, since failures can be highly atypical, and what might work well in one failure scenario might be catastrophic in another. It's especially problematic if the HDD is not explcitly signalling an error condition, but rather being slow (because it's doing a huge number of retries), or the HDD is returning data which is simply different from what was previously written. The best we can do in that case is to detect that something is wrong (this is where metadata checksums would be very helpful), and then either remount the file system r/o, or panic the machine, and/or signal to userspace that some particular file system should be drained. Cheers, - Ted
Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
在 2021/4/9 下午1:47, riteshh 写道: On 21/04/09 02:50AM, Wen Yang wrote: On Apr 7, 2021, at 5:16 AM, riteshh wrote: On 21/04/07 03:01PM, Wen Yang wrote: From: Wen Yang The kworker has occupied 100% of the CPU for several days: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 68086 root 20 0 00 0 R 100.0 0.0 9718:18 kworker/u64:11 And the stack obtained through sysrq is as follows: [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000 [20613144.850438] Call Trace: [20613144.850439] []ext4_mb_new_blocks+0x429/0x550 [ext4] [20613144.850439] [] ext4_ext_map_blocks+0xb5e/0xf30 [ext4] [20613144.850441] [] ext4_map_blocks+0x172/0x620 [ext4] [20613144.850442] [] ext4_writepages+0x7e5/0xf00 [ext4] [20613144.850443] [] do_writepages+0x1e/0x30 [20613144.850444] [] __writeback_single_inode+0x45/0x320 [20613144.850444] [] writeback_sb_inodes+0x272/0x600 [20613144.850445] [] __writeback_inodes_wb+0x92/0xc0 [20613144.850445] [] wb_writeback+0x268/0x300 [20613144.850446] [] wb_workfn+0xb4/0x380 [20613144.850447] [] process_one_work+0x189/0x420 [20613144.850447] [] worker_thread+0x4e/0x4b0 The cpu resources of the cloud server are precious, and the server cannot be restarted after running for a long time, so a configuration parameter is added to prevent this endless loop. Strange, if there is a endless loop here. Then I would definitely see if there is any accounting problem in pa->pa_count. Otherwise busy=1 should not be set everytime. ext4_mb_show_pa() function may help debug this. If yes, then that means there always exists either a file preallocation or a group preallocation. Maybe it is possible, in some use case. Others may know of such use case, if any. If this code is broken, then it doesn't make sense to me that we would leave it in the "run forever" state after the patch, and require a sysfs tunable to be set to have a properly working system? Is there anything particularly strange about the workload/system that might cause this? Filesystem is very full, memory is very low, etc? Hi Ritesh and Andreas, Thank you for your reply. Since there is still a faulty machine, we have analyzed it again and found it is indeed a very special case: crash> struct ext4_group_info 8813bb5f72d0 struct ext4_group_info { bb_state = 0, bb_free_root = { rb_node = 0x0 }, bb_first_free = 1681, bb_free = 0, Not related to this issue, but above two variables values doesn't looks consistent. bb_fragments = 0, bb_largest_free_order = -1, bb_prealloc_list = { next = 0x880268291d78, prev = 0x880268291d78 ---> *** The list is empty }, Ok. So when you collected the dump this list was empty. alloc_sem = { count = { counter = 0 }, wait_list = { next = 0x8813bb5f7308, prev = 0x8813bb5f7308 }, wait_lock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } }, osq = { tail = { counter = 0 } }, owner = 0x0 }, bb_counters = 0x8813bb5f7328 } crash> crash> list 0x880268291d78 -l ext4_prealloc_space.pa_group_list -s No point of doing this I guess, since the list anyway is empty. What you may be seeing below is some garbage data. ext4_prealloc_space.pa_count 880268291d78 pa_count = { counter = 1 ---> pa->pa_count } 8813bb5f72f0 pa_count = { counter = -30701 } I guess, since list is empty and you are seeing garbage hence counter value of above node looks weird. crash> struct -xo ext4_prealloc_space struct ext4_prealloc_space { [0x0] struct list_head pa_inode_list; [0x10] struct list_head pa_group_list; union { struct list_head pa_tmp_list; struct callback_head pa_rcu; [0x20] } u; [0x30] spinlock_t pa_lock; [0x34] atomic_t pa_count; [0x38] unsigned int pa_deleted; [0x40] ext4_fsblk_t pa_pstart; [0x48] ext4_lblk_t pa_lstart; [0x4c] ext4_grpblk_t pa_len; [0x50] ext4_grpblk_t pa_free; [0x54] unsigned short pa_type; [0x58] spinlock_t *pa_obj_lock; [0x60] struct inode *pa_inode; } SIZE: 0x68 crash> rd 0x880268291d68 20 880268291d68: 881822f8a4c8 881822f8a4c8 ..."..." 880268291d78: 8813bb5f72f0 8813bb5f72f0 .r_..r_. 880268291d88: 1000 880db2371000 ..7. 880268291d98: 00010001 880268291da8: 00029c39 00170c41 9...A... 880268291db8: 0016 881822f8a4d8 ..." 880268291dc8: 881822f8a268 880268291af8 h.."..)h 880268291dd8:
Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
On 21/04/09 12:18PM, Jan Kara wrote: > On Fri 09-04-21 11:17:33, riteshh wrote: > > On 21/04/09 02:50AM, Wen Yang wrote: > > > > On Apr 7, 2021, at 5:16 AM, riteshh wrote: > > > >> > > > >> On 21/04/07 03:01PM, Wen Yang wrote: > > > >>> From: Wen Yang > > > >>> > > > >>> The kworker has occupied 100% of the CPU for several days: > > > >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > >>> 68086 root 20 0 00 0 R 100.0 0.0 9718:18 kworker/u64:11 > > > >>> > > > >>> And the stack obtained through sysrq is as follows: > > > >>> [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000 > > > >>> [20613144.850438] Call Trace: > > > >>> [20613144.850439] []ext4_mb_new_blocks+0x429/0x550 > > > [ext4] > > > >>> [20613144.850439] [] > > > >>> ext4_ext_map_blocks+0xb5e/0xf30 > > > [ext4] > > > >>> [20613144.850441] [] ext4_map_blocks+0x172/0x620 > > > [ext4] > > > >>> [20613144.850442] [] ext4_writepages+0x7e5/0xf00 > > > [ext4] > > > >>> [20613144.850443] [] do_writepages+0x1e/0x30 > > > >>> [20613144.850444] [] > > > __writeback_single_inode+0x45/0x320 > > > >>> [20613144.850444] [] > > > >>> writeback_sb_inodes+0x272/0x600 > > > >>> [20613144.850445] [] > > > >>> __writeback_inodes_wb+0x92/0xc0 > > > >>> [20613144.850445] [] wb_writeback+0x268/0x300 > > > >>> [20613144.850446] [] wb_workfn+0xb4/0x380 > > > >>> [20613144.850447] [] process_one_work+0x189/0x420 > > > >>> [20613144.850447] [] worker_thread+0x4e/0x4b0 > > > >>> > > > >>> The cpu resources of the cloud server are precious, and the server > > > >>> cannot be restarted after running for a long time, so a configuration > > > >>> parameter is added to prevent this endless loop. > > > >> > > > >> Strange, if there is a endless loop here. Then I would definitely see > > > >> if there is any accounting problem in pa->pa_count. Otherwise busy=1 > > > >> should not be set everytime. ext4_mb_show_pa() function may help debug > > > this. > > > >> > > > >> If yes, then that means there always exists either a file preallocation > > > >> or a group preallocation. Maybe it is possible, in some use case. > > > >> Others may know of such use case, if any. > > > > > > > If this code is broken, then it doesn't make sense to me that we would > > > > leave it in the "run forever" state after the patch, and require a sysfs > > > > tunable to be set to have a properly working system? > > > > > > > Is there anything particularly strange about the workload/system that > > > > might cause this? Filesystem is very full, memory is very low, etc? > > > > > > Hi Ritesh and Andreas, > > > > > > Thank you for your reply. Since there is still a faulty machine, we have > > > analyzed it again and found it is indeed a very special case: > > > > > > > > > crash> struct ext4_group_info 8813bb5f72d0 > > > struct ext4_group_info { > > > bb_state = 0, > > > bb_free_root = { > > > rb_node = 0x0 > > > }, > > > bb_first_free = 1681, > > > bb_free = 0, > > > > Not related to this issue, but above two variables values doesn't looks > > consistent. > > > > > bb_fragments = 0, > > > bb_largest_free_order = -1, > > > bb_prealloc_list = { > > > next = 0x880268291d78, > > > prev = 0x880268291d78 ---> *** The list is empty > > > }, > > > > Ok. So when you collected the dump this list was empty. > > No, it is not empty. It has a single element. Note that the structure is at > 8813bb5f72d0 so the pointers would have to be like 8813bb5f7xxx. Errr, yes right. So the list is not empty. But I guess the other arguments discussed in that mail should still be valid. -ritesh
Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
On Fri 09-04-21 11:17:33, riteshh wrote: > On 21/04/09 02:50AM, Wen Yang wrote: > > > On Apr 7, 2021, at 5:16 AM, riteshh wrote: > > >> > > >> On 21/04/07 03:01PM, Wen Yang wrote: > > >>> From: Wen Yang > > >>> > > >>> The kworker has occupied 100% of the CPU for several days: > > >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > >>> 68086 root 20 0 00 0 R 100.0 0.0 9718:18 kworker/u64:11 > > >>> > > >>> And the stack obtained through sysrq is as follows: > > >>> [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000 > > >>> [20613144.850438] Call Trace: > > >>> [20613144.850439] []ext4_mb_new_blocks+0x429/0x550 > > [ext4] > > >>> [20613144.850439] [] ext4_ext_map_blocks+0xb5e/0xf30 > > [ext4] > > >>> [20613144.850441] [] ext4_map_blocks+0x172/0x620 > > [ext4] > > >>> [20613144.850442] [] ext4_writepages+0x7e5/0xf00 > > [ext4] > > >>> [20613144.850443] [] do_writepages+0x1e/0x30 > > >>> [20613144.850444] [] > > __writeback_single_inode+0x45/0x320 > > >>> [20613144.850444] [] writeback_sb_inodes+0x272/0x600 > > >>> [20613144.850445] [] __writeback_inodes_wb+0x92/0xc0 > > >>> [20613144.850445] [] wb_writeback+0x268/0x300 > > >>> [20613144.850446] [] wb_workfn+0xb4/0x380 > > >>> [20613144.850447] [] process_one_work+0x189/0x420 > > >>> [20613144.850447] [] worker_thread+0x4e/0x4b0 > > >>> > > >>> The cpu resources of the cloud server are precious, and the server > > >>> cannot be restarted after running for a long time, so a configuration > > >>> parameter is added to prevent this endless loop. > > >> > > >> Strange, if there is a endless loop here. Then I would definitely see > > >> if there is any accounting problem in pa->pa_count. Otherwise busy=1 > > >> should not be set everytime. ext4_mb_show_pa() function may help debug > > this. > > >> > > >> If yes, then that means there always exists either a file preallocation > > >> or a group preallocation. Maybe it is possible, in some use case. > > >> Others may know of such use case, if any. > > > > > If this code is broken, then it doesn't make sense to me that we would > > > leave it in the "run forever" state after the patch, and require a sysfs > > > tunable to be set to have a properly working system? > > > > > Is there anything particularly strange about the workload/system that > > > might cause this? Filesystem is very full, memory is very low, etc? > > > > Hi Ritesh and Andreas, > > > > Thank you for your reply. Since there is still a faulty machine, we have > > analyzed it again and found it is indeed a very special case: > > > > > > crash> struct ext4_group_info 8813bb5f72d0 > > struct ext4_group_info { > > bb_state = 0, > > bb_free_root = { > > rb_node = 0x0 > > }, > > bb_first_free = 1681, > > bb_free = 0, > > Not related to this issue, but above two variables values doesn't looks > consistent. > > > bb_fragments = 0, > > bb_largest_free_order = -1, > > bb_prealloc_list = { > > next = 0x880268291d78, > > prev = 0x880268291d78 ---> *** The list is empty > > }, > > Ok. So when you collected the dump this list was empty. No, it is not empty. It has a single element. Note that the structure is at 8813bb5f72d0 so the pointers would have to be like 8813bb5f7xxx. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
On 21/04/09 02:50AM, Wen Yang wrote: > > On Apr 7, 2021, at 5:16 AM, riteshh wrote: > >> > >> On 21/04/07 03:01PM, Wen Yang wrote: > >>> From: Wen Yang > >>> > >>> The kworker has occupied 100% of the CPU for several days: > >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >>> 68086 root 20 0 00 0 R 100.0 0.0 9718:18 kworker/u64:11 > >>> > >>> And the stack obtained through sysrq is as follows: > >>> [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000 > >>> [20613144.850438] Call Trace: > >>> [20613144.850439] []ext4_mb_new_blocks+0x429/0x550 > [ext4] > >>> [20613144.850439] [] ext4_ext_map_blocks+0xb5e/0xf30 > [ext4] > >>> [20613144.850441] [] ext4_map_blocks+0x172/0x620 > [ext4] > >>> [20613144.850442] [] ext4_writepages+0x7e5/0xf00 > [ext4] > >>> [20613144.850443] [] do_writepages+0x1e/0x30 > >>> [20613144.850444] [] > __writeback_single_inode+0x45/0x320 > >>> [20613144.850444] [] writeback_sb_inodes+0x272/0x600 > >>> [20613144.850445] [] __writeback_inodes_wb+0x92/0xc0 > >>> [20613144.850445] [] wb_writeback+0x268/0x300 > >>> [20613144.850446] [] wb_workfn+0xb4/0x380 > >>> [20613144.850447] [] process_one_work+0x189/0x420 > >>> [20613144.850447] [] worker_thread+0x4e/0x4b0 > >>> > >>> The cpu resources of the cloud server are precious, and the server > >>> cannot be restarted after running for a long time, so a configuration > >>> parameter is added to prevent this endless loop. > >> > >> Strange, if there is a endless loop here. Then I would definitely see > >> if there is any accounting problem in pa->pa_count. Otherwise busy=1 > >> should not be set everytime. ext4_mb_show_pa() function may help debug > this. > >> > >> If yes, then that means there always exists either a file preallocation > >> or a group preallocation. Maybe it is possible, in some use case. > >> Others may know of such use case, if any. > > > If this code is broken, then it doesn't make sense to me that we would > > leave it in the "run forever" state after the patch, and require a sysfs > > tunable to be set to have a properly working system? > > > Is there anything particularly strange about the workload/system that > > might cause this? Filesystem is very full, memory is very low, etc? > > Hi Ritesh and Andreas, > > Thank you for your reply. Since there is still a faulty machine, we have > analyzed it again and found it is indeed a very special case: > > > crash> struct ext4_group_info 8813bb5f72d0 > struct ext4_group_info { > bb_state = 0, > bb_free_root = { > rb_node = 0x0 > }, > bb_first_free = 1681, > bb_free = 0, Not related to this issue, but above two variables values doesn't looks consistent. > bb_fragments = 0, > bb_largest_free_order = -1, > bb_prealloc_list = { > next = 0x880268291d78, > prev = 0x880268291d78 ---> *** The list is empty > }, Ok. So when you collected the dump this list was empty. > alloc_sem = { > count = { > counter = 0 > }, > wait_list = { > next = 0x8813bb5f7308, > prev = 0x8813bb5f7308 > }, > wait_lock = { > raw_lock = { > { > val = { > counter = 0 > }, > { > locked = 0 '\000', > pending = 0 '\000' > }, > { > locked_pending = 0, > tail = 0 > } > } > } > }, > osq = { > tail = { > counter = 0 > } > }, > owner = 0x0 > }, > bb_counters = 0x8813bb5f7328 > } > crash> > > > crash> list 0x880268291d78 -l ext4_prealloc_space.pa_group_list -s No point of doing this I guess, since the list anyway is empty. What you may be seeing below is some garbage data. > ext4_prealloc_space.pa_count > 880268291d78 > pa_count = { > counter = 1 ---> pa->pa_count > } > 8813bb5f72f0 > pa_count = { > counter = -30701 > } I guess, since list is empty and you are seeing garbage hence counter value of above node looks weird. > > > crash> struct -xo ext4_prealloc_space > struct ext4_prealloc_space { >[0x0] struct list_head pa_inode_list; > [0x10] struct list_head pa_group_list; > union { > struct list_head pa_tmp_list; > struct callback_head pa_rcu; > [0x20] } u; > [0x30] spinlock_t pa_lock; > [0x34] atomic_t pa_count; > [0x38] unsigned int pa_deleted; > [0x40] ext4_fsblk_t pa_pstart; > [0x48] ext4_lblk_t pa_lstart; > [0x4c] ext4_grpblk_t pa_len; > [0x50] ext4_grpblk_t pa_free; > [0x54] unsigned short pa_type; > [0x58] spinlock_t *pa_obj_lock; > [0x60] struct inode *pa_inode; > } > SIZE: 0x68 > > > crash> rd 0x880268291d68 20 > 880268291d68: 881822f8a4c8 881822f8a4c8 ..."..." > 880268291d78: 8813bb5f72f0 8813bb5f72f0 .r_..r_. > 880268291d88: 1000 880db2371000 ..7. > 880268291d98:
Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
On Apr 8, 2021, at 12:50 PM, Wen Yang wrote: > > Hi Ritesh and Andreas, > > Thank you for your reply. Since there is still a faulty machine, we have > analyzed it again and found it is indeed a very special case: > > > crash> struct ext4_group_info 8813bb5f72d0 > struct ext4_group_info { > bb_state = 0, > bb_free_root = { >rb_node = 0x0 > }, > bb_first_free = 1681, > bb_free = 0, > bb_fragments = 0, > bb_largest_free_order = -1, > bb_prealloc_list = { >next = 0x880268291d78, >prev = 0x880268291d78 ---> *** The list is empty > }, > alloc_sem = { >count = { > counter = 0 >}, >wait_list = { > next = 0x8813bb5f7308, > prev = 0x8813bb5f7308 >}, >wait_lock = { > raw_lock = { >{ > val = { >counter = 0 > }, > { >locked = 0 '\000', >pending = 0 '\000' > }, > { >locked_pending = 0, >tail = 0 > } >} > } >}, >osq = { > tail = { >counter = 0 > } >}, >owner = 0x0 > }, > bb_counters = 0x8813bb5f7328 > } > crash> > > > crash> list 0x880268291d78 -l ext4_prealloc_space.pa_group_list -s > ext4_prealloc_space.pa_count > 880268291d78 > pa_count = { >counter = 1 ---> pa->pa_count > } > 8813bb5f72f0 > pa_count = { >counter = -30701 > } > > > crash> struct -xo ext4_prealloc_space > struct ext4_prealloc_space { > [0x0] struct list_head pa_inode_list; > [0x10] struct list_head pa_group_list; > union { > struct list_head pa_tmp_list; > struct callback_head pa_rcu; > [0x20] } u; > [0x30] spinlock_t pa_lock; > [0x34] atomic_t pa_count; > [0x38] unsigned int pa_deleted; > [0x40] ext4_fsblk_t pa_pstart; > [0x48] ext4_lblk_t pa_lstart; > [0x4c] ext4_grpblk_t pa_len; > [0x50] ext4_grpblk_t pa_free; > [0x54] unsigned short pa_type; > [0x58] spinlock_t *pa_obj_lock; > [0x60] struct inode *pa_inode; > } > SIZE: 0x68 > > > Before "goto repeat", it is necessary to check whether grp->bb_prealloc_list > is empty. This patch may fix it. > Please kindly give us your comments and suggestions. > Thanks. This patch definitely looks more reasonable than the previous one, but I don't think it is correct? It isn't clear how the system could get into this state, or stay there. If bb_prealloc_list is empty, then "busy" should be 0, since busy = 1 is only set inside "list_for_each_entry_safe(bb_prealloc_list)", and that implies the list is *not* empty? The ext4_lock_group() is held over this code, so the list should not be changing in this case, and even if the list changed, it _should_ only affect the loop one time. > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 99bf091..8082e2d 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4290,7 +4290,7 @@ static void ext4_mb_new_preallocation(struct > ext4_allocation_context *ac) >free_total += free; > >/* if we still need more blocks and some PAs were used, try again */ > - if (free_total < needed && busy) { > + if (free_total < needed && busy && > !list_empty(>bb_prealloc_list)) { >ext4_unlock_group(sb, group); >cond_resched(); >busy = 0; >goto repeat; Minor suggestion - moving "busy = 0" right after "repeat:" and removing the "int busy = 0" initializer would make this code a bit more clear. Cheers, Andreas signature.asc Description: Message signed with OpenPGP
Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p
> On Apr 7, 2021, at 5:16 AM, riteshh wrote: >> >> On 21/04/07 03:01PM, Wen Yang wrote: >>> From: Wen Yang >>> >>> The kworker has occupied 100% of the CPU for several days: >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 68086 root 20 0 00 0 R 100.0 0.0 9718:18 kworker/u64:11 >>> >>> And the stack obtained through sysrq is as follows: >>> [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000 >>> [20613144.850438] Call Trace: >>> [20613144.850439] []ext4_mb_new_blocks+0x429/0x550 [ext4] >>> [20613144.850439] [] ext4_ext_map_blocks+0xb5e/0xf30 [ext4] >>> [20613144.850441] [] ext4_map_blocks+0x172/0x620 [ext4] >>> [20613144.850442] [] ext4_writepages+0x7e5/0xf00 [ext4] >>> [20613144.850443] [] do_writepages+0x1e/0x30 >>> [20613144.850444] [] __writeback_single_inode+0x45/0x320 >>> [20613144.850444] [] writeback_sb_inodes+0x272/0x600 >>> [20613144.850445] [] __writeback_inodes_wb+0x92/0xc0 >>> [20613144.850445] [] wb_writeback+0x268/0x300 >>> [20613144.850446] [] wb_workfn+0xb4/0x380 >>> [20613144.850447] [] process_one_work+0x189/0x420 >>> [20613144.850447] [] worker_thread+0x4e/0x4b0 >>> >>> The cpu resources of the cloud server are precious, and the server >>> cannot be restarted after running for a long time, so a configuration >>> parameter is added to prevent this endless loop. >> >> Strange, if there is a endless loop here. Then I would definitely see >> if there is any accounting problem in pa->pa_count. Otherwise busy=1 >> should not be set everytime. ext4_mb_show_pa() function may help debug this. >> >> If yes, then that means there always exists either a file preallocation >> or a group preallocation. Maybe it is possible, in some use case. >> Others may know of such use case, if any. > If this code is broken, then it doesn't make sense to me that we would > leave it in the "run forever" state after the patch, and require a sysfs > tunable to be set to have a properly working system? > Is there anything particularly strange about the workload/system that > might cause this? Filesystem is very full, memory is very low, etc? Hi Ritesh and Andreas, Thank you for your reply. Since there is still a faulty machine, we have analyzed it again and found it is indeed a very special case: crash> struct ext4_group_info 8813bb5f72d0 struct ext4_group_info { bb_state = 0, bb_free_root = { rb_node = 0x0 }, bb_first_free = 1681, bb_free = 0, bb_fragments = 0, bb_largest_free_order = -1, bb_prealloc_list = { next = 0x880268291d78, prev = 0x880268291d78 ---> *** The list is empty }, alloc_sem = { count = { counter = 0 }, wait_list = { next = 0x8813bb5f7308, prev = 0x8813bb5f7308 }, wait_lock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } }, osq = { tail = { counter = 0 } }, owner = 0x0 }, bb_counters = 0x8813bb5f7328 } crash> crash> list 0x880268291d78 -l ext4_prealloc_space.pa_group_list -s ext4_prealloc_space.pa_count 880268291d78 pa_count = { counter = 1 ---> pa->pa_count } 8813bb5f72f0 pa_count = { counter = -30701 } crash> struct -xo ext4_prealloc_space struct ext4_prealloc_space { [0x0] struct list_head pa_inode_list; [0x10] struct list_head pa_group_list; union { struct list_head pa_tmp_list; struct callback_head pa_rcu; [0x20] } u; [0x30] spinlock_t pa_lock; [0x34] atomic_t pa_count; [0x38] unsigned int pa_deleted; [0x40] ext4_fsblk_t pa_pstart; [0x48] ext4_lblk_t pa_lstart; [0x4c] ext4_grpblk_t pa_len; [0x50] ext4_grpblk_t pa_free; [0x54] unsigned short pa_type; [0x58] spinlock_t *pa_obj_lock; [0x60] struct inode *pa_inode; } SIZE: 0x68 crash> rd 0x880268291d68 20 880268291d68: 881822f8a4c8 881822f8a4c8 ..."..." 880268291d78: 8813bb5f72f0 8813bb5f72f0 .r_..r_. 880268291d88: 1000 880db2371000 ..7. 880268291d98: 00010001 880268291da8: 00029c39 00170c41 9...A... 880268291db8: 0016 881822f8a4d8 ..." 880268291dc8: 881822f8a268 880268291af8 h.."..)h 880268291dd8: 880268291dd0 ea00069a07c0 ..)h 880268291de8: 004d5bb7 1000 .[M. 880268291df8: 8801a681f000 Before "goto repeat", it is necessary to check whether grp->bb_prealloc_list is empty. This patch may fix it. Please kindly give us your comments and suggestions. Thanks. diff --git