Re: Btrfs umount hang
Here is a fresh snapshot hang example: Here kworker/u24:0 is blocked waiting on btrfs_tree_read_lock. And again underlaying is btrfs_search_slot after it has gone through cache_block_group. The other two has blocked on the same place in cache_block_group. So I can confirm it is hanging the same way as with umount. Here is the struct extent_buffer passed to btrfs_tree_read_lock for kworker/u24:0: struct extent_buffer { ... lock_owner = 23286, write_locks = { counter = 1 }, read_locks = { counter = 0 }, blocking_writers = { counter = 1 }, .. } So again the ioctl has taken the lock needed for its underlying operation. Btrfs pid 23286 is going to perform a write, so it can't skip locking, and not just read locking, but write locking. The fact that it owns the wr lock is good. skip_locking in its btrfs_path struct is zero, so till here it is fine and locked. The problem rises later, if cache isnt available. Then we are about to try to do cache_block_group. Even if it is not going to hang on the wait in the beginning in the cache_block_group function (as in current case): while (cache->cached == BTRFS_CACHE_FAST) { struct btrfs_caching_control *ctl; ctl = cache->caching_ctl; atomic_inc(>count); prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); spin_unlock(>lock); >schedule(); finish_wait(>wait, ); put_caching_control(ctl); spin_lock(>lock); } it seems that there is high chance to hang later on btrfs_tree_read_lock even if there is absolutely no concurrent work going on by calling: cache_block_group ->load_free_space_cache ->lookup_free_space_inode ->__lookup_free_space_inode ->btrfs_search_slot ->btrfs_read_lock_root_node ->btrfs_tree_read_lock since we are already holding the write lock. It looks like if we call cache_block_group at find_free_extent under call via btrfs_search_slot we are about to get deadlocked too in case the data isn't cached. It looks scary. I hope that I've missed something. Best regards, Angel On Thu, Aug 10, 2017 at 1:30 AM, Angel Shtilianovwrote: > Hi Jeff, > That's exactly what I am observing in the backtraces. > umount reaches cache_block_group passing through btrfs_search_slot, > which looks like got the wr_lock (to be confirmed tomorrow when I'm in > the office) and all other btrfs worker threads are fighting for the > btrfs_tree_read_lock. > Umount is hanging on the cache_block_group wait, all others are > hanging on the read_lock. So caching seems to be the bottleneck, > blocking all others. Good catch! I'll take a look at the commit > tomorrow morning (with large cup of coffee of course). > > By the way, looking through the logs I can see one more pattern - > umount after drop snapshot reaches similar state. The common is that > drop snapshot will also call search slot, get the readlock, drop it > and get the write lock. In the same time there is umount causing > fs_sync going concurently. I've seen at least three such hangs today > and I've started the corresponding crashdumps to get copied to my > machine. Tomorrow I'll take a look at them. It seems the common is > that umount again reaches to the same place - cache_block_group and > hangs. Tomorrow I'll take a look at the crash dump for the workers > state. > > Thank you, > Angel > > On Wed, Aug 9, 2017 at 6:42 PM, Jeff Mahoney wrote: >> On 8/8/17 7:30 AM, Angel Shtilianov wrote: >>> crash> bt -f 31625 >>> PID: 31625 TASK: 88046a833400 CPU: 7 COMMAND: "btrfs-transacti" >>> wants to acquire struct extent_buffer 88000460aca0 lock, whose >>> lock_owner is 27574. >>> >>> here is pid 27574: >>> PID: 27574 TASK: 88038b469a00 CPU: 4 COMMAND: "kworker/u32:9" >>> which is also is trying to acquire eb lock 8802598b6200, and here >>> the owner is 31696. >>> >>> 31696 is >>> PID: 31696 TASK: 88044b59ce00 CPU: 5 COMMAND: "umount" >>> >>> So definitely here is a kind of deadlock. >>> umount holds the lock needed by the workers to complete and waits them >>> to complete. >>> Lockdep wouldn't complain about that. >>> I am still about to investigate what has previously triggered/disabled >>> lockdep. >>> I have to obtain the log from the machine, but I need some time to get it. >>> >>> Jeff, you were right. >>> Could you help demystifying how we ended up here? >> >> Hi Angel - >> >> It looks like a regression introduced by 291c7d2f5, but that's a very >> old commit. As that commit says, it's a rare occurence to hit that >> wait, and that's probably why we haven't seen this issue sooner. >> >> There's potential for this to happen whenever two threads are modifying >> the tree at once and one needs to find a free extent. I'll need to >> think a bit on how to fix it. >> >> -Jeff >> >>> Best regards, >>> Angel >>> >>> On Mon, Aug 7, 2017 at 9:10 PM, Jeff Mahoney
Re: Btrfs umount hang
Hi Jeff, That's exactly what I am observing in the backtraces. umount reaches cache_block_group passing through btrfs_search_slot, which looks like got the wr_lock (to be confirmed tomorrow when I'm in the office) and all other btrfs worker threads are fighting for the btrfs_tree_read_lock. Umount is hanging on the cache_block_group wait, all others are hanging on the read_lock. So caching seems to be the bottleneck, blocking all others. Good catch! I'll take a look at the commit tomorrow morning (with large cup of coffee of course). By the way, looking through the logs I can see one more pattern - umount after drop snapshot reaches similar state. The common is that drop snapshot will also call search slot, get the readlock, drop it and get the write lock. In the same time there is umount causing fs_sync going concurently. I've seen at least three such hangs today and I've started the corresponding crashdumps to get copied to my machine. Tomorrow I'll take a look at them. It seems the common is that umount again reaches to the same place - cache_block_group and hangs. Tomorrow I'll take a look at the crash dump for the workers state. Thank you, Angel On Wed, Aug 9, 2017 at 6:42 PM, Jeff Mahoneywrote: > On 8/8/17 7:30 AM, Angel Shtilianov wrote: >> crash> bt -f 31625 >> PID: 31625 TASK: 88046a833400 CPU: 7 COMMAND: "btrfs-transacti" >> wants to acquire struct extent_buffer 88000460aca0 lock, whose >> lock_owner is 27574. >> >> here is pid 27574: >> PID: 27574 TASK: 88038b469a00 CPU: 4 COMMAND: "kworker/u32:9" >> which is also is trying to acquire eb lock 8802598b6200, and here >> the owner is 31696. >> >> 31696 is >> PID: 31696 TASK: 88044b59ce00 CPU: 5 COMMAND: "umount" >> >> So definitely here is a kind of deadlock. >> umount holds the lock needed by the workers to complete and waits them >> to complete. >> Lockdep wouldn't complain about that. >> I am still about to investigate what has previously triggered/disabled >> lockdep. >> I have to obtain the log from the machine, but I need some time to get it. >> >> Jeff, you were right. >> Could you help demystifying how we ended up here? > > Hi Angel - > > It looks like a regression introduced by 291c7d2f5, but that's a very > old commit. As that commit says, it's a rare occurence to hit that > wait, and that's probably why we haven't seen this issue sooner. > > There's potential for this to happen whenever two threads are modifying > the tree at once and one needs to find a free extent. I'll need to > think a bit on how to fix it. > > -Jeff > >> Best regards, >> Angel >> >> On Mon, Aug 7, 2017 at 9:10 PM, Jeff Mahoney wrote: >>> On 8/7/17 1:19 PM, Jeff Mahoney wrote: On 8/7/17 10:12 AM, Angel Shtilianov wrote: > Hi there, > I'm investigating sporadic hanging during btrfs umount. The FS is > contained in a loop mounted file. > I have no reproduction scenario and the issue may happen once a day or > once a month. It is rare, but frustrating. > I have a crashdump (the server has been manually crashed and collected > a crashdump), so I could take look through the data structures. > What happens is that umount is getting in D state and a the kernel > complains about hung tasks. We are using kernel 4.4.y The actual back > trace is from 4.4.70, but this happens with all the 4.4 kernels I've > used (4.4.30 through 4.4.70). > Tasks like: > INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. > INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. > INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. > are getting blocked waiting for btrfs_tree_read_lock, which is owned > by task umount:31696 (which is also blocked for more than 120 seconds) > regarding the lock debug. > > umount is hung in "cache_block_group", see the '>' mark: >while (cache->cached == BTRFS_CACHE_FAST) { > struct btrfs_caching_control *ctl; > > ctl = cache->caching_ctl; > atomic_inc(>count); > prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); > spin_unlock(>lock); > >>schedule(); > > finish_wait(>wait, ); > put_caching_control(ctl); > spin_lock(>lock); > } > > The complete backtraces could be found in the attached log. > > Do you have any ideas ? Hi Angel - In your log, it says lockdep is disabled. What tripped it earlier? Lockdep really should be catching locking deadlocks in situations like this, if that's really the underlying cause. >>> >>> Actually, I'm not sure if lockdep would catch this one. Here's my >>> hypothesis: >>> >>> kworker/u32:9 is waiting on a read lock while reading the free space >>> cache, which means it owns the
Re: Btrfs umount hang
On 8/8/17 7:30 AM, Angel Shtilianov wrote: > crash> bt -f 31625 > PID: 31625 TASK: 88046a833400 CPU: 7 COMMAND: "btrfs-transacti" > wants to acquire struct extent_buffer 88000460aca0 lock, whose > lock_owner is 27574. > > here is pid 27574: > PID: 27574 TASK: 88038b469a00 CPU: 4 COMMAND: "kworker/u32:9" > which is also is trying to acquire eb lock 8802598b6200, and here > the owner is 31696. > > 31696 is > PID: 31696 TASK: 88044b59ce00 CPU: 5 COMMAND: "umount" > > So definitely here is a kind of deadlock. > umount holds the lock needed by the workers to complete and waits them > to complete. > Lockdep wouldn't complain about that. > I am still about to investigate what has previously triggered/disabled > lockdep. > I have to obtain the log from the machine, but I need some time to get it. > > Jeff, you were right. > Could you help demystifying how we ended up here? Hi Angel - It looks like a regression introduced by 291c7d2f5, but that's a very old commit. As that commit says, it's a rare occurence to hit that wait, and that's probably why we haven't seen this issue sooner. There's potential for this to happen whenever two threads are modifying the tree at once and one needs to find a free extent. I'll need to think a bit on how to fix it. -Jeff > Best regards, > Angel > > On Mon, Aug 7, 2017 at 9:10 PM, Jeff Mahoneywrote: >> On 8/7/17 1:19 PM, Jeff Mahoney wrote: >>> On 8/7/17 10:12 AM, Angel Shtilianov wrote: Hi there, I'm investigating sporadic hanging during btrfs umount. The FS is contained in a loop mounted file. I have no reproduction scenario and the issue may happen once a day or once a month. It is rare, but frustrating. I have a crashdump (the server has been manually crashed and collected a crashdump), so I could take look through the data structures. What happens is that umount is getting in D state and a the kernel complains about hung tasks. We are using kernel 4.4.y The actual back trace is from 4.4.70, but this happens with all the 4.4 kernels I've used (4.4.30 through 4.4.70). Tasks like: INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. are getting blocked waiting for btrfs_tree_read_lock, which is owned by task umount:31696 (which is also blocked for more than 120 seconds) regarding the lock debug. umount is hung in "cache_block_group", see the '>' mark: while (cache->cached == BTRFS_CACHE_FAST) { struct btrfs_caching_control *ctl; ctl = cache->caching_ctl; atomic_inc(>count); prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); spin_unlock(>lock); >schedule(); finish_wait(>wait, ); put_caching_control(ctl); spin_lock(>lock); } The complete backtraces could be found in the attached log. Do you have any ideas ? >>> >>> Hi Angel - >>> >>> In your log, it says lockdep is disabled. What tripped it earlier? >>> Lockdep really should be catching locking deadlocks in situations like >>> this, if that's really the underlying cause. >> >> Actually, I'm not sure if lockdep would catch this one. Here's my >> hypothesis: >> >> kworker/u32:9 is waiting on a read lock while reading the free space >> cache, which means it owns the cache->cached value and will issue the >> wakeup when it completes. >> >> umount is waiting on for the wakeup from kworker/u32:9 but is holding >> some tree locks in write mode. >> >> If kworker/u32:9 is waiting on the locks that umount holds, we have a >> deadlock. >> >> Can you dump the extent buffer that kworker/u32:9 is waiting on? Part >> of that will contain the PID of the holder, and if matches umount, we >> found the cause. >> >> -Jeff >> >> -- >> Jeff Mahoney >> SUSE Labs >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: Btrfs umount hang
crash> bt -f 31625 PID: 31625 TASK: 88046a833400 CPU: 7 COMMAND: "btrfs-transacti" wants to acquire struct extent_buffer 88000460aca0 lock, whose lock_owner is 27574. here is pid 27574: PID: 27574 TASK: 88038b469a00 CPU: 4 COMMAND: "kworker/u32:9" which is also is trying to acquire eb lock 8802598b6200, and here the owner is 31696. 31696 is PID: 31696 TASK: 88044b59ce00 CPU: 5 COMMAND: "umount" So definitely here is a kind of deadlock. umount holds the lock needed by the workers to complete and waits them to complete. Lockdep wouldn't complain about that. I am still about to investigate what has previously triggered/disabled lockdep. I have to obtain the log from the machine, but I need some time to get it. Jeff, you were right. Could you help demystifying how we ended up here? Best regards, Angel On Mon, Aug 7, 2017 at 9:10 PM, Jeff Mahoneywrote: > On 8/7/17 1:19 PM, Jeff Mahoney wrote: >> On 8/7/17 10:12 AM, Angel Shtilianov wrote: >>> Hi there, >>> I'm investigating sporadic hanging during btrfs umount. The FS is >>> contained in a loop mounted file. >>> I have no reproduction scenario and the issue may happen once a day or >>> once a month. It is rare, but frustrating. >>> I have a crashdump (the server has been manually crashed and collected >>> a crashdump), so I could take look through the data structures. >>> What happens is that umount is getting in D state and a the kernel >>> complains about hung tasks. We are using kernel 4.4.y The actual back >>> trace is from 4.4.70, but this happens with all the 4.4 kernels I've >>> used (4.4.30 through 4.4.70). >>> Tasks like: >>> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. >>> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. >>> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. >>> are getting blocked waiting for btrfs_tree_read_lock, which is owned >>> by task umount:31696 (which is also blocked for more than 120 seconds) >>> regarding the lock debug. >>> >>> umount is hung in "cache_block_group", see the '>' mark: >>>while (cache->cached == BTRFS_CACHE_FAST) { >>> struct btrfs_caching_control *ctl; >>> >>> ctl = cache->caching_ctl; >>> atomic_inc(>count); >>> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); >>> spin_unlock(>lock); >>> schedule(); >>> >>> finish_wait(>wait, ); >>> put_caching_control(ctl); >>> spin_lock(>lock); >>> } >>> >>> The complete backtraces could be found in the attached log. >>> >>> Do you have any ideas ? >> >> Hi Angel - >> >> In your log, it says lockdep is disabled. What tripped it earlier? >> Lockdep really should be catching locking deadlocks in situations like >> this, if that's really the underlying cause. > > Actually, I'm not sure if lockdep would catch this one. Here's my > hypothesis: > > kworker/u32:9 is waiting on a read lock while reading the free space > cache, which means it owns the cache->cached value and will issue the > wakeup when it completes. > > umount is waiting on for the wakeup from kworker/u32:9 but is holding > some tree locks in write mode. > > If kworker/u32:9 is waiting on the locks that umount holds, we have a > deadlock. > > Can you dump the extent buffer that kworker/u32:9 is waiting on? Part > of that will contain the PID of the holder, and if matches umount, we > found the cause. > > -Jeff > > -- > Jeff Mahoney > SUSE Labs > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs umount hang
On 8/7/17 1:19 PM, Jeff Mahoney wrote: > On 8/7/17 10:12 AM, Angel Shtilianov wrote: >> Hi there, >> I'm investigating sporadic hanging during btrfs umount. The FS is >> contained in a loop mounted file. >> I have no reproduction scenario and the issue may happen once a day or >> once a month. It is rare, but frustrating. >> I have a crashdump (the server has been manually crashed and collected >> a crashdump), so I could take look through the data structures. >> What happens is that umount is getting in D state and a the kernel >> complains about hung tasks. We are using kernel 4.4.y The actual back >> trace is from 4.4.70, but this happens with all the 4.4 kernels I've >> used (4.4.30 through 4.4.70). >> Tasks like: >> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. >> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. >> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. >> are getting blocked waiting for btrfs_tree_read_lock, which is owned >> by task umount:31696 (which is also blocked for more than 120 seconds) >> regarding the lock debug. >> >> umount is hung in "cache_block_group", see the '>' mark: >>while (cache->cached == BTRFS_CACHE_FAST) { >> struct btrfs_caching_control *ctl; >> >> ctl = cache->caching_ctl; >> atomic_inc(>count); >> prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); >> spin_unlock(>lock); >> >>>schedule(); >> >> finish_wait(>wait, ); >> put_caching_control(ctl); >> spin_lock(>lock); >> } >> >> The complete backtraces could be found in the attached log. >> >> Do you have any ideas ? > > Hi Angel - > > In your log, it says lockdep is disabled. What tripped it earlier? > Lockdep really should be catching locking deadlocks in situations like > this, if that's really the underlying cause. Actually, I'm not sure if lockdep would catch this one. Here's my hypothesis: kworker/u32:9 is waiting on a read lock while reading the free space cache, which means it owns the cache->cached value and will issue the wakeup when it completes. umount is waiting on for the wakeup from kworker/u32:9 but is holding some tree locks in write mode. If kworker/u32:9 is waiting on the locks that umount holds, we have a deadlock. Can you dump the extent buffer that kworker/u32:9 is waiting on? Part of that will contain the PID of the holder, and if matches umount, we found the cause. -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: Btrfs umount hang
On 8/7/17 10:12 AM, Angel Shtilianov wrote: > Hi there, > I'm investigating sporadic hanging during btrfs umount. The FS is > contained in a loop mounted file. > I have no reproduction scenario and the issue may happen once a day or > once a month. It is rare, but frustrating. > I have a crashdump (the server has been manually crashed and collected > a crashdump), so I could take look through the data structures. > What happens is that umount is getting in D state and a the kernel > complains about hung tasks. We are using kernel 4.4.y The actual back > trace is from 4.4.70, but this happens with all the 4.4 kernels I've > used (4.4.30 through 4.4.70). > Tasks like: > INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. > INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. > INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. > are getting blocked waiting for btrfs_tree_read_lock, which is owned > by task umount:31696 (which is also blocked for more than 120 seconds) > regarding the lock debug. > > umount is hung in "cache_block_group", see the '>' mark: >while (cache->cached == BTRFS_CACHE_FAST) { > struct btrfs_caching_control *ctl; > > ctl = cache->caching_ctl; > atomic_inc(>count); > prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); > spin_unlock(>lock); > >>schedule(); > > finish_wait(>wait, ); > put_caching_control(ctl); > spin_lock(>lock); > } > > The complete backtraces could be found in the attached log. > > Do you have any ideas ? Hi Angel - In your log, it says lockdep is disabled. What tripped it earlier? Lockdep really should be catching locking deadlocks in situations like this, if that's really the underlying cause. -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: Btrfs umount hang
On 7.08.2017 17:12, Angel Shtilianov wrote: > Hi there, > I'm investigating sporadic hanging during btrfs umount. The FS is > contained in a loop mounted file. > I have no reproduction scenario and the issue may happen once a day or > once a month. It is rare, but frustrating. > I have a crashdump (the server has been manually crashed and collected > a crashdump), so I could take look through the data structures. > What happens is that umount is getting in D state and a the kernel > complains about hung tasks. We are using kernel 4.4.y The actual back > trace is from 4.4.70, but this happens with all the 4.4 kernels I've > used (4.4.30 through 4.4.70). > Tasks like: > INFO: task kworker/u32:9:27574 blocked for more than 120 seconds. > INFO: task kworker/u32:12:27575 blocked for more than 120 seconds. > INFO: task btrfs-transacti:31625 blocked for more than 120 seconds. > are getting blocked waiting for btrfs_tree_read_lock, which is owned > by task umount:31696 (which is also blocked for more than 120 seconds) > regarding the lock debug. > > umount is hung in "cache_block_group", see the '>' mark: >while (cache->cached == BTRFS_CACHE_FAST) { > struct btrfs_caching_control *ctl; > > ctl = cache->caching_ctl; > atomic_inc(>count); > prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); > spin_unlock(>lock); > >>schedule(); > > finish_wait(>wait, ); > put_caching_control(ctl); > spin_lock(>lock); > } > > The complete backtraces could be found in the attached log. > > Do you have any ideas ? > Any help will be greatly appreciated. So by the looks of it while writing dirty bgs and requiring a free block for the CoW process, cache_block_group() kicks off a caching thread which should just go and read in the respective block group. So the newly spawned caching_thread should actually wake up cache_block_group either due to success, if it manages to find 2megs: if (total_found > (1024 * 1024 * 2)) { or in case of failure after the out label. But in both cases it will set cache->cached to something different than BTRFS_CACHE_FAST and it ought to exit the loop. But from your description of the issue I take it the process never comes back from the schedule, meaning it missed the wakeup from caching_thread and atm I cannot see how this could happen. Can you print the state of the 'cache' parameter of cache_block_group ? > > Best regards, > Angel Shtilianov > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html