Re: kjournald hang on ext3 to ext3 copy
On Jun 18, 2007 22:53 -0500, John Marconi wrote: > Andreas Dilger wrote: > >Two tips for debugging this kind of issue: > >- you need to have detailed stack traces (e.g. sysrq-t) of all the > > interesting processes > > > >- if a process is stuck inside a large function (e.g. 8379 in example) > > you need to provide the exact line number. this can be found by > > compiling > > the kernel with CONFIG_DEBUG_INFO (-g flag to gcc) and then doing > > "gdb vmlinux" and "p *(journal_commit_transaction+{offset})", where the > > byte offset is printed in the sysrq-t output, and then include the code > > surrounding that line from the source file > > > >- a process stuck in "start_this_handle()" is often just an innocent > > bystander. It is waiting for the currently committing transaction to > > complete before it can start a new filesystem-modifying operation > > (handle). > > That said, the journal handle acts like a lock and has been the cause of > > many deadlock problems (e.g. process 1 holds lock, waits for handle; > > process 2 holds transaction open waiting for lock). pdflush might be one > > of the "process 1" kind of tasks, and some other process is holding the > > transaction open preventing it from completing. > > I am not able to update the entire kernel to a new version for a variety > of reasons, however I can update certain parts in my system (such as the > filesystem). I did a diff of the 2.6.16 kernel against my kernel, and > the changes to jbd were minimal. I plan on looking at the latest > versions of the kernel to determine if anything has changed since 2.6.16. The problem may also be in the ext3 layer and not jbd. > I took a look at the place that kjournald was stuck - it is in the > journal_commit_transaction "while (comiit_transaction->t_updates)" loop > and it is trying to "spin_lock(&journal->j_state_lock). When I look at > pdflush, it is also trying to take the journal->j_state_lock. Do you > have any tips on finding out which process might own journal->j_state_lock? You can enable CONFIG_DEBUG_SPINLOCK in newer kernels and it appears the spinlock will set the "owner" field to the task struct. You still need to get access to this via e.g. "crash" or lkcd or something. Hmm, it seems this is only set for ppc and s390??? That is how I would debug this in any case. The other way (I've done this too many times in the past) is to look through all of the stack traces and figure out which ones are in a filesystem context, then check if any of them are blocked on locks while holding transactions open. Needs a detailed understanding of kernel callpaths. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kjournald hang on ext3 to ext3 copy
Andreas Dilger wrote: On Jun 16, 2007 08:17 -0500, John Marconi wrote: I am running into a situation in which one of my ext3 filesystems is getting hung during normal usage. There are three ext3 filesystems on a CompactFLASH. One is mounted as / and one as /tmp. In my test, I am copying a 100 MB file from /root to /tmp repeatedly. While doing this test, I eventually see the copying stop, and any attempts to access /tmp fail - if I even do ls /tmp the command will hang. I suspect kjournald because of the following ps output: PID PPID WCHAN:20 PCPU %MEM PSR COMM 884799 start_this_handle1.1 0.0 28 pdflush 885399 schedule_timeout 0.2 0.0 7 pdflush 188 1 kswapd 0.0 0.0 19 kswapd0 8051 1 mtd_blktrans_thread 0.0 0.0 22 mtdblockd 8243 1 kjournald0.0 0.0 0 kjournald 8305 1 schedule_timeout0.0 0.0 2 udevd 8378 1 kjournald0.0 0.0 0 kjournald 8379 1 journal_commit_trans 16.6 0.0 0 kjournald 8437 1 schedule_timeout 0.0 0.0 0 evlogd 8527 1 syslog0.0 0.0 1 klogd 8534 1 schedule_timeout 0.0 0.0 0 portmap 8569 1 schedule_timeout 0.0 0.0 0 rngd 8639 1 schedule_timeout 0.1 0.0 24 sshd 8741 8639 schedule_timeout0.0 0.0 0 sshd 8743 8741 wait0.0 0.0 9 bash 8857 8743 schedule_timeout4.9 0.0 7 cp 8664 1 schedule_timeout 0.0 0.0 0 xinetd 8679 1 schedule_timeout 0.0 0.0 0 evlnotifyd 8689 1 schedule_timeout 0.0 0.0 0 evlactiond 8704 1 wait 0.0 0.0 1 bash 8882 8704 -0.0 0.0 2 ps If I run ps repeatedly, I always see process 8379 in journal_commit_transaction, and it is always taking between 12% and 20% of processor 0 up. This process never completes. I also see process 8847 in start_this_handle forever as well - so I believe they are related. This system is using a 2.6.14 kernel. Please try to reproduce with a newer kernel, as this kind of problem might have been fixed already. Two tips for debugging this kind of issue: - you need to have detailed stack traces (e.g. sysrq-t) of all the interesting processes - if a process is stuck inside a large function (e.g. 8379 in example) you need to provide the exact line number. this can be found by compiling the kernel with CONFIG_DEBUG_INFO (-g flag to gcc) and then doing "gdb vmlinux" and "p *(journal_commit_transaction+{offset})", where the byte offset is printed in the sysrq-t output, and then include the code surrounding that line from the source file - a process stuck in "start_this_handle()" is often just an innocent bystander. It is waiting for the currently committing transaction to complete before it can start a new filesystem-modifying operation (handle). That said, the journal handle acts like a lock and has been the cause of many deadlock problems (e.g. process 1 holds lock, waits for handle; process 2 holds transaction open waiting for lock). pdflush might be one of the "process 1" kind of tasks, and some other process is holding the transaction open preventing it from completing. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. Andreas, Thanks for the information. I am not able to update the entire kernel to a new version for a variety of reasons, however I can update certain parts in my system (such as the filesystem). I did a diff of the 2.6.16 kernel against my kernel, and the changes to jbd were minimal. I plan on looking at the latest versions of the kernel to determine if anything has changed since 2.6.16. I took a look at the place that kjournald was stuck - it is in the journal_commit_transaction "while (comiit_transaction->t_updates)" loop and it is trying to "spin_lock(&journal->j_state_lock). When I look at pdflush, it is also trying to take the journal->j_state_lock. Do you have any tips on finding out which process might own journal->j_state_lock? Thanks again, John - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kjournald hang on ext3 to ext3 copy
On Jun 16, 2007 08:17 -0500, John Marconi wrote: > I am running into a situation in which one of my ext3 filesystems is > getting hung during normal usage. There are three ext3 filesystems on a > CompactFLASH. One is mounted as / and one as /tmp. In my test, I am > copying a 100 MB file from /root to /tmp repeatedly. While doing this > test, I eventually see the copying stop, and any attempts to access /tmp > fail - if I even do ls /tmp the command will hang. > > I suspect kjournald because of the following ps output: > PID PPID WCHAN:20 PCPU %MEM PSR COMM > 884799 start_this_handle1.1 0.0 28 pdflush > 885399 schedule_timeout 0.2 0.0 7 pdflush > 188 1 kswapd 0.0 0.0 19 kswapd0 > 8051 1 mtd_blktrans_thread 0.0 0.0 22 mtdblockd > 8243 1 kjournald0.0 0.0 0 kjournald > 8305 1 schedule_timeout0.0 0.0 2 udevd > 8378 1 kjournald0.0 0.0 0 kjournald > 8379 1 journal_commit_trans 16.6 0.0 0 kjournald > 8437 1 schedule_timeout 0.0 0.0 0 evlogd > 8527 1 syslog0.0 0.0 1 klogd > 8534 1 schedule_timeout 0.0 0.0 0 portmap > 8569 1 schedule_timeout 0.0 0.0 0 rngd > 8639 1 schedule_timeout 0.1 0.0 24 sshd > 8741 8639 schedule_timeout0.0 0.0 0 sshd > 8743 8741 wait0.0 0.0 9 bash > 8857 8743 schedule_timeout4.9 0.0 7 cp > 8664 1 schedule_timeout 0.0 0.0 0 xinetd > 8679 1 schedule_timeout 0.0 0.0 0 evlnotifyd > 8689 1 schedule_timeout 0.0 0.0 0 evlactiond > 8704 1 wait 0.0 0.0 1 bash > 8882 8704 -0.0 0.0 2 ps > > If I run ps repeatedly, I always see process 8379 in > journal_commit_transaction, and it is always taking between 12% and 20% > of processor 0 up. This process never completes. I also see process > 8847 in start_this_handle forever as well - so I believe they are related. > > This system is using a 2.6.14 kernel. Please try to reproduce with a newer kernel, as this kind of problem might have been fixed already. Two tips for debugging this kind of issue: - you need to have detailed stack traces (e.g. sysrq-t) of all the interesting processes - if a process is stuck inside a large function (e.g. 8379 in example) you need to provide the exact line number. this can be found by compiling the kernel with CONFIG_DEBUG_INFO (-g flag to gcc) and then doing "gdb vmlinux" and "p *(journal_commit_transaction+{offset})", where the byte offset is printed in the sysrq-t output, and then include the code surrounding that line from the source file - a process stuck in "start_this_handle()" is often just an innocent bystander. It is waiting for the currently committing transaction to complete before it can start a new filesystem-modifying operation (handle). That said, the journal handle acts like a lock and has been the cause of many deadlock problems (e.g. process 1 holds lock, waits for handle; process 2 holds transaction open waiting for lock). pdflush might be one of the "process 1" kind of tasks, and some other process is holding the transaction open preventing it from completing. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html