On Jun 16, 2007 08:17 -0500, John Marconi wrote:
I am running into a situation in which one of my ext3 filesystems is
getting hung during normal usage. There are three ext3 filesystems on a
CompactFLASH. One is mounted as / and one as /tmp. In my test, I am
copying a 100 MB file from /root to /tmp repeatedly. While doing this
test, I eventually see the copying stop, and any attempts to access /tmp
fail - if I even do ls /tmp the command will hang.
I suspect kjournald because of the following ps output:
PID PPID WCHAN:20 PCPU %MEM PSR COMM
884799 start_this_handle1.1 0.0 28 pdflush
885399 schedule_timeout 0.2 0.0 7 pdflush
188 1 kswapd 0.0 0.0 19 kswapd0
8051 1 mtd_blktrans_thread 0.0 0.0 22 mtdblockd
8243 1 kjournald0.0 0.0 0 kjournald
8305 1 schedule_timeout0.0 0.0 2 udevd
8378 1 kjournald0.0 0.0 0 kjournald
8379 1 journal_commit_trans 16.6 0.0 0 kjournald
8437 1 schedule_timeout 0.0 0.0 0 evlogd
8527 1 syslog0.0 0.0 1 klogd
8534 1 schedule_timeout 0.0 0.0 0 portmap
8569 1 schedule_timeout 0.0 0.0 0 rngd
8639 1 schedule_timeout 0.1 0.0 24 sshd
8741 8639 schedule_timeout0.0 0.0 0 sshd
8743 8741 wait0.0 0.0 9 bash
8857 8743 schedule_timeout4.9 0.0 7 cp
8664 1 schedule_timeout 0.0 0.0 0 xinetd
8679 1 schedule_timeout 0.0 0.0 0 evlnotifyd
8689 1 schedule_timeout 0.0 0.0 0 evlactiond
8704 1 wait 0.0 0.0 1 bash
8882 8704 -0.0 0.0 2 ps
If I run ps repeatedly, I always see process 8379 in
journal_commit_transaction, and it is always taking between 12% and 20%
of processor 0 up. This process never completes. I also see process
8847 in start_this_handle forever as well - so I believe they are related.
This system is using a 2.6.14 kernel.
Please try to reproduce with a newer kernel, as this kind of problem
might have been fixed already.
Two tips for debugging this kind of issue:
- you need to have detailed stack traces (e.g. sysrq-t) of all the
interesting processes
- if a process is stuck inside a large function (e.g. 8379 in example)
you need to provide the exact line number. this can be found by compiling
the kernel with CONFIG_DEBUG_INFO (-g flag to gcc) and then doing
gdb vmlinux and p *(journal_commit_transaction+{offset}), where the
byte offset is printed in the sysrq-t output, and then include the code
surrounding that line from the source file
- a process stuck in start_this_handle() is often just an innocent
bystander. It is waiting for the currently committing transaction to
complete before it can start a new filesystem-modifying operation (handle).
That said, the journal handle acts like a lock and has been the cause of
many deadlock problems (e.g. process 1 holds lock, waits for handle;
process 2 holds transaction open waiting for lock). pdflush might be one
of the process 1 kind of tasks, and some other process is holding the
transaction open preventing it from completing.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html