Re: kjournald hang on ext3 to ext3 copy

2007-06-18 Thread Andreas Dilger
On Jun 16, 2007  08:17 -0500, John Marconi wrote:
 I am running into a situation in which one of my ext3 filesystems is 
 getting hung during normal usage.  There are three ext3 filesystems on a 
 CompactFLASH.  One is mounted as / and one as /tmp.  In my test, I am 
 copying a 100 MB file from /root to /tmp repeatedly.  While doing this 
 test, I eventually see the copying stop, and any attempts to access /tmp 
 fail - if I even do ls /tmp the command will hang.
 
 I suspect kjournald because of the following ps output:
 PID  PPID   WCHAN:20  PCPU  %MEM  PSR  COMM
 884799 start_this_handle1.1  0.0  28 pdflush
 885399 schedule_timeout   0.2  0.0   7 pdflush
  188 1 kswapd   0.0  0.0  19   kswapd0
 8051 1 mtd_blktrans_thread   0.0  0.0  22   mtdblockd
 8243 1 kjournald0.0  0.0   0   kjournald
 8305 1 schedule_timeout0.0  0.0   2   udevd
 8378 1 kjournald0.0  0.0   0   kjournald
 8379 1 journal_commit_trans 16.6  0.0   0   kjournald
 8437 1 schedule_timeout   0.0  0.0   0   evlogd
 8527 1 syslog0.0  0.0   1   klogd
 8534 1 schedule_timeout   0.0  0.0   0   portmap
 8569 1 schedule_timeout   0.0  0.0   0   rngd
 8639 1 schedule_timeout   0.1  0.0  24   sshd
 8741  8639 schedule_timeout0.0  0.0   0 sshd
 8743  8741 wait0.0  0.0   9   bash
 8857  8743 schedule_timeout4.9  0.0   7 cp
 8664 1 schedule_timeout   0.0  0.0   0   xinetd
 8679 1 schedule_timeout   0.0  0.0   0   evlnotifyd
 8689 1 schedule_timeout   0.0  0.0   0   evlactiond
 8704 1 wait   0.0  0.0   1   bash
 8882  8704 -0.0  0.0   2 ps
 
 If I run ps repeatedly, I always see process 8379 in 
 journal_commit_transaction, and it is always taking between 12% and 20% 
 of processor 0 up.  This process never completes.  I also see process 
 8847 in start_this_handle forever as well - so I believe they are related. 
 
 This system is using a 2.6.14 kernel.

Please try to reproduce with a newer kernel, as this kind of problem
might have been fixed already.


Two tips for debugging this kind of issue:
- you need to have detailed stack traces (e.g. sysrq-t) of all the
  interesting processes

- if a process is stuck inside a large function (e.g. 8379 in example)
  you need to provide the exact line number.  this can be found by compiling
  the kernel with CONFIG_DEBUG_INFO (-g flag to gcc) and then doing
  gdb vmlinux and p *(journal_commit_transaction+{offset}), where the
  byte offset is printed in the sysrq-t output, and then include the code
  surrounding that line from the source file

- a process stuck in start_this_handle() is often just an innocent
  bystander.  It is waiting for the currently committing transaction to
  complete before it can start a new filesystem-modifying operation (handle).
  That said, the journal handle acts like a lock and has been the cause of
  many deadlock problems (e.g. process 1 holds lock, waits for handle;
  process 2 holds transaction open waiting for lock).  pdflush might be one
  of the process 1 kind of tasks, and some other process is holding the
  transaction open preventing it from completing.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kjournald hang on ext3 to ext3 copy

2007-06-18 Thread Andreas Dilger
On Jun 18, 2007  22:53 -0500, John Marconi wrote:
 Andreas Dilger wrote:
 Two tips for debugging this kind of issue:
 - you need to have detailed stack traces (e.g. sysrq-t) of all the
   interesting processes
 
 - if a process is stuck inside a large function (e.g. 8379 in example)
   you need to provide the exact line number.  this can be found by 
   compiling
   the kernel with CONFIG_DEBUG_INFO (-g flag to gcc) and then doing
   gdb vmlinux and p *(journal_commit_transaction+{offset}), where the
   byte offset is printed in the sysrq-t output, and then include the code
   surrounding that line from the source file
 
 - a process stuck in start_this_handle() is often just an innocent
   bystander.  It is waiting for the currently committing transaction to
   complete before it can start a new filesystem-modifying operation 
   (handle).
   That said, the journal handle acts like a lock and has been the cause of
   many deadlock problems (e.g. process 1 holds lock, waits for handle;
   process 2 holds transaction open waiting for lock).  pdflush might be one
   of the process 1 kind of tasks, and some other process is holding the
   transaction open preventing it from completing.
 
 I am not able to update the entire kernel to a new version for a variety 
 of reasons, however I can update certain parts in my system (such as the 
 filesystem).  I did a diff of the 2.6.16 kernel against my kernel, and 
 the changes to jbd were minimal.  I plan on looking at the latest 
 versions of the kernel to determine if anything has changed since 2.6.16.

The problem may also be in the ext3 layer and not jbd.

 I took a look at the place that kjournald was stuck - it is in the 
 journal_commit_transaction while (comiit_transaction-t_updates) loop 
 and it is trying to spin_lock(journal-j_state_lock).  When I look at 
 pdflush, it is also trying to take the journal-j_state_lock.  Do you 
 have any tips on finding out which process might own journal-j_state_lock?

You can enable CONFIG_DEBUG_SPINLOCK in newer kernels and it appears the
spinlock will set the owner field to the task struct.  You still need
to get access to this via e.g. crash or lkcd or something.

Hmm, it seems this is only set for ppc and s390???  That is how I would
debug this in any case.  The other way (I've done this too many times
in the past) is to look through all of the stack traces and figure out
which ones are in a filesystem context, then check if any of them are
blocked on locks while holding transactions open.  Needs a detailed
understanding of kernel callpaths.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html