Hi Vladimir,

Vladimir V. Saveliev wrote:

On Thu, 2006-04-20 at 09:27 -0400, Jason Keltz wrote:
On April 5, I had sent an email to the list about a problem that we were having on our systems that seemed to be reiserfs related. Unfortunately, I didn't get any response. I'm trying one more time with other information in the hope that someone might be able to assist us in solving this problem.

We're having problems with our file server crashing every 6-30 days.

Does it crash or lockup? When a system crashes it usually outputs some
information to console. Does it do that in your case?
If it does not - please describe system behavour after crash: is it
lockuped completely or just some process get blocked?

I apologize. My terminology wasn't correct. I guess the system itself is neither crashing nor hanging. NFS activity stops, and all the machines get an "nfs server not responding" message. I can still connect to the file server via SSH, and it doesn't seem to do be doing anything. What I did not try to do was to read and write to ALL reiserfs disks. I imagine that if I get the right disk that is causing the deadlock condition, the system will then hang, but I could be wrong. Any ideas on ways I could add debugging code to the kernel that could help solve the problem?

One other minor point that I may have left out. I applied the group of "quota" patches from [EMAIL PROTECTED], and not just one of the patches..

01-reiserfs-free-blocks.diff.gz  05-write_times.diff.gz
02-akpm-b_journal_head.diff.gz   06-reiserfs-quota-28.diff.gz
03-reiserfs-sync_fs-4.diff.gz    07-kinode-10.diff.gz
04-data-logging-40.diff.gz

My guess is that the problem is somewhere in these patches since I don't see other people complaining about reiserfs patches, and I imagine that the 2.4 quota patch would be used much less than 2.6 reiserfs with quota. We're working at an upgrade path from our existing systems to 2.6, but it won't happen for a few months, and we really need to solve this problem ...

Thanks for any help you can provide..


I can't remember the last time it stayed up longer than 30 days. I've recently installed kdb in order to get more debugging information in the hopes that I might be able to get assistance in solving this problem. The server crashed for the second time yesterday since I got kdb installed (18 days since the last time). I've attached some minimal output from kdb (kreiserfsd process, rpc.mountd, lockd, and then 2 nfsd processes). Any assistance would be appreciated.

I e-mailed Neil Brown of NFS server fame, and he confirmed that the problem was not an NFS server problem, but a reiserfs one.

Neil gave me this excellent description of the problem:

The deadlock is happening in flushing out the journal.  Any process
that tries to write will need to wait for that journal flush to
complete, which it won't.

Hmm,
One nfsd thread is blocked waiting to write.  It holds a read lock on
the nfsd exports table.
There is an 'rpc.mountd' process that is trying to get a write lock on
the exports table.  It will block until the read lock is released.
The other nfsd threads are trying to get a read lock.  That won't
succeed until mountd gets it's write lock and releases it.

Thus all but one nfsd threads are blocked by mountd which is blocked
by the remaining nfsd which is blocked by kreiserfsd.

Special things about our nfs server:

The server runs stock kernel 2.4.32 with kdb patch and reiserfs quota patch.
NFS filesystems being exported are all reiserfs3.
It is a dual processor system running the SMP kernel.
All exported filesystems are on a 3ware raid card.

The kdb logs:

Apr 19 13:39:36 0xf67fc000 247 1 0 1 D 0xf67fc370 kreiserfsd
Apr 19 13:39:36 ESP        EIP        Function (args)
Apr 19 13:39:36 0xf67fde98 0xc011b144 schedule+0x2b4 (0xc0452e40, 0x0, 0xf67fc000, 0xc91f8a10, 0xc91f8a10) Apr 19 13:39:36 kernel .text 0xc0100000 0xc011ae90 0xc011b3d0 Apr 19 13:39:36 0xf67fdee0 0xc014680e __wait_on_buffer+0x6e (0xc91f89c0, 0x1601, 0xf67fdf30, 0x1, 0xeeee4000) Apr 19 13:39:36 kernel .text 0xc0100000 0xc01467a0 0xc0146840 Apr 19 13:39:36 0xf67fdf08 0xf8943e79 [reiserfs]flush_commit_list+0x3e9 (0xf69ab800, 0xeae57b20, 0x0) Apr 19 13:39:36 reiserfs .text 0xf8921060 0xf8943a90 0xf8943f60 Apr 19 13:39:36 0xf67fdf48 0xf8943a41 [reiserfs]flush_older_commits+0x91 (0xf69ab800, 0xeae570a0, 0xebc80578, 0xf8ac6000, 0xf
6a2373c)
Apr 19 13:39:37 reiserfs .text 0xf8921060 0xf89439b0 0xf8943a90
Apr 19 13:39:37 0xf67fdf68 0xf8943b24 [reiserfs]flush_commit_list+0x94
Apr 19 13:39:37 reiserfs .text 0xf8921060 0xf8943a90 0xf8943f60 Apr 19 13:39:37 0xf67fdfa8 0xf894801d [reiserfs]flush_async_commits+0x3d (0xf69ab800, 0xd17ebee0, 0xf67fdfd8, 0xf67fdfdc, 0x2
0)
Apr 19 13:39:37 reiserfs .text 0xf8921060 0xf8947fe0 0xf8948020 Apr 19 13:39:37 0xf67fdfb8 0xf894652b [reiserfs]reiserfs_journal_commit_thread+0x1db Apr 19 13:39:37 reiserfs .text 0xf8921060 0xf8946350 0xf89465f0
Apr 19 13:39:37 0xf67fdff4 0xc010741e arch_kernel_thread+0x2e
Apr 19 13:39:37 kernel .text 0xc0100000 0xc01073f0 0xc0107430

Apr 19 13:39:47 0xeeffe000 993 1 0 0 S 0xeeffe370 rpc.mountd
Apr 19 13:39:47 ESP        EIP        Function (args)
Apr 19 13:39:47 0xeefffe04 0xc011b144 schedule+0x2b4 (0x0, 0xeeffe000, 0xf8d30164, 0xee98ff88, 0xeeffe000) Apr 19 13:39:47 kernel .text 0xc0100000 0xc011ae90 0xc011b3d0
Apr 19 13:39:48 0xeefffe4c 0xc011b78f interruptible_sleep_on+0x4f
Apr 19 13:39:48 kernel .text 0xc0100000 0xc011b740 0xc011b7d0
Apr 19 13:39:48 0xeefffe6c 0xf8d24d8a [nfsd]exp_writelock+0x5a
Apr 19 13:39:48 nfsd .text 0xf8d1d060 0xf8d24d30 0xf8d24e30 Apr 19 13:39:48 0xeefffe74 0xf8d244e6 [nfsd]exp_export+0x76 (0xdff64004, 0xbfffd000, 0x814, 0xc0151015, 0xcf62dc20) Apr 19 13:39:48 nfsd .text 0xf8d1d060 0xf8d24470 0xf8d24880 Apr 19 13:39:48 0xeefffed4 0xf8d1da4e [nfsd]handle_sys_nfsservctl+0x23e (0x3, 0xbfffd000, 0x0, 0xeeffe000) Apr 19 13:39:48 nfsd .text 0xf8d1d060 0xf8d1d810 0xf8d1dc90 Apr 19 13:39:48 0xeeffffac 0xc0160ed6 sys_nfsservctl+0x76 (0x3, 0xbfffd000, 0x0, 0x420dabf7, 0x2) Apr 19 13:39:48 kernel .text 0xc0100000 0xc0160e60 0xc0160f4b
Apr 19 13:39:48 0xeeffffc4 0xc0108f1f system_call+0x33
Apr 19 13:39:48 kernel .text 0xc0100000 0xc0108eec 0xc0108f24

Apr 19 13:40:11 0xeef70000     1021        1  0    1   D  0xeef70370  lockd
Apr 19 13:40:11 ESP        EIP        Function (args)
Apr 19 13:40:12 0xeef71f60 0xc011b144 schedule+0x2b4 (0x0, 0xeef70000, 0xeee93f88, 0xf8d30164, 0xeef70000) Apr 19 13:40:12 kernel .text 0xc0100000 0xc011ae90 0xc011b3d0
Apr 19 13:40:12 0xeef71fa8 0xc011b8af sleep_on+0x4f
Apr 19 13:40:12 kernel .text 0xc0100000 0xc011b860 0xc011b8f0 Apr 19 13:40:12 0xeef71fc8 0xf8d24d0a [nfsd]exp_readlock+0x2a (0xc28ed2e0, 0xeefa7000, 0x7fffffff, 0xeef70000, 0x4789) Apr 19 13:40:12 nfsd .text 0xf8d1d060 0xf8d24ce0 0xf8d24d30
Apr 19 13:40:12 0xeef71fcc 0xf8d10149 [lockd]lockd+0x1c9
Apr 19 13:40:12 lockd .text 0xf8d0e060 0xf8d0ff80 0xf8d10260
Apr 19 13:40:12 0xeef71ff4 0xc010741e arch_kernel_thread+0x2e
Apr 19 13:40:12 kernel .text 0xc0100000 0xc01073f0 0xc0107430

almost every nfsd was like this:

Apr 19 13:39:49 0xef00a000      998        1  0    1   D  0xef00a370  nfsd
Apr 19 13:39:49 ESP        EIP        Function (args)
Apr 19 13:39:49 0xef00bf38 0xc011b144 schedule+0x2b4 (0x0, 0xef00a000, 0xeed05f88, 0xee997f88, 0x337) Apr 19 13:39:49 kernel .text 0xc0100000 0xc011ae90 0xc011b3d0
Apr 19 13:39:49 0xef00bf80 0xc011b8af sleep_on+0x4f
Apr 19 13:39:49 kernel .text 0xc0100000 0xc011b860 0xc011b8f0 Apr 19 13:39:49 0xef00bfa0 0xf8d24d0a [nfsd]exp_readlock+0x2a (0xc28ed260, 0xf002d800, 0x7530, 0xb4fae97, 0xef00a000) Apr 19 13:39:49 nfsd .text 0xf8d1d060 0xf8d24ce0 0xf8d24d30
Apr 19 13:39:49 0xef00bfa4 0xf8d1d3a8 [nfsd]nfsd+0x1a8
Apr 19 13:39:49 nfsd .text 0xf8d1d060 0xf8d1d200 0xf8d1d580
Apr 19 13:39:49 0xef00bff4 0xc010741e arch_kernel_thread+0x2e
Apr 19 13:39:49 kernel .text 0xc0100000 0xc01073f0 0xc0107430
and one like this:

Apr 19 13:40:33 0xeeee4000     1043        1  0    0   D  0xeeee4370  nfsd
Apr 19 13:40:33 ESP        EIP        Function (args)
Apr 19 13:40:33 0xeeee5920 0xc011b144 schedule+0x2b4 (0x1, 0xeeee4000, 0xeae57b44, 0xeae57b44, 0x14ca5af6) Apr 19 13:40:33 kernel .text 0xc0100000 0xc011ae90 0xc011b3d0
Apr 19 13:40:33 0xeeee5968 0xc01079b3 __down+0x83
Apr 19 13:40:33 kernel .text 0xc0100000 0xc0107930 0xc0107a10 Apr 19 13:40:33 0xeeee598c 0xc0107b5c __down_failed+0x8 (0xf69ab800, 0xeae57b20, 0xeeee59c4, 0x0, 0xf7fda438) Apr 19 13:40:33 kernel .text 0xc0100000 0xc0107b54 0xc0107b60
Apr 19 13:40:34 0xeeee599c 0xf894969c [reiserfs].text.lock.journal+0x5
Apr 19 13:40:34 reiserfs .text 0xf8921060 0xf8949697 0xf89497f0 Apr 19 13:40:34 0xeeee599c 0xf8943b3a [reiserfs]flush_commit_list+0xaa (0xf69ab800, 0xeae57b20, 0x1, 0x0) Apr 19 13:40:34 reiserfs .text 0xf8921060 0xf8943a90 0xf8943f60 Apr 19 13:40:34 0xeeee59dc 0xf8943417 [reiserfs]get_list_bitmap+0x77 (0xf69ab800, 0xe2279840, 0x1, 0x2, 0x0) Apr 19 13:40:34 reiserfs .text 0xf8921060 0xf89433a0 0xf8943450 Apr 19 13:40:34 0xeeee5a00 0xf89491a1 [reiserfs]do_journal_end+0x7d1 (0xeeee5a74, 0xf69ab800, 0x1, 0x2, 0xf69ab800) Apr 19 13:40:34 reiserfs .text 0xf8921060 0xf89489d0 0xf89495a0
Apr 19 13:40:34 0xeeee5a64 0xf894753a [reiserfs]do_journal_begin_r+0x14a
Apr 19 13:40:34 reiserfs .text 0xf8921060 0xf89473f0 0xf8947670 Apr 19 13:40:34 0xeeee5aa8 0xf89477f2 [reiserfs]journal_begin_Rsmp_c8218e28+0x72 (0xeeee5e54, 0xf69ab800, 0x384, 0x0, 0x2) Apr 19 13:40:35 reiserfs .text 0xf8921060 0xf8947780 0xf8947850 Apr 19 13:40:35 0xeeee5acc 0xf89471f2 [reiserfs]reiserfs_restart_transaction+0x92 (0xeeee5e54, 0x384, 0x2ec7ca2, 0x1, 0x168ec
)
Apr 19 13:40:35 [0]more>
Apr 19 13:40:36 reiserfs .text 0xf8921060 0xf8947160 0xf8947240 Apr 19 13:40:36 0xeeee5af4 0xf893e9f2 [reiserfs]prepare_for_delete_or_cut+0x622 (0xeeee5e54, 0xd6c0f580, 0xeeee5dd4, 0xeeee5d
b4, 0xeeee5bb8)
Apr 19 13:40:36 reiserfs .text 0xf8921060 0xf893e3d0 0xf893ebc0 Apr 19 13:40:36 0xeeee5b6c 0xf893fc08 [reiserfs]reiserfs_cut_from_item+0xd8 (0xeeee5e54, 0xeeee5dd4, 0xeeee5db4, 0xd6c0f580,
0x0)
Apr 19 13:40:36 reiserfs .text 0xf8921060 0xf893fb30 0xf89401d0
Apr 19 13:40:36 0xeeee5d84 0xf894050e [reiserfs]reiserfs_do_truncate+0x29e
Apr 19 13:40:36 reiserfs .text 0xf8921060 0xf8940270 0xf89407d0 Apr 19 13:40:36 0xeeee5e28 0xf893f6ad [reiserfs]reiserfs_delete_object+0x3d (0xeeee5e54, 0xd6c0f580, 0x24, 0xd6c0f5ec, 0xf69a
b800)
Apr 19 13:40:36 reiserfs .text 0xf8921060 0xf893f670 0xf893f6f0 Apr 19 13:40:37 0xeeee5e44 0xf8929727 [reiserfs]reiserfs_delete_inode+0x107 (0xd6c0f580, 0xffffffff, 0x0) Apr 19 13:40:37 reiserfs .text 0xf8921060 0xf8929620 0xf89297b0
Apr 19 13:40:37 0xeeee5e88 0xc015f12a iput+0x17a
Apr 19 13:40:37 kernel .text 0xc0100000 0xc015efb0 0xc015f2e0
Apr 19 13:40:37 0xeeee5ea4 0xc015ca25 d_delete+0xa5
Apr 19 13:40:37 kernel .text 0xc0100000 0xc015c980 0xc015ca40
Apr 19 13:40:37 0xeeee5eb8 0xc0153a85 vfs_unlink+0x185
Apr 19 13:40:37 kernel .text 0xc0100000 0xc0153900 0xc0153be0 Apr 19 13:40:37 0xeeee5ed4 0xf8d23a35 [nfsd]nfsd_unlink+0x125 (0xeeef4c00, 0xeeef4804, 0xffffc000, 0xeeee006c, 0xa) Apr 19 13:40:37 nfsd .text 0xf8d1d060 0xf8d23910 0xf8d23b50 Apr 19 13:40:37 0xeeee5f10 0xf8d2888d [nfsd]nfsd3_proc_remove+0x7d (0xeeef4c00, 0xeeef4a00, 0xeeef4800) Apr 19 13:40:37 nfsd .text 0xf8d1d060 0xf8d28810 0xf8d28920 Apr 19 13:40:38 0xeeee5f48 0xf8d1d64e [nfsd]nfsd_dispatch+0xce (0xeeef4c00, 0xeeee0018, 0xeeee5f8c, 0x94, 0x98) Apr 19 13:40:38 nfsd .text 0xf8d1d060 0xf8d1d580 0xf8d1d765
Apr 19 13:40:38 [0]more>
Apr 19 13:40:38 0xeeee5f64 0xf8cfe2cf [sunrpc]svc_process_Rsmp_877fc141+0x45f (0xc28ed260, 0xeeef4c00, 0x7530, 0xb4f38b8, 0xe
eee4000)
Apr 19 13:40:38 sunrpc .text 0xf8cf6060 0xf8cfde70 0xf8cfe3f5
Apr 19 13:40:39 0xeeee5fa4 0xf8d1d41f [nfsd]nfsd+0x21f
Apr 19 13:40:39 nfsd .text 0xf8d1d060 0xf8d1d200 0xf8d1d580
Apr 19 13:40:39 0xeeee5ff4 0xc010741e arch_kernel_thread+0x2e
Apr 19 13:40:39 kernel .text 0xc0100000 0xc01073f0 0xc0107430

Thanks for any help you can provide...

Jason Keltz



Reply via email to