Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread Jesper Juhl

On 30/11/06, David Chinner <[EMAIL PROTECTED]> wrote:

On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
> On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote:
> >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> >> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
> >>
> >> Call Trace:
> >> [] show_trace+0xb2/0x380
> >> [] dump_stack+0x15/0x20
> >> [] xfs_error_report+0x3c/0x50
> >> [] xfs_trans_cancel+0x6e/0x130
> >> [] xfs_create+0x5ee/0x6a0
> >> [] xfs_vn_mknod+0x156/0x2e0
> >> [] xfs_vn_create+0xb/0x10
> >> [] vfs_create+0x8c/0xd0
> >> [] nfsd_create_v3+0x31a/0x560
> >> [] nfsd3_proc_create+0x148/0x170
> >> [] nfsd_dispatch+0xf9/0x1e0
> >> [] svc_process+0x437/0x6e0
> >> [] nfsd+0x1cd/0x360
> >> [] child_rip+0xa/0x12
> >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> >> fs/xfs/xfs_trans.c.  Return address = 0x80359daa
> >
> >We shut down the filesystem because we cancelled a dirty transaction.
> >Once we start to dirty the incore objects, we can't roll back to
> >an unchanged state if a subsequent fatal error occurs during the
> >transaction and we have to abort it.
> >
> So you are saying that there's nothing I can do to prevent this from
> happening in the future?

Pretty much - we need to work out what is going wrong and
we can't from teh shutdown message above - the error has
occurred in a path that doesn't have error report traps
in it.

Is this reproducable?


Not on demand, no. It has happened only this once as far as I know and
for unknown reasons.



> >If I understand historic occurrences of this correctly, there is
> >a possibility that it can be triggered in ENOMEM situations. Was your
> >machine running out of memoy when this occurred?
> >
> Not really. I just checked my monitoring software and, at the time
> this happened, the box had ~5.9G RAM free (of 8G total) and no swap
> used (but 11G available).

Ok. Sounds like we need more error reporting points inserted
into that code so we dump an error earlier and hence have some
hope of working out what went wrong next time.

OOC, there weren't any I/O errors reported before this shutdown?


No. I looked but found none.

Let me know if there's anything I can do to help.

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread David Chinner
On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
> On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote:
> >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> >> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
> >>
> >> Call Trace:
> >> [] show_trace+0xb2/0x380
> >> [] dump_stack+0x15/0x20
> >> [] xfs_error_report+0x3c/0x50
> >> [] xfs_trans_cancel+0x6e/0x130
> >> [] xfs_create+0x5ee/0x6a0
> >> [] xfs_vn_mknod+0x156/0x2e0
> >> [] xfs_vn_create+0xb/0x10
> >> [] vfs_create+0x8c/0xd0
> >> [] nfsd_create_v3+0x31a/0x560
> >> [] nfsd3_proc_create+0x148/0x170
> >> [] nfsd_dispatch+0xf9/0x1e0
> >> [] svc_process+0x437/0x6e0
> >> [] nfsd+0x1cd/0x360
> >> [] child_rip+0xa/0x12
> >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> >> fs/xfs/xfs_trans.c.  Return address = 0x80359daa
> >
> >We shut down the filesystem because we cancelled a dirty transaction.
> >Once we start to dirty the incore objects, we can't roll back to
> >an unchanged state if a subsequent fatal error occurs during the
> >transaction and we have to abort it.
> >
> So you are saying that there's nothing I can do to prevent this from
> happening in the future?

Pretty much - we need to work out what is going wrong and
we can't from teh shutdown message above - the error has
occurred in a path that doesn't have error report traps
in it.

Is this reproducable?

> >If I understand historic occurrences of this correctly, there is
> >a possibility that it can be triggered in ENOMEM situations. Was your
> >machine running out of memoy when this occurred?
> >
> Not really. I just checked my monitoring software and, at the time
> this happened, the box had ~5.9G RAM free (of 8G total) and no swap
> used (but 11G available).

Ok. Sounds like we need more error reporting points inserted
into that code so we dump an error earlier and hence have some
hope of working out what went wrong next time.

OOC, there weren't any I/O errors reported before this shutdown?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread Jesper Juhl

On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote:

On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> Hi,
>
> One of my NFS servers just gave me a nasty surprise that I think it is
> relevant to tell you about:

Thanks, Jesper.

> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
>
> Call Trace:
> [] show_trace+0xb2/0x380
> [] dump_stack+0x15/0x20
> [] xfs_error_report+0x3c/0x50
> [] xfs_trans_cancel+0x6e/0x130
> [] xfs_create+0x5ee/0x6a0
> [] xfs_vn_mknod+0x156/0x2e0
> [] xfs_vn_create+0xb/0x10
> [] vfs_create+0x8c/0xd0
> [] nfsd_create_v3+0x31a/0x560
> [] nfsd3_proc_create+0x148/0x170
> [] nfsd_dispatch+0xf9/0x1e0
> [] svc_process+0x437/0x6e0
> [] nfsd+0x1cd/0x360
> [] child_rip+0xa/0x12
> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c.  Return address = 0x80359daa

We shut down the filesystem because we cancelled a dirty transaction.
Once we start to dirty the incore objects, we can't roll back to
an unchanged state if a subsequent fatal error occurs during the
transaction and we have to abort it.


So you are saying that there's nothing I can do to prevent this from
happening in the future?


If I understand historic occurrences of this correctly, there is
a possibility that it can be triggered in ENOMEM situations. Was your
machine running out of memoy when this occurred?


Not really. I just checked my monitoring software and, at the time
this happened, the box had ~5.9G RAM free (of 8G total) and no swap
used (but 11G available).



> Filesystem "dm-1": Corruption of in-memory data detected.  Shutting
> down filesystem: dm-1
> Please umount the filesystem, and rectify the problem(s)
> nfsd: non-standard errno: 5

EIO gets returned in certain locations once the filesystem has
been shutdown.


Makes sense.



> I unmounted the filesystem, ran xfs_repair which told me to try an
> mount it first to replay the log, so I did, unmounted it again, ran
> xfs_repair (which didn't find any problems) and finally mounted it and
> everything is good - the filesystem seems intact.

Yeah, the above error report typically is due to an in-memory
problem, not an on disk issue.


Good to know.



> The server in question is running kernel 2.6.18.1

Can happen to XFS on any kernel version - got a report of this from
someone running a 2.4 kernel a couple of weeks ago



Ok.  Thank you for your reply David.

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-28 Thread David Chinner
On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> Hi,
> 
> One of my NFS servers just gave me a nasty surprise that I think it is
> relevant to tell you about:

Thanks, Jesper.

> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
> 
> Call Trace:
> [] show_trace+0xb2/0x380
> [] dump_stack+0x15/0x20
> [] xfs_error_report+0x3c/0x50
> [] xfs_trans_cancel+0x6e/0x130
> [] xfs_create+0x5ee/0x6a0
> [] xfs_vn_mknod+0x156/0x2e0
> [] xfs_vn_create+0xb/0x10
> [] vfs_create+0x8c/0xd0
> [] nfsd_create_v3+0x31a/0x560
> [] nfsd3_proc_create+0x148/0x170
> [] nfsd_dispatch+0xf9/0x1e0
> [] svc_process+0x437/0x6e0
> [] nfsd+0x1cd/0x360
> [] child_rip+0xa/0x12
> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c.  Return address = 0x80359daa

We shut down the filesystem because we cancelled a dirty transaction.
Once we start to dirty the incore objects, we can't roll back to
an unchanged state if a subsequent fatal error occurs during the
transaction and we have to abort it.

If I understand historic occurrences of this correctly, there is
a possibility that it can be triggered in ENOMEM situations. Was your
machine running out of memoy when this occurred?

> Filesystem "dm-1": Corruption of in-memory data detected.  Shutting
> down filesystem: dm-1
> Please umount the filesystem, and rectify the problem(s)
> nfsd: non-standard errno: 5

EIO gets returned in certain locations once the filesystem has
been shutdown.

> I unmounted the filesystem, ran xfs_repair which told me to try an
> mount it first to replay the log, so I did, unmounted it again, ran
> xfs_repair (which didn't find any problems) and finally mounted it and
> everything is good - the filesystem seems intact.

Yeah, the above error report typically is due to an in-memory
problem, not an on disk issue.

> The server in question is running kernel 2.6.18.1

Can happen to XFS on any kernel version - got a report of this from
someone running a 2.4 kernel a couple of weeks ago

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-28 Thread Jesper Juhl

Hi,

One of my NFS servers just gave me a nasty surprise that I think it is
relevant to tell you about:

Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
file fs/xfs/xfs_trans.c.  Caller 0x8034b47e

Call Trace:
[] show_trace+0xb2/0x380
[] dump_stack+0x15/0x20
[] xfs_error_report+0x3c/0x50
[] xfs_trans_cancel+0x6e/0x130
[] xfs_create+0x5ee/0x6a0
[] xfs_vn_mknod+0x156/0x2e0
[] xfs_vn_create+0xb/0x10
[] vfs_create+0x8c/0xd0
[] nfsd_create_v3+0x31a/0x560
[] nfsd3_proc_create+0x148/0x170
[] nfsd_dispatch+0xf9/0x1e0
[] svc_process+0x437/0x6e0
[] nfsd+0x1cd/0x360
[] child_rip+0xa/0x12
xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
fs/xfs/xfs_trans.c.  Return address = 0x80359daa
Filesystem "dm-1": Corruption of in-memory data detected.  Shutting
down filesystem: dm-1
Please umount the filesystem, and rectify the problem(s)
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
 (the above message repeates 1670 times, then the following)
xfs_force_shutdown(dm-1,0x1) called from line 424 of file
fs/xfs/xfs_rw.c.  Return address = 0x80359daa

I unmounted the filesystem, ran xfs_repair which told me to try an
mount it first to replay the log, so I did, unmounted it again, ran
xfs_repair (which didn't find any problems) and finally mounted it and
everything is good - the filesystem seems intact.

Filesystem "dm-1": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Starting XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Ending XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Filesystem "dm-1": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Ending clean XFS mount for filesystem: dm-1


The server in question is running kernel 2.6.18.1


--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/