Re: PROBLEM: XFS in-memory corruption with reflinks and duperemove: XFS (dm-4): Internal error xfs_trans_cancel at line 1048 of file fs/xfs/xfs_trans.c. Caller xfs_reflink_remap_extent+0x100/0x560

2020-05-06 Thread Eric Sandeen
On 5/6/20 6:20 PM, Edwin Török wrote:
>> (Obviously, a full metadump would be useful for confirming the shape
>> of
>> the refcount btree, but...first things first let's look at the
>> filefrag
>> output.)
> I'll try to gather one, and find a place to store/share it.
> 
> Best regards,
> --Edwin

Metadumps are compact to start with and usually compress pretty well.
Obviously a very large filesystem with lots of metadata will take some
space, but it might not be that bad.

-Eric


Re: PROBLEM: XFS in-memory corruption with reflinks and duperemove: XFS (dm-4): Internal error xfs_trans_cancel at line 1048 of file fs/xfs/xfs_trans.c. Caller xfs_reflink_remap_extent+0x100/0x560

2020-05-06 Thread Edwin Török
On Wed, 2020-05-06 at 15:47 -0700, Darrick J. Wong wrote:
> On Wed, May 06, 2020 at 12:07:12AM +0100, Edwin Török wrote:
> > 
> > On 5 May 2020 01:58:11 BST, "Darrick J. Wong" <
> > darrick.w...@oracle.com> wrote:
> > > On Mon, May 04, 2020 at 11:54:05PM +0100, Edwin Török wrote:
> > > > On Mon, 2020-05-04 at 08:21 -0700, Darrick J. Wong wrote:
> > > > > On Mon, Apr 27, 2020 at 10:15:57AM +0100, Edwin Török wrote:
> > > > > > On Tue, 2020-04-21 at 10:16 -0700, Darrick J. Wong wrote:
> > > > > > > On Sat, Apr 18, 2020 at 11:19:03AM +0100, Edwin Török
> > > > > > > wrote:
> > > > > > > > [1.] One line summary of the problem:
> > > > > > > > 
> > > > > > > > I 100% reproducibly get XFS in-memory data corruption
> > > > > > > > when
> > > > > > > > running
> > > > > > > > duperemove on an XFS filesystem with reflinks, even
> > > > > > > > after
> > > > > > > > running
> > > > > > > > xfs_repair and repeating the operation.
> > > > > > > > 
> > > > > > > > [2.] Full description of the problem/report:
> > > > > > > > Ubuntu bugreport here: 
> > > > > > > > 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1873555
> > > > > > > 
> > > > > > > Hmm.  I recently fixed an uninitialized variable error in
> > > > > > > xfs_reflink_remap_extent.
> > > > > > > 
> > > > > > > Does applying that patch to the ubuntu kernel (or running
> > > > > > > the
> > > > > > > same
> > > > > > > workload on the 5.7-rc2 ubuntu mainline kernel) fix this?
> > > > > > > 
> > > > > > > https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7-rc2/
> > > > > > 
> > > > > > [...]
> > > > > 
> > > > > A smaller testcase would help.
> > > > 
> > > > Found it! I don't think it is corruption at all, more like the
> > > function
> > > > caller not handling an expected error condition, see below:
> > > > 
> > > > I wasn't able to create a testcase that doesn't also need all
> > > > my
> > > data,
> > > > but I found a faster repro which takes only a few minutes:
> > > > 
> > > > 
> > > > sudo lvconvert --merge /dev/mapper/storage-storage--snapshot
> > > > sudo lvcreate -L 32G -s -n storage-snapshot
> > > /dev/mapper/storage-backup
> > > > sudo mount -o noatime /dev/mapper/storage-backup /mnt/storage
> > > > sudo duperemove -d --hashfile hashes /mnt/storage/from-
> > > > 
> > > restic/04683ed4/tmp/fast/c14-sd-92016-2-home/2017-06-07-
> > > 233355/2017-06-
> > > > 07-233355/edwin/.mu/xapian/postlist.DB /mnt/storage/from-
> > > > 
> > > restic/0fcf01c9/tmp/fast/c14-sd-92016-2-home/2017-05-23-
> > > 233346/2017-05-
> > > > 23-233346/edwin/.mu/xapian/postlist.DB /mnt/storage/from-
> > > > 
> > > restic/10350278/tmp/fast/c14-sd-92016-2-home/2017-06-17-
> > > 233341/2017-06-
> > > > 17-233341/edwin/.mu/xapian/postlist.DB
> > > > sudo umount /mnt/storage
> > > > 
> > > > 
> > > > > Or... if you have bpftrace handy, using kretprobes to figure
> > > > > out
> > > > > which
> > > > > function starts the -EIO return that ultimately causes the
> > > > > remap to
> > > > > fail.
> > > > > 
> > > > > (Or failing that, 'trace-cmd -e xfs_buf_ioerr' to see if it
> > > uncovers
> > > > > anything.)
> > > > 
> > > > With this xfs.bt:
> > > > 
> > > > kprobe:xfs_iread_extents,
> > > > kprobe:xfs_bmap_add_extent_unwritten_real,
> > > > kprobe:xfs_bmap_del_extent_delay,
> > > > kprobe:xfs_bmap_del_extent_real,
> > > > kprobe:xfs_bmap_extents_to_btree,
> > > > kprobe:xfs_bmap_btree_to_extents,
> > > > kprobe:__xfs_bunmapi,
> > > > kprobe:xfs_defer_finish
> > > > {
> > > > @st[tid] = kstack();
> > > > @has[tid] = 1;
> > > > }
> > > > 
> > > > kretprobe:xfs_iread_extents,
> > > > kretprobe:xfs_bmap_add_extent_unwritten_real,
> > > > kretprobe:xfs_bmap_del_extent_delay,
> > > > kretprobe:xfs_bmap_del_extent_real,
> > > > kretprobe:xfs_bmap_extents_to_btree,
> > > > kretprobe:xfs_bmap_btree_to_extents,
> > > > kretprobe:__xfs_bunmapi,
> > > > kretprobe:xfs_defer_finish
> > > > /(retval!=0) && (@has[tid])/ {
> > > > // kstack does not work here
> > > > @errors[(int32)retval, probe, @st[tid]] = count();
> > > > delete(@st[tid]);
> > > > delete(@has[tid]);
> > > > }
> > > > 
> > > > BEGIN { printf("START\n"); }
> > > > END { print(@errors); }
> > > > 
> > > > I got this:
> > > > @errors[-28, kretprobe:xfs_bmap_del_extent_real, 
> > > > xfs_bmap_del_extent_real+1
> > > > kretprobe_trampoline+0
> > > > xfs_reflink_remap_blocks+286
> > > > xfs_file_remap_range+272
> > > > vfs_dedupe_file_range_one+301
> > > > vfs_dedupe_file_range+342
> > > > do_vfs_ioctl+832
> > > > ksys_ioctl+103
> > > > __x64_sys_ioctl+26
> > > > do_syscall_64+87
> > > > entry_SYSCALL_64_after_hwframe+68
> > > > ]: 1
> > > > 
> > > > -28 is ENOSPC, but I have more than 1TiB free (and plenty of
> > > > free
> > > > inodes too).
> > > > 
> > > > Poking around in that code I found this block of code:
> > > >   >   /*
> > > >   >* If it's the case where the directory code is
> > > > running
> > > with
> > > > no block
> > > >   >

Re: Internal error xfs_trans_cancel

2016-06-26 Thread Daniel Wagner
On 06/26/2016 02:16 PM, Thorsten Leemhuis wrote:
> On 02.06.2016 15:29, Daniel Wagner wrote:
>>> Hmmm, Ok. I've been running the lockperf test and kernel builds all
>>> day on a filesystem that is identical in shape and size to yours
>>> (i.e. xfs_info output is the same) but I haven't reproduced it yet.
>> I don't know if that is important: I run the lockperf test and after
>> they have finished I do a kernel build.
>>
>>> Is it possible to get a metadump image of your filesystem to see if
>>> I can reproduce it on that?
>> Sure, see private mail.
> 
> Dave, Daniel, what's the latest status on this issue? 

I had no time to do some more testing in last couple of weeks. Tomorrow
I'll try to reproduce it again, though last time I tried it couldn't
trigger it.

> It made it to my
> list of know 4.7 regressions after Christoph suggested it should be
> listed. But this thread looks stalled, as afaics nothing happened for
> three weeks apart from Josh (added to CC) mentioning he also saw it. Or
> is this discussed elsewhere? Or fixed already?

The discussion wandered over to the thread called 'crash in xfs in
current' and there are some instruction by Al what to do test

Message-ID: <20160622014253.GS12670@dastard>

cheers,
daniel


Re: Internal error xfs_trans_cancel

2016-06-26 Thread Daniel Wagner
On 06/26/2016 02:16 PM, Thorsten Leemhuis wrote:
> On 02.06.2016 15:29, Daniel Wagner wrote:
>>> Hmmm, Ok. I've been running the lockperf test and kernel builds all
>>> day on a filesystem that is identical in shape and size to yours
>>> (i.e. xfs_info output is the same) but I haven't reproduced it yet.
>> I don't know if that is important: I run the lockperf test and after
>> they have finished I do a kernel build.
>>
>>> Is it possible to get a metadump image of your filesystem to see if
>>> I can reproduce it on that?
>> Sure, see private mail.
> 
> Dave, Daniel, what's the latest status on this issue? 

I had no time to do some more testing in last couple of weeks. Tomorrow
I'll try to reproduce it again, though last time I tried it couldn't
trigger it.

> It made it to my
> list of know 4.7 regressions after Christoph suggested it should be
> listed. But this thread looks stalled, as afaics nothing happened for
> three weeks apart from Josh (added to CC) mentioning he also saw it. Or
> is this discussed elsewhere? Or fixed already?

The discussion wandered over to the thread called 'crash in xfs in
current' and there are some instruction by Al what to do test

Message-ID: <20160622014253.GS12670@dastard>

cheers,
daniel


Re: Internal error xfs_trans_cancel

2016-06-26 Thread Thorsten Leemhuis
On 02.06.2016 15:29, Daniel Wagner wrote:
>> Hmmm, Ok. I've been running the lockperf test and kernel builds all
>> day on a filesystem that is identical in shape and size to yours
>> (i.e. xfs_info output is the same) but I haven't reproduced it yet.
> I don't know if that is important: I run the lockperf test and after
> they have finished I do a kernel build.
> 
>> Is it possible to get a metadump image of your filesystem to see if
>> I can reproduce it on that?
> Sure, see private mail.

Dave, Daniel, what's the latest status on this issue? It made it to my
list of know 4.7 regressions after Christoph suggested it should be
listed. But this thread looks stalled, as afaics nothing happened for
three weeks apart from Josh (added to CC) mentioning he also saw it. Or
is this discussed elsewhere? Or fixed already?

Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo)
 Thorsten


Re: Internal error xfs_trans_cancel

2016-06-26 Thread Thorsten Leemhuis
On 02.06.2016 15:29, Daniel Wagner wrote:
>> Hmmm, Ok. I've been running the lockperf test and kernel builds all
>> day on a filesystem that is identical in shape and size to yours
>> (i.e. xfs_info output is the same) but I haven't reproduced it yet.
> I don't know if that is important: I run the lockperf test and after
> they have finished I do a kernel build.
> 
>> Is it possible to get a metadump image of your filesystem to see if
>> I can reproduce it on that?
> Sure, see private mail.

Dave, Daniel, what's the latest status on this issue? It made it to my
list of know 4.7 regressions after Christoph suggested it should be
listed. But this thread looks stalled, as afaics nothing happened for
three weeks apart from Josh (added to CC) mentioning he also saw it. Or
is this discussed elsewhere? Or fixed already?

Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo)
 Thorsten


Re: Internal error xfs_trans_cancel

2016-06-13 Thread Josh Poimboeuf
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote:
> Hi,
> 
> I got the error message below while compiling a kernel 
> on that system. I can't really say if I did something
> which made the file system unhappy before the crash.
> 
> 
> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file 
> fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]
> [  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
> [  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
> 01/16/2014
> [  +0.48]  0286 c8be6bc3 885fa9473cb0 
> 813d146e
> [  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
> a0213cdc
> [  +0.53]  a02257b3 885fa9473cf0 a022eb36 
> 883faa502d00
> [  +0.53] Call Trace:
> [  +0.28]  [] dump_stack+0x63/0x85
> [  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
> [  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
> [  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
> [  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
> [  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
> [  +0.40]  [] vfs_rename+0x58c/0x8d0
> [  +0.32]  [] SyS_rename+0x371/0x390
> [  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
> [  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
> file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
> [  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting 
> down filesystem
> [  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
> problem(s)
> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.

I saw this today.  I was just building/installing kernels, rebooting,
running kexec, running perf.


[ 1359.005573] [ cut here ]
[ 1359.010191] WARNING: CPU: 4 PID: 6031 at fs/inode.c:280 drop_nlink+0x3e/0x50
[ 1359.017231] Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser 
libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp 
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp 
coretemp kvm_intel kvm nfsd ipmi_ssif ipmi_devintf ipmi_si iTCO_wdt irqbypass 
iTCO_vendor_support ipmi_msghandler i7core_edac shpchp sg edac_core pcspkr wmi 
lpc_ich dcdbas mfd_core acpi_power_meter auth_rpcgss acpi_cpufreq nfs_acl lockd 
grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom iw_cxgb3 ib_core 
mgag200 ata_generic pata_acpi i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm mptsas scsi_transport_sas ata_piix 
mptscsih libata cxgb3 crc32c_intel i2c_core serio_raw mptbase bnx2 fjes mdio 
dm_mirror dm_region_hash dm_log dm_mod
[ 1359.088447] CPU: 4 PID: 6031 Comm: depmod Tainted: G  I 
4.7.0-rc3+ #4
[ 1359.095911] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 
07/20/2012
[ 1359.103461]  0286 a0bc39d9 8802143dfd18 
8134bb7f
[ 1359.110871]    8802143dfd58 
8108b671
[ 1359.118280]  0118575f7d13 880222c9a6e8 8803ec3874d8 
880428827000
[ 1359.125693] Call Trace:
[ 1359.128133]  [] dump_stack+0x63/0x84
[ 1359.133259]  [] __warn+0xd1/0xf0
[ 1359.138037]  [] warn_slowpath_null+0x1d/0x20
[ 1359.143855]  [] drop_nlink+0x3e/0x50
[ 1359.149017]  [] xfs_droplink+0x28/0x60 [xfs]
[ 1359.154864]  [] xfs_remove+0x231/0x350 [xfs]
[ 1359.160682]  [] ? security_inode_permission+0x3a/0x60
[ 1359.167309]  [] xfs_vn_unlink+0x58/0xa0 [xfs]
[ 1359.173213]  [] ? selinux_inode_unlink+0x13/0x20
[ 1359.179379]  [] vfs_unlink+0xda/0x190
[ 1359.184590]  [] do_unlinkat+0x263/0x2a0
[ 1359.189974]  [] SyS_unlinkat+0x1b/0x30
[ 1359.195272]  [] do_syscall_64+0x62/0x110
[ 1359.200743]  [] entry_SYSCALL64_slow_path+0x25/0x25
[ 1359.207178] ---[ end trace 0d397afdaff9f340 ]---
[ 1359.211830] XFS (dm-0): Internal error xfs_trans_cancel at line 984 of file 
fs/xfs/xfs_trans.c.  Caller xfs_remove+0x1d1/0x350 [xfs]
[ 1359.223723] CPU: 4 PID: 6031 Comm: depmod Tainted: GW I 
4.7.0-rc3+ #4
[ 1359.231185] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 
07/20/2012
[ 1359.238736]  0286 a0bc39d9 8802143dfd60 
8134bb7f
[ 1359.246147]  8803ec3874d8 0001 8802143dfd78 
a03176bb
[ 1359.253559]  a0328c21 8802143dfda0 a03327a6 
880222e7e180
[ 1359.260969] Call Trace:
[ 1359.263407]  [] dump_stack+0x63/0x84
[ 1359.268560]  [] xfs_error_report+0x3b/0x40 [xfs]
[ 1359.274755]  [] ? xfs_remove+0x1d1/0x350 [xfs]
[ 1359.280778]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
[ 1359.286973]  [] xfs_remove+0x1d1/0x350 [xfs]
[ 1359.292820]  [] xfs_vn_unlink+0x58/0xa0 [xfs]
[ 1359.298724]  [] ? selinux_inode_unlink+0x13/0x20
[ 1359.304890]  [

Re: Internal error xfs_trans_cancel

2016-06-13 Thread Josh Poimboeuf
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote:
> Hi,
> 
> I got the error message below while compiling a kernel 
> on that system. I can't really say if I did something
> which made the file system unhappy before the crash.
> 
> 
> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file 
> fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]
> [  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
> [  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
> 01/16/2014
> [  +0.48]  0286 c8be6bc3 885fa9473cb0 
> 813d146e
> [  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
> a0213cdc
> [  +0.53]  a02257b3 885fa9473cf0 a022eb36 
> 883faa502d00
> [  +0.53] Call Trace:
> [  +0.28]  [] dump_stack+0x63/0x85
> [  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
> [  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
> [  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
> [  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
> [  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
> [  +0.40]  [] vfs_rename+0x58c/0x8d0
> [  +0.32]  [] SyS_rename+0x371/0x390
> [  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
> [  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
> file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
> [  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting 
> down filesystem
> [  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
> problem(s)
> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.

I saw this today.  I was just building/installing kernels, rebooting,
running kexec, running perf.


[ 1359.005573] [ cut here ]
[ 1359.010191] WARNING: CPU: 4 PID: 6031 at fs/inode.c:280 drop_nlink+0x3e/0x50
[ 1359.017231] Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser 
libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp 
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp 
coretemp kvm_intel kvm nfsd ipmi_ssif ipmi_devintf ipmi_si iTCO_wdt irqbypass 
iTCO_vendor_support ipmi_msghandler i7core_edac shpchp sg edac_core pcspkr wmi 
lpc_ich dcdbas mfd_core acpi_power_meter auth_rpcgss acpi_cpufreq nfs_acl lockd 
grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom iw_cxgb3 ib_core 
mgag200 ata_generic pata_acpi i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm mptsas scsi_transport_sas ata_piix 
mptscsih libata cxgb3 crc32c_intel i2c_core serio_raw mptbase bnx2 fjes mdio 
dm_mirror dm_region_hash dm_log dm_mod
[ 1359.088447] CPU: 4 PID: 6031 Comm: depmod Tainted: G  I 
4.7.0-rc3+ #4
[ 1359.095911] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 
07/20/2012
[ 1359.103461]  0286 a0bc39d9 8802143dfd18 
8134bb7f
[ 1359.110871]    8802143dfd58 
8108b671
[ 1359.118280]  0118575f7d13 880222c9a6e8 8803ec3874d8 
880428827000
[ 1359.125693] Call Trace:
[ 1359.128133]  [] dump_stack+0x63/0x84
[ 1359.133259]  [] __warn+0xd1/0xf0
[ 1359.138037]  [] warn_slowpath_null+0x1d/0x20
[ 1359.143855]  [] drop_nlink+0x3e/0x50
[ 1359.149017]  [] xfs_droplink+0x28/0x60 [xfs]
[ 1359.154864]  [] xfs_remove+0x231/0x350 [xfs]
[ 1359.160682]  [] ? security_inode_permission+0x3a/0x60
[ 1359.167309]  [] xfs_vn_unlink+0x58/0xa0 [xfs]
[ 1359.173213]  [] ? selinux_inode_unlink+0x13/0x20
[ 1359.179379]  [] vfs_unlink+0xda/0x190
[ 1359.184590]  [] do_unlinkat+0x263/0x2a0
[ 1359.189974]  [] SyS_unlinkat+0x1b/0x30
[ 1359.195272]  [] do_syscall_64+0x62/0x110
[ 1359.200743]  [] entry_SYSCALL64_slow_path+0x25/0x25
[ 1359.207178] ---[ end trace 0d397afdaff9f340 ]---
[ 1359.211830] XFS (dm-0): Internal error xfs_trans_cancel at line 984 of file 
fs/xfs/xfs_trans.c.  Caller xfs_remove+0x1d1/0x350 [xfs]
[ 1359.223723] CPU: 4 PID: 6031 Comm: depmod Tainted: GW I 
4.7.0-rc3+ #4
[ 1359.231185] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 
07/20/2012
[ 1359.238736]  0286 a0bc39d9 8802143dfd60 
8134bb7f
[ 1359.246147]  8803ec3874d8 0001 8802143dfd78 
a03176bb
[ 1359.253559]  a0328c21 8802143dfda0 a03327a6 
880222e7e180
[ 1359.260969] Call Trace:
[ 1359.263407]  [] dump_stack+0x63/0x84
[ 1359.268560]  [] xfs_error_report+0x3b/0x40 [xfs]
[ 1359.274755]  [] ? xfs_remove+0x1d1/0x350 [xfs]
[ 1359.280778]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
[ 1359.286973]  [] xfs_remove+0x1d1/0x350 [xfs]
[ 1359.292820]  [] xfs_vn_unlink+0x58/0xa0 [xfs]
[ 1359.298724]  [] ? selinux_inode_unlink+0x13/0x20
[ 1359.304890]  [

Re: Internal error xfs_trans_cancel

2016-06-02 Thread Daniel Wagner
> Hmmm, Ok. I've been running the lockperf test and kernel builds all
> day on a filesystem that is identical in shape and size to yours
> (i.e. xfs_info output is the same) but I haven't reproduced it yet.

I don't know if that is important: I run the lockperf test and after
they have finished I do a kernel build.

> Is it possible to get a metadump image of your filesystem to see if
> I can reproduce it on that?

Sure, see private mail.


Re: Internal error xfs_trans_cancel

2016-06-02 Thread Daniel Wagner
> Hmmm, Ok. I've been running the lockperf test and kernel builds all
> day on a filesystem that is identical in shape and size to yours
> (i.e. xfs_info output is the same) but I haven't reproduced it yet.

I don't know if that is important: I run the lockperf test and after
they have finished I do a kernel build.

> Is it possible to get a metadump image of your filesystem to see if
> I can reproduce it on that?

Sure, see private mail.


Re: Internal error xfs_trans_cancel

2016-06-02 Thread Dave Chinner
On Thu, Jun 02, 2016 at 07:23:24AM +0200, Daniel Wagner wrote:
> > posix03 and posix04 just emit error messages:
> > 
> > posix04 -n 40 -l 100
> > posix04: invalid option -- 'l'
> > posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] 
> > .
> 
> I screwed that this up. I have patched my version of lockperf to make
> all test using the same options names. Though forgot to send those
> patches. Will do now.
> 
> In this case you can use use '-i' instead of '-l'.
> 
> > So I changed them to run "-i $l" instead, and that has a somewhat
> > undesired effect:
> > 
> > static void
> > kill_children()
> > {
> > siginfo_t   infop;
> > 
> > signal(SIGINT, SIG_IGN);
> >>   kill(0, SIGINT);
> > while (waitid(P_ALL, 0, , WEXITED) != -1);
> > }
> > 
> > Yeah, it sends a SIGINT to everything with a process group id. It
> > kills the parent shell:
> 
> Ah that rings a bell. I tuned the parameters so that I did not run into
> this problem. I'll do patch for this one. It's pretty annoying.
> 
> > $ ./run-lockperf-tests.sh /mnt/scratch/
> > pid 9597's current affinity list: 0-15
> > pid 9597's new affinity list: 0,4,8,12
> > sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: 
> > Directory nonexistent
> > posix01 -n 8 -l 100
> > posix02 -n 8 -l 100
> > posix03 -n 8 -i 100
> > 
> > $
> > 
> > So, I've just removed those tests from your script. I'll see if I
> > have any luck with reproducing the problem now.
> 
> I was able to reproduce it again with the same steps.

Hmmm, Ok. I've been running the lockperf test and kernel builds all
day on a filesystem that is identical in shape and size to yours
(i.e. xfs_info output is the same) but I haven't reproduced it yet.
Is it possible to get a metadump image of your filesystem to see if
I can reproduce it on that?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Internal error xfs_trans_cancel

2016-06-02 Thread Dave Chinner
On Thu, Jun 02, 2016 at 07:23:24AM +0200, Daniel Wagner wrote:
> > posix03 and posix04 just emit error messages:
> > 
> > posix04 -n 40 -l 100
> > posix04: invalid option -- 'l'
> > posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] 
> > .
> 
> I screwed that this up. I have patched my version of lockperf to make
> all test using the same options names. Though forgot to send those
> patches. Will do now.
> 
> In this case you can use use '-i' instead of '-l'.
> 
> > So I changed them to run "-i $l" instead, and that has a somewhat
> > undesired effect:
> > 
> > static void
> > kill_children()
> > {
> > siginfo_t   infop;
> > 
> > signal(SIGINT, SIG_IGN);
> >>   kill(0, SIGINT);
> > while (waitid(P_ALL, 0, , WEXITED) != -1);
> > }
> > 
> > Yeah, it sends a SIGINT to everything with a process group id. It
> > kills the parent shell:
> 
> Ah that rings a bell. I tuned the parameters so that I did not run into
> this problem. I'll do patch for this one. It's pretty annoying.
> 
> > $ ./run-lockperf-tests.sh /mnt/scratch/
> > pid 9597's current affinity list: 0-15
> > pid 9597's new affinity list: 0,4,8,12
> > sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: 
> > Directory nonexistent
> > posix01 -n 8 -l 100
> > posix02 -n 8 -l 100
> > posix03 -n 8 -i 100
> > 
> > $
> > 
> > So, I've just removed those tests from your script. I'll see if I
> > have any luck with reproducing the problem now.
> 
> I was able to reproduce it again with the same steps.

Hmmm, Ok. I've been running the lockperf test and kernel builds all
day on a filesystem that is identical in shape and size to yours
(i.e. xfs_info output is the same) but I haven't reproduced it yet.
Is it possible to get a metadump image of your filesystem to see if
I can reproduce it on that?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner
> posix03 and posix04 just emit error messages:
> 
> posix04 -n 40 -l 100
> posix04: invalid option -- 'l'
> posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] 
> .

I screwed that this up. I have patched my version of lockperf to make
all test using the same options names. Though forgot to send those
patches. Will do now.

In this case you can use use '-i' instead of '-l'.

> So I changed them to run "-i $l" instead, and that has a somewhat
> undesired effect:
> 
> static void
> kill_children()
> {
> siginfo_t   infop;
> 
> signal(SIGINT, SIG_IGN);
>>   kill(0, SIGINT);
> while (waitid(P_ALL, 0, , WEXITED) != -1);
> }
> 
> Yeah, it sends a SIGINT to everything with a process group id. It
> kills the parent shell:

Ah that rings a bell. I tuned the parameters so that I did not run into
this problem. I'll do patch for this one. It's pretty annoying.

> $ ./run-lockperf-tests.sh /mnt/scratch/
> pid 9597's current affinity list: 0-15
> pid 9597's new affinity list: 0,4,8,12
> sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: 
> Directory nonexistent
> posix01 -n 8 -l 100
> posix02 -n 8 -l 100
> posix03 -n 8 -i 100
> 
> $
> 
> So, I've just removed those tests from your script. I'll see if I
> have any luck with reproducing the problem now.

I was able to reproduce it again with the same steps.


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner
> posix03 and posix04 just emit error messages:
> 
> posix04 -n 40 -l 100
> posix04: invalid option -- 'l'
> posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] 
> .

I screwed that this up. I have patched my version of lockperf to make
all test using the same options names. Though forgot to send those
patches. Will do now.

In this case you can use use '-i' instead of '-l'.

> So I changed them to run "-i $l" instead, and that has a somewhat
> undesired effect:
> 
> static void
> kill_children()
> {
> siginfo_t   infop;
> 
> signal(SIGINT, SIG_IGN);
>>   kill(0, SIGINT);
> while (waitid(P_ALL, 0, , WEXITED) != -1);
> }
> 
> Yeah, it sends a SIGINT to everything with a process group id. It
> kills the parent shell:

Ah that rings a bell. I tuned the parameters so that I did not run into
this problem. I'll do patch for this one. It's pretty annoying.

> $ ./run-lockperf-tests.sh /mnt/scratch/
> pid 9597's current affinity list: 0-15
> pid 9597's new affinity list: 0,4,8,12
> sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: 
> Directory nonexistent
> posix01 -n 8 -l 100
> posix02 -n 8 -l 100
> posix03 -n 8 -i 100
> 
> $
> 
> So, I've just removed those tests from your script. I'll see if I
> have any luck with reproducing the problem now.

I was able to reproduce it again with the same steps.


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Dave Chinner
On Wed, Jun 01, 2016 at 04:13:10PM +0200, Daniel Wagner wrote:
> >> Anything in the log before this?
> > 
> > Just the usual stuff, as I remember. Sorry, I haven't copied the whole log.
> 
> Just triggered it again. My steps for it are:
> 
> - run all lockperf test
> 
>   git://git.samba.org/jlayton/lockperf.git
> 
>   via my test script:
>  
> #!/bin/sh
>
> run_tests () {
.
> for c in `seq 8 32 128`; do
> for l in `seq 100 100 500`; do
> time run_tests "posix01 -n $c -l $l " $DIR/posix01-$c-$l.data
> time run_tests "posix02 -n $c -l $l " $DIR/posix02-$c-$l.data
> time run_tests "posix03 -n $c -l $l " $DIR/posix03-$c-$l.data
> time run_tests "posix04 -n $c -l $l " $DIR/posix04-$c-$l.data

posix03 and posix04 just emit error messages:

posix04 -n 40 -l 100
posix04: invalid option -- 'l'
posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] 
.


So I changed them to run "-i $l" instead, and that has a somewhat
undesired effect:

static void
kill_children()
{
siginfo_t   infop;

signal(SIGINT, SIG_IGN);
>   kill(0, SIGINT);
while (waitid(P_ALL, 0, , WEXITED) != -1);
}

Yeah, it sends a SIGINT to everything with a process group id. It
kills the parent shell:

$ ./run-lockperf-tests.sh /mnt/scratch/
pid 9597's current affinity list: 0-15
pid 9597's new affinity list: 0,4,8,12
sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: 
Directory nonexistent
posix01 -n 8 -l 100
posix02 -n 8 -l 100
posix03 -n 8 -i 100

$

So, I've just removed those tests from your script. I'll see if I
have any luck with reproducing the problem now.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Dave Chinner
On Wed, Jun 01, 2016 at 04:13:10PM +0200, Daniel Wagner wrote:
> >> Anything in the log before this?
> > 
> > Just the usual stuff, as I remember. Sorry, I haven't copied the whole log.
> 
> Just triggered it again. My steps for it are:
> 
> - run all lockperf test
> 
>   git://git.samba.org/jlayton/lockperf.git
> 
>   via my test script:
>  
> #!/bin/sh
>
> run_tests () {
.
> for c in `seq 8 32 128`; do
> for l in `seq 100 100 500`; do
> time run_tests "posix01 -n $c -l $l " $DIR/posix01-$c-$l.data
> time run_tests "posix02 -n $c -l $l " $DIR/posix02-$c-$l.data
> time run_tests "posix03 -n $c -l $l " $DIR/posix03-$c-$l.data
> time run_tests "posix04 -n $c -l $l " $DIR/posix04-$c-$l.data

posix03 and posix04 just emit error messages:

posix04 -n 40 -l 100
posix04: invalid option -- 'l'
posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] 
.


So I changed them to run "-i $l" instead, and that has a somewhat
undesired effect:

static void
kill_children()
{
siginfo_t   infop;

signal(SIGINT, SIG_IGN);
>   kill(0, SIGINT);
while (waitid(P_ALL, 0, , WEXITED) != -1);
}

Yeah, it sends a SIGINT to everything with a process group id. It
kills the parent shell:

$ ./run-lockperf-tests.sh /mnt/scratch/
pid 9597's current affinity list: 0-15
pid 9597's new affinity list: 0,4,8,12
sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: 
Directory nonexistent
posix01 -n 8 -l 100
posix02 -n 8 -l 100
posix03 -n 8 -i 100

$

So, I've just removed those tests from your script. I'll see if I
have any luck with reproducing the problem now.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner
>   via my test script:

Looks like my email client did not agree with my formatting of the script.

https://www.monom.org/data/lglock/run-tests.sh


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner
>   via my test script:

Looks like my email client did not agree with my formatting of the script.

https://www.monom.org/data/lglock/run-tests.sh


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner
>> Anything in the log before this?
> 
> Just the usual stuff, as I remember. Sorry, I haven't copied the whole log.

Just triggered it again. My steps for it are:

- run all lockperf test

  git://git.samba.org/jlayton/lockperf.git

  via my test script:
 
#!/bin/sh   




 
run_tests () {  


echo $1 




 
for i in `seq 10`;  


do  


rm -rf /tmp/a;  


$1 /tmp/a > /dev/null   


sync


done




 
for i in `seq 100`; 


do  


rm -rf /tmp/a;  


$1 /tmp/a >> $2 


sync


done


}   




 


 
PATH=~/src/lockperf:$PATH   




 
DIR=$1-`uname -r`   


if [ ! -d "$DIR" ]; then


mkdir $DIR  


fi  


  

Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner
>> Anything in the log before this?
> 
> Just the usual stuff, as I remember. Sorry, I haven't copied the whole log.

Just triggered it again. My steps for it are:

- run all lockperf test

  git://git.samba.org/jlayton/lockperf.git

  via my test script:
 
#!/bin/sh   




 
run_tests () {  


echo $1 




 
for i in `seq 10`;  


do  


rm -rf /tmp/a;  


$1 /tmp/a > /dev/null   


sync


done




 
for i in `seq 100`; 


do  


rm -rf /tmp/a;  


$1 /tmp/a >> $2 


sync


done


}   




 


 
PATH=~/src/lockperf:$PATH   




 
DIR=$1-`uname -r`   


if [ ! -d "$DIR" ]; then


mkdir $DIR  


fi  


  

Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner

On 06/01/2016 09:10 AM, Dave Chinner wrote:
> On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote:
>> I got the error message below while compiling a kernel 
>> on that system. I can't really say if I did something
>> which made the file system unhappy before the crash.
>>
>>
>> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of 
>> file fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]
> 
> Anything in the log before this?

Just the usual stuff, as I remember. Sorry, I haven't copied the whole log.
 
>> [  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
>> [  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
>> 01/16/2014
>> [  +0.48]  0286 c8be6bc3 885fa9473cb0 
>> 813d146e
>> [  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
>> a0213cdc
>> [  +0.53]  a02257b3 885fa9473cf0 a022eb36 
>> 883faa502d00
>> [  +0.53] Call Trace:
>> [  +0.28]  [] dump_stack+0x63/0x85
>> [  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
>> [  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
>> [  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
>> [  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
>> [  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
>> [  +0.40]  [] vfs_rename+0x58c/0x8d0
>> [  +0.32]  [] SyS_rename+0x371/0x390
>> [  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
>> [  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
>> file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
>> [  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting 
>> down filesystem
>> [  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
>> problem(s)
>> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
>> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.
> 
> Doesn't normally happen, and there's not a lot to go on here.

Restarted the box and did a couple of kernel builds and
everything was fine.

> Can
> you provide the info listed in the link below so we have some idea
> of what configuration the error occurred on?

Sure, forgot that in the first post.

> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

# uname -r
4.7.0-rc1-3-g1f55b0d



# xfs_repair -V
xfs_repair version 4.5.0 

# cat /proc/cpuinfo | grep CPU | wc -l
64

# cat /proc/meminfo 
MemTotal:   528344752 kB
MemFree:526838036 kB
MemAvailable:   525265612 kB
Buffers:2716 kB
Cached:   216896 kB
SwapCached:0 kB
Active:   119924 kB
Inactive: 116552 kB
Active(anon):  17416 kB
Inactive(anon): 1108 kB
Active(file): 102508 kB
Inactive(file):   115444 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal: 0 kB
SwapFree:  0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 16972 kB
Mapped:25288 kB
Shmem:  1616 kB
Slab: 184920 kB
SReclaimable:  60028 kB
SUnreclaim:   124892 kB
KernelStack:   13120 kB
PageTables: 2292 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:264172376 kB
Committed_AS: 270612 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
CmaTotal:  0 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
DirectMap4k:  232256 kB
DirectMap2M: 7061504 kB
DirectMap1G:531628032 kB

# cat /proc/mounts 
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=264153644k,nr_inodes=66038411,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup 
rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup 
rw,nosu

Re: Internal error xfs_trans_cancel

2016-06-01 Thread Daniel Wagner

On 06/01/2016 09:10 AM, Dave Chinner wrote:
> On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote:
>> I got the error message below while compiling a kernel 
>> on that system. I can't really say if I did something
>> which made the file system unhappy before the crash.
>>
>>
>> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of 
>> file fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]
> 
> Anything in the log before this?

Just the usual stuff, as I remember. Sorry, I haven't copied the whole log.
 
>> [  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
>> [  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
>> 01/16/2014
>> [  +0.48]  0286 c8be6bc3 885fa9473cb0 
>> 813d146e
>> [  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
>> a0213cdc
>> [  +0.53]  a02257b3 885fa9473cf0 a022eb36 
>> 883faa502d00
>> [  +0.53] Call Trace:
>> [  +0.28]  [] dump_stack+0x63/0x85
>> [  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
>> [  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
>> [  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
>> [  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
>> [  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
>> [  +0.40]  [] vfs_rename+0x58c/0x8d0
>> [  +0.32]  [] SyS_rename+0x371/0x390
>> [  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
>> [  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
>> file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
>> [  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting 
>> down filesystem
>> [  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
>> problem(s)
>> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
>> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.
> 
> Doesn't normally happen, and there's not a lot to go on here.

Restarted the box and did a couple of kernel builds and
everything was fine.

> Can
> you provide the info listed in the link below so we have some idea
> of what configuration the error occurred on?

Sure, forgot that in the first post.

> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

# uname -r
4.7.0-rc1-3-g1f55b0d



# xfs_repair -V
xfs_repair version 4.5.0 

# cat /proc/cpuinfo | grep CPU | wc -l
64

# cat /proc/meminfo 
MemTotal:   528344752 kB
MemFree:526838036 kB
MemAvailable:   525265612 kB
Buffers:2716 kB
Cached:   216896 kB
SwapCached:0 kB
Active:   119924 kB
Inactive: 116552 kB
Active(anon):  17416 kB
Inactive(anon): 1108 kB
Active(file): 102508 kB
Inactive(file):   115444 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal: 0 kB
SwapFree:  0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 16972 kB
Mapped:25288 kB
Shmem:  1616 kB
Slab: 184920 kB
SReclaimable:  60028 kB
SUnreclaim:   124892 kB
KernelStack:   13120 kB
PageTables: 2292 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:264172376 kB
Committed_AS: 270612 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
CmaTotal:  0 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
DirectMap4k:  232256 kB
DirectMap2M: 7061504 kB
DirectMap1G:531628032 kB

# cat /proc/mounts 
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=264153644k,nr_inodes=66038411,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup 
rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup 
rw,nosu

Re: Internal error xfs_trans_cancel

2016-06-01 Thread Dave Chinner
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote:
> Hi,
> 
> I got the error message below while compiling a kernel 
> on that system. I can't really say if I did something
> which made the file system unhappy before the crash.
> 
> 
> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file 
> fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]

Anything in the log before this?

> [  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
> [  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
> 01/16/2014
> [  +0.48]  0286 c8be6bc3 885fa9473cb0 
> 813d146e
> [  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
> a0213cdc
> [  +0.53]  a02257b3 885fa9473cf0 a022eb36 
> 883faa502d00
> [  +0.53] Call Trace:
> [  +0.28]  [] dump_stack+0x63/0x85
> [  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
> [  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
> [  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
> [  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
> [  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
> [  +0.40]  [] vfs_rename+0x58c/0x8d0
> [  +0.32]  [] SyS_rename+0x371/0x390
> [  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
> [  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
> file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
> [  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting 
> down filesystem
> [  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
> problem(s)
> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.

Doesn't normally happen, and there's not a lot to go on here. Can
you provide the info listed in the link below so we have some idea
of what configuration the error occurred on?

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

You didn't run out of space or something unusual like that?  Does
'xfs_repair -n ' report any errors?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Internal error xfs_trans_cancel

2016-06-01 Thread Dave Chinner
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote:
> Hi,
> 
> I got the error message below while compiling a kernel 
> on that system. I can't really say if I did something
> which made the file system unhappy before the crash.
> 
> 
> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file 
> fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]

Anything in the log before this?

> [  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
> [  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
> 01/16/2014
> [  +0.48]  0286 c8be6bc3 885fa9473cb0 
> 813d146e
> [  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
> a0213cdc
> [  +0.53]  a02257b3 885fa9473cf0 a022eb36 
> 883faa502d00
> [  +0.53] Call Trace:
> [  +0.28]  [] dump_stack+0x63/0x85
> [  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
> [  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
> [  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
> [  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
> [  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
> [  +0.40]  [] vfs_rename+0x58c/0x8d0
> [  +0.32]  [] SyS_rename+0x371/0x390
> [  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
> [  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
> file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
> [  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting 
> down filesystem
> [  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
> problem(s)
> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.

Doesn't normally happen, and there's not a lot to go on here. Can
you provide the info listed in the link below so we have some idea
of what configuration the error occurred on?

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

You didn't run out of space or something unusual like that?  Does
'xfs_repair -n ' report any errors?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Internal error xfs_trans_cancel

2016-05-31 Thread Daniel Wagner
Hi,

I got the error message below while compiling a kernel 
on that system. I can't really say if I did something
which made the file system unhappy before the crash.


[Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file 
fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]
[  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
[  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
01/16/2014
[  +0.48]  0286 c8be6bc3 885fa9473cb0 
813d146e
[  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
a0213cdc
[  +0.53]  a02257b3 885fa9473cf0 a022eb36 
883faa502d00
[  +0.53] Call Trace:
[  +0.28]  [] dump_stack+0x63/0x85
[  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
[  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
[  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
[  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
[  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
[  +0.40]  [] vfs_rename+0x58c/0x8d0
[  +0.32]  [] SyS_rename+0x371/0x390
[  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
[  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
[  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting down 
filesystem
[  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
problem(s)
[Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
[ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.


cheers,
daniel


Internal error xfs_trans_cancel

2016-05-31 Thread Daniel Wagner
Hi,

I got the error message below while compiling a kernel 
on that system. I can't really say if I did something
which made the file system unhappy before the crash.


[Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file 
fs/xfs/xfs_trans.c.  Caller xfs_rename+0x453/0x960 [xfs]
[  +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16
[  +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 
01/16/2014
[  +0.48]  0286 c8be6bc3 885fa9473cb0 
813d146e
[  +0.56]  885fa9ac5ed0 0001 885fa9473cc8 
a0213cdc
[  +0.53]  a02257b3 885fa9473cf0 a022eb36 
883faa502d00
[  +0.53] Call Trace:
[  +0.28]  [] dump_stack+0x63/0x85
[  +0.69]  [] xfs_error_report+0x3c/0x40 [xfs]
[  +0.65]  [] ? xfs_rename+0x453/0x960 [xfs]
[  +0.64]  [] xfs_trans_cancel+0xb6/0xe0 [xfs]
[  +0.65]  [] xfs_rename+0x453/0x960 [xfs]
[  +0.62]  [] xfs_vn_rename+0xb3/0xf0 [xfs]
[  +0.40]  [] vfs_rename+0x58c/0x8d0
[  +0.32]  [] SyS_rename+0x371/0x390
[  +0.36]  [] entry_SYSCALL_64_fastpath+0x1a/0xa4
[  +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of 
file fs/xfs/xfs_trans.c.  Return address = 0xa022eb4f
[  +0.027680] XFS (sde1): Corruption of in-memory data detected.  Shutting down 
filesystem
[  +0.57] XFS (sde1): Please umount the filesystem and rectify the 
problem(s)
[Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned.
[ +30.081016] XFS (sde1): xfs_log_force: error -5 returned.


cheers,
daniel


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread Jesper Juhl

On 30/11/06, David Chinner <[EMAIL PROTECTED]> wrote:

On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
> On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote:
> >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> >> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
> >>
> >> Call Trace:
> >> [] show_trace+0xb2/0x380
> >> [] dump_stack+0x15/0x20
> >> [] xfs_error_report+0x3c/0x50
> >> [] xfs_trans_cancel+0x6e/0x130
> >> [] xfs_create+0x5ee/0x6a0
> >> [] xfs_vn_mknod+0x156/0x2e0
> >> [] xfs_vn_create+0xb/0x10
> >> [] vfs_create+0x8c/0xd0
> >> [] nfsd_create_v3+0x31a/0x560
> >> [] nfsd3_proc_create+0x148/0x170
> >> [] nfsd_dispatch+0xf9/0x1e0
> >> [] svc_process+0x437/0x6e0
> >> [] nfsd+0x1cd/0x360
> >> [] child_rip+0xa/0x12
> >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> >> fs/xfs/xfs_trans.c.  Return address = 0x80359daa
> >
> >We shut down the filesystem because we cancelled a dirty transaction.
> >Once we start to dirty the incore objects, we can't roll back to
> >an unchanged state if a subsequent fatal error occurs during the
> >transaction and we have to abort it.
> >
> So you are saying that there's nothing I can do to prevent this from
> happening in the future?

Pretty much - we need to work out what is going wrong and
we can't from teh shutdown message above - the error has
occurred in a path that doesn't have error report traps
in it.

Is this reproducable?


Not on demand, no. It has happened only this once as far as I know and
for unknown reasons.



> >If I understand historic occurrences of this correctly, there is
> >a possibility that it can be triggered in ENOMEM situations. Was your
> >machine running out of memoy when this occurred?
> >
> Not really. I just checked my monitoring software and, at the time
> this happened, the box had ~5.9G RAM free (of 8G total) and no swap
> used (but 11G available).

Ok. Sounds like we need more error reporting points inserted
into that code so we dump an error earlier and hence have some
hope of working out what went wrong next time.

OOC, there weren't any I/O errors reported before this shutdown?


No. I looked but found none.

Let me know if there's anything I can do to help.

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread David Chinner
On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
> On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote:
> >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> >> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
> >>
> >> Call Trace:
> >> [] show_trace+0xb2/0x380
> >> [] dump_stack+0x15/0x20
> >> [] xfs_error_report+0x3c/0x50
> >> [] xfs_trans_cancel+0x6e/0x130
> >> [] xfs_create+0x5ee/0x6a0
> >> [] xfs_vn_mknod+0x156/0x2e0
> >> [] xfs_vn_create+0xb/0x10
> >> [] vfs_create+0x8c/0xd0
> >> [] nfsd_create_v3+0x31a/0x560
> >> [] nfsd3_proc_create+0x148/0x170
> >> [] nfsd_dispatch+0xf9/0x1e0
> >> [] svc_process+0x437/0x6e0
> >> [] nfsd+0x1cd/0x360
> >> [] child_rip+0xa/0x12
> >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> >> fs/xfs/xfs_trans.c.  Return address = 0x80359daa
> >
> >We shut down the filesystem because we cancelled a dirty transaction.
> >Once we start to dirty the incore objects, we can't roll back to
> >an unchanged state if a subsequent fatal error occurs during the
> >transaction and we have to abort it.
> >
> So you are saying that there's nothing I can do to prevent this from
> happening in the future?

Pretty much - we need to work out what is going wrong and
we can't from teh shutdown message above - the error has
occurred in a path that doesn't have error report traps
in it.

Is this reproducable?

> >If I understand historic occurrences of this correctly, there is
> >a possibility that it can be triggered in ENOMEM situations. Was your
> >machine running out of memoy when this occurred?
> >
> Not really. I just checked my monitoring software and, at the time
> this happened, the box had ~5.9G RAM free (of 8G total) and no swap
> used (but 11G available).

Ok. Sounds like we need more error reporting points inserted
into that code so we dump an error earlier and hence have some
hope of working out what went wrong next time.

OOC, there weren't any I/O errors reported before this shutdown?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread Jesper Juhl

On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote:

On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> Hi,
>
> One of my NFS servers just gave me a nasty surprise that I think it is
> relevant to tell you about:

Thanks, Jesper.

> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
>
> Call Trace:
> [] show_trace+0xb2/0x380
> [] dump_stack+0x15/0x20
> [] xfs_error_report+0x3c/0x50
> [] xfs_trans_cancel+0x6e/0x130
> [] xfs_create+0x5ee/0x6a0
> [] xfs_vn_mknod+0x156/0x2e0
> [] xfs_vn_create+0xb/0x10
> [] vfs_create+0x8c/0xd0
> [] nfsd_create_v3+0x31a/0x560
> [] nfsd3_proc_create+0x148/0x170
> [] nfsd_dispatch+0xf9/0x1e0
> [] svc_process+0x437/0x6e0
> [] nfsd+0x1cd/0x360
> [] child_rip+0xa/0x12
> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c.  Return address = 0x80359daa

We shut down the filesystem because we cancelled a dirty transaction.
Once we start to dirty the incore objects, we can't roll back to
an unchanged state if a subsequent fatal error occurs during the
transaction and we have to abort it.


So you are saying that there's nothing I can do to prevent this from
happening in the future?


If I understand historic occurrences of this correctly, there is
a possibility that it can be triggered in ENOMEM situations. Was your
machine running out of memoy when this occurred?


Not really. I just checked my monitoring software and, at the time
this happened, the box had ~5.9G RAM free (of 8G total) and no swap
used (but 11G available).



> Filesystem "dm-1": Corruption of in-memory data detected.  Shutting
> down filesystem: dm-1
> Please umount the filesystem, and rectify the problem(s)
> nfsd: non-standard errno: 5

EIO gets returned in certain locations once the filesystem has
been shutdown.


Makes sense.



> I unmounted the filesystem, ran xfs_repair which told me to try an
> mount it first to replay the log, so I did, unmounted it again, ran
> xfs_repair (which didn't find any problems) and finally mounted it and
> everything is good - the filesystem seems intact.

Yeah, the above error report typically is due to an in-memory
problem, not an on disk issue.


Good to know.



> The server in question is running kernel 2.6.18.1

Can happen to XFS on any kernel version - got a report of this from
someone running a 2.4 kernel a couple of weeks ago



Ok.  Thank you for your reply David.

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread Jesper Juhl

On 29/11/06, David Chinner [EMAIL PROTECTED] wrote:

On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
 Hi,

 One of my NFS servers just gave me a nasty surprise that I think it is
 relevant to tell you about:

Thanks, Jesper.

 Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of
 file fs/xfs/xfs_trans.c.  Caller 0x8034b47e

 Call Trace:
 [8020b122] show_trace+0xb2/0x380
 [8020b405] dump_stack+0x15/0x20
 [80327b4c] xfs_error_report+0x3c/0x50
 [803435ae] xfs_trans_cancel+0x6e/0x130
 [8034b47e] xfs_create+0x5ee/0x6a0
 [80356556] xfs_vn_mknod+0x156/0x2e0
 [803566eb] xfs_vn_create+0xb/0x10
 [80284b2c] vfs_create+0x8c/0xd0
 [802e734a] nfsd_create_v3+0x31a/0x560
 [802ec838] nfsd3_proc_create+0x148/0x170
 [802e19f9] nfsd_dispatch+0xf9/0x1e0
 [8049d617] svc_process+0x437/0x6e0
 [802e176d] nfsd+0x1cd/0x360
 [8020ab1c] child_rip+0xa/0x12
 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
 fs/xfs/xfs_trans.c.  Return address = 0x80359daa

We shut down the filesystem because we cancelled a dirty transaction.
Once we start to dirty the incore objects, we can't roll back to
an unchanged state if a subsequent fatal error occurs during the
transaction and we have to abort it.


So you are saying that there's nothing I can do to prevent this from
happening in the future?


If I understand historic occurrences of this correctly, there is
a possibility that it can be triggered in ENOMEM situations. Was your
machine running out of memoy when this occurred?


Not really. I just checked my monitoring software and, at the time
this happened, the box had ~5.9G RAM free (of 8G total) and no swap
used (but 11G available).



 Filesystem dm-1: Corruption of in-memory data detected.  Shutting
 down filesystem: dm-1
 Please umount the filesystem, and rectify the problem(s)
 nfsd: non-standard errno: 5

EIO gets returned in certain locations once the filesystem has
been shutdown.


Makes sense.



 I unmounted the filesystem, ran xfs_repair which told me to try an
 mount it first to replay the log, so I did, unmounted it again, ran
 xfs_repair (which didn't find any problems) and finally mounted it and
 everything is good - the filesystem seems intact.

Yeah, the above error report typically is due to an in-memory
problem, not an on disk issue.


Good to know.



 The server in question is running kernel 2.6.18.1

Can happen to XFS on any kernel version - got a report of this from
someone running a 2.4 kernel a couple of weeks ago



Ok.  Thank you for your reply David.

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread David Chinner
On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
 On 29/11/06, David Chinner [EMAIL PROTECTED] wrote:
 On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
  Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of
  file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
 
  Call Trace:
  [8020b122] show_trace+0xb2/0x380
  [8020b405] dump_stack+0x15/0x20
  [80327b4c] xfs_error_report+0x3c/0x50
  [803435ae] xfs_trans_cancel+0x6e/0x130
  [8034b47e] xfs_create+0x5ee/0x6a0
  [80356556] xfs_vn_mknod+0x156/0x2e0
  [803566eb] xfs_vn_create+0xb/0x10
  [80284b2c] vfs_create+0x8c/0xd0
  [802e734a] nfsd_create_v3+0x31a/0x560
  [802ec838] nfsd3_proc_create+0x148/0x170
  [802e19f9] nfsd_dispatch+0xf9/0x1e0
  [8049d617] svc_process+0x437/0x6e0
  [802e176d] nfsd+0x1cd/0x360
  [8020ab1c] child_rip+0xa/0x12
  xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
  fs/xfs/xfs_trans.c.  Return address = 0x80359daa
 
 We shut down the filesystem because we cancelled a dirty transaction.
 Once we start to dirty the incore objects, we can't roll back to
 an unchanged state if a subsequent fatal error occurs during the
 transaction and we have to abort it.
 
 So you are saying that there's nothing I can do to prevent this from
 happening in the future?

Pretty much - we need to work out what is going wrong and
we can't from teh shutdown message above - the error has
occurred in a path that doesn't have error report traps
in it.

Is this reproducable?

 If I understand historic occurrences of this correctly, there is
 a possibility that it can be triggered in ENOMEM situations. Was your
 machine running out of memoy when this occurred?
 
 Not really. I just checked my monitoring software and, at the time
 this happened, the box had ~5.9G RAM free (of 8G total) and no swap
 used (but 11G available).

Ok. Sounds like we need more error reporting points inserted
into that code so we dump an error earlier and hence have some
hope of working out what went wrong next time.

OOC, there weren't any I/O errors reported before this shutdown?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-29 Thread Jesper Juhl

On 30/11/06, David Chinner [EMAIL PROTECTED] wrote:

On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
 On 29/11/06, David Chinner [EMAIL PROTECTED] wrote:
 On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
  Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of
  file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
 
  Call Trace:
  [8020b122] show_trace+0xb2/0x380
  [8020b405] dump_stack+0x15/0x20
  [80327b4c] xfs_error_report+0x3c/0x50
  [803435ae] xfs_trans_cancel+0x6e/0x130
  [8034b47e] xfs_create+0x5ee/0x6a0
  [80356556] xfs_vn_mknod+0x156/0x2e0
  [803566eb] xfs_vn_create+0xb/0x10
  [80284b2c] vfs_create+0x8c/0xd0
  [802e734a] nfsd_create_v3+0x31a/0x560
  [802ec838] nfsd3_proc_create+0x148/0x170
  [802e19f9] nfsd_dispatch+0xf9/0x1e0
  [8049d617] svc_process+0x437/0x6e0
  [802e176d] nfsd+0x1cd/0x360
  [8020ab1c] child_rip+0xa/0x12
  xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
  fs/xfs/xfs_trans.c.  Return address = 0x80359daa
 
 We shut down the filesystem because we cancelled a dirty transaction.
 Once we start to dirty the incore objects, we can't roll back to
 an unchanged state if a subsequent fatal error occurs during the
 transaction and we have to abort it.
 
 So you are saying that there's nothing I can do to prevent this from
 happening in the future?

Pretty much - we need to work out what is going wrong and
we can't from teh shutdown message above - the error has
occurred in a path that doesn't have error report traps
in it.

Is this reproducable?


Not on demand, no. It has happened only this once as far as I know and
for unknown reasons.



 If I understand historic occurrences of this correctly, there is
 a possibility that it can be triggered in ENOMEM situations. Was your
 machine running out of memoy when this occurred?
 
 Not really. I just checked my monitoring software and, at the time
 this happened, the box had ~5.9G RAM free (of 8G total) and no swap
 used (but 11G available).

Ok. Sounds like we need more error reporting points inserted
into that code so we dump an error earlier and hence have some
hope of working out what went wrong next time.

OOC, there weren't any I/O errors reported before this shutdown?


No. I looked but found none.

Let me know if there's anything I can do to help.

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-28 Thread David Chinner
On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> Hi,
> 
> One of my NFS servers just gave me a nasty surprise that I think it is
> relevant to tell you about:

Thanks, Jesper.

> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
> 
> Call Trace:
> [] show_trace+0xb2/0x380
> [] dump_stack+0x15/0x20
> [] xfs_error_report+0x3c/0x50
> [] xfs_trans_cancel+0x6e/0x130
> [] xfs_create+0x5ee/0x6a0
> [] xfs_vn_mknod+0x156/0x2e0
> [] xfs_vn_create+0xb/0x10
> [] vfs_create+0x8c/0xd0
> [] nfsd_create_v3+0x31a/0x560
> [] nfsd3_proc_create+0x148/0x170
> [] nfsd_dispatch+0xf9/0x1e0
> [] svc_process+0x437/0x6e0
> [] nfsd+0x1cd/0x360
> [] child_rip+0xa/0x12
> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c.  Return address = 0x80359daa

We shut down the filesystem because we cancelled a dirty transaction.
Once we start to dirty the incore objects, we can't roll back to
an unchanged state if a subsequent fatal error occurs during the
transaction and we have to abort it.

If I understand historic occurrences of this correctly, there is
a possibility that it can be triggered in ENOMEM situations. Was your
machine running out of memoy when this occurred?

> Filesystem "dm-1": Corruption of in-memory data detected.  Shutting
> down filesystem: dm-1
> Please umount the filesystem, and rectify the problem(s)
> nfsd: non-standard errno: 5

EIO gets returned in certain locations once the filesystem has
been shutdown.

> I unmounted the filesystem, ran xfs_repair which told me to try an
> mount it first to replay the log, so I did, unmounted it again, ran
> xfs_repair (which didn't find any problems) and finally mounted it and
> everything is good - the filesystem seems intact.

Yeah, the above error report typically is due to an in-memory
problem, not an on disk issue.

> The server in question is running kernel 2.6.18.1

Can happen to XFS on any kernel version - got a report of this from
someone running a 2.4 kernel a couple of weeks ago

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-28 Thread Jesper Juhl

Hi,

One of my NFS servers just gave me a nasty surprise that I think it is
relevant to tell you about:

Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
file fs/xfs/xfs_trans.c.  Caller 0x8034b47e

Call Trace:
[] show_trace+0xb2/0x380
[] dump_stack+0x15/0x20
[] xfs_error_report+0x3c/0x50
[] xfs_trans_cancel+0x6e/0x130
[] xfs_create+0x5ee/0x6a0
[] xfs_vn_mknod+0x156/0x2e0
[] xfs_vn_create+0xb/0x10
[] vfs_create+0x8c/0xd0
[] nfsd_create_v3+0x31a/0x560
[] nfsd3_proc_create+0x148/0x170
[] nfsd_dispatch+0xf9/0x1e0
[] svc_process+0x437/0x6e0
[] nfsd+0x1cd/0x360
[] child_rip+0xa/0x12
xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
fs/xfs/xfs_trans.c.  Return address = 0x80359daa
Filesystem "dm-1": Corruption of in-memory data detected.  Shutting
down filesystem: dm-1
Please umount the filesystem, and rectify the problem(s)
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
 (the above message repeates 1670 times, then the following)
xfs_force_shutdown(dm-1,0x1) called from line 424 of file
fs/xfs/xfs_rw.c.  Return address = 0x80359daa

I unmounted the filesystem, ran xfs_repair which told me to try an
mount it first to replay the log, so I did, unmounted it again, ran
xfs_repair (which didn't find any problems) and finally mounted it and
everything is good - the filesystem seems intact.

Filesystem "dm-1": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Starting XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Ending XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Filesystem "dm-1": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Ending clean XFS mount for filesystem: dm-1


The server in question is running kernel 2.6.18.1


--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-28 Thread Jesper Juhl

Hi,

One of my NFS servers just gave me a nasty surprise that I think it is
relevant to tell you about:

Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of
file fs/xfs/xfs_trans.c.  Caller 0x8034b47e

Call Trace:
[8020b122] show_trace+0xb2/0x380
[8020b405] dump_stack+0x15/0x20
[80327b4c] xfs_error_report+0x3c/0x50
[803435ae] xfs_trans_cancel+0x6e/0x130
[8034b47e] xfs_create+0x5ee/0x6a0
[80356556] xfs_vn_mknod+0x156/0x2e0
[803566eb] xfs_vn_create+0xb/0x10
[80284b2c] vfs_create+0x8c/0xd0
[802e734a] nfsd_create_v3+0x31a/0x560
[802ec838] nfsd3_proc_create+0x148/0x170
[802e19f9] nfsd_dispatch+0xf9/0x1e0
[8049d617] svc_process+0x437/0x6e0
[802e176d] nfsd+0x1cd/0x360
[8020ab1c] child_rip+0xa/0x12
xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
fs/xfs/xfs_trans.c.  Return address = 0x80359daa
Filesystem dm-1: Corruption of in-memory data detected.  Shutting
down filesystem: dm-1
Please umount the filesystem, and rectify the problem(s)
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
 (the above message repeates 1670 times, then the following)
xfs_force_shutdown(dm-1,0x1) called from line 424 of file
fs/xfs/xfs_rw.c.  Return address = 0x80359daa

I unmounted the filesystem, ran xfs_repair which told me to try an
mount it first to replay the log, so I did, unmounted it again, ran
xfs_repair (which didn't find any problems) and finally mounted it and
everything is good - the filesystem seems intact.

Filesystem dm-1: Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Starting XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Ending XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Filesystem dm-1: Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Ending clean XFS mount for filesystem: dm-1


The server in question is running kernel 2.6.18.1


--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

2006-11-28 Thread David Chinner
On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
 Hi,
 
 One of my NFS servers just gave me a nasty surprise that I think it is
 relevant to tell you about:

Thanks, Jesper.

 Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of
 file fs/xfs/xfs_trans.c.  Caller 0x8034b47e
 
 Call Trace:
 [8020b122] show_trace+0xb2/0x380
 [8020b405] dump_stack+0x15/0x20
 [80327b4c] xfs_error_report+0x3c/0x50
 [803435ae] xfs_trans_cancel+0x6e/0x130
 [8034b47e] xfs_create+0x5ee/0x6a0
 [80356556] xfs_vn_mknod+0x156/0x2e0
 [803566eb] xfs_vn_create+0xb/0x10
 [80284b2c] vfs_create+0x8c/0xd0
 [802e734a] nfsd_create_v3+0x31a/0x560
 [802ec838] nfsd3_proc_create+0x148/0x170
 [802e19f9] nfsd_dispatch+0xf9/0x1e0
 [8049d617] svc_process+0x437/0x6e0
 [802e176d] nfsd+0x1cd/0x360
 [8020ab1c] child_rip+0xa/0x12
 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
 fs/xfs/xfs_trans.c.  Return address = 0x80359daa

We shut down the filesystem because we cancelled a dirty transaction.
Once we start to dirty the incore objects, we can't roll back to
an unchanged state if a subsequent fatal error occurs during the
transaction and we have to abort it.

If I understand historic occurrences of this correctly, there is
a possibility that it can be triggered in ENOMEM situations. Was your
machine running out of memoy when this occurred?

 Filesystem dm-1: Corruption of in-memory data detected.  Shutting
 down filesystem: dm-1
 Please umount the filesystem, and rectify the problem(s)
 nfsd: non-standard errno: 5

EIO gets returned in certain locations once the filesystem has
been shutdown.

 I unmounted the filesystem, ran xfs_repair which told me to try an
 mount it first to replay the log, so I did, unmounted it again, ran
 xfs_repair (which didn't find any problems) and finally mounted it and
 everything is good - the filesystem seems intact.

Yeah, the above error report typically is due to an in-memory
problem, not an on disk issue.

 The server in question is running kernel 2.6.18.1

Can happen to XFS on any kernel version - got a report of this from
someone running a 2.4 kernel a couple of weeks ago

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/