Re: PROBLEM: XFS in-memory corruption with reflinks and duperemove: XFS (dm-4): Internal error xfs_trans_cancel at line 1048 of file fs/xfs/xfs_trans.c. Caller xfs_reflink_remap_extent+0x100/0x560
On 5/6/20 6:20 PM, Edwin Török wrote: >> (Obviously, a full metadump would be useful for confirming the shape >> of >> the refcount btree, but...first things first let's look at the >> filefrag >> output.) > I'll try to gather one, and find a place to store/share it. > > Best regards, > --Edwin Metadumps are compact to start with and usually compress pretty well. Obviously a very large filesystem with lots of metadata will take some space, but it might not be that bad. -Eric
Re: PROBLEM: XFS in-memory corruption with reflinks and duperemove: XFS (dm-4): Internal error xfs_trans_cancel at line 1048 of file fs/xfs/xfs_trans.c. Caller xfs_reflink_remap_extent+0x100/0x560
On Wed, 2020-05-06 at 15:47 -0700, Darrick J. Wong wrote: > On Wed, May 06, 2020 at 12:07:12AM +0100, Edwin Török wrote: > > > > On 5 May 2020 01:58:11 BST, "Darrick J. Wong" < > > darrick.w...@oracle.com> wrote: > > > On Mon, May 04, 2020 at 11:54:05PM +0100, Edwin Török wrote: > > > > On Mon, 2020-05-04 at 08:21 -0700, Darrick J. Wong wrote: > > > > > On Mon, Apr 27, 2020 at 10:15:57AM +0100, Edwin Török wrote: > > > > > > On Tue, 2020-04-21 at 10:16 -0700, Darrick J. Wong wrote: > > > > > > > On Sat, Apr 18, 2020 at 11:19:03AM +0100, Edwin Török > > > > > > > wrote: > > > > > > > > [1.] One line summary of the problem: > > > > > > > > > > > > > > > > I 100% reproducibly get XFS in-memory data corruption > > > > > > > > when > > > > > > > > running > > > > > > > > duperemove on an XFS filesystem with reflinks, even > > > > > > > > after > > > > > > > > running > > > > > > > > xfs_repair and repeating the operation. > > > > > > > > > > > > > > > > [2.] Full description of the problem/report: > > > > > > > > Ubuntu bugreport here: > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1873555 > > > > > > > > > > > > > > Hmm. I recently fixed an uninitialized variable error in > > > > > > > xfs_reflink_remap_extent. > > > > > > > > > > > > > > Does applying that patch to the ubuntu kernel (or running > > > > > > > the > > > > > > > same > > > > > > > workload on the 5.7-rc2 ubuntu mainline kernel) fix this? > > > > > > > > > > > > > > https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7-rc2/ > > > > > > > > > > > > [...] > > > > > > > > > > A smaller testcase would help. > > > > > > > > Found it! I don't think it is corruption at all, more like the > > > function > > > > caller not handling an expected error condition, see below: > > > > > > > > I wasn't able to create a testcase that doesn't also need all > > > > my > > > data, > > > > but I found a faster repro which takes only a few minutes: > > > > > > > > > > > > sudo lvconvert --merge /dev/mapper/storage-storage--snapshot > > > > sudo lvcreate -L 32G -s -n storage-snapshot > > > /dev/mapper/storage-backup > > > > sudo mount -o noatime /dev/mapper/storage-backup /mnt/storage > > > > sudo duperemove -d --hashfile hashes /mnt/storage/from- > > > > > > > restic/04683ed4/tmp/fast/c14-sd-92016-2-home/2017-06-07- > > > 233355/2017-06- > > > > 07-233355/edwin/.mu/xapian/postlist.DB /mnt/storage/from- > > > > > > > restic/0fcf01c9/tmp/fast/c14-sd-92016-2-home/2017-05-23- > > > 233346/2017-05- > > > > 23-233346/edwin/.mu/xapian/postlist.DB /mnt/storage/from- > > > > > > > restic/10350278/tmp/fast/c14-sd-92016-2-home/2017-06-17- > > > 233341/2017-06- > > > > 17-233341/edwin/.mu/xapian/postlist.DB > > > > sudo umount /mnt/storage > > > > > > > > > > > > > Or... if you have bpftrace handy, using kretprobes to figure > > > > > out > > > > > which > > > > > function starts the -EIO return that ultimately causes the > > > > > remap to > > > > > fail. > > > > > > > > > > (Or failing that, 'trace-cmd -e xfs_buf_ioerr' to see if it > > > uncovers > > > > > anything.) > > > > > > > > With this xfs.bt: > > > > > > > > kprobe:xfs_iread_extents, > > > > kprobe:xfs_bmap_add_extent_unwritten_real, > > > > kprobe:xfs_bmap_del_extent_delay, > > > > kprobe:xfs_bmap_del_extent_real, > > > > kprobe:xfs_bmap_extents_to_btree, > > > > kprobe:xfs_bmap_btree_to_extents, > > > > kprobe:__xfs_bunmapi, > > > > kprobe:xfs_defer_finish > > > > { > > > > @st[tid] = kstack(); > > > > @has[tid] = 1; > > > > } > > > > > > > > kretprobe:xfs_iread_extents, > > > > kretprobe:xfs_bmap_add_extent_unwritten_real, > > > > kretprobe:xfs_bmap_del_extent_delay, > > > > kretprobe:xfs_bmap_del_extent_real, > > > > kretprobe:xfs_bmap_extents_to_btree, > > > > kretprobe:xfs_bmap_btree_to_extents, > > > > kretprobe:__xfs_bunmapi, > > > > kretprobe:xfs_defer_finish > > > > /(retval!=0) && (@has[tid])/ { > > > > // kstack does not work here > > > > @errors[(int32)retval, probe, @st[tid]] = count(); > > > > delete(@st[tid]); > > > > delete(@has[tid]); > > > > } > > > > > > > > BEGIN { printf("START\n"); } > > > > END { print(@errors); } > > > > > > > > I got this: > > > > @errors[-28, kretprobe:xfs_bmap_del_extent_real, > > > > xfs_bmap_del_extent_real+1 > > > > kretprobe_trampoline+0 > > > > xfs_reflink_remap_blocks+286 > > > > xfs_file_remap_range+272 > > > > vfs_dedupe_file_range_one+301 > > > > vfs_dedupe_file_range+342 > > > > do_vfs_ioctl+832 > > > > ksys_ioctl+103 > > > > __x64_sys_ioctl+26 > > > > do_syscall_64+87 > > > > entry_SYSCALL_64_after_hwframe+68 > > > > ]: 1 > > > > > > > > -28 is ENOSPC, but I have more than 1TiB free (and plenty of > > > > free > > > > inodes too). > > > > > > > > Poking around in that code I found this block of code: > > > > > /* > > > > >* If it's the case where the directory code is > > > > running > > > with > > > > no block > > > > >
Re: Internal error xfs_trans_cancel
On 06/26/2016 02:16 PM, Thorsten Leemhuis wrote: > On 02.06.2016 15:29, Daniel Wagner wrote: >>> Hmmm, Ok. I've been running the lockperf test and kernel builds all >>> day on a filesystem that is identical in shape and size to yours >>> (i.e. xfs_info output is the same) but I haven't reproduced it yet. >> I don't know if that is important: I run the lockperf test and after >> they have finished I do a kernel build. >> >>> Is it possible to get a metadump image of your filesystem to see if >>> I can reproduce it on that? >> Sure, see private mail. > > Dave, Daniel, what's the latest status on this issue? I had no time to do some more testing in last couple of weeks. Tomorrow I'll try to reproduce it again, though last time I tried it couldn't trigger it. > It made it to my > list of know 4.7 regressions after Christoph suggested it should be > listed. But this thread looks stalled, as afaics nothing happened for > three weeks apart from Josh (added to CC) mentioning he also saw it. Or > is this discussed elsewhere? Or fixed already? The discussion wandered over to the thread called 'crash in xfs in current' and there are some instruction by Al what to do test Message-ID: <20160622014253.GS12670@dastard> cheers, daniel
Re: Internal error xfs_trans_cancel
On 06/26/2016 02:16 PM, Thorsten Leemhuis wrote: > On 02.06.2016 15:29, Daniel Wagner wrote: >>> Hmmm, Ok. I've been running the lockperf test and kernel builds all >>> day on a filesystem that is identical in shape and size to yours >>> (i.e. xfs_info output is the same) but I haven't reproduced it yet. >> I don't know if that is important: I run the lockperf test and after >> they have finished I do a kernel build. >> >>> Is it possible to get a metadump image of your filesystem to see if >>> I can reproduce it on that? >> Sure, see private mail. > > Dave, Daniel, what's the latest status on this issue? I had no time to do some more testing in last couple of weeks. Tomorrow I'll try to reproduce it again, though last time I tried it couldn't trigger it. > It made it to my > list of know 4.7 regressions after Christoph suggested it should be > listed. But this thread looks stalled, as afaics nothing happened for > three weeks apart from Josh (added to CC) mentioning he also saw it. Or > is this discussed elsewhere? Or fixed already? The discussion wandered over to the thread called 'crash in xfs in current' and there are some instruction by Al what to do test Message-ID: <20160622014253.GS12670@dastard> cheers, daniel
Re: Internal error xfs_trans_cancel
On 02.06.2016 15:29, Daniel Wagner wrote: >> Hmmm, Ok. I've been running the lockperf test and kernel builds all >> day on a filesystem that is identical in shape and size to yours >> (i.e. xfs_info output is the same) but I haven't reproduced it yet. > I don't know if that is important: I run the lockperf test and after > they have finished I do a kernel build. > >> Is it possible to get a metadump image of your filesystem to see if >> I can reproduce it on that? > Sure, see private mail. Dave, Daniel, what's the latest status on this issue? It made it to my list of know 4.7 regressions after Christoph suggested it should be listed. But this thread looks stalled, as afaics nothing happened for three weeks apart from Josh (added to CC) mentioning he also saw it. Or is this discussed elsewhere? Or fixed already? Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo) Thorsten
Re: Internal error xfs_trans_cancel
On 02.06.2016 15:29, Daniel Wagner wrote: >> Hmmm, Ok. I've been running the lockperf test and kernel builds all >> day on a filesystem that is identical in shape and size to yours >> (i.e. xfs_info output is the same) but I haven't reproduced it yet. > I don't know if that is important: I run the lockperf test and after > they have finished I do a kernel build. > >> Is it possible to get a metadump image of your filesystem to see if >> I can reproduce it on that? > Sure, see private mail. Dave, Daniel, what's the latest status on this issue? It made it to my list of know 4.7 regressions after Christoph suggested it should be listed. But this thread looks stalled, as afaics nothing happened for three weeks apart from Josh (added to CC) mentioning he also saw it. Or is this discussed elsewhere? Or fixed already? Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo) Thorsten
Re: Internal error xfs_trans_cancel
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote: > Hi, > > I got the error message below while compiling a kernel > on that system. I can't really say if I did something > which made the file system unhappy before the crash. > > > [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file > fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] > [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 > [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 > 01/16/2014 > [ +0.48] 0286 c8be6bc3 885fa9473cb0 > 813d146e > [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 > a0213cdc > [ +0.53] a02257b3 885fa9473cf0 a022eb36 > 883faa502d00 > [ +0.53] Call Trace: > [ +0.28] [] dump_stack+0x63/0x85 > [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] > [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] > [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] > [ +0.65] [] xfs_rename+0x453/0x960 [xfs] > [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] > [ +0.40] [] vfs_rename+0x58c/0x8d0 > [ +0.32] [] SyS_rename+0x371/0x390 > [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 > [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of > file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f > [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting > down filesystem > [ +0.57] XFS (sde1): Please umount the filesystem and rectify the > problem(s) > [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. > [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. I saw this today. I was just building/installing kernels, rebooting, running kexec, running perf. [ 1359.005573] [ cut here ] [ 1359.010191] WARNING: CPU: 4 PID: 6031 at fs/inode.c:280 drop_nlink+0x3e/0x50 [ 1359.017231] Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp kvm_intel kvm nfsd ipmi_ssif ipmi_devintf ipmi_si iTCO_wdt irqbypass iTCO_vendor_support ipmi_msghandler i7core_edac shpchp sg edac_core pcspkr wmi lpc_ich dcdbas mfd_core acpi_power_meter auth_rpcgss acpi_cpufreq nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom iw_cxgb3 ib_core mgag200 ata_generic pata_acpi i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mptsas scsi_transport_sas ata_piix mptscsih libata cxgb3 crc32c_intel i2c_core serio_raw mptbase bnx2 fjes mdio dm_mirror dm_region_hash dm_log dm_mod [ 1359.088447] CPU: 4 PID: 6031 Comm: depmod Tainted: G I 4.7.0-rc3+ #4 [ 1359.095911] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 07/20/2012 [ 1359.103461] 0286 a0bc39d9 8802143dfd18 8134bb7f [ 1359.110871] 8802143dfd58 8108b671 [ 1359.118280] 0118575f7d13 880222c9a6e8 8803ec3874d8 880428827000 [ 1359.125693] Call Trace: [ 1359.128133] [] dump_stack+0x63/0x84 [ 1359.133259] [] __warn+0xd1/0xf0 [ 1359.138037] [] warn_slowpath_null+0x1d/0x20 [ 1359.143855] [] drop_nlink+0x3e/0x50 [ 1359.149017] [] xfs_droplink+0x28/0x60 [xfs] [ 1359.154864] [] xfs_remove+0x231/0x350 [xfs] [ 1359.160682] [] ? security_inode_permission+0x3a/0x60 [ 1359.167309] [] xfs_vn_unlink+0x58/0xa0 [xfs] [ 1359.173213] [] ? selinux_inode_unlink+0x13/0x20 [ 1359.179379] [] vfs_unlink+0xda/0x190 [ 1359.184590] [] do_unlinkat+0x263/0x2a0 [ 1359.189974] [] SyS_unlinkat+0x1b/0x30 [ 1359.195272] [] do_syscall_64+0x62/0x110 [ 1359.200743] [] entry_SYSCALL64_slow_path+0x25/0x25 [ 1359.207178] ---[ end trace 0d397afdaff9f340 ]--- [ 1359.211830] XFS (dm-0): Internal error xfs_trans_cancel at line 984 of file fs/xfs/xfs_trans.c. Caller xfs_remove+0x1d1/0x350 [xfs] [ 1359.223723] CPU: 4 PID: 6031 Comm: depmod Tainted: GW I 4.7.0-rc3+ #4 [ 1359.231185] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 07/20/2012 [ 1359.238736] 0286 a0bc39d9 8802143dfd60 8134bb7f [ 1359.246147] 8803ec3874d8 0001 8802143dfd78 a03176bb [ 1359.253559] a0328c21 8802143dfda0 a03327a6 880222e7e180 [ 1359.260969] Call Trace: [ 1359.263407] [] dump_stack+0x63/0x84 [ 1359.268560] [] xfs_error_report+0x3b/0x40 [xfs] [ 1359.274755] [] ? xfs_remove+0x1d1/0x350 [xfs] [ 1359.280778] [] xfs_trans_cancel+0xb6/0xe0 [xfs] [ 1359.286973] [] xfs_remove+0x1d1/0x350 [xfs] [ 1359.292820] [] xfs_vn_unlink+0x58/0xa0 [xfs] [ 1359.298724] [] ? selinux_inode_unlink+0x13/0x20 [ 1359.304890] [
Re: Internal error xfs_trans_cancel
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote: > Hi, > > I got the error message below while compiling a kernel > on that system. I can't really say if I did something > which made the file system unhappy before the crash. > > > [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file > fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] > [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 > [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 > 01/16/2014 > [ +0.48] 0286 c8be6bc3 885fa9473cb0 > 813d146e > [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 > a0213cdc > [ +0.53] a02257b3 885fa9473cf0 a022eb36 > 883faa502d00 > [ +0.53] Call Trace: > [ +0.28] [] dump_stack+0x63/0x85 > [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] > [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] > [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] > [ +0.65] [] xfs_rename+0x453/0x960 [xfs] > [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] > [ +0.40] [] vfs_rename+0x58c/0x8d0 > [ +0.32] [] SyS_rename+0x371/0x390 > [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 > [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of > file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f > [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting > down filesystem > [ +0.57] XFS (sde1): Please umount the filesystem and rectify the > problem(s) > [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. > [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. I saw this today. I was just building/installing kernels, rebooting, running kexec, running perf. [ 1359.005573] [ cut here ] [ 1359.010191] WARNING: CPU: 4 PID: 6031 at fs/inode.c:280 drop_nlink+0x3e/0x50 [ 1359.017231] Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_powerclamp coretemp kvm_intel kvm nfsd ipmi_ssif ipmi_devintf ipmi_si iTCO_wdt irqbypass iTCO_vendor_support ipmi_msghandler i7core_edac shpchp sg edac_core pcspkr wmi lpc_ich dcdbas mfd_core acpi_power_meter auth_rpcgss acpi_cpufreq nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod cdrom iw_cxgb3 ib_core mgag200 ata_generic pata_acpi i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mptsas scsi_transport_sas ata_piix mptscsih libata cxgb3 crc32c_intel i2c_core serio_raw mptbase bnx2 fjes mdio dm_mirror dm_region_hash dm_log dm_mod [ 1359.088447] CPU: 4 PID: 6031 Comm: depmod Tainted: G I 4.7.0-rc3+ #4 [ 1359.095911] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 07/20/2012 [ 1359.103461] 0286 a0bc39d9 8802143dfd18 8134bb7f [ 1359.110871] 8802143dfd58 8108b671 [ 1359.118280] 0118575f7d13 880222c9a6e8 8803ec3874d8 880428827000 [ 1359.125693] Call Trace: [ 1359.128133] [] dump_stack+0x63/0x84 [ 1359.133259] [] __warn+0xd1/0xf0 [ 1359.138037] [] warn_slowpath_null+0x1d/0x20 [ 1359.143855] [] drop_nlink+0x3e/0x50 [ 1359.149017] [] xfs_droplink+0x28/0x60 [xfs] [ 1359.154864] [] xfs_remove+0x231/0x350 [xfs] [ 1359.160682] [] ? security_inode_permission+0x3a/0x60 [ 1359.167309] [] xfs_vn_unlink+0x58/0xa0 [xfs] [ 1359.173213] [] ? selinux_inode_unlink+0x13/0x20 [ 1359.179379] [] vfs_unlink+0xda/0x190 [ 1359.184590] [] do_unlinkat+0x263/0x2a0 [ 1359.189974] [] SyS_unlinkat+0x1b/0x30 [ 1359.195272] [] do_syscall_64+0x62/0x110 [ 1359.200743] [] entry_SYSCALL64_slow_path+0x25/0x25 [ 1359.207178] ---[ end trace 0d397afdaff9f340 ]--- [ 1359.211830] XFS (dm-0): Internal error xfs_trans_cancel at line 984 of file fs/xfs/xfs_trans.c. Caller xfs_remove+0x1d1/0x350 [xfs] [ 1359.223723] CPU: 4 PID: 6031 Comm: depmod Tainted: GW I 4.7.0-rc3+ #4 [ 1359.231185] Hardware name: Dell Inc. PowerEdge R410/0N051F, BIOS 1.11.0 07/20/2012 [ 1359.238736] 0286 a0bc39d9 8802143dfd60 8134bb7f [ 1359.246147] 8803ec3874d8 0001 8802143dfd78 a03176bb [ 1359.253559] a0328c21 8802143dfda0 a03327a6 880222e7e180 [ 1359.260969] Call Trace: [ 1359.263407] [] dump_stack+0x63/0x84 [ 1359.268560] [] xfs_error_report+0x3b/0x40 [xfs] [ 1359.274755] [] ? xfs_remove+0x1d1/0x350 [xfs] [ 1359.280778] [] xfs_trans_cancel+0xb6/0xe0 [xfs] [ 1359.286973] [] xfs_remove+0x1d1/0x350 [xfs] [ 1359.292820] [] xfs_vn_unlink+0x58/0xa0 [xfs] [ 1359.298724] [] ? selinux_inode_unlink+0x13/0x20 [ 1359.304890] [
Re: Internal error xfs_trans_cancel
> Hmmm, Ok. I've been running the lockperf test and kernel builds all > day on a filesystem that is identical in shape and size to yours > (i.e. xfs_info output is the same) but I haven't reproduced it yet. I don't know if that is important: I run the lockperf test and after they have finished I do a kernel build. > Is it possible to get a metadump image of your filesystem to see if > I can reproduce it on that? Sure, see private mail.
Re: Internal error xfs_trans_cancel
> Hmmm, Ok. I've been running the lockperf test and kernel builds all > day on a filesystem that is identical in shape and size to yours > (i.e. xfs_info output is the same) but I haven't reproduced it yet. I don't know if that is important: I run the lockperf test and after they have finished I do a kernel build. > Is it possible to get a metadump image of your filesystem to see if > I can reproduce it on that? Sure, see private mail.
Re: Internal error xfs_trans_cancel
On Thu, Jun 02, 2016 at 07:23:24AM +0200, Daniel Wagner wrote: > > posix03 and posix04 just emit error messages: > > > > posix04 -n 40 -l 100 > > posix04: invalid option -- 'l' > > posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] > > . > > I screwed that this up. I have patched my version of lockperf to make > all test using the same options names. Though forgot to send those > patches. Will do now. > > In this case you can use use '-i' instead of '-l'. > > > So I changed them to run "-i $l" instead, and that has a somewhat > > undesired effect: > > > > static void > > kill_children() > > { > > siginfo_t infop; > > > > signal(SIGINT, SIG_IGN); > >> kill(0, SIGINT); > > while (waitid(P_ALL, 0, , WEXITED) != -1); > > } > > > > Yeah, it sends a SIGINT to everything with a process group id. It > > kills the parent shell: > > Ah that rings a bell. I tuned the parameters so that I did not run into > this problem. I'll do patch for this one. It's pretty annoying. > > > $ ./run-lockperf-tests.sh /mnt/scratch/ > > pid 9597's current affinity list: 0-15 > > pid 9597's new affinity list: 0,4,8,12 > > sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: > > Directory nonexistent > > posix01 -n 8 -l 100 > > posix02 -n 8 -l 100 > > posix03 -n 8 -i 100 > > > > $ > > > > So, I've just removed those tests from your script. I'll see if I > > have any luck with reproducing the problem now. > > I was able to reproduce it again with the same steps. Hmmm, Ok. I've been running the lockperf test and kernel builds all day on a filesystem that is identical in shape and size to yours (i.e. xfs_info output is the same) but I haven't reproduced it yet. Is it possible to get a metadump image of your filesystem to see if I can reproduce it on that? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Internal error xfs_trans_cancel
On Thu, Jun 02, 2016 at 07:23:24AM +0200, Daniel Wagner wrote: > > posix03 and posix04 just emit error messages: > > > > posix04 -n 40 -l 100 > > posix04: invalid option -- 'l' > > posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] > > . > > I screwed that this up. I have patched my version of lockperf to make > all test using the same options names. Though forgot to send those > patches. Will do now. > > In this case you can use use '-i' instead of '-l'. > > > So I changed them to run "-i $l" instead, and that has a somewhat > > undesired effect: > > > > static void > > kill_children() > > { > > siginfo_t infop; > > > > signal(SIGINT, SIG_IGN); > >> kill(0, SIGINT); > > while (waitid(P_ALL, 0, , WEXITED) != -1); > > } > > > > Yeah, it sends a SIGINT to everything with a process group id. It > > kills the parent shell: > > Ah that rings a bell. I tuned the parameters so that I did not run into > this problem. I'll do patch for this one. It's pretty annoying. > > > $ ./run-lockperf-tests.sh /mnt/scratch/ > > pid 9597's current affinity list: 0-15 > > pid 9597's new affinity list: 0,4,8,12 > > sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: > > Directory nonexistent > > posix01 -n 8 -l 100 > > posix02 -n 8 -l 100 > > posix03 -n 8 -i 100 > > > > $ > > > > So, I've just removed those tests from your script. I'll see if I > > have any luck with reproducing the problem now. > > I was able to reproduce it again with the same steps. Hmmm, Ok. I've been running the lockperf test and kernel builds all day on a filesystem that is identical in shape and size to yours (i.e. xfs_info output is the same) but I haven't reproduced it yet. Is it possible to get a metadump image of your filesystem to see if I can reproduce it on that? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Internal error xfs_trans_cancel
> posix03 and posix04 just emit error messages: > > posix04 -n 40 -l 100 > posix04: invalid option -- 'l' > posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] > . I screwed that this up. I have patched my version of lockperf to make all test using the same options names. Though forgot to send those patches. Will do now. In this case you can use use '-i' instead of '-l'. > So I changed them to run "-i $l" instead, and that has a somewhat > undesired effect: > > static void > kill_children() > { > siginfo_t infop; > > signal(SIGINT, SIG_IGN); >> kill(0, SIGINT); > while (waitid(P_ALL, 0, , WEXITED) != -1); > } > > Yeah, it sends a SIGINT to everything with a process group id. It > kills the parent shell: Ah that rings a bell. I tuned the parameters so that I did not run into this problem. I'll do patch for this one. It's pretty annoying. > $ ./run-lockperf-tests.sh /mnt/scratch/ > pid 9597's current affinity list: 0-15 > pid 9597's new affinity list: 0,4,8,12 > sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: > Directory nonexistent > posix01 -n 8 -l 100 > posix02 -n 8 -l 100 > posix03 -n 8 -i 100 > > $ > > So, I've just removed those tests from your script. I'll see if I > have any luck with reproducing the problem now. I was able to reproduce it again with the same steps.
Re: Internal error xfs_trans_cancel
> posix03 and posix04 just emit error messages: > > posix04 -n 40 -l 100 > posix04: invalid option -- 'l' > posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] > . I screwed that this up. I have patched my version of lockperf to make all test using the same options names. Though forgot to send those patches. Will do now. In this case you can use use '-i' instead of '-l'. > So I changed them to run "-i $l" instead, and that has a somewhat > undesired effect: > > static void > kill_children() > { > siginfo_t infop; > > signal(SIGINT, SIG_IGN); >> kill(0, SIGINT); > while (waitid(P_ALL, 0, , WEXITED) != -1); > } > > Yeah, it sends a SIGINT to everything with a process group id. It > kills the parent shell: Ah that rings a bell. I tuned the parameters so that I did not run into this problem. I'll do patch for this one. It's pretty annoying. > $ ./run-lockperf-tests.sh /mnt/scratch/ > pid 9597's current affinity list: 0-15 > pid 9597's new affinity list: 0,4,8,12 > sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: > Directory nonexistent > posix01 -n 8 -l 100 > posix02 -n 8 -l 100 > posix03 -n 8 -i 100 > > $ > > So, I've just removed those tests from your script. I'll see if I > have any luck with reproducing the problem now. I was able to reproduce it again with the same steps.
Re: Internal error xfs_trans_cancel
On Wed, Jun 01, 2016 at 04:13:10PM +0200, Daniel Wagner wrote: > >> Anything in the log before this? > > > > Just the usual stuff, as I remember. Sorry, I haven't copied the whole log. > > Just triggered it again. My steps for it are: > > - run all lockperf test > > git://git.samba.org/jlayton/lockperf.git > > via my test script: > > #!/bin/sh > > run_tests () { . > for c in `seq 8 32 128`; do > for l in `seq 100 100 500`; do > time run_tests "posix01 -n $c -l $l " $DIR/posix01-$c-$l.data > time run_tests "posix02 -n $c -l $l " $DIR/posix02-$c-$l.data > time run_tests "posix03 -n $c -l $l " $DIR/posix03-$c-$l.data > time run_tests "posix04 -n $c -l $l " $DIR/posix04-$c-$l.data posix03 and posix04 just emit error messages: posix04 -n 40 -l 100 posix04: invalid option -- 'l' posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] . So I changed them to run "-i $l" instead, and that has a somewhat undesired effect: static void kill_children() { siginfo_t infop; signal(SIGINT, SIG_IGN); > kill(0, SIGINT); while (waitid(P_ALL, 0, , WEXITED) != -1); } Yeah, it sends a SIGINT to everything with a process group id. It kills the parent shell: $ ./run-lockperf-tests.sh /mnt/scratch/ pid 9597's current affinity list: 0-15 pid 9597's new affinity list: 0,4,8,12 sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: Directory nonexistent posix01 -n 8 -l 100 posix02 -n 8 -l 100 posix03 -n 8 -i 100 $ So, I've just removed those tests from your script. I'll see if I have any luck with reproducing the problem now. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Internal error xfs_trans_cancel
On Wed, Jun 01, 2016 at 04:13:10PM +0200, Daniel Wagner wrote: > >> Anything in the log before this? > > > > Just the usual stuff, as I remember. Sorry, I haven't copied the whole log. > > Just triggered it again. My steps for it are: > > - run all lockperf test > > git://git.samba.org/jlayton/lockperf.git > > via my test script: > > #!/bin/sh > > run_tests () { . > for c in `seq 8 32 128`; do > for l in `seq 100 100 500`; do > time run_tests "posix01 -n $c -l $l " $DIR/posix01-$c-$l.data > time run_tests "posix02 -n $c -l $l " $DIR/posix02-$c-$l.data > time run_tests "posix03 -n $c -l $l " $DIR/posix03-$c-$l.data > time run_tests "posix04 -n $c -l $l " $DIR/posix04-$c-$l.data posix03 and posix04 just emit error messages: posix04 -n 40 -l 100 posix04: invalid option -- 'l' posix04: Usage: posix04 [-i iterations] [-n nr_children] [-s] . So I changed them to run "-i $l" instead, and that has a somewhat undesired effect: static void kill_children() { siginfo_t infop; signal(SIGINT, SIG_IGN); > kill(0, SIGINT); while (waitid(P_ALL, 0, , WEXITED) != -1); } Yeah, it sends a SIGINT to everything with a process group id. It kills the parent shell: $ ./run-lockperf-tests.sh /mnt/scratch/ pid 9597's current affinity list: 0-15 pid 9597's new affinity list: 0,4,8,12 sh: 1: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: Directory nonexistent posix01 -n 8 -l 100 posix02 -n 8 -l 100 posix03 -n 8 -i 100 $ So, I've just removed those tests from your script. I'll see if I have any luck with reproducing the problem now. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Internal error xfs_trans_cancel
> via my test script: Looks like my email client did not agree with my formatting of the script. https://www.monom.org/data/lglock/run-tests.sh
Re: Internal error xfs_trans_cancel
> via my test script: Looks like my email client did not agree with my formatting of the script. https://www.monom.org/data/lglock/run-tests.sh
Re: Internal error xfs_trans_cancel
>> Anything in the log before this? > > Just the usual stuff, as I remember. Sorry, I haven't copied the whole log. Just triggered it again. My steps for it are: - run all lockperf test git://git.samba.org/jlayton/lockperf.git via my test script: #!/bin/sh run_tests () { echo $1 for i in `seq 10`; do rm -rf /tmp/a; $1 /tmp/a > /dev/null sync done for i in `seq 100`; do rm -rf /tmp/a; $1 /tmp/a >> $2 sync done } PATH=~/src/lockperf:$PATH DIR=$1-`uname -r` if [ ! -d "$DIR" ]; then mkdir $DIR fi
Re: Internal error xfs_trans_cancel
>> Anything in the log before this? > > Just the usual stuff, as I remember. Sorry, I haven't copied the whole log. Just triggered it again. My steps for it are: - run all lockperf test git://git.samba.org/jlayton/lockperf.git via my test script: #!/bin/sh run_tests () { echo $1 for i in `seq 10`; do rm -rf /tmp/a; $1 /tmp/a > /dev/null sync done for i in `seq 100`; do rm -rf /tmp/a; $1 /tmp/a >> $2 sync done } PATH=~/src/lockperf:$PATH DIR=$1-`uname -r` if [ ! -d "$DIR" ]; then mkdir $DIR fi
Re: Internal error xfs_trans_cancel
On 06/01/2016 09:10 AM, Dave Chinner wrote: > On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote: >> I got the error message below while compiling a kernel >> on that system. I can't really say if I did something >> which made the file system unhappy before the crash. >> >> >> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of >> file fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] > > Anything in the log before this? Just the usual stuff, as I remember. Sorry, I haven't copied the whole log. >> [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 >> [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 >> 01/16/2014 >> [ +0.48] 0286 c8be6bc3 885fa9473cb0 >> 813d146e >> [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 >> a0213cdc >> [ +0.53] a02257b3 885fa9473cf0 a022eb36 >> 883faa502d00 >> [ +0.53] Call Trace: >> [ +0.28] [] dump_stack+0x63/0x85 >> [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] >> [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] >> [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] >> [ +0.65] [] xfs_rename+0x453/0x960 [xfs] >> [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] >> [ +0.40] [] vfs_rename+0x58c/0x8d0 >> [ +0.32] [] SyS_rename+0x371/0x390 >> [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 >> [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of >> file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f >> [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting >> down filesystem >> [ +0.57] XFS (sde1): Please umount the filesystem and rectify the >> problem(s) >> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. >> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. > > Doesn't normally happen, and there's not a lot to go on here. Restarted the box and did a couple of kernel builds and everything was fine. > Can > you provide the info listed in the link below so we have some idea > of what configuration the error occurred on? Sure, forgot that in the first post. > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F # uname -r 4.7.0-rc1-3-g1f55b0d # xfs_repair -V xfs_repair version 4.5.0 # cat /proc/cpuinfo | grep CPU | wc -l 64 # cat /proc/meminfo MemTotal: 528344752 kB MemFree:526838036 kB MemAvailable: 525265612 kB Buffers:2716 kB Cached: 216896 kB SwapCached:0 kB Active: 119924 kB Inactive: 116552 kB Active(anon): 17416 kB Inactive(anon): 1108 kB Active(file): 102508 kB Inactive(file): 115444 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 16972 kB Mapped:25288 kB Shmem: 1616 kB Slab: 184920 kB SReclaimable: 60028 kB SUnreclaim: 124892 kB KernelStack: 13120 kB PageTables: 2292 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:264172376 kB Committed_AS: 270612 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 232256 kB DirectMap2M: 7061504 kB DirectMap1G:531628032 kB # cat /proc/mounts sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=264153644k,nr_inodes=66038411,mode=755 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosu
Re: Internal error xfs_trans_cancel
On 06/01/2016 09:10 AM, Dave Chinner wrote: > On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote: >> I got the error message below while compiling a kernel >> on that system. I can't really say if I did something >> which made the file system unhappy before the crash. >> >> >> [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of >> file fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] > > Anything in the log before this? Just the usual stuff, as I remember. Sorry, I haven't copied the whole log. >> [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 >> [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 >> 01/16/2014 >> [ +0.48] 0286 c8be6bc3 885fa9473cb0 >> 813d146e >> [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 >> a0213cdc >> [ +0.53] a02257b3 885fa9473cf0 a022eb36 >> 883faa502d00 >> [ +0.53] Call Trace: >> [ +0.28] [] dump_stack+0x63/0x85 >> [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] >> [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] >> [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] >> [ +0.65] [] xfs_rename+0x453/0x960 [xfs] >> [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] >> [ +0.40] [] vfs_rename+0x58c/0x8d0 >> [ +0.32] [] SyS_rename+0x371/0x390 >> [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 >> [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of >> file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f >> [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting >> down filesystem >> [ +0.57] XFS (sde1): Please umount the filesystem and rectify the >> problem(s) >> [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. >> [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. > > Doesn't normally happen, and there's not a lot to go on here. Restarted the box and did a couple of kernel builds and everything was fine. > Can > you provide the info listed in the link below so we have some idea > of what configuration the error occurred on? Sure, forgot that in the first post. > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F # uname -r 4.7.0-rc1-3-g1f55b0d # xfs_repair -V xfs_repair version 4.5.0 # cat /proc/cpuinfo | grep CPU | wc -l 64 # cat /proc/meminfo MemTotal: 528344752 kB MemFree:526838036 kB MemAvailable: 525265612 kB Buffers:2716 kB Cached: 216896 kB SwapCached:0 kB Active: 119924 kB Inactive: 116552 kB Active(anon): 17416 kB Inactive(anon): 1108 kB Active(file): 102508 kB Inactive(file): 115444 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 16972 kB Mapped:25288 kB Shmem: 1616 kB Slab: 184920 kB SReclaimable: 60028 kB SUnreclaim: 124892 kB KernelStack: 13120 kB PageTables: 2292 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:264172376 kB Committed_AS: 270612 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k: 232256 kB DirectMap2M: 7061504 kB DirectMap1G:531628032 kB # cat /proc/mounts sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=264153644k,nr_inodes=66038411,mode=755 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosu
Re: Internal error xfs_trans_cancel
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote: > Hi, > > I got the error message below while compiling a kernel > on that system. I can't really say if I did something > which made the file system unhappy before the crash. > > > [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file > fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] Anything in the log before this? > [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 > [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 > 01/16/2014 > [ +0.48] 0286 c8be6bc3 885fa9473cb0 > 813d146e > [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 > a0213cdc > [ +0.53] a02257b3 885fa9473cf0 a022eb36 > 883faa502d00 > [ +0.53] Call Trace: > [ +0.28] [] dump_stack+0x63/0x85 > [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] > [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] > [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] > [ +0.65] [] xfs_rename+0x453/0x960 [xfs] > [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] > [ +0.40] [] vfs_rename+0x58c/0x8d0 > [ +0.32] [] SyS_rename+0x371/0x390 > [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 > [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of > file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f > [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting > down filesystem > [ +0.57] XFS (sde1): Please umount the filesystem and rectify the > problem(s) > [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. > [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. Doesn't normally happen, and there's not a lot to go on here. Can you provide the info listed in the link below so we have some idea of what configuration the error occurred on? http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F You didn't run out of space or something unusual like that? Does 'xfs_repair -n ' report any errors? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Internal error xfs_trans_cancel
On Wed, Jun 01, 2016 at 07:52:31AM +0200, Daniel Wagner wrote: > Hi, > > I got the error message below while compiling a kernel > on that system. I can't really say if I did something > which made the file system unhappy before the crash. > > > [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file > fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] Anything in the log before this? > [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 > [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 > 01/16/2014 > [ +0.48] 0286 c8be6bc3 885fa9473cb0 > 813d146e > [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 > a0213cdc > [ +0.53] a02257b3 885fa9473cf0 a022eb36 > 883faa502d00 > [ +0.53] Call Trace: > [ +0.28] [] dump_stack+0x63/0x85 > [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] > [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] > [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] > [ +0.65] [] xfs_rename+0x453/0x960 [xfs] > [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] > [ +0.40] [] vfs_rename+0x58c/0x8d0 > [ +0.32] [] SyS_rename+0x371/0x390 > [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 > [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of > file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f > [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting > down filesystem > [ +0.57] XFS (sde1): Please umount the filesystem and rectify the > problem(s) > [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. > [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. Doesn't normally happen, and there's not a lot to go on here. Can you provide the info listed in the link below so we have some idea of what configuration the error occurred on? http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F You didn't run out of space or something unusual like that? Does 'xfs_repair -n ' report any errors? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Internal error xfs_trans_cancel
Hi, I got the error message below while compiling a kernel on that system. I can't really say if I did something which made the file system unhappy before the crash. [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014 [ +0.48] 0286 c8be6bc3 885fa9473cb0 813d146e [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 a0213cdc [ +0.53] a02257b3 885fa9473cf0 a022eb36 883faa502d00 [ +0.53] Call Trace: [ +0.28] [] dump_stack+0x63/0x85 [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] [ +0.65] [] xfs_rename+0x453/0x960 [xfs] [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] [ +0.40] [] vfs_rename+0x58c/0x8d0 [ +0.32] [] SyS_rename+0x371/0x390 [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting down filesystem [ +0.57] XFS (sde1): Please umount the filesystem and rectify the problem(s) [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. cheers, daniel
Internal error xfs_trans_cancel
Hi, I got the error message below while compiling a kernel on that system. I can't really say if I did something which made the file system unhappy before the crash. [Jun 1 07:41] XFS (sde1): Internal error xfs_trans_cancel at line 984 of file fs/xfs/xfs_trans.c. Caller xfs_rename+0x453/0x960 [xfs] [ +0.95] CPU: 22 PID: 8640 Comm: gcc Not tainted 4.7.0-rc1 #16 [ +0.35] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014 [ +0.48] 0286 c8be6bc3 885fa9473cb0 813d146e [ +0.56] 885fa9ac5ed0 0001 885fa9473cc8 a0213cdc [ +0.53] a02257b3 885fa9473cf0 a022eb36 883faa502d00 [ +0.53] Call Trace: [ +0.28] [] dump_stack+0x63/0x85 [ +0.69] [] xfs_error_report+0x3c/0x40 [xfs] [ +0.65] [] ? xfs_rename+0x453/0x960 [xfs] [ +0.64] [] xfs_trans_cancel+0xb6/0xe0 [xfs] [ +0.65] [] xfs_rename+0x453/0x960 [xfs] [ +0.62] [] xfs_vn_rename+0xb3/0xf0 [xfs] [ +0.40] [] vfs_rename+0x58c/0x8d0 [ +0.32] [] SyS_rename+0x371/0x390 [ +0.36] [] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ +0.40] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 985 of file fs/xfs/xfs_trans.c. Return address = 0xa022eb4f [ +0.027680] XFS (sde1): Corruption of in-memory data detected. Shutting down filesystem [ +0.57] XFS (sde1): Please umount the filesystem and rectify the problem(s) [Jun 1 07:42] XFS (sde1): xfs_log_force: error -5 returned. [ +30.081016] XFS (sde1): xfs_log_force: error -5 returned. cheers, daniel
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On 30/11/06, David Chinner <[EMAIL PROTECTED]> wrote: On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote: > On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote: > >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: > >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of > >> file fs/xfs/xfs_trans.c. Caller 0x8034b47e > >> > >> Call Trace: > >> [] show_trace+0xb2/0x380 > >> [] dump_stack+0x15/0x20 > >> [] xfs_error_report+0x3c/0x50 > >> [] xfs_trans_cancel+0x6e/0x130 > >> [] xfs_create+0x5ee/0x6a0 > >> [] xfs_vn_mknod+0x156/0x2e0 > >> [] xfs_vn_create+0xb/0x10 > >> [] vfs_create+0x8c/0xd0 > >> [] nfsd_create_v3+0x31a/0x560 > >> [] nfsd3_proc_create+0x148/0x170 > >> [] nfsd_dispatch+0xf9/0x1e0 > >> [] svc_process+0x437/0x6e0 > >> [] nfsd+0x1cd/0x360 > >> [] child_rip+0xa/0x12 > >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file > >> fs/xfs/xfs_trans.c. Return address = 0x80359daa > > > >We shut down the filesystem because we cancelled a dirty transaction. > >Once we start to dirty the incore objects, we can't roll back to > >an unchanged state if a subsequent fatal error occurs during the > >transaction and we have to abort it. > > > So you are saying that there's nothing I can do to prevent this from > happening in the future? Pretty much - we need to work out what is going wrong and we can't from teh shutdown message above - the error has occurred in a path that doesn't have error report traps in it. Is this reproducable? Not on demand, no. It has happened only this once as far as I know and for unknown reasons. > >If I understand historic occurrences of this correctly, there is > >a possibility that it can be triggered in ENOMEM situations. Was your > >machine running out of memoy when this occurred? > > > Not really. I just checked my monitoring software and, at the time > this happened, the box had ~5.9G RAM free (of 8G total) and no swap > used (but 11G available). Ok. Sounds like we need more error reporting points inserted into that code so we dump an error earlier and hence have some hope of working out what went wrong next time. OOC, there weren't any I/O errors reported before this shutdown? No. I looked but found none. Let me know if there's anything I can do to help. -- Jesper Juhl <[EMAIL PROTECTED]> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote: > On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote: > >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: > >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of > >> file fs/xfs/xfs_trans.c. Caller 0x8034b47e > >> > >> Call Trace: > >> [] show_trace+0xb2/0x380 > >> [] dump_stack+0x15/0x20 > >> [] xfs_error_report+0x3c/0x50 > >> [] xfs_trans_cancel+0x6e/0x130 > >> [] xfs_create+0x5ee/0x6a0 > >> [] xfs_vn_mknod+0x156/0x2e0 > >> [] xfs_vn_create+0xb/0x10 > >> [] vfs_create+0x8c/0xd0 > >> [] nfsd_create_v3+0x31a/0x560 > >> [] nfsd3_proc_create+0x148/0x170 > >> [] nfsd_dispatch+0xf9/0x1e0 > >> [] svc_process+0x437/0x6e0 > >> [] nfsd+0x1cd/0x360 > >> [] child_rip+0xa/0x12 > >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file > >> fs/xfs/xfs_trans.c. Return address = 0x80359daa > > > >We shut down the filesystem because we cancelled a dirty transaction. > >Once we start to dirty the incore objects, we can't roll back to > >an unchanged state if a subsequent fatal error occurs during the > >transaction and we have to abort it. > > > So you are saying that there's nothing I can do to prevent this from > happening in the future? Pretty much - we need to work out what is going wrong and we can't from teh shutdown message above - the error has occurred in a path that doesn't have error report traps in it. Is this reproducable? > >If I understand historic occurrences of this correctly, there is > >a possibility that it can be triggered in ENOMEM situations. Was your > >machine running out of memoy when this occurred? > > > Not really. I just checked my monitoring software and, at the time > this happened, the box had ~5.9G RAM free (of 8G total) and no swap > used (but 11G available). Ok. Sounds like we need more error reporting points inserted into that code so we dump an error earlier and hence have some hope of working out what went wrong next time. OOC, there weren't any I/O errors reported before this shutdown? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On 29/11/06, David Chinner <[EMAIL PROTECTED]> wrote: On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: > Hi, > > One of my NFS servers just gave me a nasty surprise that I think it is > relevant to tell you about: Thanks, Jesper. > Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of > file fs/xfs/xfs_trans.c. Caller 0x8034b47e > > Call Trace: > [] show_trace+0xb2/0x380 > [] dump_stack+0x15/0x20 > [] xfs_error_report+0x3c/0x50 > [] xfs_trans_cancel+0x6e/0x130 > [] xfs_create+0x5ee/0x6a0 > [] xfs_vn_mknod+0x156/0x2e0 > [] xfs_vn_create+0xb/0x10 > [] vfs_create+0x8c/0xd0 > [] nfsd_create_v3+0x31a/0x560 > [] nfsd3_proc_create+0x148/0x170 > [] nfsd_dispatch+0xf9/0x1e0 > [] svc_process+0x437/0x6e0 > [] nfsd+0x1cd/0x360 > [] child_rip+0xa/0x12 > xfs_force_shutdown(dm-1,0x8) called from line 1139 of file > fs/xfs/xfs_trans.c. Return address = 0x80359daa We shut down the filesystem because we cancelled a dirty transaction. Once we start to dirty the incore objects, we can't roll back to an unchanged state if a subsequent fatal error occurs during the transaction and we have to abort it. So you are saying that there's nothing I can do to prevent this from happening in the future? If I understand historic occurrences of this correctly, there is a possibility that it can be triggered in ENOMEM situations. Was your machine running out of memoy when this occurred? Not really. I just checked my monitoring software and, at the time this happened, the box had ~5.9G RAM free (of 8G total) and no swap used (but 11G available). > Filesystem "dm-1": Corruption of in-memory data detected. Shutting > down filesystem: dm-1 > Please umount the filesystem, and rectify the problem(s) > nfsd: non-standard errno: 5 EIO gets returned in certain locations once the filesystem has been shutdown. Makes sense. > I unmounted the filesystem, ran xfs_repair which told me to try an > mount it first to replay the log, so I did, unmounted it again, ran > xfs_repair (which didn't find any problems) and finally mounted it and > everything is good - the filesystem seems intact. Yeah, the above error report typically is due to an in-memory problem, not an on disk issue. Good to know. > The server in question is running kernel 2.6.18.1 Can happen to XFS on any kernel version - got a report of this from someone running a 2.4 kernel a couple of weeks ago Ok. Thank you for your reply David. -- Jesper Juhl <[EMAIL PROTECTED]> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On 29/11/06, David Chinner [EMAIL PROTECTED] wrote: On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: Hi, One of my NFS servers just gave me a nasty surprise that I think it is relevant to tell you about: Thanks, Jesper. Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0x8034b47e Call Trace: [8020b122] show_trace+0xb2/0x380 [8020b405] dump_stack+0x15/0x20 [80327b4c] xfs_error_report+0x3c/0x50 [803435ae] xfs_trans_cancel+0x6e/0x130 [8034b47e] xfs_create+0x5ee/0x6a0 [80356556] xfs_vn_mknod+0x156/0x2e0 [803566eb] xfs_vn_create+0xb/0x10 [80284b2c] vfs_create+0x8c/0xd0 [802e734a] nfsd_create_v3+0x31a/0x560 [802ec838] nfsd3_proc_create+0x148/0x170 [802e19f9] nfsd_dispatch+0xf9/0x1e0 [8049d617] svc_process+0x437/0x6e0 [802e176d] nfsd+0x1cd/0x360 [8020ab1c] child_rip+0xa/0x12 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file fs/xfs/xfs_trans.c. Return address = 0x80359daa We shut down the filesystem because we cancelled a dirty transaction. Once we start to dirty the incore objects, we can't roll back to an unchanged state if a subsequent fatal error occurs during the transaction and we have to abort it. So you are saying that there's nothing I can do to prevent this from happening in the future? If I understand historic occurrences of this correctly, there is a possibility that it can be triggered in ENOMEM situations. Was your machine running out of memoy when this occurred? Not really. I just checked my monitoring software and, at the time this happened, the box had ~5.9G RAM free (of 8G total) and no swap used (but 11G available). Filesystem dm-1: Corruption of in-memory data detected. Shutting down filesystem: dm-1 Please umount the filesystem, and rectify the problem(s) nfsd: non-standard errno: 5 EIO gets returned in certain locations once the filesystem has been shutdown. Makes sense. I unmounted the filesystem, ran xfs_repair which told me to try an mount it first to replay the log, so I did, unmounted it again, ran xfs_repair (which didn't find any problems) and finally mounted it and everything is good - the filesystem seems intact. Yeah, the above error report typically is due to an in-memory problem, not an on disk issue. Good to know. The server in question is running kernel 2.6.18.1 Can happen to XFS on any kernel version - got a report of this from someone running a 2.4 kernel a couple of weeks ago Ok. Thank you for your reply David. -- Jesper Juhl [EMAIL PROTECTED] Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote: On 29/11/06, David Chinner [EMAIL PROTECTED] wrote: On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0x8034b47e Call Trace: [8020b122] show_trace+0xb2/0x380 [8020b405] dump_stack+0x15/0x20 [80327b4c] xfs_error_report+0x3c/0x50 [803435ae] xfs_trans_cancel+0x6e/0x130 [8034b47e] xfs_create+0x5ee/0x6a0 [80356556] xfs_vn_mknod+0x156/0x2e0 [803566eb] xfs_vn_create+0xb/0x10 [80284b2c] vfs_create+0x8c/0xd0 [802e734a] nfsd_create_v3+0x31a/0x560 [802ec838] nfsd3_proc_create+0x148/0x170 [802e19f9] nfsd_dispatch+0xf9/0x1e0 [8049d617] svc_process+0x437/0x6e0 [802e176d] nfsd+0x1cd/0x360 [8020ab1c] child_rip+0xa/0x12 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file fs/xfs/xfs_trans.c. Return address = 0x80359daa We shut down the filesystem because we cancelled a dirty transaction. Once we start to dirty the incore objects, we can't roll back to an unchanged state if a subsequent fatal error occurs during the transaction and we have to abort it. So you are saying that there's nothing I can do to prevent this from happening in the future? Pretty much - we need to work out what is going wrong and we can't from teh shutdown message above - the error has occurred in a path that doesn't have error report traps in it. Is this reproducable? If I understand historic occurrences of this correctly, there is a possibility that it can be triggered in ENOMEM situations. Was your machine running out of memoy when this occurred? Not really. I just checked my monitoring software and, at the time this happened, the box had ~5.9G RAM free (of 8G total) and no swap used (but 11G available). Ok. Sounds like we need more error reporting points inserted into that code so we dump an error earlier and hence have some hope of working out what went wrong next time. OOC, there weren't any I/O errors reported before this shutdown? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On 30/11/06, David Chinner [EMAIL PROTECTED] wrote: On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote: On 29/11/06, David Chinner [EMAIL PROTECTED] wrote: On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0x8034b47e Call Trace: [8020b122] show_trace+0xb2/0x380 [8020b405] dump_stack+0x15/0x20 [80327b4c] xfs_error_report+0x3c/0x50 [803435ae] xfs_trans_cancel+0x6e/0x130 [8034b47e] xfs_create+0x5ee/0x6a0 [80356556] xfs_vn_mknod+0x156/0x2e0 [803566eb] xfs_vn_create+0xb/0x10 [80284b2c] vfs_create+0x8c/0xd0 [802e734a] nfsd_create_v3+0x31a/0x560 [802ec838] nfsd3_proc_create+0x148/0x170 [802e19f9] nfsd_dispatch+0xf9/0x1e0 [8049d617] svc_process+0x437/0x6e0 [802e176d] nfsd+0x1cd/0x360 [8020ab1c] child_rip+0xa/0x12 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file fs/xfs/xfs_trans.c. Return address = 0x80359daa We shut down the filesystem because we cancelled a dirty transaction. Once we start to dirty the incore objects, we can't roll back to an unchanged state if a subsequent fatal error occurs during the transaction and we have to abort it. So you are saying that there's nothing I can do to prevent this from happening in the future? Pretty much - we need to work out what is going wrong and we can't from teh shutdown message above - the error has occurred in a path that doesn't have error report traps in it. Is this reproducable? Not on demand, no. It has happened only this once as far as I know and for unknown reasons. If I understand historic occurrences of this correctly, there is a possibility that it can be triggered in ENOMEM situations. Was your machine running out of memoy when this occurred? Not really. I just checked my monitoring software and, at the time this happened, the box had ~5.9G RAM free (of 8G total) and no swap used (but 11G available). Ok. Sounds like we need more error reporting points inserted into that code so we dump an error earlier and hence have some hope of working out what went wrong next time. OOC, there weren't any I/O errors reported before this shutdown? No. I looked but found none. Let me know if there's anything I can do to help. -- Jesper Juhl [EMAIL PROTECTED] Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: > Hi, > > One of my NFS servers just gave me a nasty surprise that I think it is > relevant to tell you about: Thanks, Jesper. > Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of > file fs/xfs/xfs_trans.c. Caller 0x8034b47e > > Call Trace: > [] show_trace+0xb2/0x380 > [] dump_stack+0x15/0x20 > [] xfs_error_report+0x3c/0x50 > [] xfs_trans_cancel+0x6e/0x130 > [] xfs_create+0x5ee/0x6a0 > [] xfs_vn_mknod+0x156/0x2e0 > [] xfs_vn_create+0xb/0x10 > [] vfs_create+0x8c/0xd0 > [] nfsd_create_v3+0x31a/0x560 > [] nfsd3_proc_create+0x148/0x170 > [] nfsd_dispatch+0xf9/0x1e0 > [] svc_process+0x437/0x6e0 > [] nfsd+0x1cd/0x360 > [] child_rip+0xa/0x12 > xfs_force_shutdown(dm-1,0x8) called from line 1139 of file > fs/xfs/xfs_trans.c. Return address = 0x80359daa We shut down the filesystem because we cancelled a dirty transaction. Once we start to dirty the incore objects, we can't roll back to an unchanged state if a subsequent fatal error occurs during the transaction and we have to abort it. If I understand historic occurrences of this correctly, there is a possibility that it can be triggered in ENOMEM situations. Was your machine running out of memoy when this occurred? > Filesystem "dm-1": Corruption of in-memory data detected. Shutting > down filesystem: dm-1 > Please umount the filesystem, and rectify the problem(s) > nfsd: non-standard errno: 5 EIO gets returned in certain locations once the filesystem has been shutdown. > I unmounted the filesystem, ran xfs_repair which told me to try an > mount it first to replay the log, so I did, unmounted it again, ran > xfs_repair (which didn't find any problems) and finally mounted it and > everything is good - the filesystem seems intact. Yeah, the above error report typically is due to an in-memory problem, not an on disk issue. > The server in question is running kernel 2.6.18.1 Can happen to XFS on any kernel version - got a report of this from someone running a 2.4 kernel a couple of weeks ago Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
Hi, One of my NFS servers just gave me a nasty surprise that I think it is relevant to tell you about: Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0x8034b47e Call Trace: [] show_trace+0xb2/0x380 [] dump_stack+0x15/0x20 [] xfs_error_report+0x3c/0x50 [] xfs_trans_cancel+0x6e/0x130 [] xfs_create+0x5ee/0x6a0 [] xfs_vn_mknod+0x156/0x2e0 [] xfs_vn_create+0xb/0x10 [] vfs_create+0x8c/0xd0 [] nfsd_create_v3+0x31a/0x560 [] nfsd3_proc_create+0x148/0x170 [] nfsd_dispatch+0xf9/0x1e0 [] svc_process+0x437/0x6e0 [] nfsd+0x1cd/0x360 [] child_rip+0xa/0x12 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file fs/xfs/xfs_trans.c. Return address = 0x80359daa Filesystem "dm-1": Corruption of in-memory data detected. Shutting down filesystem: dm-1 Please umount the filesystem, and rectify the problem(s) nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 (the above message repeates 1670 times, then the following) xfs_force_shutdown(dm-1,0x1) called from line 424 of file fs/xfs/xfs_rw.c. Return address = 0x80359daa I unmounted the filesystem, ran xfs_repair which told me to try an mount it first to replay the log, so I did, unmounted it again, ran xfs_repair (which didn't find any problems) and finally mounted it and everything is good - the filesystem seems intact. Filesystem "dm-1": Disabling barriers, not supported with external log device XFS mounting filesystem dm-1 Starting XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log) Ending XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log) Filesystem "dm-1": Disabling barriers, not supported with external log device XFS mounting filesystem dm-1 Ending clean XFS mount for filesystem: dm-1 The server in question is running kernel 2.6.18.1 -- Jesper Juhl <[EMAIL PROTECTED]> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
Hi, One of my NFS servers just gave me a nasty surprise that I think it is relevant to tell you about: Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0x8034b47e Call Trace: [8020b122] show_trace+0xb2/0x380 [8020b405] dump_stack+0x15/0x20 [80327b4c] xfs_error_report+0x3c/0x50 [803435ae] xfs_trans_cancel+0x6e/0x130 [8034b47e] xfs_create+0x5ee/0x6a0 [80356556] xfs_vn_mknod+0x156/0x2e0 [803566eb] xfs_vn_create+0xb/0x10 [80284b2c] vfs_create+0x8c/0xd0 [802e734a] nfsd_create_v3+0x31a/0x560 [802ec838] nfsd3_proc_create+0x148/0x170 [802e19f9] nfsd_dispatch+0xf9/0x1e0 [8049d617] svc_process+0x437/0x6e0 [802e176d] nfsd+0x1cd/0x360 [8020ab1c] child_rip+0xa/0x12 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file fs/xfs/xfs_trans.c. Return address = 0x80359daa Filesystem dm-1: Corruption of in-memory data detected. Shutting down filesystem: dm-1 Please umount the filesystem, and rectify the problem(s) nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 nfsd: non-standard errno: 5 (the above message repeates 1670 times, then the following) xfs_force_shutdown(dm-1,0x1) called from line 424 of file fs/xfs/xfs_rw.c. Return address = 0x80359daa I unmounted the filesystem, ran xfs_repair which told me to try an mount it first to replay the log, so I did, unmounted it again, ran xfs_repair (which didn't find any problems) and finally mounted it and everything is good - the filesystem seems intact. Filesystem dm-1: Disabling barriers, not supported with external log device XFS mounting filesystem dm-1 Starting XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log) Ending XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log) Filesystem dm-1: Disabling barriers, not supported with external log device XFS mounting filesystem dm-1 Ending clean XFS mount for filesystem: dm-1 The server in question is running kernel 2.6.18.1 -- Jesper Juhl [EMAIL PROTECTED] Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)
On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote: Hi, One of my NFS servers just gave me a nasty surprise that I think it is relevant to tell you about: Thanks, Jesper. Filesystem dm-1: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c. Caller 0x8034b47e Call Trace: [8020b122] show_trace+0xb2/0x380 [8020b405] dump_stack+0x15/0x20 [80327b4c] xfs_error_report+0x3c/0x50 [803435ae] xfs_trans_cancel+0x6e/0x130 [8034b47e] xfs_create+0x5ee/0x6a0 [80356556] xfs_vn_mknod+0x156/0x2e0 [803566eb] xfs_vn_create+0xb/0x10 [80284b2c] vfs_create+0x8c/0xd0 [802e734a] nfsd_create_v3+0x31a/0x560 [802ec838] nfsd3_proc_create+0x148/0x170 [802e19f9] nfsd_dispatch+0xf9/0x1e0 [8049d617] svc_process+0x437/0x6e0 [802e176d] nfsd+0x1cd/0x360 [8020ab1c] child_rip+0xa/0x12 xfs_force_shutdown(dm-1,0x8) called from line 1139 of file fs/xfs/xfs_trans.c. Return address = 0x80359daa We shut down the filesystem because we cancelled a dirty transaction. Once we start to dirty the incore objects, we can't roll back to an unchanged state if a subsequent fatal error occurs during the transaction and we have to abort it. If I understand historic occurrences of this correctly, there is a possibility that it can be triggered in ENOMEM situations. Was your machine running out of memoy when this occurred? Filesystem dm-1: Corruption of in-memory data detected. Shutting down filesystem: dm-1 Please umount the filesystem, and rectify the problem(s) nfsd: non-standard errno: 5 EIO gets returned in certain locations once the filesystem has been shutdown. I unmounted the filesystem, ran xfs_repair which told me to try an mount it first to replay the log, so I did, unmounted it again, ran xfs_repair (which didn't find any problems) and finally mounted it and everything is good - the filesystem seems intact. Yeah, the above error report typically is due to an in-memory problem, not an on disk issue. The server in question is running kernel 2.6.18.1 Can happen to XFS on any kernel version - got a report of this from someone running a 2.4 kernel a couple of weeks ago Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/