Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote: Well, I explained the problem, ext4 and others of course tell me which devid an error is on, hopefully btrfs will able to do so in the near future. So it is ok for you to print one of btrfs filesystem device(for example device name) ? maybe it is not really physical address the metadata locates in, this is easier. Yes, the device name is great, now I can see which of my 3 filesystems has a problem, that's a start. Next would be knowing which filename this occurred in, but I understand this would be harder to get from that point in the code. Ideally scrub should be able to find that problem and report it, at least I would know which filesystem to rescan for errors: Back to the original problem, would you agree that find / -type f -print0 | xargs grep . /dev/nul? I'll also have to try this to see if I get lucky with it :) + printk_ratelimited(BTRFS (device: %s) parent transid verify failed on %llu wanted %llu found %llu\n, + eb-fs_info-sb-s_id, eb-start, + parent_transid, btrfs_header_generation(eb)); That looks great. Ideally all such errors would look like this. Thanks for looking into this, I appreciate it. Best, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On 07/04/2014 02:02 PM, Marc MERLIN wrote: On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote: Well, I explained the problem, ext4 and others of course tell me which devid an error is on, hopefully btrfs will able to do so in the near future. So it is ok for you to print one of btrfs filesystem device(for example device name) ? maybe it is not really physical address the metadata locates in, this is easier. Yes, the device name is great, now I can see which of my 3 filesystems has a problem, that's a start. Next would be knowing which filename this occurred in, but I understand this would be harder to get from that point in the code. Ideally scrub should be able to find that problem and report it, at least I would know which filesystem to rescan for errors: Back to the original problem, would you agree that find / -type f -print0 | xargs grep . /dev/nul? I'll also have to try this to see if I get lucky with it :) + printk_ratelimited(BTRFS (device: %s) parent transid verify failed on %llu wanted %llu found %llu\n, + eb-fs_info-sb-s_id, eb-start, + parent_transid, btrfs_header_generation(eb)); That looks great. Ideally all such errors would look like this. OK, i found there is a mirror num recorded in struct @extent_buffer, so it is not diffcult to locate the real physical address that this corrupt metadata block locates. But another question is that if such problems happen, it should also help little, because usually other mirrors maybe have the same errors... Looking at btrfs codes, if it found such error which means checksum has passed but generation verification fails it dosen't even try other mirrors if existed, so i supposed such kinds of errors is fatal and could be corrected by btrfs itself.. Anyway, let me see if there are any other output even dose not output btrfs specific info. Thanks, Thanks for looking into this, I appreciate it. Best, Marc -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On 07/04/2014 02:02 PM, Marc MERLIN wrote: On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote: Well, I explained the problem, ext4 and others of course tell me which devid an error is on, hopefully btrfs will able to do so in the near future. So it is ok for you to print one of btrfs filesystem device(for example device name) ? maybe it is not really physical address the metadata locates in, this is easier. Yes, the device name is great, now I can see which of my 3 filesystems has a problem, that's a start. Next would be knowing which filename this occurred in, but I understand this would be harder to get from that point in the code. Ideally scrub should be able to find that problem and report it, at least I would know which filesystem to rescan for errors: So there is a problem, ususally such generation verifications errors is related to Btrfs metdata block.it is maybe just a Btrfs node , not related to any actually files, or even a leaf that contains more that one file If this is a read/write path from normal fs/file root, we may output its' root..but if this is something like extent root...i think it helps little Back to the original problem, would you agree that find / -type f -print0 | xargs grep . /dev/nul? I'll also have to try this to see if I get lucky with it :) + printk_ratelimited(BTRFS (device: %s) parent transid verify failed on %llu wanted %llu found %llu\n, + eb-fs_info-sb-s_id, eb-start, + parent_transid, btrfs_header_generation(eb)); That looks great. Ideally all such errors would look like this. Thanks for looking into this, I appreciate it. Best, Marc -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On Fri, Jul 04, 2014 at 02:12:36PM +0800, Wang Shilong wrote: OK, i found there is a mirror num recorded in struct @extent_buffer, so it is not diffcult to locate the real physical address that this corrupt metadata block locates. But another question is that if such problems happen, it should also help little, because usually other mirrors maybe have the same errors... I understand, although at least knowing which filesystem has the problem, whether it got auto corrected, or not, is good. Imagine if I get 500 of those errors in syslog. Even if btrfs can auto correct them, as an admin, I would want to know this because it means something likely bad happened and I may want to inspect that filesystem for damage, or the hardware for faults. So the extra output is very useful, thank you. Anyway, let me see if there are any other output even dose not output btrfs specific info. Thank you very much. On Fri, Jul 04, 2014 at 05:59:06PM +0800, Wang Shilong wrote: Marc argued that if there are several btrfs filesystems mounted, while users even don't know which filesystem hit the corrupted errors something like generation verification failure. Since @extent_buffer structure has a member @fs_info, let's output btrfs device info. I don't know the btrfs code much, but looking at the patch, it looks good, thank you for that. Marc Reported-by: Marc MERLIN m...@merlins.org Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com --- Tell me if missed anything here... --- fs/btrfs/disk-io.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 08e65e9..bd52ca0 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -351,9 +351,9 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, ret = 0; goto out; } - printk_ratelimited(parent transid verify failed on %llu wanted %llu -found %llu\n, -eb-start, parent_transid, btrfs_header_generation(eb)); + printk_ratelimited(BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n, + eb-fs_info-sb-s_id, eb-start, + parent_transid, btrfs_header_generation(eb)); ret = 1; /* @@ -617,15 +617,15 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, found_start = btrfs_header_bytenr(eb); if (found_start != eb-start) { - printk_ratelimited(KERN_INFO BTRFS: bad tree block start + printk_ratelimited(KERN_INFO BTRFS (device %s): bad tree block start %llu %llu\n, -found_start, eb-start); +eb-fs_info-sb-s_id, found_start, eb-start); ret = -EIO; goto err; } if (check_tree_block_fsid(root, eb)) { - printk_ratelimited(KERN_INFO BTRFS: bad fsid on block %llu\n, -eb-start); + printk_ratelimited(KERN_INFO BTRFS (device %s): bad fsid on block %llu\n, +eb-fs_info-sb-s_id, eb-start); ret = -EIO; goto err; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
Marc MERLIN posted on Wed, 02 Jul 2014 13:41:52 -0700 as excerpted: This got triggered by an rsync I think. I'm not sure which of my btrfs FS has the issue yet since BUG_ON isn't very helpful as discussed earlier. [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 [160562.950297] [ cut here ] [160562.965904] kernel BUG at fs/btrfs/locking.c:269! But shouldn't messages like 'parent transid verify failed' print which device this happened on to give the operator a hint on where the problem is? Could someone do a pass at those and make sure they all print the device ID/name? kernel 3.16 series here, rc2+ when this happened, rc3+ now. IOW it's 3.15 series (Marc) and 3.16 series (me) both known affected. FWIW, I'm not sure what originally triggered it, but I recently had a couple bad shutdowns -- systemd would say it was turning off the system but it wouldn't shut off, and on manual shutoff and later reboot, the read-write-mounted btrfs (two separate filesystems, home and log, root, including the rest of the system, is read-only by default, here) failed to mount. The mount failures triggered on the above parent transid failed errors with kernel BUG at fs/btrs/locking.c -- I /believe/ 269, but I couldn't swear to it. The biggest difference here (other than the fact that it happened on mount of critical filesystems at boot, so probably double-digit seconds in since boot at most) was that the parent transid numbers were only a few off, something like wanted , found -2. [End of technical stuff. The rest is discussion of my recovery experience, both because it might be of help to others and because it lets me tell my experience. =:^) ] I have backups but they weren't as current as I would have liked, so I decided to try recovery. My rootfs is btrfs as well, a /separate/ btrfs, but it remains mounted read-only by default and is only mounted read-write for updates, so wasn't damaged. That includes the bits of /var that I can get away with being read-only, with various /var/lib/* subdirs being symlinks to /home/var/lib/* subdirs, where they need to be writable, and of course /var/log being a separate dedicated filesystem -- one of the two that was damaged, and the usual /var/run and /var/lock symlinks to /run and /run/lock (with /run being a tmpfs mount on standard systemd configurations). As a result, the rootfs mounted from the initramfs and systemd on it was invoked as the new PID 1 init in the transfer from initramfs. Systemd was in turn able to start early boot services and anything that didn't have a dependency on home (including the bits of / var/lib symlinked into home) or log being mounted. But that of course left a number of critical services failing due to dependency on the home and/or log mounts, since those mounts were failing. Fortunately, while some of the time the errors would trigger a full kernel lockup with the above parent transid and locking BUG, other times the mount attempt would simply error out, and systemd would drop me to the emergency-mode root-login prompt. (If it hadn't, I'd have had to switch to booting the backup.) Since the main rootfs including /usr, /etc and much of /var was already mounted and safely read-only so I wasn't too worried about damaging it, that left me with only a partly working system, but access to all the normal recovery tools, manpages, etc, I'd normally have. The only big thing (other than X/kde, of course) initially was network access, due to dependencies on the unmountable filesystems for local DNS and I think iptables logging. I could have reconfigured that if I had to, but after I got log back up, I found I had network access (presumably with fallback to the ISP DNS), and was able to get to the wiki to research recovery of home a bit further. I decided to tackle log (/var/log) before home since it was smaller and I figured I could use anything I learned in that process to help me save more of home. My policy is no backup log partition, since I don't do backups regularly enough for the logs thereon to be of much likely usefulness. That left me trying to repair or recover what I could and then doing a mkfs.btrfs on it. The various repair options I tried didn't help -- they either died without helpful output or triggered the same lockup. Mount with the recovery or recovery,ro wouldn't work, and neither would btrfs check. Btrfs rescue didn't look useful, as I couldn't find useful documentation on chunk-recover and the supers looked fine (btrfs-show-super) so super- recover was unnecessary. I tried btrfs-zero-log on the log partition, but it didn't make the problem better and might have made it worse, so didn't try it on home. That left btrfs restore. I used it on log without really understanding what I was doing, and lost an entire directory worth of logs. =:^( Fortunately, I was able to learn a bit,
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote: This got triggered by an rsync I think. I'm not sure which of my btrfs FS has the issue yet since BUG_ON isn't very helpful as discussed earlier. [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 [160562.950297] [ cut here ] [160562.965904] kernel BUG at fs/btrfs/locking.c:269! But shouldn't messages like 'parent transid verify failed' print which device this happened on to give the operator a hint on where the problem is? Could someone do a pass at those and make sure they all print the device ID/name? Bug below: Full log before the crash: INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0 [81227b41] transaction_kthread+0xf8/0x1ab [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task kworker/u8:13:13157 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kworker/u8:13 D 0 13157 2 0x0080 Workqueue: btrfs-flush_delalloc normal_work_helper 8800041cfc00 0046 8800041cfbd0 8800041cffd8 8800034f40d0 000141c0 88021f2941c0 8800034f40d0 8800041cfca0 810fdc2f 0002 8800041cfc10 Call Trace: [810fdc2f] ? wait_on_page_read+0x3c/0x3c [8161d3c6] schedule+0x73/0x75 [8161d56b] io_schedule+0x60/0x7a [810fdc3d] sleep_on_page+0xe/0x12 [8161d7fd] __wait_on_bit+0x48/0x7a [810fdbdd] wait_on_page_bit+0x7a/0x7c [81084821] ? autoremove_wake_function+0x34/0x34 [810fef03] filemap_fdatawait_range+0x7e/0x126 [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4 [8122d7aa] ? btrfs_writepages+0x28/0x2a [811084c6] ? do_writepages+0x1e/0x2c [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57 [8124006f] btrfs_wait_ordered_range+0x6a/0x11a [8122fe01] btrfs_run_delalloc_work+0x27/0x69 [812508db] normal_work_helper+0xfe/0x240 [81065d7e] process_one_work+0x195/0x2d2 [81066020] worker_thread+0x136/0x205 [81065eea] ? process_scheduled_works+0x2f/0x2f [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0 [81227b41] transaction_kthread+0xf8/0x1ab [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On 07/03/2014 04:13 PM, Liu Bo wrote: On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote: This got triggered by an rsync I think. I'm not sure which of my btrfs FS has the issue yet since BUG_ON isn't very helpful as discussed earlier. [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 [160562.950297] [ cut here ] [160562.965904] kernel BUG at fs/btrfs/locking.c:269! But shouldn't messages like 'parent transid verify failed' print which device this happened on to give the operator a hint on where the problem is? Could someone do a pass at those and make sure they all print the device ID/name? Bug below: Full log before the crash: INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0 [81227b41] transaction_kthread+0xf8/0x1ab [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task kworker/u8:13:13157 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kworker/u8:13 D 0 13157 2 0x0080 Workqueue: btrfs-flush_delalloc normal_work_helper 8800041cfc00 0046 8800041cfbd0 8800041cffd8 8800034f40d0 000141c0 88021f2941c0 8800034f40d0 8800041cfca0 810fdc2f 0002 8800041cfc10 Call Trace: [810fdc2f] ? wait_on_page_read+0x3c/0x3c [8161d3c6] schedule+0x73/0x75 [8161d56b] io_schedule+0x60/0x7a [810fdc3d] sleep_on_page+0xe/0x12 [8161d7fd] __wait_on_bit+0x48/0x7a [810fdbdd] wait_on_page_bit+0x7a/0x7c [81084821] ? autoremove_wake_function+0x34/0x34 [810fef03] filemap_fdatawait_range+0x7e/0x126 [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4 [8122d7aa] ? btrfs_writepages+0x28/0x2a [811084c6] ? do_writepages+0x1e/0x2c [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57 [8124006f] btrfs_wait_ordered_range+0x6a/0x11a [8122fe01] btrfs_run_delalloc_work+0x27/0x69 [812508db] normal_work_helper+0xfe/0x240 [81065d7e] process_one_work+0x195/0x2d2 [81066020] worker_thread+0x136/0x205 [81065eea] ? process_scheduled_works+0x2f/0x2f [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0 [81227b41] transaction_kthread+0xf8/0x1ab [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On Thu, Jul 03, 2014 at 04:20:47PM +0800, Wang Shilong wrote: On 07/03/2014 04:13 PM, Liu Bo wrote: On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote: This got triggered by an rsync I think. I'm not sure which of my btrfs FS has the issue yet since BUG_ON isn't very helpful as discussed earlier. [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 [160562.950297] [ cut here ] [160562.965904] kernel BUG at fs/btrfs/locking.c:269! But shouldn't messages like 'parent transid verify failed' print which device this happened on to give the operator a hint on where the problem is? Could someone do a pass at those and make sure they all print the device ID/name? Bug below: Full log before the crash: INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0 [81227b41] transaction_kthread+0xf8/0x1ab [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task kworker/u8:13:13157 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kworker/u8:13 D 0 13157 2 0x0080 Workqueue: btrfs-flush_delalloc normal_work_helper 8800041cfc00 0046 8800041cfbd0 8800041cffd8 8800034f40d0 000141c0 88021f2941c0 8800034f40d0 8800041cfca0 810fdc2f 0002 8800041cfc10 Call Trace: [810fdc2f] ? wait_on_page_read+0x3c/0x3c [8161d3c6] schedule+0x73/0x75 [8161d56b] io_schedule+0x60/0x7a [810fdc3d] sleep_on_page+0xe/0x12 [8161d7fd] __wait_on_bit+0x48/0x7a [810fdbdd] wait_on_page_bit+0x7a/0x7c [81084821] ? autoremove_wake_function+0x34/0x34 [810fef03] filemap_fdatawait_range+0x7e/0x126 [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4 [8122d7aa] ? btrfs_writepages+0x28/0x2a [811084c6] ? do_writepages+0x1e/0x2c [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57 [8124006f] btrfs_wait_ordered_range+0x6a/0x11a [8122fe01] btrfs_run_delalloc_work+0x27/0x69 [812508db] normal_work_helper+0xfe/0x240 [81065d7e] process_one_work+0x195/0x2d2 [81066020] worker_thread+0x136/0x205 [81065eea] ? process_scheduled_works+0x2f/0x2f [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
Thanks for the patch. Hopefully this will make it to the next 3.15.x kernel. I also went back to 3.14 anyway since the 'blocked for 120 seconds' look like another instance of deadlocks we've been discussing here. But just curious: [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 What should I be doing about this? Does it mean that I do have some kind of corruption/damage on my filesystem? Also, is it possible to have all these messages state which devid they occurred on? I don't even know which device I should be worrying about right now, and although I'm running scrub now, my understanding is that scrub doesn't actually look at FS structures and is likely to miss this anyway. Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On Thu, Jul 03, 2014 at 06:44:21AM -0700, Marc MERLIN wrote: Thanks for the patch. Hopefully this will make it to the next 3.15.x kernel. I also went back to 3.14 anyway since the 'blocked for 120 seconds' look like another instance of deadlocks we've been discussing here. But just curious: [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 What should I be doing about this? Does it mean that I do have some kind of corruption/damage on my filesystem? If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read the copy and repair the crc with the good one, it's all we can do about it. Also, is it possible to have all these messages state which devid they occurred on? I don't even know which device I should be worrying about right now, and although I'm running scrub now, my understanding is that scrub doesn't actually look at FS structures and is likely to miss this anyway. Yes we can but it'd need a bit more effort, for now, all device msg we've seen in panic info comes from sb-s_id which points to @fs_info-latest_device. thanks, -liubo Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On Fri, Jul 04, 2014 at 11:07:22AM +0800, Liu Bo wrote: [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 What should I be doing about this? Does it mean that I do have some kind of corruption/damage on my filesystem? If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read the copy and repair the crc with the good one, it's all we can do about it. Right. It's not quite my question though. I mean I don't know what device it's on, never mind what file is affected. If I know which file is corrupted, I can simply delete it and restore from backup, no biggie. Right now I don't even know which one of my 3 btrfs filesystems (over 10TB) has this problem. That makes the message kind of problematic: you have a problem, but not I'm not giving you any fighting chance of finding out where :) Also, is it possible to have all these messages state which devid they occurred on? I don't even know which device I should be worrying about right now, and although I'm running scrub now, my understanding is that scrub doesn't actually look at FS structures and is likely to miss this anyway. Yes we can but it'd need a bit more effort, for now, all device msg we've seen in panic info comes from sb-s_id which points to @fs_info-latest_device. Food for though, as is the message is unfortunately close to useless, except to an FS developer with a system that has only one btrfs filesystem. On Fri, Jul 04, 2014 at 11:50:25AM +0800, Wang Shilong wrote: I am afraid, scrub maybe could not fix such kind of errors, all scrub doing is to verify whether checksums match and if possible use good mirrors to rewrite bad one. I wouldn't be bothered if scrub can't fix it, but it would be good if it could tell me. Such errors seem imply contention itself is corrupted, we may have passed checksum check after ending io, but we fail generation check afterwards. So should I really replace scrub with find / -type f -print0 | xargs grep . /dev/null ? Basically we need something that will scan the filesystem and ensure that all files are reachable correctly without causing filesystem problems, and if one is bad, output the name of the bad file(s). Scrub only does a half job of that it seems. To get physical device name, we still need mirror num to know which device we are locating. Ok, so it's missing for now and therefore the code can't easily report it, I understand. Well, I explained the problem, ext4 and others of course tell me which devid an error is on, hopefully btrfs will able to do so in the near future. Back to the original problem, would you agree that find / -type f -print0 | xargs grep . /dev/nul? may do a better job scanning the entire FS for problems than scrub would? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On 07/04/2014 12:11 PM, Marc MERLIN wrote: On Fri, Jul 04, 2014 at 11:07:22AM +0800, Liu Bo wrote: [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 What should I be doing about this? Does it mean that I do have some kind of corruption/damage on my filesystem? If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read the copy and repair the crc with the good one, it's all we can do about it. Right. It's not quite my question though. I mean I don't know what device it's on, never mind what file is affected. If I know which file is corrupted, I can simply delete it and restore from backup, no biggie. Right now I don't even know which one of my 3 btrfs filesystems (over 10TB) has this problem. That makes the message kind of problematic: you have a problem, but not I'm not giving you any fighting chance of finding out where :) Also, is it possible to have all these messages state which devid they occurred on? I don't even know which device I should be worrying about right now, and although I'm running scrub now, my understanding is that scrub doesn't actually look at FS structures and is likely to miss this anyway. Yes we can but it'd need a bit more effort, for now, all device msg we've seen in panic info comes from sb-s_id which points to @fs_info-latest_device. Food for though, as is the message is unfortunately close to useless, except to an FS developer with a system that has only one btrfs filesystem. On Fri, Jul 04, 2014 at 11:50:25AM +0800, Wang Shilong wrote: I am afraid, scrub maybe could not fix such kind of errors, all scrub doing is to verify whether checksums match and if possible use good mirrors to rewrite bad one. I wouldn't be bothered if scrub can't fix it, but it would be good if it could tell me. Such errors seem imply contention itself is corrupted, we may have passed checksum check after ending io, but we fail generation check afterwards. So should I really replace scrub with find / -type f -print0 | xargs grep . /dev/null ? Basically we need something that will scan the filesystem and ensure that all files are reachable correctly without causing filesystem problems, and if one is bad, output the name of the bad file(s). Scrub only does a half job of that it seems. To get physical device name, we still need mirror num to know which device we are locating. Ok, so it's missing for now and therefore the code can't easily report it, I understand. Well, I explained the problem, ext4 and others of course tell me which devid an error is on, hopefully btrfs will able to do so in the near future. So it is ok for you to print one of btrfs filesystem device(for example device name) ? maybe it is not really physical address the metadata locates in, this is easier. Back to the original problem, would you agree that find / -type f -print0 | xargs grep . /dev/nul? may do a better job scanning the entire FS for problems than scrub would? Thanks, Marc -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269
On 07/04/2014 11:07 AM, Liu Bo wrote: On Thu, Jul 03, 2014 at 06:44:21AM -0700, Marc MERLIN wrote: Thanks for the patch. Hopefully this will make it to the next 3.15.x kernel. I also went back to 3.14 anyway since the 'blocked for 120 seconds' look like another instance of deadlocks we've been discussing here. But just curious: [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 What should I be doing about this? Does it mean that I do have some kind of corruption/damage on my filesystem? If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read the copy and repair the crc with the good one, it's all we can do about it. Also, is it possible to have all these messages state which devid they occurred on? I don't even know which device I should be worrying about right now, and although I'm running scrub now, my understanding is that scrub doesn't actually look at FS structures and is likely to miss this anyway. Yes we can but it'd need a bit more effort, for now, all device msg we've seen in panic info comes from sb-s_id which points to @fs_info-latest_device. You means something like this: + printk_ratelimited(BTRFS (device: %s) parent transid verify failed on %llu wanted %llu found %llu\n, + eb-fs_info-sb-s_id, eb-start, + parent_transid, btrfs_header_generation(eb)); thanks, -liubo Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
3.15.1: kernel BUG at fs/btrfs/locking.c:269
This got triggered by an rsync I think. I'm not sure which of my btrfs FS has the issue yet since BUG_ON isn't very helpful as discussed earlier. [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 found 18120 [160562.950297] [ cut here ] [160562.965904] kernel BUG at fs/btrfs/locking.c:269! But shouldn't messages like 'parent transid verify failed' print which device this happened on to give the operator a hint on where the problem is? Could someone do a pass at those and make sure they all print the device ID/name? Bug below: Full log before the crash: INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0 [81227b41] transaction_kthread+0xf8/0x1ab [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task kworker/u8:13:13157 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. kworker/u8:13 D 0 13157 2 0x0080 Workqueue: btrfs-flush_delalloc normal_work_helper 8800041cfc00 0046 8800041cfbd0 8800041cffd8 8800034f40d0 000141c0 88021f2941c0 8800034f40d0 8800041cfca0 810fdc2f 0002 8800041cfc10 Call Trace: [810fdc2f] ? wait_on_page_read+0x3c/0x3c [8161d3c6] schedule+0x73/0x75 [8161d56b] io_schedule+0x60/0x7a [810fdc3d] sleep_on_page+0xe/0x12 [8161d7fd] __wait_on_bit+0x48/0x7a [810fdbdd] wait_on_page_bit+0x7a/0x7c [81084821] ? autoremove_wake_function+0x34/0x34 [810fef03] filemap_fdatawait_range+0x7e/0x126 [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4 [8122d7aa] ? btrfs_writepages+0x28/0x2a [811084c6] ? do_writepages+0x1e/0x2c [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57 [8124006f] btrfs_wait_ordered_range+0x6a/0x11a [8122fe01] btrfs_run_delalloc_work+0x27/0x69 [812508db] normal_work_helper+0xfe/0x240 [81065d7e] process_one_work+0x195/0x2d2 [81066020] worker_thread+0x136/0x205 [81065eea] ? process_scheduled_works+0x2f/0x2f [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ? __kthread_parkme+0x61/0x61 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds. Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1 echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. btrfs-transacti D 0 3358 2 0x 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8 8800c6914390 000141c0 7fff 8801433b8f10 0002 8161c9b0 7fff 8800c50ebc60 Call Trace: [8161c9b0] ? sock_rps_reset_flow+0x32/0x32 [8161d3c6] schedule+0x73/0x75 [8161c9e9] schedule_timeout+0x39/0x129 [8107653d] ? get_parent_ip+0xd/0x3c [8162338f] ? preempt_count_add+0x7a/0x8d [8161dbac] __wait_for_common+0x11a/0x159 [8107810f] ? wake_up_state+0x12/0x12 [8161dc0f] wait_for_completion+0x24/0x26 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0 [81227b41] transaction_kthread+0xf8/0x1ab [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c [8106b4b4] kthread+0xae/0xb6 [8106b406] ? __kthread_parkme+0x61/0x61 [8162667c] ret_from_fork+0x7c/0xb0 [8106b406] ?