Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-04 Thread Marc MERLIN
On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote:
 Well, I explained the problem, ext4 and others of course tell me which 
 devid
 an error is on, hopefully btrfs will able to do so in the near future.

 So it is ok for you to print one of btrfs filesystem device(for example
 device name) ? maybe it is not really physical address the metadata
 locates in, this is easier.

Yes, the device name is great, now I can see which of my 3 filesystems has a
problem, that's a start.
Next would be knowing which filename this occurred in, but I understand this
would be harder to get from that point in the code.
Ideally scrub should be able to find that problem and report it, at least I
would know which filesystem to rescan for errors:

 Back to the original problem, would you agree that
 find / -type f -print0 | xargs grep . /dev/nul?

I'll also have to try this to see if I get lucky with it :)

 +   printk_ratelimited(BTRFS (device: %s) parent transid verify 
 failed on %llu wanted %llu found %llu\n,
 +   eb-fs_info-sb-s_id, eb-start,
 +   parent_transid, btrfs_header_generation(eb));

That looks great. Ideally all such errors would look like this.

Thanks for looking into this, I appreciate it.

Best,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-04 Thread Wang Shilong

On 07/04/2014 02:02 PM, Marc MERLIN wrote:

On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote:

Well, I explained the problem, ext4 and others of course tell me which
devid
an error is on, hopefully btrfs will able to do so in the near future.

So it is ok for you to print one of btrfs filesystem device(for example
device name) ? maybe it is not really physical address the metadata
locates in, this is easier.

Yes, the device name is great, now I can see which of my 3 filesystems has a
problem, that's a start.
Next would be knowing which filename this occurred in, but I understand this
would be harder to get from that point in the code.
Ideally scrub should be able to find that problem and report it, at least I
would know which filesystem to rescan for errors:


Back to the original problem, would you agree that
find / -type f -print0 | xargs grep . /dev/nul?

I'll also have to try this to see if I get lucky with it :)


+   printk_ratelimited(BTRFS (device: %s) parent transid verify
failed on %llu wanted %llu found %llu\n,
+   eb-fs_info-sb-s_id, eb-start,
+   parent_transid, btrfs_header_generation(eb));

That looks great. Ideally all such errors would look like this.
OK,  i found there is a mirror num recorded  in struct @extent_buffer, 
so it is not diffcult
to locate the real physical address that this corrupt metadata block 
locates.


But another question is that if such problems happen, it should also 
help little, because

usually other mirrors maybe have the same errors...

Looking at btrfs codes, if it found such error which means checksum has 
passed but generation
verification fails it  dosen't even try other mirrors if existed, so i 
supposed such kinds of errors

is fatal and could be corrected by btrfs itself..

Anyway, let me see if there are any other output even dose not output 
btrfs specific info.


Thanks,


Thanks for looking into this, I appreciate it.

Best,
Marc


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-04 Thread Wang Shilong

On 07/04/2014 02:02 PM, Marc MERLIN wrote:

On Fri, Jul 04, 2014 at 01:29:29PM +0800, Wang Shilong wrote:

Well, I explained the problem, ext4 and others of course tell me which
devid
an error is on, hopefully btrfs will able to do so in the near future.

So it is ok for you to print one of btrfs filesystem device(for example
device name) ? maybe it is not really physical address the metadata
locates in, this is easier.

Yes, the device name is great, now I can see which of my 3 filesystems has a
problem, that's a start.
Next would be knowing which filename this occurred in, but I understand this
would be harder to get from that point in the code.
Ideally scrub should be able to find that problem and report it, at least I
would know which filesystem to rescan for errors:
So there is a problem, ususally such generation verifications errors is 
related to
Btrfs metdata block.it is maybe just a Btrfs node , not related to any 
actually files,

or even a leaf that contains more that one file

If this is a read/write path from normal fs/file root, we may output 
its' root..but if this is

something like extent root...i think it helps little




Back to the original problem, would you agree that
find / -type f -print0 | xargs grep . /dev/nul?

I'll also have to try this to see if I get lucky with it :)


+   printk_ratelimited(BTRFS (device: %s) parent transid verify
failed on %llu wanted %llu found %llu\n,
+   eb-fs_info-sb-s_id, eb-start,
+   parent_transid, btrfs_header_generation(eb));

That looks great. Ideally all such errors would look like this.

Thanks for looking into this, I appreciate it.

Best,
Marc


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-04 Thread Marc MERLIN
On Fri, Jul 04, 2014 at 02:12:36PM +0800, Wang Shilong wrote:
 OK,  i found there is a mirror num recorded  in struct
 @extent_buffer, so it is not diffcult
 to locate the real physical address that this corrupt metadata block
 locates.
 
 But another question is that if such problems happen, it should also
 help little, because
 usually other mirrors maybe have the same errors...
 
I understand, although at least knowing which filesystem has the
problem, whether it got auto corrected, or not, is good.

Imagine if I get 500 of those errors in syslog. Even if btrfs can auto
correct them, as an admin, I would want to know this because it means
something likely bad happened and I may want to inspect that filesystem
for damage, or the hardware for faults.
So the extra output is very useful, thank you.

 Anyway, let me see if there are any other output even dose not
 output btrfs specific info.

Thank you very much.

On Fri, Jul 04, 2014 at 05:59:06PM +0800, Wang Shilong wrote:
 Marc argued that if there are several btrfs filesystems mounted,
 while users even don't know which filesystem hit the corrupted
 errors something like generation verification failure.
 
 Since @extent_buffer structure has a member @fs_info, let's output
 btrfs device info.

I don't know the btrfs code much, but looking at the patch, it looks
good, thank you for that.

Marc
 
 Reported-by: Marc MERLIN m...@merlins.org
 Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
 ---
 Tell me if missed anything here...
 ---
  fs/btrfs/disk-io.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)
 
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 08e65e9..bd52ca0 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -351,9 +351,9 @@ static int verify_parent_transid(struct extent_io_tree 
 *io_tree,
   ret = 0;
   goto out;
   }
 - printk_ratelimited(parent transid verify failed on %llu wanted %llu 
 -found %llu\n,
 -eb-start, parent_transid, btrfs_header_generation(eb));
 + printk_ratelimited(BTRFS (device %s): parent transid verify failed on 
 %llu wanted %llu found %llu\n,
 + eb-fs_info-sb-s_id, eb-start,
 + parent_transid, btrfs_header_generation(eb));
   ret = 1;
  
   /*
 @@ -617,15 +617,15 @@ static int btree_readpage_end_io_hook(struct 
 btrfs_io_bio *io_bio,
  
   found_start = btrfs_header_bytenr(eb);
   if (found_start != eb-start) {
 - printk_ratelimited(KERN_INFO BTRFS: bad tree block start 
 + printk_ratelimited(KERN_INFO BTRFS (device %s): bad tree block 
 start 
  %llu %llu\n,
 -found_start, eb-start);
 +eb-fs_info-sb-s_id, found_start, eb-start);
   ret = -EIO;
   goto err;
   }
   if (check_tree_block_fsid(root, eb)) {
 - printk_ratelimited(KERN_INFO BTRFS: bad fsid on block %llu\n,
 -eb-start);
 + printk_ratelimited(KERN_INFO BTRFS (device %s): bad fsid on 
 block %llu\n,
 +eb-fs_info-sb-s_id, eb-start);
   ret = -EIO;
   goto err;
   }
 -- 
 1.8.3.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Duncan
Marc MERLIN posted on Wed, 02 Jul 2014 13:41:52 -0700 as excerpted:

 This got triggered by an rsync I think. I'm not sure which of my btrfs
 FS has the issue yet since BUG_ON isn't very helpful as discussed
 earlier.
 
 [160562.925463] parent transid verify failed on 2776298520576
 wanted 41015 found 18120
 [160562.950297] [ cut here ]
 [160562.965904] kernel BUG at fs/btrfs/locking.c:269!
 
 But shouldn't messages like 'parent transid verify failed' print which
 device this happened on to give the operator a hint on where the problem
 is?
 
 Could someone do a pass at those and make sure they all print the device
 ID/name?

kernel 3.16 series here, rc2+ when this happened, rc3+ now.  IOW it's 
3.15 series (Marc) and 3.16 series (me) both known affected.

FWIW, I'm not sure what originally triggered it, but I recently had a 
couple bad shutdowns -- systemd would say it was turning off the system 
but it wouldn't shut off, and on manual shutoff and later reboot, the 
read-write-mounted btrfs (two separate filesystems, home and log, root, 
including the rest of the system, is read-only by default, here) failed 
to mount.

The mount failures triggered on the above parent transid failed errors 
with kernel BUG at fs/btrs/locking.c -- I /believe/ 269, but I couldn't 
swear to it.

The biggest difference here (other than the fact that it happened on 
mount of critical filesystems at boot, so probably double-digit seconds 
in since boot at most) was that the parent transid numbers were only a 
few off, something like wanted , found -2.

[End of technical stuff.  The rest is discussion of my recovery 
experience, both because it might be of help to others and because it 
lets me tell my experience. =:^) ]

I have backups but they weren't as current as I would have liked, so I 
decided to try recovery.

My rootfs is btrfs as well, a /separate/ btrfs, but it remains mounted 
read-only by default and is only mounted read-write for updates, so 
wasn't damaged.  That includes the bits of /var that I can get away with 
being read-only, with various /var/lib/* subdirs being symlinks to
/home/var/lib/* subdirs, where they need to be writable, and of course 
/var/log being a separate dedicated filesystem -- one of the two that was 
damaged, and the usual /var/run and /var/lock symlinks to /run and 
/run/lock (with /run being a tmpfs mount on standard systemd 
configurations).  As a result, the rootfs mounted from the initramfs and 
systemd on it was invoked as the new PID 1 init in the transfer from 
initramfs.  Systemd was in turn able to start early boot services and 
anything that didn't have a dependency on home (including the bits of /
var/lib symlinked into home) or log being mounted.  But that of course 
left a number of critical services failing due to dependency on the home 
and/or log mounts, since those mounts were failing.

Fortunately, while some of the time the errors would trigger a full 
kernel lockup with the above parent transid and locking BUG, other times 
the mount attempt would simply error out, and systemd would drop me to 
the emergency-mode root-login prompt.  (If it hadn't, I'd have had to 
switch to booting the backup.)

Since the main rootfs including /usr, /etc and much of /var was already 
mounted and safely read-only so I wasn't too worried about damaging it, 
that left me with only a partly working system, but access to all the 
normal recovery tools, manpages, etc, I'd normally have.  The only big 
thing (other than X/kde, of course) initially was network access, due to 
dependencies on the unmountable filesystems for local DNS and I think 
iptables logging.  I could have reconfigured that if I had to, but after 
I got log back up, I found I had network access (presumably with fallback 
to the ISP DNS), and was able to get to the wiki to research recovery of 
home a bit further.

I decided to tackle log (/var/log) before home since it was smaller and I 
figured I could use anything I learned in that process to help me save 
more of home.  My policy is no backup log partition, since I don't do 
backups regularly enough for the logs thereon to be of much likely 
usefulness.  That left me trying to repair or recover what I could and 
then doing a mkfs.btrfs on it.  The various repair options I tried didn't 
help -- they either died without helpful output or triggered the same 
lockup.  Mount with the recovery or recovery,ro wouldn't work, and 
neither would btrfs check.

Btrfs rescue didn't look useful, as I couldn't find useful documentation 
on chunk-recover and the supers looked fine (btrfs-show-super) so super-
recover was unnecessary.  I tried btrfs-zero-log on the log partition, 
but it didn't make the problem better and might have made it worse, so 
didn't try it on home.

That left btrfs restore.  I used it on log without really understanding 
what I was doing, and lost an entire directory worth of logs.
=:^(  Fortunately, I was able to learn a bit, 

Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Liu Bo
On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote:
 This got triggered by an rsync I think. I'm not sure which of my btrfs FS
 has the issue yet since BUG_ON isn't very helpful as discussed earlier.
 
 [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 
 found 18120
 [160562.950297] [ cut here ]
 [160562.965904] kernel BUG at fs/btrfs/locking.c:269!
 
 But shouldn't messages like 'parent transid verify failed' print which device 
 this happened on to give the operator a hint on where the problem is?
 
 Could someone do a pass at those and make sure they all print the device
 ID/name?
 
 Bug below:
 
 Full log before the crash:
 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
   Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 btrfs-transacti D  0  3358  2 0x
  8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
  8800c6914390 000141c0 7fff 8801433b8f10
  0002 8161c9b0 7fff 8800c50ebc60
 Call Trace:
  [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
  [8161d3c6] schedule+0x73/0x75
  [8161c9e9] schedule_timeout+0x39/0x129
  [8107653d] ? get_parent_ip+0xd/0x3c
  [8162338f] ? preempt_count_add+0x7a/0x8d
  [8161dbac] __wait_for_common+0x11a/0x159
  [8107810f] ? wake_up_state+0x12/0x12
  [8161dc0f] wait_for_completion+0x24/0x26
  [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
  [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
  [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
  [8122b25a] btrfs_commit_transaction+0x20d/0x8b0
  [81227b41] transaction_kthread+0xf8/0x1ab
  [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
  [8106b4b4] kthread+0xae/0xb6
  [8106b406] ? __kthread_parkme+0x61/0x61
  [8162667c] ret_from_fork+0x7c/0xb0
  [8106b406] ? __kthread_parkme+0x61/0x61
 INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
   Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 kworker/u8:13   D  0 13157  2 0x0080
 Workqueue: btrfs-flush_delalloc normal_work_helper
  8800041cfc00 0046 8800041cfbd0 8800041cffd8
  8800034f40d0 000141c0 88021f2941c0 8800034f40d0
  8800041cfca0 810fdc2f 0002 8800041cfc10
 Call Trace:
  [810fdc2f] ? wait_on_page_read+0x3c/0x3c
  [8161d3c6] schedule+0x73/0x75
  [8161d56b] io_schedule+0x60/0x7a
  [810fdc3d] sleep_on_page+0xe/0x12
  [8161d7fd] __wait_on_bit+0x48/0x7a
  [810fdbdd] wait_on_page_bit+0x7a/0x7c
  [81084821] ? autoremove_wake_function+0x34/0x34
  [810fef03] filemap_fdatawait_range+0x7e/0x126
  [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4
  [8122d7aa] ? btrfs_writepages+0x28/0x2a
  [811084c6] ? do_writepages+0x1e/0x2c
  [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57
  [8124006f] btrfs_wait_ordered_range+0x6a/0x11a
  [8122fe01] btrfs_run_delalloc_work+0x27/0x69
  [812508db] normal_work_helper+0xfe/0x240
  [81065d7e] process_one_work+0x195/0x2d2
  [81066020] worker_thread+0x136/0x205
  [81065eea] ? process_scheduled_works+0x2f/0x2f
  [8106b4b4] kthread+0xae/0xb6
  [8106b406] ? __kthread_parkme+0x61/0x61
  [8162667c] ret_from_fork+0x7c/0xb0
  [8106b406] ? __kthread_parkme+0x61/0x61
 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
   Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 btrfs-transacti D  0  3358  2 0x
  8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
  8800c6914390 000141c0 7fff 8801433b8f10
  0002 8161c9b0 7fff 8800c50ebc60
 Call Trace:
  [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
  [8161d3c6] schedule+0x73/0x75
  [8161c9e9] schedule_timeout+0x39/0x129
  [8107653d] ? get_parent_ip+0xd/0x3c
  [8162338f] ? preempt_count_add+0x7a/0x8d
  [8161dbac] __wait_for_common+0x11a/0x159
  [8107810f] ? wake_up_state+0x12/0x12
  [8161dc0f] wait_for_completion+0x24/0x26
  [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
  [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
  [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
  [8122b25a] btrfs_commit_transaction+0x20d/0x8b0
  [81227b41] transaction_kthread+0xf8/0x1ab
  [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
  

Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Wang Shilong

On 07/03/2014 04:13 PM, Liu Bo wrote:

On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote:

This got triggered by an rsync I think. I'm not sure which of my btrfs FS
has the issue yet since BUG_ON isn't very helpful as discussed earlier.

[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 
found 18120
[160562.950297] [ cut here ]
[160562.965904] kernel BUG at fs/btrfs/locking.c:269!

But shouldn't messages like 'parent transid verify failed' print which device
this happened on to give the operator a hint on where the problem is?

Could someone do a pass at those and make sure they all print the device
ID/name?

Bug below:

Full log before the crash:
INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
   Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
btrfs-transacti D  0  3358  2 0x
  8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
  8800c6914390 000141c0 7fff 8801433b8f10
  0002 8161c9b0 7fff 8800c50ebc60
Call Trace:
  [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
  [8161d3c6] schedule+0x73/0x75
  [8161c9e9] schedule_timeout+0x39/0x129
  [8107653d] ? get_parent_ip+0xd/0x3c
  [8162338f] ? preempt_count_add+0x7a/0x8d
  [8161dbac] __wait_for_common+0x11a/0x159
  [8107810f] ? wake_up_state+0x12/0x12
  [8161dc0f] wait_for_completion+0x24/0x26
  [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
  [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
  [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
  [8122b25a] btrfs_commit_transaction+0x20d/0x8b0
  [81227b41] transaction_kthread+0xf8/0x1ab
  [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
  [8106b4b4] kthread+0xae/0xb6
  [8106b406] ? __kthread_parkme+0x61/0x61
  [8162667c] ret_from_fork+0x7c/0xb0
  [8106b406] ? __kthread_parkme+0x61/0x61
INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
   Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
kworker/u8:13   D  0 13157  2 0x0080
Workqueue: btrfs-flush_delalloc normal_work_helper
  8800041cfc00 0046 8800041cfbd0 8800041cffd8
  8800034f40d0 000141c0 88021f2941c0 8800034f40d0
  8800041cfca0 810fdc2f 0002 8800041cfc10
Call Trace:
  [810fdc2f] ? wait_on_page_read+0x3c/0x3c
  [8161d3c6] schedule+0x73/0x75
  [8161d56b] io_schedule+0x60/0x7a
  [810fdc3d] sleep_on_page+0xe/0x12
  [8161d7fd] __wait_on_bit+0x48/0x7a
  [810fdbdd] wait_on_page_bit+0x7a/0x7c
  [81084821] ? autoremove_wake_function+0x34/0x34
  [810fef03] filemap_fdatawait_range+0x7e/0x126
  [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4
  [8122d7aa] ? btrfs_writepages+0x28/0x2a
  [811084c6] ? do_writepages+0x1e/0x2c
  [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57
  [8124006f] btrfs_wait_ordered_range+0x6a/0x11a
  [8122fe01] btrfs_run_delalloc_work+0x27/0x69
  [812508db] normal_work_helper+0xfe/0x240
  [81065d7e] process_one_work+0x195/0x2d2
  [81066020] worker_thread+0x136/0x205
  [81065eea] ? process_scheduled_works+0x2f/0x2f
  [8106b4b4] kthread+0xae/0xb6
  [8106b406] ? __kthread_parkme+0x61/0x61
  [8162667c] ret_from_fork+0x7c/0xb0
  [8106b406] ? __kthread_parkme+0x61/0x61
INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
   Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
btrfs-transacti D  0  3358  2 0x
  8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
  8800c6914390 000141c0 7fff 8801433b8f10
  0002 8161c9b0 7fff 8800c50ebc60
Call Trace:
  [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
  [8161d3c6] schedule+0x73/0x75
  [8161c9e9] schedule_timeout+0x39/0x129
  [8107653d] ? get_parent_ip+0xd/0x3c
  [8162338f] ? preempt_count_add+0x7a/0x8d
  [8161dbac] __wait_for_common+0x11a/0x159
  [8107810f] ? wake_up_state+0x12/0x12
  [8161dc0f] wait_for_completion+0x24/0x26
  [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
  [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
  [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
  [8122b25a] btrfs_commit_transaction+0x20d/0x8b0
  [81227b41] transaction_kthread+0xf8/0x1ab
  [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c

Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Liu Bo
On Thu, Jul 03, 2014 at 04:20:47PM +0800, Wang Shilong wrote:
 On 07/03/2014 04:13 PM, Liu Bo wrote:
 On Wed, Jul 02, 2014 at 01:41:52PM -0700, Marc MERLIN wrote:
 This got triggered by an rsync I think. I'm not sure which of my btrfs FS
 has the issue yet since BUG_ON isn't very helpful as discussed earlier.
 
 [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 
 found 18120
 [160562.950297] [ cut here ]
 [160562.965904] kernel BUG at fs/btrfs/locking.c:269!
 
 But shouldn't messages like 'parent transid verify failed' print which 
 device
 this happened on to give the operator a hint on where the problem is?
 
 Could someone do a pass at those and make sure they all print the device
 ID/name?
 
 Bug below:
 
 Full log before the crash:
 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 btrfs-transacti D  0  3358  2 0x
   8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
   8800c6914390 000141c0 7fff 8801433b8f10
   0002 8161c9b0 7fff 8800c50ebc60
 Call Trace:
   [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
   [8161d3c6] schedule+0x73/0x75
   [8161c9e9] schedule_timeout+0x39/0x129
   [8107653d] ? get_parent_ip+0xd/0x3c
   [8162338f] ? preempt_count_add+0x7a/0x8d
   [8161dbac] __wait_for_common+0x11a/0x159
   [8107810f] ? wake_up_state+0x12/0x12
   [8161dc0f] wait_for_completion+0x24/0x26
   [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
   [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
   [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
   [8122b25a] btrfs_commit_transaction+0x20d/0x8b0
   [81227b41] transaction_kthread+0xf8/0x1ab
   [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
   [8106b4b4] kthread+0xae/0xb6
   [8106b406] ? __kthread_parkme+0x61/0x61
   [8162667c] ret_from_fork+0x7c/0xb0
   [8106b406] ? __kthread_parkme+0x61/0x61
 INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 kworker/u8:13   D  0 13157  2 0x0080
 Workqueue: btrfs-flush_delalloc normal_work_helper
   8800041cfc00 0046 8800041cfbd0 8800041cffd8
   8800034f40d0 000141c0 88021f2941c0 8800034f40d0
   8800041cfca0 810fdc2f 0002 8800041cfc10
 Call Trace:
   [810fdc2f] ? wait_on_page_read+0x3c/0x3c
   [8161d3c6] schedule+0x73/0x75
   [8161d56b] io_schedule+0x60/0x7a
   [810fdc3d] sleep_on_page+0xe/0x12
   [8161d7fd] __wait_on_bit+0x48/0x7a
   [810fdbdd] wait_on_page_bit+0x7a/0x7c
   [81084821] ? autoremove_wake_function+0x34/0x34
   [810fef03] filemap_fdatawait_range+0x7e/0x126
   [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4
   [8122d7aa] ? btrfs_writepages+0x28/0x2a
   [811084c6] ? do_writepages+0x1e/0x2c
   [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57
   [8124006f] btrfs_wait_ordered_range+0x6a/0x11a
   [8122fe01] btrfs_run_delalloc_work+0x27/0x69
   [812508db] normal_work_helper+0xfe/0x240
   [81065d7e] process_one_work+0x195/0x2d2
   [81066020] worker_thread+0x136/0x205
   [81065eea] ? process_scheduled_works+0x2f/0x2f
   [8106b4b4] kthread+0xae/0xb6
   [8106b406] ? __kthread_parkme+0x61/0x61
   [8162667c] ret_from_fork+0x7c/0xb0
   [8106b406] ? __kthread_parkme+0x61/0x61
 INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 btrfs-transacti D  0  3358  2 0x
   8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
   8800c6914390 000141c0 7fff 8801433b8f10
   0002 8161c9b0 7fff 8800c50ebc60
 Call Trace:
   [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
   [8161d3c6] schedule+0x73/0x75
   [8161c9e9] schedule_timeout+0x39/0x129
   [8107653d] ? get_parent_ip+0xd/0x3c
   [8162338f] ? preempt_count_add+0x7a/0x8d
   [8161dbac] __wait_for_common+0x11a/0x159
   [8107810f] ? wake_up_state+0x12/0x12
   [8161dc0f] wait_for_completion+0x24/0x26
   [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
   [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
   [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
   

Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Marc MERLIN
Thanks for the patch. Hopefully this will make it to the next 3.15.x
kernel.

I also went back to 3.14 anyway since the 'blocked for 120 seconds' look
like another instance of deadlocks we've been discussing here.

But just curious:

 [160562.925463] parent transid verify failed on 2776298520576 wanted 41015 
 found 18120

What should I be doing about this?
Does it mean that I do have some kind of corruption/damage on my
filesystem?

Also, is it possible to have all these messages state which devid they
occurred on? I don't even know which device I should be worrying about
right now, and although I'm running scrub now, my understanding is that
scrub doesn't actually look at FS structures and is likely to miss this
anyway.

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Liu Bo
On Thu, Jul 03, 2014 at 06:44:21AM -0700, Marc MERLIN wrote:
 Thanks for the patch. Hopefully this will make it to the next 3.15.x
 kernel.
 
 I also went back to 3.14 anyway since the 'blocked for 120 seconds' look
 like another instance of deadlocks we've been discussing here.
 
 But just curious:
 
  [160562.925463] parent transid verify failed on 2776298520576 wanted 
  41015 found 18120
 
 What should I be doing about this?
 Does it mean that I do have some kind of corruption/damage on my
 filesystem?
 

If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
the copy and repair the crc with the good one, it's all we can do about it.

 Also, is it possible to have all these messages state which devid they
 occurred on? I don't even know which device I should be worrying about
 right now, and although I'm running scrub now, my understanding is that
 scrub doesn't actually look at FS structures and is likely to miss this
 anyway.

Yes we can but it'd need a bit more effort, for now, all device msg we've seen
in panic info comes from sb-s_id which points to @fs_info-latest_device.

thanks,
-liubo

 
 Thanks,
 Marc
 -- 
 A mouse is a device used to point at the xterm you want to type in - A.S.R.
 Microsoft is to operating systems 
    what McDonalds is to gourmet 
 cooking
 Home page: http://marc.merlins.org/ | PGP 
 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Marc MERLIN
On Fri, Jul 04, 2014 at 11:07:22AM +0800, Liu Bo wrote:
   [160562.925463] parent transid verify failed on 2776298520576 wanted 
   41015 found 18120
  
  What should I be doing about this?
  Does it mean that I do have some kind of corruption/damage on my
  filesystem?
  
 If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
 the copy and repair the crc with the good one, it's all we can do about it.

Right. It's not quite my question though.
I mean I don't know what device it's on, never mind what file is affected.
If I know which file is corrupted, I can simply delete it and restore from
backup, no biggie.
Right now I don't even know which one of my 3 btrfs filesystems (over 10TB)
has this problem. That makes the message kind of problematic: you have a
problem, but not I'm not giving you any fighting chance of finding out
where :)
 
  Also, is it possible to have all these messages state which devid they
  occurred on? I don't even know which device I should be worrying about
  right now, and although I'm running scrub now, my understanding is that
  scrub doesn't actually look at FS structures and is likely to miss this
  anyway.
 
 Yes we can but it'd need a bit more effort, for now, all device msg we've seen
 in panic info comes from sb-s_id which points to @fs_info-latest_device.

Food for though, as is the message is unfortunately close to useless, except
to an FS developer with a system that has only one btrfs filesystem.

On Fri, Jul 04, 2014 at 11:50:25AM +0800, Wang Shilong wrote:
 I am afraid, scrub maybe could not fix such kind of errors, all scrub
 doing is to verify whether checksums match and if possible use good
 mirrors to rewrite bad one.

I wouldn't be bothered if scrub can't fix it, but it would be good if it
could tell me.
 
 Such errors seem imply contention itself is corrupted, we may have passed
 checksum check after ending io, but we fail generation check afterwards.
 
So should I really replace scrub with
find / -type f -print0 | xargs grep . /dev/null ?

Basically we need something that will scan the filesystem and ensure that
all files are reachable correctly without causing filesystem problems, and
if one is bad, output the name of the bad file(s).
Scrub only does a half job of that it seems.

 To get physical device name, we still need mirror num to know which device
 we are locating.

Ok, so it's missing for now and therefore the code can't easily report it,
I understand.

Well, I explained the problem, ext4 and others of course tell me which devid
an error is on, hopefully btrfs will able to do so in the near future.

Back to the original problem, would you agree that 
find / -type f -print0 | xargs grep . /dev/nul?
may do a better job scanning the entire FS for problems than scrub would?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Wang Shilong

On 07/04/2014 12:11 PM, Marc MERLIN wrote:

On Fri, Jul 04, 2014 at 11:07:22AM +0800, Liu Bo wrote:

[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 
found 18120

What should I be doing about this?
Does it mean that I do have some kind of corruption/damage on my
filesystem?


If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
the copy and repair the crc with the good one, it's all we can do about it.

Right. It's not quite my question though.
I mean I don't know what device it's on, never mind what file is affected.
If I know which file is corrupted, I can simply delete it and restore from
backup, no biggie.
Right now I don't even know which one of my 3 btrfs filesystems (over 10TB)
has this problem. That makes the message kind of problematic: you have a
problem, but not I'm not giving you any fighting chance of finding out
where :)
  

Also, is it possible to have all these messages state which devid they
occurred on? I don't even know which device I should be worrying about
right now, and although I'm running scrub now, my understanding is that
scrub doesn't actually look at FS structures and is likely to miss this
anyway.

Yes we can but it'd need a bit more effort, for now, all device msg we've seen
in panic info comes from sb-s_id which points to @fs_info-latest_device.

Food for though, as is the message is unfortunately close to useless, except
to an FS developer with a system that has only one btrfs filesystem.

On Fri, Jul 04, 2014 at 11:50:25AM +0800, Wang Shilong wrote:

I am afraid, scrub maybe could not fix such kind of errors, all scrub
doing is to verify whether checksums match and if possible use good
mirrors to rewrite bad one.

I wouldn't be bothered if scrub can't fix it, but it would be good if it
could tell me.
  

Such errors seem imply contention itself is corrupted, we may have passed
checksum check after ending io, but we fail generation check afterwards.
  
So should I really replace scrub with

find / -type f -print0 | xargs grep . /dev/null ?

Basically we need something that will scan the filesystem and ensure that
all files are reachable correctly without causing filesystem problems, and
if one is bad, output the name of the bad file(s).
Scrub only does a half job of that it seems.


To get physical device name, we still need mirror num to know which device
we are locating.

Ok, so it's missing for now and therefore the code can't easily report it,
I understand.

Well, I explained the problem, ext4 and others of course tell me which devid
an error is on, hopefully btrfs will able to do so in the near future.
So it is ok for you to print one of btrfs filesystem device(for example 
device name)  ?
maybe it is not really  physical address the metadata locates in, this 
is easier.





Back to the original problem, would you agree that
find / -type f -print0 | xargs grep . /dev/nul?
may do a better job scanning the entire FS for problems than scrub would?

Thanks,
Marc


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-03 Thread Wang Shilong

On 07/04/2014 11:07 AM, Liu Bo wrote:

On Thu, Jul 03, 2014 at 06:44:21AM -0700, Marc MERLIN wrote:

Thanks for the patch. Hopefully this will make it to the next 3.15.x
kernel.

I also went back to 3.14 anyway since the 'blocked for 120 seconds' look
like another instance of deadlocks we've been discussing here.

But just curious:


[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 
found 18120

What should I be doing about this?
Does it mean that I do have some kind of corruption/damage on my
filesystem?


If there is another copy for the block(RAID1, DUP, RAID5/6), it'd try to read
the copy and repair the crc with the good one, it's all we can do about it.


Also, is it possible to have all these messages state which devid they
occurred on? I don't even know which device I should be worrying about
right now, and although I'm running scrub now, my understanding is that
scrub doesn't actually look at FS structures and is likely to miss this
anyway.

Yes we can but it'd need a bit more effort, for now, all device msg we've seen
in panic info comes from sb-s_id which points to @fs_info-latest_device.

You means something like this:

+   printk_ratelimited(BTRFS (device: %s) parent transid verify 
failed on %llu wanted %llu found %llu\n,

+   eb-fs_info-sb-s_id, eb-start,
+   parent_transid, btrfs_header_generation(eb));




thanks,
-liubo


Thanks,
Marc
--
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
    what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901

.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


3.15.1: kernel BUG at fs/btrfs/locking.c:269

2014-07-02 Thread Marc MERLIN
This got triggered by an rsync I think. I'm not sure which of my btrfs FS
has the issue yet since BUG_ON isn't very helpful as discussed earlier.

[160562.925463] parent transid verify failed on 2776298520576 wanted 41015 
found 18120
[160562.950297] [ cut here ]
[160562.965904] kernel BUG at fs/btrfs/locking.c:269!

But shouldn't messages like 'parent transid verify failed' print which device 
this happened on to give the operator a hint on where the problem is?

Could someone do a pass at those and make sure they all print the device
ID/name?

Bug below:

Full log before the crash:
INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
  Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
btrfs-transacti D  0  3358  2 0x
 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
 8800c6914390 000141c0 7fff 8801433b8f10
 0002 8161c9b0 7fff 8800c50ebc60
Call Trace:
 [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
 [8161d3c6] schedule+0x73/0x75
 [8161c9e9] schedule_timeout+0x39/0x129
 [8107653d] ? get_parent_ip+0xd/0x3c
 [8162338f] ? preempt_count_add+0x7a/0x8d
 [8161dbac] __wait_for_common+0x11a/0x159
 [8107810f] ? wake_up_state+0x12/0x12
 [8161dc0f] wait_for_completion+0x24/0x26
 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
 [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0
 [81227b41] transaction_kthread+0xf8/0x1ab
 [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
 [8106b4b4] kthread+0xae/0xb6
 [8106b406] ? __kthread_parkme+0x61/0x61
 [8162667c] ret_from_fork+0x7c/0xb0
 [8106b406] ? __kthread_parkme+0x61/0x61
INFO: task kworker/u8:13:13157 blocked for more than 120 seconds.
  Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
kworker/u8:13   D  0 13157  2 0x0080
Workqueue: btrfs-flush_delalloc normal_work_helper
 8800041cfc00 0046 8800041cfbd0 8800041cffd8
 8800034f40d0 000141c0 88021f2941c0 8800034f40d0
 8800041cfca0 810fdc2f 0002 8800041cfc10
Call Trace:
 [810fdc2f] ? wait_on_page_read+0x3c/0x3c
 [8161d3c6] schedule+0x73/0x75
 [8161d56b] io_schedule+0x60/0x7a
 [810fdc3d] sleep_on_page+0xe/0x12
 [8161d7fd] __wait_on_bit+0x48/0x7a
 [810fdbdd] wait_on_page_bit+0x7a/0x7c
 [81084821] ? autoremove_wake_function+0x34/0x34
 [810fef03] filemap_fdatawait_range+0x7e/0x126
 [8122f1cf] ? btrfs_submit_direct+0x3f4/0x3f4
 [8122d7aa] ? btrfs_writepages+0x28/0x2a
 [811084c6] ? do_writepages+0x1e/0x2c
 [810ff38e] ? __filemap_fdatawrite_range+0x55/0x57
 [8124006f] btrfs_wait_ordered_range+0x6a/0x11a
 [8122fe01] btrfs_run_delalloc_work+0x27/0x69
 [812508db] normal_work_helper+0xfe/0x240
 [81065d7e] process_one_work+0x195/0x2d2
 [81066020] worker_thread+0x136/0x205
 [81065eea] ? process_scheduled_works+0x2f/0x2f
 [8106b4b4] kthread+0xae/0xb6
 [8106b406] ? __kthread_parkme+0x61/0x61
 [8162667c] ret_from_fork+0x7c/0xb0
 [8106b406] ? __kthread_parkme+0x61/0x61
INFO: task btrfs-transacti:3358 blocked for more than 120 seconds.
  Not tainted 3.15.1-amd64-i915-preempt-20140216jbp #1
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
btrfs-transacti D  0  3358  2 0x
 8800c50ebc50 0046 8800c50ebc20 8800c50ebfd8
 8800c6914390 000141c0 7fff 8801433b8f10
 0002 8161c9b0 7fff 8800c50ebc60
Call Trace:
 [8161c9b0] ? sock_rps_reset_flow+0x32/0x32
 [8161d3c6] schedule+0x73/0x75
 [8161c9e9] schedule_timeout+0x39/0x129
 [8107653d] ? get_parent_ip+0xd/0x3c
 [8162338f] ? preempt_count_add+0x7a/0x8d
 [8161dbac] __wait_for_common+0x11a/0x159
 [8107810f] ? wake_up_state+0x12/0x12
 [8161dc0f] wait_for_completion+0x24/0x26
 [81237ce6] btrfs_wait_and_free_delalloc_work+0x16/0x28
 [8123fd3a] btrfs_run_ordered_operations+0x1e7/0x21e
 [81229aa4] btrfs_flush_all_pending_stuffs+0x4e/0x55
 [8122b25a] btrfs_commit_transaction+0x20d/0x8b0
 [81227b41] transaction_kthread+0xf8/0x1ab
 [81227a49] ? btrfs_cleanup_transaction+0x44c/0x44c
 [8106b4b4] kthread+0xae/0xb6
 [8106b406] ? __kthread_parkme+0x61/0x61
 [8162667c] ret_from_fork+0x7c/0xb0
 [8106b406] ?