Re: Rebalancing RAID1

2013-02-22 Thread Fredrik Tolf

On Mon, 18 Feb 2013, Stefan Behrens wrote:

On Fri, 15 Feb 2013 22:56:19 +0100 (CET), Fredrik Tolf wrote:

The oops cut can be found here:
http://www.dolda2000.com/~fredrik/tmp/btrfs-oops


This scrub issue is fixed since Linux 3.8-rc1 with commit
4ded4f6 Btrfs: fix BUG() in scrub when first superblock reading gives EIO


I see, thanks!

Rebooting the system did get me running again, allowing me to remove the 
missing device from filesystem. However, I encountered a couple of 
somewhat strange happenings as I did that. I don't know if they're 
considered bugs or not, but I thought I had best report them.


To begin with, the act of removing the missing device from the filesystem 
itself caused the resynchronization to the new device to happen in 
blocking mode, so the btrfs device delete missing operation took about a 
day to finish. My expectation would have been that the device removal 
would have been a fast operation and that I would have had to scrub the 
filesystem or something in order to resynchronize, but I can see how this 
would be intented behavior.


However, what's weirder is that while the resynchronization was underway, 
I couldn't mount subvolumes on other mountpoints. The mount commands 
blocked (disk-slept) until the entire synchronization was done, and I 
don't think this was intended behavior, because I had the kernel saying 
the following while it happened:


Feb 16 06:01:27 nerv kernel: [ 3482.512106] INFO: task mount:3525 blocked for 
more than 120 seconds.
Feb 16 06:01:28 nerv kernel: [ 3482.518484] echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Feb 16 06:01:28 nerv kernel: [ 3482.526324] mount   D 88003e220e40  
   0  3525   3524 0x
Feb 16 06:01:28 nerv kernel: [ 3482.533587]  88003e220e40 0082 
a0067470 88003e2300c0
Feb 16 06:01:28 nerv kernel: [ 3482.541088]  00013b40 88001126dfd8 
00013b40 88001126dfd8
Feb 16 06:01:28 nerv kernel: [ 3482.548584]  00013b40 88003e220e40 
00013b40 88001126c010
Feb 16 06:01:28 nerv kernel: [ 3482.556280] Call Trace:
Feb 16 06:01:28 nerv kernel: [ 3482.558776]  [81396132] ? 
__mutex_lock_common+0x10d/0x175
Feb 16 06:01:28 nerv kernel: [ 3482.565078]  [81396260] ? 
mutex_lock+0x1a/0x2c
Feb 16 06:01:28 nerv kernel: [ 3482.570661]  [a05a38c2] ? 
btrfs_scan_one_device+0x40/0x133 [btrfs]
Feb 16 06:01:28 nerv kernel: [ 3482.577752]  [a0564e8b] ? 
btrfs_mount+0x1c4/0x4d8 [btrfs]
Feb 16 06:01:28 nerv kernel: [ 3482.584080]  [810e56cb] ? 
pcpu_next_pop+0x37/0x43
Feb 16 06:01:28 nerv kernel: [ 3482.589709]  [810e52c0] ? 
cpumask_next+0x18/0x1a
Feb 16 06:01:28 nerv kernel: [ 3482.595226]  [811012aa] ? 
alloc_pages_current+0xbb/0xd8
Feb 16 06:01:28 nerv kernel: [ 3482.601345]  [81113778] ? 
mount_fs+0x6c/0x149
Feb 16 06:01:28 nerv kernel: [ 3482.606595]  [811291f7] ? 
vfs_kern_mount+0x67/0xdd
Feb 16 06:01:28 nerv kernel: [ 3482.612292]  [a056516b] ? 
btrfs_mount+0x4a4/0x4d8 [btrfs]
Feb 16 06:01:28 nerv kernel: [ 3482.618673]  [810e52c0] ? 
cpumask_next+0x18/0x1a
Feb 16 06:01:28 nerv kernel: [ 3482.624178]  [811012aa] ? 
alloc_pages_current+0xbb/0xd8
Feb 16 06:01:28 nerv kernel: [ 3482.630347]  [81113778] ? 
mount_fs+0x6c/0x149
Feb 16 06:01:28 nerv kernel: [ 3482.635580]  [811291f7] ? 
vfs_kern_mount+0x67/0xdd
Feb 16 06:01:28 nerv kernel: [ 3482.641258]  [811292e0] ? 
do_kern_mount+0x49/0xd6
Feb 16 06:01:29 nerv kernel: [ 3482.646855]  [81129a98] ? 
do_mount+0x72b/0x791
Feb 16 06:01:29 nerv kernel: [ 3482.652186]  [81129b86] ? 
sys_mount+0x88/0xc3
Feb 16 06:01:29 nerv kernel: [ 3482.657464]  [8139d229] ? 
system_call_fastpath+0x16/0x1b

Furthermore, it struck me that the consequences of having to mount a 
filesystem with missing deviced with -o degraded can be a bit strange. I 
realize what the intentions of the behavior is, of course, but I think it 
might cause quite some difficulties when trying to mount a degraded btrfs 
filesystem as root on a system that you don't have physical access to, 
like a hosted server, because it might be hard to manipulate the boot 
process so as to pass that mountflag to the initrd. Note that this is not 
a problem with md-raid; it will simply assemble its arrays in degraded 
mode automatically, without intervention. I'm not necessarily saying 
that's better, but I thought I should bring up the point.


--

Fredrik Tolf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-18 Thread Stefan Behrens
On Fri, 15 Feb 2013 22:56:19 +0100 (CET), Fredrik Tolf wrote:
 The oops cut can be found here:
 http://www.dolda2000.com/~fredrik/tmp/btrfs-oops

This scrub issue is fixed since Linux 3.8-rc1 with commit
4ded4f6 Btrfs: fix BUG() in scrub when first superblock reading gives EIO

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-15 Thread Martin Steigerwald
Am Freitag, 15. Februar 2013 schrieb Fredrik Tolf:
 On Thu, 14 Feb 2013, Martin Steigerwald wrote:
[…]
  I´d restart the machine, see that BTRFS is using both devices again and
  then try the balance again.
 
 I mentioned it in another mail, but I'd very much prefer not to do that.
 I'd like to try and solve this as I normally should when a drive fails.

Well if Hugo´s solution with unmounting the FS, btrfs dev scan does not 
work, I see this my suggestion to reboot as making sense.

I do not have to add more to my analysis.

If any BTRFS developer or expert knows another solution in during runtime of 
the system, feel free :)

So or so I think a kernel bug is involved here. And I think I remember 
having seen something like this during a balance attempt myself already, but 
it was just a test BTRFS and I was not sure of it.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-15 Thread Fredrik Tolf

On Fri, 15 Feb 2013, Martin Steigerwald wrote:

So or so I think a kernel bug is involved here.


Well, *some* kernel bug is certainly involved. :)

I did wipe the filesystem off the device and reinserted it as a new device 
into the filesystem. After that, btrfs fi show gave me the following:



$ sudo ./btrfs fi show
Label: none  uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7
Total devices 3 FS bytes used 2.66TB
devid3 size 2.73TB used 0.00 path /dev/sdi1
devid2 size 2.73TB used 2.67TB path /dev/sde1
*** Some devices missing


I then proceeded to try to remove the missing devices with btrfs dev del 
missing /mnt, but it made no difference whatever, with the kernel saying 
the following:


Feb 15 07:12:29 nerv kernel: [262110.799823] btrfs: no missing devices found to 
remove

This seems odd enough, seeing as how btrfs fi show says there are 
missing devices, and the kernel contradicting that.


Either way, I tried to start a scrub on the filesystem, too, seeing if 
that would make a difference, but that oopsed the kernel. :)


The oops cut can be found here: 
http://www.dolda2000.com/~fredrik/tmp/btrfs-oops

So with that, I'm certainly going to reboot the machine. :)

--

Fredrik Tolf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-14 Thread Chris Murphy

On Feb 13, 2013, at 11:42 PM, Fredrik Tolf fred...@dolda2000.com wrote:

 It's worth noting that I still haven't un- and remounted the filesystem since 
 the drive disconnected. 

I suggest capturing the current dmesg, reboot, and see if the btrfs volume will 
mount read-only without complaints in dmesg.

Also, is a virtual machine being used in any of this, either as host or guest?

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-14 Thread Hugo Mills
On Thu, Feb 14, 2013 at 01:41:04AM -0700, Chris Murphy wrote:
 
 On Feb 14, 2013, at 12:58 AM, Fredrik Tolf fred...@dolda2000.com wrote:
  
  Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O 
  error on /dev/sdd1
 
 Well, someone else might comment on what that is exactly, I'm not getting 
 conclusive google hits on this. Sometimes it's fixed by going to a newer 
 kernel. Sometimes it's bad hardware. But it's apparently not a btrfs error. 
 But it's causing subsequent errors which are btrfs errors. So whatever it is, 
 it seems like btrfs doesn't like it.
 
 
  Feb 14 08:32:30 nerv kernel: [180511.764690] btrfs: bdev /dev/sdd1 errs: wr 
  288650, rd 26, flush 1, corrupt 0, gen 0
 
 So there continue to be write errors. Unsurprising as sdd1 seems to be 
 dropping pages.
 
  
  Scrubbing does not balance the volume. Based on the information you 
  supplied I don't really see the reason for a rebalance.
  
  Maybe my terminology is wrong again, then, because I do see a reason to get 
  the data properly replicated across the drives, which it doesn't seem to be 
  now. That's what I meant by rebalancing.
 
 How much data was copied to the drives? I'm continuously confused by how 
 btrfs reports data usage. What I have is this from fi show and fi df:
 
 Data, RAID1: total=2.66TB, used=2.66TB

   This is the amount of actual useful data (i.e. what you see with du
or ls -l). Double this (because it's RAID-1) to get the number of
bytes or raw storage used.

 Total devices 2 FS bytes used 1.64TB
 devid1 size 2.73TB used 1.64TB path /dev/sdi1
 devid2 size 2.73TB used 2.67TB path /dev/sde1

   This is the amount of raw disk space allocated. The total of used
here should add up to twice the total values above (for
Data+Metadata+System).

 So I can't tell if it's ~1.64TB copied or 2.6TB.

   Looks like /dev/sdi1 isn't actually being written to -- it should
be the same allocation as /dev/sde1.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Alert status mauve ocelot: Slight chance of brimstone. Be ---   
   prepared to make a nice cup of tea.   


signature.asc
Description: Digital signature


Re: Rebalancing RAID1

2013-02-14 Thread Martin Steigerwald
Am Mittwoch, 13. Februar 2013 schrieb Fredrik Tolf:
 Dear list,

Hi Fredrik,

 I'm sorry if this is a dumb n3wb question, but I couldn't find anything 
 about it, so please bear with me.
 
 I just decided to try BtrFS for the first time, to replace an old ReiserFS 
 data partition currently on a mdadm mirror. To do so, I'm using two 3 TB 
 disks that were initially detected as sdd and sde, on which I have a 
 single large GPT partition, so the devices I'm using for btrfs are sdd1 
 and sde1.
 
 I created a filesystem on them using RAID1 from the start (mkfs.btrfs -d 
 raid -m raid1 /dev/sd{d,e}1), and started copying the data from the old 
 partition onto it during the night. As it happened, I immediately got 
 reason to try out BtrFS recovery because sometime during the copying 
 operation /dev/sdd had some kind of cable failure and was removed from the 
 system. A while later, however, it was apparently auto-redetected, this 
 time as /dev/sdi, and BtrFS seems to have inserted it back into the 
 filesystem somehow.
 
 The current situation looks like this:
 
  $ sudo ./btrfs fi show
  Label: none  uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7
  Total devices 2 FS bytes used 1.64TB
  devid1 size 2.73TB used 1.64TB path /dev/sdi1
  devid2 size 2.73TB used 2.67TB path /dev/sde1
  
  Btrfs v0.20-rc1-56-g6cd836d
 
 As you can see, /dev/sdi1 has much less space used, which I can only 
 assume is because extents weren't allocated on it while it was off-line. 
 I'm now trying to remedy this, but I'm not sure if I'm doing it right.
 
 What I'm doing is to run btrfs fi bal start /mnt , and it gives me a 
 ton of kernel messages that look like this:
 
 Feb 12 22:57:16 nerv kernel: [59596.948464] btrfs: relocating block group 
 2879804932096 flags 17
 Feb 12 22:57:45 nerv kernel: [59626.618280] btrfs_end_buffer_write_sync: 8 
 callbacks suppressed
 Feb 12 22:57:45 nerv kernel: [59626.621893] lost page write due to I/O error 
 on /dev/sdd1
 Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs_dev_stat_print_on_error: 8 
 callbacks suppressed
 Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs: bdev /dev/sdd1 errs: wr 
 66339, rd 26, flush 1, corrupt 0, gen 0
 Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error 
 on /dev/sdd1
 [Lots of the above, and occasionally a couple of lines like these]
 Feb 12 22:57:48 nerv kernel: [59629.569278] btrfs: found 46 extents
 Feb 12 22:57:50 nerv kernel: [59631.685067] btrfs_dev_stat_print_on_error: 5 
 callbacks suppressed
[…]
 Also, why does it say that the errors are occuring /dev/sdd1? Is it just 
 remembering the whole filesystem by that name since that's how I mounted 
 it, or is it still trying to access the old removed instance of that disk 
 and is that, then, why it's giving all these errors?

You started the balance after above btrfs fi show command?

Then its obvious to me:

For some reason BTRFS is still trying to write to /dev/sdd, which isn´t
there anymore. That perfectly explains those lost page writes for me. If
that is the case, this seems to me like a serious bug in BTRFS.

Also Hugo´s obversation point in that direction. At first I would take those
log messages literally. 

There is a chance that BTRFS still displays /dev/sdd while actually writing
to /dev/sdi, but, I doubt it. I think its possible to find this out by
using iostat -x 1 or atop or something like that. And if it does write to
the correct device file, I think it makes sense to update and fix those
log messages.

I´d restart the machine, see that BTRFS is using both devices again and
then try the balance again.

I´d do this while still having a backup on the ReiserFS volume or another
backup drive. After this I´d do a btrfs scrub start to see whether BTRFS
is happy with all the data on the drives.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-14 Thread Chris Murphy

On Feb 14, 2013, at 1:59 AM, Hugo Mills h...@carfax.org.uk wrote:
 
 Data, RAID1: total=2.66TB, used=2.66TB
 
   This is the amount of actual useful data (i.e. what you see with du
 or ls -l). Double this (because it's RAID-1) to get the number of
 bytes or raw storage used.

Right, the decoder ring. Effectively no outsiders will understand this. It 
contradicts the behavior of conventional df with btrfs volumes. And it becomes 
untenable with per subvolume profiles.


 Total devices 2 FS bytes used 1.64TB
 devid1 size 2.73TB used 1.64TB path /dev/sdi1
 devid2 size 2.73TB used 2.67TB path /dev/sde1
 
   This is the amount of raw disk space allocated. The total of used
 here should add up to twice the total values above (for
 Data+Metadata+System).

I'm mostly complaining about the first line. If 2.67TB of writes to sde1 are 
successful enough to be stated as used on that device, then FS bytes used 
should be at least 2.67TB.

 
 So I can't tell if it's ~1.64TB copied or 2.6TB.
 
   Looks like /dev/sdi1 isn't actually being written to -- it should
 be the same allocation as /dev/sde1.

Yeah he's getting a lot of these, and I don't know what it is:

 Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error 
 on /dev/sdd1

It's not tied to btrfs or libata so I don't think it's the drive itself 
reporting the write error. I think maybe the kernel has become confused as a 
result of the original ICRC ABRT, and the subsequent change from sdd to sdi. 

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-14 Thread Chris Murphy

On Feb 14, 2013, at 7:44 AM, Martin Steigerwald mar...@lichtvoll.de wrote:

 For some reason BTRFS is still trying to write to /dev/sdd, which isn´t
 there anymore. That perfectly explains those lost page writes for me. If
 that is the case, this seems to me like a serious bug in BTRFS.

Following the ICRC ABRT error, /dev/sdd becomes /dev/sdi. Btrfs-progs 
recognizes this by only listing /dev/sdi and /dev/sde as devices in the volume. 
But the btrfs kernel space code continues to try to write to /dev/sdd, while 
/dev/sdi isn't getting any writes (at least, it's not filling up with data).

Btrfs kernel space code is apparently unaware that /dev/sdd is gone. That seems 
to be the primary problem.

A question is, if the kernel space code was aware of a member device vanishing 
and then reappearing, whether as the same or different block device 
designation, should it automatically re-add the device to the volume? Upon 
being re-added, it would be out of sync, leading to a follow-up question about 
whether it should be auto-scrubbed to fix this? And yet another follow-up 
question which is if the file system metadata contains information that can be 
used similar to the function of the md write-intent bitmap, reducing the time 
to catch the drive up, avoiding a full scrub?


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-14 Thread Hugo Mills
On Thu, Feb 14, 2013 at 11:05:39AM -0700, Chris Murphy wrote:
 
 On Feb 14, 2013, at 1:59 AM, Hugo Mills h...@carfax.org.uk wrote:
  
  Data, RAID1: total=2.66TB, used=2.66TB
  
This is the amount of actual useful data (i.e. what you see with du
  or ls -l). Double this (because it's RAID-1) to get the number of
  bytes or raw storage used.
 
 Right, the decoder ring. Effectively no outsiders will understand
 this. It contradicts the behavior of conventional df with btrfs
 volumes. And it becomes untenable with per subvolume profiles.

   Correct, but *all* other single-value (or small-number-of-values)
displays of space usage fail in similar ways. We've(*) had this
discussion out on this mailing list many times before. All simple
displays of disk usage will cause someone to misinterpret something at
some point, and get cross.

(*) For non-you values of we.

   If you want a display of raw bytes used/free, then someone will
complain that they had 20GB free, wrote a 10GB file, and it's all
gone. If you want a display of usable data used/free, then we can't
predict the free part. There is no single set of values that will
make this simple.

  Total devices 2 FS bytes used 1.64TB
  devid1 size 2.73TB used 1.64TB path /dev/sdi1
  devid2 size 2.73TB used 2.67TB path /dev/sde1
  
This is the amount of raw disk space allocated. The total of used
  here should add up to twice the total values above (for
  Data+Metadata+System).
 
 I'm mostly complaining about the first line. If 2.67TB of writes to sde1 are 
 successful enough to be stated as used on that device, then FS bytes used 
 should be at least 2.67TB.

   The values shown above are for bytes *allocated* -- i.e. the
total values shown in btrfs fi df. You haven't added in the
metadata, which I'm willing to bet is another 100 GiB or so allocated
space, bringing you up to the 2.67 TiB.

   (There's another problem with this display, which is that it's
actually showing TiB, not TB. There have been patches for this, but I
don't know if any are current).

  
  So I can't tell if it's ~1.64TB copied or 2.6TB.

   2.66 TiB. The 1.64TiB is clearly wrong, given all the other values.
Hence my conclusion below.

Looks like /dev/sdi1 isn't actually being written to -- it should
  be the same allocation as /dev/sde1.
 
 Yeah he's getting a lot of these, and I don't know what it is:
 
  Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O 
  error on /dev/sdd1
 
 It's not tied to btrfs or libata so I don't think it's the drive itself 
 reporting the write error. I think maybe the kernel has become confused as a 
 result of the original ICRC ABRT, and the subsequent change from sdd to sdi. 

   That would be my conclusion, too. But with the newly-appeared
/dev/sdi1, btrfs fi show picks it up as belonging to the FS (because
it's got the same UUID), but it's not been picked up by the kernel, so
the kernel's not trying to write to it, and it's therefore massively
out of date.

   I think the solution, if it's certain that the drive is now
behaving sensibly again, is one of:

 * unmount, btrfs dev scan, remount, scrub
or
 * btrfs dev delete missing, add /dev/sdi1 to the FS, and balance

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- I must be musical:  I've got *loads* of CDs ---   


signature.asc
Description: Digital signature


Re: Rebalancing RAID1

2013-02-14 Thread Chris Murphy

On Feb 14, 2013, at 1:56 PM, Hugo Mills h...@carfax.org.uk wrote:

 
 
   Correct, but *all* other single-value (or small-number-of-values)
 displays of space usage fail in similar ways. We've(*) had this
 discussion out on this mailing list many times before. All simple
 displays of disk usage will cause someone to misinterpret something at
 some point, and get cross.

The decoder ring method causes misinterpretation.

I refuse the premise that there isn't a way to at least be consistent; and use 
switches for alternate presentations.

   If you want a display of raw bytes used/free, then someone will
 complain that they had 20GB free, wrote a 10GB file, and it's all
 gone. If you want a display of usable data used/free, then we can't
 predict the free part. There is no single set of values that will
 make this simple.

This is exactly how (conventional) df -h works now. And it causes exactly the 
problem you describe. The df -h size and available numbers are double that of 
btrfs fi df/show. Not ok. Not consistent. Either df needs to change (likely) or 
btrfs fi needs to change.

2x 80GB array, btrfs

/dev/sdb160G  112K  158G   1% /mnt

2x 80GB array, md raid1 xfs

/dev/md0 80G   33M   80G   1% /mnt

And I think it's (regular) df that needs to change the most. btrfs fi df 
contains 50% superfluous information as far as I can tell:

[root@f18v ~]# btrfs fi df /mnt
Data, RAID1: total=1.00GB, used=0.00
*Data: total=8.00MB, used=0.00
*System, RAID1: total=8.00MB, used=8.00KB
*System: total=4.00MB, used=0.00
Metadata, RAID1: total=1.00GB, used=48.00KB
*Metadata: total=8.00MB, used=0.00

The lines marked * I see zero useful conveyed information. And fi show:

[root@f18v ~]# btrfs fi show
Label: 'hello'  uuid: d5517733-7c9f-458a-9e99-5b832b8776b2
Total devices 2 FS bytes used 56.00KB
devid2 size 80.00GB used 2.01GB path /dev/sdc
devid1 size 80.00GB used 2.03GB path /dev/sdb

I don't know why I should care about allocated chunks but if that's what used 
means in this case, it should say that, rather than used. I'm sortof annoyed 
that the same words, total and used, have different meaning depending on their 
position, without other qualifiers. It's like being in school and the teacher 
would get pissed when students wouldn't specify units or label axes, and now 
I'm one of those types. What do these numbers mean? If I have to infer this, 
then they're obscure, so why should I care about them?

And what I can get from btrfs fi df that it doesn't indicate at all, that could 
be more useful than regular df (simply because there's no room) is a:

Free Space Estimate: min - max


   I think the solution, if it's certain that the drive is now
 behaving sensibly again, is one of:
 
 * unmount, btrfs dev scan, remount, scrub
 or
 * btrfs dev delete missing, add /dev/sdi1 to the FS, and balance

The 2nd won't work because user space tools don't consider there to be a 
missing device.

So back to the question on how btrfs should behave in such a case. md would 
have tossed the drive and as far as I know doesn't automatically readd it if it 
reappears as either the same or a different block device. And when the user 
uses --re-add there's a resync.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-13 Thread Chris Murphy

On Feb 12, 2013, at 11:18 PM, Fredrik Tolf fred...@dolda2000.com wrote:
 
 
 smartctl -l scterc /dev/sdX
 
 Warning: device does not support SCT Error Recovery Control command
 
 Doesn't seem that way to me; partly because of the SMART data, and partly 
 because of the errors that were logged as the drive failed:
 
 Feb 12 16:36:49 nerv kernel: [36769.546522] ata6.00: Ata error. fis:0x21
 Feb 12 16:36:49 nerv kernel: [36769.550454] ata6: SError: { Handshk }
 Feb 12 16:36:51 nerv kernel: [36769.554129] ata6.00: failed command: WRITE 
 FPDMA QUEUED
 Feb 12 16:36:51 nerv kernel: [36769.559375] ata6.00: cmd 
 61/00:00:00:ec:2e/04:00:cd:00:00/40 tag 0 ncq 524288 out
 Feb 12 16:36:51 nerv kernel: [36769.559375]  res 
 41/84:d0:00:98:2e/84:00:cd:00:00/40 Emask 0x10 (ATA bus error)
 Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
 Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }
 
 That's not typical for actual media problems, in my experience. :)

Quite typical, because these drives don't support SCTERC which almost certainly 
means their error timeouts are well above that of the linux SCSI layer which is 
30 seconds. Their timeouts are likely around 2 minutes. So in fact they never 
report back a URE because the command timer times out and resets the drive.
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html

For your use case, I'd reject these drives and get WDC Red, or even reportedly 
the Hitachi Deskstars still have a settable SCTERC. And set it for something 
like 70 deciseconds. Then if if a drive ECC hasn't recovered in 7 seconds, it 
will give up, and report a read error with the problem LBA. Either btrfs (or 
md) can recover the data from the other drive, and cause the read error to be 
fixed on the other drive.

However, in your case, with both the kernel message ICRC ABRT, and the 
following SMART entry, this is your cable problem. The ICRC and UCMA_CRC errors 
are the same problem reported by the actors at each end of the cable.

/dev/hdi
Serial Number:WD-WMC1T1679668
199 UDMA_CRC_Error_Count0x0032   200   192   000Old_age   Always   
-   91


So the question is whether the cable problem has actually been fixed, and if 
you're still getting ICRC errors from the kernel. As this is hdi, I'm wondering 
how many drives are connected, and if this could be power induced rather than 
just cable induced. Once that's solved, you should do a scrub, rather than a 
rebalance.

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-13 Thread Fredrik Tolf

On Wed, 13 Feb 2013, Chris Murphy wrote:

On Feb 12, 2013, at 11:18 PM, Fredrik Tolf fred...@dolda2000.com wrote:

That's not typical for actual media problems, in my experience. :)


Quite typical, because these drives don't support SCTERC which almost 
certainly means their error timeouts are well above that of the linux 
SCSI layer which is 30 seconds. Their timeouts are likely around 2 
minutes. So in fact they never report back a URE because the command 
timer times out and resets the drive.


That's interesting to read. I haven't ever actually experienced missing a 
bad sector reported by a hard drive, though; and not for a lack of 
experience with bad sectors.


Either way, though, with the assumption that it actually was a cable 
problem rather than bad medium...


However, in your case, with both the kernel message ICRC ABRT, and the 
following SMART entry, this is your cable problem.


... I'd still like to solve the problem as it is, so that I know what to 
do the next time I get some device error.


So the question is whether the cable problem has actually been fixed, 
and if you're still getting ICRC errors from the kernel.


I'm not getting any block-layer errors from the kernel. The errors I 
posted originally are the only ones I'm getting.


As this is hdi, I'm wondering how many drives are connected, and if this 
could be power induced rather than just cable induced.


With the general change, I actually decreased the number of drives in the 
system from 10 to 8, so unless the new drives are incredibly more 
power-hungry than the old ones, that shouldn't be a problem.



Once that's solved, you should do a scrub, rather than a rebalance.


Oh, will scrubbing actually rebalance the array? I was under the 
impression that it only checked for bad checksums.


I'm still wondering what those errors actually mean, though. I'm still 
getting them occasionally, even when I'm not rebalancing (just not as 
often). I'm also very curious about what it means that it's still 
complaining about sdd rather than sdi.


It's worth noting that I still haven't un- and remounted the filesystem 
since the drive disconnected. I assumed that I shouldn't need to and that 
the multiple-device layer of btrfs should handle the situation correctly. 
Is that assumption correct?


--

Fredrik Tolf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-13 Thread Fredrik Tolf

On Thu, 14 Feb 2013, Chris Murphy wrote:

So the question is whether the cable problem has actually been fixed, and if 
you're still getting ICRC errors from the kernel.


I'm not getting any block-layer errors from the kernel. The errors I posted 
originally are the only ones I'm getting.


Previously you reported:
Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }

These are not block errors. You should not proceed until you're certain this 
isn't still intermittently occurring.


Sorry for being unclear. By block-layer errors I intended to mean 
hardware/driver errors, as those are, as opposed to filesystem errors, but 
I guess that's not the vernacular use of the term.


To try to be clearer, then:

I am not getting ICRC errors anymore, or any driver-related errors 
whatsoever. I was only getting them when sdd was originally lost, and have 
not been getting any of them since.


The errors I am currently getting, and the ones I was getting during the 
rebalance, are those I reported in the original mail; that is:


Feb 14 08:32:30 nerv kernel: [180511.760850] lost page write due to I/O error 
on /dev/sdd1
Feb 14 08:32:30 nerv kernel: [180511.764690] btrfs: bdev /dev/sdd1 errs: wr 
288650, rd 26, flush 1, corrupt 0, gen 0

I am only getting those messages from the kernel, and nothing else. 
Currently, those two messages are the only ones I'm getting at all (except 
with slightly different numeric parameters, of course); while I was trying 
to rebalance, I also got messages looking like this:


Feb 12 22:57:16 nerv kernel: [59596.948464] btrfs: relocating block group 
2879804932096 flags 17
Feb 12 22:57:45 nerv kernel: [59626.618280] btrfs_end_buffer_write_sync: 8 
callbacks suppressed
Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs_dev_stat_print_on_error: 8 
callbacks suppressed
Feb 12 22:57:48 nerv kernel: [59629.569278] btrfs: found 46 extents

I hope that clears it up.


Once that's solved, you should do a scrub, rather than a rebalance.


Oh, will scrubbing actually rebalance the array? I was under the impression 
that it only checked for bad checksums.


Scrubbing does not balance the volume. Based on the information you 
supplied I don't really see the reason for a rebalance.


Maybe my terminology is wrong again, then, because I do see a reason to 
get the data properly replicated across the drives, which it doesn't seem 
to be now. That's what I meant by rebalancing.


What you do next depends on what your goal is for this data, on these 
two disks, using btrfs. If the idea is to trust the data on the volume; 
you still have the source data so I'd mkfs.btrfs on the disks and start 
over. If the idea is to experiment and learn, you might want to do a 
btrfsck, followed by a scrub.


I'm still keeping the original data just in case, of course. However, my 
primary goal right now is to learn how to manage redundancy reliably with 
btrfs. I mean, with md, I can easily handle a device failure and fix it up 
without having to remount or reboot; and I've assumed that I should be 
able to do that with btrfs as well (please correct me if that assumption 
is invalid, though).


Btrfs is stable on stable hardware. Your hardware most definitely was 
not stable during a series of writes. So I'd say all bets are off. That 
doesn't mean it can't be fixed, but the very fact you're still getting 
errors indicates something is still wrong.


Isn't btrfs' RAID1 supposed to be stable as long as only one disk fails, 
though?



This:
Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error on 
/dev/sdd1
Are not btrfs errors.


I see. I thought that was a btrfs error, but I was wrong then. Since I'm 
not actually getting any driver errors, though, and it's referring to sdd, 
doesn't that just mean, as I suspect, that btrfs is still trying to use 
the old defunct sdd instead of sdi as the drive became named after it was 
redetected?



This:
Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }


Just to be overly redundant: I'm not getting those anymore, and I only 
ever got them before the drive was redetected as sdi.


--

Fredrik Tolf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Rebalancing RAID1

2013-02-12 Thread Fredrik Tolf

Dear list,

I'm sorry if this is a dumb n3wb question, but I couldn't find anything 
about it, so please bear with me.


I just decided to try BtrFS for the first time, to replace an old ReiserFS 
data partition currently on a mdadm mirror. To do so, I'm using two 3 TB 
disks that were initially detected as sdd and sde, on which I have a 
single large GPT partition, so the devices I'm using for btrfs are sdd1 
and sde1.


I created a filesystem on them using RAID1 from the start (mkfs.btrfs -d 
raid -m raid1 /dev/sd{d,e}1), and started copying the data from the old 
partition onto it during the night. As it happened, I immediately got 
reason to try out BtrFS recovery because sometime during the copying 
operation /dev/sdd had some kind of cable failure and was removed from the 
system. A while later, however, it was apparently auto-redetected, this 
time as /dev/sdi, and BtrFS seems to have inserted it back into the 
filesystem somehow.


The current situation looks like this:


$ sudo ./btrfs fi show
Label: none  uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7
Total devices 2 FS bytes used 1.64TB
devid1 size 2.73TB used 1.64TB path /dev/sdi1
devid2 size 2.73TB used 2.67TB path /dev/sde1

Btrfs v0.20-rc1-56-g6cd836d


As you can see, /dev/sdi1 has much less space used, which I can only 
assume is because extents weren't allocated on it while it was off-line. 
I'm now trying to remedy this, but I'm not sure if I'm doing it right.


What I'm doing is to run btrfs fi bal start /mnt , and it gives me a 
ton of kernel messages that look like this:


Feb 12 22:57:16 nerv kernel: [59596.948464] btrfs: relocating block group 
2879804932096 flags 17
Feb 12 22:57:45 nerv kernel: [59626.618280] btrfs_end_buffer_write_sync: 8 
callbacks suppressed
Feb 12 22:57:45 nerv kernel: [59626.621893] lost page write due to I/O error on 
/dev/sdd1
Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs_dev_stat_print_on_error: 8 
callbacks suppressed
Feb 12 22:57:45 nerv kernel: [59626.621893] btrfs: bdev /dev/sdd1 errs: wr 
66339, rd 26, flush 1, corrupt 0, gen 0
Feb 12 22:57:45 nerv kernel: [59626.644110] lost page write due to I/O error on 
/dev/sdd1
[Lots of the above, and occasionally a couple of lines like these]
Feb 12 22:57:48 nerv kernel: [59629.569278] btrfs: found 46 extents
Feb 12 22:57:50 nerv kernel: [59631.685067] btrfs_dev_stat_print_on_error: 5 
callbacks suppressed

This barrage of messages combined with the fact that the rebalance is 
going quite slowly (btrfs fi bal stat indicates about 1 extent per minute, 
where an extent seems to be about 1 GB; which is several factors slower 
than it took to copy the data onto the filesystem) leads me to think that 
something is wrong. Is it, or should I just wait 2 days for it to 
complete, ignoring the error?


Also, why does it say that the errors are occuring /dev/sdd1? Is it just 
remembering the whole filesystem by that name since that's how I mounted 
it, or is it still trying to access the old removed instance of that disk 
and is that, then, why it's giving all these errors?


Thanks for reading!

--

Fredrik Tolf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-12 Thread Chris Murphy

On Feb 12, 2013, at 4:01 PM, Fredrik Tolf fred...@dolda2000.com wrote:
 
 mkfs.btrfs -d raid -m raid1 /dev/sd{d,e}1

Is that a typo? -d raid isn't valid.

What do you get for:
btrfs fi df /mnt

Please report the result for each drive:
smartctl -a /dev/sdX
smartctl -l scterc /dev/sdX

 
 Also, why does it say that the errors are occuring /dev/sdd1? Is it just 
 remembering the whole filesystem by that name since that's how I mounted it, 
 or is it still trying to access the old removed instance of that disk and is 
 that, then, why it's giving all these errors?

I suspect bad sectors at the moment. But it could be other things too. What 
kernel version?


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rebalancing RAID1

2013-02-12 Thread Fredrik Tolf

On Tue, 12 Feb 2013, Chris Murphy wrote:


On Feb 12, 2013, at 4:01 PM, Fredrik Tolf fred...@dolda2000.com wrote:


mkfs.btrfs -d raid -m raid1 /dev/sd{d,e}1


Is that a typo? -d raid isn't valid.


Ah yes, sorry. That was a typo.


What do you get for:
btrfs fi df /mnt


$ sudo ./btrfs fi df /mnt
Data, RAID1: total=2.66TB, used=2.66TB
Data: total=8.00MB, used=0.00
System, RAID1: total=8.00MB, used=388.00KB
System: total=4.00MB, used=0.00
Metadata, RAID1: total=4.00GB, used=3.66GB
Metadata: total=8.00MB, used=0.00


Please report the result for each drive:
smartctl -a /dev/sdX


As they're a bit long for mail, so see here:
http://www.dolda2000.com/~fredrik/tmp/smart-hde
http://www.dolda2000.com/~fredrik/tmp/smart-hdi

There's not a whole lot to see, though.


smartctl -l scterc /dev/sdX


Warning: device does not support SCT Error Recovery Control command


Also, why does it say that the errors are occuring /dev/sdd1? Is it just 
remembering the whole filesystem by that name since that's how I mounted it, or 
is it still trying to access the old removed instance of that disk and is that, 
then, why it's giving all these errors?


I suspect bad sectors at the moment.


Doesn't seem that way to me; partly because of the SMART data, and partly 
because of the errors that were logged as the drive failed:


Feb 12 16:36:49 nerv kernel: [36769.546522] ata6.00: Ata error. fis:0x21
Feb 12 16:36:49 nerv kernel: [36769.550454] ata6: SError: { Handshk }
Feb 12 16:36:51 nerv kernel: [36769.554129] ata6.00: failed command: WRITE 
FPDMA QUEUED
Feb 12 16:36:51 nerv kernel: [36769.559375] ata6.00: cmd 
61/00:00:00:ec:2e/04:00:cd:00:00/40 tag 0 ncq 524288 out
Feb 12 16:36:51 nerv kernel: [36769.559375]  res 
41/84:d0:00:98:2e/84:00:cd:00:00/40 Emask 0x10 (ATA bus error)
Feb 12 16:36:51 nerv kernel: [36769.574831] ata6.00: status: { DRDY ERR }
Feb 12 16:36:52 nerv kernel: [36769.578867] ata6.00: error: { ICRC ABRT }

That's not typical for actual media problems, in my experience. :)

What kernel version?

Oh, sorry, it's 3.7.1. The system is otherwise a pretty much vanilla 
Debian Squeeze (curreny Stable) that I've just compiled a newer kernel 
(and btrfs-tools) for.


Thanks for replying!

--

Fredrik Tolf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-05-02 Thread David Sterba
On Thu, Apr 19, 2012 at 05:42:05PM +0200, Marco L. Crociani wrote:
 Apr 19 17:38:41 evo kernel: [  347.661915] Call Trace:
 Apr 19 17:38:41 evo kernel: [  347.661964]  [a00b76ac]  
 btrfs_ioctl_dev_info+0x15c/0x1a0 [btrfs]
 Apr 19 17:38:41 evo kernel: [  347.662013]  [a00ba9b1]  
 btrfs_ioctl+0x571/0x6c0 [btrfs]
 Apr 19 17:38:41 evo kernel: [  347.662024]  [81193839]  
 do_vfs_ioctl+0x99/0x330
 Apr 19 17:38:41 evo kernel: [  347.662032]  [8118d345] ?   
 putname+0x35/0x50
 Apr 19 17:38:41 evo kernel: [  347.662040]  [81193b71]  
 sys_ioctl+0xa1/0xb0
 Apr 19 17:38:41 evo kernel: [  347.662049]  [816691a9]  
 system_call_fastpath+0x16/0x1b

Fixed by
http://comments.gmane.org/gmane.comp.file-systems.btrfs/16302

reported earlier
http://article.gmane.org/gmane.comp.file-systems.btrfs/16796

and it's part of 3.4-rc5.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-05-02 Thread Marco L. Crociani
On Wed, May 2, 2012 at 4:54 PM, David Sterba d...@jikos.cz wrote:

 On Thu, Apr 19, 2012 at 05:42:05PM +0200, Marco L. Crociani wrote:
  Apr 19 17:38:41 evo kernel: [  347.661915] Call Trace:
  Apr 19 17:38:41 evo kernel: [  347.661964]  [a00b76ac] 
  btrfs_ioctl_dev_info+0x15c/0x1a0 [btrfs]
  Apr 19 17:38:41 evo kernel: [  347.662013]  [a00ba9b1] 
  btrfs_ioctl+0x571/0x6c0 [btrfs]
  Apr 19 17:38:41 evo kernel: [  347.662024]  [81193839] 
  do_vfs_ioctl+0x99/0x330
  Apr 19 17:38:41 evo kernel: [  347.662032]  [8118d345] ?  
  putname+0x35/0x50
  Apr 19 17:38:41 evo kernel: [  347.662040]  [81193b71] 
  sys_ioctl+0xa1/0xb0
  Apr 19 17:38:41 evo kernel: [  347.662049]  [816691a9] 
  system_call_fastpath+0x16/0x1b

 Fixed by
 http://comments.gmane.org/gmane.comp.file-systems.btrfs/16302

 reported earlier
 http://article.gmane.org/gmane.comp.file-systems.btrfs/16796

 and it's part of 3.4-rc5.


I was on 3.4-rc5!

--
Marco Lorenzo Crociani,
marco.croci...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-05-02 Thread David Sterba
On Mon, Apr 30, 2012 at 03:01:04PM +0200, Marco L. Crociani wrote:
 ./btrfs device delete missing /mnt/sda3
 ERROR: error removing the device 'missing' - Input/output error
 
 
 Apr 30 13:17:57 evo kernel: [  108.866205] btrfs: allowing degraded mounts
 Apr 30 13:17:57 evo kernel: [  108.866214] btrfs: disk space caching is 
 enabled
 Apr 30 13:18:32 evo kernel: [  143.274899] btrfs: relocating block
 group 1401002393600 flags 17
 Apr 30 13:19:25 evo kernel: [  196.888248] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  196.889900] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  196.890429] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  197.087419] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  197.087681] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154

the failed checksums prevent to remove the data from the device and then
removing fails with the above error.

 ./btrfs inspect-internal inode-resolve -v 257 /mnt/sda3/
 ioctl ret=-1, error: No such file or directory

So it's not a visible file, possibly a deleted yet uncleaned snapshot or
the space_cache (guessing from the inode number). But AFAICS the
checksums are turned off for the free space inode so ...

 ./btrfs scrub status /mnt/sda3/
 scrub status for c87975a0-a575-405e-9890-d3f7f25bbd96
   scrub started at Mon Apr 30 13:26:26 2012 and was aborted after 4367 
 seconds
   total bytes scrubbed: 406.64GB with 2 errors
   error details: csum=2
   corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

Shouldn't the csum errors be included under uncorrectable?

 Apr 30 14:37:24 evo kernel: [ 4875.275776] btrfs: checksum error at
 logical 752871157760 on dev /dev/sda3, sector 873795352, root 259,
 inode 1580389, offset 612610048, length 4096, links 1 (path:
^^^

so the scrub catches different checksum errors than appeared during
balance (inode 257).

 Apr 30 14:37:24 evo kernel: [ 4875.275838] BUG: unable to handle kernel NULL 
 pointer dereference at 0090
 Apr 30 14:37:24 evo kernel: [ 4875.275848] IP: [811ae841]  
 bio_add_page+0x11/0x60
 Apr 30 14:37:24 evo kernel: [ 4875.276022] RIP:
 0010:[811ae841]  [811ae841] bio_add_page+0x11/0x60

this looks like something disappeared under hands of scrub

1045 BUG_ON(!page-page);
1046 bio = bio_alloc(GFP_NOFS, 1);
1047 if (!bio)
1048 return -EIO;
1049 bio-bi_bdev = page-bdev;
1050 bio-bi_sector = page-physical  9;
1051 bio-bi_end_io = scrub_complete_bio_end_io;
1052 bio-bi_private = complete;

1054 ret = bio_add_page(bio, page-page, PAGE_SIZE, 0);
1055 if (PAGE_SIZE != ret) {
1056 bio_put(bio);
1057 return -EIO;
1058 }

everything is initialized before use here, so it's hidden behind the
pointers, my bet is at page-bdev-something . Thinking again how things
got here:

* unsuccesful device remove 'missing', due to csum errors in a
  non-regular file
* crashed scrub, after inidirect access of a null pointer

Is there anything I missed for steps to reproduce it?


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-05-02 Thread David Sterba
On Wed, May 02, 2012 at 04:59:03PM +0200, Marco L. Crociani wrote:
  On Thu, Apr 19, 2012 at 05:42:05PM +0200, Marco L. Crociani wrote:
   Apr 19 17:38:41 evo kernel: [  347.661964]  [a00b76ac] 
   btrfs_ioctl_dev_info+0x15c/0x1a0 [btrfs]
[...]
 I was on 3.4-rc5!

You really saw this crash with 3.4-rc5 ? The patch should be there.
Anyway, your follow-up report was on top of 3.4-rc5, with different
error.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-05-02 Thread Marco L. Crociani
 Is there anything I missed for steps to reproduce it?

All the story is in previous mails.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/16829
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg15949.html
First mail is missing from mail-archive...

Summary:
Some damaged sectors on one device. Seems to be ok after rewriteing so
I started a scrub.
During scrub (kernel 3.2.x) device completely broke down with A LOT of
dameged sectors --- other device fills up -- out of space ---
unclean shotdown.
With 3.3 kernels I was able to mount it and add a new device.
I tried 3.4-rc4 but the patch wasn't there.
I had problem compiling from git, before I tried DKMS, then the whole
kernel, (set CONCURRENCY = 5 with quadcore is wrong? ) so I waited
rc5.
With the tar from kernel.org I have successfully compiled 3.4-rc5
(with CONCURRENCY = 4).
Errors with scrub.
Here we are.


On Wed, May 2, 2012 at 5:27 PM, David Sterba d...@jikos.cz wrote:
 On Wed, May 02, 2012 at 04:59:03PM +0200, Marco L. Crociani wrote:
  On Thu, Apr 19, 2012 at 05:42:05PM +0200, Marco L. Crociani wrote:
   Apr 19 17:38:41 evo kernel: [  347.661964]  [a00b76ac] 
   btrfs_ioctl_dev_info+0x15c/0x1a0 [btrfs]
 [...]
 I was on 3.4-rc5!

 You really saw this crash with 3.4-rc5 ?

Yes.
I tell you now what I did before your response today.

From this point:

btrfs fi sh
Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
Total devices 3 FS bytes used 1015.83GB
devid3 size 1.75TB used 357.00GB path /dev/sdb3
devid1 size 1.75TB used 1.34TB path /dev/sda3
*** Some devices missing

I reached:

btrfs fi show
Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
Total devices 3 FS bytes used 1004.23GB
devid3 size 1.75TB used 1.25TB path /dev/sdb3
devid1 size 1.75TB used 1.33TB path /dev/sda3
*** Some devices missing

using btrfs balance start -dvrange=1..[group where it fails minus 1]
 a number of times (I started writing some notes on
http://btrfs.ipv5.de/index.php?title=User:Tyrael ).

These should be all the errors (sorry for the confusion):

---

Apr 30 19:53:13 evo kernel: [ 3163.927548] btrfs csum failed ino 510
off 910946304 csum 432355644 private 175165154



May  1 23:15:12 evo kernel: [101661.681997] btrfs: relocating block
group 1742452293632 flags 17
May  1 23:15:39 evo kernel: [101688.412777] btrfs: found 328 extents
May  1 23:15:47 evo kernel: [101696.543742] btrfs: found 328 extents
May  1 23:15:48 evo kernel: [101697.575754] btrfs: relocating block
group 1741378551808 flags 17
May  1 23:16:16 evo kernel: [101724.754908] btrfs: found 137 extents
May  1 23:16:24 evo kernel: [101732.915791] btrfs: found 137 extents
May  1 23:16:24 evo kernel: [101733.275939] btrfs: relocating block
group 1401002393600 flags 17
May  1 23:16:45 evo kernel: [101753.889479] btrfs csum failed ino 2876
off 910946304 csum 432355644 private 175165154

Apr 30 20:55:09 evo kernel: [ 6879.601004] btrfs: relocating block
group 1738157326336 flags 17
Apr 30 20:55:10 evo kernel: [ 6879.995377] btrfs: relocating block
group 1401002393600 flags 17
Apr 30 20:55:29 evo kernel: [ 6898.819546] btrfs csum failed ino 636
off 910946304 csum 432355644 private 175165154
Apr 30 20:55:29 evo kernel: [ 6898.849422] btrfs csum failed ino 636
off 910946304 csum 432355644 private 175165154
Apr 30 20:55:29 evo kernel: [ 6898.849689] btrfs csum failed ino 636
off 910946304 csum 432355644 private 175165154
Apr 30 20:55:29 evo kernel: [ 6898.878413] btrfs csum failed ino 636
off 910946304 csum 432355644 private 175165154
Apr 30 20:55:29 evo kernel: [ 6898.878668] btrfs csum failed ino 636
off 910946304 csum 432355644 private 175165154

May  1 15:26:26 evo kernel: [73542.827058] btrfs: relocating block
group 1394559942656 flags 17
May  1 15:26:38 evo kernel: [73555.038433] btrfs csum failed ino 1581
off 648593408 csum 283516648 private 3975454589

Apr 30 20:58:26 evo kernel: [ 7076.525087] btrfs: relocating block
group 1394559942656 flags 17
Apr 30 20:58:38 evo kernel: [ 7088.082493] btrfs csum failed ino 642
off 648593408 csum 283516648 private 3975454589
Apr 30 20:58:38 evo kernel: [ 7088.108851] btrfs csum failed ino 642
off 648593408 csum 283516648 private 3975454589

May  1 15:28:41 evo kernel: [73677.797363] btrfs: relocating block
group 1385970008064 flags 17
May  1 15:28:45 evo kernel: [73681.242643] btrfs csum failed ino 1582
off 229765120 csum 3096851068 private 993448323

Apr 30 21:30:46 evo kernel: [ 9016.216885] btrfs: found 223 extents
Apr 30 21:30:46 evo kernel: [ 9016.533470] btrfs: relocating block
group 1385970008064 flags 17
Apr 30 21:30:49 evo kernel: [ 9019.630665] btrfs csum failed ino 650
off 229765120 csum 3096851068 private 993448323

Apr 30 21:56:29 evo kernel: [10558.769597] btrfs: relocating block
group 1378453815296 flags 17
Apr 30 21:56:31 evo kernel: [10561.185029] btrfs csum failed ino 657
off 190976000 csum 3234929648 

Re: Errors in rebalancing RAID1 array after disk failure.

2012-05-02 Thread Stefan Behrens
On 5/2/2012 5:22 PM, David Sterba wrote:
 On Mon, Apr 30, 2012 at 03:01:04PM +0200, Marco L. Crociani wrote:
 ./btrfs device delete missing /mnt/sda3
 ERROR: error removing the device 'missing' - Input/output error


 Apr 30 13:17:57 evo kernel: [  108.866205] btrfs: allowing degraded mounts
 Apr 30 13:17:57 evo kernel: [  108.866214] btrfs: disk space caching is 
 enabled
 Apr 30 13:18:32 evo kernel: [  143.274899] btrfs: relocating block
 group 1401002393600 flags 17
 Apr 30 13:19:25 evo kernel: [  196.888248] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  196.889900] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  196.890429] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  197.087419] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 30 13:19:25 evo kernel: [  197.087681] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 
 the failed checksums prevent to remove the data from the device and then
 removing fails with the above error.
 
 ./btrfs inspect-internal inode-resolve -v 257 /mnt/sda3/
 ioctl ret=-1, error: No such file or directory
 
 So it's not a visible file, possibly a deleted yet uncleaned snapshot or
 the space_cache (guessing from the inode number). But AFAICS the
 checksums are turned off for the free space inode so ...
 
 ./btrfs scrub status /mnt/sda3/
 scrub status for c87975a0-a575-405e-9890-d3f7f25bbd96
  scrub started at Mon Apr 30 13:26:26 2012 and was aborted after 4367 
 seconds
  total bytes scrubbed: 406.64GB with 2 errors
  error details: csum=2
  corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
 
 Shouldn't the csum errors be included under uncorrectable?

uncorrectable errors would have been set to 2 if no crash had happened.

 
 Apr 30 14:37:24 evo kernel: [ 4875.275776] btrfs: checksum error at
 logical 752871157760 on dev /dev/sda3, sector 873795352, root 259,
 inode 1580389, offset 612610048, length 4096, links 1 (path:
 ^^^
 
 so the scrub catches different checksum errors than appeared during
 balance (inode 257).
 
 Apr 30 14:37:24 evo kernel: [ 4875.275838] BUG: unable to handle kernel NULL 
 pointer dereference at 0090
 Apr 30 14:37:24 evo kernel: [ 4875.275848] IP: [811ae841]  
 bio_add_page+0x11/0x60
 Apr 30 14:37:24 evo kernel: [ 4875.276022] RIP:
 0010:[811ae841]  [811ae841] bio_add_page+0x11/0x60
 
 this looks like something disappeared under hands of scrub
 
 1045 BUG_ON(!page-page);
 1046 bio = bio_alloc(GFP_NOFS, 1);
 1047 if (!bio)
 1048 return -EIO;
 1049 bio-bi_bdev = page-bdev;
 1050 bio-bi_sector = page-physical  9;
 1051 bio-bi_end_io = scrub_complete_bio_end_io;
 1052 bio-bi_private = complete;
 
 1054 ret = bio_add_page(bio, page-page, PAGE_SIZE, 0);
 1055 if (PAGE_SIZE != ret) {
 1056 bio_put(bio);
 1057 return -EIO;
 1058 }
 
 everything is initialized before use here, so it's hidden behind the
 pointers, my bet is at page-bdev-something . Thinking again how things
 got here:
 
 * unsuccesful device remove 'missing', due to csum errors in a
   non-regular file
 * crashed scrub, after inidirect access of a null pointer
 
 Is there anything I missed for steps to reproduce it?

Right. bdev is a NULL pointer for missing devices. Scrub tries to repair
the checksum error by accessing the mirrors, and that device is missing
and NULL.
I'll send a patch tomorrow to prevent the scrub crash in this situation.

Thanks!
From 28fa74661f7a0e209a826e212b40d667516f5d1f Mon Sep 17 00:00:00 2001
From: Stefan Behrens sbehr...@giantdisaster.de
Date: Wed, 2 May 2012 18:49:57 +0200
Subject: [PATCH] Btrfs: fix crash in scrub correction code when device is 
missing

When scrub tries to fix an I/O or checksum error and one of the devices
containing the mirror is missing, it crashes on bdev being a NULL pointer.

Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de
---
 fs/btrfs/scrub.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b679bf6..967bcf1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -998,6 +998,8 @@ static int scrub_setup_recheck_block(struct scrub_dev *sdev,
page = sblock-pagev + page_index;
page-logical = logical;
page-physical = bbio-stripes[mirror_index].physical;
+   if (bbio-stripes[mirror_index].dev-missing)
+   continue;
page-bdev = bbio-stripes[mirror_index].dev-bdev;
page-mirror_num = 

Re: Errors in rebalancing RAID1 array after disk failure.

2012-05-02 Thread Stefan Behrens
Oops, please scratch the attachment of the mail before, that patch is
not yet finished. I forgot to remove it before hitting the send button :(

Sorry.

 I'll send a patch tomorrow to prevent the scrub crash in this situation.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-04-30 Thread Marco L. Crociani
Hi all,
today another episode... I have compiled and tried kernel 3.4-rc5

./btrfs fi sh
Label: 'RootFS'  uuid: c87975a0-a575-405e-9890-d3f7f25bbd96
Total devices 3 FS bytes used 1006.67GB
devid3 size 1.75TB used 357.00GB path /dev/sdb3
devid1 size 1.75TB used 1.34TB path /dev/sda3
*** Some devices missing

Btrfs Btrfs v0.19


./btrfs device delete missing /mnt/sda3
ERROR: error removing the device 'missing' - Input/output error


Apr 30 13:17:51 evo kernel: [  103.074835] device label RootFS devid 1
transid 47082 /dev/sda3
Apr 30 13:17:52 evo kernel: [  103.281796] device label RootFS devid 3
transid 47082 /dev/sdb3
Apr 30 13:17:57 evo kernel: [  108.865001] device label RootFS devid 1
transid 47082 /dev/sda3
Apr 30 13:17:57 evo kernel: [  108.866205] btrfs: allowing degraded mounts
Apr 30 13:17:57 evo kernel: [  108.866214] btrfs: disk space caching is enabled
Apr 30 13:18:32 evo kernel: [  143.274899] btrfs: relocating block
group 1401002393600 flags 17
Apr 30 13:19:25 evo kernel: [  196.888248] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 30 13:19:25 evo kernel: [  196.889900] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 30 13:19:25 evo kernel: [  196.890429] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 30 13:19:25 evo kernel: [  197.087419] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 30 13:19:25 evo kernel: [  197.087681] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154

./btrfs inspect-internal inode-resolve -v 257 /mnt/sda3/
ioctl ret=-1, error: No such file or directory




./btrfs scrub status /mnt/sda3/
scrub status for c87975a0-a575-405e-9890-d3f7f25bbd96
scrub started at Mon Apr 30 13:26:26 2012 and was aborted after 4367 
seconds
total bytes scrubbed: 406.64GB with 2 errors
error details: csum=2
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0




Apr 30 14:37:24 evo kernel: [ 4875.275776] btrfs: checksum error at
logical 752871157760 on dev /dev/sda3, sector 873795352, root 259,
inode 1580389, offset 612610048, length 4096, links 1 (path:
.ecryptfs/[  ]
Apr 30 14:37:24 evo kernel: [ 4875.275838] BUG: unable to handle
kernel NULL pointer dereference at 0090
Apr 30 14:37:24 evo kernel: [ 4875.275848] IP: [811ae841]
bio_add_page+0x11/0x60
Apr 30 14:37:24 evo kernel: [ 4875.275862] PGD 0
Apr 30 14:37:24 evo kernel: [ 4875.275868] Oops:  [#1] SMP
Apr 30 14:37:24 evo kernel: [ 4875.275875] CPU 2
Apr 30 14:37:24 evo kernel: [ 4875.275878] Modules linked in:
ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT
xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables
bridge stp kvm_amd kvm rfcomm bnep dm_crypt parport_pc bluetooth ppdev
snd_hda_codec_realtek snd_hda_codec_hdmi uvcvideo videobuf2_core
snd_hda_intel snd_hda_codec videodev videobuf2_vmalloc snd_usb_audio
videobuf2_memops snd_hwdep snd_pcm snd_usbmidi_lib snd_seq_midi
snd_rawmidi eeepc_wmi asus_wmi snd_seq_midi_event snd_seq snd_timer
snd_seq_device mac_hid sparse_keymap snd binfmt_misc soundcore
snd_page_alloc dm_multipath k10temp i2c_piix4 microcode lp parport
raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov
raid6_pq async_tx raid0 multipath linear btrfs zlib_deflate libcrc32c
raid1 usbhid hid wmi r8169
Apr 30 14:37:24 evo kernel: [ 4875.276004]
Apr 30 14:37:24 evo kernel: [ 4875.276010] Pid: 3401, comm:
btrfs-scrub-1 Not tainted 3.4.0-rc5-mio01 #1 System manufacturer
System Product Name/F1A75-V EVO
Apr 30 14:37:24 evo kernel: [ 4875.276022] RIP:
0010:[811ae841]  [811ae841] bio_add_page+0x11/0x60
Apr 30 14:37:24 evo kernel: [ 4875.276033] RSP: 0018:88017135bba0
EFLAGS: 00010246
Apr 30 14:37:24 evo kernel: [ 4875.276038] RAX:  RBX:
8801710ac000 RCX: 
Apr 30 14:37:24 evo kernel: [ 4875.276044] RDX: 1000 RSI:
ea0004c2b8c0 RDI: 88017775b900
Apr 30 14:37:24 evo kernel: [ 4875.276050] RBP: 88017135bba0 R08:
8801bed16590 R09: 0001
Apr 30 14:37:24 evo kernel: [ 4875.276056] R10: 710d1001 R11:
0007 R12: 88017775b900
Apr 30 14:37:24 evo kernel: [ 4875.276061] R13: 8801710ac000 R14:
 R15: 88017135bbf8
Apr 30 14:37:24 evo kernel: [ 4875.276068] FS:  7f33e7e239c0()
GS:8801bed0() knlGS:f66a2b70
Apr 30 14:37:24 evo kernel: [ 4875.276074] CS:  0010 DS:  ES: 
CR0: 8005003b
Apr 30 14:37:24 evo kernel: [ 4875.276080] CR2: 0090 CR3:
00017b6e4000 CR4: 07e0
Apr 30 14:37:24 evo kernel: [ 4875.276086] DR0:  DR1:
 DR2: 
Apr 30 14:37:24 evo kernel: [ 4875.276092] DR3:  DR6:
0ff0 DR7: 0400

Re: Errors in rebalancing RAID1 array after disk failure.

2012-04-19 Thread Marco L. Crociani
Today I tried scrub...

Apr 19 17:36:01 evo kernel: [  187.932297] device label RootFS devid 1
transid 47046 /dev/sda3
Apr 19 17:36:02 evo kernel: [  188.145858] device label RootFS devid 3
transid 47046 /dev/sdb3
Apr 19 17:36:19 evo kernel: [  205.483044] device label RootFS devid 1
transid 47046 /dev/sda3
Apr 19 17:36:19 evo kernel: [  205.483730] btrfs: allowing degraded mounts
Apr 19 17:36:19 evo kernel: [  205.483737] btrfs: disk space caching is enabled
Apr 19 17:38:41 evo kernel: [  347.661603] BUG: unable to handle
kernel NULL pointer dereference at   (null)
Apr 19 17:38:41 evo kernel: [  347.661617] IP: [8131ff94]
strncpy+0x14/0x30
Apr 19 17:38:41 evo kernel: [  347.661633] PGD 17b672067 PUD 17b5ed067 PMD 0
Apr 19 17:38:41 evo kernel: [  347.661643] Oops:  [#1] SMP
Apr 19 17:38:41 evo kernel: [  347.661650] CPU 3
Apr 19 17:38:41 evo kernel: [  347.661654] Modules linked in:
ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT
xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables
bridge stp kvm_amd kvm rfcomm bnep bluetooth parport_pc ppdev dm_crypt
snd_hda_codec_realtek snd_hda_codec_hdmi snd_usb_audio snd_usbmidi_lib
snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi
snd_seq_midi_event uvcvideo snd_seq snd_timer snd_seq_device snd
videobuf2_core videodev v4l2_compat_ioctl32 videobuf2_vmalloc
soundcore videobuf2_memops dm_multipath eeepc_wmi mac_hid asus_wmi
binfmt_misc snd_page_alloc fglrx(PO) i2c_piix4 k10temp sparse_keymap
lp parport raid10 raid456 async_pq async_xor xor async_memcpy
async_raid6_recov raid6_pq async_tx raid0 multipath linear btrfs
zlib_deflate libcrc32c raid1 usbhid hid wmi r8169
Apr 19 17:38:41 evo kernel: [  347.661780]
Apr 19 17:38:41 evo kernel: [  347.661787] Pid: 3218, comm: btrfs
Tainted: P   O 3.3.2-030302-generic #201204131335 System
manufacturer System Product Name/F1A75-V EVO
Apr 19 17:38:41 evo kernel: [  347.661799] RIP:
0010:[8131ff94]  [8131ff94] strncpy+0x14/0x30
Apr 19 17:38:41 evo kernel: [  347.661810] RSP: 0018:880182559e08
EFLAGS: 00010206
Apr 19 17:38:41 evo kernel: [  347.661816] RAX: 8801b14eac00 RBX:
8801b14ea000 RCX: 
Apr 19 17:38:41 evo kernel: [  347.661822] RDX: 0400 RSI:
 RDI: 8801b14eac00
Apr 19 17:38:41 evo kernel: [  347.661827] RBP: 880182559e08 R08:
8801b048b8b8 R09: 0002
Apr 19 17:38:41 evo kernel: [  347.661833] R10: 0010 R11:
0206 R12: 8801b1741800
Apr 19 17:38:41 evo kernel: [  347.661839] R13: 00d55040 R14:
8801b14ea008 R15: 8801b048b898
Apr 19 17:38:41 evo kernel: [  347.661846] FS:  7f73c9f34760()
GS:8801bed8() knlGS:
Apr 19 17:38:41 evo kernel: [  347.661852] CS:  0010 DS:  ES: 
CR0: 80050033
Apr 19 17:38:41 evo kernel: [  347.661857] CR2:  CR3:
0001827db000 CR4: 06e0
Apr 19 17:38:41 evo kernel: [  347.661863] DR0:  DR1:
 DR2: 
Apr 19 17:38:41 evo kernel: [  347.661869] DR3:  DR6:
0ff0 DR7: 0400
Apr 19 17:38:41 evo kernel: [  347.661875] Process btrfs (pid: 3218,
threadinfo 880182558000, task 88017b5e44d0)
Apr 19 17:38:41 evo kernel: [  347.661880] Stack:
Apr 19 17:38:41 evo kernel: [  347.661884]  880182559e78
a00b76ac 8801b1504e00 
Apr 19 17:38:41 evo kernel: [  347.661895]  
 880182559f48 5bfc4f67
Apr 19 17:38:41 evo kernel: [  347.661905]  00012c2c
8801824a2600 00d55040 88018c7df800
Apr 19 17:38:41 evo kernel: [  347.661915] Call Trace:
Apr 19 17:38:41 evo kernel: [  347.661964]  [a00b76ac]
btrfs_ioctl_dev_info+0x15c/0x1a0 [btrfs]
Apr 19 17:38:41 evo kernel: [  347.662013]  [a00ba9b1]
btrfs_ioctl+0x571/0x6c0 [btrfs]
Apr 19 17:38:41 evo kernel: [  347.662024]  [81193839]
do_vfs_ioctl+0x99/0x330
Apr 19 17:38:41 evo kernel: [  347.662032]  [8118d345] ?
putname+0x35/0x50
Apr 19 17:38:41 evo kernel: [  347.662040]  [81193b71]
sys_ioctl+0xa1/0xb0
Apr 19 17:38:41 evo kernel: [  347.662049]  [816691a9]
system_call_fastpath+0x16/0x1b
Apr 19 17:38:41 evo kernel: [  347.662054] Code: 48 83 c2 01 84 c9 75
ef c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 d2 48 89 f8
48 89 e5 75 08 eb 18 66 90 48 83 c7 01 0f b6 0e 80 f9 01 88 0f 48 83
de ff 48 83 ea 01 75 ea c9 c3 0f
Apr 19 17:38:41 evo kernel: [  347.662128] RIP  [8131ff94]
strncpy+0x14/0x30
Apr 19 17:38:41 evo kernel: [  347.662137]  RSP 880182559e08
Apr 19 17:38:41 evo kernel: [  347.662141] CR2: 
Apr 19 17:38:41 evo kernel: [  347.662147] ---[ end trace 9a8c295d04917ed2 ]---


-- 
Marco Lorenzo Crociani,
marco.croci...@gmail.com
--
To unsubscribe from this 

Re: Errors in rebalancing RAID1 array after disk failure.

2012-04-16 Thread David Sterba
On Sat, Apr 14, 2012 at 06:39:12PM +0200, Marco L. Crociani wrote:
 Apr 14 18:07:52 evo kernel: [  431.054709] btrfs: relocating block
 group 1401002393600 flags 17
 Apr 14 18:08:14 evo kernel: [  453.506541] btrfs csum failed ino 362
 off 910946304 csum 432355644 private 175165154

The failed checksums prevent balance to relocate the blockgroup, which
is a needed step during 'dev delete'. Unless the csum is fixable by
using another copy, I think the only option left is to delete the file
(not counting the unsafe way of resetting the block's cheksum).


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-04-16 Thread Marco L. Crociani
On Mon, Apr 16, 2012 at 3:46 PM, David Sterba d...@jikos.cz wrote:
 On Sat, Apr 14, 2012 at 06:39:12PM +0200, Marco L. Crociani wrote:
 Apr 14 18:07:52 evo kernel: [  431.054709] btrfs: relocating block
 group 1401002393600 flags 17
 Apr 14 18:08:14 evo kernel: [  453.506541] btrfs csum failed ino 362
 off 910946304 csum 432355644 private 175165154

 The failed checksums prevent balance to relocate the blockgroup, which
 is a needed step during 'dev delete'. Unless the csum is fixable by
 using another copy, I think the only option left is to delete the file
 (not counting the unsafe way of resetting the block's cheksum).


I deleted the files.
( find /mnt/sda3 -inum 362 -ls  is correct to find them? )

Now it gives me errors on inode 257
I deleted a file but it still gives me errors on inode 257 but find
/mnt/sda3 -inum 257 -ls gives me nothing now.

Apr 17 00:41:49 evo kernel: [  156.530441] device label RootFS devid 1
transid 47037 /dev/sda3
Apr 17 00:41:49 evo kernel: [  156.734993] device label RootFS devid 3
transid 47037 /dev/sdb3
Apr 17 00:42:12 evo kernel: [  179.496155] device label RootFS devid 1
transid 47037 /dev/sda3
Apr 17 00:42:12 evo kernel: [  179.496881] btrfs: allowing degraded mounts
Apr 17 00:42:12 evo kernel: [  179.496888] btrfs: disk space caching is enabled
Apr 17 00:42:24 evo kernel: [  191.290093] btrfs: relocating block
group 1401002393600 flags 17
Apr 17 00:42:53 evo kernel: [  220.417535] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 17 00:42:53 evo kernel: [  220.480570] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 17 00:42:53 evo kernel: [  220.480868] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 17 00:42:53 evo kernel: [  220.505168] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154
Apr 17 00:42:53 evo kernel: [  220.528368] btrfs csum failed ino 257
off 910946304 csum 432355644 private 175165154


-- 
Marco Lorenzo Crociani,
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors in rebalancing RAID1 array after disk failure.

2012-04-16 Thread Marco L. Crociani
On Tue, Apr 17, 2012 at 12:56 AM, Marco L. Crociani
marco.croci...@gmail.com wrote:
 On Mon, Apr 16, 2012 at 3:46 PM, David Sterba d...@jikos.cz wrote:
 On Sat, Apr 14, 2012 at 06:39:12PM +0200, Marco L. Crociani wrote:
 Apr 14 18:07:52 evo kernel: [  431.054709] btrfs: relocating block
 group 1401002393600 flags 17
 Apr 14 18:08:14 evo kernel: [  453.506541] btrfs csum failed ino 362
 off 910946304 csum 432355644 private 175165154

 The failed checksums prevent balance to relocate the blockgroup, which
 is a needed step during 'dev delete'. Unless the csum is fixable by
 using another copy, I think the only option left is to delete the file
 (not counting the unsafe way of resetting the block's cheksum).


 I deleted the files.
 ( find /mnt/sda3 -inum 362 -ls  is correct to find them? )

 Now it gives me errors on inode 257
 I deleted a file but it still gives me errors on inode 257 but find
 /mnt/sda3 -inum 257 -ls gives me nothing now.

 Apr 17 00:41:49 evo kernel: [  156.530441] device label RootFS devid 1
 transid 47037 /dev/sda3
 Apr 17 00:41:49 evo kernel: [  156.734993] device label RootFS devid 3
 transid 47037 /dev/sdb3
 Apr 17 00:42:12 evo kernel: [  179.496155] device label RootFS devid 1
 transid 47037 /dev/sda3
 Apr 17 00:42:12 evo kernel: [  179.496881] btrfs: allowing degraded mounts
 Apr 17 00:42:12 evo kernel: [  179.496888] btrfs: disk space caching is 
 enabled
 Apr 17 00:42:24 evo kernel: [  191.290093] btrfs: relocating block
 group 1401002393600 flags 17
 Apr 17 00:42:53 evo kernel: [  220.417535] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 17 00:42:53 evo kernel: [  220.480570] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 17 00:42:53 evo kernel: [  220.480868] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 17 00:42:53 evo kernel: [  220.505168] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154
 Apr 17 00:42:53 evo kernel: [  220.528368] btrfs csum failed ino 257
 off 910946304 csum 432355644 private 175165154


 --
 Marco Lorenzo Crociani,

Running another time btrfs dev delete missing return a different error
(something like invalid argument), and no log activity.
Then umount completely freeze the system. Keyboard's leds start blinking.
Also alt gr + print screen + REISUB doesn't work.

-- 
Marco Lorenzo Crociani,
marco.croci...@gmail.com
Telefono: +39 02320622509
Fax: +39 02700540121
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html