Re: BTRFS losing SE Linux labels on power failure or "reboot -nffd"

2018-06-06 Thread Russell Coker
https://www.spinics.net/lists/linux-btrfs/msg77927.html

Thanks to Hans van Kranenburg and Holger Hoffstätte, the above has the link to 
a message with the patch for this which was already included in kernel 4.16.11 
which was uploaded to Debian on the 27th of May and got into testing about the 
time that my message got to the SE Linux list.

The kernel from Debian/Stable still has the issue.  So using a testing kernel 
might be a good option to deal with this problem at the moment.

On Monday, 4 June 2018 11:14:52 PM AEST Russell Coker wrote:
> The command "reboot -nffd" (kernel reboot without flushing kernel buffers or
> writing status) when run on a BTRFS system with SE Linux will often result
> in /var/log/audit/audit.log being unlabeled.  It also results in some
> systemd- journald files like
> /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/ system.journal being
> unlabeled but that is rarer.  I think that the same problem afflicts both
> systemd-journald and auditd but it's a race condition that on my systems
> (both production and test) is more likely to affect auditd.
> 
> root@stretch:/# xattr -l /var/log/audit/audit.log
> security.selinux:
>    73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5Fsystem_u:object_
> 0010   72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73r:auditd_log_t:s
> 0020   30 00  0.
> 
> SE Linux uses the xattr "security.selinux", you can see what it's doing with
> xattr(1) but generally using "ls -Z" is easiest.
> 
> If this issue just affected "reboot -nffd" then a solution might be to just
> not run that command.  However this affects systems after a power outage.
> 
> I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security
> update for Debian/Stretch which is the latest supported release of Debian). 
> I have also reproduced it in an identical manner with kernel 4.16.0-1-amd64
> (the latest from Debian/Unstable).  For testing I reproduced this with a 4G
> filesystem in a VM, but in production it has happened on BTRFS RAID-1
> arrays, both SSD and HDD.
> 
> #!/bin/bash
> set -e
> COUNT=$(ps aux|grep [s]bin/auditd|wc -l)
> date
> if [ "$COUNT" = "1" ]; then
>  echo "all good"
> else
>  echo "failed"
>  exit 1
> fi
> 
> Firstly the above is the script /usr/local/sbin/testit, I test for auditd
> running because it aborts if the context on it's log file is wrong.  When SE
> Linux is in enforcing mode an incorrect/missing label on the audit.log file
> causes auditd to abort.
> 
> root@stretch:~# ls -liZ /var/log/audit/audit.log
> 37952 -rw---. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun 
> 1 12:23 /var/log/audit/audit.log
> Above is before I do the tests.
> 
> while ssh stretch /usr/local/sbin/testit ; do
>  ssh btrfs-local "reboot -nffd" > /dev/null 2>&1 &
>  sleep 20
> done
> Above is the shell code I run to do the tests.  Note that the VM in question
> runs on SSD storage which is why it can consistently boot in less than 20
> seconds.
> 
> Fri  1 Jun 12:26:13 UTC 2018
> all good
> Fri  1 Jun 12:26:33 UTC 2018
> failed
> Above is the output from the shell code in question.  After the first reboot
> it fails.  The probability of failure on my test system is greater than
> 50%.
> 
> root@stretch:~# ls -liZ /var/log/audit/audit.log
> 37952 -rw---. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun 
> 1 12:26 /var/log/audit/audit.log
> Now the result.  Note that the Inode has not changed.  I could understand a
> newly created file missing an xattr, but this is an existing file which
> shouldn't have had it's xattr changed.  But somehow it gets corrupted.
> 
> The first possibility I considered was that SE Linux code might be at fault.
> I asked on the SE Linux mailing list (I haven't been involved in SE Linux
> kernel code for about 15 years) and was informed that this isn't likely at
> all.  There have been no problems like this reported with other
> filesystems.
> 
> Does anyone have any ideas of other tests I should run?  Anyone want me to
> try a different kernel?  I can give root on a VM to anyone who wants to
> poke at it.


-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS losing SE Linux labels on power failure or "reboot -nffd"

2018-06-04 Thread Russell Coker
The command "reboot -nffd" (kernel reboot without flushing kernel buffers or 
writing status) when run on a BTRFS system with SE Linux will often result in 
/var/log/audit/audit.log being unlabeled.  It also results in some systemd-
journald files like /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/
system.journal being unlabeled but that is rarer.  I think that the same 
problem afflicts both systemd-journald and auditd but it's a race condition 
that on my systems (both production and test) is more likely to affect auditd.

root@stretch:/# xattr -l /var/log/audit/audit.log 
security.selinux: 
   73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5Fsystem_u:object_ 
0010   72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73r:auditd_log_t:s 
0020   30 00  0.

SE Linux uses the xattr "security.selinux", you can see what it's doing with 
xattr(1) but generally using "ls -Z" is easiest.

If this issue just affected "reboot -nffd" then a solution might be to just 
not run that command.  However this affects systems after a power outage.
 
I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security 
update for Debian/Stretch which is the latest supported release of Debian).  I 
have also reproduced it in an identical manner with kernel 4.16.0-1-amd64 (the 
latest from Debian/Unstable).  For testing I reproduced this with a 4G 
filesystem in a VM, but in production it has happened on BTRFS RAID-1 arrays, 
both SSD and HDD.
 
#!/bin/bash 
set -e 
COUNT=$(ps aux|grep [s]bin/auditd|wc -l) 
date 
if [ "$COUNT" = "1" ]; then 
 echo "all good" 
else 
 echo "failed" 
 exit 1 
fi

Firstly the above is the script /usr/local/sbin/testit, I test for auditd 
running because it aborts if the context on it's log file is wrong.  When SE 
Linux is in enforcing mode an incorrect/missing label on the audit.log file 
causes auditd to abort.
 
root@stretch:~# ls -liZ /var/log/audit/audit.log 
37952 -rw---. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun  1 
12:23 /var/log/audit/audit.log
Above is before I do the tests.
 
while ssh stretch /usr/local/sbin/testit ; do 
 ssh btrfs-local "reboot -nffd" > /dev/null 2>&1 & 
 sleep 20 
done
Above is the shell code I run to do the tests.  Note that the VM in question 
runs on SSD storage which is why it can consistently boot in less than 20 
seconds.
 
Fri  1 Jun 12:26:13 UTC 2018 
all good 
Fri  1 Jun 12:26:33 UTC 2018 
failed
Above is the output from the shell code in question.  After the first reboot 
it fails.  The probability of failure on my test system is greater than 50%.
 
root@stretch:~# ls -liZ /var/log/audit/audit.log  
37952 -rw---. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun  1 
12:26 /var/log/audit/audit.log
Now the result.  Note that the Inode has not changed.  I could understand a 
newly created file missing an xattr, but this is an existing file which 
shouldn't have had it's xattr changed.  But somehow it gets corrupted.
 
The first possibility I considered was that SE Linux code might be at fault.  
I asked on the SE Linux mailing list (I haven't been involved in SE Linux 
kernel code for about 15 years) and was informed that this isn't likely at 
all.  There have been no problems like this reported with other filesystems.
 
Does anyone have any ideas of other tests I should run?  Anyone want me to try 
a different kernel?  I can give root on a VM to anyone who wants to poke at 
it.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Russell Coker
On Monday, 4 September 2017 2:57:18 PM AEST Stefan Priebe - Profihost AG 
wrote:
> > Then roughly make sure the complete set of metadata blocks fits in the
> > cache. For an fs of this size let's say/estimate 150G. Then maybe same
> > of double for data, so an SSD of 500G would be a first try.
> 
> I would use 1TB devices for each Raid or a 4TB PCIe card.

One thing I've considered is to create a filesystem with a RAID-1 of SSDs and 
then create lots of files with long names to use up a lot of space on the 
SSDs.  Then delete those files and add disks to the filesystem.  Then BTRFS 
should keep using the allocated metadata blocks on the SSD for all metadata 
and use disks for just data.

I haven't yet tried bcache, but would prefer something simpler with one less 
layer.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


read-only for no good reason on 4.9.30

2017-09-03 Thread Russell Coker
I have a system with less than 50% disk space used.  It just started rejecting 
writes due to lack of disk space.  I ran "btrfs balance" and then it started 
working correctly again.  It seems that a btrfs filesystem if left alone will 
eventually get fragmented enough that it rejects writes (I've had similar 
issues with other systems running BTRFS with other kernel versions).

Is this a known issue?

Is there any good way of recognising when it's likely to happen?  Is there 
anything I can do other than rewriting a medium size file to determine when 
it's happened?

# uname -a 
Linux trex 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 
GNU/Linux
# df -h / 
Filesystem  Size  Used Avail Use% Mounted on 
/dev/sdc239G  113G  126G  48% /
# btrfs fi df / 
Data, RAID1: total=117.00GiB, used=111.81GiB 
System, RAID1: total=32.00MiB, used=48.00KiB 
Metadata, RAID1: total=1.00GiB, used=516.00MiB 
GlobalReserve, single: total=246.59MiB, used=0.00B
# btrfs dev usa / 
/dev/sdc, ID: 1 
  Device size:   238.47GiB 
  Device slack:  0.00B 
  Data,RAID1:117.00GiB 
  Metadata,RAID1:  1.00GiB 
  System,RAID1:   32.00MiB 
  Unallocated:   120.44GiB 

/dev/sdd, ID: 2 
  Device size:   238.47GiB 
  Device slack:  0.00B 
  Data,RAID1:117.00GiB 
  Metadata,RAID1:  1.00GiB 
  System,RAID1:   32.00MiB 
  Unallocated:   120.44GiB

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior after "rm -rf //"

2016-08-11 Thread Russell Coker
http://selinux.coker.com.au/play.html

There are a variety of ways of giving the same result that rm doesn't reject. 
"/*" Wasn't caught last time I checked. See the above URL if you want to test 
out various rm operations as root. ;)

On 10 August 2016 9:24:23 AM AEST, Christian Kujau  
wrote:
>On Mon, 8 Aug 2016, Ivan Sizov wrote:
>> I'd ran "rm -rf //" by mistake two days ago. I'd stopped it after
>five
>
>Out of curiosity, what version of coreutils is this? The
>--preserve-root 
>option is the default for quite some time now:
>
>> Don't include dirname.h, since system.h does it now.
>> (usage, main): --preserve-root is now the default.
>> 2006-09-03 02:53:58 +
>http://git.savannah.gnu.org/cgit/coreutils.git/commit/src/rm.c?id=89ffaa19909d31dffbcf12fb4498afb72666f6c9
>
>Even coreutils-6.10 from Debian/5 refuses to remove "/":
>
>$ sudo rm -rf /
>rm: cannot remove root directory `/'
>
>Christian.

-- 
Sent from my Nexus 6P with K-9 Mail.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS admin access control

2016-08-03 Thread Russell Coker
I've just written a script for "mon" to monitor BTRFS filesystems.  I had to 
use sudo because "btrfs device stats" needs to be run as root.

Would it be possible to do some of these things as non-root?  I think it would 
be ideal if there was a "btrfs tunefs" operation somewhat comparable to 
"tune2fs" that allowed us to specify which UID or GID could perform operations 
like "btrfs device stats".

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-13 Thread Russell Coker
On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote:
> I've had some discussions on the list these days about not having
> checksumming with nodatacow (mostly with Hugo and Duncan).
> 
> They both basically told me it wouldn't be straight possible with CoW,
> and Duncan thinks it may not be so much necessary, but none of them
> could give me really hard arguments, why it cannot work (or perhaps I
> was just too stupid to understand them ^^)... while at the same time I
> think that it would be generally utmost important to have checksumming
> (real world examples below).

My understanding of BTRFS is that the metadata referencing data blocks has the 
checksums for those blocks, then the blocks which link to that metadata (EG 
directory entries referencing file metadata) has checksums of those.  For each 
metadata block there is a new version that is eventually linked from a new 
version of the tree root.

This means that the regular checksum mechanisms can't work with nocow data.  A 
filesystem can have checksums just pointing to data blocks but you need to 
cater for the case where a corrupt metadata block points to an old version of 
a data block and matching checksum.  The way that BTRFS works with an entire 
checksumed tree means that there's no possibility of pointing to an old 
version of a data block.

The NetApp published research into hard drive errors indicates that they are 
usually in small numbers and located in small areas of the disk.  So if BTRFS 
had a nocow file with any storage method other than dup you would have metadata 
and file data far enough apart that they are not likely to be hit by the same 
corruption (and the same thing would apply with most Ext4 Inode tables and 
data blocks).  I think that a file mode where there were checksums on data 
blocks with no checksums on the metadata tree would be useful.  But it would 
require a moderate amount of coding and there's lots of other things that the 
developers are working on.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


3.16.0 Debian kernel hang

2015-12-04 Thread Russell Coker
One of my test laptops started hanging on mounting the root filesystem.  I 
think that it had experience an unexpected power outage prior to that which 
may have caused corruption.

When I tried to mount the root filesystem the mount process would stick in D 
state, there would be no disk IO, and the computer would get hot - presumably 
due to kernel CPU use even though "top" didn't seem to indicate that.

When I mounted the filesystem with a 4.2.0 kernel it said "The free space cache 
file (1103101952) is invalid, skip it" and then things worked.  Now that the 
machine is running 4.2.0 everything is fine.

I know that there are no plans to backport things to 3.16 and I don't think 
the Debian people are going to be very interested in this.  So this message is 
a FYI for users, maybe consider not using the Debian/Jessie kernel for BTRFS 
systems.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.16.0 Debian kernel hang

2015-12-04 Thread Russell Coker
On Sat, 5 Dec 2015 12:53:07 AM Austin S Hemmelgarn wrote:
> > The only reason I'm not running Unstable kernels on my Debian systems is
> > because I run some Xen servers and upgrading Xen is problemmatic.  Linode
> > is moving from Xen to KVM so I guess I should consider doing the
> > same.  If I migrate my Xen servers to KVM I can use newer kernels with
> > less risk.
> 
> That's interesting, that must be something with how they do kernel 
> development in Debian, because I've never had any issues upgrading 
> either Xen or Linux on any of the systems I've run Xen on, and I 
> directly track mainline (with a small number of patches) for Linux, and 
> stay relatively close to mainline with Xen (Gentoo doesn't have all that 
> many patches on top of the regular release for Xen, aside from XSA
> patches).

I don't think that Debian does anything wrong in this regard.  It's just that 
my experience of Xen is that it is fragile at the best of times.  The fact 
that Red Hat packaged the Xen kernel in the Linux kernel package is a major 
indication of Xen problems IMHO, the concept of Xen is that it shouldn't be 
tied to a Linux kernel.

If you haven't had Xen issues then I think you have been lucky.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.16.0 Debian kernel hang

2015-12-04 Thread Russell Coker
On Sat, 5 Dec 2015 12:08:58 AM Austin S Hemmelgarn wrote:
> > I know that there are no plans to backport things to 3.16 and I don't
> > think the Debian people are going to be very interested in this.  So
> > this message is a FYI for users, maybe consider not using the
> > Debian/Jessie kernel for BTRFS systems.
> 
> I'd suggest extending that suggestion to:
> If you're not using an Enterprise distro (RHEL, SLES, CentOS, OEL), then 
> you should probably be building your own kernel, ideally using upstream 
> sources.

There are lots of ways of dealing with this.

Debian development doesn't stop.  Anyone who is running a Jessie system can 
easily run a kernel from Testing or Unstable (which really isn't particularly 
unstable).  It's generally expected that Debian user-space will work with a 
kernel from +- one release of Debian.  Also every time I've tried it Debian 
has worked well with a CentOS kernel of a similar version.

The only reason I'm not running Unstable kernels on my Debian systems is 
because I run some Xen servers and upgrading Xen is problemmatic.  Linode is 
moving from Xen to KVM so I guess I should consider doing the same.  If I 
migrate my Xen servers to KVM I can use newer kernels with less risk.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LWN mention

2015-12-01 Thread Russell Coker
On Mon, 9 Nov 2015 08:10:13 AM Duncan wrote:
> Russell Coker posted on Sun, 08 Nov 2015 17:38:32 +1100 as excerpted:
> > https://lwn.net/Articles/663474/
> > http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500
> > 
> > Above is a BTRFS issue that is mentioned in a LWN comment.  Has this one
> > been fixed yet?
> 
> Good job linking to a subscription-only article, without using a link
> that makes it available for non-subscriber list readers to read.[1]

Well it's only subscription-only for a week, so it's available for everyone 
now.

While LWN has a feature for "subscribers" (which includes me because HP 
sponsors LWN access for all DDs) to send free links for other people I don't 
believe it would be appropriate to use that on a mailing list.  If anyone had 
asked by private mail I'd have been happy to send them a personal link for 
that.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using Btrfs on single drives

2015-11-24 Thread Russell Coker
On Sun, 15 Nov 2015 03:01:57 PM Duncan wrote:
> That looks to me like native drive limitations.
> 
> Due to the fact that a modern hard drive spins at the same speed no 
> matter where the read/write head is located, when it's reading/writing to 
> the first part of the drive -- the outside -- much more linear drive 
> distance will pass under the read/write heads in say a tenth of a second 
> than will be the case as the last part of the drive is filled -- the 
> inside -- and throughput will be much higher at the first of the drive.

http://www.coker.com.au/bonnie++/zcav/results.html

The above page has the results of my ZCAV benchmark (part of the Bonnie++ 
suite) which shows this.  You can safely tun ZCAV in read mode on a device 
that's got a filesystem on it so it's not too late to test these things.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to replicate a Xen VM using BTRFS as the root filesystem.

2015-10-28 Thread Russell Coker
On Wed, 28 Oct 2015 11:07:20 PM Austin S Hemmelgarn wrote:
> Using this methodology, I can have a new Gentoo PV domain running in 
> about half an hour, whereas it takes me at least two and a half hours 
> (and often much longer than that) when using the regular install process 
> for Gentoo.

On my virtual servers I have a BTRFS subvol /xenstore for the block devices of 
virtual machines.  When I want to duplicate a VM I run
"cp -a --reflink=aways /xenstore/A /xenstore/B" which takes a few seconds.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Expected behavior of bad sectors on one drive in a RAID1

2015-10-20 Thread Russell Coker
On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote:
> > https://www.gnu.org/software/ddrescue/
> > 
> > At this stage I would use ddrescue or something similar to copy data from
> > the failing disk to a fresh disk, then do a BTRFS scrub to regenerate
> > the missing data.
> > 
> > I wouldn't remove the disk entirely because then you lose badly if you
> > get another failure.  I wouldn't use a BTRFS replace because you already
> > have the system apart and I expect ddrescue could copy the data faster. 
> > Also as the drive has been causing system failures (I'm guessing a
> > problem with the power connector) you REALLY don't want BTRFS to corrupt
> > data on the other disks.  If you have a system with the failing disk and
> > a new disk attached then there's no risk of further contamination.
> 
> BIG DISCLAIMER: For the filesystem to be safely mountable it is
> ABSOLUTELY NECESSARY to remove the old disk after doing a block level

You are correct, my message wasn't clear.

What I meant to say is that doing a "btrfs device remove" or "btrfs replace" 
is generally a bad idea in such a situation.  "btrfs replace" is pretty good 
if you are replacing a disk with a larger one or replacing a disk that has 
only minor errors (a disk that just gets a few bad sectors is unlikely to get 
many more in a hurry).

> copy of it.  By all means, keep the disk around, but do not keep it
> visible to the kernel after doing a block level copy of it.  Also, you
> will probably have to run 'btrfs device scan' after copying the disk and
> removing it for the filesystem to work right.  This is an inherent
> result of how BTRFS's multi-device functionality works, and also applies
> to doing stuff like LVM snapshots of BTRFS filesystems.

Good advice.  I recommend just rebooting the system.  I think that if anyone 
who has the background knowledge to do such things without rebooting will 
probably just do it without needing to ask us for advice.

> >> Question 2 - Before having ran the scrub, booting off the raid with
> >> bad sectors, would btrfs "on the fly" recognize it was getting bad
> >> sector data with the checksum being off, and checking the other
> >> drives?  Or, is it expected that I could get a bad sector read in a
> >> critical piece of operating system and/or kernel, which could be
> >> causing my lockup issues?
> > 
> > Unless you have disabled CoW then BTRFS will not return bad data.
> 
> It is worth clarifying also that:
> a. While BTRFS will not return bad data in this case, it also won't
> automatically repair the corruption.

Really?  If so I think that's a bug in BTRFS.  When mounted rw I think that 
every time corruption is discovered it should be automatically fixed.

> b. In the unlikely event that both copies are bad, trying to read the
> data will return an IO error.
> c. It is theoretically possible (although statistically impossible) that
> the block could become corrupted, but the checksum could still be
> correct (CRC32c is good at detecting small errors, but it's not hard to
> generate a hash collision for any arbitrary value, so if a large portion
> of the block goes bad, then it can theoretically still have a valid
> checksum).

It would be interesting to see some research into how CRC32 fits with the more 
common disk errors.  For a disk to return bad data and claim it to be good the 
data must either be a misplaced write or read (which is almost certain to be 
caught by BTRFS as the metadata won't match), or a random sector that matches 
the disk's CRC.  Is generating a hash collision for a CRC32 inside a CRC 
protected block much more difficult?

> >> Question 3 - Probably doesn't matter, but how can I see which files
> >> (or metadata to files) the 40 current bad sectors are in?  (On extX,
> >> I'd use tune2fs and debugfs to be able to see this information.)
> > 
> > Read all the files in the system and syslog will report it.  But really
> > don't do that until after you have copied the disk.
> 
> It may also be possible to use some of the debug tools from BTRFS to do
> this without hitting the disks so hard, but it will likely take a lot
> more effort.

I don't think that you can do that without hitting the disks hard.

That said last time I checked (last time an executive of a hard drive 
manufacturer was willing to talk to me) drives were apparently designed to 
perform any sequence of operations for their warranty period.  So for a disk 
that is believed to be good this shouldn't be a problem.  For a disk that is 
known to be dying it would be a really bad idea to do anything other than copy 
the data off at maximum speed.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Expected behavior of bad sectors on one drive in a RAID1

2015-10-19 Thread Russell Coker
On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote:
> sda appears to be going bad, with my low threshold of "going bad", and
> will be replaced ASAP.  It just developed 16 reallocated sectors, and
> has 40 current pending sectors.
> 
> I'm currently running a "btrfs scrub start -B -d -r /terra", which
> status on another term shows me has found 32 errors after running for
> an hour.

https://www.gnu.org/software/ddrescue/

At this stage I would use ddrescue or something similar to copy data from the 
failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing 
data.

I wouldn't remove the disk entirely because then you lose badly if you get 
another failure.  I wouldn't use a BTRFS replace because you already have the 
system apart and I expect ddrescue could copy the data faster.  Also as the 
drive has been causing system failures (I'm guessing a problem with the power 
connector) you REALLY don't want BTRFS to corrupt data on the other disks.  If 
you have a system with the failing disk and a new disk attached then there's 
no risk of further contamination.

> Question 2 - Before having ran the scrub, booting off the raid with
> bad sectors, would btrfs "on the fly" recognize it was getting bad
> sector data with the checksum being off, and checking the other
> drives?  Or, is it expected that I could get a bad sector read in a
> critical piece of operating system and/or kernel, which could be
> causing my lockup issues?

Unless you have disabled CoW then BTRFS will not return bad data.

> Question 3 - Probably doesn't matter, but how can I see which files
> (or metadata to files) the 40 current bad sectors are in?  (On extX,
> I'd use tune2fs and debugfs to be able to see this information.)

Read all the files in the system and syslog will report it.  But really don't 
do that until after you have copied the disk.

> I do have hourly snapshots, from when it was properly running, so once
> I'm that far in the process, I can also compare the most recent
> snapshots, and see if there's any changes that happened to files that
> shouldn't have.

Snapshots refer to the same data blocks, so if a data block is corrupted in a 
way that BTRFS doesn't notice (which should be almost impossible) then all 
snapshots will have it.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-03 Thread Russell Coker
On Fri, 2 Oct 2015 10:07:24 PM Austin S Hemmelgarn wrote:
> > ARC presumably worked better than the other Solaris caching options.  It
> > was ported to Linux with zfsonlinux because that was the easy way of
> > doing it.
> 
> Actually, I think part of that was also the fact that ZFS is a COW 
> filesystem, and classical LRU caching (like the regular Linux pagecache) 
> often does horribly with COW workloads (and I'm relatively convinced 
> that this is a significant part of why BTRFS has such horrible 
> performance compared to ZFS).

Last time I checked a BTRFS RAID-1 filesystem would assign each process to read 
from one disk based on it's PID.  Every RAID-1 implementation that has any 
sort of performance optimisation will allow a single process that's reading to 
use both disks to some extent.

When the BTRFS developers spend some serious effort optimising for performance 
it will be useful to compare BTRFS and ZFS.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qgroup problem

2015-10-02 Thread Russell Coker
(sysadm_t:SystemLow-SystemHigh)root@unstable:~/pol# btrfs qgroup show -r -e 
/tmp
qgroupid rfer   excl   max_rfer  max_excl 
          
0/5  1647689728 1647689728 0 0
0/25816384  16384  524288000 0
0/25916384  16384  536870912 0
0/2610  0  0 0
0/2620  0  0 0
(sysadm_t:SystemLow-SystemHigh)root@unstable:~/pol# btrfs subvol list /|grep 
tmp
ID 259 gen 161316 top level 5 path tmp
(sysadm_t:SystemLow-SystemHigh)root@unstable:~/pol# uname -a
Linux unstable 4.2.0-1-amd64 #1 SMP Debian 4.2.1-2 (2015-09-27) x86_64 
GNU/Linux

Above is from a SE Linux test machine running BTRFS with quotas on kernel 
4.2.1.

# scp selinux-policy-default_2.20140421-11.1_all.deb bofh@unstable:/tmp
scp: /tmp/selinux-policy-default_2.20140421-11.1_all.deb: Disk quota exceeded

Above is the result of trying to scp a 2MB file to /tmp.

Is there anything I'm doing wrong or is quotas broken?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strange i/o errors with btrfs on raid/lvm

2015-10-01 Thread Russell Coker
On Sat, 26 Sep 2015 06:47:26 AM Chris Murphy wrote:
> And then
> 
> Aug 28 17:06:49 host mdadm[2751]: RebuildFinished event detected on md
> device /dev/md/0, component device  mismatches found: 2048 (on raid
> level 10)
> Aug 28 17:06:49 host mdadm[2751]: SpareActive event detected on md
> device /dev/md/0, component device /dev/sdd1

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919#41

For Linux Software RAID-1 it's expected that you get a multiple of 128 
mismatches found every time you do a scan.  I don't know why that is, but 
systems running like that don't appear to have any problems.  This even 
appears to happen on arrays that aren't in any noticable use (EG swap on a 
server that has heaps of RAM so swap doesn't get used).  The above URL 
describes part of the problem but if that was the entire explanation then I'd 
expect multiples of 8 not 128.  Also the explanation that Neil offers doesn't 
seem to cover the case of 10,000+ mismatches occurring in the weekly scrubs.

Linux Software RAID-10 works the same way in this regard.  So I'm not sure 
that the log entries in question mean anything other than a known bug in Linux 
Software RAID.

I have servers that have been running well for years while repeatedly 
reporting such errors on Linux Software RAID-1 and not getting any FSCK 
problems.  Note that the servers in question don't run BTRFS because years ago 
it really wasn't suitable for server use.  It is possible that there are 
errors that aren't detected by e2fsck, don't result in corruption of gzip 
compressed data, don't give MySQL data consistency errors, and don't corrupt 
any user data in a noticable way.  But it seems to be evidence in support of 
the theory that such mismatches don't matter.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-01 Thread Russell Coker
On Sat, 26 Sep 2015 12:20:41 AM Austin S Hemmelgarn wrote:
> > FYI:
> > Linux pagecache use LRU cache algo, and in general case it's working good
> > enough
> 
> I'd argue that 'general usage' should be better defined in this 
> statement.  Obviously, ZFS's ARC implementation provides better 
> performance in a significant number of common use cases for Linux, 
> otherwise people wouldn't be using it to the degree they are.

No-one gets a free choice about this.  I have a number of servers running ZFS 
because I needed the data consistency features and BTRFS wasn't ready.  There 
is no choice of LRU vs ARC once you've made the BTRFS vs ZFS decision.

ARC presumably worked better than the other Solaris caching options.  It was 
ported to Linux with zfsonlinux because that was the easy way of doing it.

Some people here have reported that ARC worked well for them on Linux.  My 
experience was that the zfsonlinux kernel modules wouldn't respect the module 
load options to reduce the size of the ARC and the default size would cause 
smaller servers to have kernel panics due to lack of RAM.  My solution to that 
problem was to get more RAM for all ZFS servers as buying RAM is cheaper for 
my clients than paying me to diagnose the problems with ZFS.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-23 Thread Russell Coker
On Sat, 19 Sep 2015 12:13:29 AM Austin S Hemmelgarn wrote:
> The other option (which for some reason I almost never see anyone
> suggest), is to expose 2 disks to the guest (ideally stored on different
> filesystems), and do BTRFS raid1 on top of that.  In general, this is
> what I do (except I use LVM for the storage back-end instead of a
> filesystem) when I have data integrity requirements in the guest.  On
> the other hand of course, most of my VM's are trivial for me to
> recreate, so I don't often need this and just use DM-RAID via LVm.

I used to do that.  But it was very fiddly and snapshotting the virtual machine 
images required making a snapshot of half a RAID-1 array via LVM (or 
snapshotting both when the virtual machine wasn't running).

Now I just have a single big BTRFS RAID-1 filesystem and use regular files for 
the virtual machine images with the Ext3 filesystem.

On Sun, 20 Sep 2015 11:26:26 AM Jim Salter wrote:
> Performance will be fantastic... except when it's completely abysmal.  
> When I tried it, I also ended up with a completely borked (btrfs-raid1) 
> filesystem that would only mount read-only and read at hideously reduced 
> speeds after about a year of usage in a small office environment.  Did 
> not make me happy.

I've found performance to be acceptable, not great (you can't expect great 
performance from such things) but good enough for lightly loaded servers and 
test systems.

I even ran a training session on BTRFS and ZFS filesystems with the images 
stored on a BTRFS RAID-1 (of 15,000rpm SAS disks).  When more than 3 students 
ran a scrub at the same time performance dropped but it was mostly usable and 
there were no complaints.  Admittedly that server hit a BTRFS bug and needed 
"reboot -nf" half way through, but I don't think that was a BTRFS virtual 
machine issue, rather it was a more general BTRFS under load issue.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-23 Thread Russell Coker
On Fri, 18 Sep 2015 12:00:15 PM Duncan wrote:
> The caveat here is that if the VM/DB is active during the backups (btrfs 
> send/receive or other), it'll still COW1 any writes during the existence 
> of the btrfs snapshot.  If the backup can be scheduled during VM/DB 
> downtime or at least when activity is very low, the relatively short COW1 
> time should avoid serious fragmentation, but if not, even only relatively 
> temporary snapshots are likely to trigger noticeable cow1 fragmentation 
> issues eventually.

One relevant issue for this is whether the working set of the database fits 
into RAM.  RAM has been getting bigger and cheaper while databases I run 
haven't been getting bigger.  Now every database I run has a working set that 
fits into RAM so read performance (and therefore fragmentation) doesn't matter 
for me except when rebooting - and database servers don't get rebooted that 
often.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Understanding BTRFS storage

2015-09-01 Thread Russell Coker
On Fri, 28 Aug 2015 07:35:02 PM Hugo Mills wrote:
>   On Fri, Aug 28, 2015 at 10:50:12AM +0200, George Duffield wrote:
> > Running a traditional raid5 array of that size is statistically
> > guaranteed to fail in the event of a rebuild.
> 
>Except that if it were, you wouldn't see anyone running RAID-5
> arrays of that size and (considerably) larger. And successfully
> replacing devices in them.

Let's not assume that everyone who thinks that they are "successfully" running 
a RAID-5 array is actually doing so.

One of the features of BTRFS is that you won't get undetected data corruption.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions on use of NOCOW impact to subvolumes and snapshots

2015-08-20 Thread Russell Coker
On Thu, 20 Aug 2015 10:09:26 PM Austin S Hemmelgarn wrote:
  2:  Out of curiosity, why is data checksumming tied to COW?
 
 There's no safe way to sanely handle checksumming without COW, because
 there is no way (at least on current hardware) to ensure that the data
 block and the checksums both get written at the exact same time, and
 that one of the writes aborting will cause the other too do so as well.
   In-place compression is disabled for nodatasum files for essentially
 the same reason (although compression can cause much worse data loss
 than a failed checksum).

A journaling filesystem could have checksums on data blocks.  If Ext4 was 
modified to have checksums it would do that.

But given that a major feature of BTRFS is snapshots it doesn't make much 
sense to implement a separate way of managing checksums.  I think that ZIL is 
the correct way of solving these problems.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions on use of NOCOW impact to subvolumes and snapshots

2015-08-19 Thread Russell Coker
On Thu, 20 Aug 2015 11:55:43 AM Chris Murphy wrote:
  Question 1:  If I apply the NOCOW attribute to a file or directory, how
  does that affect my ability to run btrfs scrub?
 
 nodatacow includes nodatasum and no compression. So it means these
 files are presently immune from scrub check and repair so long as it's
 based on checksums. I don't know if raid56 scrub compares to parity
 and recomputes parity (assumes data is correct), absent checksums,
 which would be similar to how md raid 56 does it.

Linux Software RAID could recreate a mismatched block from RAID-6 parity but 
doesn't do so.  It could be that a block was changed correctly but didn't get 
the parity data written so such correction would be reverting a change.  So 
Linux Software RAID only regenerates parity for a scrub and makes both disks 
have the same data for RAID-1.

There's no good solution to these problems without doing the sorts of things 
that WAFL, ZFS, and BTRFS do.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


lockup

2015-08-14 Thread Russell Coker
I have a Xen server with 14 DomUs that are being used for BTRFS and ZFS 
training.  About 5 people are corrupting virtual disks and scrubbing them, 
lots of IO.

All the virtual machine disk images are snapshots of a master image with copy 
on write.  I just had the following error which ended with a NMI.  I copied 
what I could.  It's running the latest Debian/Jessie kernel 3.16.7.

[15780.056002] Code: 44 24 10 e9 1c ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 
66 
66 66 90 41 54 55 48 89 fd 53 4c 8b 67 50 66 66 66 66 90 f0 ff 4d 4c 74 35 5b 
5d 41 5c c3 48 8b 1d a9 07 07 00 48 85 db 74 1c 48 8b 
[15808.056003] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-i38:22730]
[15808.056003] Modules linked in: xt_tcpudp xt_physdev iptable_filter ip_tables 
x_tables xen_netback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss 
oid_registry nfs_acl nfs lockd fscache sunrpc bridge stp llc ext4 crc16 
mbcache jbd2 ppdev psmouse serio_raw pcspkr k8temp joydev evdev ipmi_si ns558 
gameport parport_pc parport ipmi_msghandler snd_mpu401_uart snd_rawmidi 
snd_seq_device snd processor button soundcore edac_mce_amd edac_core 
i2c_nforce2 i2c_core shpchp thermal_sys loop autofs4 crc32c_generic btrfs xor 
raid6_pq raid1 md_mod sd_mod crc_t10dif crct10dif_generic crct10dif_common 
hid_generic usbhid hid sg sr_mod cdrom ata_generic ohci_pci mptsas 
scsi_transport_sas mptscsih mptbase e1000 pata_amd ehci_pci ohci_hcd ehci_hcd 
libata forcedeth scsi_mod usbcore usb_common
[15808.056003] CPU: 1 PID: 22730 Comm: qemu-system-i38 Not tainted 3.16.0-4-
amd64 #1 Debian 3.16.7-ckt11-1+deb8u3
[15808.056003] Hardware name: Sun Microsystems Sun Fire X4100 M2/Sun Fire 
X4100 M2, BIOS 0ABJX102 11/03/2008
[15808.056003] task: 8812e010 ti: 880001e9c000 task.ti: 
880001e9c000
[15808.056003] RIP: e030:[a024edb9]  [a024edb9] 
btrfs_put_ordered_extent+0x19/0xc0 [btrfs]
[15808.056003] RSP: e02b:880001e9fe08  EFLAGS: 0202
[15808.056003] RAX: 0583 RBX: 88000a4f0580 RCX: 06a4
[15808.056003] RDX: 88000a4f0580 RSI: 88000a4f0508 RDI: 88000a4f0508
[15808.056003] RBP: 88000a4f0508 R08: 88000a4f0560 R09: 8800502f29b0
[15808.056003] R10: 7ff0 R11: 0005 R12: 880053821950
[15808.056003] R13: 88000a4f0508 R14: 880004f7cf00 R15: 880001e9fe50
[15808.056003] FS:  7fdc312f5700() GS:88007744() 
knlGS:
[15808.056003] CS:  e033 DS:  ES:  CR0: 8005003b
[15808.056003] CR2: 7f4af0c74000 CR3: 2e534000 CR4: 
0660
[15808.056003] Stack:
[15808.056003]  88000a4f0580 880052d76800 880002503800 
a02342f4
[15808.056003]  880004f7cfa8 880002503000  
a02881e2
[15808.056003]  8800  880052d76800 
88000b7f7b18
[15808.056003] Call Trace:
[15808.056003]  [a02342f4] ? btrfs_wait_pending_ordered+0xc4/0x100 
[btrfs]
[15808.056003]  [a02881e2] ? __btrfs_run_delayed_items+0xf2/0x1d0 
[btrfs]
[15808.056003]  [a0236356] ? btrfs_commit_transaction+0x2d6/0xa10 
[btrfs]
[15808.056003]  [810a7a40] ? prepare_to_wait_event+0xf0/0xf0
[15808.056003]  [a0246529] ? btrfs_sync_file+0x1c9/0x2f0 [btrfs]
[15808.056003]  [811d53cb] ? do_fsync+0x4b/0x70
[15808.056003]  [811d564f] ? SyS_fdatasync+0xf/0x20
[15808.056003]  [8151158d] ? system_call_fast_compare_end+0x10/0x15
[15808.056003] Code: 44 24 10 e9 1c ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 
66 
66 66 90 41 54 55 48 89 fd 53 4c 8b 67 50 66 66 66 66 90 f0 ff 4d 4c 74 35 5b 
5d 41 5c c3 48 8b 1d a9 07 07 00 48 85 db 74 1c 48 8b 
[15818.440002] INFO: rcu_sched self-detected stall on CPU { 1}  (t=68266 
jiffies 
g=236497 c=236496 q=6784)
[15818.440002] sending NMI to all CPUs:
[15818.440002] NMI backtrace for cpu 1
[15818.440002] CPU: 1 PID: 22730 Comm: qemu-system-i38 Not tainted 3.16.0-4-
amd64 #1 Debian 3.16.7-ckt11-1+deb8u3
[15818.440002] Hardware name: Sun Microsystems Sun Fire X4100 M2/Sun Fire 
X4100 M2, BIOS 0ABJX102 11/03/2008
[15818.440002] task: 8812e010 ti: 880001e9c000 task.ti: 
880001e9c000
[15818.440002] RIP: e030:[8100130a]  [8100130a] 
xen_hypercall_vcpu_op+0xa/0x20
[15818.440002] RSP: e02b:880077443cc8  EFLAGS: 0046
[15818.440002] RAX:  RBX: 0001 RCX: 8100130a
[15818.440002] RDX:  RSI: 0001 RDI: 
000b
[15818.440002] RBP: 818e2900 R08: 818e23e0 R09: 880bcc40
[15818.440002] R10: 0855 R11: 0246 R12: 818e23e0
[15818.440002] R13: 0005 R14: 1a80 R15: 81853680
[15818.440002] FS:  7fdc312f5700() GS:88007744() 
knlGS:
[15818.440002] CS:  e033 DS:  ES:  CR0: 8005003b
[15818.440002] 

can we make balance delete missing devices?

2015-08-14 Thread Russell Coker
[ 2918.502237] BTRFS info (device loop1): disk space caching is enabled
[ 2918.503213] BTRFS: failed to read chunk tree on loop1
[ 2918.540082] BTRFS: open_ctree failed

I just had a test RAID-1 filesystem with a missing device.  I mounted it with 
the degraded option and added a new device.  I balanced it (to make it do 
RAID-1 again) and thought everything was good.  Then when I tried to mount it 
again it gave errors such as the above (not sure why).  Then I tried wiping 
/dev/loop1 and it refused to mount entirely due to having 2 missing devices.

Obviously it was my mistake to not remove the missing device, and wiping 
/dev/loop1 was a bad idea.  Failing to remove a missing device seems likely to 
be a common mistake.  Could we make the balance operation automatically delete 
the missing device?  I can't imagine a situation in which a balance would be 
desired but deleting the missing device wouldn't be desired.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


lack of scrub error data

2015-08-13 Thread Russell Coker
Below is the result of testing a corrupted filesystem.  What's going on here?  
The kernel message log and the btrfs output don't tell me how many errors 
there were.  Also the data is RAID-0 (the default for a filesystem created with 
2 devices) so if this was in a data area it should have lost data, but no data 
was lost so apparently not.

Debian kernel 4.0.8-2 and btrfs-tools 4.0-2.

# cp -r /lib /mnt/tmp
# btrfs scrub start -B /mnt/tmp
scrub done for c0992a26-3c7f-47b0-b8de-878dfe197469
scrub started at Fri Aug 14 10:17:53 2015 and finished after 0 seconds
total bytes scrubbed: 372.82MiB with 0 errors
# dd if=/dev/zero of=/dev/loop0 bs=1024k count=20 seek=100
20+0 records in
20+0 records out
20971520 bytes (21 MB) copied, 0.0267078 s, 785 MB/s
# btrfs scrub start -B /mnt/tmp
scrub done for c0992a26-3c7f-47b0-b8de-878dfe197469
scrub started at Fri Aug 14 10:19:15 2015 and finished after 0 seconds
total bytes scrubbed: 415.04MiB with 0 errors
WARNING: errors detected during scrubbing, corrected.
# btrfs scrub start -B /mnt/tmp
scrub done for c0992a26-3c7f-47b0-b8de-878dfe197469 
scrub started at Fri Aug 14 10:20:25 2015 and finished after 0 seconds
total bytes scrubbed: 415.04MiB with 0 errors
# btrfs fi df /mnt/tmp
Data, RAID0: total=800.00MiB, used=400.23MiB
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=400.00MiB, used=7.39MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B
# btrfs device stats /mnt/tmp
[/dev/loop0].write_io_errs   0
[/dev/loop0].read_io_errs0
[/dev/loop0].flush_io_errs   0
[/dev/loop0].corruption_errs 0
[/dev/loop0].generation_errs 0
[/dev/loop1].write_io_errs   0
[/dev/loop1].read_io_errs0
[/dev/loop1].flush_io_errs   0
[/dev/loop1].corruption_errs 0
[/dev/loop1].generation_errs 0
# diff -ru /lib /mnt/tmp/lib
#

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs raid1 metadata, single data

2015-08-07 Thread Russell Coker
On Fri, 7 Aug 2015 06:49:58 PM Robert Krig wrote:
 What exactly is contained in btrfs metadata?

Much the same as in metadata for every other filesystem.

 I've read about some users setting up their btrfs volumes as
 data=single, but metadata=raid1
 
 Is there any actual benefit to that? I mean, if you keep your data as
 single, but have multiple copies of metadata, does that still allow you
 to recover from data corruption? Or is metadata redundancy a benefit to
 ensure that your btrfs volume remains mountable/readable?

If you have redundant metadata and experience corruption then you will know 
the name of every file that has data corruption, this is really good for 
restoring from backup.  Also you will be protected against corruption of a 
root directory causing massive data loss.

If you have the bad luck to have certain metadata structures corrupted with no 
redundancy then you can face massive data loss and possibly have the entire 
filesystem become at least temporarily unusable.  While corruption of the root 
directory is unlikely it is possible.  With dup metadata I've seen a BTRFS 
filesystem remain usable after 12,000+ read errors.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount btrfs takes 30 minutes, btrfs check runs out of memory

2015-08-01 Thread Russell Coker
On Sat, 1 Aug 2015 02:35:39 PM John Ettedgui wrote:
  It seems that you're using Chromium while doing the dump. :)
  If no CD drive, I'll recommend to use Archlinux installation iso to make
  a bootable USB stick and do the dump.
  (just download and dd would do the trick)
  As its kernel and tools is much newer than most distribution.
 
 So I did not have any usb sticks large enough for this task (only 4Gb)
 so I restarted into emergency runlevel with only / mounted and as ro,
 I hope that'll do.

The Debian/Jessie Netinst image is about 120M and allows you to launch a 
shell.  If you want a newer kernel you could rebuild the Debian Netinst 
yourself.

Also a basic text-only Linux installation takes a lot less than 4G of storage.  
I have a couple of 1G USB sticks with Debian installed that I use to fix 
things.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: INFO: task btrfs-transacti:204 blocked for more than 120 seconds. (more like 8+min)

2015-07-30 Thread Russell Coker
On Fri, 24 Jul 2015 11:11:22 AM Duncan wrote:
 The option is mem=nn[KMG].  You may also need memmap=, presumably
 memmap=nn[KMG]$ss[KMG], to reserve the unused memory area, preventing its
 use for PCI address space, since that would collide with the physical
 memory that's there but unused due to mem=.
 
 That should let you test with mem=2G, so double-memory becomes 4G. =:^)

That's a good thing to note.

 Meanwhile, does bonnie do pre-allocation for its tests?

http://www.coker.com.au/bonnie++/readme.html

No.  The above URL gives details on the tests.

  What I did see from years ago seemed to be that you'd have to disable
  COW where you knew there would be large files.  I'm really hoping
  there's a way to avoid this type of locking, because I don't think I'd
  be comfortable knowing a non-root user could bomb the system with a
  large file in the wrong area.
 
 The problem with cow isn't large files in general, it's rewrites into the
 middle of them (as opposed to append-writes).  If the writes are
 sequential appends, or if it's write-one-read-many, cow on large files
 doesn't tend to be an issue.

The Bonnie++ rewrite test might be a pathological case for BTRFS.  But it's a 
test that other filesystems have handled for decades.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: INFO: task btrfs-transacti:204 blocked for more than 120 seconds. (more like 8+min)

2015-07-30 Thread Russell Coker
On Fri, 24 Jul 2015 05:12:38 AM james harvey wrote:
 I started trying to run with a -s 4G option, to use 4GB files for
 performance measuring.  It refused to run, and said file size should
 be double RAM for good results.  I sighed, removed the option, and
 let it run, defaulting to **64GB files**.  So, yeah, big files.  But,
 I do work with Photoshop .PSB files that get that large.

You can use the -r0 option to stop it insisting on twice the RAM size.  
However if you have files that are less than twice the RAM then the test 
results will be unrealistic as read requests will be mostly satisfied from 
cache.

 During the first two lines (Writing intelligently... and
 Rewriting... the filesystem seems to be completely locked out for
 anything other than bonnie++.  KDE stops being able to switch focus,
 change tasks.  Can switch to tty's and log in, do things like ls,
 but attempting to write to the filesystem hangs.  Can switch back to
 KDE, but screen is black with cursor until bonnie++ completes.  top
 didn't show excessive CPU usage.

That sort of problem isn't unique to BTRFS.  BTRFS has had little performance 
optimisation so it might be worse than other filesystems in that regard.  But 
on any filesystem you can expect situations where one process that is doing 
non-stop writes fills up buffers and starves other processes.

Note that when a single disk access takes 8000ms+ (more than 8 seconds) then 
high level operations involving multiple files will take much longer.

 I think the Writing intelligently phase is sequential, and the old
 references I saw were regarding many re-writes sporadically in the
 middle.

Intelligent writes is sequential, re-writes is reading and writing 
sequentially.

 What I did see from years ago seemed to be that you'd have to disable
 COW where you knew there would be large files.  I'm really hoping
 there's a way to avoid this type of locking, because I don't think I'd
 be comfortable knowing a non-root user could bomb the system with a
 large file in the wrong area.

Disabling CoW won't solve all issues related to sharing disk IO capacity 
between users.  Also disabling CoW will remove all BTRFS benefits apart from 
subvols, and subvols aren't that useful when snapshots aren't an option.

 IF I do HAVE to disable COW, I know I can do it selectively.  But, if
 I did it everywhere... Which in that situation I would, because I
 can't afford to run into many minute long lockups on a mistake... I
 lose compression, right?  Do I lose snapshots?  (Assume so, but hope
 I'm wrong.)  What else do I lose?  Is there any advantage running
 btrfs without COW anywhere over other filesystems?

I believe that when you disable CoW and make a snapshot there will be one CoW 
stage for each block until it's copied somewhere else.

 How would one even know where the division is between a file small
 enough to allow on btrfs, vs one not to?

http://doc.coker.com.au/projects/memlockd/

If a hostile user wrote a program that used fsync() they could reproduce such 
problems with much smaller files.  My memlockd program alleviates such problems 
by locking the pages of important programs and libraries into RAM.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1: system stability

2015-07-22 Thread Russell Coker
On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote:
 OK I actually don't know what the intended block layer behavior is
 when unplugging a device, if it is supposed to vanish, or change state
 somehow so that thing that depend on it can know it's missing or
 what. So the question here is, is this working as intended? If the
 layer Btrfs depends on isn't working as intended, then Btrfs is
 probably going to do wild and crazy things. And I don't know that the
 part of the block layer Btrfs depends on for this is the same (or
 different) as what the md driver depends on.

I disagree with that statement.  BTRFS should be expected to not do wild and 
crazy things regardless of what happens with block devices.

A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning 
any manner of corrupted data and should not lose data or panic the kernel.

A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by 
mounting read-only or failing all operations on the filesystem.  It should not 
affect any other filesystem or have any significant impact on the system unless 
it's the root filesystem.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


why does df spin up disks?

2015-06-28 Thread Russell Coker
When I have a mounted filesystem why doesn't the kernel store the amount of 
free space?  Why does it need to spin up a disk that had been spun down?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAID5 filesystem corruption during balance

2015-06-19 Thread Russell Coker
On Sun, 24 May 2015 01:02:21 AM Jan Voet wrote:
 Doing a 'btrfs balance cancel' immediately after the array was mounted
 seems to have done the trick.  A subsequent 'btrfs check' didn't show any
 errors at all and all the data seems to be there.  :-)

I add rootflags=skip_balance to the kernel command-line of all my Debian 
systems to solve this.  I've had problems with the balance resuming in the 
past which had similar results.  I've also never seen a situation where 
resuming the balance did any good.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in


Re: Carefully crafted BTRFS-image causes kernel to crash

2015-04-21 Thread Russell Coker
On Tue, 21 Apr 2015, Qu Wenruo quwen...@cn.fujitsu.com wrote:
 Although we may add extra check for such problem to improve robustness,
 but IMHO it's not a real world problem.

Some of the ReiserFS developers gave a similar reaction to some of my bug 
reports.  ReiserFS wasn't the most robust filesystem.

I think that it should be EXECTED that a kernel will have to occasionally deal 
with filesystem images that are created by hostile parties.  Userspace crash 
and kernel freeze is not a suitable way of dealing with it.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The FAQ on fsync/O_SYNC

2015-04-19 Thread Russell Coker
On Mon, 20 Apr 2015, Craig Ringer cr...@2ndquadrant.com wrote:
 PostgreSQL is its self copy-on-write (because of multi-version
 concurrency control), so it doesn't make much sense to have the FS
 doing another layer of COW.

That's a matter of opinion.

I think it's great if PostgreSQL can do internal checkums and error 
correction.  But I'd rather not have to test that functionality in the field.

Really I prefer to have the ZFS copies= option for databases.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to clone a btrfs filesystem

2015-04-18 Thread Russell Coker
On Sat, 18 Apr 2015, Christoph Anton Mitterer cales...@scientia.net wrote:
 On Sat, 2015-04-18 at 04:24 +, Russell Coker wrote:
  dd works.  ;)
  
  There are patches to rsync that make it work on block devices.  Of course
  that will copy space occupied by deleted files too.
 
 I think both are not quite the solutions I was looking for.

I know, but I don't think what you want is possible at this time.

 Guess for dd this is obvious, but for rsync I'd also loose all btrfs
 features like checksum verifications,... and even if these patches you
 mention would make it work on block devices, I'd guess it would at least
 need to read everything, it would not longer be a merge into another
 filesystem (perhaps I shouldn't have written clone)... and the target
 block device would need to have at least the size of the origin.

An rsync on block devices wouldn't lose BTRFS checksums, you could run a scrub 
on the target at any time to verify them.  For a dd or anything based on that 
the target needs to be at least as big as the source.  But typical use of 
BTRFS for backup devices tends to result in keeping as many snapshots as 
possible without running out of space which means that no matter how you were 
to copy it the target would need to be as big.

 Can't one do something like the following:
 1) The source fs has several snapshots and subvols.
The target fs is empty (the first time).
 
 
 For the first time populating the target fs:
 
 2) Make ro snapshots of all non-ro snapshots and subvols on the
 source-fs.
 
 3) Send/receive the first of the ro snapshots to the target fs, with no
 parent and no clone-src.
 
 4) Send/receive all further ro snapshots to the target fs, with no
 parents, but each time specifying one further clone-src (i.e. all that
 have already been sent/received) so that they're used for reflinks and
 so on
 
 5) At the end somehow make rw-subvols from the snapshots/subvols that
 have been previously rw (how?).

Sure, for 5 I believe you can make a rw snapshot of a ro subvol.

 Does that sound as if it would somehow work like that? Especially would
 it preserve all the reflink statuses and everything else (sparse files,
 etc.)

Yes, but it would take a bit of scripting work.

 Some additional questions:
 a) Can btrfs send change anything(!) on the source fs?
 b) Can one abort (Ctrl-C) a send and/or receive... and make it continue
 at the same place were it was stopped?

A yes, B I don't know.

Also I'm not personally inclined to trust send/recv at this time.  I don't 
think it's had a lot of testing.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to clone a btrfs filesystem

2015-04-17 Thread Russell Coker
On Fri, 17 Apr 2015 11:08:44 PM Christoph Anton Mitterer wrote:
 How can I best copy one btrfs filesystem (with snapshots and subvolumes)
 into another, especially with keeping the CoW/reflink status of all
 files?

dd works.  ;)

 And ideally incrementally upgrade it later (again with all
 snapshots/subvols, and again not loosing the shared blocks between these
 files).

There are patches to rsync that make it work on block devices.  Of course that 
will copy space occupied by deleted files too.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


directory defrag

2015-04-14 Thread Russell Coker
The current defragmentation options seem to only support defragmenting named 
files/directories or a recursive defragmentation of files and directories.

I'd like to recursively defragment directories.  One of my systems has a large 
number of large files, the files are write-once and read performance is good 
enough.  However performance of ls -al is often very poor, presumably due to 
metadata fragmentation.

The other thing I'd still like is the ability to force all metadata allocation 
to be from specified disks.  I'd like to have a pair of SSDs for RAID-1 storage 
of metadata and a set of hard drives for RAID-1 storage of data.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: du accuracy

2015-04-12 Thread Russell Coker
On Fri, 10 Apr 2015, Liu Bo bo.li@oracle.com wrote:
  Above are some consecutive du runs.  Why does the space used go from 1.2G
  to  1.1G before going up again?  The file was created by cat /dev/sde 
  2gsd so it definitely wasn't getting smaller.
 
  
 
  What's going on here?
 
 What's your mount options?  with autodefrag or compression?

Linux server 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt4-2 (2015-01-27) x86_64 
GNU/Linux

Above is the kernel I'm using, one of the recent ones from Debian/Jessie.

UUID= /big btrfs skip_balance,relatime,nosuid,nodev 0 0

Above is the /etc/fstab line.

/dev/sdb2 /big btrfs 
rw,seclabel,nosuid,nodev,relatime,space_cache,skip_balance 0 0

Above is the /proc/mounts line.  I'm not doing anything noteworthy here.  

# du -h tmp|tail -1 ; sleep 10 ; du -h tmp|tail -1 
1.6Gtmp
1.4Gtmp

I've just had something similar happen when rsyncing a directory full of files 
to the server and running du to check on the progress.  It's probably possible 
that a rename could happen at the wrong time to cause a file to be missed in 
the du count.  So this could be legit.  But the case described in my previous 
message concerned a single file that was being extended.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


snapshot space use

2015-04-08 Thread Russell Coker
# zfs list -t snapshot
NAMEUSED  AVAIL  REFER  MOUNTPOINT
hetz0/be0-mail@2015-03-10  2.88G  -   387G  -
hetz0/be0-mail@2015-03-11  1.12G  -   388G  -
hetz0/be0-mail@2015-03-12  1.11G  -   388G  -
hetz0/be0-mail@2015-03-13  1.19G  -   388G  -
hetz0/be0-mail@2015-03-14  1.02G  -   388G  -
hetz0/be0-mail@2015-03-15   989M  -   386G  -

Is there any way to do something similar to the above ZFS command?  It's handy 
to know which snapshots are taking up the most space, especially when multiple 
subvols are being snapshotted.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs subvolume default subvolume

2015-04-07 Thread Russell Coker
On Tue, 7 Apr 2015 02:03:04 PM arnaud gaboury wrote:
 Would you mind give the return of # btrfs subvolume list and $ cat
 /etc/fstab ? It would help me. TY

ID 262 gen 1579103 top level 5 path mysql   


ID 263 gen 1179578 top level 5 path var/spool/squid 


ID 264 gen 1579262 top level 5 path xenstore


ID 265 gen 1579262 top level 5 path mailstore   


ID 269 gen 1578978 top level 5 path xenswap 


ID 318 gen 1569545 top level 5 path webstore


ID 328 gen 1574457 top level 262 path mysql/backup  


ID 329 gen 1569555 top level 264 path xenstore/backup   


ID 330 gen 1574457 top level 265 path mailstore/backup  


ID 331 gen 1569555 top level 318 path webstore/backup   


ID 618 gen 1569546 top level 5 path webdata 


ID 657 gen 1569555 top level 618 path webdata/backup


ID 1456 gen 1569541 top level 5 path backup 


ID 1621 gen 1556937 top level 330 path mailstore/backup/2015-04-01  


ID 1622 gen 1556938 top level 328 path mysql/backup/2015-04-01  


ID 1623 gen 1559462 top level 330 path mailstore/backup/2015-04-02  


ID 1624 gen 1559463 top level 328 path mysql/backup/2015-04-02  


ID 1625 gen 1564402 top level 330 path mailstore/backup/2015-04-03  


ID 1626 gen 1564403 top level 328 path mysql/backup/2015-04-03  


ID 1627 gen 1566974 top level 330 path mailstore/backup/2015-04-04  


ID 1628 gen 1566975 top level 328 path mysql/backup/2015-04-04  


ID 1629 gen 1569540 top level 330 path mailstore/backup/2015-04-05  


ID 1630 gen 1569543 top 

Re: btrfs subvolume default subvolume

2015-04-07 Thread Russell Coker
On Tue, 7 Apr 2015 10:58:28 AM arnaud gaboury wrote:
 After more reading, it seems to me creating a top root subvolume is
 the right thing to do:
 # btrfs subvolume create root
 # btrfs subvolume create root/var
 # btrfs subvolume create root/home
 
 Am I right?

A filesystem is designed to support arbitrary names for files, directories, and 
in the case of BTRFS subvolumes.

You might ask whether /var/log is a good place for log files, it's generally 
regarded that yes is the correct answer to that question, but it's not a 
question for filesystem developers.

I like to have / in the root subvolume so all subvolumes are available by 
default.  I have /var etc in the same subvolume so I can make an atomic 
snapshot of all such data.  It's not the one true answer, but it works for me 
and will work for other people.

The main thing is to have all data that needs to be atomically snapshotted on 
the same subvolume.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs hangs 3.19-10

2015-04-06 Thread Russell Coker
On Mon, 6 Apr 2015 07:40:03 AM Pavel Volkov wrote:
 On Sunday, April 5, 2015 1:04:17 PM MSK, Hugo Mills wrote:
 That's these, I think:
  #define BTRFS_FEATURE_INCOMPAT_BIG_METADATA (1ULL  5)
  #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF(1ULL  6)
  
  so it's definitely -O^extref. I don't see where big_metadata comes
  from, though. That's not a -O option. Try with -O^extref and see where
  that gets you. (Also, don't mount the FS on a newer kernel -- it may
  be setting big metadata automatically, although it probably shouldn't
  do).
 
 By the way, is there any way to see which options are enabled on a local
 filesystem without having to try mounting it with old kernel and checking
 dmesg?

# file -s /dev/dm-0 
/dev/dm-0: BTRFS Filesystem sectorsize 4096, nodesize 4096, leafsize 4096, 
UUID=97d70558-ddea-493e-874c-ff74be9ce099, 92390113280/171796594688 bytes used, 
1 devices

Above is what file(1) reports on my laptop which has been running BTRFS since 
3.2 days.  It gives the node size etc but not the feature flags.  Some time ago 
I submitted a patch to the Debian package that covered everything I could 
figure out, I'm sure that they would accept a patch for feature flags if anyone 
can work out how to do it.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs hangs 3.19-10

2015-04-06 Thread Russell Coker
On Mon, 6 Apr 2015 03:21:18 AM Duncan wrote:
 So... for 3.2 compatibility, extref must not be enabled (tho it's now the 
 default and AFAIK there's no way to actually disable it, only enable, so 
 an old btrfs-tools would have to be used that doesn't enable it by 
 default), AND the nodesize must be set to page size, 4096 bytes for x86.

So basically we have to have an old version of mkfs.btrfs to make a filesystem 
that can be mounted on kernel 3.2.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs hangs 3.19-10

2015-04-04 Thread Russell Coker
On Fri, 3 Apr 2015 05:14:12 AM Duncan wrote:
 Well, btrfs itself isn't really stable yet...  Stable series should be 
 stable at least to the extent that whatever you're using in them is, but 
 with btrfs itself not yet entirely stable... 

Also for stable operation you want both forward and backward compatability.  
You could make an Ext3 filesystem and expect that any random ancient Linux box 
you are likely to encounter can read it.  Even Ext4 has been supported for a 
long time and most systems you are likely to encounter won't have any problems 
with it.

I recently made a BTRFS filesystem on a Debian/Jessie system (kernel 3.16.7) 
with default options and discovered that Debian/Wheezy (kernel 3.2.65) can't 
read it.  I think that one criteria for stable in a filesystem is that 
kernels from a couple of previous releases can mount it.  By that criteria 
BTRFS won't be stable for use in Debian for about 4 years.

As an aside are there options to mkfs.btrfs that would make a filesystem 
mountable by kernel 3.2.65?  If so I'll file a Debian/Jessie bug report 
requesting that a specific mention be added to the man page.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs hangs 3.19-10

2015-04-04 Thread Russell Coker
On Sun, 5 Apr 2015 03:16:21 AM Duncan wrote:
 Hugo Mills posted on Sat, 04 Apr 2015 13:00:47 + as excerpted:
  On Sat, Apr 04, 2015 at 12:55:08PM +, Russell Coker wrote:
  As an aside are there options to mkfs.btrfs that would make a
  filesystem mountable by kernel 3.2.65?  If so I'll file a Debian/Jessie
  bug report requesting that a specific mention be added to the man page.
  
  Yes, there are. It's probably -O^extref, but if you can show the
  dmesg output from the 3.2 kernel on the failed mount (so that it shows
  what the actual failure was), we should be able to give you a more
  precise answer.

[698190.987065] Btrfs loaded
[698190.92] device fsid 118e2c64-6ce1-4f21-85e2-2d6aea8f0fa5 devid 1 
transid 426 /dev/sdf1
[698191.000981] btrfs: disk space caching is enabled
[698191.000986] BTRFS: couldn't mount because of unsupported optional features 
(60).
[698191.018176] btrfs: open_ctree failed

 So I was thinking about this, and the several other earlier options where
 support wasn't added until kernel X, and had an idea...
 
 How easy and useful might it be to add to mkfs.btrfs appropriate option-
 group aliases such that if one knew the oldest kernel one was likely to
 deal with, all one would need to do for the mkfs would be to set for
 example, -O3.2, or even simply --3.2 (or maybe even --32), and have
 mkfs.btrfs automatically set/unset the appropriate options so it would
 just work with that kernel and anything newer?

That would be really useful.  Also it would be good if the code structure 
allowed adding extra aliases, so for Debian we could add an option -Owheezy.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance fails with no space errors (despite having plenty)

2015-03-24 Thread Russell Coker
I've been having ongoing issues with balance failing with no space errors in 
spite of having plenty. Strangely it seems to most often happen from cron jobs, 
when a cron job fails I can count on a manual balance succeeding.

I'm running the latest Debian/Jessie kernel.
-- 
Sent from my Samsung Galaxy Note 3 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fstrim status (RAID5)

2015-03-24 Thread Russell Coker
Debian/Wheezy userspace can't be expected to work as well as desired with a 
3.19 kernel.

Wheezy with BTRFS single or RAID1-1 works reasonably well as long as you have 
lots of free space, balance it regularly, and configure it not to resume a 
balance on reboot.

Debian/Jessie works well with BTRFS single or RAID-1 without any issues. But 
don't use RAID 5/6. Jessie userspace should work well with 3.19 and 4.0 kernels.

On March 23, 2015 4:38:49 AM GMT+11:00, Hugo Mills h...@carfax.org.uk wrote:
On Sun, Mar 22, 2015 at 05:30:59PM +, scruffters wrote:
 Hello list,
 
 I wondered if anybody could advise on the status of using fstrim with
 BTRFS. The motivation for me is to finally try using SSD's in a R5
 style configuration with some way of running maintenance overnight to
 maintain performance...
[snip]
 My kernel is 3.2.0-4 (Debian Wheezy).

   Then you shouldn't use btrfs RAID-5 at all, and probably shouldn't
be using btrfs.

   For parity RAID, 3.19 or later is what you should be using, as it
supports recovery and rebuild. You can get those kernels from the
Debian experimental repo. If you're not using parity RAID, then the
last few kernels have been OK, but keeping up with the latest version
is still a reasonably good idea.

   Hugo.

-- 
Sent from my Samsung Galaxy Note 3 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: syntax for deleting subvolumes?

2015-03-20 Thread Russell Coker
On Fri, 20 Mar 2015 04:18:38 AM Duncan wrote:
 If cp --reflink=auto was the default, it'd just work, making a reflink 
 where possible, falling back to a normal copy where not possible to 
 reflink.
 
 However, I'd be wary of such a change, because admins are used to cp 
 creating a separate copy which may well be intended as a backup, guarding 
 against I/O errors on the original.  Btrfs does do checksum checking to 
 ensure validity, but if that fails, do you really want it failing for 
 BOTH copies, including the one the admin made specifically to AVOID such 

If you have a second copy on the same filesystem then an error on any of the 
metadata that is a parent will corrupt both.  EG if you have 2 copies under 
your home directory then a directory error for /, /home, or /home/$USER will 
make both copies unavailable.

Also any errors on superblocks or other essential filesystem metadata risks 
losing both copies.

If BTRFS was to adopt something equivalent to the ZFS copies= feature then 
the sysadmin could specify that certain subtrees would have extra copies of 
data AND each metadata block would have 1 more copy than the data blocks it 
refers to.

In conclusion I think that the ZFS copies= feature is the correct solution to 
this problem.  Until/unless the BFTRFS developers copy that design concept the 
thing to do if you want backups on the same storage hardware is to copy the 
data to a filesystem on a different partition - the NetApp research shows that 
disk read errors tend to be correlated by location on disk.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: syntax for deleting subvolumes?

2015-03-20 Thread Russell Coker
On Thu, 19 Mar 2015 07:18:29 AM Erkki Seppala wrote:
  But as a user level facility, I want to be able to snapshot before
  making a change to a tree full of source code and (re)building it all
  over again.  I may want to keep my new build, but I may want to flush
  it and return to known good state.
 
 You may want to check out cp --reflink=always (different from cp
 --link), which creates copy-on-write copy of the data. It isn't quite as
 fast as snapshots to create, but it's still plenty fast and without the
 downsides of subvolumes.

In which situations is --reflink=always expected to fail on BTRFS?

I recently needed to move /home on one system to a subvol (instead of being in 
the root subvol) and used cp --reflink=always which gave a few errors when 
running with BTRFS on kernel 3.16.7-ckt7-1.  I was in a hurry and didn't try 
to track down the caue, I just ran it again with --reflink=auto which gave no 
errors and completed very quickly (obviously the files which couldn't be 
reflinked were small).

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe or useful to use NOCOW flag and autodefrag mount option at same time?

2015-03-16 Thread Russell Coker
On Sun, 15 Mar 2015, peer@gmx.net wrote:
 Following common recommendations [1], I use these mount options on my
 main developing machine: noatime,autodefrag. This is desktop machine and
 it works well so far. Now, I'm also going to install several KVM virtual
 machines on this system. I want to use qcow2 files stored on SSD with a
 btrfs on it. In order to avoid bad performance with the VMs, I want to
 disable the Copy-On-Write mechanism on the storage directory of my VM
 images as for example described in [2].

Why do you expect a great performance benefit from that?

As there is no real seek time SSDs probably won't give you much benefit from 
defragmenting.  As for disabling CoW, that will reduce the number of writes 
(as you don't need to write the metadata all the way up the tree) and improve 
performance, but not as much as on spinning media where you need to do seeks 
for all that.

Finally having checksums on everything to give the possibility of recognising 
corrupt data is a really good feature and something that you want on your VM 
images.

So far I have never even tried disabling CoW or using auto defragment.  All of 
my BTRFS filesystems have either low performance or run on SSD.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


single GlobalReserve

2015-02-03 Thread Russell Coker
# btrfs fi df /big
Data, RAID1: total=2.56TiB, used=2.56TiB
System, RAID1: total=32.00MiB, used=388.00KiB
Metadata, RAID1: total=19.25GiB, used=14.06GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Why is GlobalReserve single?  That filesystem has been RAID-1 for ages (since 
long before the default leaf size became 16K).  When GlobalReserve was added 
it should have taken the same allocation policy as metadata.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ignoring bad blocks

2015-01-11 Thread Russell Coker
On Sun, 4 Jan 2015 13:46:30 Chris Murphy wrote:
 So do use dd you need to also use bs= setting it to a multiple of
 4096. Of course most people using dd for zeroing set bs= to a decently
 high value because it makes the process go much faster than the
 default block size of 512 bytes.

You could just use cat(1) to write to the disk.  Cat SHOULD write a minimum of 
page size buffers unless the kernel tells it that the device block size is 
larger.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Debian/Jessie 3.16.7-ckt2-1 kernel error

2014-12-23 Thread Russell Coker
I've attached the kernel message log that I get after booting kernel 3.16.7 
from Debian/Unstable.  This is the kernel branch that will go into 
Debian/Jessie so it's important to get it fixed.

Below has the start of the errors, the attached file has everything from boot.  
I've got similar issues on another system, would it help if I collect the logs 
from multiple systems?

[6.618809] Btrfs loaded
[6.670878] BTRFS: device fsid a2d7cbbc-23be-4d97-b2bb-de99b0c58c7d devid 1 
t
ransid 548614 /dev/mapper/root-crypt
[6.706907] BTRFS info (device dm-0): disk space caching is enabled
[6.798762] BTRFS: detected SSD devices, enabling SSD mode
[6.881272] [ cut here ]
[6.881358] WARNING: CPU: 3 PID: 198 at /build/linux-CMiYW9/linux-3.16.7-
ckt2
/fs/btrfs/delayed-inode.c:1410 btrfs_commit_transaction+0x38a/0x9c0 [btrfs]()

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/


kern-err.gz
Description: GNU Zip compressed data


Re: Balance scrub defrag

2014-12-11 Thread Russell Coker
On Wed, 10 Dec 2014 17:17:28 Robert White wrote:
 A _monthly_ scrub is maybe worth scheduling if you have a lot of churn 
 in your disk contents.

I do weekly scrubs.  I recently had 2 disks in a RAID-1 array develop read 
errors within a month of each other.  The first scrub after replacing sdb 
revealed an error on sdc!

 Defragging should be done after significant content additions/changes 
 (like replacing a lot of files via package management) and limited to 
 the directories most likely changed.

I have never run defrag.  Currently all my BTRFS filesystems that have any 
performance requirements are on SSD and I don't think that defragmenting a SSD 
does much good.

 Balancing is almost never necessary and can be anti-helpful if a 
 experiences random updates in batches (because the nicely packed file 
 may end up far, far away from the active data extent where its COW 
 events are taking place.

The problem with running out of metadata space requires a need for an 
occasional data balance.  If you set it to only balance chunks that are less 
than 10% used then it doesn't take much time.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pro/cons of raid1 with mdadm/lvm2

2014-12-01 Thread Russell Coker
On Mon, 1 Dec 2014, Chris Murphy li...@colorremedies.com wrote:
 On Sun, Nov 30, 2014 at 3:06 PM, Russell Coker russ...@coker.com.au wrote:
  When the 2 disks have different data mdadm has no way of knowing which
  one is correct and has a 50% chance of overwriting good data. But BTRFS
  does checksums on all reads and solves the problem of corrupt data - as
  long as you don't have 2 corrupt sectors in matching blocks.
 
 Yeah. I'm not sure though if openSUSE 13.2 prevents users from
 creating btrfs raid1 volumes entirely, or if it's just an install time
 limitation.

With BTRFS you can make it RAID-1 afterwards.  The possibility of data loss 
during system install usually isn't something you are concerned about so this 
shouldn't be a problem.

 I know that Fedora's installer won't allow the user to create Btrfs on
 LVM, and it probably doesn't allow it on md raid either.

For LVM that's reasonable, for MD-RAID that would be a bug IMHO.

On Mon, 1 Dec 2014, Roman Mamedov r...@romanrm.net wrote:
   * mdadm RAID has much better read balancing;
 Btrfs reads are satisfied from what's in effect a random drive
 (PID-based balancing of threads to drives), mdadm reads from the
 less-loaded drive. Also mdadm has a way to specify some RAID1 array
 members as to be never used for reads if at all possible (write-mostly),
 which helps in RAID1 of HDD and SSD.

True.  But that's just a lack of performance tuning in the current code, it 
will be fixed at some future time.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pro/cons of raid1 with mdadm/lvm2

2014-11-30 Thread Russell Coker
When the 2 disks have different data mdadm has no way of knowing which one is 
correct and has a 50% chance of overwriting good data. But BTRFS does checksums 
on all reads and solves the problem of corrupt data - as long as you don't have 
2 corrupt sectors in matching blocks.
-- 
Sent from my Samsung Galaxy Note 3 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Balance and RAID-1

2014-11-27 Thread Russell Coker
I had a RAID-1 filesystem with 2*3TB disks and 330G of disk space free 
according to df -h.  I replaced a 3TB disk with a 4TB disk and df reported no 
change in the free space (as expected).

I added a 1TB disk to the filesystem and there was still no change!  I 
expected that adding a 1TB disk would give 3TB+1TB on one side and 4TB on the 
other so I would instantly get an extra 1TB of free space according to df -h.

1072 out of about 2734 chunks balanced (1073 considered),  61% left

I ran btrfs balance for just over a day and a btrfs balance status reported 
the above.  But it still only showed 860G free according to df -h.  I've 
cancelled the balance because 860G of free space is enough for the moment and 
I don't want that server running slowly and making noise any more.

I think that the allocation of disk space needs to be improved.  If I added a 
1TB disk to a pair of 3TB disks then it would be quite reasonably for some 
serious reallocation to be required to make use of the extra space.  But when 
I added a 1TB disk to an array that had a 3TB and a 4TB then it should be able 
to make use of the space quite easily.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Balance and RAID-1

2014-11-27 Thread Russell Coker
On Fri, 28 Nov 2014, Zygo Blaxell ce3g8...@umail.furryterror.org wrote:
 On Fri, Nov 28, 2014 at 01:37:50AM +1100, Russell Coker wrote:
  I had a RAID-1 filesystem with 2*3TB disks and 330G of disk space free 
  according to df -h.  I replaced a 3TB disk with a 4TB disk and df
  reported no  change in the free space (as expected).
 
 Did you btrfs resize that 4TB disk?  If not, btrfs still thinks the 4TB
 disk is a 3TB disk, and the rest follows from that.

Thanks, you are correct I had missed that step.

It would be nice if the replace command would inform the user of the amount of 
space.  Something like the new device is 1000GB larger than the old, you 
might want to resize.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


resetting device stats

2014-11-26 Thread Russell Coker
When running Debian kernel version 3.16.0-4-amd64 and btrfs-tools version 
3.17-1.1 I ran a btrfs replace operation to replace a 3TB disk that was giving 
read errors with a new 4TB disk.

After the replace the btrfs device stats command reported that the 4TB disk 
had 16 read errors.  It appears that the device stats are copied in a replace 
operation.

I believe that this is a bug.  There is no reason to claim that the new 4TB 
disk had read errors when it was the old 3TB disk that had the errors.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


strange device stats message

2014-11-25 Thread Russell Coker
I am in the middle of replacing /dev/sdb (which is 3TB SATA disk that gives a 
few read errors on every scrub) with /dev/sdc2 (a partition on a new 4TB SATA 
disk).  I am running btrfs-tools version 3.17-1.1 from Debian/Unstable and 
Debian kernel 3.16.0-4-amd64.  I get the following, the last section of which 
seems wrong.  Would this be a bug in the kernel or btrfs-tools?

# btrfs device stats /big
[/dev/sdc2].write_io_errs   0
[/dev/sdc2].read_io_errs0
[/dev/sdc2].flush_io_errs   0
[/dev/sdc2].corruption_errs 0
[/dev/sdc2].generation_errs 0
[/dev/sdb].write_io_errs   0
[/dev/sdb].read_io_errs16
[/dev/sdb].flush_io_errs   0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sdd].write_io_errs   0
[/dev/sdd].read_io_errs0
[/dev/sdd].flush_io_errs   0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[].write_io_errs   0
[].read_io_errs0
[].flush_io_errs   0
[].corruption_errs 0
[].generation_errs 0

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


open_ctree problem

2014-11-22 Thread Russell Coker
I have a workstation running Linux 3.14.something on a 120G SSD.  It recently 
had a problem and now the root filesystem can't be mounted, here is the 
message I get when trying to mount it read-only on Debian kernel 3.16.2-3:

[4703937.784447] BTRFS info (device loop0): disk space caching is enabled
[4703938.754247] BTRFS: log replay required on RO media
[4703938.794148] BTRFS: open_ctree failed

When I tried to boot it normally it gave a lot of kernel messages and failed 
to mount it.

Here's the error I get from the btrfs-zero-log in btrfs-tools 0.19+20130501-1:

# btrfs-zero-log yayia-corrupt 
extent buffer leak: start 157263929344 len 4096
*** Error in `btrfs-zero-log': corrupted double-linked list: 
0x01068960 ***
Aborted

I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error.  But 
when I tried to mount the filesystem I got the attached kernel error when 
trying to mount with Debian kernel 3.16.2-3.

Any suggestions on what I should do next?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/


dmesg.txt.gz
Description: GNU Zip compressed data


Re: open_ctree problem

2014-11-22 Thread Russell Coker
Strangely I repeated the same process on the same system (btrfs-zero-log and 
mount read-only) and it worked.  While it's a concern that repeating the same 
process gives different results it's nice that I'm getting all my data back.

On Sun, 23 Nov 2014, Russell Coker russ...@coker.com.au wrote:
 I have a workstation running Linux 3.14.something on a 120G SSD.  It
 recently had a problem and now the root filesystem can't be mounted, here
 is the message I get when trying to mount it read-only on Debian kernel
 3.16.2-3:
 
 [4703937.784447] BTRFS info (device loop0): disk space caching is enabled
 [4703938.754247] BTRFS: log replay required on RO media
 [4703938.794148] BTRFS: open_ctree failed
 
 When I tried to boot it normally it gave a lot of kernel messages and
 failed to mount it.
 
 Here's the error I get from the btrfs-zero-log in btrfs-tools
 0.19+20130501-1:
 
 # btrfs-zero-log yayia-corrupt
 extent buffer leak: start 157263929344 len 4096
 *** Error in `btrfs-zero-log': corrupted double-linked list:
 0x01068960 ***
 Aborted
 
 I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error. 
 But when I tried to mount the filesystem I got the attached kernel error
 when trying to mount with Debian kernel 3.16.2-3.
 
 Any suggestions on what I should do next?


-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NOCOW and Swap Files?

2014-10-23 Thread Russell Coker
Also it would be nice to have checksums on the swap data. It's a bit of a waste 
to pay for ECC RAM and then lose the ECC benefits as soon as data is paged out.
-- 
Sent from my Samsung Galaxy Note 3 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device balance times

2014-10-23 Thread Russell Coker
Also a device replace operation requires that the replacement be the same size 
(or maybe larger). While a remove and replace allows the replacement to be 
merely large enough to contain all the data. Given the size variation in what 
might be called the same size disk by manufcturers this isn't uncommon - unless 
you just get a replacement of the next size up (which is a good option too).

On October 23, 2014 3:59:31 AM GMT+11:00, Bob Marley bobmar...@shiftmail.org 
wrote:
On 22/10/2014 14:40, Piotr Pawłow wrote:
 On 22.10.2014 03:43, Chris Murphy wrote:
 On Oct 21, 2014, at 4:14 PM, Piotr Pawłowp...@siedziba.pl  wrote:
 Looks normal to me. Last time I started a balance after adding 6th 
 device to my FS, it took 4 days to move 25GBs of data.
 It's long term untenable. At some point it must be fixed. It's way, 
 way slower than md raid.
 At a certain point it needs to fallback to block level copying, with

 a ~ 32KB block. It can't be treating things as if they're 1K files, 
 doing file level copying that takes forever. It's just too risky
that 
 another device fails in the meantime.

 There's device replace for restoring redundancy, which is fast, but

 not implemented yet for RAID5/6.

Device replace on raid 0,1,10 works if the device to be replaced is 
still alive, otherwise the operation is as long as a rebalance and
works 
similarly (AFAIR).
Which is way too long in terms of the likelihood of another disk
failing.
Additionally, it seeks like crazy during the operation, which also 
greatly increases the likelihood of another disk failing.

Until this is fixed I am not confident in using btrfs on a production 
system which requires RAID redundancy.

The operation needs to be streamlined: it should be as sequential as 
possible (sort files according to their LBA before reading/writing), 
with the fewest number of seeks on every disk, and with large buffers, 
so that reads from the source disk(s) and writes to the replacement
disk 
goes at platter-speed or near there.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Sent from my Samsung Galaxy Note 3 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strange 3.16.3 problem

2014-10-21 Thread Russell Coker
On Tue, 21 Oct 2014, Zygo Blaxell zblax...@furryterror.org wrote:
 On Mon, Oct 20, 2014 at 04:38:28AM +, Duncan wrote:
  Russell Coker posted on Sat, 18 Oct 2014 14:54:19 +1100 as excerpted:
   # find . -name *546
   ./1412233213.M638209P10546 # ls -l ./1412233213.M638209P10546 ls:
   cannot access ./1412233213.M638209P10546: No such file or directory
  
  Does your mail server do a lot of renames?  Is one perhaps stuck?  If so,
  that sounds like the same thing Zygo Blaxell is reporting in the
  3.16.3..3.17.1 hang in renameat2() thread, OP on Sun, 19 Oct 2014
  15:25:26 -400, Msg-ID: 20141019192525.ga29...@hungrycats.org, as linked
  here:

It's a Maildir server so it does a lot of renames, but I don't think anything 
is stuck.  I've just rebooted the Dom0 and nothing has changed.

 For Russell's issue...most of the stuff I can think of has been
 tried already.  I didn't see if there was any attempt try to ls the
 file from the NFS server as well as the client side.  If ls is OK on
 the server but not the client, it's an NFS issue (possibly interacting
 with some btrfs-specific quirk); otherwise, it's likely a corrupted
 filesystem (mail servers seem to be unusually good at making these).

# ls -l *546
ls: cannot access *546: No such file or directory

Above is on the server.

# ls -l *546
ls: cannot access 1412233213.M638209P10546: No such file or directory

Above is on the client.  Note that wildcard expansion worked because readdir() 
found the file even though stat can't.

 Most of the I/O time on mail servers tends to land in the fsync() system
 call, and some nasty fsync() btrfs bugs were fixed in 3.17 (i.e. after
 3.16, and not in the 3.16.x stable update for x = 5 (the last one
 I've checked)).  That said, I'm not familiar with how fsync() translates
 over NFS, so it might not be relevant after all.

That's going to suck for people running mail servers on Debian.

 If the NFS server's view of the filesystem is OK, check the NFS protocol
 version from /proc/mounts on the client.  Sometimes NFS clients will
 get some transient network error during connection and fall back to some
 earlier (and potentially buggier) NFS version.  I've seen very different
 behavior in some important corner cases from v4 and v3 clients, for
 example, and if the client is falling all the way back to v2 the bugs
 and their workarounds start to get just plain _weird_ (e.g. filenames
 which produce specific values from some hash function or that contain
 specific character sequences are unusable).  v2 is so old it may even
 have issues with 64-bit inode numbers.

Rebooting the client multiple times and rebooting the server once doesn't 
change it.  I don't think it's any transient error.

On Tue, 21 Oct 2014, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 Just now saw this thread, but IIRC 'No such file or directory' also gets 
 returned sometimes when trying to automount a share that can't be 
 enumerated by the client, and also sometimes when there is a stale NFS 
 file handle.

I think that rebooting both client and server precludes the possibility of a 
stale file handle.  Even rebooting the client (which I have done several 
times) should fix it.

On Tue, 21 Oct 2014, Robert White rwh...@pobox.com wrote:
 Okay, from the strace output the shell _is_ finding the file in the
 directory read and expand (readdir) pass. That is *546 is being
 expanded to the full file name text 1412233213.M638209P10546 but then
 the actual operation fails because the name is apparently not associated
 with anything.
 
 So what pass of scrub or btrfsck checks directory connectedness? Does
 that pass give your file system a clean bill of health?

That's inconvenient for a remote system with a single BTRFS filesystem.

 Also you said that you are using a 32bit user space copied from another
 server under a 64bit kernel. Is the ls command a 32 bit executable then?

Yes.

 What happens if you stop the Xen domain for the mail server and then
 mount the disks into a native 64bit environment and then ls the file name?

The filesystem in question is NFS mounted from a server with 64bit kernel+user 
to a virtual server with 64bit kernel+32bit user.  On the file server (the Xen 
Dom0) ls doesn't even see that file in readdir.

 I ask because the man page for lstat64 says its a wrapper for the
 underlying system call (fstatat64). It is not impossible that you might
 have a case where the wrapper is failing inside glibc due to some 32/64
 bit conversion taking place.

If there is a 32/64 conversion then we have another problem.  The mail server 
is configured to reject messages bigger than about 50M, I don't recall the 
exact number but it's a lot smaller than 2G.

On Tue, 21 Oct 2014, Goffredo Baroncelli kreij...@inwind.it wrote:
 Could this be related to the inode overflow in 32 bit system 
 (see inode_cache options) ? If so running a 64bit ls -i should
 work

I've just installed coreutils:amd64 on the NFS client and I get the same 
results

Re: strange 3.16.3 problem

2014-10-21 Thread Russell Coker
I've just upgraded the Dom0 (NFS server) from 3.16.3 to 3.16.5 and it all 
works.

Prior to upgrading the Dom0 I had the same problem occur with different file 
names.  All the names in question were truncated names of files that exist.  
It seems that 3.16.3 has a bug with NFS serving files with long names.

Thanks for all the suggestions.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strange 3.16.3 problem

2014-10-18 Thread Russell Coker
On Sat, 18 Oct 2014, Michael Johnson - MJ m...@revmj.com wrote:
 The NFS client is part of the kernel iirc, so it should be 64 bit.  This
 would allow the creation of files larger than 4gb and create possible
 issues with a 32 bit user space utility.

A correctly written 32bit application will handle files 4G in size.

While some applications may have problems, I'm fairly sure that ls will be ok.

# dd if=/dev/zero of=/tmp/test bs=1024k count=1 seek=5000
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00383089 s, 274 MB/s
# /bin/ls -lh /tmp/test
-rw-r--r--. 1 root root 4.9G Oct 18 20:47 /tmp/test
# file /bin/ls
/bin/ls: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically 
linked (uses shared libs), for GNU/Linux 2.6.26, 
BuildID[sha1]=0xd3280633faaabf56a14a26693d2f810a3e51, stripped

A quick test shows that a 32bit ls can handle this.

 I would mount from a client with 64 bit user space and see if the problem
 occurs there.  If so, it is probably not a btrfs issue (if I am
 understanding your environment correctly).

I'll try that later.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strange 3.16.3 problem

2014-10-18 Thread Russell Coker
On Sun, 19 Oct 2014, Robert White rwh...@pobox.com wrote:
 On 10/17/2014 08:54 PM, Russell Coker wrote:
  # find . -name *546
  ./1412233213.M638209P10546
  # ls -l ./1412233213.M638209P10546
  ls: cannot access ./1412233213.M638209P10546: No such file or directory
  
  Any suggestions?
 
 Does ls -l *546 show the file to exist? e.g. what happens if you use
 the exact same wildcard in the ls command as you used in the find?

# ls -l *546 
ls: cannot access 1412233213.M638209P10546: No such file or directory

That gives the same result as find, the shell matches the file name but then 
ls can't view it.

lstat64(1412233213.M638209P10546, 0x9fab0c8) = -1 ENOENT (No such file or 
directory)

From strace, the lstat64 system call fails.
 
 It is possible (and back in the day it was quite common) for files to be
 created with non-renderable nonsense in the name. for instance if the
 first four characters of the name were 13^H4 (where ^H is the single
 backspace character) the file wold look like it was named 14* but it
 would be listed by ls using 13*. If the file name is damaged, which
 is usually a failing in the program that created the file, then it can
 be hidden in plain sight.

If that's the case then it's still a kernel bug somewhere.  Maildrop and 
Dovecot don't create files with any unusual characters in the names.

 Note that this sort of name is hidden from the copy-paste done in the
 terminal window because the binary nonsense is just not in the output
 any more by the time you select it with the mouse.
 
 It doesn't have to be a backspace, BTW, it can be any character that the
 terminal window will not render.
 
 If things get really ugly you may need to remove the file using
 
 find . -name *546 -exec rm {} \;

# find . -name *546 -exec rm {} \;
rm: cannot remove `./1412233213.M638209P10546': No such file or directory

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


strange 3.16.3 problem

2014-10-17 Thread Russell Coker
I have a system running the Debian 3.16.3-2 AMD64 kernel for the Xen Dom0 and 
the DomUs.

The Dom0 has a pair of 500G SATA disks in a BTRFS RAID-1 array.  The RAID-1 
array has some subvols exported by NFS as well as a subvol for the disk images 
for the DomUs - I am not using NoCOW as performance is fine without it and I 
like having checksums on everything.

I have started having some problems with a mail server that is running in a 
DomU.  The mail server has 32bit user-space because it was copied from a 32bit 
system and I had no reason to upgrade it to 64bit, but it's running a 64bit 
kernel so I don't think that 32bit user-space is related to my problem.

# find . -name *546
./1412233213.M638209P10546
# ls -l ./1412233213.M638209P10546
ls: cannot access ./1412233213.M638209P10546: No such file or directory

Above is the problem, find says that the file in question exists but ls 
doesn't think so, the file in question is part of a Maildir spool that's NFS 
mounted.  This problem persisted across a reboot of the DomU, so it's a 
problem with the Dom0 (the NFS server).

The dmesg output on the Dom0 doesn't appear to have anything relevant, and a 
find command doesn't find the file.  I don't know if this is a NFS problem or 
a BTRFS problem.  I haven't rebooted the Dom0 yet because a remote reboot of a 
server running a kernel from Debian/Unstable is something I try to avoid.

Any suggestions?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting a dead device

2014-09-24 Thread Russell Coker
On Sun, 21 Sep 2014 11:05:46 Chris Murphy wrote:
 On Sep 20, 2014, at 7:39 PM, Russell Coker russ...@coker.com.au wrote:
  Anyway the new drive turned out to have some errors, writes failed and
  I've
  got a heap of errors such as the above.
 
 I'm curious if smartctl -t conveyance reveals any problems, it's not a full
 surface test but is designed to be a test for (typical?) problems drives
 have due to shipment damage, and doesn't take very long.

Unfortunately I got to your message after sending the defective drive to e-
waste.  But I expect that any test that involves real reads/writes would 
report a failure (if the USB-SATA device supported passing them through) as 
the drive seemed to fail for everything.

  # btrfs device delete /dev/sdc3 /
  ERROR: error removing the device '/dev/sdc3' - Invalid argument
  
  It seems that I can't remove the device because removing requires writing.
 
 What kernel message do you get associated with this? Try using the devid
 instead of /dev/.

I'll keep that in mine.

device delete dev [dev...] path
  Remove device(s) from a filesystem identified by path.

The man page has the above text which makes no mention of devid, so I think we 
need a documentation patch for this.

 For future reference, btrfs replace start is better to use than add+delete.
 It's an optimization but it also makes it possible to ignore the device
 being replaced for reads; and you can also get a status on the progress
 with btrfs replace status. And it looks like it does some additional
 error checking.

Oh yes, I've done this and it works well.  However it doesn't work if the 
replacement is smaller than the device being replaced.

  Also as an aside, while the stats about write errors are useful, in this
  case it would be really good if there was a count of successful writes,
  it would be useful to know if the successful write count was close to 0.
 
 I think this is for other tools. Btrfs is a file system its responsible for
 the integrity of the data it writes, I don't think it's responsible for
 prequalifying drives.

I agree that it doesn't have to prequalify drives.  But it should expose all 
data it has which can be of use to the sysadmin.  After it was too late I 
realised that I could have used iostat to get stats for the block device.  But 
it would still be nice to have stats from btrfs.

Also btrfs has to deal with the fact that drives may fail at any time.  
Admittedly I was using a drive I knew to be slightly sub-standard (I got it 
free because it gave an error in a client's RAID-Z array).  But sometimes 
drives like that last for years, it's difficult to predict.

 Even a simple dd if=/dev/zero of=/dev/sdc bs=64k count=1600 will write out
 100MB, and dmesg will show if there are any controller or drive problems on
 writes. You may have to do more than 100MB for problems to show up but you
 get the idea.

True.  But a drive can fail after 101MB.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting a dead device

2014-09-21 Thread Russell Coker
On Sun, 21 Sep 2014, Duncan 1i5t5.dun...@cox.net wrote:
 Russell Coker posted on Sun, 21 Sep 2014 11:39:17 +1000 as excerpted:
  On a system running the Debian 3.14.15-2 kernel I added a new drive to a
  RAID-1 array.  My aim was to add a device and remove one of the old
  devices.
 
 That's an old kernel and presumably an old btrfs-progs.  Quite a number
 of device management fixes have gone in recently, and you'd likely not be
 in quite that predicament were you running a current kernel (make sure
 it's 3.16.2+ or 3.17-rc2+ to get the fix for the workqueues bug that
 affected 3.15 and thru 3.16.1 and 3.17-rc1).

3.16.2 is in Debian, I'm in the process of upgrading to it.

 And the recommended way to handle a device replace now would be btrfs
 replace, doing the add and delete in a single (albeit possibly long) step
 instead of separately.

I'm changing from a 500G single filesystem to a 200G RAID-1 (there's only 150G 
of data).  The change from 500G to 200G can't be done with a replace as a 
replace requires an equal or greater size.

I did a 2 step process, add/delete to go to a 200G USB attached device for 
half the array and then replace to go from 200G on USB to 200G internal.

  The drive is attached by USB so I turned off the USB device and then got
  the above result.  So it still seems impossible to remove the device
  even though it's physically not present.  I've connected a new USB disk
  which is now /dev/sdd, so it seems that BTRFS is keeping the name
  /dev/sdc locked.
  
  Should there be a way to fix this without rebooting or anything?
 
 Did you try btrfs device delete missing?  It's documented on the wiki but
 apparently not yet on the manpage.

I did that after rebooting.  It didn't occur to me to try a missing 
operation when the drive really wasn't missing.

 According to the wiki that deletes
 the first device that was in the metadata but not found when booting, so
 you may have to reboot to do it, but it should work.

That would be a bug.  There's no reason a reboot should be required if we can 
remove a drive and add a new one with the kernel recognising it.  Hot-swap 
disks aren't any sort of new feature.

 Tho with the recent
 stale-devices fixes, were that a current kernel you may not actually have
 to reboot to have delete missing work.  But you probably will on 3.14,
 and of course to upgrade kernels you'd have to reboot anyway, so...

Yes a reboot was needed anyway.  But I'd have liked to delay that.

  Also as an aside, while the stats about write errors are useful, in this
  case it would be really good if there was a count of successful writes,
  it would be useful to know if the successful write count was close to 0.
  
   My understanding of the BTRFS design is that there would be no
  
  performance penalty for adding counts of the number of successful reads
  and writes to the superblock.  Could this be done?
 
 Not necessarily for reads, consider the case when the filesystem is read-
 only as my btrfs root filesystem is by default -- lots of reads but
 likely no writes and no super-block updates for the entire uptime.  But I
 believe you're correct for writes, since they'd ultimately update the
 superblocks anyway.

For the case of a read-only filesystem it's OK to skip read stats.  It would 
also be a bad idea to update read stats without writing data.  But there's no 
reason why read stats couldn't be accumulated in-memory and written out the 
next time something was written to disk.  That would give a slight inaccuracy 
in the case where there was a power failure after some period of reading 
without writing, but that's an unusual corner case.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


deleting a dead device

2014-09-20 Thread Russell Coker
On a system running the Debian 3.14.15-2 kernel I added a new drive to a 
RAID-1 array.  My aim was to add a device and remove one of the old devices.

Sep 21 11:26:51 server kernel: [2070145.375221] BTRFS: lost page write due to 
I/O error on /dev/sdc3
Sep 21 11:26:51 server kernel: [2070145.375225] BTRFS: bdev /dev/sdc3 errs: wr 
269, rd 0, flush 0, corrupt 0, gen 0
Sep 21 11:27:21 server kernel: [2070175.517691] BTRFS: lost page write due to 
I/O error on /dev/sdc3
Sep 21 11:27:21 server kernel: [2070175.517699] BTRFS: bdev /dev/sdc3 errs: wr 
270, rd 0, flush 0, corrupt 0, gen 0
Sep 21 11:27:21 server kernel: [2070175.517712] BTRFS: lost page write due to 
I/O error on /dev/sdc3
Sep 21 11:27:21 server kernel: [2070175.517715] BTRFS: bdev /dev/sdc3 errs: wr 
271, rd 0, flush 0, corrupt 0, gen 0
Sep 21 11:27:51 server kernel: [2070205.665947] BTRFS: lost page write due to 
I/O error on /dev/sdc3
Sep 21 11:27:51 server kernel: [2070205.665955] BTRFS: bdev /dev/sdc3 errs: wr 
272, rd 0, flush 0, corrupt 0, gen 0
Sep 21 11:27:51 server kernel: [2070205.665967] BTRFS: lost page write due to 
I/O error on /dev/sdc3
Sep 21 11:27:51 server kernel: [2070205.665971] BTRFS: bdev /dev/sdc3 errs: wr 
273, rd 0, flush 0, corrupt 0, gen 0

Anyway the new drive turned out to have some errors, writes failed and I've 
got a heap of errors such as the above.  The errors started immediately after 
adding the drive and the system wasn't actively writing to the filesystem.  So 
very few (if any) writes made it to the device.

# btrfs device delete /dev/sdc3 /
ERROR: error removing the device '/dev/sdc3' - Invalid argument

It seems that I can't remove the device because removing requires writing.

# btrfs device delete /dev/sdc3 /
ERROR: error removing the device '/dev/sdc3' - No such file or directory
# btrfs device stats /
[/dev/sda3].write_io_errs   0
[/dev/sda3].read_io_errs0
[/dev/sda3].flush_io_errs   0
[/dev/sda3].corruption_errs 57
[/dev/sda3].generation_errs 0
[/dev/sdb3].write_io_errs   0
[/dev/sdb3].read_io_errs0
[/dev/sdb3].flush_io_errs   0
[/dev/sdb3].corruption_errs 0
[/dev/sdb3].generation_errs 0
[/dev/sdc3].write_io_errs   267
[/dev/sdc3].read_io_errs0
[/dev/sdc3].flush_io_errs   0
[/dev/sdc3].corruption_errs 0
[/dev/sdc3].generation_errs 0

The drive is attached by USB so I turned off the USB device and then got the 
above result.  So it still seems impossible to remove the device even though 
it's physically not present.  I've connected a new USB disk which is now 
/dev/sdd, so it seems that BTRFS is keeping the name /dev/sdc locked.

Should there be a way to fix this without rebooting or anything?

Also as an aside, while the stats about write errors are useful, in this case 
it would be really good if there was a count of successful writes, it would be 
useful to know if the successful write count was close to 0.  My understanding 
of the BTRFS design is that there would be no performance penalty for adding 
counts of the number of successful reads and writes to the superblock.  Could 
this be done?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


device delete progress

2014-09-20 Thread Russell Coker
We need to have a way to determine the progress of a device delete operation.  
Also for a balance of a RAID-1 that has more than 2 devices it would be good 
to know how much space is used on each device.

Could btrfs fi df be extended to show information separately for each device?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No space on empty, degraded raid10

2014-09-11 Thread Russell Coker
On Mon, 8 Sep 2014, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 Also, I've found out the hard way that system chunks really should be
 RAID1, NOT RAID10, otherwise it's very likely that the filesystem
 won't mount at all if you lose 2 disks.

Why would that be different?

In a RAID-1 you expect system problems if 2 disks fail, why would RAID-10 be 
different?

Also it would be nice if there was a N-way mirror option for system data.  As 
such data is tiny (32MB on the 120G filesystem in my workstation) the space 
used by having a copy on every disk in the array shouldn't matter.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should btrfs device delete missing ... take?

2014-09-11 Thread Russell Coker
It would be nice if a file system mounted ro counted as ro snapshots for btrfs 
send.

When a file system is so messed up it can't be mounted rw it should be regarded 
as ro for all operations.
-- 
Sent from my Samsung Galaxy Note 2 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Debian 3.14.13-2 lockup

2014-08-18 Thread Russell Coker
I've attached the dmesg output from a system running Debian kernel 3.14.13 
which locked up.  Everything which needed to write to disk was blocked.  The 
dmesg output didn't catch the first messages which had scrolled out of the 
buffer.  As the disk wasn't writable there was nothing useful in 
/var/log/kern.log .

As an aside one reason for using tmpfs for /tmp is that when the root 
filesystem has a problem bash filename completion still works.  I lost a 
precious root shell on that system because I pressed TAB and bash hung.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/


dmesg.txt.gz
Description: application/gzip


Re: Putting very big and small files in one subvolume?

2014-08-17 Thread Russell Coker
On Sun, 17 Aug 2014 12:31:42 Duncan wrote:
 OTOH, I tend to be rather more of an independent partition booster than 
 many.  The biggest reason for that is the too many eggs in one basket 
 problem.  Fully separate filesystems on separate partitions separate 
 those data eggs into separate baskets, so if the metaphorical bottom 
 drops out of one of those filesystem baskets, only the data eggs in that 
 filesystem basket are lost, while the eggs in the separate filesystem 
 baskets are still safe and sound, not affected at all. =:^)
 
 The thing that troubles me about replacing a bunch of independent 
 partitions and filesystems with a bunch of subvolumes on a single btrfs 
 filesystem is thus just that, you've nicely divided that big basket into 
 little subvolume compartments, but it's still one big basket, and if the 
 bottom falls out, you potentially lose EVERYTHING in that filesystem 
 basket!

I'll write the counter-point to this.

If you have several partitions for /, /var/log, and /home then losing any one 
of them will result in a system that's mostly unusable.  So for continuous 
service there doesn't seem to be a benefit in having multiple partitions.

When you have to restore a backup in adverse circumstances the restore time is 
important.  For example if you have 10*4TB disks and need RAID-1 redundancy 
(which you need on any BTRFS filesystem of note as I don't think RAID-5 and 
RAID-6 are trustworthy) then an advantage of 5*4TB RAID-1 filesystems over a 
20TB RAID-10 is that restore time will be a lot smaller.  But this isn't an 
issue for typical BTRFS users who are working with much smaller amounts of 
data, at this time I have to recommend ZFS over BTRFS for most systems that 
manage 20TB of data.

If you have a RAID-1 array of the biggest disks available (which is probably 
the biggest storage for 99% of BTRFS users) then you are looking at a restore 
time of maybe 4TB at 160MB/s == something less than 7 hours.  For a home 
network 7 hours delay in getting things going after a major failure is quite 
OK.

Finally failures of filesystems on different partitions won't be independent.  
If one filesystem on a disk becomes unusable due to drive firmware issues or 
other serious problems then other filesystems on the same physical disk are 
likely to suffer the same fate.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 40TB volume taking over 16 hours to mount, any ideas?

2014-08-08 Thread Russell Coker
On Fri, 8 Aug 2014 16:35:29 Jose Ildefonso Camargo Tolosa wrote:
 uname -a
 Linux server1 3.15.8-031508-generic #201407311933 SMP Thu Jul 31
 23:34:33 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 
 The complete story:
 
 The filesystem was created on Ubuntu 12.04, running kernel 3.11.
 mount options included compress=zlib .
 
 After having some issues with it, specifically that it would mount
 itself read-only, and would then be stuck while trying to mount it
 again, we decided to upgrade 14.04, kernel 3.13, and 3.12 btrfs tools.

[...]

 Then, after reading here and there, decided to try to use a newer
 kernel, tried 3.15.8.  Well, it is still mounting after ~16 hours, and
 I got messages like these at first:

I recommend trying a 3.14 kernel.  I had ongoing problems with kernels before 
3.14 which included infinite loops in kernel space.  Based on reports on this 
list I haven't been inclined to test 3.15 kernels.  But 3.14 has been working 
well for me on many systems.

Trying 3.14 can't hurt.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC with mkdir and rename

2014-08-04 Thread Russell Coker
On Mon, 4 Aug 2014 14:17:02 Peter Waller wrote:
 For anyone else having this problem, this article is fairly useful for
 understanding disk full problems and rebalance:
 
 http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem- 
 Full-Problems.html
 
 It actually covers the problem that I had, which is that a rebalance
 can't take place because it is full.
 
 I still am unsure what is really wrong with this whole situation. Is
 it that I wasn't careful to do a rebalance when I should have been
 doing? Is it that BTRFS doesn't do a rebalance automatically when it
 could in principle?

Yes and yes.  The fact that BTRFS can't avoid getting into such situations and 
can't recover when it does are both bugs in BTRFS.  The fact that you didn't 
run a balance to prevent this is due to not being careful enough with a 
filesystem that's still in a development stage.

 It's pretty bad to end up in a situation (with spare space) where the
 only way out is to add more storage, which may be impractical,
 difficult or expensive.

Absolutely.

 I conclude that now since I have added more storage, the rebalance
 won't fail and if I keep rebalancing from a cron job I won't hit this
 problem again

Yes.

 (unless the filesystem fills up very fast! what then?).
 I don't know however what value to assign to `-dusage` in general for
 the cron rebalance. Any hints?

If you regularly run a scrub with options such as -dusage=50 -musage=10 then 
the amount of free space in metadata chunks will tend to be a lot greater than 
that in data chunks.

Another option I've considered is to write a program that creates millions of 
files with 1000 byte random file names.  After creating a filesystem I could 
run that program to cause a sufficient number of metadata chunks to be 
allocated and then remove the subvol containing all those files (which 
incidentally is a lot faster than rm -rf).

Another thing I've considered is making a filesystem for a file server with a 
RAID-1 array of SSDs and running the above program to allocate all chunks for 
metadata.  Then when the SSDs are totally assigned to metadata I would add a 
pair of SATA disks for data.  A filesystem with all metadata on SSD and all 
data on SATA disks should give great performance as well as having lots of 
space.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Threads being NUMA aware

2014-08-03 Thread Russell Coker
Please get yourself a NUMA system and test this out.
-- 
Sent from my Samsung Galaxy Note 2 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scan not being performed properly on boot

2014-08-03 Thread Russell Coker
On Mon, 4 Aug 2014 04:02:53 Peter Roberts wrote:
 I've just recently started testing btrfs on my server but after just 24 
 hours problems have started. I get booted to a busybox prompt user 
 ubuntu 14.04. I have a multi device FS setup and I can't say for sure if 
 it managed to boot initially or not but it worked fine in regular usage.

What is GRUB (or your boot loader) giving as parameters to the kernel?

What error messages appear on screen?  Sometimes it's helpful to photograph 
the screen and put the picture on a web server to help people diagnose the 
problem.

 I cannot mount my root with anything other than an explicit /dev/sdx 
 reference until I manually run a scan. Then UUID etc all work. I've 

That sounds like a problem with the Ubuntu initrd, probably filing an Ubuntu 
bug report would be the best thing to do.  Is BTRFS supported in that version 
of Ubuntu?

But just changing your boot configuration to use /dev/sdx is probably the best 
option.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Threads being NUMA aware

2014-08-03 Thread Russell Coker
On Sun, 3 Aug 2014 22:44:26 Nick Krause wrote:
 On Sun, Aug 3, 2014 at 7:48 PM, Russell Coker russ...@coker.com.au wrote:
  Please get yourself a NUMA system and test this out.
 
 Unfortunately I don't have money for an extra machine as of now as I
 am a student

If you can't get an extra machine then you probably can't contribute to 
developing filesystems, and your ability to do any sort of kernel development 
will be greatly limited.  It's probably best that you choose another area of 
software development until you get more hardware (and skill).

 so if x86 is NUMA I can test otherwise I can't.

If you had even read the NUMA Wikipedia page then you would already know the 
answer to this implied question.

But really programming computers isn't something that you are good at.  It's 
not something that you will become good at if you attempt tasks that are way 
above your skill level.  I think that the best strategy for you is to find a 
mailing list for Linux beginners, when your skills get to the level that you 
answer more questions than you ask (and people appreciate your answers) then 
you can move on to easy programming tasks.  Once you master the easy 
programming tasks you can move on to more difficult tasks.

I suggest that you develop a realistic plan.  Plan to start kernel programming 
in 2024 and have a series of goals on that path over the next 10 years.  Make 
your 2015 goal be answering lots of questions on a Linux beginners mailing 
list, that's a goal you can achieve in a year.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scan not being performed properly on boot

2014-08-03 Thread Russell Coker
On Sun, 3 Aug 2014 21:00:19 George Mitchell wrote:
  But just changing your boot configuration to use /dev/sdx is probably the
  best option.
 
 Assuming you are booting with grub2, you will need to use /dev/sdx in 
 the grub2 configuration file.  This is known issue with grub2. Example 
 from my script:

That's not a GRUB issue it's an initrd issue.  GRUB just passes text to the 
kernel.  cat /proc/cmdline will show you what GRUB sent to the kernel, the 
programs in the initrd read /proc/cmdline and then do what they think is 
appropriate.

 
 -- echo'Loading Linux desktop ...'
 linux   /vmlinuz-desktop root=/dev/sda7 ro splash quiet
 echo'Loading initial ramdisk ...'
 initrd  /initrd-desktop.img

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scan not being performed properly on boot

2014-08-03 Thread Russell Coker
On Sun, 3 Aug 2014 21:34:29 George Mitchell wrote:
 I see what you are saying.  Its a hack.  But I suspect that most of the 
 distros are not yet accommodating btrfs with their standard mkinitrd 
 process.  At this point modifying grub2 config does solve the problem.  
 If you know a reasonably easy way to fix initrd so that it can interpret 
 UUID and LABEL, I would certainly be all ears.

Debian/Wheezy works with BTRFS when using UUID= to specify the root 
filesystem.  Wheezy was released in May 2013, Ubuntu 14.04 was released in 
April 2014 and as Ubuntu is based on Debian it should have at least the same 
features as an older version of Debian.

I think that most Distributions are supporting BTRFS.  Debian/Wheezy has 
support for BTRFS (although I recommend that you don't use it unless you plan 
to take a kernel from Testing), most Debian derivatives will support it, 
Fedora supports it.

It's probably better to make a list of distributions that DON'T support BTRFS, 
it'll be a shorter list.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC with mkdir and rename

2014-08-02 Thread Russell Coker
On Sun, 3 Aug 2014 00:35:28 Peter Waller wrote:
 I'm running Ubuntu 14.04. I wonder if this problem is related to the
 thread titled Machine lockup due to btrfs-transaction on AWS EC2
 
 Ubuntu 14.04 which I started on the 29th of July:
  http://thread.gmane.org/gmane.comp.file-systems.btrfs/37224
 
 Kernel: 3.15.7-031507-generic

As an aside, I'm still on 3.14 kernels for my systems and have no immediate 
plans to use 3.15.  There has been discussion here about a number of problems 
with 3.15, so I don't think that any testing I do with 3.15 will help the 
developers and it will just take more of my time.

 $ sudo btrfs fi df /path/to/volume
 Data, single: total=489.97GiB, used=427.75GiB
 Metadata, DUP: total=5.00GiB, used=4.50GiB

As has been noted you are using all the space in 1G data chunks and the system 
can't allocate more 256M metadata chunks (which are allocated in pairs because 
it's DUP so allocating 512M at a time.

 In this case, for example, metadata has 0.5GiB free (sounds like
 plenty for metadata for one mkdir to me). Data has 62GiB free. Why
 would I get ENOSPC for a file rename?

Some space is always reserved.  Due to the way BTRFS works changes to a file 
requires writing a new copy of the tree.  So the amount of metadata space 
required for an operation that is conceptually simple can be significant.

One thing that can sometimes solve that problem is to delete a subvol.  But 
note that it can take a considerable amount of time to free the space, 
particularly if you are running out of metadata space.  So you could delete a 
couple of subvols, run sync a couple of times, and have a coffee break.

If possible avoid rebooting as that can make things much worse.  This was a 
particular problem with kernels 3.13 and earlier which could enter a CPU loop 
requiring a reboot and then you would have big problems.

 I tried a rebalance with btrfs balance start -dusage=10 and tried
 increasing the value until I saw reallocations in dmesg.

/sbin/btrfs fi balance start -dusage=30 -musage=10 /

It's a good idea to have a cron job running a rebalance.  Above is what I use 
on some of my systems, it will free data chunks that are up to 30% used and 
metadata chunks that are only 10% used.  It almost never frees metadata chunks 
and regularly frees data chunks which is what I want.

 and enlarge the volume. When I did this, metadata grew by 1GiB:
  Data, single: total=490.97GiB, used=427.75GiB
  System, DUP: total=8.00MiB, used=60.00KiB
  System, single: total=4.00MiB, used=0.00
  Metadata, DUP: total=5.50GiB, used=4.50GiB
  Metadata, single: total=8.00MiB, used=0.00
  unknown, single: total=512.00MiB, used=0.00

Now that you have solved that problem you could balance the filesystem 
(deallocating ~60 data chunks) and then shrink it.  In the past I've added a 
USB flash disk to a filesystem to give it enough space to allow a balance and 
then removed it (NB you have to do a btrfs remove before removing the USB 
stick).

 * Why didn't the metadata grow before enlarging the disk?
 * Why didn't the rebalance enable the metadata to grow?
 * Why is it necessary to rebalance? Can't it automatically take some
 free space from 'data'?

It would be nice if it could automatically rebalance.  It's theoretically 
possible as the btrfs program just asks the kernel to do it.  But there's 
nothing stopping you from having a regular cron job to do it.  You could even 
write a daemon to poll the status of a btrfs filesystem and run balance when 
appropriate if you were keen enough.

 * What is the best course of action to take (other than enlarging the
 disk or deleting files) if I encounter this situation again?

Have a cron job run a balance regularly.

On Sat, 2 Aug 2014 21:52:36 Nick Krause wrote:
 I have run into this error to and this seems to be a rather big issue as
 ext4 seems to never run of metadata room at least from my testing. I feel
 greatly that this part of btrfs needs be improved and moved into a function
 or set of functions for re balancing metadata in the kernel itself.

Ext4 has fixed size Inode tables that are assigned at mkfs time.  If you run 
out of Inodes then you can't create new files.  If you have too big Inode 
tables then you waste disk space and have a longer fsck time (at least before 
uninit_bg).

The other metadata for Ext4 is allocated from data blocks so it will run out 
when data space runs out (EG if mkdir fails due to lack of space on ext4 then 
you can delete a file to make it work).

But really BTRFS is just a totally different filesystem.  Ext4 lacks the 
features such as full data checksums and subvolume support that make these 
things difficult.

I always found the CP/M filesystem to be easier.  It was when they added 
support for directories that things started getting difficult.  :-#

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of 

Re: Questions on incremental backups

2014-07-18 Thread Russell Coker
On Fri, 18 Jul 2014 13:56:58 Sam Bull wrote:
 On ven, 2014-07-18 at 14:35 +1000, Russell Coker wrote:
  Ignoring directories in send/recv is done by subvol. Even if you use
  rsync it's a good idea to have different subvols for directory trees
  with different backup requirements.
 
 So, an inner subvol won't be backed up? If I wanted a full backup, I
 would presumably get snapshots of each subvol separately, right?

If you use btrfs send/recv then it won't get the inner subvol.  If you use 
rsync then by default it goes through the entire directory tree unless you use 
the -x option.

  Displaying backups is an issue of backup software. It is above the
  level that BTRFS development touches. While people here can probably
  offer generic advice on backup software it's not the topic of the
  list.
 
 As said, I don't mind developing the software. But, is the required
 information easily available? Is there a way to get a diff, something
 like a list of changed/added/removed files between snapshots?

Your usual diff utility will do it.  I guess you could parse the output of 
btrfs send.

 And, finally, nobody has mentioned on the possibility of merging
 multiple snapshots into a single snapshot. Would this be possible, to
 create a snapshot that contains the most recent version of each file
 present across all of the snapshots (including files which may be
 present in only one of the snapshots)?

There is no btrfs functionality for that.  But I'm sure you could do something 
with standard Unix utilities and copying files around.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions on incremental backups

2014-07-17 Thread Russell Coker
Daily snapshots work welk with kernel 3.14 and above (I had problems with 3.13 
and previous). I have snapshots every 15 mins on some subvols.

Very large numbers of snapshots can cause performance problems. I suggest 
keeping below 1000 snapshots at this time.

You can use send/recv functionality for remote backups. So far I've used rsync, 
it works well and send/recv has some limitations about filesystem structure 
etc. Rsync can transfer to a ext4 or ZFS filesystem if you wish.

Ignoring directories in send/recv is done by subvol. Even if you use rsync it's 
a good idea to have different subvols for directory trees with different backup 
requirements.

Displaying backups is an issue of backup software. It is above the level that 
BTRFS development touches. While people here can probably offer generic advice 
on backup software it's not the topic of the list.

I use date based snapshots on my backup BTRFS filesystems and I can easily 
delete snapshots in the middle of the list.
-- 
Sent from my Samsung Galaxy Note 2 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs and LBA errors

2014-07-15 Thread Russell Coker
On Tue, 15 Jul 2014 11:42:05 constantine wrote:
 Thank you very much for your advice. It worked!

Great!

 I verified that the superblocks 1 and 2 had similar information with
 btrfs-show-super -i 1 /dev/sdc1 (and -i 2) and then with crossed
 fingers:
 btrfs-select-super  -s 2 /dev/sdc1
 which restored my btrfs filesystem.
 
 Then I runned scrub. For future reference, scrub was stuck:
 
 # btrfs scrub status partition/
 scrub status for c1eb1aaf-665a-4337-9d04-3c3921aa67e0
 scrub started at Thu Jul 10 21:30:41 2014, running for 2743 seconds
total bytes scrubbed: 193.47GiB with 0 errors
 
 and could not cancel it:
 
 btrfs scrub cancel partition/
 ERROR: scrub cancel failed on Downloads/: not running
 
 After deleting
 /var/lib/btrfs/scrub.status.c1eb1aaf-665a-4337-9d04-3c3921aa67e0 it was
 completed successfully.

That's a known bug.

Now I hope you have made a good backup, the next problem you encounter may not 
be as easy to solve.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs transaction checksum corruption losing root of the tree bizarre UUID change.

2014-07-11 Thread Russell Coker
On Fri, 11 Jul 2014 10:38:22 Duncan wrote:
  I've moved all drives and move those to my main rig which got a nice
  16GB of ecc ram, so errors of ram, cpu, controller should be kept
  theoretically eliminated.
 
 It's worth noting that ECC RAM doesn't necessarily help when it's an in-
 transit bus error.  Some years ago I had one of the original 3-digit 
 Opteron machines, which of course required registered and thus ECC RAM.  
 The first RAM I purchased for that board was apparently borderline on its 
 timing certifications, and while it worked fine when the system wasn't 
 too stressed, including with memtest, which passed with flying colors, 
 under medium memory activity it would very occasionally give me, for 
 instance, a bad bzip2 csum, and with intensive memory activity, the 
 problem would be worse (more bz2 decompress errors, gcc would error out 
 too sometimes and I'd have to restart my build, very occasionally the 
 system would crash).

If bad RAM causes corrupt memory but no ECC error reports then it probably 
wouldn't be a bus error.  A bus error SHOULD give ECC reports.

One problem is that RAM errors aren't random.  From memory the Hamming codes 
used fix 100% of single bit errors, detect 100% of 2 bit errors, and let some 
3 bit errors through.  If you have a memory module with 3 chips on it (the 
later generation of DIMM for any given size) then an error in 1 chip can 
change 4 bits.

The other main problem is that if you have a read or write going to the wrong 
address then you lose as AFAIK there's no ECC on address lines.

But I still recommend ECC RAM, it just decreases the scope for problems.  
About half the serious problems I've had with BTRFS have been caused by a 
faulty DIMM...

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs and LBA errors

2014-07-11 Thread Russell Coker
On Fri, 11 Jul 2014 13:42:40 constantine wrote:
 Btrfs filesystem could not be mounted because /dev/sdc1 had unreadable
 sectors. It is/was  a single filesystem (not raid1 or raid0) over /dev/sda1
 and /dev/sdc1.

What does file -s /dev/sda1 /dev/sdc1 report?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs and LBA errors

2014-07-11 Thread Russell Coker
On Fri, 11 Jul 2014 21:29:07 constantine wrote:
 Thank you very much for your response:
 
 # file -s /dev/sda1 /dev/sdc1
 /dev/sda1: BTRFS Filesystem label partition, sectorsize 4096,
 nodesize 4096, leafsize 4096,
 UUID=c1eb1aaf-665a-4337-9d04-3c3921aa67e0, 1683870334976/3010310701056
 bytes used, 2 devices
 /dev/sdc1: data

Looks like the primary superblock on /dev/sdc1 is corrupt.

http://www.funtoo.org/BTRFS_Fun#btrfs-select-super_.2F_btrfs-zero-log

The above URL documents how to use the other superblocks.  It notes that you 
only get one chance - if you do this wrong you will dramatically reduce the 
probability of getting your data back.  Do not do this on the master copy of 
your data!

Buy 2 disks of equal or greater capacity, copy the block devices, and then 
working with the copy.  sdc needs to be replaced anyway no matter what you do 
and getting a replacement for sda would be a good strategy anyway.  Buy 2*4TB 
disks and you can make a new RAID-1 array that has more capacity than the non-
RAID filesystem you currently have.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID with enterprise SATA or SAS drives

2014-07-09 Thread Russell Coker
On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote:
  - for someone using SAS or enterprise SATA drives with Linux, I
  understand btrfs gives the extra benefit of checksums, are there any
  other specific benefits over using mdadm or dmraid?
 
 I think I can answer this one.
 
 Most important advantage I think is BTRFS is aware of which blocks of the
 RAID are in use and need to be synced:
 
 - Instant initialization of RAID regardless of size (unless at some
 capacity mkfs.btrfs needs more time)

From mdadm(8):

   --assume-clean
  Tell mdadm that the array pre-existed and is known to be  clean.
  It  can be useful when trying to recover from a major failure as
  you can be sure that no data will be affected unless  you  actu‐
  ally  write  to  the array.  It can also be used when creating a
  RAID1 or RAID10 if you want to avoid the initial resync, however
  this  practice  — while normally safe — is not recommended.  Use
  this only if you really know what you are doing.

  When the devices that will be part of a new  array  were  filled
  with zeros before creation the operator knows the array is actu‐
  ally clean. If that is the case,  such  as  after  running  bad‐
  blocks,  this  argument  can be used to tell mdadm the facts the
  operator knows.

While it might be regarded as a hack, it is possible to do a fairly instant 
initialisation of a Linux software RAID-1.

 - Rebuild after disk failure or disk replace will only copy *used* blocks

Have you done any benchmarks on this?  The down-side of copying used blocks is 
that you first need to discover which blocks are used.  Given that seek time is 
a major bottleneck at some portion of space used it will be faster to just 
copy the entire disk.

I haven't done any tests on BTRFS in this regard, but I've seen a disk 
replacement on ZFS run significantly slower than a dd of the block device 
would.

 Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID
 should be able to do this as well. But also for scrubbing: BTRFS only
 check and repairs used blocks.

When you scrub Linux Software RAID (and in fact pretty much every RAID) it 
will only correct errors that the disks flag.  If a disk returns bad data and 
says that it's good then the RAID scrub will happily copy the bad data over 
the good data (for a RAID-1) or generate new valid parity blocks for bad data 
(for RAID-5/6).

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

Page 12 of the above document says that nearline disks (IE the ones people 
like me can afford for home use) have a 0.466% incidence of returning bad data 
and claiming it's good in a year.  Currently I run about 20 such disks in a 
variety of servers, workstations, and laptops.  Therefore the probability of 
having no such errors on all those disks would be .99534^20=.91081.  The 
probability of having no such errors over a period of 10 years would be 
(.99534^20)^10=.39290 which means that over 10 years I should expect to have 
such errors, which is why BTRFS RAID-1 and DUP metadata on single disks are 
necessary features.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs loopback problems

2014-07-06 Thread Russell Coker
root@yoyo:/# btrfs fi df /
Data, RAID1: total=9.00GiB, used=6.95GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=82.95MiB
root@yoyo:/# df -h /
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda2   273G   15G  257G   6% /

I have a Xen server that has a RAID-1 array of 2*140G SAS disks, above is the 
df output.

The Xen server is for training and testing (including training people to use 
BTRFS), hence the name.  I have a subvol /xenstore which has image files for 
BTRFS RAID-1 filesystems (RAID-1 within RAID-1 is going to suck for 
performance but be good for training, and apparently testing).

root@yoyo:/# mount -o loop,degraded /xenstore/btrfsa /mnt/tmp

I mounted one of them loopback with the above command and then tried doing an 
apt-get update in a chroot.  The result was that apt-get entered D state and 
the following was in the kernel message log.  The system is running Debian 
kernel 3.14.9.

[ 2280.784105] INFO: task btrfs-flush_del:1339 blocked for more than 120 
seconds.
[ 2280.784126]   Not tainted 3.14-1-amd64 #1
[ 2280.784136] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[ 2280.784152] btrfs-flush_del D 8800547f0cf8 0  1339  2 
0x
[ 2280.784173]  8800547f08e0 0246 00014380 
8800547f5fd8
[ 2280.784232]  00014380 8800547f08e0 880077454c10 
8800c028
[ 2280.784290]  0002 8111f240 8800547f5ce0 
8800547f5dc8
[ 2280.784347] Call Trace:
[ 2280.784376]  [8111f240] ? wait_on_page_read+0x60/0x60
[ 2280.784410]  [814bcd54] ? io_schedule+0x94/0x130
[ 2280.784440]  [8111f245] ? sleep_on_page+0x5/0x10
[ 2280.784470]  [814bd0c4] ? __wait_on_bit+0x54/0x80
[ 2280.784501]  [8111f04f] ? wait_on_page_bit+0x7f/0x90
[ 2280.784534]  [8109e300] ? autoremove_wake_function+0x30/0x30
[ 2280.784566]  [8112c008] ? pagevec_lookup_tag+0x18/0x20
[ 2280.784597]  [8111f130] ? filemap_fdatawait_range+0xd0/0x160
[ 2280.784654]  [a0230da5] ? btrfs_wait_ordered_range+0x65/0x120 
[btrfs]
[ 2280.784709]  [a021d7a1] ? btrfs_run_delalloc_work+0x21/0x80 
[btrfs]
[ 2280.784750]  [a02449b0] ? worker_loop+0x140/0x520 [btrfs]
[ 2280.784782]  [814bc5e9] ? __schedule+0x2a9/0x700
[ 2280.784820]  [a0244870] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[ 2280.784854]  [8107f858] ? kthread+0xb8/0xd0
[ 2280.784884]  [8107f7a0] ? kthread_create_on_node+0x180/0x180
[ 2280.784917]  [814c7acc] ? ret_from_fork+0x7c/0xb0
[ 2280.784948]  [8107f7a0] ? kthread_create_on_node+0x180/0x180
[ 2280.784980] INFO: task btrfs-transacti:1343 blocked for more than 120 
seconds.
[ 2280.785028]   Not tainted 3.14-1-amd64 #1
[ 2280.785055] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[ 2280.785105] btrfs-transacti D 8800547c4968 0  1343  2 
0x
[ 2280.785142]  8800547c4550 0246 00014380 
880054055fd8
[ 2280.785200]  00014380 8800547c4550 88004f063f70 
880054055d78
[ 2280.785257]  88004f063f68 8800547c4550 880056047de0 
880054055df0
[ 2280.785314] Call Trace:
[ 2280.785340]  [814bbe49] ? schedule_timeout+0x209/0x2a0
[ 2280.785372]  [81095b6f] ? enqueue_task_fair+0x2bf/0xdf0
[ 2280.785403]  [81091337] ? sched_clock_cpu+0x47/0xb0
[ 2280.785434]  [8131dc6b] ? notify_remote_via_irq+0x2b/0x50
[ 2280.785466]  [8108bee5] ? check_preempt_curr+0x65/0x90
[ 2280.785497]  [8108bf1f] ? ttwu_do_wakeup+0xf/0xc0
[ 2280.785528]  [814bd3f0] ? wait_for_completion+0xa0/0x110
[ 2280.785560]  [8108e7a0] ? wake_up_state+0x10/0x10
[ 2280.785599]  [a02273dd] ? 
btrfs_wait_and_free_delalloc_work+0xd/0x20 [btrfs]
[ 2280.785656]  [a02304f6] ? 
btrfs_run_ordered_operations+0x1e6/0x2b0 [btrfs]
[ 2280.785712]  [a02182b7] ? btrfs_commit_transaction+0x217/0x990 
[btrfs]
[ 2280.785768]  [a0218abb] ? start_transaction+0x8b/0x550 [btrfs]
[ 2280.785807]  [a021432d] ? transaction_kthread+0x1ad/0x240 [btrfs]
[ 2280.785846]  [a0214180] ? btrfs_cleanup_transaction+0x510/0x510 
[btrfs]
[ 2280.785894]  [8107f858] ? kthread+0xb8/0xd0
[ 2280.785924]  [8107f7a0] ? kthread_create_on_node+0x180/0x180
[ 2280.785956]  [814c7acc] ? ret_from_fork+0x7c/0xb0
[ 2280.785985]  [8107f7a0] ? kthread_create_on_node+0x180/0x180
[ 2280.786017] INFO: task apt-get:1358 blocked for more than 120 seconds.
[ 2280.786047]   Not tainted 3.14-1-amd64 #1
[ 2280.786074] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables 
this message.
[ 2280.786122] apt-get D 88005bce2d78 0  1358   1357 
0x
[ 2280.786159]  88005bce2960 0286 00014380 
88005a681fd8
[ 2280.786218]  00014380 88005bce2960 880077494c10 

Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?

2014-07-03 Thread Russell Coker
On Thu, 3 Jul 2014 18:19:38 Marc MERLIN wrote:
 I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been
 running out of memory and deadlocking (panic= doesn't even work).
 I downgraded back to 3.14, but I already had the problem once since then.

Is there any correlation between such problems and BTRFS operations such as 
creating snapshots or running a scrub/balance?

Back in ~3.10 days I had serious problems with BTRFS memory use when removing 
multiple snapshots or balancing.  But at about 3.13 they all seemed to get 
fixed.

I usually didn't have a kernel panic when I had such problems (although I 
sometimes had a system lock up solid such that I couldn't even determine what 
it's problem was).  Usually the Oom handler started killing big processes such 
as chromium when it shouldn't have needed to.

Note that I haven't verified that the BTRFS memory use is reasonable in all 
such situations.  Merely that it doesn't use enough to kill my systems.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 3+ drives

2014-06-28 Thread Russell Coker
On Sat, 28 Jun 2014 04:26:43 Duncan wrote:
 Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted:
  On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote:
  Can I get more protection by using more than 2 drives?
  
  I had an onboard RAID a few years back that would let me use RAID1
  across up to 4 drives.
  
  Currently the only RAID level that fully works in BTRFS is RAID-1 with
  data on 2 disks.
 
 Not /quite/ correct.  Raid0 works, but of course that isn't exactly
 RAID as it's not redundant.  And raid10 works.  But that's simply
 raid0 over raid1.  So depending on whether you consider raid0 actually

http://en.wikipedia.org/wiki/Linux_MD_RAID_10#LINUX-MD-RAID-10

There are a number of ways of doing RAID-0 over RAID-1, but BTRFS doesn't do 
any of them.  When you have more than 2 disks and tell BTRFS to do RAID-1 you 
get a result that might be somewhat comparable to Linux software RAID-10, 
except for the issue of having disks of different sizes and adding more disks 
after creating the RAID.

 RAID or not, which in turn depends on how strict you are with the
 redundant part, there is or is not more than btrfs raid1 working.

The way BTRFS, ZFS, and WAFL work is quite different to anything described in 
any of the original papers on RAID.  One could make a case that what these 
filesystems do shouldn't be called RAID, but then we would be searching for 
another term for it.

  If you have 4 disks in the array then each block will
  be on 2 of the disks.
 
 Correct.
 
 FWIW I'm told that the paper that laid out the original definition of
 RAID (which was linked on this list in a similar discussion some months
 ago) defined RAID-1 as paired redundancy, no matter the number of
 devices.  Various implementations (including Linux' own mdraid soft-raid,
 and I believe dmraid as well) feature multi-way-mirroring aka N-way-
 mirroring such that N devices equals N way mirroring, but that's an
 implementation extension and isn't actually necessary to claim RAID-1
 support.

The paper is a little ambiguous as to whether a 3 disk mirror can be RAID-1.

 So look for N-way-mirroring when you go RAID shopping, and no, btrfs does
 not have it at this time, altho it is roadmapped for implementation after
 completion of the raid5/6 code.
 
 FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for
 device redundancy, but to take full advantage of btrfs data integrity
 features, allowing to scrub a checksum-mismatch copy with the content
 of a checksum-validated copy if available.  That's currently possible,
 but due to the pair-mirroring-only restriction, there's only one
 additional copy, and if it happens to be bad as well, there's no
 possibility of a third copy to scrub from.  As it happens my personal
 sweet-spot between cost/performance and reliability would be 3-way
 mirroring, but once they code beyond N=2, N should go unlimited, so N=3,
 N=4, N=50 if you have a way to hook them all up... should all be possible.

What I want is the ZFS copies= feature.

  If you want to have 4 disks in a fully redundant configuration (IE you
  could lose 3 disks without losing any data) then the thing to do is to
  have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1
  on top of that.
 
 The caveat with that is that at least mdraid1/dmraid1 has no verified
 data integrity, and while mdraid5/6 does have 1/2-way-parity calculation,
 it's only used in recovery, NOT cross-verified in ordinary use.

Linux Software RAID-6 only uses the parity when you have a hard read error.  
If you have a disk return bad data and say it's good then you just lose.

That said the rate of disks returning such bad data is very low.  If you had a 
hypothetical array of 4 disks as I suggested then to lose data you need to 
have one pair of disks entirely fail and another disk return corrupt data or 
have 2 disks in separate RAID-1 pairs return corrupt data on matching sectors 
(according to BTRFS data copies) such that Linux software RAID copies the 
corrupt data to the good disk.

That sort of thing is much less likely than having a regular BTRFS RAID-1 
array of 2 disks failing.

Also if you were REALLY paranoid you could have 2 BTRFS RAID-1 filesystems 
that each contain a single large file.  Those 2 large files could be run via 
losetup and used for another BTRFS RAID-1 filesystem.  That gets you 
redundancy at both levels.  Of course if you had 2 disks in one pair fail then 
the loopback BTRFS filesystem would still be OK.

How does the BTRFS kernel code handle a loopback device read failure?

 In fact, with md/dmraid and its reasonable possibility of silent
 corruption since at that level any of the copies could be returned and
 there's no data integrity checking, if whatever md/dmraid level copy /is/
 returned ends up being bad, then btrfs will consider that side of the
 pair bad, without any way to check additional copies at the underlying md/
 dmraid level.  Effectively you only have two verified copies

Re: RAID1 3+ drives

2014-06-28 Thread Russell Coker
On Sat, 28 Jun 2014 11:38:47 Duncan wrote:
 And with the size of disks we have today, the statistics on multiple
 whole device reliability are NOT good to us!  There's a VERY REAL chance,
 even likelihood, that at least one block on the device is going to be
 bad, and not be caught by its own error detection!

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The above paper suggests that it's about 10% of SATA disks getting such errors 
per year and that typically a disk that has such a problem has it for ~50 
sectors.  The probability of having 2 disks randomly get such errors (if they 
are truly random and independent) would be something like 1% per year.  The 
probability that the ~50 sectors on each of 2*3TB disks happening to match up 
is much lower.

  Also if you were REALLY paranoid you could have 2 BTRFS RAID-1
  filesystems that each contain a single large file.  Those 2 large files
  could be run via losetup and used for another BTRFS RAID-1 filesystem.
  That gets you redundancy at both levels.  Of course if you had 2 disks
  in one pair fail then the loopback BTRFS filesystem would still be OK.
 
 But the COW and fragmentation issues on the bottom level... OUCH!  And
 you can't simply set NOCOW, because that turns off the checksumming as
 well, leaving you right back where you were without the integrity
 checking!

It really depends on how much performance you need.  I've got some virtual 
servers running BTRFS within BTRFS and with modern hardware and a light load 
it works OK.

 *BUT* at a cost of essentially *CONSTANT* scrubbing.  Constant because at
 the multi-TBs we're talking, just completing a single scrub cycle could
 well take more than a standard 8-hour work-day, so by the time you
 finish, it's already about time to start the next scrub cycle.

Scrubbing my BTRFS RAID-1 filesystem with 2.4TB of data stored on a pair of 
3TB disks takes 5 hours.

 That sort of constant scrubbing is going to take its toll both on device
 life and on I/O thruput for whatever data you're actually storing on the
 device, since a good share of the time it's going to be scrubbing as
 well, slowing down the speed of the real I/O.

Some years ago I asked an executive from a company that manufactured hard 
drives about this.  The engineering manager who was directed to answer my 
question told me that the drives were designed to perform any sequence of 
legal operations continually for the warranty period.  So if a disk had a 3 
year warranty then it should be able to survive a scrubbing loop for 3 years.

But scrubbing a system that runs 24*7 is a problem.  Hopefully we will get a 
speed limit feature for BTRFS scrubbing as there is for Linux software RAID 
rebuild/scrub.

  No.  I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub
  every Sunday night.  If I had an array of 4 disks then I could do scrubs
  on Saturday night as well.
 
 But are you scrubbing at both the btrfs and the md/dmraid level?  That'll
 effectively double the scrub-time.

It's a BTRFS RAID-1, there is no mdadm on that system.

 And while that might not take a full 24 hours, it's likely to take a
 significant enough portion of 24 hours, that if you're doing a full mdraid
 and btrfs level both scrub every two days, some significant fraction (say
 a third to a half) of the time will be spent scrubbing, during which
 normal I/O speeds will be significantly reduced, while also reducing
 device lifetime due to the relatively high duty cycle seek activity.

When the expected error rate for SATA disks is ~10% of disks having errors per 
year a scrub every second day seems rather paranoid.

But if you are that paranoid then the wisc.edu paper suggests that you should 
be buying enterprise disks that have a much lower error rate.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Question] Btrfs on iSCSI device

2014-06-27 Thread Russell Coker
On Fri, 27 Jun 2014 18:34:34 Goffredo Baroncelli wrote:
 I don't think that it is possible to mount the _same device_ at the _same
 time_ on two different machines. And this doesn't depend by the filesystem.

If you use a clustered filesystem then you can safely mount it on multiple 
machines.

If you use a non-clustered filesystem it can still mount and even appear to 
work for a while.  It's surprising how many writes you can make to a dual-
mounted filesystem that's not designed for such things before you get a 
totally broken filesystem.

On Fri, 27 Jun 2014 13:15:16 Austin S Hemmelgarn wrote:
 The reason it appears to work when using iSCSI and not with directly
 connected parallel SCSI or SAS is that iSCSI doesn't provide low level
 hardware access.

I've tried this with dual-attached FC and had no problems mounting.  In what 
way is directly connected SCSI different from FC?

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 3+ drives

2014-06-27 Thread Russell Coker
On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote:
 Can I get more protection by using more than 2 drives?
 
 I had an onboard RAID a few years back that would let me use RAID1
 across up to 4 drives.

Currently the only RAID level that fully works in BTRFS is RAID-1 with data on 
2 disks.  If you have 4 disks in the array then each block will be on 2 of the 
disks.  RAID-5/6 code mostly works but the last report I read indicated that 
some situations for recovery and disk replacement didn't work - presumably 
anyone who's afraid of multiple disks failing isn't going to want to trust 
BTRFS RAID-6 code at the moment.

If you want to have 4 disks in a fully redundant configuration (IE you could 
lose 3 disks without losing any data) then the thing to do is to have 2 RAID-1 
arrays with Linux software RAID and then run BTRFS RAID-1 on top of that.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >