Re: BTRFS losing SE Linux labels on power failure or "reboot -nffd"
https://www.spinics.net/lists/linux-btrfs/msg77927.html Thanks to Hans van Kranenburg and Holger Hoffstätte, the above has the link to a message with the patch for this which was already included in kernel 4.16.11 which was uploaded to Debian on the 27th of May and got into testing about the time that my message got to the SE Linux list. The kernel from Debian/Stable still has the issue. So using a testing kernel might be a good option to deal with this problem at the moment. On Monday, 4 June 2018 11:14:52 PM AEST Russell Coker wrote: > The command "reboot -nffd" (kernel reboot without flushing kernel buffers or > writing status) when run on a BTRFS system with SE Linux will often result > in /var/log/audit/audit.log being unlabeled. It also results in some > systemd- journald files like > /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/ system.journal being > unlabeled but that is rarer. I think that the same problem afflicts both > systemd-journald and auditd but it's a race condition that on my systems > (both production and test) is more likely to affect auditd. > > root@stretch:/# xattr -l /var/log/audit/audit.log > security.selinux: > 73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5Fsystem_u:object_ > 0010 72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73r:auditd_log_t:s > 0020 30 00 0. > > SE Linux uses the xattr "security.selinux", you can see what it's doing with > xattr(1) but generally using "ls -Z" is easiest. > > If this issue just affected "reboot -nffd" then a solution might be to just > not run that command. However this affects systems after a power outage. > > I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security > update for Debian/Stretch which is the latest supported release of Debian). > I have also reproduced it in an identical manner with kernel 4.16.0-1-amd64 > (the latest from Debian/Unstable). For testing I reproduced this with a 4G > filesystem in a VM, but in production it has happened on BTRFS RAID-1 > arrays, both SSD and HDD. > > #!/bin/bash > set -e > COUNT=$(ps aux|grep [s]bin/auditd|wc -l) > date > if [ "$COUNT" = "1" ]; then > echo "all good" > else > echo "failed" > exit 1 > fi > > Firstly the above is the script /usr/local/sbin/testit, I test for auditd > running because it aborts if the context on it's log file is wrong. When SE > Linux is in enforcing mode an incorrect/missing label on the audit.log file > causes auditd to abort. > > root@stretch:~# ls -liZ /var/log/audit/audit.log > 37952 -rw---. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun > 1 12:23 /var/log/audit/audit.log > Above is before I do the tests. > > while ssh stretch /usr/local/sbin/testit ; do > ssh btrfs-local "reboot -nffd" > /dev/null 2>&1 & > sleep 20 > done > Above is the shell code I run to do the tests. Note that the VM in question > runs on SSD storage which is why it can consistently boot in less than 20 > seconds. > > Fri 1 Jun 12:26:13 UTC 2018 > all good > Fri 1 Jun 12:26:33 UTC 2018 > failed > Above is the output from the shell code in question. After the first reboot > it fails. The probability of failure on my test system is greater than > 50%. > > root@stretch:~# ls -liZ /var/log/audit/audit.log > 37952 -rw---. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun > 1 12:26 /var/log/audit/audit.log > Now the result. Note that the Inode has not changed. I could understand a > newly created file missing an xattr, but this is an existing file which > shouldn't have had it's xattr changed. But somehow it gets corrupted. > > The first possibility I considered was that SE Linux code might be at fault. > I asked on the SE Linux mailing list (I haven't been involved in SE Linux > kernel code for about 15 years) and was informed that this isn't likely at > all. There have been no problems like this reported with other > filesystems. > > Does anyone have any ideas of other tests I should run? Anyone want me to > try a different kernel? I can give root on a VM to anyone who wants to > poke at it. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS losing SE Linux labels on power failure or "reboot -nffd"
The command "reboot -nffd" (kernel reboot without flushing kernel buffers or writing status) when run on a BTRFS system with SE Linux will often result in /var/log/audit/audit.log being unlabeled. It also results in some systemd- journald files like /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/ system.journal being unlabeled but that is rarer. I think that the same problem afflicts both systemd-journald and auditd but it's a race condition that on my systems (both production and test) is more likely to affect auditd. root@stretch:/# xattr -l /var/log/audit/audit.log security.selinux: 73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5Fsystem_u:object_ 0010 72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73r:auditd_log_t:s 0020 30 00 0. SE Linux uses the xattr "security.selinux", you can see what it's doing with xattr(1) but generally using "ls -Z" is easiest. If this issue just affected "reboot -nffd" then a solution might be to just not run that command. However this affects systems after a power outage. I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security update for Debian/Stretch which is the latest supported release of Debian). I have also reproduced it in an identical manner with kernel 4.16.0-1-amd64 (the latest from Debian/Unstable). For testing I reproduced this with a 4G filesystem in a VM, but in production it has happened on BTRFS RAID-1 arrays, both SSD and HDD. #!/bin/bash set -e COUNT=$(ps aux|grep [s]bin/auditd|wc -l) date if [ "$COUNT" = "1" ]; then echo "all good" else echo "failed" exit 1 fi Firstly the above is the script /usr/local/sbin/testit, I test for auditd running because it aborts if the context on it's log file is wrong. When SE Linux is in enforcing mode an incorrect/missing label on the audit.log file causes auditd to abort. root@stretch:~# ls -liZ /var/log/audit/audit.log 37952 -rw---. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun 1 12:23 /var/log/audit/audit.log Above is before I do the tests. while ssh stretch /usr/local/sbin/testit ; do ssh btrfs-local "reboot -nffd" > /dev/null 2>&1 & sleep 20 done Above is the shell code I run to do the tests. Note that the VM in question runs on SSD storage which is why it can consistently boot in less than 20 seconds. Fri 1 Jun 12:26:13 UTC 2018 all good Fri 1 Jun 12:26:33 UTC 2018 failed Above is the output from the shell code in question. After the first reboot it fails. The probability of failure on my test system is greater than 50%. root@stretch:~# ls -liZ /var/log/audit/audit.log 37952 -rw---. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun 1 12:26 /var/log/audit/audit.log Now the result. Note that the Inode has not changed. I could understand a newly created file missing an xattr, but this is an existing file which shouldn't have had it's xattr changed. But somehow it gets corrupted. The first possibility I considered was that SE Linux code might be at fault. I asked on the SE Linux mailing list (I haven't been involved in SE Linux kernel code for about 15 years) and was informed that this isn't likely at all. There have been no problems like this reported with other filesystems. Does anyone have any ideas of other tests I should run? Anyone want me to try a different kernel? I can give root on a VM to anyone who wants to poke at it. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
On Monday, 4 September 2017 2:57:18 PM AEST Stefan Priebe - Profihost AG wrote: > > Then roughly make sure the complete set of metadata blocks fits in the > > cache. For an fs of this size let's say/estimate 150G. Then maybe same > > of double for data, so an SSD of 500G would be a first try. > > I would use 1TB devices for each Raid or a 4TB PCIe card. One thing I've considered is to create a filesystem with a RAID-1 of SSDs and then create lots of files with long names to use up a lot of space on the SSDs. Then delete those files and add disks to the filesystem. Then BTRFS should keep using the allocated metadata blocks on the SSD for all metadata and use disks for just data. I haven't yet tried bcache, but would prefer something simpler with one less layer. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
read-only for no good reason on 4.9.30
I have a system with less than 50% disk space used. It just started rejecting writes due to lack of disk space. I ran "btrfs balance" and then it started working correctly again. It seems that a btrfs filesystem if left alone will eventually get fragmented enough that it rejects writes (I've had similar issues with other systems running BTRFS with other kernel versions). Is this a known issue? Is there any good way of recognising when it's likely to happen? Is there anything I can do other than rewriting a medium size file to determine when it's happened? # uname -a Linux trex 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux # df -h / Filesystem Size Used Avail Use% Mounted on /dev/sdc239G 113G 126G 48% / # btrfs fi df / Data, RAID1: total=117.00GiB, used=111.81GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=1.00GiB, used=516.00MiB GlobalReserve, single: total=246.59MiB, used=0.00B # btrfs dev usa / /dev/sdc, ID: 1 Device size: 238.47GiB Device slack: 0.00B Data,RAID1:117.00GiB Metadata,RAID1: 1.00GiB System,RAID1: 32.00MiB Unallocated: 120.44GiB /dev/sdd, ID: 2 Device size: 238.47GiB Device slack: 0.00B Data,RAID1:117.00GiB Metadata,RAID1: 1.00GiB System,RAID1: 32.00MiB Unallocated: 120.44GiB -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange behavior after "rm -rf //"
http://selinux.coker.com.au/play.html There are a variety of ways of giving the same result that rm doesn't reject. "/*" Wasn't caught last time I checked. See the above URL if you want to test out various rm operations as root. ;) On 10 August 2016 9:24:23 AM AEST, Christian Kujauwrote: >On Mon, 8 Aug 2016, Ivan Sizov wrote: >> I'd ran "rm -rf //" by mistake two days ago. I'd stopped it after >five > >Out of curiosity, what version of coreutils is this? The >--preserve-root >option is the default for quite some time now: > >> Don't include dirname.h, since system.h does it now. >> (usage, main): --preserve-root is now the default. >> 2006-09-03 02:53:58 + >http://git.savannah.gnu.org/cgit/coreutils.git/commit/src/rm.c?id=89ffaa19909d31dffbcf12fb4498afb72666f6c9 > >Even coreutils-6.10 from Debian/5 refuses to remove "/": > >$ sudo rm -rf / >rm: cannot remove root directory `/' > >Christian. -- Sent from my Nexus 6P with K-9 Mail. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS admin access control
I've just written a script for "mon" to monitor BTRFS filesystems. I had to use sudo because "btrfs device stats" needs to be run as root. Would it be possible to do some of these things as non-root? I think it would be ideal if there was a "btrfs tunefs" operation somewhat comparable to "tune2fs" that allowed us to specify which UID or GID could perform operations like "btrfs device stats". -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote: > I've had some discussions on the list these days about not having > checksumming with nodatacow (mostly with Hugo and Duncan). > > They both basically told me it wouldn't be straight possible with CoW, > and Duncan thinks it may not be so much necessary, but none of them > could give me really hard arguments, why it cannot work (or perhaps I > was just too stupid to understand them ^^)... while at the same time I > think that it would be generally utmost important to have checksumming > (real world examples below). My understanding of BTRFS is that the metadata referencing data blocks has the checksums for those blocks, then the blocks which link to that metadata (EG directory entries referencing file metadata) has checksums of those. For each metadata block there is a new version that is eventually linked from a new version of the tree root. This means that the regular checksum mechanisms can't work with nocow data. A filesystem can have checksums just pointing to data blocks but you need to cater for the case where a corrupt metadata block points to an old version of a data block and matching checksum. The way that BTRFS works with an entire checksumed tree means that there's no possibility of pointing to an old version of a data block. The NetApp published research into hard drive errors indicates that they are usually in small numbers and located in small areas of the disk. So if BTRFS had a nocow file with any storage method other than dup you would have metadata and file data far enough apart that they are not likely to be hit by the same corruption (and the same thing would apply with most Ext4 Inode tables and data blocks). I think that a file mode where there were checksums on data blocks with no checksums on the metadata tree would be useful. But it would require a moderate amount of coding and there's lots of other things that the developers are working on. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
3.16.0 Debian kernel hang
One of my test laptops started hanging on mounting the root filesystem. I think that it had experience an unexpected power outage prior to that which may have caused corruption. When I tried to mount the root filesystem the mount process would stick in D state, there would be no disk IO, and the computer would get hot - presumably due to kernel CPU use even though "top" didn't seem to indicate that. When I mounted the filesystem with a 4.2.0 kernel it said "The free space cache file (1103101952) is invalid, skip it" and then things worked. Now that the machine is running 4.2.0 everything is fine. I know that there are no plans to backport things to 3.16 and I don't think the Debian people are going to be very interested in this. So this message is a FYI for users, maybe consider not using the Debian/Jessie kernel for BTRFS systems. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.16.0 Debian kernel hang
On Sat, 5 Dec 2015 12:53:07 AM Austin S Hemmelgarn wrote: > > The only reason I'm not running Unstable kernels on my Debian systems is > > because I run some Xen servers and upgrading Xen is problemmatic. Linode > > is moving from Xen to KVM so I guess I should consider doing the > > same. If I migrate my Xen servers to KVM I can use newer kernels with > > less risk. > > That's interesting, that must be something with how they do kernel > development in Debian, because I've never had any issues upgrading > either Xen or Linux on any of the systems I've run Xen on, and I > directly track mainline (with a small number of patches) for Linux, and > stay relatively close to mainline with Xen (Gentoo doesn't have all that > many patches on top of the regular release for Xen, aside from XSA > patches). I don't think that Debian does anything wrong in this regard. It's just that my experience of Xen is that it is fragile at the best of times. The fact that Red Hat packaged the Xen kernel in the Linux kernel package is a major indication of Xen problems IMHO, the concept of Xen is that it shouldn't be tied to a Linux kernel. If you haven't had Xen issues then I think you have been lucky. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.16.0 Debian kernel hang
On Sat, 5 Dec 2015 12:08:58 AM Austin S Hemmelgarn wrote: > > I know that there are no plans to backport things to 3.16 and I don't > > think the Debian people are going to be very interested in this. So > > this message is a FYI for users, maybe consider not using the > > Debian/Jessie kernel for BTRFS systems. > > I'd suggest extending that suggestion to: > If you're not using an Enterprise distro (RHEL, SLES, CentOS, OEL), then > you should probably be building your own kernel, ideally using upstream > sources. There are lots of ways of dealing with this. Debian development doesn't stop. Anyone who is running a Jessie system can easily run a kernel from Testing or Unstable (which really isn't particularly unstable). It's generally expected that Debian user-space will work with a kernel from +- one release of Debian. Also every time I've tried it Debian has worked well with a CentOS kernel of a similar version. The only reason I'm not running Unstable kernels on my Debian systems is because I run some Xen servers and upgrading Xen is problemmatic. Linode is moving from Xen to KVM so I guess I should consider doing the same. If I migrate my Xen servers to KVM I can use newer kernels with less risk. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LWN mention
On Mon, 9 Nov 2015 08:10:13 AM Duncan wrote: > Russell Coker posted on Sun, 08 Nov 2015 17:38:32 +1100 as excerpted: > > https://lwn.net/Articles/663474/ > > http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500 > > > > Above is a BTRFS issue that is mentioned in a LWN comment. Has this one > > been fixed yet? > > Good job linking to a subscription-only article, without using a link > that makes it available for non-subscriber list readers to read.[1] Well it's only subscription-only for a week, so it's available for everyone now. While LWN has a feature for "subscribers" (which includes me because HP sponsors LWN access for all DDs) to send free links for other people I don't believe it would be appropriate to use that on a mailing list. If anyone had asked by private mail I'd have been happy to send them a personal link for that. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using Btrfs on single drives
On Sun, 15 Nov 2015 03:01:57 PM Duncan wrote: > That looks to me like native drive limitations. > > Due to the fact that a modern hard drive spins at the same speed no > matter where the read/write head is located, when it's reading/writing to > the first part of the drive -- the outside -- much more linear drive > distance will pass under the read/write heads in say a tenth of a second > than will be the case as the last part of the drive is filled -- the > inside -- and throughput will be much higher at the first of the drive. http://www.coker.com.au/bonnie++/zcav/results.html The above page has the results of my ZCAV benchmark (part of the Bonnie++ suite) which shows this. You can safely tun ZCAV in read mode on a device that's got a filesystem on it so it's not too late to test these things. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to replicate a Xen VM using BTRFS as the root filesystem.
On Wed, 28 Oct 2015 11:07:20 PM Austin S Hemmelgarn wrote: > Using this methodology, I can have a new Gentoo PV domain running in > about half an hour, whereas it takes me at least two and a half hours > (and often much longer than that) when using the regular install process > for Gentoo. On my virtual servers I have a BTRFS subvol /xenstore for the block devices of virtual machines. When I want to duplicate a VM I run "cp -a --reflink=aways /xenstore/A /xenstore/B" which takes a few seconds. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Expected behavior of bad sectors on one drive in a RAID1
On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote: > > https://www.gnu.org/software/ddrescue/ > > > > At this stage I would use ddrescue or something similar to copy data from > > the failing disk to a fresh disk, then do a BTRFS scrub to regenerate > > the missing data. > > > > I wouldn't remove the disk entirely because then you lose badly if you > > get another failure. I wouldn't use a BTRFS replace because you already > > have the system apart and I expect ddrescue could copy the data faster. > > Also as the drive has been causing system failures (I'm guessing a > > problem with the power connector) you REALLY don't want BTRFS to corrupt > > data on the other disks. If you have a system with the failing disk and > > a new disk attached then there's no risk of further contamination. > > BIG DISCLAIMER: For the filesystem to be safely mountable it is > ABSOLUTELY NECESSARY to remove the old disk after doing a block level You are correct, my message wasn't clear. What I meant to say is that doing a "btrfs device remove" or "btrfs replace" is generally a bad idea in such a situation. "btrfs replace" is pretty good if you are replacing a disk with a larger one or replacing a disk that has only minor errors (a disk that just gets a few bad sectors is unlikely to get many more in a hurry). > copy of it. By all means, keep the disk around, but do not keep it > visible to the kernel after doing a block level copy of it. Also, you > will probably have to run 'btrfs device scan' after copying the disk and > removing it for the filesystem to work right. This is an inherent > result of how BTRFS's multi-device functionality works, and also applies > to doing stuff like LVM snapshots of BTRFS filesystems. Good advice. I recommend just rebooting the system. I think that if anyone who has the background knowledge to do such things without rebooting will probably just do it without needing to ask us for advice. > >> Question 2 - Before having ran the scrub, booting off the raid with > >> bad sectors, would btrfs "on the fly" recognize it was getting bad > >> sector data with the checksum being off, and checking the other > >> drives? Or, is it expected that I could get a bad sector read in a > >> critical piece of operating system and/or kernel, which could be > >> causing my lockup issues? > > > > Unless you have disabled CoW then BTRFS will not return bad data. > > It is worth clarifying also that: > a. While BTRFS will not return bad data in this case, it also won't > automatically repair the corruption. Really? If so I think that's a bug in BTRFS. When mounted rw I think that every time corruption is discovered it should be automatically fixed. > b. In the unlikely event that both copies are bad, trying to read the > data will return an IO error. > c. It is theoretically possible (although statistically impossible) that > the block could become corrupted, but the checksum could still be > correct (CRC32c is good at detecting small errors, but it's not hard to > generate a hash collision for any arbitrary value, so if a large portion > of the block goes bad, then it can theoretically still have a valid > checksum). It would be interesting to see some research into how CRC32 fits with the more common disk errors. For a disk to return bad data and claim it to be good the data must either be a misplaced write or read (which is almost certain to be caught by BTRFS as the metadata won't match), or a random sector that matches the disk's CRC. Is generating a hash collision for a CRC32 inside a CRC protected block much more difficult? > >> Question 3 - Probably doesn't matter, but how can I see which files > >> (or metadata to files) the 40 current bad sectors are in? (On extX, > >> I'd use tune2fs and debugfs to be able to see this information.) > > > > Read all the files in the system and syslog will report it. But really > > don't do that until after you have copied the disk. > > It may also be possible to use some of the debug tools from BTRFS to do > this without hitting the disks so hard, but it will likely take a lot > more effort. I don't think that you can do that without hitting the disks hard. That said last time I checked (last time an executive of a hard drive manufacturer was willing to talk to me) drives were apparently designed to perform any sequence of operations for their warranty period. So for a disk that is believed to be good this shouldn't be a problem. For a disk that is known to be dying it would be a really bad idea to do anything other than copy the data off at maximum speed. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Expected behavior of bad sectors on one drive in a RAID1
On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote: > sda appears to be going bad, with my low threshold of "going bad", and > will be replaced ASAP. It just developed 16 reallocated sectors, and > has 40 current pending sectors. > > I'm currently running a "btrfs scrub start -B -d -r /terra", which > status on another term shows me has found 32 errors after running for > an hour. https://www.gnu.org/software/ddrescue/ At this stage I would use ddrescue or something similar to copy data from the failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing data. I wouldn't remove the disk entirely because then you lose badly if you get another failure. I wouldn't use a BTRFS replace because you already have the system apart and I expect ddrescue could copy the data faster. Also as the drive has been causing system failures (I'm guessing a problem with the power connector) you REALLY don't want BTRFS to corrupt data on the other disks. If you have a system with the failing disk and a new disk attached then there's no risk of further contamination. > Question 2 - Before having ran the scrub, booting off the raid with > bad sectors, would btrfs "on the fly" recognize it was getting bad > sector data with the checksum being off, and checking the other > drives? Or, is it expected that I could get a bad sector read in a > critical piece of operating system and/or kernel, which could be > causing my lockup issues? Unless you have disabled CoW then BTRFS will not return bad data. > Question 3 - Probably doesn't matter, but how can I see which files > (or metadata to files) the 40 current bad sectors are in? (On extX, > I'd use tune2fs and debugfs to be able to see this information.) Read all the files in the system and syslog will report it. But really don't do that until after you have copied the disk. > I do have hourly snapshots, from when it was properly running, so once > I'm that far in the process, I can also compare the most recent > snapshots, and see if there's any changes that happened to files that > shouldn't have. Snapshots refer to the same data blocks, so if a data block is corrupted in a way that BTRFS doesn't notice (which should be almost impossible) then all snapshots will have it. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Fri, 2 Oct 2015 10:07:24 PM Austin S Hemmelgarn wrote: > > ARC presumably worked better than the other Solaris caching options. It > > was ported to Linux with zfsonlinux because that was the easy way of > > doing it. > > Actually, I think part of that was also the fact that ZFS is a COW > filesystem, and classical LRU caching (like the regular Linux pagecache) > often does horribly with COW workloads (and I'm relatively convinced > that this is a significant part of why BTRFS has such horrible > performance compared to ZFS). Last time I checked a BTRFS RAID-1 filesystem would assign each process to read from one disk based on it's PID. Every RAID-1 implementation that has any sort of performance optimisation will allow a single process that's reading to use both disks to some extent. When the BTRFS developers spend some serious effort optimising for performance it will be useful to compare BTRFS and ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qgroup problem
(sysadm_t:SystemLow-SystemHigh)root@unstable:~/pol# btrfs qgroup show -r -e /tmp qgroupid rfer excl max_rfer max_excl 0/5 1647689728 1647689728 0 0 0/25816384 16384 524288000 0 0/25916384 16384 536870912 0 0/2610 0 0 0 0/2620 0 0 0 (sysadm_t:SystemLow-SystemHigh)root@unstable:~/pol# btrfs subvol list /|grep tmp ID 259 gen 161316 top level 5 path tmp (sysadm_t:SystemLow-SystemHigh)root@unstable:~/pol# uname -a Linux unstable 4.2.0-1-amd64 #1 SMP Debian 4.2.1-2 (2015-09-27) x86_64 GNU/Linux Above is from a SE Linux test machine running BTRFS with quotas on kernel 4.2.1. # scp selinux-policy-default_2.20140421-11.1_all.deb bofh@unstable:/tmp scp: /tmp/selinux-policy-default_2.20140421-11.1_all.deb: Disk quota exceeded Above is the result of trying to scp a 2MB file to /tmp. Is there anything I'm doing wrong or is quotas broken? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strange i/o errors with btrfs on raid/lvm
On Sat, 26 Sep 2015 06:47:26 AM Chris Murphy wrote: > And then > > Aug 28 17:06:49 host mdadm[2751]: RebuildFinished event detected on md > device /dev/md/0, component device mismatches found: 2048 (on raid > level 10) > Aug 28 17:06:49 host mdadm[2751]: SpareActive event detected on md > device /dev/md/0, component device /dev/sdd1 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919#41 For Linux Software RAID-1 it's expected that you get a multiple of 128 mismatches found every time you do a scan. I don't know why that is, but systems running like that don't appear to have any problems. This even appears to happen on arrays that aren't in any noticable use (EG swap on a server that has heaps of RAM so swap doesn't get used). The above URL describes part of the problem but if that was the entire explanation then I'd expect multiples of 8 not 128. Also the explanation that Neil offers doesn't seem to cover the case of 10,000+ mismatches occurring in the weekly scrubs. Linux Software RAID-10 works the same way in this regard. So I'm not sure that the log entries in question mean anything other than a known bug in Linux Software RAID. I have servers that have been running well for years while repeatedly reporting such errors on Linux Software RAID-1 and not getting any FSCK problems. Note that the servers in question don't run BTRFS because years ago it really wasn't suitable for server use. It is possible that there are errors that aren't detected by e2fsck, don't result in corruption of gzip compressed data, don't give MySQL data consistency errors, and don't corrupt any user data in a noticable way. But it seems to be evidence in support of the theory that such mismatches don't matter. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Sat, 26 Sep 2015 12:20:41 AM Austin S Hemmelgarn wrote: > > FYI: > > Linux pagecache use LRU cache algo, and in general case it's working good > > enough > > I'd argue that 'general usage' should be better defined in this > statement. Obviously, ZFS's ARC implementation provides better > performance in a significant number of common use cases for Linux, > otherwise people wouldn't be using it to the degree they are. No-one gets a free choice about this. I have a number of servers running ZFS because I needed the data consistency features and BTRFS wasn't ready. There is no choice of LRU vs ARC once you've made the BTRFS vs ZFS decision. ARC presumably worked better than the other Solaris caching options. It was ported to Linux with zfsonlinux because that was the easy way of doing it. Some people here have reported that ARC worked well for them on Linux. My experience was that the zfsonlinux kernel modules wouldn't respect the module load options to reduce the size of the ARC and the default size would cause smaller servers to have kernel panics due to lack of RAM. My solution to that problem was to get more RAM for all ZFS servers as buying RAM is cheaper for my clients than paying me to diagnose the problems with ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Sat, 19 Sep 2015 12:13:29 AM Austin S Hemmelgarn wrote: > The other option (which for some reason I almost never see anyone > suggest), is to expose 2 disks to the guest (ideally stored on different > filesystems), and do BTRFS raid1 on top of that. In general, this is > what I do (except I use LVM for the storage back-end instead of a > filesystem) when I have data integrity requirements in the guest. On > the other hand of course, most of my VM's are trivial for me to > recreate, so I don't often need this and just use DM-RAID via LVm. I used to do that. But it was very fiddly and snapshotting the virtual machine images required making a snapshot of half a RAID-1 array via LVM (or snapshotting both when the virtual machine wasn't running). Now I just have a single big BTRFS RAID-1 filesystem and use regular files for the virtual machine images with the Ext3 filesystem. On Sun, 20 Sep 2015 11:26:26 AM Jim Salter wrote: > Performance will be fantastic... except when it's completely abysmal. > When I tried it, I also ended up with a completely borked (btrfs-raid1) > filesystem that would only mount read-only and read at hideously reduced > speeds after about a year of usage in a small office environment. Did > not make me happy. I've found performance to be acceptable, not great (you can't expect great performance from such things) but good enough for lightly loaded servers and test systems. I even ran a training session on BTRFS and ZFS filesystems with the images stored on a BTRFS RAID-1 (of 15,000rpm SAS disks). When more than 3 students ran a scrub at the same time performance dropped but it was mostly usable and there were no complaints. Admittedly that server hit a BTRFS bug and needed "reboot -nf" half way through, but I don't think that was a BTRFS virtual machine issue, rather it was a more general BTRFS under load issue. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Fri, 18 Sep 2015 12:00:15 PM Duncan wrote: > The caveat here is that if the VM/DB is active during the backups (btrfs > send/receive or other), it'll still COW1 any writes during the existence > of the btrfs snapshot. If the backup can be scheduled during VM/DB > downtime or at least when activity is very low, the relatively short COW1 > time should avoid serious fragmentation, but if not, even only relatively > temporary snapshots are likely to trigger noticeable cow1 fragmentation > issues eventually. One relevant issue for this is whether the working set of the database fits into RAM. RAM has been getting bigger and cheaper while databases I run haven't been getting bigger. Now every database I run has a working set that fits into RAM so read performance (and therefore fragmentation) doesn't matter for me except when rebooting - and database servers don't get rebooted that often. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding BTRFS storage
On Fri, 28 Aug 2015 07:35:02 PM Hugo Mills wrote: > On Fri, Aug 28, 2015 at 10:50:12AM +0200, George Duffield wrote: > > Running a traditional raid5 array of that size is statistically > > guaranteed to fail in the event of a rebuild. > >Except that if it were, you wouldn't see anyone running RAID-5 > arrays of that size and (considerably) larger. And successfully > replacing devices in them. Let's not assume that everyone who thinks that they are "successfully" running a RAID-5 array is actually doing so. One of the features of BTRFS is that you won't get undetected data corruption. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions on use of NOCOW impact to subvolumes and snapshots
On Thu, 20 Aug 2015 10:09:26 PM Austin S Hemmelgarn wrote: 2: Out of curiosity, why is data checksumming tied to COW? There's no safe way to sanely handle checksumming without COW, because there is no way (at least on current hardware) to ensure that the data block and the checksums both get written at the exact same time, and that one of the writes aborting will cause the other too do so as well. In-place compression is disabled for nodatasum files for essentially the same reason (although compression can cause much worse data loss than a failed checksum). A journaling filesystem could have checksums on data blocks. If Ext4 was modified to have checksums it would do that. But given that a major feature of BTRFS is snapshots it doesn't make much sense to implement a separate way of managing checksums. I think that ZIL is the correct way of solving these problems. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions on use of NOCOW impact to subvolumes and snapshots
On Thu, 20 Aug 2015 11:55:43 AM Chris Murphy wrote: Question 1: If I apply the NOCOW attribute to a file or directory, how does that affect my ability to run btrfs scrub? nodatacow includes nodatasum and no compression. So it means these files are presently immune from scrub check and repair so long as it's based on checksums. I don't know if raid56 scrub compares to parity and recomputes parity (assumes data is correct), absent checksums, which would be similar to how md raid 56 does it. Linux Software RAID could recreate a mismatched block from RAID-6 parity but doesn't do so. It could be that a block was changed correctly but didn't get the parity data written so such correction would be reverting a change. So Linux Software RAID only regenerates parity for a scrub and makes both disks have the same data for RAID-1. There's no good solution to these problems without doing the sorts of things that WAFL, ZFS, and BTRFS do. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
lockup
I have a Xen server with 14 DomUs that are being used for BTRFS and ZFS training. About 5 people are corrupting virtual disks and scrubbing them, lots of IO. All the virtual machine disk images are snapshots of a master image with copy on write. I just had the following error which ended with a NMI. I copied what I could. It's running the latest Debian/Jessie kernel 3.16.7. [15780.056002] Code: 44 24 10 e9 1c ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 54 55 48 89 fd 53 4c 8b 67 50 66 66 66 66 90 f0 ff 4d 4c 74 35 5b 5d 41 5c c3 48 8b 1d a9 07 07 00 48 85 db 74 1c 48 8b [15808.056003] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-i38:22730] [15808.056003] Modules linked in: xt_tcpudp xt_physdev iptable_filter ip_tables x_tables xen_netback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc bridge stp llc ext4 crc16 mbcache jbd2 ppdev psmouse serio_raw pcspkr k8temp joydev evdev ipmi_si ns558 gameport parport_pc parport ipmi_msghandler snd_mpu401_uart snd_rawmidi snd_seq_device snd processor button soundcore edac_mce_amd edac_core i2c_nforce2 i2c_core shpchp thermal_sys loop autofs4 crc32c_generic btrfs xor raid6_pq raid1 md_mod sd_mod crc_t10dif crct10dif_generic crct10dif_common hid_generic usbhid hid sg sr_mod cdrom ata_generic ohci_pci mptsas scsi_transport_sas mptscsih mptbase e1000 pata_amd ehci_pci ohci_hcd ehci_hcd libata forcedeth scsi_mod usbcore usb_common [15808.056003] CPU: 1 PID: 22730 Comm: qemu-system-i38 Not tainted 3.16.0-4- amd64 #1 Debian 3.16.7-ckt11-1+deb8u3 [15808.056003] Hardware name: Sun Microsystems Sun Fire X4100 M2/Sun Fire X4100 M2, BIOS 0ABJX102 11/03/2008 [15808.056003] task: 8812e010 ti: 880001e9c000 task.ti: 880001e9c000 [15808.056003] RIP: e030:[a024edb9] [a024edb9] btrfs_put_ordered_extent+0x19/0xc0 [btrfs] [15808.056003] RSP: e02b:880001e9fe08 EFLAGS: 0202 [15808.056003] RAX: 0583 RBX: 88000a4f0580 RCX: 06a4 [15808.056003] RDX: 88000a4f0580 RSI: 88000a4f0508 RDI: 88000a4f0508 [15808.056003] RBP: 88000a4f0508 R08: 88000a4f0560 R09: 8800502f29b0 [15808.056003] R10: 7ff0 R11: 0005 R12: 880053821950 [15808.056003] R13: 88000a4f0508 R14: 880004f7cf00 R15: 880001e9fe50 [15808.056003] FS: 7fdc312f5700() GS:88007744() knlGS: [15808.056003] CS: e033 DS: ES: CR0: 8005003b [15808.056003] CR2: 7f4af0c74000 CR3: 2e534000 CR4: 0660 [15808.056003] Stack: [15808.056003] 88000a4f0580 880052d76800 880002503800 a02342f4 [15808.056003] 880004f7cfa8 880002503000 a02881e2 [15808.056003] 8800 880052d76800 88000b7f7b18 [15808.056003] Call Trace: [15808.056003] [a02342f4] ? btrfs_wait_pending_ordered+0xc4/0x100 [btrfs] [15808.056003] [a02881e2] ? __btrfs_run_delayed_items+0xf2/0x1d0 [btrfs] [15808.056003] [a0236356] ? btrfs_commit_transaction+0x2d6/0xa10 [btrfs] [15808.056003] [810a7a40] ? prepare_to_wait_event+0xf0/0xf0 [15808.056003] [a0246529] ? btrfs_sync_file+0x1c9/0x2f0 [btrfs] [15808.056003] [811d53cb] ? do_fsync+0x4b/0x70 [15808.056003] [811d564f] ? SyS_fdatasync+0xf/0x20 [15808.056003] [8151158d] ? system_call_fast_compare_end+0x10/0x15 [15808.056003] Code: 44 24 10 e9 1c ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 54 55 48 89 fd 53 4c 8b 67 50 66 66 66 66 90 f0 ff 4d 4c 74 35 5b 5d 41 5c c3 48 8b 1d a9 07 07 00 48 85 db 74 1c 48 8b [15818.440002] INFO: rcu_sched self-detected stall on CPU { 1} (t=68266 jiffies g=236497 c=236496 q=6784) [15818.440002] sending NMI to all CPUs: [15818.440002] NMI backtrace for cpu 1 [15818.440002] CPU: 1 PID: 22730 Comm: qemu-system-i38 Not tainted 3.16.0-4- amd64 #1 Debian 3.16.7-ckt11-1+deb8u3 [15818.440002] Hardware name: Sun Microsystems Sun Fire X4100 M2/Sun Fire X4100 M2, BIOS 0ABJX102 11/03/2008 [15818.440002] task: 8812e010 ti: 880001e9c000 task.ti: 880001e9c000 [15818.440002] RIP: e030:[8100130a] [8100130a] xen_hypercall_vcpu_op+0xa/0x20 [15818.440002] RSP: e02b:880077443cc8 EFLAGS: 0046 [15818.440002] RAX: RBX: 0001 RCX: 8100130a [15818.440002] RDX: RSI: 0001 RDI: 000b [15818.440002] RBP: 818e2900 R08: 818e23e0 R09: 880bcc40 [15818.440002] R10: 0855 R11: 0246 R12: 818e23e0 [15818.440002] R13: 0005 R14: 1a80 R15: 81853680 [15818.440002] FS: 7fdc312f5700() GS:88007744() knlGS: [15818.440002] CS: e033 DS: ES: CR0: 8005003b [15818.440002]
can we make balance delete missing devices?
[ 2918.502237] BTRFS info (device loop1): disk space caching is enabled [ 2918.503213] BTRFS: failed to read chunk tree on loop1 [ 2918.540082] BTRFS: open_ctree failed I just had a test RAID-1 filesystem with a missing device. I mounted it with the degraded option and added a new device. I balanced it (to make it do RAID-1 again) and thought everything was good. Then when I tried to mount it again it gave errors such as the above (not sure why). Then I tried wiping /dev/loop1 and it refused to mount entirely due to having 2 missing devices. Obviously it was my mistake to not remove the missing device, and wiping /dev/loop1 was a bad idea. Failing to remove a missing device seems likely to be a common mistake. Could we make the balance operation automatically delete the missing device? I can't imagine a situation in which a balance would be desired but deleting the missing device wouldn't be desired. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
lack of scrub error data
Below is the result of testing a corrupted filesystem. What's going on here? The kernel message log and the btrfs output don't tell me how many errors there were. Also the data is RAID-0 (the default for a filesystem created with 2 devices) so if this was in a data area it should have lost data, but no data was lost so apparently not. Debian kernel 4.0.8-2 and btrfs-tools 4.0-2. # cp -r /lib /mnt/tmp # btrfs scrub start -B /mnt/tmp scrub done for c0992a26-3c7f-47b0-b8de-878dfe197469 scrub started at Fri Aug 14 10:17:53 2015 and finished after 0 seconds total bytes scrubbed: 372.82MiB with 0 errors # dd if=/dev/zero of=/dev/loop0 bs=1024k count=20 seek=100 20+0 records in 20+0 records out 20971520 bytes (21 MB) copied, 0.0267078 s, 785 MB/s # btrfs scrub start -B /mnt/tmp scrub done for c0992a26-3c7f-47b0-b8de-878dfe197469 scrub started at Fri Aug 14 10:19:15 2015 and finished after 0 seconds total bytes scrubbed: 415.04MiB with 0 errors WARNING: errors detected during scrubbing, corrected. # btrfs scrub start -B /mnt/tmp scrub done for c0992a26-3c7f-47b0-b8de-878dfe197469 scrub started at Fri Aug 14 10:20:25 2015 and finished after 0 seconds total bytes scrubbed: 415.04MiB with 0 errors # btrfs fi df /mnt/tmp Data, RAID0: total=800.00MiB, used=400.23MiB System, RAID1: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=400.00MiB, used=7.39MiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B # btrfs device stats /mnt/tmp [/dev/loop0].write_io_errs 0 [/dev/loop0].read_io_errs0 [/dev/loop0].flush_io_errs 0 [/dev/loop0].corruption_errs 0 [/dev/loop0].generation_errs 0 [/dev/loop1].write_io_errs 0 [/dev/loop1].read_io_errs0 [/dev/loop1].flush_io_errs 0 [/dev/loop1].corruption_errs 0 [/dev/loop1].generation_errs 0 # diff -ru /lib /mnt/tmp/lib # -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs raid1 metadata, single data
On Fri, 7 Aug 2015 06:49:58 PM Robert Krig wrote: What exactly is contained in btrfs metadata? Much the same as in metadata for every other filesystem. I've read about some users setting up their btrfs volumes as data=single, but metadata=raid1 Is there any actual benefit to that? I mean, if you keep your data as single, but have multiple copies of metadata, does that still allow you to recover from data corruption? Or is metadata redundancy a benefit to ensure that your btrfs volume remains mountable/readable? If you have redundant metadata and experience corruption then you will know the name of every file that has data corruption, this is really good for restoring from backup. Also you will be protected against corruption of a root directory causing massive data loss. If you have the bad luck to have certain metadata structures corrupted with no redundancy then you can face massive data loss and possibly have the entire filesystem become at least temporarily unusable. While corruption of the root directory is unlikely it is possible. With dup metadata I've seen a BTRFS filesystem remain usable after 12,000+ read errors. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On Sat, 1 Aug 2015 02:35:39 PM John Ettedgui wrote: It seems that you're using Chromium while doing the dump. :) If no CD drive, I'll recommend to use Archlinux installation iso to make a bootable USB stick and do the dump. (just download and dd would do the trick) As its kernel and tools is much newer than most distribution. So I did not have any usb sticks large enough for this task (only 4Gb) so I restarted into emergency runlevel with only / mounted and as ro, I hope that'll do. The Debian/Jessie Netinst image is about 120M and allows you to launch a shell. If you want a newer kernel you could rebuild the Debian Netinst yourself. Also a basic text-only Linux installation takes a lot less than 4G of storage. I have a couple of 1G USB sticks with Debian installed that I use to fix things. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: INFO: task btrfs-transacti:204 blocked for more than 120 seconds. (more like 8+min)
On Fri, 24 Jul 2015 11:11:22 AM Duncan wrote: The option is mem=nn[KMG]. You may also need memmap=, presumably memmap=nn[KMG]$ss[KMG], to reserve the unused memory area, preventing its use for PCI address space, since that would collide with the physical memory that's there but unused due to mem=. That should let you test with mem=2G, so double-memory becomes 4G. =:^) That's a good thing to note. Meanwhile, does bonnie do pre-allocation for its tests? http://www.coker.com.au/bonnie++/readme.html No. The above URL gives details on the tests. What I did see from years ago seemed to be that you'd have to disable COW where you knew there would be large files. I'm really hoping there's a way to avoid this type of locking, because I don't think I'd be comfortable knowing a non-root user could bomb the system with a large file in the wrong area. The problem with cow isn't large files in general, it's rewrites into the middle of them (as opposed to append-writes). If the writes are sequential appends, or if it's write-one-read-many, cow on large files doesn't tend to be an issue. The Bonnie++ rewrite test might be a pathological case for BTRFS. But it's a test that other filesystems have handled for decades. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: INFO: task btrfs-transacti:204 blocked for more than 120 seconds. (more like 8+min)
On Fri, 24 Jul 2015 05:12:38 AM james harvey wrote: I started trying to run with a -s 4G option, to use 4GB files for performance measuring. It refused to run, and said file size should be double RAM for good results. I sighed, removed the option, and let it run, defaulting to **64GB files**. So, yeah, big files. But, I do work with Photoshop .PSB files that get that large. You can use the -r0 option to stop it insisting on twice the RAM size. However if you have files that are less than twice the RAM then the test results will be unrealistic as read requests will be mostly satisfied from cache. During the first two lines (Writing intelligently... and Rewriting... the filesystem seems to be completely locked out for anything other than bonnie++. KDE stops being able to switch focus, change tasks. Can switch to tty's and log in, do things like ls, but attempting to write to the filesystem hangs. Can switch back to KDE, but screen is black with cursor until bonnie++ completes. top didn't show excessive CPU usage. That sort of problem isn't unique to BTRFS. BTRFS has had little performance optimisation so it might be worse than other filesystems in that regard. But on any filesystem you can expect situations where one process that is doing non-stop writes fills up buffers and starves other processes. Note that when a single disk access takes 8000ms+ (more than 8 seconds) then high level operations involving multiple files will take much longer. I think the Writing intelligently phase is sequential, and the old references I saw were regarding many re-writes sporadically in the middle. Intelligent writes is sequential, re-writes is reading and writing sequentially. What I did see from years ago seemed to be that you'd have to disable COW where you knew there would be large files. I'm really hoping there's a way to avoid this type of locking, because I don't think I'd be comfortable knowing a non-root user could bomb the system with a large file in the wrong area. Disabling CoW won't solve all issues related to sharing disk IO capacity between users. Also disabling CoW will remove all BTRFS benefits apart from subvols, and subvols aren't that useful when snapshots aren't an option. IF I do HAVE to disable COW, I know I can do it selectively. But, if I did it everywhere... Which in that situation I would, because I can't afford to run into many minute long lockups on a mistake... I lose compression, right? Do I lose snapshots? (Assume so, but hope I'm wrong.) What else do I lose? Is there any advantage running btrfs without COW anywhere over other filesystems? I believe that when you disable CoW and make a snapshot there will be one CoW stage for each block until it's copied somewhere else. How would one even know where the division is between a file small enough to allow on btrfs, vs one not to? http://doc.coker.com.au/projects/memlockd/ If a hostile user wrote a program that used fsync() they could reproduce such problems with much smaller files. My memlockd program alleviates such problems by locking the pages of important programs and libraries into RAM. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1: system stability
On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's missing or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. I disagree with that statement. BTRFS should be expected to not do wild and crazy things regardless of what happens with block devices. A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning any manner of corrupted data and should not lose data or panic the kernel. A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by mounting read-only or failing all operations on the filesystem. It should not affect any other filesystem or have any significant impact on the system unless it's the root filesystem. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
why does df spin up disks?
When I have a mounted filesystem why doesn't the kernel store the amount of free space? Why does it need to spin up a disk that had been spun down? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS RAID5 filesystem corruption during balance
On Sun, 24 May 2015 01:02:21 AM Jan Voet wrote: Doing a 'btrfs balance cancel' immediately after the array was mounted seems to have done the trick. A subsequent 'btrfs check' didn't show any errors at all and all the data seems to be there. :-) I add rootflags=skip_balance to the kernel command-line of all my Debian systems to solve this. I've had problems with the balance resuming in the past which had similar results. I've also never seen a situation where resuming the balance did any good. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in
Re: Carefully crafted BTRFS-image causes kernel to crash
On Tue, 21 Apr 2015, Qu Wenruo quwen...@cn.fujitsu.com wrote: Although we may add extra check for such problem to improve robustness, but IMHO it's not a real world problem. Some of the ReiserFS developers gave a similar reaction to some of my bug reports. ReiserFS wasn't the most robust filesystem. I think that it should be EXECTED that a kernel will have to occasionally deal with filesystem images that are created by hostile parties. Userspace crash and kernel freeze is not a suitable way of dealing with it. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The FAQ on fsync/O_SYNC
On Mon, 20 Apr 2015, Craig Ringer cr...@2ndquadrant.com wrote: PostgreSQL is its self copy-on-write (because of multi-version concurrency control), so it doesn't make much sense to have the FS doing another layer of COW. That's a matter of opinion. I think it's great if PostgreSQL can do internal checkums and error correction. But I'd rather not have to test that functionality in the field. Really I prefer to have the ZFS copies= option for databases. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to clone a btrfs filesystem
On Sat, 18 Apr 2015, Christoph Anton Mitterer cales...@scientia.net wrote: On Sat, 2015-04-18 at 04:24 +, Russell Coker wrote: dd works. ;) There are patches to rsync that make it work on block devices. Of course that will copy space occupied by deleted files too. I think both are not quite the solutions I was looking for. I know, but I don't think what you want is possible at this time. Guess for dd this is obvious, but for rsync I'd also loose all btrfs features like checksum verifications,... and even if these patches you mention would make it work on block devices, I'd guess it would at least need to read everything, it would not longer be a merge into another filesystem (perhaps I shouldn't have written clone)... and the target block device would need to have at least the size of the origin. An rsync on block devices wouldn't lose BTRFS checksums, you could run a scrub on the target at any time to verify them. For a dd or anything based on that the target needs to be at least as big as the source. But typical use of BTRFS for backup devices tends to result in keeping as many snapshots as possible without running out of space which means that no matter how you were to copy it the target would need to be as big. Can't one do something like the following: 1) The source fs has several snapshots and subvols. The target fs is empty (the first time). For the first time populating the target fs: 2) Make ro snapshots of all non-ro snapshots and subvols on the source-fs. 3) Send/receive the first of the ro snapshots to the target fs, with no parent and no clone-src. 4) Send/receive all further ro snapshots to the target fs, with no parents, but each time specifying one further clone-src (i.e. all that have already been sent/received) so that they're used for reflinks and so on 5) At the end somehow make rw-subvols from the snapshots/subvols that have been previously rw (how?). Sure, for 5 I believe you can make a rw snapshot of a ro subvol. Does that sound as if it would somehow work like that? Especially would it preserve all the reflink statuses and everything else (sparse files, etc.) Yes, but it would take a bit of scripting work. Some additional questions: a) Can btrfs send change anything(!) on the source fs? b) Can one abort (Ctrl-C) a send and/or receive... and make it continue at the same place were it was stopped? A yes, B I don't know. Also I'm not personally inclined to trust send/recv at this time. I don't think it's had a lot of testing. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to clone a btrfs filesystem
On Fri, 17 Apr 2015 11:08:44 PM Christoph Anton Mitterer wrote: How can I best copy one btrfs filesystem (with snapshots and subvolumes) into another, especially with keeping the CoW/reflink status of all files? dd works. ;) And ideally incrementally upgrade it later (again with all snapshots/subvols, and again not loosing the shared blocks between these files). There are patches to rsync that make it work on block devices. Of course that will copy space occupied by deleted files too. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
directory defrag
The current defragmentation options seem to only support defragmenting named files/directories or a recursive defragmentation of files and directories. I'd like to recursively defragment directories. One of my systems has a large number of large files, the files are write-once and read performance is good enough. However performance of ls -al is often very poor, presumably due to metadata fragmentation. The other thing I'd still like is the ability to force all metadata allocation to be from specified disks. I'd like to have a pair of SSDs for RAID-1 storage of metadata and a set of hard drives for RAID-1 storage of data. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: du accuracy
On Fri, 10 Apr 2015, Liu Bo bo.li@oracle.com wrote: Above are some consecutive du runs. Why does the space used go from 1.2G to 1.1G before going up again? The file was created by cat /dev/sde 2gsd so it definitely wasn't getting smaller. What's going on here? What's your mount options? with autodefrag or compression? Linux server 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt4-2 (2015-01-27) x86_64 GNU/Linux Above is the kernel I'm using, one of the recent ones from Debian/Jessie. UUID= /big btrfs skip_balance,relatime,nosuid,nodev 0 0 Above is the /etc/fstab line. /dev/sdb2 /big btrfs rw,seclabel,nosuid,nodev,relatime,space_cache,skip_balance 0 0 Above is the /proc/mounts line. I'm not doing anything noteworthy here. # du -h tmp|tail -1 ; sleep 10 ; du -h tmp|tail -1 1.6Gtmp 1.4Gtmp I've just had something similar happen when rsyncing a directory full of files to the server and running du to check on the progress. It's probably possible that a rename could happen at the wrong time to cause a file to be missed in the du count. So this could be legit. But the case described in my previous message concerned a single file that was being extended. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
snapshot space use
# zfs list -t snapshot NAMEUSED AVAIL REFER MOUNTPOINT hetz0/be0-mail@2015-03-10 2.88G - 387G - hetz0/be0-mail@2015-03-11 1.12G - 388G - hetz0/be0-mail@2015-03-12 1.11G - 388G - hetz0/be0-mail@2015-03-13 1.19G - 388G - hetz0/be0-mail@2015-03-14 1.02G - 388G - hetz0/be0-mail@2015-03-15 989M - 386G - Is there any way to do something similar to the above ZFS command? It's handy to know which snapshots are taking up the most space, especially when multiple subvols are being snapshotted. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs subvolume default subvolume
On Tue, 7 Apr 2015 02:03:04 PM arnaud gaboury wrote: Would you mind give the return of # btrfs subvolume list and $ cat /etc/fstab ? It would help me. TY ID 262 gen 1579103 top level 5 path mysql ID 263 gen 1179578 top level 5 path var/spool/squid ID 264 gen 1579262 top level 5 path xenstore ID 265 gen 1579262 top level 5 path mailstore ID 269 gen 1578978 top level 5 path xenswap ID 318 gen 1569545 top level 5 path webstore ID 328 gen 1574457 top level 262 path mysql/backup ID 329 gen 1569555 top level 264 path xenstore/backup ID 330 gen 1574457 top level 265 path mailstore/backup ID 331 gen 1569555 top level 318 path webstore/backup ID 618 gen 1569546 top level 5 path webdata ID 657 gen 1569555 top level 618 path webdata/backup ID 1456 gen 1569541 top level 5 path backup ID 1621 gen 1556937 top level 330 path mailstore/backup/2015-04-01 ID 1622 gen 1556938 top level 328 path mysql/backup/2015-04-01 ID 1623 gen 1559462 top level 330 path mailstore/backup/2015-04-02 ID 1624 gen 1559463 top level 328 path mysql/backup/2015-04-02 ID 1625 gen 1564402 top level 330 path mailstore/backup/2015-04-03 ID 1626 gen 1564403 top level 328 path mysql/backup/2015-04-03 ID 1627 gen 1566974 top level 330 path mailstore/backup/2015-04-04 ID 1628 gen 1566975 top level 328 path mysql/backup/2015-04-04 ID 1629 gen 1569540 top level 330 path mailstore/backup/2015-04-05 ID 1630 gen 1569543 top
Re: btrfs subvolume default subvolume
On Tue, 7 Apr 2015 10:58:28 AM arnaud gaboury wrote: After more reading, it seems to me creating a top root subvolume is the right thing to do: # btrfs subvolume create root # btrfs subvolume create root/var # btrfs subvolume create root/home Am I right? A filesystem is designed to support arbitrary names for files, directories, and in the case of BTRFS subvolumes. You might ask whether /var/log is a good place for log files, it's generally regarded that yes is the correct answer to that question, but it's not a question for filesystem developers. I like to have / in the root subvolume so all subvolumes are available by default. I have /var etc in the same subvolume so I can make an atomic snapshot of all such data. It's not the one true answer, but it works for me and will work for other people. The main thing is to have all data that needs to be atomically snapshotted on the same subvolume. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs hangs 3.19-10
On Mon, 6 Apr 2015 07:40:03 AM Pavel Volkov wrote: On Sunday, April 5, 2015 1:04:17 PM MSK, Hugo Mills wrote: That's these, I think: #define BTRFS_FEATURE_INCOMPAT_BIG_METADATA (1ULL 5) #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF(1ULL 6) so it's definitely -O^extref. I don't see where big_metadata comes from, though. That's not a -O option. Try with -O^extref and see where that gets you. (Also, don't mount the FS on a newer kernel -- it may be setting big metadata automatically, although it probably shouldn't do). By the way, is there any way to see which options are enabled on a local filesystem without having to try mounting it with old kernel and checking dmesg? # file -s /dev/dm-0 /dev/dm-0: BTRFS Filesystem sectorsize 4096, nodesize 4096, leafsize 4096, UUID=97d70558-ddea-493e-874c-ff74be9ce099, 92390113280/171796594688 bytes used, 1 devices Above is what file(1) reports on my laptop which has been running BTRFS since 3.2 days. It gives the node size etc but not the feature flags. Some time ago I submitted a patch to the Debian package that covered everything I could figure out, I'm sure that they would accept a patch for feature flags if anyone can work out how to do it. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs hangs 3.19-10
On Mon, 6 Apr 2015 03:21:18 AM Duncan wrote: So... for 3.2 compatibility, extref must not be enabled (tho it's now the default and AFAIK there's no way to actually disable it, only enable, so an old btrfs-tools would have to be used that doesn't enable it by default), AND the nodesize must be set to page size, 4096 bytes for x86. So basically we have to have an old version of mkfs.btrfs to make a filesystem that can be mounted on kernel 3.2. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs hangs 3.19-10
On Fri, 3 Apr 2015 05:14:12 AM Duncan wrote: Well, btrfs itself isn't really stable yet... Stable series should be stable at least to the extent that whatever you're using in them is, but with btrfs itself not yet entirely stable... Also for stable operation you want both forward and backward compatability. You could make an Ext3 filesystem and expect that any random ancient Linux box you are likely to encounter can read it. Even Ext4 has been supported for a long time and most systems you are likely to encounter won't have any problems with it. I recently made a BTRFS filesystem on a Debian/Jessie system (kernel 3.16.7) with default options and discovered that Debian/Wheezy (kernel 3.2.65) can't read it. I think that one criteria for stable in a filesystem is that kernels from a couple of previous releases can mount it. By that criteria BTRFS won't be stable for use in Debian for about 4 years. As an aside are there options to mkfs.btrfs that would make a filesystem mountable by kernel 3.2.65? If so I'll file a Debian/Jessie bug report requesting that a specific mention be added to the man page. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs hangs 3.19-10
On Sun, 5 Apr 2015 03:16:21 AM Duncan wrote: Hugo Mills posted on Sat, 04 Apr 2015 13:00:47 + as excerpted: On Sat, Apr 04, 2015 at 12:55:08PM +, Russell Coker wrote: As an aside are there options to mkfs.btrfs that would make a filesystem mountable by kernel 3.2.65? If so I'll file a Debian/Jessie bug report requesting that a specific mention be added to the man page. Yes, there are. It's probably -O^extref, but if you can show the dmesg output from the 3.2 kernel on the failed mount (so that it shows what the actual failure was), we should be able to give you a more precise answer. [698190.987065] Btrfs loaded [698190.92] device fsid 118e2c64-6ce1-4f21-85e2-2d6aea8f0fa5 devid 1 transid 426 /dev/sdf1 [698191.000981] btrfs: disk space caching is enabled [698191.000986] BTRFS: couldn't mount because of unsupported optional features (60). [698191.018176] btrfs: open_ctree failed So I was thinking about this, and the several other earlier options where support wasn't added until kernel X, and had an idea... How easy and useful might it be to add to mkfs.btrfs appropriate option- group aliases such that if one knew the oldest kernel one was likely to deal with, all one would need to do for the mkfs would be to set for example, -O3.2, or even simply --3.2 (or maybe even --32), and have mkfs.btrfs automatically set/unset the appropriate options so it would just work with that kernel and anything newer? That would be really useful. Also it would be good if the code structure allowed adding extra aliases, so for Debian we could add an option -Owheezy. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance fails with no space errors (despite having plenty)
I've been having ongoing issues with balance failing with no space errors in spite of having plenty. Strangely it seems to most often happen from cron jobs, when a cron job fails I can count on a manual balance succeeding. I'm running the latest Debian/Jessie kernel. -- Sent from my Samsung Galaxy Note 3 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs fstrim status (RAID5)
Debian/Wheezy userspace can't be expected to work as well as desired with a 3.19 kernel. Wheezy with BTRFS single or RAID1-1 works reasonably well as long as you have lots of free space, balance it regularly, and configure it not to resume a balance on reboot. Debian/Jessie works well with BTRFS single or RAID-1 without any issues. But don't use RAID 5/6. Jessie userspace should work well with 3.19 and 4.0 kernels. On March 23, 2015 4:38:49 AM GMT+11:00, Hugo Mills h...@carfax.org.uk wrote: On Sun, Mar 22, 2015 at 05:30:59PM +, scruffters wrote: Hello list, I wondered if anybody could advise on the status of using fstrim with BTRFS. The motivation for me is to finally try using SSD's in a R5 style configuration with some way of running maintenance overnight to maintain performance... [snip] My kernel is 3.2.0-4 (Debian Wheezy). Then you shouldn't use btrfs RAID-5 at all, and probably shouldn't be using btrfs. For parity RAID, 3.19 or later is what you should be using, as it supports recovery and rebuild. You can get those kernels from the Debian experimental repo. If you're not using parity RAID, then the last few kernels have been OK, but keeping up with the latest version is still a reasonably good idea. Hugo. -- Sent from my Samsung Galaxy Note 3 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: syntax for deleting subvolumes?
On Fri, 20 Mar 2015 04:18:38 AM Duncan wrote: If cp --reflink=auto was the default, it'd just work, making a reflink where possible, falling back to a normal copy where not possible to reflink. However, I'd be wary of such a change, because admins are used to cp creating a separate copy which may well be intended as a backup, guarding against I/O errors on the original. Btrfs does do checksum checking to ensure validity, but if that fails, do you really want it failing for BOTH copies, including the one the admin made specifically to AVOID such If you have a second copy on the same filesystem then an error on any of the metadata that is a parent will corrupt both. EG if you have 2 copies under your home directory then a directory error for /, /home, or /home/$USER will make both copies unavailable. Also any errors on superblocks or other essential filesystem metadata risks losing both copies. If BTRFS was to adopt something equivalent to the ZFS copies= feature then the sysadmin could specify that certain subtrees would have extra copies of data AND each metadata block would have 1 more copy than the data blocks it refers to. In conclusion I think that the ZFS copies= feature is the correct solution to this problem. Until/unless the BFTRFS developers copy that design concept the thing to do if you want backups on the same storage hardware is to copy the data to a filesystem on a different partition - the NetApp research shows that disk read errors tend to be correlated by location on disk. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: syntax for deleting subvolumes?
On Thu, 19 Mar 2015 07:18:29 AM Erkki Seppala wrote: But as a user level facility, I want to be able to snapshot before making a change to a tree full of source code and (re)building it all over again. I may want to keep my new build, but I may want to flush it and return to known good state. You may want to check out cp --reflink=always (different from cp --link), which creates copy-on-write copy of the data. It isn't quite as fast as snapshots to create, but it's still plenty fast and without the downsides of subvolumes. In which situations is --reflink=always expected to fail on BTRFS? I recently needed to move /home on one system to a subvol (instead of being in the root subvol) and used cp --reflink=always which gave a few errors when running with BTRFS on kernel 3.16.7-ckt7-1. I was in a hurry and didn't try to track down the caue, I just ran it again with --reflink=auto which gave no errors and completed very quickly (obviously the files which couldn't be reflinked were small). -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe or useful to use NOCOW flag and autodefrag mount option at same time?
On Sun, 15 Mar 2015, peer@gmx.net wrote: Following common recommendations [1], I use these mount options on my main developing machine: noatime,autodefrag. This is desktop machine and it works well so far. Now, I'm also going to install several KVM virtual machines on this system. I want to use qcow2 files stored on SSD with a btrfs on it. In order to avoid bad performance with the VMs, I want to disable the Copy-On-Write mechanism on the storage directory of my VM images as for example described in [2]. Why do you expect a great performance benefit from that? As there is no real seek time SSDs probably won't give you much benefit from defragmenting. As for disabling CoW, that will reduce the number of writes (as you don't need to write the metadata all the way up the tree) and improve performance, but not as much as on spinning media where you need to do seeks for all that. Finally having checksums on everything to give the possibility of recognising corrupt data is a really good feature and something that you want on your VM images. So far I have never even tried disabling CoW or using auto defragment. All of my BTRFS filesystems have either low performance or run on SSD. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
single GlobalReserve
# btrfs fi df /big Data, RAID1: total=2.56TiB, used=2.56TiB System, RAID1: total=32.00MiB, used=388.00KiB Metadata, RAID1: total=19.25GiB, used=14.06GiB GlobalReserve, single: total=512.00MiB, used=0.00B Why is GlobalReserve single? That filesystem has been RAID-1 for ages (since long before the default leaf size became 16K). When GlobalReserve was added it should have taken the same allocation policy as metadata. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ignoring bad blocks
On Sun, 4 Jan 2015 13:46:30 Chris Murphy wrote: So do use dd you need to also use bs= setting it to a multiple of 4096. Of course most people using dd for zeroing set bs= to a decently high value because it makes the process go much faster than the default block size of 512 bytes. You could just use cat(1) to write to the disk. Cat SHOULD write a minimum of page size buffers unless the kernel tells it that the device block size is larger. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Debian/Jessie 3.16.7-ckt2-1 kernel error
I've attached the kernel message log that I get after booting kernel 3.16.7 from Debian/Unstable. This is the kernel branch that will go into Debian/Jessie so it's important to get it fixed. Below has the start of the errors, the attached file has everything from boot. I've got similar issues on another system, would it help if I collect the logs from multiple systems? [6.618809] Btrfs loaded [6.670878] BTRFS: device fsid a2d7cbbc-23be-4d97-b2bb-de99b0c58c7d devid 1 t ransid 548614 /dev/mapper/root-crypt [6.706907] BTRFS info (device dm-0): disk space caching is enabled [6.798762] BTRFS: detected SSD devices, enabling SSD mode [6.881272] [ cut here ] [6.881358] WARNING: CPU: 3 PID: 198 at /build/linux-CMiYW9/linux-3.16.7- ckt2 /fs/btrfs/delayed-inode.c:1410 btrfs_commit_transaction+0x38a/0x9c0 [btrfs]() -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ kern-err.gz Description: GNU Zip compressed data
Re: Balance scrub defrag
On Wed, 10 Dec 2014 17:17:28 Robert White wrote: A _monthly_ scrub is maybe worth scheduling if you have a lot of churn in your disk contents. I do weekly scrubs. I recently had 2 disks in a RAID-1 array develop read errors within a month of each other. The first scrub after replacing sdb revealed an error on sdc! Defragging should be done after significant content additions/changes (like replacing a lot of files via package management) and limited to the directories most likely changed. I have never run defrag. Currently all my BTRFS filesystems that have any performance requirements are on SSD and I don't think that defragmenting a SSD does much good. Balancing is almost never necessary and can be anti-helpful if a experiences random updates in batches (because the nicely packed file may end up far, far away from the active data extent where its COW events are taking place. The problem with running out of metadata space requires a need for an occasional data balance. If you set it to only balance chunks that are less than 10% used then it doesn't take much time. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pro/cons of raid1 with mdadm/lvm2
On Mon, 1 Dec 2014, Chris Murphy li...@colorremedies.com wrote: On Sun, Nov 30, 2014 at 3:06 PM, Russell Coker russ...@coker.com.au wrote: When the 2 disks have different data mdadm has no way of knowing which one is correct and has a 50% chance of overwriting good data. But BTRFS does checksums on all reads and solves the problem of corrupt data - as long as you don't have 2 corrupt sectors in matching blocks. Yeah. I'm not sure though if openSUSE 13.2 prevents users from creating btrfs raid1 volumes entirely, or if it's just an install time limitation. With BTRFS you can make it RAID-1 afterwards. The possibility of data loss during system install usually isn't something you are concerned about so this shouldn't be a problem. I know that Fedora's installer won't allow the user to create Btrfs on LVM, and it probably doesn't allow it on md raid either. For LVM that's reasonable, for MD-RAID that would be a bug IMHO. On Mon, 1 Dec 2014, Roman Mamedov r...@romanrm.net wrote: * mdadm RAID has much better read balancing; Btrfs reads are satisfied from what's in effect a random drive (PID-based balancing of threads to drives), mdadm reads from the less-loaded drive. Also mdadm has a way to specify some RAID1 array members as to be never used for reads if at all possible (write-mostly), which helps in RAID1 of HDD and SSD. True. But that's just a lack of performance tuning in the current code, it will be fixed at some future time. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pro/cons of raid1 with mdadm/lvm2
When the 2 disks have different data mdadm has no way of knowing which one is correct and has a 50% chance of overwriting good data. But BTRFS does checksums on all reads and solves the problem of corrupt data - as long as you don't have 2 corrupt sectors in matching blocks. -- Sent from my Samsung Galaxy Note 3 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Balance and RAID-1
I had a RAID-1 filesystem with 2*3TB disks and 330G of disk space free according to df -h. I replaced a 3TB disk with a 4TB disk and df reported no change in the free space (as expected). I added a 1TB disk to the filesystem and there was still no change! I expected that adding a 1TB disk would give 3TB+1TB on one side and 4TB on the other so I would instantly get an extra 1TB of free space according to df -h. 1072 out of about 2734 chunks balanced (1073 considered), 61% left I ran btrfs balance for just over a day and a btrfs balance status reported the above. But it still only showed 860G free according to df -h. I've cancelled the balance because 860G of free space is enough for the moment and I don't want that server running slowly and making noise any more. I think that the allocation of disk space needs to be improved. If I added a 1TB disk to a pair of 3TB disks then it would be quite reasonably for some serious reallocation to be required to make use of the extra space. But when I added a 1TB disk to an array that had a 3TB and a 4TB then it should be able to make use of the space quite easily. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Balance and RAID-1
On Fri, 28 Nov 2014, Zygo Blaxell ce3g8...@umail.furryterror.org wrote: On Fri, Nov 28, 2014 at 01:37:50AM +1100, Russell Coker wrote: I had a RAID-1 filesystem with 2*3TB disks and 330G of disk space free according to df -h. I replaced a 3TB disk with a 4TB disk and df reported no change in the free space (as expected). Did you btrfs resize that 4TB disk? If not, btrfs still thinks the 4TB disk is a 3TB disk, and the rest follows from that. Thanks, you are correct I had missed that step. It would be nice if the replace command would inform the user of the amount of space. Something like the new device is 1000GB larger than the old, you might want to resize. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
resetting device stats
When running Debian kernel version 3.16.0-4-amd64 and btrfs-tools version 3.17-1.1 I ran a btrfs replace operation to replace a 3TB disk that was giving read errors with a new 4TB disk. After the replace the btrfs device stats command reported that the 4TB disk had 16 read errors. It appears that the device stats are copied in a replace operation. I believe that this is a bug. There is no reason to claim that the new 4TB disk had read errors when it was the old 3TB disk that had the errors. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
strange device stats message
I am in the middle of replacing /dev/sdb (which is 3TB SATA disk that gives a few read errors on every scrub) with /dev/sdc2 (a partition on a new 4TB SATA disk). I am running btrfs-tools version 3.17-1.1 from Debian/Unstable and Debian kernel 3.16.0-4-amd64. I get the following, the last section of which seems wrong. Would this be a bug in the kernel or btrfs-tools? # btrfs device stats /big [/dev/sdc2].write_io_errs 0 [/dev/sdc2].read_io_errs0 [/dev/sdc2].flush_io_errs 0 [/dev/sdc2].corruption_errs 0 [/dev/sdc2].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs16 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [].write_io_errs 0 [].read_io_errs0 [].flush_io_errs 0 [].corruption_errs 0 [].generation_errs 0 -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
open_ctree problem
I have a workstation running Linux 3.14.something on a 120G SSD. It recently had a problem and now the root filesystem can't be mounted, here is the message I get when trying to mount it read-only on Debian kernel 3.16.2-3: [4703937.784447] BTRFS info (device loop0): disk space caching is enabled [4703938.754247] BTRFS: log replay required on RO media [4703938.794148] BTRFS: open_ctree failed When I tried to boot it normally it gave a lot of kernel messages and failed to mount it. Here's the error I get from the btrfs-zero-log in btrfs-tools 0.19+20130501-1: # btrfs-zero-log yayia-corrupt extent buffer leak: start 157263929344 len 4096 *** Error in `btrfs-zero-log': corrupted double-linked list: 0x01068960 *** Aborted I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error. But when I tried to mount the filesystem I got the attached kernel error when trying to mount with Debian kernel 3.16.2-3. Any suggestions on what I should do next? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ dmesg.txt.gz Description: GNU Zip compressed data
Re: open_ctree problem
Strangely I repeated the same process on the same system (btrfs-zero-log and mount read-only) and it worked. While it's a concern that repeating the same process gives different results it's nice that I'm getting all my data back. On Sun, 23 Nov 2014, Russell Coker russ...@coker.com.au wrote: I have a workstation running Linux 3.14.something on a 120G SSD. It recently had a problem and now the root filesystem can't be mounted, here is the message I get when trying to mount it read-only on Debian kernel 3.16.2-3: [4703937.784447] BTRFS info (device loop0): disk space caching is enabled [4703938.754247] BTRFS: log replay required on RO media [4703938.794148] BTRFS: open_ctree failed When I tried to boot it normally it gave a lot of kernel messages and failed to mount it. Here's the error I get from the btrfs-zero-log in btrfs-tools 0.19+20130501-1: # btrfs-zero-log yayia-corrupt extent buffer leak: start 157263929344 len 4096 *** Error in `btrfs-zero-log': corrupted double-linked list: 0x01068960 *** Aborted I installed btrfs-tools 3.17-1 and then btrfs-zero-log ran without error. But when I tried to mount the filesystem I got the attached kernel error when trying to mount with Debian kernel 3.16.2-3. Any suggestions on what I should do next? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NOCOW and Swap Files?
Also it would be nice to have checksums on the swap data. It's a bit of a waste to pay for ECC RAM and then lose the ECC benefits as soon as data is paged out. -- Sent from my Samsung Galaxy Note 3 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: device balance times
Also a device replace operation requires that the replacement be the same size (or maybe larger). While a remove and replace allows the replacement to be merely large enough to contain all the data. Given the size variation in what might be called the same size disk by manufcturers this isn't uncommon - unless you just get a replacement of the next size up (which is a good option too). On October 23, 2014 3:59:31 AM GMT+11:00, Bob Marley bobmar...@shiftmail.org wrote: On 22/10/2014 14:40, Piotr Pawłow wrote: On 22.10.2014 03:43, Chris Murphy wrote: On Oct 21, 2014, at 4:14 PM, Piotr Pawłowp...@siedziba.pl wrote: Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data. It's long term untenable. At some point it must be fixed. It's way, way slower than md raid. At a certain point it needs to fallback to block level copying, with a ~ 32KB block. It can't be treating things as if they're 1K files, doing file level copying that takes forever. It's just too risky that another device fails in the meantime. There's device replace for restoring redundancy, which is fast, but not implemented yet for RAID5/6. Device replace on raid 0,1,10 works if the device to be replaced is still alive, otherwise the operation is as long as a rebalance and works similarly (AFAIR). Which is way too long in terms of the likelihood of another disk failing. Additionally, it seeks like crazy during the operation, which also greatly increases the likelihood of another disk failing. Until this is fixed I am not confident in using btrfs on a production system which requires RAID redundancy. The operation needs to be streamlined: it should be as sequential as possible (sort files according to their LBA before reading/writing), with the fewest number of seeks on every disk, and with large buffers, so that reads from the source disk(s) and writes to the replacement disk goes at platter-speed or near there. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Sent from my Samsung Galaxy Note 3 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strange 3.16.3 problem
On Tue, 21 Oct 2014, Zygo Blaxell zblax...@furryterror.org wrote: On Mon, Oct 20, 2014 at 04:38:28AM +, Duncan wrote: Russell Coker posted on Sat, 18 Oct 2014 14:54:19 +1100 as excerpted: # find . -name *546 ./1412233213.M638209P10546 # ls -l ./1412233213.M638209P10546 ls: cannot access ./1412233213.M638209P10546: No such file or directory Does your mail server do a lot of renames? Is one perhaps stuck? If so, that sounds like the same thing Zygo Blaxell is reporting in the 3.16.3..3.17.1 hang in renameat2() thread, OP on Sun, 19 Oct 2014 15:25:26 -400, Msg-ID: 20141019192525.ga29...@hungrycats.org, as linked here: It's a Maildir server so it does a lot of renames, but I don't think anything is stuck. I've just rebooted the Dom0 and nothing has changed. For Russell's issue...most of the stuff I can think of has been tried already. I didn't see if there was any attempt try to ls the file from the NFS server as well as the client side. If ls is OK on the server but not the client, it's an NFS issue (possibly interacting with some btrfs-specific quirk); otherwise, it's likely a corrupted filesystem (mail servers seem to be unusually good at making these). # ls -l *546 ls: cannot access *546: No such file or directory Above is on the server. # ls -l *546 ls: cannot access 1412233213.M638209P10546: No such file or directory Above is on the client. Note that wildcard expansion worked because readdir() found the file even though stat can't. Most of the I/O time on mail servers tends to land in the fsync() system call, and some nasty fsync() btrfs bugs were fixed in 3.17 (i.e. after 3.16, and not in the 3.16.x stable update for x = 5 (the last one I've checked)). That said, I'm not familiar with how fsync() translates over NFS, so it might not be relevant after all. That's going to suck for people running mail servers on Debian. If the NFS server's view of the filesystem is OK, check the NFS protocol version from /proc/mounts on the client. Sometimes NFS clients will get some transient network error during connection and fall back to some earlier (and potentially buggier) NFS version. I've seen very different behavior in some important corner cases from v4 and v3 clients, for example, and if the client is falling all the way back to v2 the bugs and their workarounds start to get just plain _weird_ (e.g. filenames which produce specific values from some hash function or that contain specific character sequences are unusable). v2 is so old it may even have issues with 64-bit inode numbers. Rebooting the client multiple times and rebooting the server once doesn't change it. I don't think it's any transient error. On Tue, 21 Oct 2014, Austin S Hemmelgarn ahferro...@gmail.com wrote: Just now saw this thread, but IIRC 'No such file or directory' also gets returned sometimes when trying to automount a share that can't be enumerated by the client, and also sometimes when there is a stale NFS file handle. I think that rebooting both client and server precludes the possibility of a stale file handle. Even rebooting the client (which I have done several times) should fix it. On Tue, 21 Oct 2014, Robert White rwh...@pobox.com wrote: Okay, from the strace output the shell _is_ finding the file in the directory read and expand (readdir) pass. That is *546 is being expanded to the full file name text 1412233213.M638209P10546 but then the actual operation fails because the name is apparently not associated with anything. So what pass of scrub or btrfsck checks directory connectedness? Does that pass give your file system a clean bill of health? That's inconvenient for a remote system with a single BTRFS filesystem. Also you said that you are using a 32bit user space copied from another server under a 64bit kernel. Is the ls command a 32 bit executable then? Yes. What happens if you stop the Xen domain for the mail server and then mount the disks into a native 64bit environment and then ls the file name? The filesystem in question is NFS mounted from a server with 64bit kernel+user to a virtual server with 64bit kernel+32bit user. On the file server (the Xen Dom0) ls doesn't even see that file in readdir. I ask because the man page for lstat64 says its a wrapper for the underlying system call (fstatat64). It is not impossible that you might have a case where the wrapper is failing inside glibc due to some 32/64 bit conversion taking place. If there is a 32/64 conversion then we have another problem. The mail server is configured to reject messages bigger than about 50M, I don't recall the exact number but it's a lot smaller than 2G. On Tue, 21 Oct 2014, Goffredo Baroncelli kreij...@inwind.it wrote: Could this be related to the inode overflow in 32 bit system (see inode_cache options) ? If so running a 64bit ls -i should work I've just installed coreutils:amd64 on the NFS client and I get the same results
Re: strange 3.16.3 problem
I've just upgraded the Dom0 (NFS server) from 3.16.3 to 3.16.5 and it all works. Prior to upgrading the Dom0 I had the same problem occur with different file names. All the names in question were truncated names of files that exist. It seems that 3.16.3 has a bug with NFS serving files with long names. Thanks for all the suggestions. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strange 3.16.3 problem
On Sat, 18 Oct 2014, Michael Johnson - MJ m...@revmj.com wrote: The NFS client is part of the kernel iirc, so it should be 64 bit. This would allow the creation of files larger than 4gb and create possible issues with a 32 bit user space utility. A correctly written 32bit application will handle files 4G in size. While some applications may have problems, I'm fairly sure that ls will be ok. # dd if=/dev/zero of=/tmp/test bs=1024k count=1 seek=5000 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.00383089 s, 274 MB/s # /bin/ls -lh /tmp/test -rw-r--r--. 1 root root 4.9G Oct 18 20:47 /tmp/test # file /bin/ls /bin/ls: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0xd3280633faaabf56a14a26693d2f810a3e51, stripped A quick test shows that a 32bit ls can handle this. I would mount from a client with 64 bit user space and see if the problem occurs there. If so, it is probably not a btrfs issue (if I am understanding your environment correctly). I'll try that later. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strange 3.16.3 problem
On Sun, 19 Oct 2014, Robert White rwh...@pobox.com wrote: On 10/17/2014 08:54 PM, Russell Coker wrote: # find . -name *546 ./1412233213.M638209P10546 # ls -l ./1412233213.M638209P10546 ls: cannot access ./1412233213.M638209P10546: No such file or directory Any suggestions? Does ls -l *546 show the file to exist? e.g. what happens if you use the exact same wildcard in the ls command as you used in the find? # ls -l *546 ls: cannot access 1412233213.M638209P10546: No such file or directory That gives the same result as find, the shell matches the file name but then ls can't view it. lstat64(1412233213.M638209P10546, 0x9fab0c8) = -1 ENOENT (No such file or directory) From strace, the lstat64 system call fails. It is possible (and back in the day it was quite common) for files to be created with non-renderable nonsense in the name. for instance if the first four characters of the name were 13^H4 (where ^H is the single backspace character) the file wold look like it was named 14* but it would be listed by ls using 13*. If the file name is damaged, which is usually a failing in the program that created the file, then it can be hidden in plain sight. If that's the case then it's still a kernel bug somewhere. Maildrop and Dovecot don't create files with any unusual characters in the names. Note that this sort of name is hidden from the copy-paste done in the terminal window because the binary nonsense is just not in the output any more by the time you select it with the mouse. It doesn't have to be a backspace, BTW, it can be any character that the terminal window will not render. If things get really ugly you may need to remove the file using find . -name *546 -exec rm {} \; # find . -name *546 -exec rm {} \; rm: cannot remove `./1412233213.M638209P10546': No such file or directory -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
strange 3.16.3 problem
I have a system running the Debian 3.16.3-2 AMD64 kernel for the Xen Dom0 and the DomUs. The Dom0 has a pair of 500G SATA disks in a BTRFS RAID-1 array. The RAID-1 array has some subvols exported by NFS as well as a subvol for the disk images for the DomUs - I am not using NoCOW as performance is fine without it and I like having checksums on everything. I have started having some problems with a mail server that is running in a DomU. The mail server has 32bit user-space because it was copied from a 32bit system and I had no reason to upgrade it to 64bit, but it's running a 64bit kernel so I don't think that 32bit user-space is related to my problem. # find . -name *546 ./1412233213.M638209P10546 # ls -l ./1412233213.M638209P10546 ls: cannot access ./1412233213.M638209P10546: No such file or directory Above is the problem, find says that the file in question exists but ls doesn't think so, the file in question is part of a Maildir spool that's NFS mounted. This problem persisted across a reboot of the DomU, so it's a problem with the Dom0 (the NFS server). The dmesg output on the Dom0 doesn't appear to have anything relevant, and a find command doesn't find the file. I don't know if this is a NFS problem or a BTRFS problem. I haven't rebooted the Dom0 yet because a remote reboot of a server running a kernel from Debian/Unstable is something I try to avoid. Any suggestions? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deleting a dead device
On Sun, 21 Sep 2014 11:05:46 Chris Murphy wrote: On Sep 20, 2014, at 7:39 PM, Russell Coker russ...@coker.com.au wrote: Anyway the new drive turned out to have some errors, writes failed and I've got a heap of errors such as the above. I'm curious if smartctl -t conveyance reveals any problems, it's not a full surface test but is designed to be a test for (typical?) problems drives have due to shipment damage, and doesn't take very long. Unfortunately I got to your message after sending the defective drive to e- waste. But I expect that any test that involves real reads/writes would report a failure (if the USB-SATA device supported passing them through) as the drive seemed to fail for everything. # btrfs device delete /dev/sdc3 / ERROR: error removing the device '/dev/sdc3' - Invalid argument It seems that I can't remove the device because removing requires writing. What kernel message do you get associated with this? Try using the devid instead of /dev/. I'll keep that in mine. device delete dev [dev...] path Remove device(s) from a filesystem identified by path. The man page has the above text which makes no mention of devid, so I think we need a documentation patch for this. For future reference, btrfs replace start is better to use than add+delete. It's an optimization but it also makes it possible to ignore the device being replaced for reads; and you can also get a status on the progress with btrfs replace status. And it looks like it does some additional error checking. Oh yes, I've done this and it works well. However it doesn't work if the replacement is smaller than the device being replaced. Also as an aside, while the stats about write errors are useful, in this case it would be really good if there was a count of successful writes, it would be useful to know if the successful write count was close to 0. I think this is for other tools. Btrfs is a file system its responsible for the integrity of the data it writes, I don't think it's responsible for prequalifying drives. I agree that it doesn't have to prequalify drives. But it should expose all data it has which can be of use to the sysadmin. After it was too late I realised that I could have used iostat to get stats for the block device. But it would still be nice to have stats from btrfs. Also btrfs has to deal with the fact that drives may fail at any time. Admittedly I was using a drive I knew to be slightly sub-standard (I got it free because it gave an error in a client's RAID-Z array). But sometimes drives like that last for years, it's difficult to predict. Even a simple dd if=/dev/zero of=/dev/sdc bs=64k count=1600 will write out 100MB, and dmesg will show if there are any controller or drive problems on writes. You may have to do more than 100MB for problems to show up but you get the idea. True. But a drive can fail after 101MB. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deleting a dead device
On Sun, 21 Sep 2014, Duncan 1i5t5.dun...@cox.net wrote: Russell Coker posted on Sun, 21 Sep 2014 11:39:17 +1000 as excerpted: On a system running the Debian 3.14.15-2 kernel I added a new drive to a RAID-1 array. My aim was to add a device and remove one of the old devices. That's an old kernel and presumably an old btrfs-progs. Quite a number of device management fixes have gone in recently, and you'd likely not be in quite that predicament were you running a current kernel (make sure it's 3.16.2+ or 3.17-rc2+ to get the fix for the workqueues bug that affected 3.15 and thru 3.16.1 and 3.17-rc1). 3.16.2 is in Debian, I'm in the process of upgrading to it. And the recommended way to handle a device replace now would be btrfs replace, doing the add and delete in a single (albeit possibly long) step instead of separately. I'm changing from a 500G single filesystem to a 200G RAID-1 (there's only 150G of data). The change from 500G to 200G can't be done with a replace as a replace requires an equal or greater size. I did a 2 step process, add/delete to go to a 200G USB attached device for half the array and then replace to go from 200G on USB to 200G internal. The drive is attached by USB so I turned off the USB device and then got the above result. So it still seems impossible to remove the device even though it's physically not present. I've connected a new USB disk which is now /dev/sdd, so it seems that BTRFS is keeping the name /dev/sdc locked. Should there be a way to fix this without rebooting or anything? Did you try btrfs device delete missing? It's documented on the wiki but apparently not yet on the manpage. I did that after rebooting. It didn't occur to me to try a missing operation when the drive really wasn't missing. According to the wiki that deletes the first device that was in the metadata but not found when booting, so you may have to reboot to do it, but it should work. That would be a bug. There's no reason a reboot should be required if we can remove a drive and add a new one with the kernel recognising it. Hot-swap disks aren't any sort of new feature. Tho with the recent stale-devices fixes, were that a current kernel you may not actually have to reboot to have delete missing work. But you probably will on 3.14, and of course to upgrade kernels you'd have to reboot anyway, so... Yes a reboot was needed anyway. But I'd have liked to delay that. Also as an aside, while the stats about write errors are useful, in this case it would be really good if there was a count of successful writes, it would be useful to know if the successful write count was close to 0. My understanding of the BTRFS design is that there would be no performance penalty for adding counts of the number of successful reads and writes to the superblock. Could this be done? Not necessarily for reads, consider the case when the filesystem is read- only as my btrfs root filesystem is by default -- lots of reads but likely no writes and no super-block updates for the entire uptime. But I believe you're correct for writes, since they'd ultimately update the superblocks anyway. For the case of a read-only filesystem it's OK to skip read stats. It would also be a bad idea to update read stats without writing data. But there's no reason why read stats couldn't be accumulated in-memory and written out the next time something was written to disk. That would give a slight inaccuracy in the case where there was a power failure after some period of reading without writing, but that's an unusual corner case. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
deleting a dead device
On a system running the Debian 3.14.15-2 kernel I added a new drive to a RAID-1 array. My aim was to add a device and remove one of the old devices. Sep 21 11:26:51 server kernel: [2070145.375221] BTRFS: lost page write due to I/O error on /dev/sdc3 Sep 21 11:26:51 server kernel: [2070145.375225] BTRFS: bdev /dev/sdc3 errs: wr 269, rd 0, flush 0, corrupt 0, gen 0 Sep 21 11:27:21 server kernel: [2070175.517691] BTRFS: lost page write due to I/O error on /dev/sdc3 Sep 21 11:27:21 server kernel: [2070175.517699] BTRFS: bdev /dev/sdc3 errs: wr 270, rd 0, flush 0, corrupt 0, gen 0 Sep 21 11:27:21 server kernel: [2070175.517712] BTRFS: lost page write due to I/O error on /dev/sdc3 Sep 21 11:27:21 server kernel: [2070175.517715] BTRFS: bdev /dev/sdc3 errs: wr 271, rd 0, flush 0, corrupt 0, gen 0 Sep 21 11:27:51 server kernel: [2070205.665947] BTRFS: lost page write due to I/O error on /dev/sdc3 Sep 21 11:27:51 server kernel: [2070205.665955] BTRFS: bdev /dev/sdc3 errs: wr 272, rd 0, flush 0, corrupt 0, gen 0 Sep 21 11:27:51 server kernel: [2070205.665967] BTRFS: lost page write due to I/O error on /dev/sdc3 Sep 21 11:27:51 server kernel: [2070205.665971] BTRFS: bdev /dev/sdc3 errs: wr 273, rd 0, flush 0, corrupt 0, gen 0 Anyway the new drive turned out to have some errors, writes failed and I've got a heap of errors such as the above. The errors started immediately after adding the drive and the system wasn't actively writing to the filesystem. So very few (if any) writes made it to the device. # btrfs device delete /dev/sdc3 / ERROR: error removing the device '/dev/sdc3' - Invalid argument It seems that I can't remove the device because removing requires writing. # btrfs device delete /dev/sdc3 / ERROR: error removing the device '/dev/sdc3' - No such file or directory # btrfs device stats / [/dev/sda3].write_io_errs 0 [/dev/sda3].read_io_errs0 [/dev/sda3].flush_io_errs 0 [/dev/sda3].corruption_errs 57 [/dev/sda3].generation_errs 0 [/dev/sdb3].write_io_errs 0 [/dev/sdb3].read_io_errs0 [/dev/sdb3].flush_io_errs 0 [/dev/sdb3].corruption_errs 0 [/dev/sdb3].generation_errs 0 [/dev/sdc3].write_io_errs 267 [/dev/sdc3].read_io_errs0 [/dev/sdc3].flush_io_errs 0 [/dev/sdc3].corruption_errs 0 [/dev/sdc3].generation_errs 0 The drive is attached by USB so I turned off the USB device and then got the above result. So it still seems impossible to remove the device even though it's physically not present. I've connected a new USB disk which is now /dev/sdd, so it seems that BTRFS is keeping the name /dev/sdc locked. Should there be a way to fix this without rebooting or anything? Also as an aside, while the stats about write errors are useful, in this case it would be really good if there was a count of successful writes, it would be useful to know if the successful write count was close to 0. My understanding of the BTRFS design is that there would be no performance penalty for adding counts of the number of successful reads and writes to the superblock. Could this be done? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
device delete progress
We need to have a way to determine the progress of a device delete operation. Also for a balance of a RAID-1 that has more than 2 devices it would be good to know how much space is used on each device. Could btrfs fi df be extended to show information separately for each device? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No space on empty, degraded raid10
On Mon, 8 Sep 2014, Austin S Hemmelgarn ahferro...@gmail.com wrote: Also, I've found out the hard way that system chunks really should be RAID1, NOT RAID10, otherwise it's very likely that the filesystem won't mount at all if you lose 2 disks. Why would that be different? In a RAID-1 you expect system problems if 2 disks fail, why would RAID-10 be different? Also it would be nice if there was a N-way mirror option for system data. As such data is tiny (32MB on the 120G filesystem in my workstation) the space used by having a copy on every disk in the array shouldn't matter. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
It would be nice if a file system mounted ro counted as ro snapshots for btrfs send. When a file system is so messed up it can't be mounted rw it should be regarded as ro for all operations. -- Sent from my Samsung Galaxy Note 2 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Debian 3.14.13-2 lockup
I've attached the dmesg output from a system running Debian kernel 3.14.13 which locked up. Everything which needed to write to disk was blocked. The dmesg output didn't catch the first messages which had scrolled out of the buffer. As the disk wasn't writable there was nothing useful in /var/log/kern.log . As an aside one reason for using tmpfs for /tmp is that when the root filesystem has a problem bash filename completion still works. I lost a precious root shell on that system because I pressed TAB and bash hung. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ dmesg.txt.gz Description: application/gzip
Re: Putting very big and small files in one subvolume?
On Sun, 17 Aug 2014 12:31:42 Duncan wrote: OTOH, I tend to be rather more of an independent partition booster than many. The biggest reason for that is the too many eggs in one basket problem. Fully separate filesystems on separate partitions separate those data eggs into separate baskets, so if the metaphorical bottom drops out of one of those filesystem baskets, only the data eggs in that filesystem basket are lost, while the eggs in the separate filesystem baskets are still safe and sound, not affected at all. =:^) The thing that troubles me about replacing a bunch of independent partitions and filesystems with a bunch of subvolumes on a single btrfs filesystem is thus just that, you've nicely divided that big basket into little subvolume compartments, but it's still one big basket, and if the bottom falls out, you potentially lose EVERYTHING in that filesystem basket! I'll write the counter-point to this. If you have several partitions for /, /var/log, and /home then losing any one of them will result in a system that's mostly unusable. So for continuous service there doesn't seem to be a benefit in having multiple partitions. When you have to restore a backup in adverse circumstances the restore time is important. For example if you have 10*4TB disks and need RAID-1 redundancy (which you need on any BTRFS filesystem of note as I don't think RAID-5 and RAID-6 are trustworthy) then an advantage of 5*4TB RAID-1 filesystems over a 20TB RAID-10 is that restore time will be a lot smaller. But this isn't an issue for typical BTRFS users who are working with much smaller amounts of data, at this time I have to recommend ZFS over BTRFS for most systems that manage 20TB of data. If you have a RAID-1 array of the biggest disks available (which is probably the biggest storage for 99% of BTRFS users) then you are looking at a restore time of maybe 4TB at 160MB/s == something less than 7 hours. For a home network 7 hours delay in getting things going after a major failure is quite OK. Finally failures of filesystems on different partitions won't be independent. If one filesystem on a disk becomes unusable due to drive firmware issues or other serious problems then other filesystems on the same physical disk are likely to suffer the same fate. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 40TB volume taking over 16 hours to mount, any ideas?
On Fri, 8 Aug 2014 16:35:29 Jose Ildefonso Camargo Tolosa wrote: uname -a Linux server1 3.15.8-031508-generic #201407311933 SMP Thu Jul 31 23:34:33 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The complete story: The filesystem was created on Ubuntu 12.04, running kernel 3.11. mount options included compress=zlib . After having some issues with it, specifically that it would mount itself read-only, and would then be stuck while trying to mount it again, we decided to upgrade 14.04, kernel 3.13, and 3.12 btrfs tools. [...] Then, after reading here and there, decided to try to use a newer kernel, tried 3.15.8. Well, it is still mounting after ~16 hours, and I got messages like these at first: I recommend trying a 3.14 kernel. I had ongoing problems with kernels before 3.14 which included infinite loops in kernel space. Based on reports on this list I haven't been inclined to test 3.15 kernels. But 3.14 has been working well for me on many systems. Trying 3.14 can't hurt. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC with mkdir and rename
On Mon, 4 Aug 2014 14:17:02 Peter Waller wrote: For anyone else having this problem, this article is fairly useful for understanding disk full problems and rebalance: http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem- Full-Problems.html It actually covers the problem that I had, which is that a rebalance can't take place because it is full. I still am unsure what is really wrong with this whole situation. Is it that I wasn't careful to do a rebalance when I should have been doing? Is it that BTRFS doesn't do a rebalance automatically when it could in principle? Yes and yes. The fact that BTRFS can't avoid getting into such situations and can't recover when it does are both bugs in BTRFS. The fact that you didn't run a balance to prevent this is due to not being careful enough with a filesystem that's still in a development stage. It's pretty bad to end up in a situation (with spare space) where the only way out is to add more storage, which may be impractical, difficult or expensive. Absolutely. I conclude that now since I have added more storage, the rebalance won't fail and if I keep rebalancing from a cron job I won't hit this problem again Yes. (unless the filesystem fills up very fast! what then?). I don't know however what value to assign to `-dusage` in general for the cron rebalance. Any hints? If you regularly run a scrub with options such as -dusage=50 -musage=10 then the amount of free space in metadata chunks will tend to be a lot greater than that in data chunks. Another option I've considered is to write a program that creates millions of files with 1000 byte random file names. After creating a filesystem I could run that program to cause a sufficient number of metadata chunks to be allocated and then remove the subvol containing all those files (which incidentally is a lot faster than rm -rf). Another thing I've considered is making a filesystem for a file server with a RAID-1 array of SSDs and running the above program to allocate all chunks for metadata. Then when the SSDs are totally assigned to metadata I would add a pair of SATA disks for data. A filesystem with all metadata on SSD and all data on SATA disks should give great performance as well as having lots of space. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Threads being NUMA aware
Please get yourself a NUMA system and test this out. -- Sent from my Samsung Galaxy Note 2 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scan not being performed properly on boot
On Mon, 4 Aug 2014 04:02:53 Peter Roberts wrote: I've just recently started testing btrfs on my server but after just 24 hours problems have started. I get booted to a busybox prompt user ubuntu 14.04. I have a multi device FS setup and I can't say for sure if it managed to boot initially or not but it worked fine in regular usage. What is GRUB (or your boot loader) giving as parameters to the kernel? What error messages appear on screen? Sometimes it's helpful to photograph the screen and put the picture on a web server to help people diagnose the problem. I cannot mount my root with anything other than an explicit /dev/sdx reference until I manually run a scan. Then UUID etc all work. I've That sounds like a problem with the Ubuntu initrd, probably filing an Ubuntu bug report would be the best thing to do. Is BTRFS supported in that version of Ubuntu? But just changing your boot configuration to use /dev/sdx is probably the best option. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Threads being NUMA aware
On Sun, 3 Aug 2014 22:44:26 Nick Krause wrote: On Sun, Aug 3, 2014 at 7:48 PM, Russell Coker russ...@coker.com.au wrote: Please get yourself a NUMA system and test this out. Unfortunately I don't have money for an extra machine as of now as I am a student If you can't get an extra machine then you probably can't contribute to developing filesystems, and your ability to do any sort of kernel development will be greatly limited. It's probably best that you choose another area of software development until you get more hardware (and skill). so if x86 is NUMA I can test otherwise I can't. If you had even read the NUMA Wikipedia page then you would already know the answer to this implied question. But really programming computers isn't something that you are good at. It's not something that you will become good at if you attempt tasks that are way above your skill level. I think that the best strategy for you is to find a mailing list for Linux beginners, when your skills get to the level that you answer more questions than you ask (and people appreciate your answers) then you can move on to easy programming tasks. Once you master the easy programming tasks you can move on to more difficult tasks. I suggest that you develop a realistic plan. Plan to start kernel programming in 2024 and have a series of goals on that path over the next 10 years. Make your 2015 goal be answering lots of questions on a Linux beginners mailing list, that's a goal you can achieve in a year. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scan not being performed properly on boot
On Sun, 3 Aug 2014 21:00:19 George Mitchell wrote: But just changing your boot configuration to use /dev/sdx is probably the best option. Assuming you are booting with grub2, you will need to use /dev/sdx in the grub2 configuration file. This is known issue with grub2. Example from my script: That's not a GRUB issue it's an initrd issue. GRUB just passes text to the kernel. cat /proc/cmdline will show you what GRUB sent to the kernel, the programs in the initrd read /proc/cmdline and then do what they think is appropriate. -- echo'Loading Linux desktop ...' linux /vmlinuz-desktop root=/dev/sda7 ro splash quiet echo'Loading initial ramdisk ...' initrd /initrd-desktop.img -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Scan not being performed properly on boot
On Sun, 3 Aug 2014 21:34:29 George Mitchell wrote: I see what you are saying. Its a hack. But I suspect that most of the distros are not yet accommodating btrfs with their standard mkinitrd process. At this point modifying grub2 config does solve the problem. If you know a reasonably easy way to fix initrd so that it can interpret UUID and LABEL, I would certainly be all ears. Debian/Wheezy works with BTRFS when using UUID= to specify the root filesystem. Wheezy was released in May 2013, Ubuntu 14.04 was released in April 2014 and as Ubuntu is based on Debian it should have at least the same features as an older version of Debian. I think that most Distributions are supporting BTRFS. Debian/Wheezy has support for BTRFS (although I recommend that you don't use it unless you plan to take a kernel from Testing), most Debian derivatives will support it, Fedora supports it. It's probably better to make a list of distributions that DON'T support BTRFS, it'll be a shorter list. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC with mkdir and rename
On Sun, 3 Aug 2014 00:35:28 Peter Waller wrote: I'm running Ubuntu 14.04. I wonder if this problem is related to the thread titled Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04 which I started on the 29th of July: http://thread.gmane.org/gmane.comp.file-systems.btrfs/37224 Kernel: 3.15.7-031507-generic As an aside, I'm still on 3.14 kernels for my systems and have no immediate plans to use 3.15. There has been discussion here about a number of problems with 3.15, so I don't think that any testing I do with 3.15 will help the developers and it will just take more of my time. $ sudo btrfs fi df /path/to/volume Data, single: total=489.97GiB, used=427.75GiB Metadata, DUP: total=5.00GiB, used=4.50GiB As has been noted you are using all the space in 1G data chunks and the system can't allocate more 256M metadata chunks (which are allocated in pairs because it's DUP so allocating 512M at a time. In this case, for example, metadata has 0.5GiB free (sounds like plenty for metadata for one mkdir to me). Data has 62GiB free. Why would I get ENOSPC for a file rename? Some space is always reserved. Due to the way BTRFS works changes to a file requires writing a new copy of the tree. So the amount of metadata space required for an operation that is conceptually simple can be significant. One thing that can sometimes solve that problem is to delete a subvol. But note that it can take a considerable amount of time to free the space, particularly if you are running out of metadata space. So you could delete a couple of subvols, run sync a couple of times, and have a coffee break. If possible avoid rebooting as that can make things much worse. This was a particular problem with kernels 3.13 and earlier which could enter a CPU loop requiring a reboot and then you would have big problems. I tried a rebalance with btrfs balance start -dusage=10 and tried increasing the value until I saw reallocations in dmesg. /sbin/btrfs fi balance start -dusage=30 -musage=10 / It's a good idea to have a cron job running a rebalance. Above is what I use on some of my systems, it will free data chunks that are up to 30% used and metadata chunks that are only 10% used. It almost never frees metadata chunks and regularly frees data chunks which is what I want. and enlarge the volume. When I did this, metadata grew by 1GiB: Data, single: total=490.97GiB, used=427.75GiB System, DUP: total=8.00MiB, used=60.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=5.50GiB, used=4.50GiB Metadata, single: total=8.00MiB, used=0.00 unknown, single: total=512.00MiB, used=0.00 Now that you have solved that problem you could balance the filesystem (deallocating ~60 data chunks) and then shrink it. In the past I've added a USB flash disk to a filesystem to give it enough space to allow a balance and then removed it (NB you have to do a btrfs remove before removing the USB stick). * Why didn't the metadata grow before enlarging the disk? * Why didn't the rebalance enable the metadata to grow? * Why is it necessary to rebalance? Can't it automatically take some free space from 'data'? It would be nice if it could automatically rebalance. It's theoretically possible as the btrfs program just asks the kernel to do it. But there's nothing stopping you from having a regular cron job to do it. You could even write a daemon to poll the status of a btrfs filesystem and run balance when appropriate if you were keen enough. * What is the best course of action to take (other than enlarging the disk or deleting files) if I encounter this situation again? Have a cron job run a balance regularly. On Sat, 2 Aug 2014 21:52:36 Nick Krause wrote: I have run into this error to and this seems to be a rather big issue as ext4 seems to never run of metadata room at least from my testing. I feel greatly that this part of btrfs needs be improved and moved into a function or set of functions for re balancing metadata in the kernel itself. Ext4 has fixed size Inode tables that are assigned at mkfs time. If you run out of Inodes then you can't create new files. If you have too big Inode tables then you waste disk space and have a longer fsck time (at least before uninit_bg). The other metadata for Ext4 is allocated from data blocks so it will run out when data space runs out (EG if mkdir fails due to lack of space on ext4 then you can delete a file to make it work). But really BTRFS is just a totally different filesystem. Ext4 lacks the features such as full data checksums and subvolume support that make these things difficult. I always found the CP/M filesystem to be easier. It was when they added support for directories that things started getting difficult. :-# -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of
Re: Questions on incremental backups
On Fri, 18 Jul 2014 13:56:58 Sam Bull wrote: On ven, 2014-07-18 at 14:35 +1000, Russell Coker wrote: Ignoring directories in send/recv is done by subvol. Even if you use rsync it's a good idea to have different subvols for directory trees with different backup requirements. So, an inner subvol won't be backed up? If I wanted a full backup, I would presumably get snapshots of each subvol separately, right? If you use btrfs send/recv then it won't get the inner subvol. If you use rsync then by default it goes through the entire directory tree unless you use the -x option. Displaying backups is an issue of backup software. It is above the level that BTRFS development touches. While people here can probably offer generic advice on backup software it's not the topic of the list. As said, I don't mind developing the software. But, is the required information easily available? Is there a way to get a diff, something like a list of changed/added/removed files between snapshots? Your usual diff utility will do it. I guess you could parse the output of btrfs send. And, finally, nobody has mentioned on the possibility of merging multiple snapshots into a single snapshot. Would this be possible, to create a snapshot that contains the most recent version of each file present across all of the snapshots (including files which may be present in only one of the snapshots)? There is no btrfs functionality for that. But I'm sure you could do something with standard Unix utilities and copying files around. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions on incremental backups
Daily snapshots work welk with kernel 3.14 and above (I had problems with 3.13 and previous). I have snapshots every 15 mins on some subvols. Very large numbers of snapshots can cause performance problems. I suggest keeping below 1000 snapshots at this time. You can use send/recv functionality for remote backups. So far I've used rsync, it works well and send/recv has some limitations about filesystem structure etc. Rsync can transfer to a ext4 or ZFS filesystem if you wish. Ignoring directories in send/recv is done by subvol. Even if you use rsync it's a good idea to have different subvols for directory trees with different backup requirements. Displaying backups is an issue of backup software. It is above the level that BTRFS development touches. While people here can probably offer generic advice on backup software it's not the topic of the list. I use date based snapshots on my backup BTRFS filesystems and I can easily delete snapshots in the middle of the list. -- Sent from my Samsung Galaxy Note 2 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs and LBA errors
On Tue, 15 Jul 2014 11:42:05 constantine wrote: Thank you very much for your advice. It worked! Great! I verified that the superblocks 1 and 2 had similar information with btrfs-show-super -i 1 /dev/sdc1 (and -i 2) and then with crossed fingers: btrfs-select-super -s 2 /dev/sdc1 which restored my btrfs filesystem. Then I runned scrub. For future reference, scrub was stuck: # btrfs scrub status partition/ scrub status for c1eb1aaf-665a-4337-9d04-3c3921aa67e0 scrub started at Thu Jul 10 21:30:41 2014, running for 2743 seconds total bytes scrubbed: 193.47GiB with 0 errors and could not cancel it: btrfs scrub cancel partition/ ERROR: scrub cancel failed on Downloads/: not running After deleting /var/lib/btrfs/scrub.status.c1eb1aaf-665a-4337-9d04-3c3921aa67e0 it was completed successfully. That's a known bug. Now I hope you have made a good backup, the next problem you encounter may not be as easy to solve. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs transaction checksum corruption losing root of the tree bizarre UUID change.
On Fri, 11 Jul 2014 10:38:22 Duncan wrote: I've moved all drives and move those to my main rig which got a nice 16GB of ecc ram, so errors of ram, cpu, controller should be kept theoretically eliminated. It's worth noting that ECC RAM doesn't necessarily help when it's an in- transit bus error. Some years ago I had one of the original 3-digit Opteron machines, which of course required registered and thus ECC RAM. The first RAM I purchased for that board was apparently borderline on its timing certifications, and while it worked fine when the system wasn't too stressed, including with memtest, which passed with flying colors, under medium memory activity it would very occasionally give me, for instance, a bad bzip2 csum, and with intensive memory activity, the problem would be worse (more bz2 decompress errors, gcc would error out too sometimes and I'd have to restart my build, very occasionally the system would crash). If bad RAM causes corrupt memory but no ECC error reports then it probably wouldn't be a bus error. A bus error SHOULD give ECC reports. One problem is that RAM errors aren't random. From memory the Hamming codes used fix 100% of single bit errors, detect 100% of 2 bit errors, and let some 3 bit errors through. If you have a memory module with 3 chips on it (the later generation of DIMM for any given size) then an error in 1 chip can change 4 bits. The other main problem is that if you have a read or write going to the wrong address then you lose as AFAIK there's no ECC on address lines. But I still recommend ECC RAM, it just decreases the scope for problems. About half the serious problems I've had with BTRFS have been caused by a faulty DIMM... -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs and LBA errors
On Fri, 11 Jul 2014 13:42:40 constantine wrote: Btrfs filesystem could not be mounted because /dev/sdc1 had unreadable sectors. It is/was a single filesystem (not raid1 or raid0) over /dev/sda1 and /dev/sdc1. What does file -s /dev/sda1 /dev/sdc1 report? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs and LBA errors
On Fri, 11 Jul 2014 21:29:07 constantine wrote: Thank you very much for your response: # file -s /dev/sda1 /dev/sdc1 /dev/sda1: BTRFS Filesystem label partition, sectorsize 4096, nodesize 4096, leafsize 4096, UUID=c1eb1aaf-665a-4337-9d04-3c3921aa67e0, 1683870334976/3010310701056 bytes used, 2 devices /dev/sdc1: data Looks like the primary superblock on /dev/sdc1 is corrupt. http://www.funtoo.org/BTRFS_Fun#btrfs-select-super_.2F_btrfs-zero-log The above URL documents how to use the other superblocks. It notes that you only get one chance - if you do this wrong you will dramatically reduce the probability of getting your data back. Do not do this on the master copy of your data! Buy 2 disks of equal or greater capacity, copy the block devices, and then working with the copy. sdc needs to be replaced anyway no matter what you do and getting a replacement for sda would be a good strategy anyway. Buy 2*4TB disks and you can make a new RAID-1 array that has more capacity than the non- RAID filesystem you currently have. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID with enterprise SATA or SAS drives
On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote: - for someone using SAS or enterprise SATA drives with Linux, I understand btrfs gives the extra benefit of checksums, are there any other specific benefits over using mdadm or dmraid? I think I can answer this one. Most important advantage I think is BTRFS is aware of which blocks of the RAID are in use and need to be synced: - Instant initialization of RAID regardless of size (unless at some capacity mkfs.btrfs needs more time) From mdadm(8): --assume-clean Tell mdadm that the array pre-existed and is known to be clean. It can be useful when trying to recover from a major failure as you can be sure that no data will be affected unless you actu‐ ally write to the array. It can also be used when creating a RAID1 or RAID10 if you want to avoid the initial resync, however this practice — while normally safe — is not recommended. Use this only if you really know what you are doing. When the devices that will be part of a new array were filled with zeros before creation the operator knows the array is actu‐ ally clean. If that is the case, such as after running bad‐ blocks, this argument can be used to tell mdadm the facts the operator knows. While it might be regarded as a hack, it is possible to do a fairly instant initialisation of a Linux software RAID-1. - Rebuild after disk failure or disk replace will only copy *used* blocks Have you done any benchmarks on this? The down-side of copying used blocks is that you first need to discover which blocks are used. Given that seek time is a major bottleneck at some portion of space used it will be faster to just copy the entire disk. I haven't done any tests on BTRFS in this regard, but I've seen a disk replacement on ZFS run significantly slower than a dd of the block device would. Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID should be able to do this as well. But also for scrubbing: BTRFS only check and repairs used blocks. When you scrub Linux Software RAID (and in fact pretty much every RAID) it will only correct errors that the disks flag. If a disk returns bad data and says that it's good then the RAID scrub will happily copy the bad data over the good data (for a RAID-1) or generate new valid parity blocks for bad data (for RAID-5/6). http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html Page 12 of the above document says that nearline disks (IE the ones people like me can afford for home use) have a 0.466% incidence of returning bad data and claiming it's good in a year. Currently I run about 20 such disks in a variety of servers, workstations, and laptops. Therefore the probability of having no such errors on all those disks would be .99534^20=.91081. The probability of having no such errors over a period of 10 years would be (.99534^20)^10=.39290 which means that over 10 years I should expect to have such errors, which is why BTRFS RAID-1 and DUP metadata on single disks are necessary features. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs loopback problems
root@yoyo:/# btrfs fi df / Data, RAID1: total=9.00GiB, used=6.95GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=1.00GiB, used=82.95MiB root@yoyo:/# df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda2 273G 15G 257G 6% / I have a Xen server that has a RAID-1 array of 2*140G SAS disks, above is the df output. The Xen server is for training and testing (including training people to use BTRFS), hence the name. I have a subvol /xenstore which has image files for BTRFS RAID-1 filesystems (RAID-1 within RAID-1 is going to suck for performance but be good for training, and apparently testing). root@yoyo:/# mount -o loop,degraded /xenstore/btrfsa /mnt/tmp I mounted one of them loopback with the above command and then tried doing an apt-get update in a chroot. The result was that apt-get entered D state and the following was in the kernel message log. The system is running Debian kernel 3.14.9. [ 2280.784105] INFO: task btrfs-flush_del:1339 blocked for more than 120 seconds. [ 2280.784126] Not tainted 3.14-1-amd64 #1 [ 2280.784136] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2280.784152] btrfs-flush_del D 8800547f0cf8 0 1339 2 0x [ 2280.784173] 8800547f08e0 0246 00014380 8800547f5fd8 [ 2280.784232] 00014380 8800547f08e0 880077454c10 8800c028 [ 2280.784290] 0002 8111f240 8800547f5ce0 8800547f5dc8 [ 2280.784347] Call Trace: [ 2280.784376] [8111f240] ? wait_on_page_read+0x60/0x60 [ 2280.784410] [814bcd54] ? io_schedule+0x94/0x130 [ 2280.784440] [8111f245] ? sleep_on_page+0x5/0x10 [ 2280.784470] [814bd0c4] ? __wait_on_bit+0x54/0x80 [ 2280.784501] [8111f04f] ? wait_on_page_bit+0x7f/0x90 [ 2280.784534] [8109e300] ? autoremove_wake_function+0x30/0x30 [ 2280.784566] [8112c008] ? pagevec_lookup_tag+0x18/0x20 [ 2280.784597] [8111f130] ? filemap_fdatawait_range+0xd0/0x160 [ 2280.784654] [a0230da5] ? btrfs_wait_ordered_range+0x65/0x120 [btrfs] [ 2280.784709] [a021d7a1] ? btrfs_run_delalloc_work+0x21/0x80 [btrfs] [ 2280.784750] [a02449b0] ? worker_loop+0x140/0x520 [btrfs] [ 2280.784782] [814bc5e9] ? __schedule+0x2a9/0x700 [ 2280.784820] [a0244870] ? btrfs_queue_worker+0x300/0x300 [btrfs] [ 2280.784854] [8107f858] ? kthread+0xb8/0xd0 [ 2280.784884] [8107f7a0] ? kthread_create_on_node+0x180/0x180 [ 2280.784917] [814c7acc] ? ret_from_fork+0x7c/0xb0 [ 2280.784948] [8107f7a0] ? kthread_create_on_node+0x180/0x180 [ 2280.784980] INFO: task btrfs-transacti:1343 blocked for more than 120 seconds. [ 2280.785028] Not tainted 3.14-1-amd64 #1 [ 2280.785055] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2280.785105] btrfs-transacti D 8800547c4968 0 1343 2 0x [ 2280.785142] 8800547c4550 0246 00014380 880054055fd8 [ 2280.785200] 00014380 8800547c4550 88004f063f70 880054055d78 [ 2280.785257] 88004f063f68 8800547c4550 880056047de0 880054055df0 [ 2280.785314] Call Trace: [ 2280.785340] [814bbe49] ? schedule_timeout+0x209/0x2a0 [ 2280.785372] [81095b6f] ? enqueue_task_fair+0x2bf/0xdf0 [ 2280.785403] [81091337] ? sched_clock_cpu+0x47/0xb0 [ 2280.785434] [8131dc6b] ? notify_remote_via_irq+0x2b/0x50 [ 2280.785466] [8108bee5] ? check_preempt_curr+0x65/0x90 [ 2280.785497] [8108bf1f] ? ttwu_do_wakeup+0xf/0xc0 [ 2280.785528] [814bd3f0] ? wait_for_completion+0xa0/0x110 [ 2280.785560] [8108e7a0] ? wake_up_state+0x10/0x10 [ 2280.785599] [a02273dd] ? btrfs_wait_and_free_delalloc_work+0xd/0x20 [btrfs] [ 2280.785656] [a02304f6] ? btrfs_run_ordered_operations+0x1e6/0x2b0 [btrfs] [ 2280.785712] [a02182b7] ? btrfs_commit_transaction+0x217/0x990 [btrfs] [ 2280.785768] [a0218abb] ? start_transaction+0x8b/0x550 [btrfs] [ 2280.785807] [a021432d] ? transaction_kthread+0x1ad/0x240 [btrfs] [ 2280.785846] [a0214180] ? btrfs_cleanup_transaction+0x510/0x510 [btrfs] [ 2280.785894] [8107f858] ? kthread+0xb8/0xd0 [ 2280.785924] [8107f7a0] ? kthread_create_on_node+0x180/0x180 [ 2280.785956] [814c7acc] ? ret_from_fork+0x7c/0xb0 [ 2280.785985] [8107f7a0] ? kthread_create_on_node+0x180/0x180 [ 2280.786017] INFO: task apt-get:1358 blocked for more than 120 seconds. [ 2280.786047] Not tainted 3.14-1-amd64 #1 [ 2280.786074] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 2280.786122] apt-get D 88005bce2d78 0 1358 1357 0x [ 2280.786159] 88005bce2960 0286 00014380 88005a681fd8 [ 2280.786218] 00014380 88005bce2960 880077494c10
Re: Is btrfs related to OOM death problems on my 8GB server with both 3.15.1 and 3.14?
On Thu, 3 Jul 2014 18:19:38 Marc MERLIN wrote: I upgraded my server from 3.14 to 3.15.1 last week, and since then it's been running out of memory and deadlocking (panic= doesn't even work). I downgraded back to 3.14, but I already had the problem once since then. Is there any correlation between such problems and BTRFS operations such as creating snapshots or running a scrub/balance? Back in ~3.10 days I had serious problems with BTRFS memory use when removing multiple snapshots or balancing. But at about 3.13 they all seemed to get fixed. I usually didn't have a kernel panic when I had such problems (although I sometimes had a system lock up solid such that I couldn't even determine what it's problem was). Usually the Oom handler started killing big processes such as chromium when it shouldn't have needed to. Note that I haven't verified that the BTRFS memory use is reasonable in all such situations. Merely that it doesn't use enough to kill my systems. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 3+ drives
On Sat, 28 Jun 2014 04:26:43 Duncan wrote: Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted: On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote: Can I get more protection by using more than 2 drives? I had an onboard RAID a few years back that would let me use RAID1 across up to 4 drives. Currently the only RAID level that fully works in BTRFS is RAID-1 with data on 2 disks. Not /quite/ correct. Raid0 works, but of course that isn't exactly RAID as it's not redundant. And raid10 works. But that's simply raid0 over raid1. So depending on whether you consider raid0 actually http://en.wikipedia.org/wiki/Linux_MD_RAID_10#LINUX-MD-RAID-10 There are a number of ways of doing RAID-0 over RAID-1, but BTRFS doesn't do any of them. When you have more than 2 disks and tell BTRFS to do RAID-1 you get a result that might be somewhat comparable to Linux software RAID-10, except for the issue of having disks of different sizes and adding more disks after creating the RAID. RAID or not, which in turn depends on how strict you are with the redundant part, there is or is not more than btrfs raid1 working. The way BTRFS, ZFS, and WAFL work is quite different to anything described in any of the original papers on RAID. One could make a case that what these filesystems do shouldn't be called RAID, but then we would be searching for another term for it. If you have 4 disks in the array then each block will be on 2 of the disks. Correct. FWIW I'm told that the paper that laid out the original definition of RAID (which was linked on this list in a similar discussion some months ago) defined RAID-1 as paired redundancy, no matter the number of devices. Various implementations (including Linux' own mdraid soft-raid, and I believe dmraid as well) feature multi-way-mirroring aka N-way- mirroring such that N devices equals N way mirroring, but that's an implementation extension and isn't actually necessary to claim RAID-1 support. The paper is a little ambiguous as to whether a 3 disk mirror can be RAID-1. So look for N-way-mirroring when you go RAID shopping, and no, btrfs does not have it at this time, altho it is roadmapped for implementation after completion of the raid5/6 code. FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for device redundancy, but to take full advantage of btrfs data integrity features, allowing to scrub a checksum-mismatch copy with the content of a checksum-validated copy if available. That's currently possible, but due to the pair-mirroring-only restriction, there's only one additional copy, and if it happens to be bad as well, there's no possibility of a third copy to scrub from. As it happens my personal sweet-spot between cost/performance and reliability would be 3-way mirroring, but once they code beyond N=2, N should go unlimited, so N=3, N=4, N=50 if you have a way to hook them all up... should all be possible. What I want is the ZFS copies= feature. If you want to have 4 disks in a fully redundant configuration (IE you could lose 3 disks without losing any data) then the thing to do is to have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1 on top of that. The caveat with that is that at least mdraid1/dmraid1 has no verified data integrity, and while mdraid5/6 does have 1/2-way-parity calculation, it's only used in recovery, NOT cross-verified in ordinary use. Linux Software RAID-6 only uses the parity when you have a hard read error. If you have a disk return bad data and say it's good then you just lose. That said the rate of disks returning such bad data is very low. If you had a hypothetical array of 4 disks as I suggested then to lose data you need to have one pair of disks entirely fail and another disk return corrupt data or have 2 disks in separate RAID-1 pairs return corrupt data on matching sectors (according to BTRFS data copies) such that Linux software RAID copies the corrupt data to the good disk. That sort of thing is much less likely than having a regular BTRFS RAID-1 array of 2 disks failing. Also if you were REALLY paranoid you could have 2 BTRFS RAID-1 filesystems that each contain a single large file. Those 2 large files could be run via losetup and used for another BTRFS RAID-1 filesystem. That gets you redundancy at both levels. Of course if you had 2 disks in one pair fail then the loopback BTRFS filesystem would still be OK. How does the BTRFS kernel code handle a loopback device read failure? In fact, with md/dmraid and its reasonable possibility of silent corruption since at that level any of the copies could be returned and there's no data integrity checking, if whatever md/dmraid level copy /is/ returned ends up being bad, then btrfs will consider that side of the pair bad, without any way to check additional copies at the underlying md/ dmraid level. Effectively you only have two verified copies
Re: RAID1 3+ drives
On Sat, 28 Jun 2014 11:38:47 Duncan wrote: And with the size of disks we have today, the statistics on multiple whole device reliability are NOT good to us! There's a VERY REAL chance, even likelihood, that at least one block on the device is going to be bad, and not be caught by its own error detection! http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The above paper suggests that it's about 10% of SATA disks getting such errors per year and that typically a disk that has such a problem has it for ~50 sectors. The probability of having 2 disks randomly get such errors (if they are truly random and independent) would be something like 1% per year. The probability that the ~50 sectors on each of 2*3TB disks happening to match up is much lower. Also if you were REALLY paranoid you could have 2 BTRFS RAID-1 filesystems that each contain a single large file. Those 2 large files could be run via losetup and used for another BTRFS RAID-1 filesystem. That gets you redundancy at both levels. Of course if you had 2 disks in one pair fail then the loopback BTRFS filesystem would still be OK. But the COW and fragmentation issues on the bottom level... OUCH! And you can't simply set NOCOW, because that turns off the checksumming as well, leaving you right back where you were without the integrity checking! It really depends on how much performance you need. I've got some virtual servers running BTRFS within BTRFS and with modern hardware and a light load it works OK. *BUT* at a cost of essentially *CONSTANT* scrubbing. Constant because at the multi-TBs we're talking, just completing a single scrub cycle could well take more than a standard 8-hour work-day, so by the time you finish, it's already about time to start the next scrub cycle. Scrubbing my BTRFS RAID-1 filesystem with 2.4TB of data stored on a pair of 3TB disks takes 5 hours. That sort of constant scrubbing is going to take its toll both on device life and on I/O thruput for whatever data you're actually storing on the device, since a good share of the time it's going to be scrubbing as well, slowing down the speed of the real I/O. Some years ago I asked an executive from a company that manufactured hard drives about this. The engineering manager who was directed to answer my question told me that the drives were designed to perform any sequence of legal operations continually for the warranty period. So if a disk had a 3 year warranty then it should be able to survive a scrubbing loop for 3 years. But scrubbing a system that runs 24*7 is a problem. Hopefully we will get a speed limit feature for BTRFS scrubbing as there is for Linux software RAID rebuild/scrub. No. I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub every Sunday night. If I had an array of 4 disks then I could do scrubs on Saturday night as well. But are you scrubbing at both the btrfs and the md/dmraid level? That'll effectively double the scrub-time. It's a BTRFS RAID-1, there is no mdadm on that system. And while that might not take a full 24 hours, it's likely to take a significant enough portion of 24 hours, that if you're doing a full mdraid and btrfs level both scrub every two days, some significant fraction (say a third to a half) of the time will be spent scrubbing, during which normal I/O speeds will be significantly reduced, while also reducing device lifetime due to the relatively high duty cycle seek activity. When the expected error rate for SATA disks is ~10% of disks having errors per year a scrub every second day seems rather paranoid. But if you are that paranoid then the wisc.edu paper suggests that you should be buying enterprise disks that have a much lower error rate. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Question] Btrfs on iSCSI device
On Fri, 27 Jun 2014 18:34:34 Goffredo Baroncelli wrote: I don't think that it is possible to mount the _same device_ at the _same time_ on two different machines. And this doesn't depend by the filesystem. If you use a clustered filesystem then you can safely mount it on multiple machines. If you use a non-clustered filesystem it can still mount and even appear to work for a while. It's surprising how many writes you can make to a dual- mounted filesystem that's not designed for such things before you get a totally broken filesystem. On Fri, 27 Jun 2014 13:15:16 Austin S Hemmelgarn wrote: The reason it appears to work when using iSCSI and not with directly connected parallel SCSI or SAS is that iSCSI doesn't provide low level hardware access. I've tried this with dual-attached FC and had no problems mounting. In what way is directly connected SCSI different from FC? -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 3+ drives
On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote: Can I get more protection by using more than 2 drives? I had an onboard RAID a few years back that would let me use RAID1 across up to 4 drives. Currently the only RAID level that fully works in BTRFS is RAID-1 with data on 2 disks. If you have 4 disks in the array then each block will be on 2 of the disks. RAID-5/6 code mostly works but the last report I read indicated that some situations for recovery and disk replacement didn't work - presumably anyone who's afraid of multiple disks failing isn't going to want to trust BTRFS RAID-6 code at the moment. If you want to have 4 disks in a fully redundant configuration (IE you could lose 3 disks without losing any data) then the thing to do is to have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1 on top of that. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html